Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Alrajhi, Laila; Alamri, Ahmed; Pereira, Filipe Dwan; Cristea, Alexandra I.; Oliveira, Elaine H. T.

doi:10.1007/s11257-023-09381-y

Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums

Alrajhi, Laila; Alamri, Ahmed; Pereira, Filipe Dwan; Cristea, Alexandra I.; Oliveira, Elaine H. T.

Authors

Laila Alrajhi laila.m.alrajhi@durham.ac.uk
PGR Student Doctor of Philosophy

Ahmed Alamri

Filipe Dwan Pereira

Professor Alexandra Cristea alexandra.i.cristea@durham.ac.uk
Professor

Elaine H. T. Oliveira

Abstract

In MOOCs, identifying urgent comments on discussion forums is an ongoing challenge. Whilst urgent comments require immediate reactions from instructors, to improve interaction with their learners, and potentially reducing drop-out rates—the task is difficult, as truly urgent comments are rare. From a data analytics perspective, this represents a highly unbalanced (sparse) dataset. Here, we aim to automate the urgent comments identification process, based on fine-grained learner modelling—to be used for automatic recommendations to instructors. To showcase and compare these models, we apply them to the first gold standard dataset for Urgent iNstructor InTErvention (UNITE), which we created by labelling FutureLearn MOOC data. We implement both benchmark shallow classifiers and deep learning. Importantly, we not only compare, for the first time for the unbalanced problem, several data balancing techniques, comprising text augmentation, text augmentation with undersampling, and undersampling, but also propose several new pipelines for combining different augmenters for text augmentation. Results show that models with undersampling can predict most urgent cases; and 3X augmentation + undersampling usually attains the best performance. We additionally validate the best models via a generic benchmark dataset (Stanford). As a case study, we showcase how the naïve Bayes with count vector can adaptively support instructors in answering learner questions/comments, potentially saving time or increasing efficiency in supporting learners. Finally, we show that the errors from the classifier mirrors the disagreements between annotators. Thus, our proposed algorithms perform at least as well as a ‘super-diligent’ human instructor (with the time to consider all comments).

Citation

Alrajhi, L., Alamri, A., Pereira, F. D., Cristea, A. I., & Oliveira, E. H. T. (2024). Solving the imbalanced data issue: automatic urgency detection for instructor assistance in MOOC discussion forums. User Modeling and User-Adapted Interaction, 34(3), 797-852. https://doi.org/10.1007/s11257-023-09381-y

Journal Article Type	Article
Acceptance Date	Aug 9, 2023
Online Publication Date	Dec 1, 2023
Publication Date	Jul 1, 2024
Deposit Date	Jan 10, 2024
Publicly Available Date	Jan 10, 2024
Journal	User Modeling and User-Adapted Interaction
Print ISSN	0924-1868
Electronic ISSN	1573-1391
Publisher	Springer
Peer Reviewed	Peer Reviewed
Volume	34
Issue	3
Pages	797-852
DOI	https://doi.org/10.1007/s11257-023-09381-y
Keywords	MOOCs, Machine learning, Undersampling, Error analysis, Natural language processing, Adaptive models, Imbalanced data, Text augmentation, Urgent comments
Public URL	https://durham-repository.worktribe.com/output/2118421

Files

Published Journal Article (Advance Online Version) (2.1 Mb)
PDF

Licence
http://creativecommons.org/licenses/by/4.0/

Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/

Copyright Statement
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.