Latifah Almuqren
AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
Almuqren, Latifah; Cristea, Alexandra
Abstract
Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust's power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.
Citation
Almuqren, L., & Cristea, A. (2021). AraCust: a Saudi Telecom Tweets corpus for sentiment analysis. PeerJ Computer Science, 7, Article e510. https://doi.org/10.7717/peerj-cs.510
Journal Article Type | Article |
---|---|
Online Publication Date | May 20, 2021 |
Publication Date | 2021 |
Deposit Date | Oct 7, 2021 |
Publicly Available Date | Oct 7, 2021 |
Journal | PeerJ Computer Science |
Electronic ISSN | 2376-5992 |
Publisher | PeerJ |
Peer Reviewed | Peer Reviewed |
Volume | 7 |
Article Number | e510 |
DOI | https://doi.org/10.7717/peerj-cs.510 |
Public URL | https://durham-repository.worktribe.com/output/1231610 |
Files
Published Journal Article
(4 Mb)
PDF
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
Copyright Statement
Copyright
2021 Almuqren and Cristea
Distributed under
Creative Commons CC-BY 4.0
You might also like
Editorial: New challenges and future perspectives in cognitive neuroscience
(2024)
Journal Article
Using deep learning to analyze the psychological effects of COVID-19
(2023)
Journal Article
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search