Sardar Jaf
A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters
Jaf, Sardar; Dong, Minghui; Tseng, Yuen-Hsien; Lu, Yanfeng; Yu, Liang-Chih; Lee, Lung-Hao; Wu, Chung-Hsien; Li, Haizhou
Authors
Minghui Dong
Yuen-Hsien Tseng
Yanfeng Lu
Liang-Chih Yu
Lung-Hao Lee
Chung-Hsien Wu
Haizhou Li
Abstract
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.
Citation
Jaf, S., Dong, M., Tseng, Y.-H., Lu, Y., Yu, L.-C., Lee, L.-H., Wu, C.-H., & Li, H. (2016, November). A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters. Presented at The 20th International Conference on Asian Language Processing., Tainan, Taiwan
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | The 20th International Conference on Asian Language Processing. |
Start Date | Nov 21, 2016 |
End Date | Nov 23, 2016 |
Acceptance Date | Aug 28, 2016 |
Online Publication Date | Mar 13, 2017 |
Publication Date | Mar 13, 2017 |
Deposit Date | Oct 21, 2016 |
Publicly Available Date | Oct 24, 2016 |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 228-231 |
Book Title | Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. |
DOI | https://doi.org/10.1109/ialp.2016.7875974 |
Public URL | https://durham-repository.worktribe.com/output/1149675 |
Files
Accepted Conference Proceeding
(667 Kb)
PDF
Copyright Statement
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
You might also like
Combining Machine Learning Classifiers for the Task of Arabic Characters Recognition
(2018)
Journal Article
Security Threats to Critical Infrastructure: The Human Factor
(2018)
Journal Article
BotDet: A System for Real Time Botnet Command and Control Traffic Detection
(2018)
Journal Article
CAM: A Combined Attention Model for Natural Language Inference
(2018)
Presentation / Conference Contribution
An Exploration of Dropout with RNNs for Natural Language Inference
(2018)
Presentation / Conference Contribution