Skip to main content

Research Repository

Advanced Search

A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters

Jaf, Sardar; Dong, Minghui; Tseng, Yuen-Hsien; Lu, Yanfeng; Yu, Liang-Chih; Lee, Lung-Hao; Wu, Chung-Hsien; Li, Haizhou

A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters Thumbnail


Authors

Sardar Jaf

Minghui Dong

Yuen-Hsien Tseng

Yanfeng Lu

Liang-Chih Yu

Lung-Hao Lee

Chung-Hsien Wu

Haizhou Li



Abstract

In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.

Citation

Jaf, S., Dong, M., Tseng, Y., Lu, Y., Yu, L., Lee, L., …Li, H. (2017). A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters. In Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan (228-231). https://doi.org/10.1109/ialp.2016.7875974

Presentation Conference Type Conference Paper (Published)
Conference Name The 20th International Conference on Asian Language Processing.
Start Date Nov 21, 2016
End Date Nov 23, 2016
Acceptance Date Aug 28, 2016
Online Publication Date Mar 13, 2017
Publication Date Mar 13, 2017
Deposit Date Oct 21, 2016
Publicly Available Date Oct 24, 2016
Publisher Institute of Electrical and Electronics Engineers
Pages 228-231
Book Title Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan.
DOI https://doi.org/10.1109/ialp.2016.7875974
Public URL https://durham-repository.worktribe.com/output/1149675

Files

Accepted Conference Proceeding (667 Kb)
PDF

Copyright Statement
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.





You might also like



Downloadable Citations