Sardar Jaf
A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters
Jaf, Sardar; Dong, Minghui; Tseng, Yuen-Hsien; Lu, Yanfeng; Yu, Liang-Chih; Lee, Lung-Hao; Wu, Chung-Hsien; Li, Haizhou
Authors
Minghui Dong
Yuen-Hsien Tseng
Yanfeng Lu
Liang-Chih Yu
Lung-Hao Lee
Chung-Hsien Wu
Haizhou Li
Abstract
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them.
Citation
Jaf, S., Dong, M., Tseng, Y., Lu, Y., Yu, L., Lee, L., …Li, H. (2017). A Semi-automatic Approach to Identifying and Unifying Ambiguously Encoded Arabic-Based Characters. In Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan (228-231). https://doi.org/10.1109/ialp.2016.7875974
Presentation Conference Type | Conference Paper (Published) |
---|---|
Conference Name | The 20th International Conference on Asian Language Processing. |
Start Date | Nov 21, 2016 |
End Date | Nov 23, 2016 |
Acceptance Date | Aug 28, 2016 |
Online Publication Date | Mar 13, 2017 |
Publication Date | Mar 13, 2017 |
Deposit Date | Oct 21, 2016 |
Publicly Available Date | Oct 24, 2016 |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 228-231 |
Book Title | Proceedings of the 2016 International Conference on Asian Language Processing (IALP), 21-23 November 2016, Tainan, Taiwan. |
DOI | https://doi.org/10.1109/ialp.2016.7875974 |
Public URL | https://durham-repository.worktribe.com/output/1149675 |
Files
Accepted Conference Proceeding
(667 Kb)
PDF
Copyright Statement
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
You might also like
CAM: A Combined Attention Model for Natural Language Inference
(2018)
Presentation / Conference Contribution
An Exploration of Dropout with RNNs for Natural Language Inference
(2018)
Presentation / Conference Contribution
Improved Arabic Characters Recognition by Combining Multiple Machine Learning Classifiers
(2017)
Presentation / Conference Contribution
A Simple Approach to Unify Ambiguously Encoded Kurdish Characters
(2016)
Presentation / Conference Contribution
Parser Hybridisation for Natural Languages
(2013)
Presentation / Conference Contribution
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search