Sardar Jaf
A Simple Approach to Unify Ambiguously Encoded Kurdish Characters
Jaf, Sardar
Authors
Abstract
In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters.
Citation
Jaf, S. (2016, September). A Simple Approach to Unify Ambiguously Encoded Kurdish Characters. Presented at Second International Conference of Computational Linguistics in Bulgaria, Sofia, Bulgaria
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | Second International Conference of Computational Linguistics in Bulgaria |
Start Date | Sep 9, 2016 |
Acceptance Date | Jul 27, 2016 |
Publication Date | Sep 9, 2016 |
Deposit Date | Aug 22, 2016 |
Publicly Available Date | Aug 23, 2016 |
Pages | 86-94 |
Series ISSN | 2367-5578,2367-5675 |
Public URL | https://durham-repository.worktribe.com/output/1150394 |
Publisher URL | http://dcl.bas.bg/clib/proceedings/ |
Files
Published Conference Proceeding
(6.2 Mb)
PDF
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
Accepted Conference Proceeding
(110 Kb)
PDF
Copyright Statement
This work is available under a Creative Commons Attribution 4.0. International Licence (CC BY 4.0).
You might also like
Combining Machine Learning Classifiers for the Task of Arabic Characters Recognition
(2018)
Journal Article
Security Threats to Critical Infrastructure: The Human Factor
(2018)
Journal Article
BotDet: A System for Real Time Botnet Command and Control Traffic Detection
(2018)
Journal Article
CAM: A Combined Attention Model for Natural Language Inference
(2018)
Presentation / Conference Contribution
An Exploration of Dropout with RNNs for Natural Language Inference
(2018)
Presentation / Conference Contribution
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search