A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

Jaf, Sardar

A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

Jaf, Sardar

Authors

Sardar Jaf

Abstract

In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters.

Citation

Jaf, S. (2016, September). A Simple Approach to Unify Ambiguously Encoded Kurdish Characters. Presented at Second International Conference of Computational Linguistics in Bulgaria, Sofia, Bulgaria

Presentation Conference Type	Conference Paper (published)
Conference Name	Second International Conference of Computational Linguistics in Bulgaria
Start Date	Sep 9, 2016
Acceptance Date	Jul 27, 2016
Publication Date	Sep 9, 2016
Deposit Date	Aug 22, 2016
Publicly Available Date	Aug 23, 2016
Pages	86-94
Series ISSN	2367-5578,2367-5675
Public URL	https://durham-repository.worktribe.com/output/1150394
Publisher URL	http://dcl.bas.bg/clib/proceedings/