Skip to main content

Research Repository

Advanced Search

A Simple Approach to Unify Ambiguously Encoded Kurdish Characters

Jaf, Sardar

A Simple Approach to Unify Ambiguously Encoded Kurdish Characters Thumbnail


Authors

Sardar Jaf



Abstract

In this study we outline a potential problem in the normalisation stage of processing texts that are based on a modified version of the Arabic alphabet. The main source of resources available for processing resource-scarce languages is raw text. We have identified an interesting challenge that must be addressed when normalising certain natural language texts. Many lessresourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). It is important to identify ambiguous characters during the normalisation stage of most text processing tasks. We will demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying ambiguously encoded characters.

Citation

Jaf, S. (2016). A Simple Approach to Unify Ambiguously Encoded Kurdish Characters.

Conference Name Second International Conference of Computational Linguistics in Bulgaria
Conference Location Sofia, Bulgaria
Start Date Sep 9, 2016
Acceptance Date Jul 27, 2016
Publication Date Sep 9, 2016
Deposit Date Aug 22, 2016
Publicly Available Date Aug 23, 2016
Pages 86-94
Series ISSN 2367-5578,2367-5675
Publisher URL http://dcl.bas.bg/clib/proceedings/

Files


Accepted Conference Proceeding (110 Kb)
PDF

Copyright Statement
This work is available under a Creative Commons Attribution 4.0. International Licence (CC BY 4.0).





You might also like



Downloadable Citations