Automatic Arabic diacritics restoration based on deep nets

A. Sallab, M. Rashwan, A. Rafea, H.M. Rafaat “Automatic Arabic diacritics restoration based on deep nets”, Arabic NLP workshop, EMNLP 2014: Conference on Empirical Methods in Natural Language Processing, October 25-29, 2014, Doha, Qatar

In this paper, Arabic diacritics restoration problem is tackled under the deep learning framework presenting Confused Subset Resolution (CSR) method to improve the classification accuracy, in addition to Arabic Part-of-Speech (PoS) tagging framework using deep neural nets. Special focus is given to syntactic diacritization, which still suffer low accuracy as indicated by related works. Evaluation is done versus state-of-the-art systems reported in literature, with quite challenging datasets, collected from different domains. Standard datasets like LDC Arabic Tree Bank is used in addition to custom ones available online for results replication. Results show significant improvement of the proposed techniques over other approaches, reducing the syntactic classification error to 9.9% and morphological classification error to 3% compared to 12.7% and 3.8% of the best reported results in literature, improving the error by 22% over the best reported systems.

Dr. Ahmad El Sallab

Personal Page

Automatic Arabic diacritics restoration based on deep nets