Digitalizing Arabic Books
Arabic is the sixth most spoken language in the world, with more than 420 million speakers. However, the digital representation of the language is clearly lagging and disproportional to its real-world usage and presence.
With the continuous rise of Large Language Models (LLMs), such digital machine-readable Arabic content is of great importance. These models rely on such data to understand the language, and build a view of the real world. The fact that the culture, beliefs, ideas, and opinions of Arabic-speakers are not well-represented, which might be a graver problem than the under-representation of Arabic content online.
To this end, using Optical Character Recognition (OCR) might help. As simple as this sounds, the Arabic NLP community might still be deterred from using OCR. As a current PhD student focusing on Arabic, I myself do not know much about the quality of the output of OCR models for Arabic content. A recent paper showed that OCR models might struggle with the less common calligraphic scripts of Arabic [1].
In this blog post, I dream of mobilizing the interest of the Arabic NLP community to use, evaluate, and improve the OCR systems to enrich the Arabic content online. Assessing the performance of the models on dialectal documents is an understudied research question that is vital for the research community.
Case Study: Parallel Tunisian Constitution Corpus (PTCC)
Official and legal documents in Arab countries, such as the constitution, are primarily issued in Arabic, and can sometimes be available in other languages such as: English and French. It is extremely rare to find these documents written in a variant of Dialectal Arabic.
On reading the TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect paper, I found that the 2014 Tunisian Constitution was translated in Tunisian Arabic, as an attempt to make it more accessible to the Tunisians.
I managed to find two versions of the constitution as pdf files written in MSA and Tunisian Arabic. While text embedded in epub files can be automatically parsed (e.g.: using EbookLib), text embedded in pdf files pose a challenge. Even a daunting manual copying of text from pdf files is not possible, as Arabic text tend to get mingled, as shown below if the following text, representing the first article, is copied from the pdf: “الفصل 1. تونس دولة حرّة مسْتقلّة ذات سيادة يعني حتّى دولة وإلآ سلطة أخرى ما تتدخل فيها, وتونس مسلمة ولوغتها العربية معنتها أغلبية شعبها مسلم ولوغتو العربية ونظامها جمهوري يعني الشعب هو إلي ينتخب رئيس الجمهورية موش كيما المُلوكية يكون فيها الحكم بالوراثة. الفصل هاذا ما إنجّموش نبدلوه.”
الفصل
. 1
تونس دولة حرّ ة م
ة ذات سيادة يعني حتّ قلّ تْ سْ
ى دولة
إلاّ و
سلطة
تْ أخرى ما ت
دخّل
، فيها
وتونس مسلمة
لوغتها العربية و
معنتها
أغلبية شعبها
مسلم ولوغتو العربية ونظامها
ج
مهوري يعني الشّعب
إ لّ هو
ي ينتخب رئيس الجمهورية
موش كيما الم لوكية يكون فيها الحكم
.
Another option is using Optical Character Recognition (OCR) to transform images of text into machine-readable text. Tesseract is an open-source OCR library. The library has a python interface and supports Arabic. I used PyTesseract as per the following instructions, followed by some ad-hoc script for aligning the articles extracted from the two pdf files. The aligned sentences form Parallel Tunisian Constitution Corpus (PTCC) which can be accessed through: https://huggingface.co/datasets/AMR-KELEG/PTCC.
[1] HICMA: The Handwriting Identification for Calligraphy and Manuscripts in Arabic Dataset (Ismail et al., ArabicNLP-WS 2023)