In this post, I will try to keep a list of any publically available resources that I can find. There are lots of Arabic resources that aren’t used since they can’t be easily found.
Datasets
Dataset |
Description |
Languages |
Published |
ALUE |
Arabic Language Understanding Evaluation is a set of shared tasks that are to be used for benchmarking models |
MSA & Dialects |
2020 |
LinCE |
A NER dataset (stats?) and Language Identification corpus (each token in the sentence is annotated with the language MSA or EGA) |
MSA & ARZ |
2020 |
MADAR |
A set of setences manually translated by professional translators from different cities into different dialects from Travel (Subtask 1) and Twitter (Substask 2) data |
26 different dialects from different Arabian cities including MSA |
2019 |
CALM |
“CALM contains transcripts from 65 movies (comprising 655,858 word tokens), 88 scripted television programs (396,734 word tokens), and internet texts (1,092,442 word tokens). Some of the content has been annotated, and annotation is ongoing.” |
ARZ |
2019 |
HARD, paper |
93700 hotel reviews in Arabic language collected from Booking.com |
MSA & Dialects |
2018 |
ASTD |
Arabic Sentiment Tweets Dataset contains over 10k Arabic sentiment tweets classified into four classes subjective positive, subjective negative, subjective mixed, and objective |
NA |
2015 |
Arabic sentiment analysis |
A two label dataset of Arabic tweets (2000 positive and 2000 negative) |
NA |
2013 |
LABR |
Large Scale Arabic Book Reviews Dataset contains over 63,000 book reviews in Arabic collected from goodreads. |
NA |
2013 |
AQMAR, Topcode challenge |
Wikipedia-based NER dataset |
ARA |
2012 |
ANERCorp ANERCorp - CAMeL Lab Train/Test Splits |
A relatively old Arabic NER (Person, Location, Organisation, Miscellaneous) corpus of 150K tokens (11% are entities) |
NA |
2007 |
Hate speech datasets |
A set of different hate speech datasets in different languages including Arabic |
NA |
NA |
Raw corpora
Corpus |
Description |
Languages |
Published |
OSCAR |
OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture |
MSA & ARZ |
2020 |
Abu El-Khair Corpus |
Arabic text corpus, that includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there are about three million unique words |
ARA |
2016 |
ArabicWeb16 |
A public web crawl of 150,211,934 Arabic webpages |
MSA and Dialects |
2016 |
Shamela |
Epub files parsed into text |
CA? |
- |
Tool name |
Description |
Programming language |
Languages |
Published |
stanza |
TBC (A tool for NER and POS? for multiple languages including Arabic), uses AQMAR and PADT |
python (pytorch) |
MSA & EGA |
2020 (Christopher D. Manning.) |
Misc. resources