In this post, I will try to keep a list of any publically available resources that I can find. There are lots of Arabic resources that aren’t used since they can’t be easily found.

Datasets

Dataset Description Languages Published
ALUE Arabic Language Understanding Evaluation is a set of shared tasks that are to be used for benchmarking models MSA & Dialects 2020
LinCE A NER dataset (stats?) and Language Identification corpus (each token in the sentence is annotated with the language MSA or EGA) MSA & ARZ 2020
MADAR A set of setences manually translated by professional translators from different cities into different dialects from Travel (Subtask 1) and Twitter (Substask 2) data 26 different dialects from different Arabian cities including MSA 2019
CALM “CALM contains transcripts from 65 movies (comprising 655,858 word tokens), 88 scripted television programs (396,734 word tokens), and internet texts (1,092,442 word tokens). Some of the content has been annotated, and annotation is ongoing.” ARZ 2019
HARD, paper 93700 hotel reviews in Arabic language collected from Booking.com MSA & Dialects 2018
ASTD Arabic Sentiment Tweets Dataset contains over 10k Arabic sentiment tweets classified into four classes subjective positive, subjective negative, subjective mixed, and objective NA 2015
Arabic sentiment analysis A two label dataset of Arabic tweets (2000 positive and 2000 negative) NA 2013
LABR Large Scale Arabic Book Reviews Dataset contains over 63,000 book reviews in Arabic collected from goodreads. NA 2013
AQMAR, Topcode challenge Wikipedia-based NER dataset ARA 2012
ANERCorp ANERCorp - CAMeL Lab Train/Test Splits A relatively old Arabic NER (Person, Location, Organisation, Miscellaneous) corpus of 150K tokens (11% are entities) NA 2007
Hate speech datasets A set of different hate speech datasets in different languages including Arabic NA NA

Raw corpora

Corpus Description Languages Published
OSCAR OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture MSA & ARZ 2020
Abu El-Khair Corpus Arabic text corpus, that includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there are about three million unique words ARA 2016
ArabicWeb16 A public web crawl of 150,211,934 Arabic webpages MSA and Dialects 2016
Shamela Epub files parsed into text CA? -

Tools

Tool name Description Programming language Languages Published
stanza TBC (A tool for NER and POS? for multiple languages including Arabic), uses AQMAR and PADT python (pytorch) MSA & EGA 2020 (Christopher D. Manning.)


Misc. resources