Amr Keleg | Arabic NLP resources

In this post, I will try to keep a list of any publically available resources that I can find. There are lots of Arabic resources that aren’t used since they can’t be easily found.

Datasets

Dataset	Description	Languages	Published
ALUE	Arabic Language Understanding Evaluation is a set of shared tasks that are to be used for benchmarking models	MSA & Dialects	2020
LinCE	A NER dataset (stats?) and Language Identification corpus (each token in the sentence is annotated with the language MSA or EGA)	MSA & ARZ	2020
MADAR	A set of setences manually translated by professional translators from different cities into different dialects from Travel (Subtask 1) and Twitter (Substask 2) data	26 different dialects from different Arabian cities including MSA	2019
CALM	“CALM contains transcripts from 65 movies (comprising 655,858 word tokens), 88 scripted television programs (396,734 word tokens), and internet texts (1,092,442 word tokens). Some of the content has been annotated, and annotation is ongoing.”	ARZ	2019
HARD, paper	93700 hotel reviews in Arabic language collected from Booking.com	MSA & Dialects	2018
ASTD	Arabic Sentiment Tweets Dataset contains over 10k Arabic sentiment tweets classified into four classes subjective positive, subjective negative, subjective mixed, and objective	NA	2015
Arabic sentiment analysis	A two label dataset of Arabic tweets (2000 positive and 2000 negative)	NA	2013
LABR	Large Scale Arabic Book Reviews Dataset contains over 63,000 book reviews in Arabic collected from goodreads.	NA	2013
AQMAR, Topcode challenge	Wikipedia-based NER dataset	ARA	2012
ANERCorp ANERCorp - CAMeL Lab Train/Test Splits	A relatively old Arabic NER (Person, Location, Organisation, Miscellaneous) corpus of 150K tokens (11% are entities)	NA	2007
Hate speech datasets	A set of different hate speech datasets in different languages including Arabic	NA	NA

Raw corpora

Corpus	Description	Languages	Published
OSCAR	OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture	MSA & ARZ	2020
Abu El-Khair Corpus	Arabic text corpus, that includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there are about three million unique words	ARA	2016
ArabicWeb16	A public web crawl of 150,211,934 Arabic webpages	MSA and Dialects	2016
Shamela	Epub files parsed into text	CA?	-

Tools

Tool name	Description	Programming language	Languages	Published
stanza	TBC (A tool for NER and POS? for multiple languages including Arabic), uses AQMAR and PADT	python (pytorch)	MSA & EGA	2020 (Christopher D. Manning.)

WANLP: WANLP2020
OSACT: OSACT4

Misc. resources

SA paper including names of some datasets
Speech Recognition
Speech Recognition
Q&A from ask.fm
NER
Arabizi - Taha
Other gazeteers
ARZ and dialectal resources (mainly for segmentation and POS)
Arabic Morphological analysis
CNN and BBC data
Tunisian dialect resources repo
Random Arabic datasets
NYU-AD
ELXIR: A program able to inflect words?
ANETAC: English names with their Arabic transliterations
MISC Arabic sources - Big Science
THE ARABIC LEARNER’s WRITING TOOLKIT

Datasets

Raw corpora

Tools

Arabic related workshops

Misc. resources