Resources
Arabic Dialect Identification (ADI)
- Multilabel Arabic Dialect Identification (MLADI) Dataset and Leaderboard
- Dataset Description: Each sample’s validity in 11 different country-level dialects is manually assessed by 3 annotators from each country. TODO: MENTION ALDi!
- Dataset Access Form: https://forms.gle/gdgTToxG2tH5xT27A
- Leadeboard: https://huggingface.co/spaces/AMR-KELEG/MLADI
- Baseline Model: https://huggingface.co/AMR-KELEG/NADI2024-baseline
- Leaderboard Description: The MLADI leaderboard serves as a public interface for benchmarking ADI models using an ‘extended version’ of the NADI 2024 test set, the first multi-label country-level ADI dataset.
- Papers:
- Amr Keleg, Sharon Goldwater, Walid Magdy. 2025. Revisiting Common Assumptions about Arabic Dialects in NLP In Proceedings of ACL 2025.
- Muhammad Abdul-Mageed, Amr Keleg, AbdelRahim Elmadany, Chiyu Zhang, Injy Hamed, Walid Magdy, Houda Bouamor, and Nizar Habash. 2024. NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task. In Proceedings of the Second Arabic Natural Language Processing Conference, pages 709–728, Bangkok, Thailand.
- Error Analysis for the predictions of a SOTA single-label ADI model.
- Description: Recruited speakers from 7 different Arab countries to check if the model’s predictions for the system’s errors are also valid.
- Annotations: https://github.com/AMR-KELEG/ADI-under-scrutiny/blob/master/data/annotations.tar.gz
- Model: https://huggingface.co/AMR-KELEG/ADI-NADI-2023
- Original Evaluation Dataset: QADI’s test set
- Paper: Amr Keleg and Walid Magdy. 2023. Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification. In Proceedings of ArabicNLP 2023, pages 385–398, Singapore (Hybrid).