Search for a command to run...
The availability of high-quality corpora is foundational for advancements in Natural Language Processing (NLP), enabling the training and rigorous evaluation of computational models. While rich textual resources exist for high-resource languages, a significant scarcity persists for many natural languages, particularly understudied Arabic dialects such as Kuwaiti Arabic (KA). This paper introduces Hazawi+ , a multi-domain textual corpus comprising over 7 million tokens of KA dialectal stories and novels. Unlike social network texts, Hazawi+ is specifically designed to capture the rich linguistic features inherent in narrative texts, including morphological complexity, informal syntax, and pragmatic nuances, making it an invaluable resource for developing NLP models in low-resource settings. The entire corpus underwent automatic morphological annotation using CAMeL tools specialized for Gulf Arabic, with annotation quality subsequently validated through a rigorous manual review of a 105,770-token sample by two language experts. To demonstrate Hazawi+ ’s immediate usability, we present a complementary empirical study involving the programmatic generation of a synthetic dataset of KA stories, which is then utilized in a downstream task to train a classifier capable of distinguishing between human-written and bot-generated narratives. This experiment serves as a crucial proof-of-concept, underscoring Hazawi+ ’s potential to provide researchers with deep insights into dialectal linguistic patterns and to significantly enhance the precision of various language processing tasks for the Kuwaiti Arabic dialect.
Published in: ACM Transactions on Asian and Low-Resource Language Information Processing
DOI: 10.1145/3800688