Benchmarking Large Language Models on Arabic parsing

20260 citationsJournal Articlegold Open Access

Authors

Kamal Al-Sabahi

Amer Fael Mohammed Balhaf · Ashiya University

Mohamed Moheb Zagloul Al Shamy · University of Technology and Applied Sciences-Suhar

Majjed Al-Qatf · Ollscoil na Gaillimhe – University of Galway

Kang Yang · National University of Defense Technology

Abdulrahman Al-badwi · Central South University

Abstract

Parsing Arabic sentences, specifically i c rāb , poses unique challenges due to the language’s intricate morphology, diverse syntactic structures, and rich contextual nuances. This study evaluates the performance of leading general-purpose Large Language Models (LLMs) in Arabic i c rāb parsing using a novel human-annotated dataset, systematically covering various grammatical phenomena. A tailored evaluation framework assesses performance across detailed syntactic and morphological features. Under matched multi-shot prompting (basis for cross-model comparisons), Claude-3.5-Sonnet achieved the highest overall F1 score (0.84), followed by GPT-4o (0.83) and Gemini-1.5-Pro (0.77). Conversely, less advanced models such as Claude2.1 and GPT-3.5-turbo struggled with complex constructions, highlighting persistent linguistic limitations. Multi-shot prompting substantially improved accuracy across proprietary models, yielding improvements of up to 18% in complex categories and underscoring the value of in-context learning. Additionally, evaluations of open-source models (DeepSeek-chat-v3-0324 and LLaMA-4-scout) established baseline performance levels confirming substantial gaps compared to proprietary models. The findings reveal ongoing challenges like diacritic sensitivity and semantic ambiguity while establishing a robust benchmark for Arabic grammatical parsing in general-purpose LLMs. All resources (dataset, codebase, and evaluation outputs) are available at https://github.com/alsabahi2030/Arabic-LLM-Parsing . • Human-annotated dataset of 1100 Arabic sentences across 11 categories and 34 subtypes • Tailored evaluation framework assessing Arabic syntactic and morphological parsing • Comparative analysis of Claude, GPT, Gemini, DeepSeek, and LLaMA-4 on Arabic • Multi-shot prompting boosts parsing accuracy considerably in complex categories

Topics & Keywords

Natural Language Processing Techniques Topic Modeling Text Readability and Simplification

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: International Journal of Information Management Data Insights

Volume 6, Issue 1, pp. 100404-100404

DOI: 10.1016/j.jjimei.2026.100404

Field-Weighted Citation Impact: 0.00