Search for a command to run...
Parsing Arabic sentences, specifically i c rāb , poses unique challenges due to the language’s intricate morphology, diverse syntactic structures, and rich contextual nuances. This study evaluates the performance of leading general-purpose Large Language Models (LLMs) in Arabic i c rāb parsing using a novel human-annotated dataset, systematically covering various grammatical phenomena. A tailored evaluation framework assesses performance across detailed syntactic and morphological features. Under matched multi-shot prompting (basis for cross-model comparisons), Claude-3.5-Sonnet achieved the highest overall F1 score (0.84), followed by GPT-4o (0.83) and Gemini-1.5-Pro (0.77). Conversely, less advanced models such as Claude2.1 and GPT-3.5-turbo struggled with complex constructions, highlighting persistent linguistic limitations. Multi-shot prompting substantially improved accuracy across proprietary models, yielding improvements of up to 18% in complex categories and underscoring the value of in-context learning. Additionally, evaluations of open-source models (DeepSeek-chat-v3-0324 and LLaMA-4-scout) established baseline performance levels confirming substantial gaps compared to proprietary models. The findings reveal ongoing challenges like diacritic sensitivity and semantic ambiguity while establishing a robust benchmark for Arabic grammatical parsing in general-purpose LLMs. All resources (dataset, codebase, and evaluation outputs) are available at https://github.com/alsabahi2030/Arabic-LLM-Parsing . • Human-annotated dataset of 1100 Arabic sentences across 11 categories and 34 subtypes • Tailored evaluation framework assessing Arabic syntactic and morphological parsing • Comparative analysis of Claude, GPT, Gemini, DeepSeek, and LLaMA-4 on Arabic • Multi-shot prompting boosts parsing accuracy considerably in complex categories
Published in: International Journal of Information Management Data Insights
Volume 6, Issue 1, pp. 100404-100404