Search for a command to run...
This paper presents the first systematic evaluation of ArkTS code generation with large language models, which uses 300 prompts across three difficulty levels and measures Pass@1, compilation rate, and generation time in milliseconds while it maps compiler messages into syntax, type, undefined reference, and other failures, and it also adds an independent LLM judge with a fixed scoring rule. We evaluate 21 models and we find that functional correctness stays low and compilation varies widely, since DeepSeek-R1 reaches 22.7% Pass@1, Claude-3.7-Sonnet reaches 13.7%, and Gemini-2.0-Pro-Experimental-02-05 reaches 12.5%, while several widely used systems return no correct ArkTS solutions at Pass@1. Performance drops from Easy to Hard and drops again when we move from algorithms and data structures to API usage and UI design, which shows that reactive state and lifecycle handling remain difficult, and lower syntax error rates often come with many undefined references and type mismatches which keep Pass@1 low even when code compiles, and short generation time does not guarantee higher compilation success. A small held out study which adds one compact ArkTS example with explicit imports and types and then applies a single compiler guided repair step raises Pass@1 from 23.3% to 36.7% with a modest latency cost, and the largest gains come from fewer missing imports, clearer public signatures, and quick fixes to bracket balance and unresolved symbols. These observations are simple and they point to practical gaps that need patient work on ArkTS typing, imports, and lifecycle usage before code generation becomes more dependable.