Search for a command to run...
Purpose: Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study developed the large language model (LLM)-based automated Quality Assurance (QA) for auto-contouring (LAQUA) system using a multimodal LLM, Gemini 2.5 Pro, and evaluated its feasibility as a clinical primary screening tool to streamline the QA workflow. Methods: Twenty male pelvic CT scans from an open dataset were utilized. Three distinct auto-contouring software packages (OncoStudio, RatoGuide prototype and syngo.via) were evaluated. Auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Pro. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Optimal; 4: Acceptable; 3: Suboptimal; 2: Unacceptable; redraw from scratch; 1: Unacceptable; organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, Spearman's rank correlation coefficients (ρ) and weighted kappa coefficients (κ) were calculated. Additionally, to assess screening performance, sensitivity and specificity were calculated by dichotomizing the scores into "Pass" and "Fail" using two different cutoffs (scores ≥ 3 and ≥ 4 as "Pass"). Finally, the alignment of the rationales provided by the LLM with the auto-contouring quality was evaluated by two board-certified radiation oncologists. This was conducted using a Likert scale assessing four domains (error detection, hallucination, clinical relevance, and anatomical understanding), each scored out of 2 points. Results: The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs (ρ: 0.567 – 0.835; quadratic weighted κ : 0.639 – 0.804), with the rectum showing the highest correlation. Regarding screening performance, a cutoff of ≥3 as "Pass" achieved the highest sensitivity and specificity in specific subgroups, but with wide 95% confidence intervals (CIs). A cutoff of ≥4 as "Pass" narrowed the CIs, yielding the highest sensitivity in the rectum (0.976) and the highest specificity in the left femoral head (0.933). Qualitatively, the LLM's rationales achieved an overall mean score of 1.70 ± 0.48 (out of 2), with 155 of 291 outputs receiving perfect scores across all criteria. Conclusions: The LAQUA system demonstrated substantial agreement with expert evaluations in AI-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.