Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings

20260 citationsJournal Articlediamond Open Access

Authors

Luke Patterson · American Institutes for Research

Blue Webb · American Institutes for Research

Zeyu Jin · American Institutes for Research

Maggie Beiting-Parrish · Federation of American Scientists

Abstract

This study focuses on developing and evaluating a customized Generative AI chatbot designed to enhance access to large-scale educational data. The chatbot aims to assist researchers and policymakers in exploring complex datasets, such as NAEP, through natural language queries. The chatbot was built using a Retrieval-Augmented Generation (RAG) framework that integrates multiple specialized agents to retrieve, interpret, and synthesize educational data. One agent was selected as a case study for performance evaluation. The study compared an automated Large Language Model (LLM)-based evaluation (“LLM-as-a-judge”) with human expert ratings to examine validity and consistency across three criteria: correctness, completeness, and communication quality. A total of 141 expert-generated questions reflecting typical user queries were used, each accompanied by a reference answer and source documentation. Chatbot’s responses were evaluated with a three-dimensional framework on Correctness, Completeness, and Communication. In addition to human evaluation, an LLM-based evaluation was implemented, and the model was provided with the rubric, human-written reference answers, and retrieved RAG contents to generate automated quality assessments. Interrater reliability among human raters and the LLM-as-a-judge were computed with quadratic weighted kappa (QWK). Findings showed that the LLM-as-a-judge approach achieved comparable agreement levels with human raters and demonstrated reliability across all evaluation dimensions. Interrater reliability analyses revealed no significant differences between inter-human and human-to-LLM agreement, except in the communication dimension, where human-to-LLM consistency was higher. These results indicate that the LLM-as-a-judge method can serve as a viable and consistent alternative to human evaluation for customized RAG-based chatbot assessment. Integrating LLM-based evaluation into the assessment of Generative AI chatbots provides a scalable, reliable, and cost-effective complement to traditional human review. With human oversight for calibration and validation, this approach enables more efficient and consistent evaluation practices, advancing the use of AI tools that facilitate broader access to large-scale educational data.

Topics & Keywords

AI in Service Interactions Artificial Intelligence in Healthcare and Education Topic Modeling

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Large-scale Assessments in Education

Volume 14, Issue 1

DOI: 10.1186/s40536-026-00287-w

Field-Weighted Citation Impact: 0.00