Search for a command to run...
This study focuses on developing and evaluating a customized Generative AI chatbot designed to enhance access to large-scale educational data. The chatbot aims to assist researchers and policymakers in exploring complex datasets, such as NAEP, through natural language queries. The chatbot was built using a Retrieval-Augmented Generation (RAG) framework that integrates multiple specialized agents to retrieve, interpret, and synthesize educational data. One agent was selected as a case study for performance evaluation. The study compared an automated Large Language Model (LLM)-based evaluation (“LLM-as-a-judge”) with human expert ratings to examine validity and consistency across three criteria: correctness, completeness, and communication quality. A total of 141 expert-generated questions reflecting typical user queries were used, each accompanied by a reference answer and source documentation. Chatbot’s responses were evaluated with a three-dimensional framework on Correctness, Completeness, and Communication. In addition to human evaluation, an LLM-based evaluation was implemented, and the model was provided with the rubric, human-written reference answers, and retrieved RAG contents to generate automated quality assessments. Interrater reliability among human raters and the LLM-as-a-judge were computed with quadratic weighted kappa (QWK). Findings showed that the LLM-as-a-judge approach achieved comparable agreement levels with human raters and demonstrated reliability across all evaluation dimensions. Interrater reliability analyses revealed no significant differences between inter-human and human-to-LLM agreement, except in the communication dimension, where human-to-LLM consistency was higher. These results indicate that the LLM-as-a-judge method can serve as a viable and consistent alternative to human evaluation for customized RAG-based chatbot assessment. Integrating LLM-based evaluation into the assessment of Generative AI chatbots provides a scalable, reliable, and cost-effective complement to traditional human review. With human oversight for calibration and validation, this approach enables more efficient and consistent evaluation practices, advancing the use of AI tools that facilitate broader access to large-scale educational data.