Evaluating Large Language Models on Aerospace Medicine Principles

20251 citationsJournal Article

Authors

C. Davis · Louisiana State University Health Sciences Center New Orleans

Shawn M. Pickett · Landauer (United States)

Abstract

IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from <i>Aerospace Medicine Boards Questions and Answers</i>, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.

Topics & Keywords

Artificial Intelligence in Healthcare and Education Artificial Intelligence in Healthcare Machine Learning in Healthcare

UN Sustainable Development Goals

Peace, Justice and strong institutions

Publication Details

Published in: Wilderness and Environmental Medicine

Volume 36, Issue 1_suppl, pp. 44S-52S

DOI: 10.1177/10806032251330628

Field-Weighted Citation Impact: 0.66