Representing Injuries in Trauma Patients: Development and Evaluation of Embeddings for Injuries

20260 citationsJournal Articlegreen Open Access

Authors

Kelvin Szolnoky · Karolinska Institutet

Jonatan Attergrim · Karolinska University Hospital

Awais Ashfaq · Hallands sjukhus Halmstad

Henrik Linusson · Varberg Hospital

Martin Gerdin Wärnberg · Karolinska University Hospital

Johanna Berg · Malmö University

Abstract

A bstract Background Trauma patients present with heterogeneous injury patterns that are challenging to represent in statistical models. Traditional approaches either use high-dimensional one-hot encoding, resulting in sparse features, or aggregate injuries into summary scores that lose patient-specific detail. This study developed data-driven ICD-10 embeddings for trauma injuries and evaluated their ability to preserve injury information. Methods Using the National Trauma Data Bank, we trained autoencoder models on all trauma patients from 2018 to generate dense vector representations of ICD-10 injury codes. We evaluated embeddings of dimensions 2, 4, 8, 16, and 32 against one-hot encoding using three prediction tasks: in-hospital mortality, emergency department disposition, and blood transfusion within 24 hours. For each hospital included, we trained separate logistic regression and LightGBM models using 2018 data from that hospital, then evaluated performance on 2019 data from the same hospital. Performance was measured using area under the receiver operating characteristic curve (AUC) and stratified by hospital size. Results In LightGBM models, 8-dimensional embeddings improved AUC compared to one-hot encoding of 0.08 (95% CI: 0.06, 0.10) in small hospitals, 0.03 (0.02, 0.04) in medium hospitals, and 0.02 (0.01, 0.02) in large hospitals, with comparable performance in major hospitals (0.00 [-0.01, 0.01]). In logistic regression, 32-dimensional embeddings showed AUC improvements of 0.03 (0.01, 0.05), 0.02 (0.01, 0.03), and 0.02 (0.02, 0.03) for small, medium, and large hospitals respectively, with similar performance in major hospitals (0.01 [0.00, 0.01]). Conclusion ICD-10 code injury embeddings with ≥8 dimensions preserve clinically relevant information and can outperform one-hot encoding while reducing dimensionality. The embeddings and software are openly available to support further trauma research and applications.

Topics & Keywords

Trauma and Emergency Care Studies Medical Coding and Health Information Injury Epidemiology and Prevention

UN Sustainable Development Goals

Good health and well-being

Publication Details

Published in: medRxiv

DOI: 10.64898/2026.01.03.26343379

Field-Weighted Citation Impact: 0.00