ICD coding of death certificates with generative language models

20260 citationsJournal Articlegold Open Access

Authors

Isabel Coutinho · Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento

Bruno Martins · Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento

Afonso Moreira · Direção Geral do Território

André Peralta‐Santos · Direção Geral do Território

Abstract

Although large language models can achieve remarkable results in most text generation tasks, these models have been less used in text classification problems, of which ICD coding of clinical documents is one example. In this work, we propose different strategies to adapt a LLaMA generative language model to the ICD coding task. In one such strategy, we only use a language modeling objective for training, followed by constrained decoding at inference time, rather than fine-tuning the model for discriminative classification. We specifically use free-text descriptions in Portuguese death certificates to train a relatively small LLaMA model for assigning ICD codes to the underlying cause of death, and we compare it against a BERT encoder model, which is typically used to address text classification tasks. Experiments show that generative language models can achieve strong results in ICD coding of death certificates, with a classification accuracy that is at least in line with the results obtained using encoder models. We thus demonstrate that language generation can be a suitable approach for ICD coding, allowing for multiple related tasks, such as coding the underlying or the multiple causes contributing for a death, to be performed with a single unified model.

Topics & Keywords

Machine Learning in Healthcare Topic Modeling Authorship Attribution and Profiling

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: PLOS Digital Health

Volume 5, Issue 2, pp. e0001245-e0001245

DOI: 10.1371/journal.pdig.0001245

Field-Weighted Citation Impact: 0.00