Interpreting Machine Learning in Rare Outcomes

20260 citationsJournal Article

Authors

Carlos M. Ardila · Universidad de Antioquia

Anny Marcela Vivares‐Builes · Universidad de Antioquia

Eliana Pineda‐Vélez · Universidad de Antioquia

Abstract

We read with great interest the article by Correia-Neto et al. [1], “Machine Learning for Predicting Malignant Transformation in Actinic Cheilitis: A Prognostic Support System Based on Demographic and Clinical Descriptors.” The authors should be commended for addressing an important and understudied question: whether readily available clinical and demographic descriptors can be leveraged through machine learning (ML) to support prognostication in actinic cheilitis (AC). Given the clinical uncertainty surrounding malignant transformation (MT) in AC, this effort is timely and potentially impactful. At the same time, the article raises broader methodological and conceptual issues regarding the use of ML in small, imbalanced clinical datasets. Of the 340 included patients, only 18 experienced MT (5.29%). To address class imbalance, the authors applied ADASYN oversampling, expanding the minority class to 319 synthetic MT instances. While this approach is technically sound within the ML literature, it fundamentally alters the empirical structure of the dataset: the model is effectively trained on a synthetic distribution in which the minority class is no longer rare [2]. The reported performance metrics—96.72% accuracy and 0.9498 AUC for XGBoost must therefore be interpreted with caution. When synthetic cases approach parity with real cases, there is a substantial risk that the classifier learns the geometry of artificially generated data rather than the biological signal underlying malignant transformation. Relatedly, the use of five-fold cross-validation on a dataset augmented by synthetic samples may inflate internal validity without guaranteeing external robustness. Because synthetic cases are derived from the original minority class, there is a non-negligible possibility that highly similar synthetic observations appear in both training and testing folds, even if technically separated by cross-validation. In small clinical datasets, this can lead to optimistic estimates of discrimination [3]. The absence of an external validation cohort is therefore a critical limitation, particularly when the authors suggest that these models may be “ready to be employed as a support system.” Another issue concerns the conceptual framing of prediction versus rediscovery of known risk factors. SHAP analysis identified ulceration, multifocality, and longer duration as dominant predictors of ML Cheilitis. These are well-established clinical indicators of higher malignant risk in AC. While the convergence between SHAP outputs and prior knowledge supports face validity, it also invites the question of incremental value: does the ML model meaningfully outperform a well-calibrated multivariable logistic regression built on the same descriptors? No direct comparison with conventional statistical modeling is provided. In a dataset of this size, especially with only 18 true MT events, simpler and more transparent approaches may offer comparable performance with greater interpretability and fewer assumptions [4]. The temporal structure of the data also deserves attention. Patients classified as “no MT” required only a minimum follow-up of one year. Given that malignant transformation in AC may occur over prolonged periods, some individuals labeled as negative may in fact represent delayed positives [5]. In ML terms, this introduces potential label noise that is not random but time-dependent. Models trained under such conditions risk learning short-term correlates of biopsy or referral patterns rather than true long-term transformation risk. The study intentionally excluded treatment variables because of lack of standardization ML Cheilitis. While understandable, this omission may limit the clinical interpretability of the predictions. If sunscreen adherence, lesion excision, or behavioral modification influence MT risk, then the model's predictions implicitly assume a static management context. A prognostic support system deployed in real practice would need to account for dynamic interventions; otherwise, its outputs may reflect historical management patterns rather than intrinsic lesion biology. More broadly, this article highlights a recurring tension in contemporary oral pathology: high-performance ML metrics derived from relatively small, retrospective, single-center datasets are increasingly interpreted as evidence of clinical readiness. Yet discrimination alone is insufficient. Calibration, decision-curve analysis, and clinical utility assessment are essential to determine whether a given probability threshold changes management in a way that improves outcomes. Without such analyses, even an AUC approaching 0.95 does not establish that the tool will alter surveillance intervals, biopsy decisions, or patient prognosis. None of these considerations diminish the importance of the authors' contribution. On the contrary, this work underscores the need for multi-institutional, longitudinal AC registries capable of supporting robust external validation and head-to-head comparisons between ML and traditional modeling. It also invites the field to adopt reporting standards tailored to AI in clinical research, ensuring transparency in preprocessing, resampling, and validation strategies. Correia-Neto et al. [1] have opened an important conversation about ML-driven risk stratification in actinic cheilitis. We believe that advancing this line of inquiry will require not only increasingly sophisticated algorithms, but also careful attention to dataset structure, event rates, temporal validity, and demonstrable clinical utility. In this respect, the article is as much a stimulus for methodological reflection as it is a contribution to prognostic modeling in oral potentially malignant disorders. The author contributed to the conception, analysis, interpretation of data, and drafting of the manuscript. The authors declare no conflicts of interest. Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Topics & Keywords

Explainable Artificial Intelligence (XAI)Artificial Intelligence in Healthcare and Education Machine Learning in Healthcare

Publication Details

Published in: Journal of Oral Pathology and Medicine

DOI: 10.1111/jop.70134

Field-Weighted Citation Impact: 0.00

Command Palette

Interpreting Machine Learning in Rare Outcomes

Authors

Abstract

Topics & Keywords

Publication Details