A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction

20260 citationsJournal Articlegold Open Access

Authors

Agnideep Aich · University of Louisiana at Lafayette

Md. Monzur Murshed · Minnesota State University, Mankato

Sameera Hewage · West Liberty University

Amanda S. Mayeaux · University of Louisiana at Lafayette

Abstract

Effective feature selection is critical for building robust and interpretable predictive models, particularly in medical applications where identifying risk factors in the most extreme patient strata is essential. Traditional methods often focus on average associations, potentially overlooking predictors whose importance is concentrated in the tails of the data distribution. In this study, we introduce a novel, computationally efficient supervised filter that leverages a Gumbel copula implied upper-tail concordance score ([Formula: see text], a monotone transformation of Kendall's τ) to rank features by their tendency to be simultaneously extreme with the positive class. We evaluated this method against four standard baselines (Mutual Information, mRMR, ReliefF, and L1/Elastic-Net) across four classifiers on two diabetes datasets: a large-scale public health survey (CDC, [Formula: see text]) and a classic clinical benchmark (PIMA, [Formula: see text]). Our analysis included comprehensive statistical tests, permutation importance, and robustness checks. On the CDC dataset, our method was the fastest selector and reduced the feature space by ≈52%. While this resulted in a minimal but statistically significant performance trade-off compared to using all 21 features, our filter significantly outperformed standard filters (Mutual Information, mRMR) and was statistically indistinguishable from the strong ReliefF baseline. On the PIMA dataset (8 predictors), our method's ranking produced the numerically highest ROC-AUC, despite paired DeLong tests showing no statistically significant differences versus strong baselines. PIMA thus serves as a ranking-only sanity check that our upper-tail criterion behaves sensibly in a low-dimensional clinical setting. Across both datasets, the Gumbel-[Formula: see text] selector consistently identified clinically coherent and impactful predictors. We conclude that feature selection via upper-tail dependence is an efficient and interpretable screening approach that can complement standard feature-selection baselines in public health and clinical risk prediction.

Topics & Keywords

Artificial Intelligence in Healthcare Machine Learning in Healthcare Imbalanced Data Classification Techniques

Publication Details

Published in: Scientific Reports

DOI: 10.1038/s41598-026-41874-9

Field-Weighted Citation Impact: 0.00

Command Palette

A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction

Authors

Abstract

Topics & Keywords

Publication Details