Search for a command to run...
Recently, several machine learning (ML) algorithms for right-censored data, including Oblique Random Survival Forest (ORSF), have been utilized to develop risk prediction tools in cardiovascular disease (CVD) and oncology research. ORSF employs hyperplanes to represent a singular split, as opposed to the conventional univariate (axis) based approach, such as Random Survival Forest (RSF). However, ORSF encounters a hurdle in identifying the relevant features while constructing the hyperplane. Hence, we aim to propose and evaluate the predictive performance of three novel feature selection-based hyperplanes for ORSF. We propose three variants of ORSF: (a) ORSF-LASSO, based on LASSO regression, (b) ORSF-MRMR, based on the Minimum Redundancy Maximum Relevance (MRMR) framework, and (c) ORSF-CARS, based on correlation-adjusted regression survival (CARS) score. Nine versions of these variants were evaluated against the Penalized Cox Proportional Hazards Model, RSF, and the original ORSF on ten public CVD and oncology datasets using Harrell’s C-index, D-Calibration, and integrated Brier score (IBS). The models were trained using three-fold cross-validation in R version 4.2.1 and the mlr3 ecosystem. The newly proposed models have shown high discrimination in CVD datasets, with ORSF-LASSO-min being the most consistently best-performing model. Furthermore, in CVD datasets, one of the proposed models demonstrated the lowest D-calibration compared to existing models. In oncology datasets, one or more new models outperformed existing models in three out of five datasets. ORSF-MRMR-3q, a novel model, exhibited the lowest D-calibration across two oncology datasets. The sensitivity analysis indicated that the performance of the newly proposed methods aligned with the primary analysis. Our findings suggest that the three proposed new variants have the potential to predict time-to-event outcomes in CVD and oncology prognosis research. Nonetheless, the proposed methods need to be validated in comprehensive and varied datasets with prolonged follow-up periods and across multiple health domains.