Search for a command to run...
Relevance. Cardiovascular diseases remain a leading cause of mortality globally, creating a high demand for automated diagnostic systems. However, developing reliable machine learning models for electrocardiogram (ECG) analysis is often hindered by the availability of only small-scale and imbalanced datasets, which limits the effectiveness of deep learning approaches. The object of research is the process of automated processing and classification of electrocardiographic signals for diagnostic purposes. The subject of the research includes methods of beat-centric feature extraction, patient-level aggregation strategies, and machine learning algorithms for cardiovascular risk prediction. The purpose of this paper is to develop and evaluate a reliable classification framework, optimized for small datasets, that increases prediction accuracy by leveraging patient-level feature aggregation and explainable machine learning models. To achieve this goal, the following tasks were solved: 1) implementation of a robust preprocessing pipeline using a refined Pan-Tompkins algorithm for precise beat-centric segmentation; 2) development of a statistical feature aggregation strategy to mitigate local signal variability; and 3) optimization and validation of a Random Forest classifier. The methodology employed includes digital signal processing (Butterworth filtering), advanced feature engineering (HRV, Wavelets analysis), and rigorous 10-fold Stratified Cross-Validation to ensure generalization on limited data. Research results. The study proposes a pipeline initiating with standard signal preprocessing, followed by precise R-peak detection and beat-centric segmentation. Physiological features (HRV, wavelet, morphological) are then extracted from individual segments and statistically aggregated at the patient level. Experiments on a dataset of 164 subjects demonstrated that the proposed patient-level aggregation strategy significantly outperformed traditional segment-based analysis. The final Random Forest model achieved an ROC-AUC score of 0.84. Feature importance analysis confirmed the critical role of Heart Rate Variability (HRV) metrics, particularly SDNN and RMSSD, in differentiating between healthy and high-risk subjects.
Published in: Innovative technologies and scientific solutions for industries