Search for a command to run...
Accurate and reliable classification of heart sounds is critical for early screening of cardiovascular abnormalities. However, most existing deep learning approaches rely on fully supervised training and are evaluated primarily under in-distribution settings, limiting their robustness and generalizability in real-world clinical scenarios where data distributions and acquisition conditions vary. In addition, the scarcity of labelled phonocardiogram (PCG) data and the lack of systematic robustness evaluation further hinder the practical deployment of automated heart sound analysis systems. This study proposes a self-supervised ConvNeXt-based framework for heart sound classification that leverages contrastive representation learning to exploit unlabeled PCG recordings prior to task-specific fine-tuning. Log-Mel spectrograms are used as input representations, and the encoder is pretrained using a contrastive objective to learn invariant acoustic features, followed by supervised optimization with a lightweight classification head. To emphasize practical reliability, the model is evaluated using multi-seed experiments, validation-driven threshold calibration, and domain-aware testing protocols. In addition, Grad-CAM is employed to provide visual explanations of model predictions. Experiments conducted on the PhysioNet heart sound dataset demonstrate that the proposed framework achieves stable and consistent performance across multiple runs, with a favorable balance between sensitivity and specificity and a mean accuracy of approximately 94%. The results show that self-supervised pretraining improves representation robustness and reduces sensitivity to initialization and data partitioning. Visual attribution maps further indicate that the model focuses on clinically meaningful time–frequency regions associated with cardiac events, supporting the interpretability and clinical relevance of the proposed approach. • A self-supervised ConvNeXt framework is introduced for robust heart sound representation learning from unlabeled PCG data. • Domain-aware and multi-seed evaluation protocols are employed to assess robustness under distribution shifts. • Clinically guided threshold calibration achieves a favorable sensitivity–specificity trade-off for screening applications. • Grad-CAM visual explanations reveal physiologically meaningful acoustic regions supporting model interpretability.
Published in: Intelligence-Based Medicine
Volume 14, pp. 100377-100377