Search for a command to run...
This repository contains the ML analysis framework developed at the Institute of Communication and Computer Systems (ICCS), National Technical University of Athens (NTUA), as part of Work Package 3 (Task T3.1) of the EU Horizon project ONCOSCREEN. ONCOSCREEN aims to develop non-invasive, point-of-care cancer screening tools. This software component supports the ONCO-VOC breath analyser — a miniaturised device equipped with an array of 48 molecularly-modified gold nanoparticle (GNP) sensors designed to detect Volatile Organic Compounds (VOCs) in exhaled breath that are indicative of colorectal cancer (CRC). The sensor array design is inspired by the mammalian olfactory system, where overlapping receptor affinities encode complex chemical signatures rather than individual compounds. The framework processes raw sensor time-series collected from multi-site clinical trials (Mainz, UKSH Lübeck, and IPO Porto) and implements the full analytical pipeline from signal preprocessing to classification. Key stages include ambient-corrected signal processing, feature extraction (164 features per subject including AUC, gradient, and phase-specific statistics), statistical filtering via Welch's t-test and Mann–Whitney U test with Bonferroni correction, dimensionality reduction (PCA, PLS-DA), and nested Leave-One-Out Cross-Validation (LOOCV) benchmarking of eleven classifiers including LightGBM, XGBoost, Neural Networks, SVM, and Logistic Regression variants. Applied to Clinical Phase A data (461 breath samples; 47 CRC, 132 healthy no-risk controls, plus intermediate risk groups), the framework achieved ROC-AUC scores ranging from 87.4% to 91.3%, with a best balanced accuracy of 88.0% at the Youden's J-index optimal threshold. These results demonstrate that the ONCO-VOC breath signal contains a robust, multi-site discriminative signature for CRC detection detectable even by linear models. The software is released as open-source under the MIT License in fulfilment of EU Horizon Open Science requirements. Clinical data are not included due to patient privacy constraints.