Search for a command to run...
MS-based proteomics offers powerful opportunities for biomarker discovery; nevertheless, it is associated with technical challenges, including missing values and batch effects. Although imputation and batch-correction methods are well established in proteomics, their impact remains incompletely characterized in large-scale clinical proteomics datasets. Here, we examine the practical impact and interaction of three popular imputation methods (Gaussian, ½ LOD, KNN) in combination with three batch-effect correction approaches (ComBat, ComBat with disease covariate, MNN) on differential abundance analysis in a CE-MS urine peptidomics dataset of 1,050 samples across 13 batches from chronic kidney disease (CKD) patients and controls. Downstream effects were assessed based on peptide validation between discovery and validation sets. Imputation method choice had minimal impact on the final list of disease-associated peptides (DAPs), given the missingness structure and normalization strategy. In contrast, batch-effect correction largely affected the results: MNN and especially unadjusted ComBat removed a large proportion of DAPs ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:semantics><mml:mo>∼</mml:mo> <mml:annotation>$\sim $</mml:annotation></mml:semantics> </mml:math> 50% and >90%, respectively), whereas inclusion of disease status in the ComBat model largely preserved biological signal. This study highlights how popular preprocessing choices can affect biological signal, showing that imputation and batch-effect correction interact and jointly influence downstream results, underscoring the need for caution when applying batch-effect correction. STATEMENT OF SIGNIFICANCE OF THE STUDY: Finding reliable biomarkers in clinical proteomics requires addressing the technical noise that can hide true biological signals. In this work, we examine the practical impact and interaction of commonly used imputation and batch correction methods on the list of peptides that emerge as differentially abundant. Instead of relying on simulations or small datasets, we examine a large, real-world urine-peptidomics cohort of more than 1,000 samples screened for chronic kidney disease. The results demonstrate that, in datasets such as the one used here, different preprocessing strategies can lead to substantially different outcomes. Imputation and batch-effect correction were found to be interdependent, and batch effect removal can lead to loss of meaningful biological differences, highlighting the importance of applying such corrections with caution.