Search for a command to run...
Purpose Ensemble methods can enhance segmentation performance, but their effectiveness depends on the integration strategy. We investigated whether the STAPLE algorithm’s probabilistic framework could effectively leverage model diversity from different loss functions to improve kidney tumor segmentation accuracy compared to individual models and conventional soft voting. Methods We utilized CT scans from 210 patients in the KiTS19 dataset with expert-annotated kidney and tumor structures. Five model variants were developed using the nnU-Net framework: two 2D U-Nets and three 3D U-Nets, each trained with different hybrid loss functions ( L CE+Dice , L TopK+Dice , L CE+GDice ). Five approaches were compared: individual 2D U-Net, individual 3D U-Net, majority voting ensemble, soft voting ensemble, and STAPLE ensemble. Models underwent 5-fold cross-validation, and performance was evaluated using DSC, JI, HD95, precision, and recall on 63 test patients. Statistical significance was assessed using Wilcoxon signed-rank tests with Benjamini–Hochberg correction. Generalizability was evaluated on liver tumor segmentation using the LiTS17 dataset. Results In KiTS19 tumor segmentation, individual model DSCs ranged from 0.64 ± 0.27 (2D models) to 0.70 ± 0.24 (3D models). Majority voting achieved DSC of 0.70 ± 0.27 and soft voting achieved 0.71 ± 0.26, while STAPLE reached 0.74 ± 0.23 (adjusted p<0.05). JI improved from 0.53-0.59 (individual models) to 0.63 ± 0.24 (STAPLE). HD95 decreased to 11.81 ± 13.43 with STAPLE. Precision and recall reached 0.88 ± 0.20 and 0.72 ± 0.24, respectively. In LiTS17 liver tumor segmentation, STAPLE similarly outperformed soft voting (DSC: 0.76 ± 0.10 vs. 0.71 ± 0.18, adjusted p<0.05). Conclusions The STAPLE algorithm achieved superior performance in primary segmentation metrics compared to individual models, majority voting, and soft voting (STAPLE > soft voting > majority voting), demonstrating the benefits of probabilistic ensemble methods for kidney tumor segmentation. Stratified analysis revealed that STAPLE’s advantage was most pronounced for medium-sized tumors, where performance variability was reduced by 45%. The approach showed consistent effectiveness in liver tumor segmentation, suggesting potential for broader clinical applications.