Search for a command to run...
Introduction Marine debris detection from satellite imagery is challenged by two major factors: extreme class imbalance, with debris pixels accounting for less than 0.01% of image content, and the need for robust generalization across diverse geographic and temporal domains for operational deployment. Although existing methods often report strong within-dataset performance, cross-dataset generalization, where models trained on one dataset are applied to entirely different geographic regions, remains insufficiently investigated. Methods To address this limitation, we conducted rigorous bidirectional cross-dataset validation experiments using the MARIDA and MADOS datasets. The problem was reformulated as a binary segmentation task and addressed using a standard U-Net architecture combined with a composite imbalance-aware loss and a rarity-aware sampling strategy. Two experimental settings were considered: training on MARIDA and testing on MADOS, and training on MADOS and testing on MARIDA. Results The experiments revealed asymmetric cross-dataset generalization. Models trained on the geographically diverse MADOS dataset achieved an F1-score of 0.890 when tested on MARIDA, corresponding to only a 1.25% decrease from the within-dataset baseline of 0.901. In contrast, models trained on MARIDA achieved an F1-score of 0.833 on MADOS, representing a 7.55% decrease. The average cross-dataset degradation was 4.38%, which is substantially lower than the typical 10--25% performance drops reported in remote sensing domain-shift scenarios. Despite comparable patch counts (2,529 for MADOS versus 2,173 for MARIDA), the superior transferability of MADOS-trained models indicates that geographic diversity across globally distributed tiles is more beneficial than exhaustive annotation within concentrated regions. Moreover, the MADOS-to-MARIDA cross-dataset F1-score of 0.890 exceeded MAP-Mapper's within-dataset F1-score of 0.880 and closely approached MariNeXt's reported performance of 0.891. Discussion These findings show that careful data formulation and training design can enable standard architectures to achieve strong cross-domain performance under extreme class imbalance, approaching or even surpassing more specialized models in realistic deployment conditions. The results provide practical guidance for operational marine debris monitoring systems: spatially stratified sampling across diverse marine environments should be prioritized, F1-scores in the range of 0.86--0.89 can be expected when deploying on previously unseen regions without fine-tuning, and a two-stage strategy should be considered in which models are first trained on geographically diverse data and then optionally adapted for region-specific applications. To the best of our knowledge, this is the first systematic cross-dataset validation study involving both MARIDA and MADOS, demonstrating that binary reformulation supports generalization-preserving marine debris detection across geographic and temporal domain shifts.