47-Dataset Benchmark Validation Results for Pathway Subtyping Framework v0.4.0

20260 citationsDatasetgreen Open Access

Authors

Rohit Chauhan · Hill Top Research (United States)

Abstract

================================================================================QUICK FACTS================================================================================ Total Datasets: 47Success Rate: 37/47 (78.7%) - TCGA: 12/12 (100%) - GEO: 25/35 (71%) Total Samples: 36,551File Count: 36 files in archiveUncompressed: 340 KB ================================================================================PACKAGE CONTENTS================================================================================ 📊 PRIMARY DATA✓ bootstrap_threshold_calibration_47datasets_zenodo.csv (4.9 KB) - 47 datasets with silhouette, bootstrap ARI, sample counts, status - 37 PASS, 10 ERROR (documented constraints) 📖 DOCUMENTATION✓ README.md - Quick start guide✓ ZENODO_47DATASETS_README.md - Complete methodology (11 KB)✓ LICENSE - MIT License✓ requirements.txt - Python dependencies 📁 SCRIPTS (Reproducible Pipeline)✓ run_47benchmarks.py - TCGA validation✓ run_47benchmarks_geo.py - GEO validation✓ run_geo_final_10.py - Final 10 GEO datasets✓ merge_complete_47_results.py - Merge results✓ benchmark_data_loaders.py - Data loading factory✓ benchmark_validation.py - Metrics computation 📦 BENCHMARK LOADERS (Supporting Libraries)✓ local_gdc_loader.py ⭐ CRITICAL - TCGA loader with transpose fix✓ loaders.py - GEO REST API loader✓ translators.py - Gene ID translation✓ est_translator.py - EST probe translation✓ gene_translation.py - Translation pipeline✓ translation_cache.py - Result caching✓ api.py - API utilities✓ errors.py - Custom exceptions✓ __init__.py - Package init 📋 MANIFESTS (Dataset Specifications)✓ benchmark_47datasets_manifest.csv - Master manifest (all 47)✓ tcga_remaining_6datasets_manifest.csv - TCGA Phase 2✓ geo_remaining_32datasets_manifest.csv - GEO Phase 2✓ geo_final_10datasets_manifest.csv - GEO Phase 3✓ benchmark_failed_only_manifest.csv - Failed datasets✓ dataset_candidate_list_47.csv - Selection criteria✓ test_manifest_geo.csv - Test manifest 📚 TECHNICAL DOCS✓ BENCHMARK_SPLIT_STRATEGY.md - Architecture & design✓ GDC-API-FIX-SUMMARY.md - LocalGDCDataLoader fix details✓ PHASE3-DEBUGGING-SESSION-NOTES.md - Debugging log ================================================================================VALIDATION METRICS================================================================================ Silhouette Coefficient (37 PASS): Mean: 0.28 Median: 0.15 Range: 0.03 - 0.97 Bootstrap ARI 5th Percentile (37 PASS): Mean: 0.36 Median: 0.00 Range: -0.14 - 1.00 ================================================================================ERROR ANALYSIS (10 Datasets - NOT Validation Failures)================================================================================ 10 datasets have fundamental data constraints: Too Few Samples (5): - GSE71861, GSE39666, GSE94331, GSE51861, GSE81110 - Require n ≥ 10 for reliable k-selection in range [2,10] Data Quality Issues (3): - GSE33133, GSE75688, GSE99254 - NaN values, type mismatches, or format incompatibilities Download/Format Failures (2): - GSE20685, GSE29006 - Connection errors or expression matrix extraction failures These are unfixable data constraints, not framework issues.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19324360

Command Palette

47-Dataset Benchmark Validation Results for Pathway Subtyping Framework v0.4.0

Authors

Abstract

Topics & Keywords

Publication Details