Search for a command to run...
Abstract Background: Genomic markers such as ESR1, PGR, EGFR, MKI67, and FOXC1, along with clinically validated multigene signatures like Oncotype DX, MammaPrint, Prosigna, and EndoPredict, are widely used to inform prognosis and guide therapeutic decisions in breast cancer. However, the widespread use of these gene expression-based assays is often constrained by their high cost and limited accessibility. To address this challenge, we developed an AI model capable of inferring the spatial distribution of gene expression directly from routine digitized H&E-stained whole slide images (WSIs). Methods: We trained a weakly supervised multiple instance learning model using a publicly available dataset (TCGA), containing paired bulk RNASeq and H&E WSIs across multiple disease types. Remarkably, although trained solely on bulk tissue-level expression data, the model accurately reconstructs spatial gene expression patterns, offering a scalable and cost-effective alternative to spatial transcriptomics. This pan-cancer model was trained to predict expression levels of thousands of gene expressions, including known breast cancer-relevant genes such as ESR1, PGR, FOXA1, AURKA, BIRC5, MELK, MYBL2, PLK1, CDC20, CCNE1 and NAT1. Once trained, the model was predicted on a held-out validation dataset of breast cancer WSIs to measure slide-level prediction performance using Pearson’s correlation. Spatial predictions were validated on external WSI datasets (HEST and CPTAC) which includes associated spatial transcriptomics and an external WSI dataset (IMPRESS) with associated immunohistochemistry data. Results: Our model accurately predicted the expression of over 2,500 genes with a Pearson correlation greater than 0.6. Notably, breast cancer relevant genes such as ESR1, FOXA1, and AURKA showed particularly high correlation between predicted expression values and ground truth bulk RNA-seq measurements. Among these, we qualitatively compared the predicted spatial expression patterns of PD-L1, CD163, and CD8 with corresponding IHC-stained tissue sections. The regions of high predicted expression are aligned with areas of high marker density in the IHC images. Additionally, comparison with spatial transcriptomic data from 10x Genomics demonstrated strong concordance between our model’s spatial predictions and the ground truth measurements, particularly for breast cancer-relevant genes. Conclusions: Our results demonstrate that routine H&E-stained whole slide images can be leveraged to accurately infer both bulk and spatial gene expression using a weakly supervised AI model. This capability enables large-scale, retrospective spatial analysis of archival tissue and opens new avenues for understanding tumor heterogeneity in breast cancer using standard pathology slides. By providing gene-level insights without the need for additional staining or molecular assays, this approach may augment current histopathological evaluation and support more informed clinical research, risk stratification, and treatment planning. Citation Format: S. S. Chavan, C. Feng, H. Muhammad, H. S. Basu, W. Huang, R. Roy, G. Wilding, G. B. Mills, S. Kummar. Inferring Spatial Genomic Expression profiles from H&E-Stained Histology Using Weakly Supervised Deep Learning in Breast Cancer [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PD11-05.
Published in: Clinical Cancer Research
Volume 32, Issue 4_Supplement, pp. PD11-05