Search for a command to run...
In recent years, foundation models (FMs) have begun to reshape numerical simulations on highperformance computing (HPC) platforms. These large, pre-trained AI models enable rapid predictions across a broad range of physical domains, including Earth system modeling, fluid dynamics, materials science, as well as complex multi-modal simulations in aerospace engineering and fusion research. By training on diverse datasets, FMs learn intricate relationships and underlying physical behavior while also enabling the quantification of uncertainty in their predictions. This capability allows simulations that once required days of numerical calculation to be completed in minutes (FM inference) 1 , supporting real-time design optimization, uncertainty-aware decision making, and more comprehensive exploration of complex scenarios.Foundation models are trained on extensive simulation datasets, effectively emulating slow, iterative solvers. They serve as reusable "base models" that can be adapted to specific applications, such as inverse design and uncertainty quantification, without necessitating retraining from scratch. Consequently, FMs unlock new possibilities in engineering and scientific research, enabling swift, physics-informed optimization cycles across various domains.The real strength of foundation models comes from their ability to learn rich physical and chemical relationships from large, diverse datasets and to assist scientists with complex, highdimensional tasks. Rather than replacing traditional simulation or expert judgment, these models support high-throughput virtual experimentation and help researchers navigate design choices and tradeoffs that would otherwise be difficult to explore. In Earth system science, for example, models such as DeepMind's GraphCast and NVIDIA's FourCastNet use large spatiotemporal datasets to improve weather and climate prediction while providing actionable guidance to human forecasters. Similar approaches are emerging in other domains. In materials science, models trained on crystal structure databases help researchers prioritize promising candidates for further study, while in drug discovery they support protein structure prediction and molecular design. Related techniques are also being applied in fusion energy research, where data-driven models assist physicists with plasma control and scenario planning in complex operational environments.FMs are driving a shift in numerical simulation workflows. This change parallels the longestablished scientific and engineering practice of employing reduced-order or surrogate models to gain swift understanding, particularly when full-fidelity simulations are too time-intensive.Historically, methods like Reynolds-averaged simulations have been utilized for quick engineering analysis by simplifying the underlying physics, thereby avoiding the substantial cost of repeatedly solving extensive, high-resolution models. FMs share this foundational motivation but provide a more flexible and expressive contemporary alternative. By learning from large collections of simulation and experimental data, and by leveraging architectures such as transformers and graph neural networks, they can capture complex relationships across regimes while still enabling fast, adaptable emulation. This makes them a powerful complement to traditional solvers, supporting quicker iteration and more informed decision making without sacrificing necessary fidelity.FMs are trained predominantly with self-supervised objectives on large simulation and observational data, improving data efficiency by learning structure directly from unlabeled data. We incorporate physics-aware inductive biases (e.g., conservation, symmetries) so that the models remain consistent with governing laws and suitable for scientific use. FMs are treated as reusable emulators that complement-rather than replace-high-fidelity numerical simulations.The remainder of this article is organized as follows: (i) AI-ready data management; (ii) FM architecture and design considerations; (iii) parallelization experiences for scaling ViT-based FMs; (iv) training and inference acceleration on HPC; and finally a discussion that formalizes trust via a verification-and-validation protocol and presents a quantitative breakeven analysis under matched accuracy/compute budgets.As FMs scale to production on HPC systems, AI-ready data is key to achieving the speedups and generalization needed. Building high-quality datasets requires programmatic pipelines for collection, quality control, de-duplication, consistent labeling, and machine-readable provenance, moving beyond generic FAIR to a campaign-oriented governance model. By recording simulation campaigns with standardized workflows and ontologies, data curation can be driven by impact metrics like time-to-insight and downstream model quality, aligning data work with FM utility rather than mere storage volume.At exascale, storing all high-order fields is infeasible, making compression a first-class design choice. Modern error-bounded lossy methods can significantly reduce footprint [1]. For FMs, fidelity must be defined by physics-aware criteria-e.g., controlling errors in energy or gradients-and by quantifying how compression perturbations affect surrogates and emulators [2]. Integrated in situ or in transit, compression also acts as a learnable encoder, yielding compact features that minimize I/O and memory during pretraining and fine-tuning while maintaining strict error controls for diagnostics [3].Scalable FMs also demand a unified description of scientific data across domains. We advocate a token-centric view of "AI-ready" description mirroring FM tokenization. Descriptor tokens capture meshes, coordinates, units, and provenance; feature tokens represent compressed field content. Fusion mechanisms bind tokens across simulations, experiments, and modalities into coherent scientific narratives. Anchoring these tokens in community standards (e.g., netCDF-like schemas for Earth system modeling) provides model teams with a stable, extensible interface to simulation data. By integrating campaign governance, physics-aware compression, and token-centric schemas, data management establishes a robust feedback loop from simulation to FM pretraining and inference on HPC [2]. This foundational work directly optimizes the end-to-end objective: accelerating time-to-solution and ensuring trustworthy predictions, which underpins the necessary architectural and parallelization strategies.Many numerical simulations are developed based on spatiotemporal physics laws; consequently, FMs for scientific simulation must accurately capture spatiotemporal physics while also mapping efficiently to HPC hardware. Generally, this involves utilizing graph-based models that respect irregular meshes or Vision Transformer (ViT)-based models [4] that operate on regular grids. For example, on regular latitude-longitude grids, ViT-style encoders employ patch embeddings that align with GPU tensor cores and batched I/O. This alignment facilitates the computation of large-scale spatial correlations with predictable memory layouts. For spherical or unstructured domains, graph neural networks encode locality and adjacency through message passing, thereby preserving mesh topology without the need for computationally expensive remeshing. Temporal structure is commonly modeled using autoregressive or time-conditioned transformers; however, long temporal windows significantly increase attention memory, communication overhead, and checkpoint sizes-all of which represent key bottlenecks on distributed HPC systems.Two principles align architectures with the end-to-end objectives of utilization, time-to-solution, and trustworthy predictions highlighted earlier. First, physics-aware inductive biases constrain models to respect invariants, conservation, and symmetries, improving data efficiency and reliability across regimes. Second, token-and compute-adaptivity control cost by scaling with solution complexity rather than nominal grid size. Practical mechanisms include windowed or hierarchical attention for long-range dependencies, sparsity and mixture-of-experts for conditional computation, and adaptive token pruning or region-of-interest refinement that focuses capacity where dynamics are active. Linear-or subquadratic-attention variants further reduce the quadratic growth of standard self-attention, enabling extreme resolutions on modern accelerators.Looking ahead, diffusion-based generative models [5] provide a principled path to probabilistic forecasting, ensemble generation, and uncertainty quantification in chaotic systems, complementing deterministic surrogates. Combined with multi-resolution training curricula and schema-driven data pipelines, these choices tie model design to the same data and workflow principles used for AI-ready datasets.Parallelizing FMs for numerical simulation on HPC is essential for efficiency and scale, yet challenging. Increasing resolution (e.g., for climate) stresses the quadratic cost of ViT self-attention; multi-variable fields also raise memory and communication pressure. Making these models practical requires mixed-precision training and hybrid parallelization that respect interconnect bandwidth, memory residency, and checkpointing constraints.Scaling FMs to an extreme level presents significant challenges. Experience from ORBIT-class efforts highlights four levers for ViT-based FMs on HPC: (i) linear-complexity attention (e.g., TILES) reduces self-attention from $O(n^2)$ to $O(n)$, enabling multi-billion-token contexts at high resolution; (ii) hybrid parallelization (e.g., tensor/model sharding plus data parallelism such as STOP) scales to 100B+ parameters while controlling communication and memory; (iii) lightweight architectural efficiency (e.g., Reslim) improves utilization without sacrificing accuracy; and (iv) domain specialization with robust uncertainty quantification maintains physical consistency in multi-variable predictions [6,7]. These techniques trade exact global attention for sparse/global routing and introduce modest auxiliary indices/buffers; we evaluate their accuracy-cost impact explicitly.Complementing ViT-focused scaling, emerging linear-time generative surrogates pair diffusion models with state-space architectures (Mamba) to target $O(N)$ complexity and lower VRAM [8]. Mapping 1D Mamba to 2D/3D fields uses multi-directional and zigzag scanning; hybrids retain a small fraction of self-attention to preserve global coherence. These designs reduce activation/KV traffic and interconnect pressure, enable longer contexts on fixed Host Memory Buffer (HBM) budgets, and naturally improve the scalability of diffusion-model based FMs.Performance of FMs is often limited more by memory capacity/bandwidth and collective communication than by peak FLOP/s. Costs for processing multi-channel 2D/3D fields over long contexts are dominated by activation/KV-cache residency, optimizer/state footprints, tensor layout conversions, and synchronization. Training emphasizes throughput (time-to-train), while inference emphasizes latency (time-to-first-token and response time). Both must be assessed by end-to-end utilization within the simulation-to-AI loop [9].Several techniques mitigate memory traffic and replication. IO-aware and fused attention kernels (e.g., FlashAttention-2 [10]) reduce high-bandwidth memory (HBM) movement. Distributed training uses hybrid parallelism (data, tensor/model, pipeline) and sharding parameters, gradients, and optimizer states to reduce replication and communication. Inference uses KV-cache management like PagedAttention to improve batching efficiency. Topologyaware placement, tensor fusion, and overlapping collectives with compute convert theoretical peak performance into sustained throughput.Cross-layer co-design further improves efficiency. Sequence parallelism and tiling constrain activation footprints. Linear/subquadratic attention variants reduce quadratic growth. Activation checkpointing trades compute for memory, while gradient accumulation increases effective batch size. Conditional computation, such as sparse activations and mixture-of-experts, scales cost with solution complexity. Reliability is critical: mixed/low-precision training requires scaling and clipping for numerical stability, and deterministic kernels support fast recovery and reproducibility.HPC-native runtimes, such as task-based systems (e.g., PaRSEC [11]), provide orchestration by expressing the workflow as dependency graphs that overlap collectives with computation, prioritizing latency-critical inference alongside throughput-driven training. Integrating these techniques with schema-driven I/O and in-situ/in-transit compression keeps data movement low and the simulation-to-inference loop active. By refining these strategies, we can harness the full potential of FMs in numerical simulations, delivering trustworthy predictions on production supercomputers.Despite their potential, the adoption of foundation models for numerical simulations faces significant challenges, particularly regarding trustworthiness and interpretability in industrial and scientific settings.Adoption ultimately depends on trust. Robust validation is necessary to compare FM outputs with traditional numerical solvers and observed data, quantify uncertainty through methods like ensembles or diffusion techniques, and enforce physical constraints. Establishing a continuous feedback loop in simulation workflows to detect, diagnose, and correct regime-specific errors will enhance reliability over time. Moreover, evaluation should focus on end-to-end time-toquality and sustained utilization rather than isolated metrics. Ensuring stability under mixed or low precision requires gradient scaling and clipping, while reproducible, shard-aware checkpointing facilitates rapid recovery. Such alignment allows FMs to provide swift, credible scientific insights while optimizing supercomputing resources.A standardized evaluation framework for assessing FM performance could greatly enhance their deployment in scientific domains. This framework should incorporate metrics such as accuracy, efficiency, and trustworthiness, mirroring the validation processes in traditional numerical simulations, which include code verification, physics consistency checks, and cross-solver benchmarking with confidence intervals [12]. Additionally, a hardware-aware simulation platform can rigorously evaluate FM pipelines across various computational architectures, establishing pre-registered acceptance thresholds [13]. Understanding the tradeoff between accuracy and computational time will enable practitioners to allocate resources effectively based on their specific needs and available capacity, ultimately facilitating informed decision-making.