Search for a command to run...
This is a white paper released as part of the ETP4HPC’s Strategic Research Agenda 6. Energy efficiency and sustainability (EE&S) have become central challenges for modern High-Performance Computing (HPC) and AI infrastructures. While energy savings are often viewed as the main issue, sustainability also includes ecological, economic, and societal dimensions. A full life-cycle perspective—from planning and procurement to operation and decommissioning—is necessary to avoid rebound effects such as the Jevons paradox[1]. HPC and AI systems increasingly interact with their environment. Their multi-megawatt, highly variable power demand influences grid stability and requires better forecasting and closer coordination with energy providers. Future operations must incorporate energy- and carbon-aware scheduling, while heat reuse concepts and standardized sustainability metrics (PUE, WUE, ERE) become integral to system design and evaluation. Growing hardware heterogeneity (CPUs, GPUs, accelerators, and emerging neuromorphic or quantum devices) offers efficiency potential that is still underutilized due to insufficient software optimization. Progress requires stronger hardware–software co-design, greater support for developers in exploiting advanced architectural features, and systematic use of monitoring data to identify inefficient applications. European semiconductor initiatives further enable alignment between hardware and computational requirements. Software, algorithms, and workflows are equally important. Energy-efficient algorithms, data reduction techniques (compression, filtering, deduplication), improved I/O strategies, and optimized workflows can significantly reduce resource usage. Strengthening reproducibility and robust experiment packaging across diverse systems helps prevent wasted compute time and supports sustainable software lifecycles. Digital twins (DTs) are emerging as key tools for lifecycle management. They support planning, operational optimization, predictive analysis, and end-of-life decisions. Effective DTs require harmonized monitoring, standard metrics, and shared methodologies. A European knowledge initiative could accelerate adoption and ensure consistent practices across HPC sites. Raising awareness and empowering stakeholders—developers, users, operators, vendors, and funding bodies—is essential. Transparent monitoring infrastructures and vendor-agnostic sustainability indicators enable informed decision-making. User feedback mechanisms (dashboards, reports, incentives) can encourage more energy-conscious behaviour. Funding agencies, particularly the European Commission, should embed sustainability metrics, full life-cycle assessments [ISO 14040/14044], and continuous reporting requirements into procurement and project evaluations. In the post-exascale era, simple performance scaling through increased energy use is no longer viable. With growing AI workloads and the need to reduce energy demand and CO₂ emissions, coordinated progress across hardware, software, operations, infrastructure, and user behaviour is imperative. Only a combined effort by all stakeholders will enable high-performing HPC and AI systems to meet future needs while aligning with long-term sustainability goals. [1] https://en.wikipedia.org/wiki/Jevons_paradox