NLR HPC Eagle Jobs Data and Additional Energy Metrics

20260 citationsDatasetgreen Open Access

Authors

Struan Clark · Museum of the Rockies

Matt Selensky · Museum of the Rockies

Kevin Menear · Museum of the Rockies

Abstract

<h2>Dataset Description</h2> This dataset contains anonymized job-level records from the Eagle high-performance computing (HPC) system. Each record represents a Slurm batch job and includes scheduling metadata, resource requests, resource utilization, CPU and GPU energy consumption measurements, and computed efficiency metrics. Personally identifiable fields (user, account, and job name) have been replaced with cryptographic hashes. Energy metrics include both TDP-estimated CPU energy and measured node-level and GPU-level energy from iLO and Ganglia monitoring systems. <h3>Developed by</h3> National Laboratory of the Rockies (NLR), <a href="https://ror.org/036266993">ROR: https://ror.org/036266993</a> <h3>Contributed by</h3> HPC Operations and Data Analytics teams at NLR. <h3>Dataset short description</h3> Anonymized Slurm job records from the NLR Eagle HPC system, including job scheduling, resource allocation, CPU and GPU energy measurements, and efficiency metrics. <h3>Over what timeframe was the data collected or generated? Does this timeframe align with when the underlying phenomena or events occurred?</h3> The dataset covers the operational lifetime of the Eagle HPC system, with timestamps in Mountain Time zone. Slurm data was processed nightly after midnight, so the database was always current through the prior day. The collection timeframe aligns directly with the underlying job scheduling events as they occurred on the Eagle system. <h3>What resources were used?</h3><h4>Facilities:</h4><ul><li><strong>Eagle HPC System</strong>, National Laboratory of the Rockies (NLR), <a href="https://ror.org/036266993">ROR: https://ror.org/036266993</a></li></ul><h4>Funding:</h4> U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy (EERE). <h4>Other Supporting Entities:</h4> N/A <h2>Sharing/Access Information</h2><h3>Reuse restrictions placed on the data:</h3> The dataset has been anonymized by hashing sensitive fields (user, account, and job name). Reuse is subject to the license specified in this datacard. Users should not attempt to re-identify individuals from hashed fields. <h3>Provide DOIs, and bibtex citations to publications that cite or use the data.</h3> N/A <h3>Provide DOIs, citations, or links to other publicly accessible locations of the data.</h3> N/A <h3>Provide DOIs, citations, or links and descriptions of relationships to ancillary data sets.</h3> This dataset is derived from the Eagle schema of the NLR HPC job database. <h2>Data & File Overview</h2><h3>List all files contained in the dataset.</h3> Format: <strong>File</strong> | <strong>Description</strong> esif.hpc.eagle.job-anon.zip | Zipped Hive-partitioned Apache Parquet dataset containing anonymized job records from the Eagle Slurm scheduler. Each row is a parent job record with scheduling metadata, resource requests/usage, CPU and GPU energy measurements, and computed efficiency metrics. esif.hpc.eagle.job-anon-energy-metrics.zip | Zipped Hive-partitioned Apache Parquet dataset containing anonymized job records from the Eagle Slurm scheduler. Each row is a parent job record with scheduling metadata, resource requests/usage, CPU and GPU energy measurements, computed efficiency metrics, and addtional energy metrics calculated from iLO and Ganglia. datacard.md | This datacard file describing the dataset. <h3>Describe the relationship(s) between files.</h3> The Parquet dataset is the primary data file. The datacard provides documentation. In the source database, each job record may have associated job_step records (not included here) that contain finer-grained per-step resource usage data including TRESUsage fields. <h3>Describe any additional related data collected that was not included in the current data package.</h3> The source database contains additional tables not included in this extract including job_step (per-step resource usage including TRESUsage fields). Raw Slurm slurm_data JSONB fields have also been excluded. <h3>Are there multiple versions of this dataset?</h3> N/A <h2>Methodological Information</h2><h3>How was the data for each instance obtained or generated?</h3> Each instance is a parent job record collected from the Slurm workload manager on the Eagle HPC system via the sacct command. The data represents real job submissions, scheduling decisions, and resource consumption. Calculated fields (efficiency metrics, energy measurements) are derived from the raw Slurm data through database triggers and batch functions. Energy data is enriched from two additional sources: node-level power from iLO (Integrated Lights-Out) monitoring, and GPU-level power from Ganglia monitoring. <h3>For each instrument, facility, or source used to generate and collect the data, what mechanisms or procedures were used?</h3> Slurm data was collected via the sacct command and ingested through the following pipeline: Eagle Jobs API → Redpanda message queue (hpc-eagle-job topic) → StreamSets on Snowy → HPCMON API → Sage PostgreSQL database. Slurm data was processed nightly after midnight. Node-level energy data was collected from iLO (HP Integrated Lights-Out) management interfaces. GPU energy data was collected from Ganglia monitoring. Both energy sources were joined to job records via node lists and time ranges. <h3>To create the final dataset, was any preprocessing/cleaning/labeling of raw data done?</h3> Yes. The following preprocessing was applied: <ol><li><strong>Anonymization</strong>: The fields name, user, and account were replaced with cryptographic hashes to prevent re-identification.</li><li><strong>Column derivation</strong>: Several columns are calculated from raw Slurm fields, including queue_wait (start_time − submit_time), cpu_eff (TotalCPU / CPUTime), and max_mem_eff.</li><li><strong>State simplification</strong>: A state_simple column maps detailed Slurm states (e.g., "CANCELLED BY 12345") to simplified labels (e.g., "CANCELLED").</li><li><strong>QoS accounting</strong>: An accounting_qos column applies business rules: buy-in partitions are labeled "buy-in"; standby partitions are labeled "standby"; otherwise the Slurm QoS is used.</li><li><strong>Energy enrichment</strong>: CPU TDP-estimated energy is calculated from cpu_used, CPU TDP (200W for Intel Xeon Gold 6154), and core count (18 cores). Node-level measured energy is joined from iLO data. GPU-level measured energy is joined from Ganglia data.</li><li><strong>Timezone handling</strong>: Eagle's Slurm export did not include timezone offsets. Timezone-aware columns (submit_time_tz, start_time_tz, end_time_tz) were populated from the LEX accounting database which stores correct timezone information. The non-tz columns may be incorrect across daylight saving boundaries.</li></ol><h3>Is the software that was used to preprocess/clean/label the data available?</h3> The data is loaded and processed using PostgreSQL functions. These are internal to the NLR HPC operations database and are not publicly released at this time. <h3>Describe any standards and calibration information, if appropriate.</h3> All timestamps are in Mountain Time zone. The non-timezone columns (submit_time, start_time, end_time) use the timestamp datatype without timezone and may be off by one hour across daylight saving boundaries. The timezone-aware columns (submit_time_tz, start_time_tz, end_time_tz) are sourced from LEX accounting data and correctly handle DST transitions. CPU TDP energy estimates use a fixed 200W TDP for the Intel Xeon Gold 6154 with 18 cores per CPU. Node-level energy is measured via iLO. GPU energy is measured via Ganglia. <h3>Describe the environmental and experimental conditions relevant to the dataset.</h3> The Eagle system was located at the NLR campus. Compute nodes used Intel Xeon Gold 6154 processors (18 cores, 200W TDP). Node configurations included standard compute, bigmem, GPU, DAV, and other specialized partitions. Available partitions: bigmem, bigmem-8600, bigscratch, csc, dav, ddn, debug, gpu, haswell, long, mono, short, standard. Available QoS levels: Unknown, normal, buy-in, debug, penalty, high, standby. Available job states: CANCELLED, COMPLETED, FAILED, NODE_FAIL, OUT_OF_MEMORY, PENDING, RUNNING, TIMEOUT. <h3>Describe any quality-assurance procedures performed on the data.</h3> The data have been cleaned and validated through the standard data processes used to support Eagle operations. While these preprocessing and quality-control steps are integral to the dataset, the underlying software and pipelines are not publicly available <h2>Data-Specific Information</h2><h3>What data does each instance within the dataset consist of?</h3> Each instance (row) represents a single parent Slurm job on the Eagle system. The data includes raw Slurm scheduling fields (timestamps, resource requests, resource usage, state), anonymized identifiers, derived efficiency metrics, and measured energy consumption from both node-level (iLO) and GPU-level (Ganglia) monitoring. <h3>Number of variables:</h3> 62 <h3>Number of cases/rows:</h3> Approximately 13,800,000. <h3>Variable descriptions:</h3> Format: <strong>Variable Name</strong> | <strong>Description</strong> | <strong>Unit</strong> | <strong>Value Labels | Slurm sacct Field |</strong> | id | Unique primary key (full job ID string) | N/A | | JobID | | job_id | Numeric job ID in Slurm | N/A | | JobIDRaw | | array_pos | Array index if job array, else null | N/A | | ArrayTaskID | | name_hash | Anonymized hash of the job name | N/A | hash string | JobName | | user_hash | Anonymized hash of the submitting user | N/A | hash string | User | | account_hash | Anonymized hash of the allocation account | N/A | hash string | Account | | partition | HPC queue/partition requested | N/A | bigmem, bigmem-8600, bigscratch, csc, dav, ddn, debug, gpu, haswe

Topics & Keywords

Publication Details

Published in: DOE National Renewable Energy Laboratory (NREL) Repository

DOI: 10.7799/3023273

Command Palette

NLR HPC Eagle Jobs Data and Additional Energy Metrics

Authors

Abstract

Topics & Keywords

Publication Details