JIT-Mozilla-Perf, a dataset for performance regression prediction

20260 citationsDatasetgreen Open Access

Authors

Gregory Mierzwinski · Mozilla Foundation

Abstract

The **JIT-Mozilla-Perf dataset** is a curated Mozilla performance regression dataset designed for research on just-in-time risk prediction, commit-level performance regression detection, and debugging-policy simulation. It integrates data from three upstream Mozilla systems: **Treeherder** performance alerts and test metadata, **Bugzilla** bug metadata and regressor/culprit links, and **Mercurial autoland** commit history and code diffs. The dataset is suitable both for reproducible data extraction studies and for downstream modeling and simulation. At the core of the dataset is a bug-to-revision linking pipeline that identifies Mozilla performance regression bugs, associates them with regressor revisions, and reconstructs a single **net code diff** for each bug by locating the newest contiguous block of `Bug <id>` commits in the `autoland` Mercurial repository. These bug-linked diffs are then converted into a structured textual format for language-model training and evaluation. The main LLM-ready artifact is `perf_llm_struc.jsonl`, where each record contains a `commit_id`, a prompt composed of the commit message plus a structured diff representation, and a binary label (`"1"` for regressor, `"0"` otherwise). A filtered variant, `perf_llm_struc_no_fw_2_6_18.jsonl`, relabels examples after excluding selected Treeherder frameworks. The dataset also contains the supporting performance-testing metadata required for realistic downstream simulation. These artifacts include Treeherder alert summaries, failing performance signatures, signature metadata, per-revision performance coverage, signature-group co-occurrence structures, and signature-group job duration estimates. Together, these files support experiments not only in commit classification, but also in simulation of batch testing and culprit localization strategies under realistic platform-specific test execution constraints. In the current repository snapshot, the dataset includes approximately **4,301 Treeherder alert summaries**, **216 alert rows linked to regression bugs and tests**, **717,470 Bugzilla bug records**, **813,494 autoland commit metadata records**, **11,384 training samples for performance regression prediction**, **43,301 Treeherder performance signatures**, **1,608 signature groups**, and **20,530 historical per-revision performance testing records**. These counts can change slightly if the dataset is regenerated with a different extraction window or after upstream Mozilla data changes. Primary files in this deposition include:- `alert_summaries.csv`: raw Treeherder performance alert summaries.- `alerts_with_bug_and_test_info.csv`: bug-linked Treeherder alerts and regressed test/platform information.- `perf_bugs.csv`: filtered Bugzilla performance bugs labeled as regressors or regressions.- `all_commits.jsonl`: exported Mercurial `autoland` commit metadata.- `perf_bugs_with_diff.jsonl`: bug-linked net diffs reconstructed from contiguous commit blocks.- `perf_llm_struc.jsonl`: structured-diff classification dataset for LLMs.- `perf_llm_struc_no_fw_2_6_18.jsonl`: framework-filtered variant of the LLM dataset.- `all_signatures.jsonl`, `sig_groups.jsonl`, `sig_group_job_durations.csv`, and `perf_jobs_per_revision_details_rectified.jsonl`: performance-testing metadata used for simulation and analysis. The full extraction and preparation pipeline is available in the companion repository: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm` Relevant extraction code directories are:- Treeherder extraction scripts: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/tree/master/data_extraction/treeherder`- Bugzilla extraction scripts: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/tree/master/data_extraction/bugzilla`- Mercurial extraction scripts: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/tree/master/data_extraction/mercurial`- Dataset preparation script: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/blob/master/data_extraction/data_preparation.py`- Dataset documentation: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/blob/master/datasets/mozilla_perf/README.md`- Downstream batch-testing simulator: `https://github.com/Ali-Sayed-Salehi/jit-dp-llm/tree/master/analysis/batch_testing` A typical regeneration workflow is:- extract Treeherder alerts and failing signatures,- fetch Bugzilla bug metadata and construct performance bug labels,- export Mercurial `autoland` commits and reconstruct bug-linked net diffs,- generate the structured LLM dataset,- build signature, coverage, duration, and grouping artifacts for downstream simulation. The software in the companion repository is released under the **MIT License**. Because this dataset is derived from Mozilla infrastructure and source history, users should also consider the terms, licenses, and attribution requirements associated with the upstream Mozilla data sources.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18829343

Command Palette

JIT-Mozilla-Perf, a dataset for performance regression prediction

Authors

Abstract

Topics & Keywords

Publication Details