Search for a command to run...
This repository is the replication package for the paper "Risk-Aware Batch Testing for Performance Regression Detection". It contains the complete artifact chain used in the paper: the JIT-Mozilla-Perf dataset, the data extraction pipeline, model fine-tuning and inference code for commit-level performance regression prediction, and the replay-based CI simulation framework used to evaluate batching strategies. The companion JIT-Mozilla-Perf dataset is archived separately on Zenodo at https://doi.org/10.5281/zenodo.18829344. This replication package can be found on GitHub:https://github.com/Ali-Sayed-Salehi/jit-dp-llm/tree/zenodo-batch-perf The package supports reproduction of the paper’s full workflow:(1) construction of the JIT-Mozilla-Perf dataset from Mozilla production data sources,(2) fine-tuning of commit-level performance regression risk models,(3) inference to generate chronological commit risk scores, and(4) replay-based simulation of risk-aware batching strategies. The core dataset used by the paper is stored under datasets/mozilla_perf/. Its main modeling artifact, perf_llm_struc_no_fw_2_6_18.jsonl, contains 11,384 chronologically ordered commit instances derived from Mozilla performance alerts, Bugzilla performance bugs, and Mercurial Autoland history. The repository also includes the simulation metadata needed to model realistic performance testing behavior, including failing performance signatures, signature groups, per-revision coverage, and job-duration estimates. The files under datasets/mozilla_perf/ in this replication package correspond to the same paper dataset family and are the artifacts consumed by the training and simulation code documented here. The replication package includes prediction artifacts that can be used directly as simulator inputs, including:- analysis/batch_testing/final_test_results_perf_codebert_eval.json- analysis/batch_testing/final_test_results_perf_codebert_final_test.json These artifacts allow users to rerun the main Optuna-based batch-testing experiments without retraining models. The paper evaluates ModernBERT, CodeBERT, and LLaMA 3.1 8B as performance regression risk predictors, then uses their risk scores to drive batching strategies such as Time-Window Batching (TWB), Fixed-Size Batching (FSB), Risk-Adaptive Stream Batching (RASB), Risk-Aged Priority Batching (RAPB), and Risk-Adaptive Trigger Batching (RATB). The main reported result is that RAPB-la provides the strongest overall balance between cost and timeliness, reducing total tests by 32.4%, reducing maximum time-to-culprit by 26.2%, and yielding an estimated annual infrastructure savings of about $491K relative to the production-inspired baseline. The paper-relevant repository paths are:- datasets/mozilla_perf/- data_extraction/treeherder/- data_extraction/bugzilla/- data_extraction/mercurial/- data_extraction/data_preparation.py- llama/- analysis/batch_testing/- slurm_scripts/speed/- docker/Dockerfile.llama-train-environment Detailed reproduction instructions are provided in the repository README. The fastest rerun path is to use the packaged CodeBERT prediction JSON files as inputs to analysis/batch_testing/simulation.py. The full regeneration path rebuilds the dataset, fine-tunes the risk predictors, runs inference on the eval and test splits, and then reruns the simulator.