Search for a command to run...
This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version. New Benchmarks & Tasks A big wave of new evaluation tasks this release: AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311 BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338 GraphWalks by @jannalulu in #3377 ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265 Icelandic WinoGrande by @jmichaelov in #3277 CLIcK Korean benchmark by @shing100 in #3173 MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705 EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167 EQBench in Spanish and Catalan by @priverabsc in #3168 Anthropic discrim-eval by @Helw150 in #3091 XNLI-VA by @FranValero97 in #3194 Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317 HumanEval infilling by @its-alpesh in #3299 CNN-DailyMail 3.0.0 by @preordinary in #3426 Global PIQA and new acc_norm_bytes metric by @baberabb in #3368 Fixes & Improvements Core Changes: Python 3.10 minimum by @jannalulu in #3337 Unpinned datasets library by @baberabb in #3316 BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347 Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418 Resolve duplicate task names with safeguards by @giuliolovisotto in #3394 Task Fixes: Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406 Fixed AIME answer extraction by @jannalulu in #3353 Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361 Fixed crows_pairs dataset by @jannalulu in #3378 Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206 Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222 Fixed CodeXGLUE by @gsaltintas in #3238 Pinned correct MMLUSR version by @christinaexyou in #3350 Updated minerva_math by @baberabb in #3259 Backend Fixes: Fixed vLLM import errors when not installed by @fxmarty-amd in #3292 Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303 Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398 Fixed GPT series model bugs by @zinccat in #3348 Fixed PIL image hashing to use actual bytes by @tboerstad in #3331 Fixed additional_config parsing by @brian-dellabetta in #3393 Fixed batch chunking seed handling with groupby by @slimfrkha in #3047 Fixed no-output error handling by @Oseltamivir in #3395 Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415 Fixed custom task config reading by @SkyR0ver in #3425 Model & Backend Support OpenAI GPT-5 support by @babyplutokurt in #3247 Azure OpenAI support by @zinccat in #3349 Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234 OpenVINO text2text models by @nikita-savelyevv in #3101 Intel XPU support for HFLM by @kaixuanliu in #3211 Attention head steering support by @luciaquirke in #3279 Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185 What's Changed Remove trust_remote_code: True from updated datasets by @Avelina9X in https://github.com/EleutherAI/lm-evaluation-harness/pull/3213 Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in https://github.com/EleutherAI/lm-evaluation-harness/pull/3234 Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3206 remove incomplete compilation instructions, solves #3233 by @ceferisbarov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3242 Update utils.py by @Anri-Lombard in https://github.com/EleutherAI/lm-evaluation-harness/pull/3246 Adding support for OpenAI GPT-5 model by @babyplutokurt in https://github.com/EleutherAI/lm-evaluation-harness/pull/3247 Add xnli_va dataset by @FranValero97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3194 Add ZhoBLiMP benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3218 Add BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3221 Add TurBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3219 Add LM-SynEval Benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3184 Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in https://github.com/EleutherAI/lm-evaluation-harness/pull/3222 update minerva_math by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3259 feat: Add CLIcK task by @shing100 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3173 Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3091 Add support for OpenVINO text2text generation models by @nikita-savelyevv in https://github.com/EleutherAI/lm-evaluation-harness/pull/3101 Update MMLU-ProX task by @weihao1115 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3174 Support for AIME dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3248 feat(scrolls): delete chat_template from kwargs by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3267 pacify pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3268 Fix codexglue by @gsaltintas in https://github.com/EleutherAI/lm-evaluation-harness/pull/3238 Add BHS benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3265 Add acc_norm metric to BLiMP-NL by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3272 Add acc_norm metric to ZhoBLiMP by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3271 Add EsBBQ and CaBBQ tasks by @valleruizf in https://github.com/EleutherAI/lm-evaluation-harness/pull/3167 Add support for steering individual attention heads by @luciaquirke in https://github.com/EleutherAI/lm-evaluation-harness/pull/3279 Add the Icelandic WinoGrande benchmark by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3277 Ignore seed when splitting batch in chunks with groupby by @slimfrkha in https://github.com/EleutherAI/lm-evaluation-harness/pull/3047 [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in https://github.com/EleutherAI/lm-evaluation-harness/pull/3292 Fix LongBench Evaluation by @TimurAysin in https://github.com/EleutherAI/lm-evaluation-harness/pull/3273 add intel xpu support for HFLM by @kaixuanliu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3211 feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in https://github.com/EleutherAI/lm-evaluation-harness/pull/2705 Add BabiLong by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3287 Add AIME to task description by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3296 Add humaneval_infilling task by @its-alpesh in https://github.com/EleutherAI/lm-evaluation-harness/pull/3299 Add eqbench tasks in Spanish and Catalan by @priverabsc in https://github.com/EleutherAI/lm-evaluation-harness/pull/3168 [fix] add math and longbench to test dependencies by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3321 Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in https://github.com/EleutherAI/lm-evaluation-harness/pull/3303 unpin datasets; update pre-commit by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3316 bump to python 3.10 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3337 Longbench v2 by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3338 Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in https://github.com/EleutherAI/lm-evaluation-harness/pull/3185 Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/3317 remove duplicate tags/groups by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3343 Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/3344 Fixes bugs when using gpt series model by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3348 [fix] aime doesn't extract answers by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3353 add global_piqa; add acc_norm_bytes metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3368 [fix] crows_pairs dataset by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3378 Fix issue 3355 assertion error by @marksverdhei in https://github.com/EleutherAI/lm-evaluation-harness/pull/3356 fix(gsm8k): align README to yaml file by @neoheartbeats in https://github.com/EleutherAI/lm-evaluation-harness/pull/3388 added azure openai support by @zinccat in https://github.com/EleutherAI/lm-evaluation-harness/pull/3349 Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/3347 fix trust_remote_code=True for longbench by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3361 [feat] add graphwalks by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3377 Longbench group fix by @jannalulu in https://github.com/EleutherAI/lm-evaluation-harness/pull/3359 Resolve deprecation of vllm.utils.get_open_port by @DarkLight1337 in https://github.com/EleutherAI/lm