EleutherAI/lm-evaluation-harness: v0.4.9.10

20260 citationsOthergreen Open Access

Authors

Lintang Sutawika

Hailey Schoelkopf

Leo Gao

Baber Abbasi

Stella Biderman · Booz Allen Hamilton (United States)

Anish Thite · Fucape Business School

Abstract

Highlights The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support. Breaking Change: Lightweight Core with Optional Backends pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428) The core package no longer includes backends. Install them explicitly: pip install lm_eval # core only, no model backends pip install lm_eval[hf] # HuggingFace backend (transformers, torch, accelerate) pip install lm_eval[vllm] # vLLM backend pip install lm_eval[api] # API backends (OpenAI, Anthropic, etc.) Additional breaking change: Accessing model classes via attribute no longer works: # This still works: from lm_eval.models.huggingface import HFLM # This now raises AttributeError: import lm_eval.models lm_eval.models.huggingface.HFLM CLI Refactor The CLI now uses explicit subcommands and supports YAML config files (#3440): lm-eval run --model hf --tasks hellaswag # run evaluations lm-eval run --config my_config.yaml # load args from YAML config lm-eval ls tasks # list available tasks lm-eval validate --tasks hellaswag,arc_easy # validate task configs Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag See lm-eval --help or the CLI documentation for details. Other Improvements Decoupled ContextSampler with new build_qa_turn helper (#3429) Normalized gen_kwargs with truncation_side support for vLLM (#3509) New Benchmarks & Tasks PISA task by @HallerPatrick in #3412 SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305 OpenAI Multilingual MMLU by @Helw150 in #3473 ULQA benchmark by @keramjan in #3340 IFEval in Spanish and Catalan by @juliafalcao in #3467 TruthfulQA-VA for Catalan by @sgs97ua in #3469 Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454 NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444 Model Support Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487 Fixes & Improvements Task Fixes Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500 Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508 Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465 Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488 Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461 Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527 Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526 Updated LIBRA task utils by @bond005 in #3520 Backend Fixes Fixed vLLM off-by-one max_length error by @baberabb in #3503 Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482 Fixed SGLang import and removed duplicate tasks by @baberabb in #3492 Removed deprecated AutoModelForVision2Seq by @baberabb in #3522 Fixed Anthropic chat model mapping by @lucafossen in #3453 Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517 Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436 Fixed CLI regressions by @fxmarty-amd in #3449 New Contributors @safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436 @lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453 @Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305 @ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488 @juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467 @medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487 @ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489 @keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340 @bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520 @mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517 @wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523 Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.9.10

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18393948

Command Palette

EleutherAI/lm-evaluation-harness: v0.4.9.10

Authors

Abstract

Topics & Keywords

Publication Details