BENCHMARKS

llmpm benchmark

Evaluate any model against 68+ industry-standard tasks directly from the terminal. Compare models, track regressions, and reproduce leaderboard results — all with a single command.

CLI COMMANDS

Install benchmark backend (one-time setup)

pip install llmpm[benchmark]

copy

Run a single benchmark on an installed model

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks mmlu

copy

Run the full Open LLM Leaderboard suite

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks openllm

copy

Run multiple tasks in one pass

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks ifeval,hellaswag,mmlu

copy

Benchmark any HuggingFace model directly (no install needed)

llmpm benchmark pretrained=gpt2 --tasks hellaswag

copy

Quick smoke test — limit to 100 examples

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks mmlu --limit 100

copy

Override few-shot count

llmpm benchmark pretrained=gpt2 --tasks hellaswag --num-fewshot 10

copy

Save full HTML report to ./results/

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks ifeval --output ./results/

copy

List all supported benchmark tasks

llmpm benchmark --list-tasks

copy

AVAILABLE TASKS

LEADERBOARD SUITES

CORE BENCHMARKS

hellaswagHellaSwag — sentence-completion commonsense (10-shot)

gpqaGPQA Diamond — 198 expert-level science QA (biology, chemistry, physics) [open mirror]

hleHumanity's Last Exam — 3,000 expert-level questions across science, math & humanities

mmluMMLU — 57-subject academic knowledge (5-shot)

arc_challengeARC Challenge — 4-choice science QA, hard set (25-shot)

arc_easyARC Easy — 4-choice science QA, easy set (0-shot)

gsm8kGSM8K — grade-school math word problems (5-shot)

truthfulqaTruthfulQA — truthfulness & factuality (0-shot)

winograndeWinogrande — Winograd-schema commonsense (5-shot)

ifevalIFEval — verifiable instruction-following evaluation (0-shot)

bbhBIG-Bench Hard — 23 hard reasoning tasks (3-shot CoT)

mmlu_proMMLU-Pro — harder 10-option MMLU variant (5-shot)

MATH & REASONING

hendrycks_mathMATH — competition-level math problems (4-shot)

minerva_mathMinerva Math — math reasoning with chain-of-thought (4-shot)

gsm_plusGSM+ — augmented GSM8K with harder variants

mathqaMathQA — algebraic & scientific word problems (0-shot)

asdivASDiv — diverse arithmetic word problems

arithmeticArithmetic — basic n-digit arithmetic operations (0-shot)

bigbenchBIG-Bench — 200+ diverse tasks covering reasoning, knowledge, and language

anliAdversarial NLI — robust NLI across 3 rounds (0-shot)

CODE GENERATION

humanevalHumanEval — 164 Python programming problems (pass@k)

mbppMBPP — 374 crowd-sourced Python programming problems

READING COMPREHENSION

dropDROP — discrete reasoning over paragraphs (3-shot)

squadv2SQuAD v2 — reading comprehension with unanswerable Qs (1-shot)

raceRACE — reading comprehension (0-shot)

coqaCoQA — conversational QA (0-shot)

super_glueSuperGLUE — 8-task language understanding suite

glueGLUE — 9-task NLU benchmark (validation sets)

KNOWLEDGE & FACTUALITY

triviaqaTriviaQA — trivia open-domain QA (0-shot)

nq_openNatural Questions — open-domain QA (0-shot)

sciqSciQ — science exam QA with supporting documents (0-shot)

webqsWebQuestions — Freebase-grounded QA (0-shot)

agievalAGIEval — human-centric standardized exam tasks

COMMONSENSE REASONING

piqaPIQA — physical intuition QA (0-shot)

siqaSocialIQA — social-interaction reasoning (0-shot)

openbookqaOpenBookQA — open-book science QA (0-shot)

commonsense_qaCommonsenseQA — commonsense multiple-choice (0-shot)

wsc273WSC273 — Winograd Schema Challenge (0-shot)

logiqaLogiQA — logical reading comprehension (0-shot)

logiqa2LogiQA 2.0 — updated logical reading comprehension (0-shot)

swagSWAG — grounded commonsense inference (0-shot)

babibAbI — 20 QA tasks for reasoning over short stories

LONG CONTEXT

longbenchLongBench — bilingual long-context understanding benchmark

LANGUAGE MODELING & PERPLEXITY

wikitextWikiText — word-level perplexity on WikiText-103

lambadaLAMBADA — last-word prediction requiring passage understanding

MEDICAL & SCIENTIFIC

medqaMedQA — US medical licensing exam (USMLE) QA

medmcqaMedMCQA — Indian medical entrance exam QA

pubmedqaPubMedQA — biomedical research question answering

SAFETY & BIAS

wmdpWMDP — hazardous knowledge benchmark (bio, chem, cyber)

bbqBBQ — bias benchmark for QA across 9 social categories

crows_pairsCrowS-Pairs — measuring social biases in language models

hendrycks_ethicsETHICS — moral knowledge across 5 ethical frameworks

realtoxicitypromptsRealToxicityPrompts — toxicity of model continuations

MULTILINGUAL

mgsmMGSM — multilingual grade-school math (11 languages)

belebeleBelebele — multilingual reading comprehension (122 languages)

global_mmluGlobal MMLU — MMLU translated into 42 languages

xnliXNLI — cross-lingual NLI in 15 languages

xcopaXCOPA — cross-lingual commonsense (11 languages)

xwinogradXWinograd — multilingual Winograd schemas

okapiOkapi — multilingual instruction-following benchmark

cevalC-Eval — Chinese academic knowledge benchmark

cmmluCMMLU — Chinese multi-subject language understanding

kmmluKMMLU — Korean multi-subject language understanding

SUMMARIZATION

cnn_dailymailCNN/DailyMail — news article summarization