BENCHMARKS

llmpm benchmark

Evaluate any model against 68+ industry-standard tasks directly from the terminal. Compare models, track regressions, and reproduce leaderboard results — all with a single command.

CLI COMMANDS

Install benchmark backend (one-time setup)

pip install llmpm[benchmark]
copy

Run a single benchmark on an installed model

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks mmlu
copy

Run the full Open LLM Leaderboard suite

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks openllm
copy

Run multiple tasks in one pass

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks ifeval,hellaswag,mmlu
copy

Benchmark any HuggingFace model directly (no install needed)

llmpm benchmark pretrained=gpt2 --tasks hellaswag
copy

Quick smoke test — limit to 100 examples

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks mmlu --limit 100
copy

Override few-shot count

llmpm benchmark pretrained=gpt2 --tasks hellaswag --num-fewshot 10
copy

Save full HTML report to ./results/

llmpm benchmark meta-llama/Llama-3.2-3B-Instruct --tasks ifeval --output ./results/
copy

List all supported benchmark tasks

llmpm benchmark --list-tasks
copy

AVAILABLE TASKS

68

LEADERBOARD SUITES

CORE BENCHMARKS

hellaswagHellaSwag — sentence-completion commonsense (10-shot)
gpqaGPQA Diamond — 198 expert-level science QA (biology, chemistry, physics) [open mirror]
hleHumanity's Last Exam — 3,000 expert-level questions across science, math & humanities
mmluMMLU — 57-subject academic knowledge (5-shot)
arc_challengeARC Challenge — 4-choice science QA, hard set (25-shot)
arc_easyARC Easy — 4-choice science QA, easy set (0-shot)
gsm8kGSM8K — grade-school math word problems (5-shot)
truthfulqaTruthfulQA — truthfulness & factuality (0-shot)
winograndeWinogrande — Winograd-schema commonsense (5-shot)
ifevalIFEval — verifiable instruction-following evaluation (0-shot)
bbhBIG-Bench Hard — 23 hard reasoning tasks (3-shot CoT)
mmlu_proMMLU-Pro — harder 10-option MMLU variant (5-shot)

MATH & REASONING

hendrycks_mathMATH — competition-level math problems (4-shot)
minerva_mathMinerva Math — math reasoning with chain-of-thought (4-shot)
gsm_plusGSM+ — augmented GSM8K with harder variants
mathqaMathQA — algebraic & scientific word problems (0-shot)
asdivASDiv — diverse arithmetic word problems
arithmeticArithmetic — basic n-digit arithmetic operations (0-shot)
bigbenchBIG-Bench — 200+ diverse tasks covering reasoning, knowledge, and language
anliAdversarial NLI — robust NLI across 3 rounds (0-shot)

CODE GENERATION

humanevalHumanEval — 164 Python programming problems (pass@k)
mbppMBPP — 374 crowd-sourced Python programming problems

READING COMPREHENSION

dropDROP — discrete reasoning over paragraphs (3-shot)
squadv2SQuAD v2 — reading comprehension with unanswerable Qs (1-shot)
raceRACE — reading comprehension (0-shot)
coqaCoQA — conversational QA (0-shot)
super_glueSuperGLUE — 8-task language understanding suite
glueGLUE — 9-task NLU benchmark (validation sets)

KNOWLEDGE & FACTUALITY

triviaqaTriviaQA — trivia open-domain QA (0-shot)
nq_openNatural Questions — open-domain QA (0-shot)
sciqSciQ — science exam QA with supporting documents (0-shot)
webqsWebQuestions — Freebase-grounded QA (0-shot)
agievalAGIEval — human-centric standardized exam tasks

COMMONSENSE REASONING

piqaPIQA — physical intuition QA (0-shot)
siqaSocialIQA — social-interaction reasoning (0-shot)
openbookqaOpenBookQA — open-book science QA (0-shot)
commonsense_qaCommonsenseQA — commonsense multiple-choice (0-shot)
wsc273WSC273 — Winograd Schema Challenge (0-shot)
logiqaLogiQA — logical reading comprehension (0-shot)
logiqa2LogiQA 2.0 — updated logical reading comprehension (0-shot)
swagSWAG — grounded commonsense inference (0-shot)
babibAbI — 20 QA tasks for reasoning over short stories

LONG CONTEXT

longbenchLongBench — bilingual long-context understanding benchmark

LANGUAGE MODELING & PERPLEXITY

wikitextWikiText — word-level perplexity on WikiText-103
lambadaLAMBADA — last-word prediction requiring passage understanding

MEDICAL & SCIENTIFIC

medqaMedQA — US medical licensing exam (USMLE) QA
medmcqaMedMCQA — Indian medical entrance exam QA
pubmedqaPubMedQA — biomedical research question answering

SAFETY & BIAS

wmdpWMDP — hazardous knowledge benchmark (bio, chem, cyber)
bbqBBQ — bias benchmark for QA across 9 social categories
crows_pairsCrowS-Pairs — measuring social biases in language models
hendrycks_ethicsETHICS — moral knowledge across 5 ethical frameworks
realtoxicitypromptsRealToxicityPrompts — toxicity of model continuations

MULTILINGUAL

mgsmMGSM — multilingual grade-school math (11 languages)
belebeleBelebele — multilingual reading comprehension (122 languages)
global_mmluGlobal MMLU — MMLU translated into 42 languages
xnliXNLI — cross-lingual NLI in 15 languages
xcopaXCOPA — cross-lingual commonsense (11 languages)
xwinogradXWinograd — multilingual Winograd schemas
okapiOkapi — multilingual instruction-following benchmark
cevalC-Eval — Chinese academic knowledge benchmark
cmmluCMMLU — Chinese multi-subject language understanding
kmmluKMMLU — Korean multi-subject language understanding

SUMMARIZATION

cnn_dailymailCNN/DailyMail — news article summarization