Reference · updated 2026‑04‑28

Twenty benchmarks. Every frontier score, every source linked.

A standing comparison across the 20 most-cited public benchmarks for AI models. Every cell shows the model’s score as published by the lab that built it — or, where the score comes from a third-party harness, the harness vendor. Each footnote links back to the primary source.

Krentix appears in every row. Where Krentix has been measured against the public dataset (HumanEval, full 164 problems — 89.0% pass@1), the cell shows the score in gold. Where Krentix hasn’t been measured yet, the cell is dashed — we don’t fabricate numbers to fill space.

2026‑04‑28 · MBPP+ landed · 308 of 378 = 81.5% pass@1, full dataset, medium tier. 69 of 70 fails were genuine full-pipeline misses (1 routing bug). Full-power tier expected to add 6–10 points once the router patches and tier wiring ship. GSM8K queued.

01 · The table

Twenty benchmarks across coding, math, reasoning, knowledge, agentic, and composite.

Columns are the strongest current frontier model from each major lab plus Krentix. Where a model has multiple variants (e.g. reasoning vs non-reasoning), the strongest publicly disclosed is shown.

Benchmark Krentix Claude Opus 4.7 GPT‑5.5 (xhigh) Gemini 3.1 Pro Other frontier
HumanEvalCoding · pass@1 89.0%146 / 164[1] ~saturatedcluster >92%[12] ~saturatedcluster >92%[12] ~saturatedcluster >92%[12]
MBPP+Coding · pass@1 · full 378 81.5%308 / 378 · medium tier[14] vendor scores not standardised[13]
SWE‑bench VerifiedCoding · % resolved Docker harness pending 87.6%via Vellum[2] 58.6%via Vellum[2] 82.0% Sonnet 4.5[2]
LiveCodeBenchCoding · pass@1 91.7%[3] 89.6% DeepSeek V3.2[3]
SciCodeScientific code · pass@1 56.1%[4] 58.9%[4] 56.6% GPT‑5.4[4]
GSM8KGrade-school math queued · runs after MBPP+ saturated saturated saturated >95% across cluster[13]
MATH‑500Competition math 99.4% GPT‑5 (high)[5] · 99.2% o3[5]
AIME 2025Olympiad math · correct/15 99.8%Opus 4.6, via Vellum[2] 100%GPT‑5.2 xhigh[6] 100%Gemini 3 Flash Reasoning[6] 99.1% Kimi K2 Thinking[2]
MMLUGeneral knowledge saturated saturated saturated >90% across cluster[13]
MMLU‑ProHarder MMLU 89.5%Opus 4.5 Reasoning[7] 89.8%Gemini 3 Pro (high)[7] 89.5% Gemini 3 Pro (low)[7]
GPQA DiamondGraduate-level Q&A 94.2%[2] 93.5%[8] 94.1%[8] 95.4% Claude 3 Opus[2]
BIG‑Bench Hard23 hard reasoning tasks vendor scores not standardised[13]
Humanity’s Last ExamHardest reasoning · gated dataset gated by authors 40.0%Opus 4.6, via Vellum[2] 44.3%[9] 44.7%[9] 44.9% Kimi K2 Thinking[2]
ARC‑AGI 2Abstract reasoning 68.8%Opus 4.6, via Vellum[2] 85.0%[2] 58.3% Sonnet 4.5[2]
Terminal‑BenchAgentic CLI harness gated qual. claim «passes prior fails»[10]
TruthfulQATruthfulness harness pending vendor scores not standardised[13]
RULERLong context · 128K avg vendor scores not standardised[13]
MMMUMultimodal · vision+text Krentix is text-only vendor scores not standardised[13]
MathVistaMultimodal math Krentix is text-only vendor scores not standardised[13]
AA Intelligence IndexComposite · 0–100 composite, not directly runnable 57max variant[11] 60[11] 57[11] 54 Kimi K2.6, MiMo‑V2.5 Pro[11]

How to read this table.

Krentix scores are measured by us, on the public dataset, with a harness that’s in this repo. Today the Krentix-measured row is HumanEval (89.0%, 146/164, full set). MBPP+ is in flight as of this writing; GSM8K is queued right after it. Each completed run pushes a per-problem JSON to the bench repo and updates the cell here on the next deploy. Cells that say haven’t been run yet, not hidden.

Frontier scores link to one of three sources: (a) the lab’s own release announcement or model card, (b) Artificial Analysis, which aggregates and re-runs vendor numbers, or (c) Vellum’s LLM leaderboard, which consolidates lab‑published scores. Vendor labs in 2026 mostly stopped publishing standardised benchmark tables in their release announcements — they cite saturated tests, partner benchmarks, or charts without numbers. We use AA / Vellum where the lab itself doesn’t publish a number we can quote.

«Saturated» means the top frontier cluster sits within ~1–2 points of each other and within ~5 points of the benchmark’s ceiling. HumanEval, GSM8K and MMLU all hit this state in 2024–2025; differentiation has moved to harder benchmarks (HLE, GPQA, SciCode, SWE‑bench, ARC‑AGI 2).

02 · Sources

Every cell, traced back.

Numbered footnotes in the table link to the entries below. Each entry points to the primary public source we read the score from, with the date of access.

[1] Krentix HumanEval pass@1 = 89.0% (146/164), measured 2026-04-28 against openai/human-eval v2 (2021-07-05). Harness, raw per-problem results, and seed: github.com/joelrobic-gif/krentix-landing/bench.

[2] Vellum LLM leaderboard, accessed 2026-04-27. vellum.ai/llm-leaderboard. Aggregates published scores for Claude (Anthropic), GPT (OpenAI), Gemini (Google), Kimi (Moonshot), and others. Cells citing this source: SWE-bench Verified, AIME 2025, ARC-AGI 2, HLE (Opus 4.6, Kimi K2 Thinking), GPQA (Claude 3 Opus, Opus 4.7).

[3] Artificial Analysis — LiveCodeBench leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/livecodebench. Top three: Gemini 3 Pro Preview (high) 91.7%, Gemini 3 Flash Preview (Reasoning) 90.8%, DeepSeek V3.2 Speciale 89.6%.

[4] Artificial Analysis — SciCode leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/scicode. Top three: Gemini 3.1 Pro Preview 58.9%, GPT-5.4 (xhigh) 56.6%, GPT-5.5 (xhigh) 56.1%.

[5] Artificial Analysis — MATH-500 leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/math-500. Top three: GPT-5 (high) 99.4%, o3 99.2%, Grok 3 mini Reasoning (high) 99.2%.

[6] Artificial Analysis — AIME 2025 leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/aime-2025. Saturation cluster at 100%: GPT-5.2 (xhigh), GPT-5 Codex (high), Gemini 3 Flash Preview (Reasoning).

[7] Artificial Analysis — MMLU-Pro leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/mmlu-pro. Top three: Gemini 3 Pro Preview (high) 89.8%, Gemini 3 Pro Preview (low) 89.5%, Claude Opus 4.5 (Reasoning) 89.5%.

[8] Artificial Analysis — GPQA Diamond leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/gpqa-diamond. Top three: Gemini 3.1 Pro Preview 94.1%, GPT-5.5 (xhigh) 93.5%, GPT-5.5 (high) 93.2%.

[9] Artificial Analysis — Humanity’s Last Exam leaderboard, accessed 2026-04-27. artificialanalysis.ai/evaluations/humanitys-last-exam. Top three: Gemini 3.1 Pro Preview 44.7%, GPT-5.5 (xhigh) 44.3%, GPT-5.5 (high) 43.0%.

[10] Anthropic Claude Opus product page, accessed 2026-04-27. anthropic.com/claude/opus. Anthropic publishes qualitative claims («Opus 4.7 passed Terminal Bench tasks that prior Claude models had failed») but no comparable numerical leaderboard score for Terminal-Bench.

[11] Artificial Analysis — Intelligence Index leaderboard, accessed 2026-04-27. artificialanalysis.ai/leaderboards/models. Composite score 0–100 across multiple benchmarks. Top: GPT-5.5 (xhigh) 60, GPT-5.5 (high) 59, Claude Opus 4.7 (max), Gemini 3.1 Pro Preview, GPT-5.4 (xhigh), GPT-5.5 (medium) all 57.

[12] HumanEval «saturated cluster»: Claude Sonnet 3.5 reported 92.0% (Anthropic, 2024). Claude 3.7 Sonnet reported 93.7%. GPT-4 Turbo reported 87.1% (OpenAI). Gemini 1.5 Pro reported 84.1% (Google). Newer 2025–2026 frontier models score above 90% but vendors no longer feature HumanEval in release announcements (test is considered saturated). We cite the saturation cluster qualitatively rather than fabricate a current vendor number.

[13] «Vendor scores not standardised»: As of April 2026, vendor labs (Anthropic, OpenAI, Google) have largely stopped publishing standard-benchmark tables in their release announcements for the most current models, citing saturation, partner-specific benchmarks, or visual charts without numerical labels. Where this note appears we don’t have a vendor-published score we can quote and won’t fabricate one. Krentix runs against the public dataset are queued separately.

[14] Krentix MBPP+ pass@1 = 81.5% (308/378), measured 2026-04-28 against evalplus/mbppplus (Liu et al., NeurIPS 2023). Wall: 98.7 min. Medium tier (current default lineup); 69 of 70 fails were full-pipeline ensemble misses, 1 was a speed_path routing bug. Harness, raw per-problem results, and stderr tails on every failure: github.com/joelrobic-gif/krentix-landing/bench/mbpp.