UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/EnigmaEval
EnigmaEval
Live · updated continuously
Browse sectionsEnigmaEval
Benchmarks · /benchmarks/scale-enigmaeval

EnigmaEval

Hard reasoning suite on the Scale leaderboard.
Source · Scale Labs
Version · scale-labs snapshot 2026-05-01
Scores · 16

Passport

Thin verified coverageThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Pass rate (%)
judge
Rubric
direction
higher better
group id
scale_enigma_current
domain
Reasoning / math / science

What it measures vs what it misses

✓ Measures

Performance on advanced reasoning problems.

✗ Misses

Preference, price, and broad product usability.

Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.

Leaderboard · this benchmark version

#1 · o3
SL · Apr 1, 2026
71%
#2 · DeepSeek Reasoner
SL · Apr 1, 2026
68%
#3 · Gemini 2.5 Pro
SL · Apr 1, 2026
66%
#4 · GPT-5
SL · Apr 1, 2026
64%
#5 · GPT-5.4
SL · Apr 1, 2026
64%
#6 · GPT-5.4 mini
SL · Apr 1, 2026
64%
#7 · GPT-5.4 nano
SL · Apr 1, 2026
64%
#8 · Claude Opus 4.1
SL · Apr 1, 2026
62%
#9 · Claude Opus 4
SL · Apr 1, 2026
62%
#10 · Claude Opus 4.6
SL · Apr 1, 2026
62%
#11 · Claude Opus 4.7
SL · Apr 1, 2026
62%
#12 · Claude Sonnet 4
SL · Apr 1, 2026
58%
#13 · Claude Sonnet 4.5
SL · Apr 1, 2026
58%
#14 · Claude Sonnet 4.6
SL · Apr 1, 2026
58%
#15 · Qwen3 235B A22B
SL · Apr 1, 2026
57%
#16 · Grok 4
SL · Apr 1, 2026
54%