Benchmarks · /benchmarks/scale-enigmaeval

EnigmaEval

Hard reasoning suite on the Scale leaderboard.

Source · Scale Labs
Version · scale-labs snapshot 2026-05-01
Scores · 16

Passport

Thin verified coverageThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

Scale Labs

metric

Pass rate (%)

judge

Rubric

direction

higher better

group id

scale_enigma_current

domain

Reasoning / math / science

What it measures vs what it misses

✓ Measures

Performance on advanced reasoning problems.

✗ Misses

Preference, price, and broad product usability.

Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.

Leaderboard · this benchmark version

#1 · o3

SL · Apr 1, 2026

71%

#2 · DeepSeek Reasoner

SL · Apr 1, 2026

68%

#3 · Gemini 2.5 Pro

SL · Apr 1, 2026

66%

#4 · GPT-5

SL · Apr 1, 2026

64%

#5 · GPT-5.4

SL · Apr 1, 2026

64%

#6 · GPT-5.4 mini

SL · Apr 1, 2026

64%

#7 · GPT-5.4 nano

SL · Apr 1, 2026

64%

#8 · Claude Opus 4.1

SL · Apr 1, 2026

62%

#9 · Claude Opus 4

SL · Apr 1, 2026

62%

#10 · Claude Opus 4.6

SL · Apr 1, 2026

62%

#11 · Claude Opus 4.7

SL · Apr 1, 2026

62%

#12 · Claude Sonnet 4

SL · Apr 1, 2026

58%

#13 · Claude Sonnet 4.5

SL · Apr 1, 2026

58%

#14 · Claude Sonnet 4.6

SL · Apr 1, 2026

58%

#15 · Qwen3 235B A22B

SL · Apr 1, 2026

57%

#16 · Grok 4

SL · Apr 1, 2026

54%