UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/MASK
MASK
Live · updated continuously
Browse sectionsMASK
Benchmarks · /benchmarks/scale-mask

MASK

Scale hidden-goal safety benchmark focused on honesty under conflicting incentives.
Source · Scale Labs
Version · scale-labs snapshot 2026-05-01
Scores · 14

Passport

Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Honesty score (%)
judge
Rubric
direction
higher better
group id
scale_mask_current
domain
Safety

What it measures vs what it misses

✓ Measures

Whether a model stays honest instead of covertly optimizing against the user.

✗ Misses

General capability breadth. Tool-use or retrieval quality.

Why this countsWhether a model stays honest instead of covertly optimizing against the user.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesGeneral capability breadth.

Leaderboard · this benchmark version

#1 · Claude Opus 4.6
SL · Apr 29, 2026
96.3%
#2 · GPT-OSS 120B
SL · Apr 29, 2026
92%
#3 · GPT-5.4
SL · Apr 29, 2026
91.7%
#4 · GPT-OSS 20B
SL · Apr 29, 2026
86.5%
#5 · GPT-5.1
SL · Apr 29, 2026
86.3%
#6 · GPT-5
SL · Apr 29, 2026
79.3%
#7 · GPT-5.4 mini
SL · Apr 29, 2026
79.3%
#8 · GPT-5.4 nano
SL · Apr 29, 2026
79.3%
#9 · Gemini 2.5 Pro
SL · Apr 29, 2026
55.7%
#10 · DeepSeek Reasoner
SL · Apr 29, 2026
53%
#11 · Llama 4 Maverick
SL · Apr 29, 2026
49.7%
#12 · Gemini 3.1 Flash-Lite Preview
SL · Apr 29, 2026
48.4%
#13 · Gemini 3 Pro Preview
SL · Apr 29, 2026
42.6%
#14 · Gemini 3.1 Pro
SL · Apr 29, 2026
42.4%