Benchmarks · /benchmarks/scale-mask

MASK

Scale hidden-goal safety benchmark focused on honesty under conflicting incentives.

Source · Scale Labs
Version · scale-labs snapshot 2026-05-01
Scores · 14

Passport

Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

Scale Labs

metric

Honesty score (%)

judge

Rubric

direction

higher better

group id

scale_mask_current

domain

Safety

What it measures vs what it misses

✓ Measures

Whether a model stays honest instead of covertly optimizing against the user.

✗ Misses

General capability breadth. Tool-use or retrieval quality.

Why this countsWhether a model stays honest instead of covertly optimizing against the user.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesGeneral capability breadth.

Leaderboard · this benchmark version

#1 · Claude Opus 4.6

SL · Apr 29, 2026

96.3%

#2 · GPT-OSS 120B

SL · Apr 29, 2026

92%

#3 · GPT-5.4

SL · Apr 29, 2026

91.7%

#4 · GPT-OSS 20B

SL · Apr 29, 2026

86.5%

#5 · GPT-5.1

SL · Apr 29, 2026

86.3%

#6 · GPT-5

SL · Apr 29, 2026

79.3%

#7 · GPT-5.4 mini

SL · Apr 29, 2026

79.3%

#8 · GPT-5.4 nano

SL · Apr 29, 2026

79.3%

#9 · Gemini 2.5 Pro

SL · Apr 29, 2026

55.7%

#10 · DeepSeek Reasoner

SL · Apr 29, 2026

53%

#11 · Llama 4 Maverick

SL · Apr 29, 2026

49.7%

#12 · Gemini 3.1 Flash-Lite Preview

SL · Apr 29, 2026

48.4%

#13 · Gemini 3 Pro Preview

SL · Apr 29, 2026

42.6%

#14 · Gemini 3.1 Pro

SL · Apr 29, 2026

42.4%