UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/HiL-Bench
HiL-Bench
Live · updated continuously
Browse sectionsHiL-Bench
Benchmarks · /benchmarks/scale-hil-bench

HiL-Bench

Scale human-in-the-loop benchmark spanning software engineering and tool-assisted workflows.
Source · Scale Labs
Version · scale-labs snapshot 2026-05-01
Scores · 6

Passport

Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Success rate (%)
judge
Rubric
direction
higher better
group id
scale_hil_current
domain
Coding

What it measures vs what it misses

✓ Measures

Recovery and execution quality on coding-heavy tasks with intervention paths.

✗ Misses

Pure chat fluency. Standalone latency metrics.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · GPT-5.5
SL · Apr 29, 2026
29.1%
#2 · Claude Opus 4.7
SL · Apr 29, 2026
27.7%
#3 · Claude Opus 4.6
SL · Apr 29, 2026
24.3%
#4 · Gemini 3.1 Pro
SL · Apr 29, 2026
20.3%
#5 · GPT-5.4
SL · Apr 29, 2026
9.3%
#6 · Grok 4.20
SL · Apr 29, 2026
8%