Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Success rate (%)
judge
Rubric
direction
higher better
group id
scale_hil_current
domain
Coding
What it measures vs what it misses
✓ Measures
Recovery and execution quality on coding-heavy tasks with intervention paths.
✗ Misses
Pure chat fluency. Standalone latency metrics.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.