Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
Score (%)
judge
Rubric
direction
higher better
group id
scale_tutorbench_current
domain
Reasoning / math / science
What it measures vs what it misses
✓ Measures
How well a model tutors through multi-step academic problems. Instruction quality, pedagogy, and reasoning support on teaching-style prompts.
✗ Misses
Live classroom preference. Latency and cost.
Why this countsIt is one of the cleaner reads on deliberate reasoning strength rather than style or popularity.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt still misses product usability, latency, and whether the model stays correct in messy real workflows.