Visible tradeoffsThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
Scale Labs
metric
APR (%)
judge
Rubric
direction
higher better
group id
scale_vtb_current
domain
Vision understanding
What it measures vs what it misses
✓ Measures
Visual interpretation and reasoning over benchmark images and prompts.
✗ Misses
Image generation quality. Tool-use orchestration beyond the judged task.
Why this countsIt is useful when the model must read charts, UI, screenshots, or visual scenes rather than text alone.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not tell you whether the model can generate or edit images well.