Visible tradeoffsThis is a human preference signal, so it tells you what people liked side by side, not what is formally correct.
source
Arena
metric
Arena rating (rating)
judge
Human
direction
higher better
group id
arena_code_2026_q2
domain
Coding
What it measures vs what it misses
✓ Measures
Human preference over coding outputs. Perceived usefulness and style fit in side-by-side code tasks.
✗ Misses
Pass/fail correctness. Latency and cost.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.