Verified but agingThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Terminal-Bench
metric
Accuracy (%)
judge
Objective
direction
higher better
group id
terminal_bench_2_live
domain
Coding
What it measures vs what it misses
✓ Measures
End-to-end task success on hard terminal workflows that require planning, editing, debugging, and execution.
✗ Misses
IDE-native workflows, code review quality, and non-terminal product engineering work.
Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.