UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/Debugging
Debugging
Live · updated continuously
Browse sectionsDebugging
Benchmarks · /benchmarks/bridgebench-debugging

Debugging

Debugging-focused engineering evaluation.
Source · BridgeBench
Version · bridgebench snapshot 2026-05-01
Scores · 11

Passport

Verified but agingThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
BridgeBench
metric
Score (%)
judge
Rubric
direction
higher better
group id
bridgebench_debugging_2026_04
domain
Coding

What it measures vs what it misses

✓ Measures

Ability to locate and fix bugs in realistic coding tasks.

✗ Misses

Cost. Arena taste.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · Claude Opus 4.6
BB · Undated
87%
#2 · Claude Sonnet 4.6
BB · Undated
86.6%
#3 · Grok 4.20
BB · Undated
86.3%
#4 · Claude Opus 4.7
BB · Undated
86.2%
#5 · Grok 4.3
BB · Undated
86.1%
#6 · Gemini 3.1 Pro Preview
BB · Undated
85.9%
#7 · GPT-5.4
BB · Undated
85.6%
#8 · o4 mini
BB · Undated
85.6%
#9 · GPT-5.4 mini
BB · Undated
84.1%
#10 · GPT-5.4 nano
BB · Undated
81.2%
#11 · GPT-5.5
BB · Undated
77.5%