Verified but agingThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.
source
BridgeBench
metric
Pushback rate (%)
judge
Rubric
direction
higher better
group id
bridgebench_pushback_2026_04
domain
Professional reasoning
What it measures vs what it misses
✓ Measures
Resistance to confidently accepting bogus assumptions in expert-style prompts.
✗ Misses
Coding execution quality. Latency and cost.
Why this countsResistance to confidently accepting bogus assumptions in expert-style prompts.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesCoding execution quality.