Benchmarks · /benchmarks/bridgebench-security

Security

Security-oriented coding evaluation.

Source · BridgeBench
Version · bridgebench snapshot 2026-05-01
Scores · 10

Passport

Verified but agingThis is a rubric-judged signal, so it is more structured than arena taste but still depends on the scoring rubric.

source

BridgeBench

metric

Score (%)

judge

Rubric

direction

higher better

group id

bridgebench_security_2026_04

domain

Coding

What it measures vs what it misses

✓ Measures

Security issue discovery and remediation quality.

✗ Misses

Latency and cost.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · GPT-5.5

BB · Undated

85.3%

#2 · Claude Sonnet 4.6

BB · Undated

85.3%

#3 · Gemini 3.1 Pro Preview

BB · Undated

85.2%

#4 · GPT-5.4

BB · Undated

84.4%

#5 · GPT-5.4 mini

BB · Undated

83.3%

#6 · Grok 4.3

BB · Undated

81.9%

#7 · Claude Opus 4.6

BB · Undated

81.6%

#8 · GPT-5.4 nano

BB · Undated

80%

#9 · Grok 4.20

BB · Undated

78.9%

#10 · Claude Opus 4.7

BB · Undated

76%