Benchmarks · /benchmarks/arena-webdev

WebDev Arena

Blind preference arena for web app generation and front-end implementation quality.

Source · Arena
Version · arena snapshot 2026-05-01
Scores · 61

Passport

Visible tradeoffsThis is a human preference signal, so it tells you what people liked side by side, not what is formally correct.

source

Arena

metric

Arena rating (rating)

judge

Human

direction

higher better

group id

arena_webdev_2026_q2

domain

Coding

What it measures vs what it misses

✓ Measures

Perceived usefulness and polish on browser-facing coding tasks. How often a model's generated web experience wins in side-by-side judgments.

✗ Misses

Objective correctness or runtime reliability. Accessibility, maintainability, and deploy-time quality unless voters notice them directly.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · Claude Opus 4.7

AR · May 1, 2026

1,561

#2 · Claude Opus 4.6

AR · May 1, 2026

1,543

#3 · GLM-5.1

AR · May 1, 2026

1,534

#4 · Claude Sonnet 4.6

AR · May 1, 2026

1,527

#5 · Kimi K2.6

AR · May 1, 2026

1,526

#6 · muse-spark

AR · May 1, 2026

1,509

#7 · MiMo-V2.5-Pro

AR · May 1, 2026

1,475

#8 · Claude Opus 4.5

AR · May 1, 2026

1,467

#9 · Qwen3.6 Plus

AR · May 1, 2026

1,467

#10 · deepseek-v4-pro-thinking

AR · May 1, 2026

1,455

#11 · Gemini 3.1 Pro Preview

AR · May 1, 2026

1,453

#12 · GPT-5.5

AR · May 1, 2026

1,447

#13 · mimo-v2.5

AR · May 1, 2026

1,444

#14 · GLM-4.7

AR · May 1, 2026

1,440

#15 · Gemini 3 Pro Preview

AR · May 1, 2026

1,438

#16 · GPT-5.4

AR · May 1, 2026

1,437

#17 · GLM-5

AR · May 1, 2026

1,437

#18 · kimi-k2.5-thinking

AR · May 1, 2026

1,430

#19 · MiMo-V2-Pro

AR · May 1, 2026

1,430

#20 · MiniMax-M2.7

AR · May 1, 2026

1,411

#21 · Grok 4.3

AR · May 1, 2026

1,408

#22 · kimi-k2.5-instant

AR · May 1, 2026

1,408

#23 · GPT-5.3 Codex

AR · May 1, 2026

1,407

#24 · GPT-5.4 mini

AR · May 1, 2026

1,400

#25 · Grok 4.20

AR · May 1, 2026

1,399

#26 · GPT-5

AR · May 1, 2026

1,393

#27 · GPT-5.4 nano

AR · May 1, 2026

1,393

#28 · minimax-m2.1-preview

AR · May 1, 2026

1,392

#29 · Gemini 3 Flash

AR · May 1, 2026

1,389

#30 · Qwen3.5 397B A17B

AR · May 1, 2026

1,387

#31 · Claude Sonnet 4.5

AR · May 1, 2026

1,386

#32 · Claude Opus 4.1

AR · May 1, 2026

1,385

#33 · Claude Opus 4

AR · May 1, 2026

1,385

#34 · MiniMax-M2.5

AR · May 1, 2026

1,383

#35 · deepseek-v3.2-thinking

AR · May 1, 2026

1,368

#36 · Qwen3.5 122B A10B

AR · May 1, 2026

1,363

#37 · GLM-4.6

AR · May 1, 2026

1,355

#38 · Qwen3.5 27B

AR · May 1, 2026

1,350

#39 · GPT-5.2

AR · May 1, 2026

1,335

#40 · DeepSeek Chat

AR · May 1, 2026

1,332

#41 · kimi-k2-thinking-turbo

AR · May 1, 2026

1,330

#42 · Claude Haiku 4.5

AR · May 1, 2026

1,317

#43 · MiniMax-M2

AR · May 1, 2026

1,304

#44 · MiMo-V2-Flash

AR · May 1, 2026

1,300

#45 · DeepSeek V3.2 Exp

AR · May 1, 2026

1,286

#46 · Qwen3-Coder 480B A35B

AR · May 1, 2026

1,281

#47 · KAT-Coder-Pro V1

AR · May 1, 2026

1,258

#48 · Qwen3.5 35B A3B

AR · May 1, 2026

1,248

#49 · Trinity Large Thinking

AR · May 1, 2026

1,246

#50 · GPT-5.1

AR · May 1, 2026

1,239

#51 · Gemini 3.1 Flash-Lite Preview

AR · May 1, 2026

1,238

#52 · Qwen3.5 Flash

AR · May 1, 2026

1,236

#53 · Grok 4.1 Fast

AR · May 1, 2026

1,234

#54 · Mistral Large 3

AR · May 1, 2026

1,222

#55 · Grok 4.1

AR · May 1, 2026

1,207

#56 · Gemini 2.5 Pro

AR · May 1, 2026

1,203

#57 · Devstral 2

AR · May 1, 2026

1,199

#58 · Mercury 2

AR · May 1, 2026

1,165

#59 · Grok 4 Fast

AR · May 1, 2026

1,149

#60 · Grok Code Fast

AR · May 1, 2026

1,139

#61 · devstral-medium-2507

AR · May 1, 2026

1,091