Benchmarks · /benchmarks/terminal-bench-2

Terminal-Bench 2.0

Official Terminal-Bench 2.0 leaderboard for realistic multi-step terminal tasks.

Source · Terminal-Bench
Version · terminal-bench snapshot 2026-05-01
Scores · 24

Passport

Verified but agingThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Terminal-Bench

metric

Accuracy (%)

judge

Objective

direction

higher better

group id

terminal_bench_2_live

domain

Coding

What it measures vs what it misses

✓ Measures

End-to-end task success on hard terminal workflows that require planning, editing, debugging, and execution.

✗ Misses

IDE-native workflows, code review quality, and non-terminal product engineering work.

Why this countsIt tells you whether the model can generate, repair, and reason over code under evaluator pressure rather than marketing examples.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not fully capture repo-scale iteration, IDE ergonomics, or long debugging loops.

Leaderboard · this benchmark version

#1 · GPT-5.5

TERMINAL-BENCH · Apr 23, 2026

82%

#2 · GPT-5.3 Codex

TERMINAL-BENCH · Feb 6, 2026

75.1%

#3 · Gemini 3 Pro Preview

TERMINAL-BENCH · Jan 6, 2026

69.4%

#4 · Claude Opus 4.6

TERMINAL-BENCH · Feb 6, 2026

62.9%

#5 · GPT-5.2

TERMINAL-BENCH · Dec 18, 2025

62.9%

#6 · GPT-5.1

TERMINAL-BENCH · Nov 24, 2025

60.4%

#7 · Claude Opus 4.5

TERMINAL-BENCH · Nov 22, 2025

57.8%

#8 · Gemini 3 Flash Preview

TERMINAL-BENCH · Jan 7, 2026

51.7%

#9 · GPT-5

TERMINAL-BENCH · Nov 4, 2025

49.6%

#10 · GPT-5.4

TERMINAL-BENCH · Nov 4, 2025

49.6%

#11 · Claude Sonnet 4.5

TERMINAL-BENCH · Oct 31, 2025

42.8%

#12 · Claude Opus 4.1

TERMINAL-BENCH · Oct 31, 2025

38%

#13 · Claude Opus 4

TERMINAL-BENCH · Oct 31, 2025

38%

#14 · Claude Opus 4.7

TERMINAL-BENCH · Oct 31, 2025

38%

#15 · Gemini 2.5 Pro

TERMINAL-BENCH · Oct 31, 2025

32.6%

#16 · GPT-5.4 mini

TERMINAL-BENCH · Nov 4, 2025

31.9%

#17 · Claude Haiku 4.5

TERMINAL-BENCH · Nov 3, 2025

29.8%

#18 · Grok 4

TERMINAL-BENCH · Nov 2, 2025

27.2%

#19 · Grok Code Fast

TERMINAL-BENCH · Nov 3, 2025

25.8%

#20 · Qwen3-Coder 480B A35B

TERMINAL-BENCH · Nov 2, 2025

25.4%

#21 · GPT-OSS 120B

TERMINAL-BENCH · Nov 1, 2025

18.7%

#22 · Gemini 2.5 Flash

TERMINAL-BENCH · Nov 3, 2025

17.1%

#23 · GPT-5.4 nano

TERMINAL-BENCH · Nov 4, 2025

11.5%

#24 · GPT-OSS 20B

TERMINAL-BENCH · Nov 3, 2025

3.4%