TERMINAL-BENCH · benchmark platform

Terminal-Bench

Agent benchmark for hard, realistic multi-step tasks completed inside terminal environments.

verification status

verified

Last checked May 1, 2026

Evidence ledger

ModalitiescodeCadencerelease-basedAPInot publicEvaluations24VerificationverifiedVerified runtime21Manual verified0Relay / mirrored0Backfilled3

Relay sources mirror another provider's public page; manual rows are checked against the cited page; backfilled rows are historical inserts; seeded rows are demo fixtures. Relay rows are supporting evidence, not first-party measurements.

Operational state

snapshot

Latest pull

May 1, 2026

json

parser

Loaded 21 Terminal-Bench 2.0 benchmark records from verified rows.

0.1.0

verify

terminal-bench verification finished with status verified.

May 1, 2026

verified

open

terminal-bench contains 7 unmapped model labels.

May 1, 2026

model_alias

Benchmarks from this source

Terminal-Bench 2.0

Agentic terminal coding

Accuracy

Latest change explanation

terminal-bench changed versus terminal-bench-20260501T202649Z with parser_diff, benchmark_movement causes.

Parser output changed: The parser metadata or warnings shifted relative to the previous run.
Benchmark coverage or values moved: 1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date.