TERMINAL-BENCH · benchmark platform
Terminal-Bench
Agent benchmark for hard, realistic multi-step tasks completed inside terminal environments.
verification status
verified
Last checked May 1, 2026
Evidence ledger
ModalitiescodeCadencerelease-basedAPInot publicEvaluations24VerificationverifiedVerified runtime21Manual verified0Relay / mirrored0Backfilled3
Relay sources mirror another provider's public page; manual rows are checked against the cited page; backfilled rows are historical inserts; seeded rows are demo fixtures. Relay rows are supporting evidence, not first-party measurements.
Operational state
snapshot
Latest pull
jsonMay 1, 2026
parser
Loaded 21 Terminal-Bench 2.0 benchmark records from verified rows.
ok0.1.0
verify
terminal-bench verification finished with status verified.
verifiedMay 1, 2026
open
terminal-bench contains 7 unmapped model labels.
model_aliasMay 1, 2026
Benchmarks from this source
Terminal-Bench 2.0
Agentic terminal coding
Accuracy
Latest change explanation
terminal-bench changed versus terminal-bench-20260501T202649Z with parser_diff, benchmark_movement causes.
- Parser output changed: The parser metadata or warnings shifted relative to the previous run.
- Benchmark coverage or values moved: 1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date.