UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/Long Context Reasoning
Long Context Reasoning
Live · updated continuously
Browse sectionsLong Context Reasoning
Benchmarks · /benchmarks/artificial-analysis-long-context-reasoning

Long Context Reasoning

Artificial Analysis benchmark for extracting, reasoning over, and synthesizing long-form documents.
Source · Artificial Analysis
Version · artificial-analysis snapshot 2026-05-01
Scores · 6

Passport

Thin verified coverageThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
Artificial Analysis
metric
Score (%)
judge
Objective
direction
higher better
group id
aa_long_context_reasoning_current
domain
Long context

What it measures vs what it misses

✓ Measures

Reasoning quality across long documents that require multi-step synthesis. Whether long context windows translate into usable document comprehension.

✗ Misses

Short-form chat preference. Image-heavy document workflows.

Why this countsIt checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not guarantee good synthesis quality once real documents, tools, and latency constraints are involved.

Leaderboard · this benchmark version

#1 · GPT-5.2
AA · May 1, 2026
75.7%
#2 · GPT-5
AA · May 1, 2026
75.6%
#3 · GPT-5.4
AA · May 1, 2026
75.6%
#4 · GPT-5.4 mini
AA · May 1, 2026
75.6%
#5 · GPT-5.4 nano
AA · May 1, 2026
75.6%
#6 · GPT-5.1
AA · May 1, 2026
75%