Benchmarks · /benchmarks/artificial-analysis-long-context-reasoning

Long Context Reasoning

Artificial Analysis benchmark for extracting, reasoning over, and synthesizing long-form documents.

Source · Artificial Analysis
Version · artificial-analysis snapshot 2026-05-01
Scores · 6

Passport

Thin verified coverageThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

Artificial Analysis

metric

Score (%)

judge

Objective

direction

higher better

group id

aa_long_context_reasoning_current

domain

Long context

What it measures vs what it misses

✓ Measures

Reasoning quality across long documents that require multi-step synthesis. Whether long context windows translate into usable document comprehension.

✗ Misses

Short-form chat preference. Image-heavy document workflows.

Why this countsIt checks whether long-context claims survive contact with retrieval, memory, or long-document tasks.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not guarantee good synthesis quality once real documents, tools, and latency constraints are involved.

Leaderboard · this benchmark version

#1 · GPT-5.2

AA · May 1, 2026

75.7%

#2 · GPT-5

AA · May 1, 2026

75.6%

#3 · GPT-5.4

AA · May 1, 2026

75.6%

#4 · GPT-5.4 mini

AA · May 1, 2026

75.6%

#5 · GPT-5.4 nano

AA · May 1, 2026

75.6%

#6 · GPT-5.1

AA · May 1, 2026

75%