UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Benchmarks/Instruction following
Instruction following
Live · updated continuously
Benchmarks · /benchmarks/livebench-instruction-following

Instruction following

LiveBench instruction-following slice across paraphrase and constrained generation tasks.
Source · LiveBench
Version · livebench snapshot 2026-05-01
Scores · 32

Passport

Verified but agingThis is an objective signal, so it is mainly about measurable task performance rather than public taste.
source
LiveBench
metric
Score (%)
judge
Objective
direction
higher better
group id
livebench_instruction_following_2026_04
domain
Chat / text

What it measures vs what it misses

✓ Measures

How well a model obeys specified output instructions on recent tasks.

✗ Misses

Broader reasoning depth. Human preference over style.

Why this countsIt tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not prove deeper reasoning, tool use, or enterprise workflow reliability.

Leaderboard · this benchmark version

#1 · o1 Preview
LB · Dec 10, 2024
85.2%
#2 · Gemini Experimental
LB · Dec 10, 2024
84.8%
#3 · Gemini 2.0 Pro Experimental
LB · Feb 5, 2025
82.8%
#4 · o1
LB · Mar 4, 2025
82.3%
#5 · Claude Sonnet 3.7
LB · Feb 24, 2025
82.2%
#6 · Gemini 2.0 Flash-Lite
LB · Feb 27, 2025
81.3%
#7 · Gemini 2.5 Pro
LB · Mar 25, 2025
81%
#8 · Gemini 2.0 Flash
LB · Apr 7, 2025
80.9%
#9 · Grok 2 mini
LB · Oct 17, 2024
80.7%
#10 · DeepSeek Reasoner
LB · Jan 20, 2025
80.6%
#11 · o3 mini
LB · Feb 1, 2025
80.3%
#12 · Gemini 1.5 Flash 8B
LB · Dec 10, 2024
76.8%
#13 · Grok 3
LB · Mar 18, 2025
75.9%
#14 · Gemini 1.5 Flash
LB · Dec 10, 2024
75.8%
#15 · GPT-4o
LB · Mar 27, 2025
74.1%
#16 · Gemini 1.5 Pro
LB · Dec 10, 2024
73.3%
#17 · o1 mini
LB · Dec 10, 2024
72.6%
#18 · GPT-4.5 Preview
LB · Feb 27, 2025
72.3%
#19 · GPT-4
LB · Jul 10, 2024
71.8%
#20 · Claude Sonnet 3.5
LB · Dec 10, 2024
71.1%
#21 · Grok 2
LB · Dec 16, 2024
70.8%
#22 · Grok Beta
LB · Dec 10, 2024
70.5%
#23 · GPT-4o mini
LB · Dec 10, 2024
69.7%
#24 · GPT-4 Turbo
LB · Dec 10, 2024
68.9%
#25 · Claude Haiku 3.5
LB · Dec 10, 2024
68.8%
#26 · Claude Haiku 4.5
LB · Dec 10, 2024
68.8%
#27 · DeepSeek Chat
LB · Dec 11, 2024
67.8%
#28 · Claude Opus 3
LB · Dec 10, 2024
65.6%
#29 · Claude Sonnet 3
LB · Jun 11, 2024
63.6%
#30 · GPT-3.5 Turbo
LB · Jun 11, 2024
62.9%
#31 · Grok 3 mini
LB · Mar 14, 2025
62.4%
#32 · Claude Haiku 3
LB · Dec 12, 2024
57%