Benchmarks · /benchmarks/livebench-instruction-following

Instruction following

LiveBench instruction-following slice across paraphrase and constrained generation tasks.

Source · LiveBench
Version · livebench snapshot 2026-05-01
Scores · 32

Passport

Verified but agingThis is an objective signal, so it is mainly about measurable task performance rather than public taste.

source

LiveBench

metric

Score (%)

judge

Objective

direction

higher better

group id

livebench_instruction_following_2026_04

domain

Chat / text

What it measures vs what it misses

✓ Measures

How well a model obeys specified output instructions on recent tasks.

✗ Misses

Broader reasoning depth. Human preference over style.

Why this countsIt tests whether the model is actually useful in normal conversational turns, not just on narrow correctness tasks.Comparable-group ruleThis percentile only compares models inside the exact benchmark/version group shown here. It is not a universal score.What it missesIt does not prove deeper reasoning, tool use, or enterprise workflow reliability.

Leaderboard · this benchmark version

#1 · o1 Preview

LB · Dec 10, 2024

85.2%

#2 · Gemini Experimental

LB · Dec 10, 2024

84.8%

#3 · Gemini 2.0 Pro Experimental

LB · Feb 5, 2025

82.8%

#4 · o1

LB · Mar 4, 2025

82.3%

#5 · Claude Sonnet 3.7

LB · Feb 24, 2025

82.2%

#6 · Gemini 2.0 Flash-Lite

LB · Feb 27, 2025

81.3%

#7 · Gemini 2.5 Pro

LB · Mar 25, 2025

81%

#8 · Gemini 2.0 Flash

LB · Apr 7, 2025

80.9%

#9 · Grok 2 mini

LB · Oct 17, 2024

80.7%

#10 · DeepSeek Reasoner

LB · Jan 20, 2025

80.6%

#11 · o3 mini

LB · Feb 1, 2025

80.3%

#12 · Gemini 1.5 Flash 8B

LB · Dec 10, 2024

76.8%

#13 · Grok 3

LB · Mar 18, 2025

75.9%

#14 · Gemini 1.5 Flash

LB · Dec 10, 2024

75.8%

#15 · GPT-4o

LB · Mar 27, 2025

74.1%

#16 · Gemini 1.5 Pro

LB · Dec 10, 2024

73.3%

#17 · o1 mini

LB · Dec 10, 2024

72.6%

#18 · GPT-4.5 Preview

LB · Feb 27, 2025

72.3%

#19 · GPT-4

LB · Jul 10, 2024

71.8%

#20 · Claude Sonnet 3.5

LB · Dec 10, 2024

71.1%

#21 · Grok 2

LB · Dec 16, 2024

70.8%

#22 · Grok Beta

LB · Dec 10, 2024

70.5%

#23 · GPT-4o mini

LB · Dec 10, 2024

69.7%

#24 · GPT-4 Turbo

LB · Dec 10, 2024

68.9%

#25 · Claude Haiku 3.5

LB · Dec 10, 2024

68.8%

#26 · Claude Haiku 4.5

LB · Dec 10, 2024

68.8%

#27 · DeepSeek Chat

LB · Dec 11, 2024

67.8%

#28 · Claude Opus 3

LB · Dec 10, 2024

65.6%

#29 · Claude Sonnet 3

LB · Jun 11, 2024

63.6%

#30 · GPT-3.5 Turbo

LB · Jun 11, 2024

62.9%

#31 · Grok 3 mini

LB · Mar 14, 2025

62.4%

#32 · Claude Haiku 3

LB · Dec 12, 2024

57%