Home · guided decision

Which model should you use,
and how sure should you be?

Start from the job you need done, then drill into the evidence only when the shortlist is narrow enough to matter.

Default job · model selection
Presets · 6 visible scenarios
Supporting evidence · always available

Step 1 · Ask the job

Describe the job in one line.

Results update live while you type. Use the button only when you want the app to apply suggested preset and filter changes from your wording.

Query

Live update · Preset · Everyday chatbot · Combined public record

Current questionEveryday chatbot with combined public record

Use case

General-purpose chat quality with decent reasoning and enough context to feel useful day to day.

Evidence mode

Combined mode can use clearly labeled provider-official receipts while independent third-party coverage catches up.

Access model

Primary filters

Current weightingcoverage, recency, and included evidence

This preset weights chat text, reasoning math science, long context with a 60% coverage floor and a 120-day recency window. Provider-official receipts can contribute as labeled hybrid evidence in this mode. Relay, backfilled, and seeded-demo evidence stay out unless you explicitly allow them.

Step 2 · Current read

The current shortlist for Everyday chatbot with combined public record is Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5.

Data snapshot May 1, 2026No blocked sources excludedCombined public record

Visible tradeoffsThe current evidence supports a shortlist, not a single winner.

Current shortlistClaude Opus 4.7, Gemini 3.1 Pro, GPT-5.5

Headline modeShortlist, not single winner

Coverage67% visible · 67% verified

PresetEveryday chatbot

Evidence modeCombined public record

Hybrid receipts

Why these finalists made the cuttop reasons behind the current answer

Current shortlist: Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5.
Claude Opus 4.7 is the strongest exact-match option still visible.
Claude Opus 4.7 currently leads the fit score at 97.6, but the evidence is still too mixed for a single headline winner.

What to pressure testwhere the current answer is still fragile

No single winner: The current public record is only strong enough to support a shortlist, not a single headline crown.
Counter-case · Gemini 3.1 Pro: Gemini 3.1 Pro is strongest on Long context and Chat / text for this preset.
Evidence risk: The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Compare finalists Inspect evidence Coverage report

Step 3 · Pressure test the callRead the argument before you commit

What would flip the answer

If you tighten benchmark spread: Claude Opus 4.7 still holds if you care more about aligned evidence than upside.
If you tighten recency: Claude Opus 4.7 remains viable because the visible receipts are still fairly fresh.
If you require open-weight: No open-weight model currently clears the same evidence floor.
If cost and speed matter more: No clearly cheaper alternative currently clears the same evidence floor.

Why this is not a clean win

The current evidence supports a shortlist, not a single winner.
Gemini 3.1 Pro remains close enough that a different weighting can still flip the public answer.

Receipts discipline

Current shortlist: Current shortlist: Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5
Evidence 1: Strongest exact-match option: Claude Opus 4.7
Evidence 2: Strongest indirect contender: DeepSeek Reasoner
Evidence 3: Best open-weight finalist: No open-weight finalist on this surface

Decision Buckets

Exact leaders first, then indirect and missing-coverage cases

The guide now keeps exact-match winners strict while still surfacing strong models that would previously disappear.

Primary bucket

Exact-match leaders

Claude Opus 4.7

Anthropic · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts

Visible tradeoffs1.5% benchmark spread · 100% freshness · exact alias

Fit score97.6

Strongest evidencereasoning math science · chat text

Claude Opus 4.7 is strongest on Reasoning / math / science and Chat / text for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows: 3
Manual checks: 1
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (99.1).

Open model

Open compare

Gemini 3.1 Pro

Google · frontier · 100% visible · 100% exact · 0% indirect

PreviewHybrid receiptsTrack rollup

Visible tradeoffs31.4% benchmark spread · 100% freshness · exact direct

Fit score90.2

Strongest evidencelong context · chat text

Gemini 3.1 Pro is strongest on Long context and Chat / text for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows: 6
Manual checks: 2
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (90.1).

Open model

Open compare

GPT-5.5

OpenAI · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts

Visible tradeoffs20.9% benchmark spread · 100% freshness · exact alias

Fit score81.9

Strongest evidencechat text · reasoning math science

GPT-5.5 is strongest on Chat / text and Reasoning / math / science for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows: 3
Manual checks: 1
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (83.4).

Open model

Open compare

GPT-5.4

OpenAI · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts

Visible tradeoffs9% benchmark spread · 92.5% freshness · exact direct

Fit score81.3

Strongest evidencechat text · reasoning math science

GPT-5.4 is strongest on Chat / text and Reasoning / math / science for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows: 5
Manual checks: 1
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (82.8).

Open model

Open compare

Gemini 2.5 Pro

Google · frontier · 67% visible · 67% exact · 0% indirect

Visible tradeoffs19.9% benchmark spread · 100% freshness · exact direct

Fit score73.0

Strongest evidencechat text · reasoning math science

Gemini 2.5 Pro is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Parser or mapping changes recently moved Artificial Analysis, Arena.

Verified rows: 5
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (73.6).

Open model

Open compare

Secondary bucket

Strong contenders with indirect evidence

DeepSeek Reasoner

DeepSeek · budget · 67% visible · 33% exact · 33% indirect

Visible tradeoffs9.4% benchmark spread · 100% freshness · exact alias

Fit score50.3

Strongest evidencechat text · reasoning math science

DeepSeek Reasoner is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Indirect evidence covers 33% of the preset domains.

Verified rows: 4
Manual checks: 0
Relay rows: 0
Backfilled rows: 0

Base score is the weighted mean of preset-domain benchmark fit (52.1).

Open model

Open compare

Tertiary bucket

Tracked but under-benchmarked

These models are in the official registry, but the current benchmark surface still has missing or indirect coverage.

Llama 4 Maverick

Meta · 67% visible · 67% exact · 0% indirect

Llama 4 Maverick has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

Qwen3 235B A22B

Qwen · 67% visible · 67% exact · 0% indirect

Qwen3 235B A22B has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Long context.

chat textreasoning math science

Open model

Claude Haiku 4.5

Anthropic · 33% visible · 33% exact · 0% indirect

Claude Haiku 4.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek Chat

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Chat has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek V3 (Dec)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3 (Dec) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

DeepSeek V3 0324

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3 0324 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

DeepSeek V3.1

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3.1 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

DeepSeek V3.1 Terminus

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3.1 Terminus has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

DeepSeek V3.2 Exp

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek V3.2 Exp has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

DeepSeek V4 Flash (Max)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V4 Flash (Max) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

DeepSeek V4 Pro (Max)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V4 Pro (Max) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Only indirect or proxy evidence is currently available.
Missing benchmark coverage in Reasoning / math / science, Long context.
Proxy benchmark mappings are available but kept out of exact-match winners.

chat text

Open model

Gemini 2.5 Flash

Google · 33% visible · 33% exact · 0% indirect

Gemini 2.5 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.

Missing benchmark coverage in Reasoning / math / science, Long context.

chat text

Open model

alert

10 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open

models

Arena moved via real benchmark movement

8 benchmark rows were added, 0 removed, and 118 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z.

Open

models

Artificial Analysis moved via real benchmark movement

75 benchmark rows were added, 0 removed, and 20 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z.

Open

models

MTEB moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z.

Open

models

Terminal-Bench moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z.

Open

What changed this week

alert

10 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open

models

Arena moved via real benchmark movement

8 benchmark rows were added, 0 removed, and 118 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z.

Evidence window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z

Open artifact Model Benchmark Source

models

Artificial Analysis moved via real benchmark movement

75 benchmark rows were added, 0 removed, and 20 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z.

Evidence window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z

Open artifact Model Benchmark Source

models

MTEB moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z.

Evidence window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z

Open artifact Model Benchmark Source

models

Terminal-Bench moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z.

Evidence window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z

Open artifact Model Benchmark Source

product

Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Evidence window: 2026-04-16

Open changelog

models

Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

Evidence window: 2026-04-16

models

Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

Evidence window: 2026-04-15

Watchlists

Followed items reopen from their canonical URL first. Bundle export still works, but the durable state is the href plus deterministic latest-delta links, not a rebuilt local compare preset.

Open workspaces

Loading watchlist state...

No watchlists yet. Follow a recommendation card or compare set.

Saved compare views

Loading saved compare views...

Save a compare workspace to keep a shortlist around.

Workspace bundle

Portable bundles stay link-native. Use them to preview a shared workspace, reopen the same compare URLs on another device, or import the snapshot without reconstructing intent from loose local fields.

Current workspace0 saved compare views · 0 watches · 0 pinned compare models

Preview or import a shared bundle

Build / data stamp

Read this before trusting a headline.

Data snapshot May 1, 2026Registry verification passed9 providers · 826 tracked modelsPage refreshed May 7, 2026

If this stamp lags behind the repo, you are likely looking at an older build or cached deploy.

Quick routes

Jump straight to the page you need.

These shortcuts resolve into public URLs instead of hidden state. Use them to open a recommendation page, compare workspace, head-to-head page, disagreement page, change log feed, or a specific model, benchmark, or source.

Resolve a recommendation into a public artifactbest open model for long-context researchResearch assistantOpen page

Send a shortlist into compare modecompare gpt-5, claude opus, gemini proEveryday chatbotOpen page

Open a head-to-head debate pagegpt-5 vs claude opusEveryday chatbotOpen page

Open a disagreement artifactbenchmark controversy for livebench codingCoding copilotOpen page

Open the latest public movementwhat changed this weekEveryday chatbotOpen page

Jump straight to an entity pageopen model gpt-5Open-weight shortlistOpen page

Which model should you use,and how sure should you be?

Describe the job in one line.

The current shortlist for Everyday chatbot with combined public record is Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5.

What would flip the answer

Why this is not a clean win

Receipts discipline

Exact leaders first, then indirect and missing-coverage cases

Exact-match leaders

Claude Opus 4.7

Gemini 3.1 Pro

GPT-5.5

GPT-5.4

Gemini 2.5 Pro

Strong contenders with indirect evidence

DeepSeek Reasoner

Tracked but under-benchmarked

Llama 4 Maverick

Qwen3 235B A22B

Claude Haiku 4.5

DeepSeek Chat

DeepSeek V3 (Dec)

DeepSeek V3 0324

DeepSeek V3.1

DeepSeek V3.1 Terminus

DeepSeek V3.2 Exp

DeepSeek V4 Flash (Max)

DeepSeek V4 Pro (Max)

Gemini 2.5 Flash

Hot takes with receipts

What changed this week

Read this before trusting a headline.

Jump straight to the page you need.

Which model should you use,
and how sure should you be?