UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Guide
Guide
Live · updated continuously
Home · guided decision

Which model should you use,
and how sure should you be?

Start from the job you need done, then drill into the evidence only when the shortlist is narrow enough to matter.
Default job · model selection
Presets · 6 visible scenarios
Supporting evidence · always available
Step 1 · Ask the job

Describe the job in one line.

Results update live while you type. Use the button only when you want the app to apply suggested preset and filter changes from your wording.

Query

Live update · Preset · Everyday chatbot · Combined public record

Current questionEveryday chatbot with combined public record
Use case

General-purpose chat quality with decent reasoning and enough context to feel useful day to day.

Evidence mode

Combined mode can use clearly labeled provider-official receipts while independent third-party coverage catches up.

Access model
Primary filters
Current weightingcoverage, recency, and included evidence
This preset weights chat text, reasoning math science, long context with a 60% coverage floor and a 120-day recency window. Provider-official receipts can contribute as labeled hybrid evidence in this mode. Relay, backfilled, and seeded-demo evidence stay out unless you explicitly allow them.
Step 2 · Current read

The current shortlist for Everyday chatbot with combined public record is Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5.

Data snapshot May 1, 2026No blocked sources excludedCombined public record
Visible tradeoffsThe current evidence supports a shortlist, not a single winner.
Current shortlistClaude Opus 4.7, Gemini 3.1 Pro, GPT-5.5
Headline modeShortlist, not single winner
Coverage67% visible · 67% verified
PresetEveryday chatbot
Evidence modeCombined public record
Hybrid receipts
Why these finalists made the cuttop reasons behind the current answer
  • Current shortlist: Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5.
  • Claude Opus 4.7 is the strongest exact-match option still visible.
  • Claude Opus 4.7 currently leads the fit score at 97.6, but the evidence is still too mixed for a single headline winner.
What to pressure testwhere the current answer is still fragile
  • No single winner: The current public record is only strong enough to support a shortlist, not a single headline crown.
  • Counter-case · Gemini 3.1 Pro: Gemini 3.1 Pro is strongest on Long context and Chat / text for this preset.
  • Evidence risk: The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.
Step 3 · Pressure test the callRead the argument before you commit

What would flip the answer

  • If you tighten benchmark spread: Claude Opus 4.7 still holds if you care more about aligned evidence than upside.
  • If you tighten recency: Claude Opus 4.7 remains viable because the visible receipts are still fairly fresh.
  • If you require open-weight: No open-weight model currently clears the same evidence floor.
  • If cost and speed matter more: No clearly cheaper alternative currently clears the same evidence floor.

Why this is not a clean win

  • The current evidence supports a shortlist, not a single winner.
  • Gemini 3.1 Pro remains close enough that a different weighting can still flip the public answer.

Receipts discipline

  • Current shortlist: Current shortlist: Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.5
  • Evidence 1: Strongest exact-match option: Claude Opus 4.7
  • Evidence 2: Strongest indirect contender: DeepSeek Reasoner
  • Evidence 3: Best open-weight finalist: No open-weight finalist on this surface
Decision Buckets

Exact leaders first, then indirect and missing-coverage cases

The guide now keeps exact-match winners strict while still surfacing strong models that would previously disappear.

Primary bucket

Exact-match leaders

#1

Claude Opus 4.7

Anthropic · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts
Visible tradeoffs1.5% benchmark spread · 100% freshness · exact alias
Fit score97.6
Strongest evidencereasoning math science · chat text

Claude Opus 4.7 is strongest on Reasoning / math / science and Chat / text for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows
3
Manual checks
1
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (99.1).
Open model
Open compare
#2

Gemini 3.1 Pro

Google · frontier · 100% visible · 100% exact · 0% indirect

PreviewHybrid receiptsTrack rollup
Visible tradeoffs31.4% benchmark spread · 100% freshness · exact direct
Fit score90.2
Strongest evidencelong context · chat text

Gemini 3.1 Pro is strongest on Long context and Chat / text for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows
6
Manual checks
2
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (90.1).
Open model
Open compare
#3

GPT-5.5

OpenAI · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts
Visible tradeoffs20.9% benchmark spread · 100% freshness · exact alias
Fit score81.9
Strongest evidencechat text · reasoning math science

GPT-5.5 is strongest on Chat / text and Reasoning / math / science for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows
3
Manual checks
1
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (83.4).
Open model
Open compare
#4

GPT-5.4

OpenAI · frontier · 67% visible · 67% exact · 0% indirect

Hybrid receipts
Visible tradeoffs9% benchmark spread · 92.5% freshness · exact direct
Fit score81.3
Strongest evidencechat text · reasoning math science

GPT-5.4 is strongest on Chat / text and Reasoning / math / science for this preset.

The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.

Some visible coverage is coming from provider-official receipts while independent coverage catches up.

Verified rows
5
Manual checks
1
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (82.8).
Open model
Open compare
#5

Gemini 2.5 Pro

Google · frontier · 67% visible · 67% exact · 0% indirect

Visible tradeoffs19.9% benchmark spread · 100% freshness · exact direct
Fit score73.0
Strongest evidencechat text · reasoning math science

Gemini 2.5 Pro is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Parser or mapping changes recently moved Artificial Analysis, Arena.

Verified rows
5
Manual checks
0
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (73.6).
Open model
Open compare
Secondary bucket

Strong contenders with indirect evidence

#6

DeepSeek Reasoner

DeepSeek · budget · 67% visible · 33% exact · 33% indirect

Visible tradeoffs9.4% benchmark spread · 100% freshness · exact alias
Fit score50.3
Strongest evidencechat text · reasoning math science

DeepSeek Reasoner is strongest on Chat / text and Reasoning / math / science for this preset.

The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.

Indirect evidence covers 33% of the preset domains.

Verified rows
4
Manual checks
0
Relay rows
0
Backfilled rows
0
Base score is the weighted mean of preset-domain benchmark fit (52.1).
Open model
Open compare
Tertiary bucket

Tracked but under-benchmarked

These models are in the official registry, but the current benchmark surface still has missing or indirect coverage.

Llama 4 Maverick

Meta · 67% visible · 67% exact · 0% indirect

Llama 4 Maverick has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

Qwen3 235B A22B

Qwen · 67% visible · 67% exact · 0% indirect

Qwen3 235B A22B has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Long context.
chat textreasoning math science

Claude Haiku 4.5

Anthropic · 33% visible · 33% exact · 0% indirect

Claude Haiku 4.5 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek Chat

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek Chat has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek V3 (Dec)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3 (Dec) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

DeepSeek V3 0324

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3 0324 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

DeepSeek V3.1

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3.1 has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

DeepSeek V3.1 Terminus

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V3.1 Terminus has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

DeepSeek V3.2 Exp

DeepSeek · 33% visible · 33% exact · 0% indirect

DeepSeek V3.2 Exp has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

DeepSeek V4 Flash (Max)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V4 Flash (Max) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

DeepSeek V4 Pro (Max)

DeepSeek · 33% visible · 33% exact · 33% indirect

DeepSeek V4 Pro (Max) has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Only indirect or proxy evidence is currently available.
  • Missing benchmark coverage in Reasoning / math / science, Long context.
  • Proxy benchmark mappings are available but kept out of exact-match winners.
chat text

Gemini 2.5 Flash

Google · 33% visible · 33% exact · 0% indirect

Gemini 2.5 Flash has direct evidence on part of this preset, but not enough to clear the exact-match floor.

  • Missing benchmark coverage in Reasoning / math / science, Long context.
chat text

Hot takes with receipts

The product should generate public claims worth arguing with, not just filter state.

Open delta artifact
alert
10 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

Open
models
Arena moved via real benchmark movement

8 benchmark rows were added, 0 removed, and 118 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z.

Open
models
Artificial Analysis moved via real benchmark movement

75 benchmark rows were added, 0 removed, and 20 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z.

Open
models
MTEB moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z.

Open
models
Terminal-Bench moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z.

Open

What changed this week

alert
10 review items still need manual judgment

The product keeps parser and mapping ambiguity visible instead of silently guessing.

models
Arena moved via real benchmark movement

8 benchmark rows were added, 0 removed, and 118 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z.

Evidence window: 2026-05-01T20:26:09Z -> 2026-05-01T22:04:34Z

models
Artificial Analysis moved via real benchmark movement

75 benchmark rows were added, 0 removed, and 20 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z.

Evidence window: 2026-05-01T20:26:53Z -> 2026-05-01T22:05:29Z

models
MTEB moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z.

Evidence window: 2026-05-01T20:26:33Z -> 2026-05-01T22:04:57Z

models
Terminal-Bench moved via new benchmark coverage

1 benchmark rows were added, 0 removed, and 0 existing rows changed value or evaluation date. Window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z.

Evidence window: 2026-05-01T20:26:49Z -> 2026-05-01T22:05:24Z

product
Initial glass-box matrix release

Added matrix homepage, comparable-group normalization, per-cell receipts, source pages, and custom composite preview.

Evidence window: 2026-04-16

models
Methodology contract published

Documented comparability rules, raw-vs-normalized behavior, and why unlike metrics are never averaged by default.

Evidence window: 2026-04-16

models
Artificial Analysis ID rule adopted

Stable model and creator IDs are now the preferred external identity keys when available.

Evidence window: 2026-04-15

Build / data stamp

Read this before trusting a headline.

Data snapshot May 1, 2026Registry verification passed9 providers · 826 tracked modelsPage refreshed May 7, 2026

If this stamp lags behind the repo, you are likely looking at an older build or cached deploy.

Quick routes

Jump straight to the page you need.

These shortcuts resolve into public URLs instead of hidden state. Use them to open a recommendation page, compare workspace, head-to-head page, disagreement page, change log feed, or a specific model, benchmark, or source.

Resolve a recommendation into a public artifactbest open model for long-context researchResearch assistantOpen page
Send a shortlist into compare modecompare gpt-5, claude opus, gemini proEveryday chatbotOpen page
Open a head-to-head debate pagegpt-5 vs claude opusEveryday chatbotOpen page
Open a disagreement artifactbenchmark controversy for livebench codingCoding copilotOpen page
Open the latest public movementwhat changed this weekEveryday chatbotOpen page
Jump straight to an entity pageopen model gpt-5Open-weight shortlistOpen page