#1Claude Opus 4.7
Anthropic · frontier · 67% visible · 67% exact · 0% indirect
Hybrid receipts
Visible tradeoffs1.5% benchmark spread · 100% freshness · exact alias
Fit score97.6
Strongest evidencereasoning math science · chat text
Claude Opus 4.7 is strongest on Reasoning / math / science and Chat / text for this preset.
The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.
Some visible coverage is coming from provider-official receipts while independent coverage catches up.
- Verified rows
- 3
- Manual checks
- 1
- Relay rows
- 0
- Backfilled rows
- 0
Base score is the weighted mean of preset-domain benchmark fit (99.1).
#2Gemini 3.1 Pro
Google · frontier · 100% visible · 100% exact · 0% indirect
PreviewHybrid receiptsTrack rollup
Visible tradeoffs31.4% benchmark spread · 100% freshness · exact direct
Fit score90.2
Strongest evidencelong context · chat text
Gemini 3.1 Pro is strongest on Long context and Chat / text for this preset.
The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.
Some visible coverage is coming from provider-official receipts while independent coverage catches up.
- Verified rows
- 6
- Manual checks
- 2
- Relay rows
- 0
- Backfilled rows
- 0
Base score is the weighted mean of preset-domain benchmark fit (90.1).
#3GPT-5.5
OpenAI · frontier · 67% visible · 67% exact · 0% indirect
Hybrid receipts
Visible tradeoffs20.9% benchmark spread · 100% freshness · exact alias
Fit score81.9
Strongest evidencechat text · reasoning math science
GPT-5.5 is strongest on Chat / text and Reasoning / math / science for this preset.
The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.
Some visible coverage is coming from provider-official receipts while independent coverage catches up.
- Verified rows
- 3
- Manual checks
- 1
- Relay rows
- 0
- Backfilled rows
- 0
Base score is the weighted mean of preset-domain benchmark fit (83.4).
#4GPT-5.4
OpenAI · frontier · 67% visible · 67% exact · 0% indirect
Hybrid receipts
Visible tradeoffs9% benchmark spread · 92.5% freshness · exact direct
Fit score81.3
Strongest evidencechat text · reasoning math science
GPT-5.4 is strongest on Chat / text and Reasoning / math / science for this preset.
The current lead depends partly on provider-official eval receipts, so it should travel with a hybrid-evidence caveat until independent coverage deepens.
Some visible coverage is coming from provider-official receipts while independent coverage catches up.
- Verified rows
- 5
- Manual checks
- 1
- Relay rows
- 0
- Backfilled rows
- 0
Base score is the weighted mean of preset-domain benchmark fit (82.8).
#5Gemini 2.5 Pro
Google · frontier · 67% visible · 67% exact · 0% indirect
Visible tradeoffs19.9% benchmark spread · 100% freshness · exact direct
Fit score73.0
Strongest evidencechat text · reasoning math science
Gemini 2.5 Pro is strongest on Chat / text and Reasoning / math / science for this preset.
The visible evidence mix still leans on weaker or split signals, especially around Reasoning / math / science, source verification state, and any backfilled or relay evidence still in play.
Parser or mapping changes recently moved Artificial Analysis, Arena.
- Verified rows
- 5
- Manual checks
- 0
- Relay rows
- 0
- Backfilled rows
- 0
Base score is the weighted mean of preset-domain benchmark fit (73.6).