UABUnbiased AI BenchGlass box for model evals.
Every leaderboard, with receipts.
Home/Methodology
Methodology
Loading search
Live · updated continuously
Methodology

No statistical fruit salad.

Raw scores stay raw. Percentiles happen only inside exact comparable groups. We verify the receipt, infer only when the evidence is explicit, and refuse to guess aliases, mappings, or unsupported matches.
Layer 1 · raw receipts
Layer 2 · verified and labeled records
Layer 3 · track-aware recommendations and composites
1

Verify the receipt

Fetch the published leaderboard or dataset, save the snapshot, and keep the pointer to the public receipt. We verify the source URL, content type, and snapshot hash before we treat anything as a measurement.

Source URL preservedSnapshot hash loggedCapture time stored
2

Parse carefully

Parse the source into structured records. We infer only when aliases, mappings, and anomalies are explicit enough to support the match. If not, the item stays open instead of being silently guessed.

Parser version attachedAnomalies loggedManual review opened
3

Label provenance

Attach benchmark metadata and provenance: judge type, metric family, directionality, benchmark version, modality, comparable group, and claim origin. Relay rows mirror another provider's public page; provider-official rows come from the model lab's own public eval pages and stay explicitly labeled as hybrid evidence; manual rows are human-checked; backfilled rows are historical inserts; seeded rows are demo fixtures.

Judge type keptProvenance keptComparable group kept
4

Normalize locally

Normalize only inside the exact comparable group. The product does not flatten unrelated units into one global score.

Within-group onlyNo universal scalarCoverage gaps remain visible
5

Publish the readout

Expose secondary metrics that describe the shape of the visible evidence. Product recommendations can rank a reviewed release track rather than a single raw model ID, but raw receipts stay attached to their exact source labels. Preview rows stay preview-labeled, provider-official rows stay hybrid-labeled, and neither is silently rewritten as independent consensus.

CoverageSpreadFreshnessOpen reviewsTrack rollups
6

State the limits

We verify public receipts, snapshot hashes, parser outputs, and human-check metadata. We infer only when the mapping rule is explicit enough to audit later. We do not guess aliases, provenance, modality, or pricing bands from vibes.

VerifiedInferredNever guessed