Verify the receipt
Fetch the published leaderboard or dataset, save the snapshot, and keep the pointer to the public receipt. We verify the source URL, content type, and snapshot hash before we treat anything as a measurement.
Fetch the published leaderboard or dataset, save the snapshot, and keep the pointer to the public receipt. We verify the source URL, content type, and snapshot hash before we treat anything as a measurement.
Parse the source into structured records. We infer only when aliases, mappings, and anomalies are explicit enough to support the match. If not, the item stays open instead of being silently guessed.
Attach benchmark metadata and provenance: judge type, metric family, directionality, benchmark version, modality, comparable group, and claim origin. Relay rows mirror another provider's public page; provider-official rows come from the model lab's own public eval pages and stay explicitly labeled as hybrid evidence; manual rows are human-checked; backfilled rows are historical inserts; seeded rows are demo fixtures.
Normalize only inside the exact comparable group. The product does not flatten unrelated units into one global score.
Expose secondary metrics that describe the shape of the visible evidence. Product recommendations can rank a reviewed release track rather than a single raw model ID, but raw receipts stay attached to their exact source labels. Preview rows stay preview-labeled, provider-official rows stay hybrid-labeled, and neither is silently rewritten as independent consensus.
We verify public receipts, snapshot hashes, parser outputs, and human-check metadata. We infer only when the mapping rule is explicit enough to audit later. We do not guess aliases, provenance, modality, or pricing bands from vibes.