SL · benchmark platform
Scale Labs
Rubric-heavy frontier evals across agentic coding, visual-language understanding, spoken dialogue, tutoring, and hard reasoning.
verification status
verified
Last checked May 1, 2026
Evidence ledger
Modalitiestext, code, vision, audioCadencerelease-basedAPInot publicEvaluations98VerificationverifiedVerified runtime70Manual verified0Relay / mirrored0Backfilled28
Relay sources mirror another provider's public page; manual rows are checked against the cited page; backfilled rows are historical inserts; seeded rows are demo fixtures. Relay rows are supporting evidence, not first-party measurements.
Operational state
snapshot
Latest pull
jsonMay 1, 2026
parser
Loaded 70 verified benchmark records for scale-labs.
ok0.1.0
verify
scale-labs verification finished with status verified.
verifiedMay 1, 2026
Benchmarks from this source
EnigmaEval
Hard reasoning
Pass rate
VISTA
Vision-language understanding
Score
TutorBench
STEM tutoring quality
Score
VTB
Vision-language reasoning
APR
PRBench Legal
Professional legal reasoning
Score
HiL-Bench
Human-in-the-loop software tasks
Success rate
MASK
Hidden-goal honesty and safety
Honesty score
MultiNRC
Multilingual reasoning
Score
Latest change explanation
scale-labs matched scale-labs-20260501T202649Z with no notable change causes detected.