AR · benchmark platform
Arena
Blind human-preference arenas across chat, coding, vision, image, video, document, and search.
verification status
verified
Last checked May 1, 2026
Evidence ledger
Modalitiestext, code, vision, document, image, video, searchCadencecontinuousAPInot publicEvaluations773VerificationverifiedVerified runtime764Manual verified0Relay / mirrored0Backfilled9
Relay sources mirror another provider's public page; manual rows are checked against the cited page; backfilled rows are historical inserts; seeded rows are demo fixtures. Relay rows are supporting evidence, not first-party measurements.
Operational state
snapshot
Latest pull
jsonMay 1, 2026
parser
Parsed 764 Arena leaderboard records.
ok0.4.0
verify
arena verification finished with status verified.
verifiedMay 1, 2026
open
Arena exposed leaderboard rows that are not yet mapped into the canonical registry: muse-spark (4), kimi-k2.6 (4), kimi-k2.5-thinking (4), kimi-k2.5-instant (4), mimo-v2.5 (4), glm-5.1 (3), mimo-v2.5-pro (3), deepseek-v4-pro-thinking (3), glm-5 (3), qwen3.6-plus (3), glm-4.6 (3), glm-4.7 (3), mimo-v2-pro (3), deepseek-v3.2-thinking (3), kimi-k2-thinking-turbo (3)
model_aliasMay 1, 2026
Benchmarks from this source
Text Arena
Blind chat preference
Arena rating
Code Arena
Blind coding preference
Arena rating
Vision Arena
Blind multimodal preference
Arena rating
WebDev Arena
Blind web app preference
Arena rating
Search Arena
Blind search-grounded preference
Arena rating
Document Arena
Blind document preference
Arena rating
Text-to-Image Arena
Blind image generation preference
Arena rating
Image Edit Arena
Blind image editing preference
Arena rating
Text-to-Video Arena
Blind video generation preference
Arena rating
Image-to-Video Arena
Blind image-to-video preference
Arena rating
Video Edit Arena
Blind video editing preference
Arena rating
Latest change explanation
arena changed versus arena-20260501T202609Z with source_snapshot, parser_diff, mapping, benchmark_movement causes.
- Source snapshot changed: The saved raw source snapshot changed relative to the previous run.
- Parser output changed: The parser metadata or warnings shifted relative to the previous run.
- Mapping or review queue changed: Mapping-related signals changed across 1 fields.
- Benchmark coverage or values moved: 8 benchmark rows were added, 0 removed, and 118 existing rows changed value or evaluation date.