source
Arena
Observed user preference on vision-grounded prompts and multimodal comparisons. How often a model wins when people judge image-aware answers head to head.
Ground-truth visual reasoning accuracy. Task-specific breakdowns such as OCR, charts, or diagram-heavy workflows.