source
Arena
Observed user preference in direct head-to-head chat comparisons. How often a model wins when users do not know the identity of the models.
Ground-truth correctness. Task coverage outside the prompt mix currently flowing through the arena.