Benchmarks

+ New benchmark

coding

HumanEvalHand-written Python coding problems.pass@16
SWE-bench VerifiedReal-world GitHub issue resolution.%2

general

Chatbot ArenaCrowdsourced pairwise preference Elo.elo3

math

AIME 2024American Invitational Mathematics Examination, 2024 problems.%1
MATHCompetition mathematics problems.%4

multimodal

MMMUMultimodal college-level reasoning.%4

reasoning

GPQA DiamondGraduate-level science questions, expert-curated.%2
MMLUMassive Multitask Language Understanding — 57 academic subjects.%5