Wikibench — the AI model benchmark wiki

coding


HumanEval	Hand-written Python coding problems.	pass@1	6
SWE-bench Verified	Real-world GitHub issue resolution.	%	2


Chatbot Arena	Crowdsourced pairwise preference Elo.	elo	3


AIME 2024	American Invitational Mathematics Examination, 2024 problems.	%	1
MATH	Competition mathematics problems.	%	4


MMMU	Multimodal college-level reasoning.	%	4


GPQA Diamond	Graduate-level science questions, expert-curated.	%	2
MMLU	Massive Multitask Language Understanding — 57 academic subjects.	%	5