Benchmarks
+ New benchmarkcoding
| HumanEval | Hand-written Python coding problems. | pass@1 | 6 |
| SWE-bench Verified | Real-world GitHub issue resolution. | % | 2 |
general
| Chatbot Arena | Crowdsourced pairwise preference Elo. | elo | 3 |
math
multimodal
| MMMU | Multimodal college-level reasoning. | % | 4 |
reasoning
| GPQA Diamond | Graduate-level science questions, expert-curated. | % | 2 |
| MMLU | Massive Multitask Language Understanding — 57 academic subjects. | % | 5 |