Wikibench
A community-edited reference of AI model benchmark results. Browse models, benchmarks, and providers. Compare any two models side by side, or browse the leaderboards.
Benchmark coverage
Recorded results by benchmark category.
- coding — Count: 8 — Category: coding — 8
- reasoning — Count: 7 — Category: reasoning — 7
- math — Count: 5 — Category: math — 5
- multimodal — Count: 4 — Category: multimodal — 4
- general — Count: 3 — Category: general — 3
Models by provider
Catalogued models for the most represented providers.
- Anthropic — Count: 2 — Provider: Anthropic — 2
- OpenAI — Count: 2 — Provider: OpenAI — 2
- Google DeepMind — Count: 1 — Provider: Google DeepMind — 1
- Meta — Count: 1 — Provider: Meta — 1
- Mistral AI — Count: 1 — Provider: Mistral AI — 1
Featured leaderboards
Chatbot Arena
| # | Model | elo |
|---|---|---|
| 1 | Gemini 2.0 Flash | 1356 |
| 2 | GPT-4o | 1287 |
| 3 | Claude 3.5 Sonnet | 1271 |
GPQA Diamond
| # | Model | % |
|---|---|---|
| 1 | Claude Opus 4.7 | 83.3% |
| 2 | o1 | 78.0% |
HumanEval
| # | Model | pass@1 |
|---|---|---|
| 1 | Claude Opus 4.7 | 95.0% |
| 2 | o1 | 92.4% |
| 3 | Claude 3.5 Sonnet | 92.0% |
| 4 | Mistral Large 2 | 92.0% |
| 5 | GPT-4o | 90.2% |