HumanEval
Hand-written Python coding problems.
Leaderboard
| # | Model | Provider | pass@1 | Evaluated | Source |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 95.0% | — | |
| 2 | o1 | OpenAI | 92.4% | — | |
| 3 | Claude 3.5 Sonnet | Anthropic | 92.0% | — | |
| 4 | Mistral Large 2 | Mistral AI | 92.0% | — | |
| 5 | GPT-4o | OpenAI | 90.2% | — | |
| 6 | Llama 3.3 70B | Meta | 88.4% | — |