Skip to content
WBWikibenchmodel intelligence
Tab
Account
OverviewModelsBenchmarksProvidersLeaderboardCompare
ArticleEditHistory

HumanEval

HumanEval

Category
coding
Score unit
pass@1
Higher is better
yes

Hand-written Python coding problems.

Leaderboard

#ModelProviderpass@1EvaluatedSource
1Claude Opus 4.7Anthropic95.0%—
2o1OpenAI92.4%—
3Claude 3.5 SonnetAnthropic92.0%—
4Mistral Large 2Mistral AI92.0%—
5GPT-4oOpenAI90.2%—
6Llama 3.3 70BMeta88.4%—

+ Add result

Wikibench — community-edited AI benchmark data.AboutContent licensed CC BY-SA 4.0.