o1

Reasoning-tuned model.

Benchmark results

BenchmarkCategoryScoreVerifiedSource
AIME 2024math83.3%yes
GPQA Diamondreasoning78.0%yes
HumanEvalcoding92.4%yes
MATHmath94.8%yes

+ Add result