Multi-step Reasoning2022en

BIG-Bench Hard (BBH)

BIG-Bench Hard is a curated subset of 23 challenging tasks from BIG-Bench that require multi-step reasoning, where chain-of-thought prompting significantly helps performance. Tasks include algorithmic reasoning, logical deduction, causal judgment, and more. By 2024–2025, frontier models were approaching saturation (>90%) on BBH, prompting the creation of the harder BBEH variant.

Metrics:accuracy
Paper / Website
Current State of the Art

Claude 3.5 Sonnet

Anthropic

93.1

accuracy

Top Models Performance Comparison

Top 5 models ranked by accuracy

accuracy1Claude 3.5 Sonnet93.1100.0%2Gemini 1.5 Pro89.295.8%3Gemma 3 27B IT87.694.1%4Claude 3 Opus86.893.2%5Llama 3.1 405B85.992.3%0%25%50%75%100%% of best
Best Score
93.1
Top Model
Claude 3.5 Sonnet
Models Compared
5
Score Range
7.2

accuracyPrimary

#ModelScorePaper / CodeDate
1
Claude 3.5 SonnetAPI
Anthropic
93.1Mar 2026
2
Gemini 1.5 ProAPI
Google
89.2Mar 2026
3
Gemma 3 27B IT
Google DeepMind
87.6Mar 2026
4
Claude 3 OpusAPI
Anthropic
86.8Mar 2026
5
Llama 3.1 405BOpen Source
Meta
85.9Mar 2026

Other Multi-step Reasoning Datasets