Multi-step Reasoning2022en
BIG-Bench Hard (BBH)
BIG-Bench Hard is a curated subset of 23 challenging tasks from BIG-Bench that require multi-step reasoning, where chain-of-thought prompting significantly helps performance. Tasks include algorithmic reasoning, logical deduction, causal judgment, and more. By 2024–2025, frontier models were approaching saturation (>90%) on BBH, prompting the creation of the harder BBEH variant.
Metrics:accuracy
Paper / WebsiteCurrent State of the Art
Claude 3.5 Sonnet
Anthropic
93.1
accuracy
Top Models Performance Comparison
Top 5 models ranked by accuracy
Best Score
93.1
Top Model
Claude 3.5 Sonnet
Models Compared
5
Score Range
7.2
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude 3.5 SonnetAPI Anthropic | 93.1 | Mar 2026 | |
| 2 | Gemini 1.5 ProAPI Google | 89.2 | Mar 2026 | |
| 3 | Gemma 3 27B IT Google DeepMind | 87.6 | Mar 2026 | |
| 4 | Claude 3 OpusAPI Anthropic | 86.8 | Mar 2026 | |
| 5 | Llama 3.1 405BOpen Source Meta | 85.9 | Mar 2026 |