BIG-Bench Hard is a curated subset of 23 challenging tasks from BIG-Bench that require multi-step reasoning, where chain-of-thought prompting significantly helps performance. Tasks include algorithmic reasoning, logical deduction, causal judgment, and more. By 2024–2025, frontier models were approaching saturation (>90%) on BBH, prompting the creation of the harder BBEH variant.
Accuracy is the reported evaluation metric for BIG-Bench Hard. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Claude 3.5 Sonnet | verified | 93.1 | 2026 | Source ↗ | Looks wrong? |
| 02 | Gemini 1.5 Pro | verified | 89.2 | 2026 | Source ↗ | Looks wrong? |
| 03 | Qwen3-235B-A22B | unverified | 88.87 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 04 | Step-3.5-Flash Base | unverified | 88.2 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 05 | Gemma-3-27b | verified | 87.6 | 2026 | Source ↗ | Looks wrong? |
| 06 | Claude 3 Opus | verified | 86.8 | 2026 | Source ↗ | Looks wrong? |
| 07 | Llama 3.1 405B | verified | 85.9 | 2026 | Source ↗ | Looks wrong? |
| 08 | MiniCPM-o 4.5-Instruct | unverified | 81.1 | 2026 | Paper ↗Code ↗ | Looks wrong? |
| 09 | Apertus-70B-Instruct | unverified | 64.2 | 2025 | Paper ↗Code ↗ | Looks wrong? |
| 10 | Llama 2 70B (5-shot) | unverified | 51.2 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 11 | SmoLM2 (1.7B) | unverified | 32.2 | 2025 | Paper ↗Code ↗ | Looks wrong? |