Codesota · Benchmark · BIG-Bench HardHome/Leaderboards/BIG-Bench Hard
Unknown

BIG-Bench Hard.

BIG-Bench Hard is a curated subset of 23 challenging tasks from BIG-Bench that require multi-step reasoning, where chain-of-thought prompting significantly helps performance. Tasks include algorithmic reasoning, logical deduction, causal judgment, and more. By 2024–2025, frontier models were approaching saturation (>90%) on BBH, prompting the creation of the harder BBEH variant.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for BIG-Bench Hard. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Claude 3.5 Sonnet
Claude 3.5 Sonnet (Oct 2024). 3-shot CoT. From llm-stats.com leaderboard.
verified93.12026Source ↗Looks wrong?
02Gemini 1.5 Pro
Gemini 1.5 Pro. 3-shot CoT.
verified89.22026Source ↗Looks wrong?
03Qwen3-235B-A22Bunverified88.872025Paper ↗Code ↗Looks wrong?
04Step-3.5-Flash Baseunverified88.22026Paper ↗Code ↗Looks wrong?
05Gemma-3-27b
Gemma 3 27B. 3-shot CoT.
verified87.62026Source ↗Looks wrong?
06Claude 3 Opus
Claude 3 Opus. 3-shot CoT.
verified86.82026Source ↗Looks wrong?
07Llama 3.1 405B
Llama 3.1 405B Instruct. Confirmed in official Meta model card (CoT). Source: https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
verified85.92026Source ↗Looks wrong?
08MiniCPM-o 4.5-Instructunverified81.12026Paper ↗Code ↗Looks wrong?
09Apertus-70B-Instructunverified64.22025Paper ↗Code ↗Looks wrong?
10Llama 2 70B (5-shot)unverified51.22023Paper ↗Code ↗Looks wrong?
11SmoLM2 (1.7B)unverified32.22025Paper ↗Code ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards