GPT-5 Leads Aider Polyglot at 88% --- Real-World Coding Benchmark
OpenAI's GPT-5 with high reasoning effort tops the Aider Polyglot coding benchmark at 88.0%, establishing a clear lead over o3-pro (84.9%) and Gemini 2.5 Pro (83.1%). Claude Sonnet 4 disappoints at 61.0%, trailing the pack by a wide margin.
The Aider Polyglot benchmark measures a model's ability to edit code across multiple programming languages in real-world scenarios. Unlike synthetic benchmarks that test isolated function generation, Polyglot evaluates how well models handle actual codebases with existing context, multiple files, and cross-language dependencies. This makes it one of the most practical measures of a model's coding utility.
GPT-5 with high reasoning effort achieves 88.0%, a 3.1 percentage point lead over the second-place o3-pro. This is a decisive gap on a benchmark where top models have historically been separated by fractions of a percent. OpenAI dominates the top 6 with five entries, while Anthropic's Claude Sonnet 4 lands near the bottom of the leaderboard at 61.0%.
Aider Polyglot Leaderboard
| # | Model | Provider | Score | vs. #1 |
|---|---|---|---|---|
| 1 | SOTAGPT-5 (high reasoning) | OpenAI | 88.0% | --- |
| 2 | o3-pro | OpenAI | 84.9% | -3.1% |
| 3 | Gemini 2.5 Pro | 83.1% | -4.9% | |
| 4 | o3 | OpenAI | 82.6% | -5.4% |
| 5 | GPT-5 (medium reasoning) | OpenAI | 81.3% | -6.7% |
| 6 | o4-mini (high) | OpenAI | 80.8% | -7.2% |
| 7 | DeepSeek R1 | DeepSeek | 76.7% | -11.3% |
| 8 | Gemini 2.0 Flash Thinking | 75.5% | -12.5% | |
| 9 | Claude Sonnet 4 | Anthropic | 61.0% | -27.0% |
Aider Polyglot benchmark scores as of March 2026. Models ranked by percentage of tasks completed correctly across multiple programming languages.
OpenAI Dominates the Top 6
Provider Breakdown
| Provider | Models | Best |
|---|---|---|
| OpenAI | 5 | 88.0% |
| 2 | 83.1% | |
| DeepSeek | 1 | 76.7% |
| Anthropic | 1 | 61.0% |
Key Takeaways
- -OpenAI holds 5 of the top 6 positions, showing depth across its model lineup
- -GPT-5's reasoning effort setting matters: high (88.0%) vs. medium (81.3%) is a 6.7 point gap
- -Google's Gemini 2.5 Pro is the only non-OpenAI model in the top 3
- -Claude Sonnet 4 at 61.0% is 27 points behind the leader, a surprisingly large gap
GPT-5 Reasoning Effort: High vs. Medium
GPT-5 appears twice on the leaderboard with different reasoning effort settings. The high reasoning configuration scores 88.0% while medium reasoning scores 81.3%, a 6.7 percentage point difference. This demonstrates that compute-at-inference-time scaling continues to deliver meaningful gains on practical coding tasks.
The gap between high and medium reasoning is larger than the gap between o3-pro and Gemini 2.5 Pro (1.8 points), suggesting that reasoning effort selection has become one of the most impactful levers for coding performance. Teams choosing between cost and quality should benchmark their specific workloads to find the optimal reasoning level.
Claude Sonnet 4: What Happened?
Claude Sonnet 4's 61.0% score is the most notable result on the leaderboard --- not because it's a bad model, but because it trails the leader by 27 points. For a model family that has been competitive on other coding benchmarks like SWE-bench, this is a significant underperformance.
The Aider Polyglot benchmark specifically tests multi-language editing ability with Aider's diff-based workflow. Claude's lower score may reflect difficulty with Aider's specific edit format rather than a fundamental coding capability gap. Benchmark methodology always matters: a model that excels at generating code from scratch may struggle with edit-based workflows, and vice versa.
That said, for teams using Aider as their primary coding assistant, this result is directly actionable: GPT-5 or o3-pro will deliver substantially better results in that specific workflow.
About the Aider Polyglot Benchmark
| What it measures | Multi-language code editing using Aider's diff-based workflow |
| Languages tested | Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more |
| Task type | Edit existing code to pass failing tests |
| Why it matters | Reflects real-world coding assistant usage --- editing, not just generating |
| Source | aider.chat/docs/leaderboards |
The Bottom Line
GPT-5 with high reasoning is the best model for Aider-based coding workflows, and it is not close. At 88.0%, it outperforms every competitor by at least 3 points. OpenAI's dominance across the top 6 positions shows that their investment in reasoning capabilities has translated directly into practical coding performance.
For teams evaluating coding assistants, the choice depends on workflow. If you use Aider, GPT-5 or o3-pro are the clear picks. If cost is a concern, GPT-5 with medium reasoning (81.3%) or o4-mini (80.8%) offer strong performance at lower compute cost. Claude Sonnet 4's 61.0% means Aider users should look elsewhere, though Claude may still excel in other coding contexts.
Track the latest Aider Polyglot results and compare coding models across all benchmarks on CodeSOTA.