Code|6 min read

GPT-5 Leads Aider Polyglot at 88% --- Real-World Coding Benchmark

OpenAI's GPT-5 with high reasoning effort tops the Aider Polyglot coding benchmark at 88.0%, establishing a clear lead over o3-pro (84.9%) and Gemini 2.5 Pro (83.1%). Claude Sonnet 4 disappoints at 61.0%, trailing the pack by a wide margin.

88.0%
GPT-5 (high)
84.9%
o3-pro
83.1%
Gemini 2.5 Pro
61.0%
Claude Sonnet 4

The Aider Polyglot benchmark measures a model's ability to edit code across multiple programming languages in real-world scenarios. Unlike synthetic benchmarks that test isolated function generation, Polyglot evaluates how well models handle actual codebases with existing context, multiple files, and cross-language dependencies. This makes it one of the most practical measures of a model's coding utility.

GPT-5 with high reasoning effort achieves 88.0%, a 3.1 percentage point lead over the second-place o3-pro. This is a decisive gap on a benchmark where top models have historically been separated by fractions of a percent. OpenAI dominates the top 6 with five entries, while Anthropic's Claude Sonnet 4 lands near the bottom of the leaderboard at 61.0%.

Aider Polyglot Leaderboard

#ModelProviderScorevs. #1
1
SOTAGPT-5 (high reasoning)
OpenAI88.0%---
2
o3-pro
OpenAI84.9%-3.1%
3
Gemini 2.5 Pro
Google83.1%-4.9%
4
o3
OpenAI82.6%-5.4%
5
GPT-5 (medium reasoning)
OpenAI81.3%-6.7%
6
o4-mini (high)
OpenAI80.8%-7.2%
7
DeepSeek R1
DeepSeek76.7%-11.3%
8
Gemini 2.0 Flash Thinking
Google75.5%-12.5%
9
Claude Sonnet 4
Anthropic61.0%-27.0%

Aider Polyglot benchmark scores as of March 2026. Models ranked by percentage of tasks completed correctly across multiple programming languages.

OpenAI Dominates the Top 6

Provider Breakdown

ProviderModelsBest
OpenAI588.0%
Google283.1%
DeepSeek176.7%
Anthropic161.0%

Key Takeaways

  • -OpenAI holds 5 of the top 6 positions, showing depth across its model lineup
  • -GPT-5's reasoning effort setting matters: high (88.0%) vs. medium (81.3%) is a 6.7 point gap
  • -Google's Gemini 2.5 Pro is the only non-OpenAI model in the top 3
  • -Claude Sonnet 4 at 61.0% is 27 points behind the leader, a surprisingly large gap

GPT-5 Reasoning Effort: High vs. Medium

GPT-5 appears twice on the leaderboard with different reasoning effort settings. The high reasoning configuration scores 88.0% while medium reasoning scores 81.3%, a 6.7 percentage point difference. This demonstrates that compute-at-inference-time scaling continues to deliver meaningful gains on practical coding tasks.

The gap between high and medium reasoning is larger than the gap between o3-pro and Gemini 2.5 Pro (1.8 points), suggesting that reasoning effort selection has become one of the most impactful levers for coding performance. Teams choosing between cost and quality should benchmark their specific workloads to find the optimal reasoning level.

Claude Sonnet 4: What Happened?

Claude Sonnet 4's 61.0% score is the most notable result on the leaderboard --- not because it's a bad model, but because it trails the leader by 27 points. For a model family that has been competitive on other coding benchmarks like SWE-bench, this is a significant underperformance.

The Aider Polyglot benchmark specifically tests multi-language editing ability with Aider's diff-based workflow. Claude's lower score may reflect difficulty with Aider's specific edit format rather than a fundamental coding capability gap. Benchmark methodology always matters: a model that excels at generating code from scratch may struggle with edit-based workflows, and vice versa.

That said, for teams using Aider as their primary coding assistant, this result is directly actionable: GPT-5 or o3-pro will deliver substantially better results in that specific workflow.

About the Aider Polyglot Benchmark

What it measuresMulti-language code editing using Aider's diff-based workflow
Languages testedPython, JavaScript, TypeScript, Java, C++, Go, Rust, and more
Task typeEdit existing code to pass failing tests
Why it mattersReflects real-world coding assistant usage --- editing, not just generating
Sourceaider.chat/docs/leaderboards

The Bottom Line

GPT-5 with high reasoning is the best model for Aider-based coding workflows, and it is not close. At 88.0%, it outperforms every competitor by at least 3 points. OpenAI's dominance across the top 6 positions shows that their investment in reasoning capabilities has translated directly into practical coding performance.

For teams evaluating coding assistants, the choice depends on workflow. If you use Aider, GPT-5 or o3-pro are the clear picks. If cost is a concern, GPT-5 with medium reasoning (81.3%) or o4-mini (80.8%) offer strong performance at lower compute cost. Claude Sonnet 4's 61.0% means Aider users should look elsewhere, though Claude may still excel in other coding contexts.

Track the latest Aider Polyglot results and compare coding models across all benchmarks on CodeSOTA.

Related Resources