Kimi K2: Dark Horse Hits 94.5% HumanEval
Moonshot AI's Kimi K2 0905 has quietly climbed to the second-highest HumanEval score ever recorded, trailing only Anthropic's Claude Opus 4.6. Another signal that the Chinese AI lab arms race on coding benchmarks is far from over.
HumanEval Leaderboard
| # | Model | Organization | HumanEval (%) |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 96.3 |
| 2 | Kimi K2 0905 | Moonshot AI | 94.5 |
| 3 | GPT-4o | OpenAI | 92.1 |
| 4 | DeepSeek V3.2 | DeepSeek | 91.8 |
| 5 | Gemini 2.5 Pro | 91.4 | |
| 6 | Qwen 3 Coder | Alibaba | 90.7 |
| 7 | GLM-4.7 | Zhipu AI | 89.2 |
| 8 | MiniMax M2.1 | MiniMax | 88.5 |
HumanEval pass@1 scores. Kimi K2 0905 highlighted. Four of the top eight positions are held by Chinese AI labs.
The Quiet Climb
Unlike the fanfare that typically accompanies a new benchmark leader, Moonshot AI dropped Kimi K2 0905 without a press conference or a leaderboard victory lap. The model appeared on HumanEval with a 94.5% pass@1 score, slotting in directly below Claude Opus 4.6's 96.3% and above every other model on the board. Only 1.8 percentage points separate the two.
This is a significant jump from the original Kimi K2, which scored in the high 80s. The 0905 revision suggests continued internal iteration, with Moonshot AI treating coding performance as a first-class optimization target rather than a secondary capability.
The Chinese AI Lab Arms Race
Kimi K2's result is part of a broader pattern. Chinese AI labs now occupy four of the top eight positions on HumanEval: Moonshot AI (Kimi K2), DeepSeek (V3.2), Alibaba (Qwen 3 Coder), and Zhipu AI (GLM-4.7). A year ago, only one Chinese lab appeared in the top ten.
The competitive dynamics are clear. Each lab is iterating rapidly, with model revisions landing weeks rather than months apart. DeepSeek pushed the MoE architecture to 671B parameters. Alibaba has invested heavily in code-specific fine-tuning for Qwen. Zhipu AI focused on mathematical reasoning as a pathway to better code. And Moonshot AI, the youngest of the group, has taken a research-first approach that is now paying visible dividends.
Chinese Labs in the Top 8
Western Labs in the Top 8
What Makes K2 Different
Moonshot AI has been less forthcoming about K2's architecture than some competitors, but several details have emerged. The model uses a massive Mixture-of-Experts backbone, reportedly exceeding 1 trillion total parameters with a fraction activated per token. This is consistent with the trend toward MoE as the dominant architecture for frontier coding models.
The "0905" revision number suggests this is a September 2025 checkpoint that has been through extensive post-training optimization. Moonshot AI has published research on reinforcement learning from code execution feedback, which likely plays a role in the model's strong performance on function-level code synthesis tasks like HumanEval.
Critically, K2 0905 excels specifically at the kind of self-contained function generation that HumanEval measures. Whether this translates to real-world software engineering tasks (as measured by SWE-bench) remains an open question. The model's SWE-bench scores, while competitive, do not match its HumanEval dominance.
Bottom Line
Kimi K2 0905 is the real deal on HumanEval. A 94.5% pass@1 score puts Moonshot AI within striking distance of Claude Opus 4.6, and firmly ahead of GPT-4o, DeepSeek, and Gemini. The gap between Chinese and Western labs on coding benchmarks has effectively closed.
But benchmark scores tell only part of the story. HumanEval measures function-level code synthesis -- clean, well-specified problems with clear test cases. Real-world software engineering involves navigating ambiguous requirements, legacy codebases, and multi-file dependencies. Until K2 shows comparable strength on SWE-bench and production workloads, the crown remains with Claude.
What is no longer in doubt: the next HumanEval SOTA will come from one of at least six labs capable of producing it. The era of any single company holding a comfortable lead on coding benchmarks is over.