Kimi K2: Dark Horse Hits 94.5% HumanEval

HumanEval Leaderboard

#	Model	Organization	HumanEval (%)
1	Claude Opus 4.6	Anthropic	96.3
2	Kimi K2 0905	Moonshot AI	94.5
3	GPT-4o	OpenAI	92.1
4	DeepSeek V3.2	DeepSeek	91.8
5	Gemini 2.5 Pro	Google	91.4
6	Qwen 3 Coder	Alibaba	90.7
7	GLM-4.7	Zhipu AI	89.2
8	MiniMax M2.1	MiniMax	88.5

HumanEval pass@1 scores. Kimi K2 0905 highlighted. Four of the top eight positions are held by Chinese AI labs.

The Quiet Climb

Unlike the fanfare that typically accompanies a new benchmark leader, Moonshot AI dropped Kimi K2 0905 without a press conference or a leaderboard victory lap. The model appeared on HumanEval with a 94.5% pass@1 score, slotting in directly below Claude Opus 4.6's 96.3% and above every other model on the board. Only 1.8 percentage points separate the two.

This is a significant jump from the original Kimi K2, which scored in the high 80s. The 0905 revision suggests continued internal iteration, with Moonshot AI treating coding performance as a first-class optimization target rather than a secondary capability.

The Chinese AI Lab Arms Race

Kimi K2's result is part of a broader pattern. Chinese AI labs now occupy four of the top eight positions on HumanEval: Moonshot AI (Kimi K2), DeepSeek (V3.2), Alibaba (Qwen 3 Coder), and Zhipu AI (GLM-4.7). A year ago, only one Chinese lab appeared in the top ten.

The competitive dynamics are clear. Each lab is iterating rapidly, with model revisions landing weeks rather than months apart. DeepSeek pushed the MoE architecture to 671B parameters. Alibaba has invested heavily in code-specific fine-tuning for Qwen. Zhipu AI focused on mathematical reasoning as a pathway to better code. And Moonshot AI, the youngest of the group, has taken a research-first approach that is now paying visible dividends.

Chinese Labs in the Top 8

Moonshot AI (Kimi K2)94.5%

DeepSeek (V3.2)91.8%

Alibaba (Qwen 3 Coder)90.7%

Zhipu AI (GLM-4.7)89.2%

Western Labs in the Top 8

Anthropic (Claude Opus 4.6)96.3%

OpenAI (GPT-4o)92.1%

Google (Gemini 2.5 Pro)91.4%

MiniMax (M2.1)88.5%

What Makes K2 Different

Moonshot AI has been less forthcoming about K2's architecture than some competitors, but several details have emerged. The model uses a massive Mixture-of-Experts backbone, reportedly exceeding 1 trillion total parameters with a fraction activated per token. This is consistent with the trend toward MoE as the dominant architecture for frontier coding models.

The "0905" revision number suggests this is a September 2025 checkpoint that has been through extensive post-training optimization. Moonshot AI has published research on reinforcement learning from code execution feedback, which likely plays a role in the model's strong performance on function-level code synthesis tasks like HumanEval.

Critically, K2 0905 excels specifically at the kind of self-contained function generation that HumanEval measures. Whether this translates to real-world software engineering tasks (as measured by SWE-bench) remains an open question. The model's SWE-bench scores, while competitive, do not match its HumanEval dominance.

Bottom Line

Kimi K2 0905 is the real deal on HumanEval. A 94.5% pass@1 score puts Moonshot AI within striking distance of Claude Opus 4.6, and firmly ahead of GPT-4o, DeepSeek, and Gemini. The gap between Chinese and Western labs on coding benchmarks has effectively closed.

But benchmark scores tell only part of the story. HumanEval measures function-level code synthesis -- clean, well-specified problems with clear test cases. Real-world software engineering involves navigating ambiguous requirements, legacy codebases, and multi-file dependencies. Until K2 shows comparable strength on SWE-bench and production workloads, the crown remains with Claude.

What is no longer in doubt: the next HumanEval SOTA will come from one of at least six labs capable of producing it. The era of any single company holding a comfortable lead on coding benchmarks is over.

Kimi K2: Dark Horse Hits 94.5% HumanEval

HumanEval Leaderboard

The Quiet Climb

The Chinese AI Lab Arms Race

Chinese Labs in the Top 8

Western Labs in the Top 8

What Makes K2 Different

Bottom Line

Related Resources

Code Generation Benchmarks

MiniMax M2.1 Review

More News