News
Technical deep-dives into trending AI models. Benchmark analysis, architecture breakdowns, practical recommendations.
Featured
Gemini 3.1 Flash Live: Real-Time Multimodal Voice for AI Agents
Google's real-time voice model handles audio + images + video + text with 128K context. 90.8% on ComplexFuncBench Audio. Powers Gemini Live and Search Live.
Claude Opus 4.5 Hits 80.9% on SWE-bench Verified
Anthropic's flagship model leads the software engineering benchmark, resolving 4 out of 5 real GitHub issues. HumanEval at 97.8% with Opus 4.6. Agent scaffolding is now the differentiator.
Gemini 3 Pro Dominates LiveCodeBench at 91.7%
Google's latest model crushes competitive programming benchmarks. LiveCodeBench Pro Elo: 2887. DeepSeek V3.2 Speciale follows at 89.6%. The Gemini advantage is algorithmic reasoning.
MiniMax M2.1: The New SWE-bench Leader at 90% Lower Cost
229B parameter MoE model achieves 74.0% on SWE-bench Verified, beating Claude Sonnet 4.5 at $0.30/1M tokens. Technical analysis of the cost-efficiency champion.
GLM-4.7: 95.7% on AIME 2025 - Math Reasoning Breakthrough
Zhipu AI's 358B MoE model sets new records in mathematical reasoning, surpassing GPT-5.1 High on AIME. Interleaved thinking and MIT license.
TRELLIS.2: Production-Ready 3D Assets in 3 Seconds
Microsoft Research's 4B parameter model generates game-ready PBR assets from single images. O-Voxel architecture enables 1536x resolution.
Recent
GPT-5 Leads Aider Polyglot at 88% — Real-World Coding Benchmark
OpenAI's GPT-5 with high reasoning tops the Aider coding benchmark, followed by o3-pro (84.9%) and Gemini 2.5 Pro (83.1%). Claude Sonnet 4 disappoints at 61%.
DeepSeek V3.2 Speciale: Open-Source Model at 89.6% LiveCodeBench
DeepSeek's latest open model closes the gap with proprietary leaders. V3.1-Think hits 66% SWE-bench. The open-source code generation frontier advances rapidly.
Google Chirp 3 HD: Instant Voice Cloning in 31 Languages
8 distinct voice personalities, real-time streaming, and voice cloning from short samples. GA on Vertex AI. The TTS landscape is fragmenting between LLM-native and dedicated models.
Is SWE-bench Verified Contaminated? OpenAI Shifts to SWE-bench Pro
OpenAI stops reporting Verified scores, citing contamination concerns. Agent scaffolding inflates scores (81% with agents vs 69% standalone). The benchmark wars heat up.
Kimi K2: Dark Horse Hits 94.5% HumanEval
Moonshot AI's Kimi K2 0905 quietly reaches second-best HumanEval score behind only Claude Opus 4.6. The Chinese AI lab arms race continues on coding benchmarks.
Gemini 2.5 Pro TTS: LLM-Native Speech at 4.7 MOS
30 speakers, 80+ locales, prompt-controlled emotion and style. Google's bet: TTS should be a capability of the LLM, not a separate model. Flash variant optimized for real-time.
Tencent HY-MT1.5: Translation Model Beats Google by 15-65%
1.8B parameter model from WMT2025 winner achieves near Gemini-3.0-Pro performance while running on smartphones. Supports 33 languages.
LiquidAI LFM2-2.6B: Edge Model Beats 680B DeepSeek R1
2.6B dense model using pure RL surpasses models 263x larger on instruction-following. Hybrid convolution-attention enables phone deployment.
Wan2.2 Animate: First Open-Source MoE Video Model
Alibaba's 14B MoE model combines motion transfer and character animation. 720p at 24fps for ~$0.40 per 5s clip vs $2 for Veo 3.
Z-Image-Turbo: FLUX-Quality Images on 16GB GPUs
Alibaba's 6B distilled model achieves near-FLUX quality in 8 steps on consumer hardware. Apache 2.0 license enables commercial use.
Stay ahead of model releases
We cover what matters: benchmarks, not press releases.
About our coverage
CodeSOTA tracks trending AI models from Hugging Face, arXiv, and major conferences. Our analysis focuses on verified benchmark results, not marketing claims.
Each article includes technical specifications, benchmark comparisons, deployment requirements, and practical recommendations for different use cases.