Codesota · NewsDispatches on new models & benchmark shifts16 posts · 4 months

§ News archive

News, filtered.

A dated log of new model releases, benchmark shifts and measurement controversies. One post per development that actually changed a scoreboard — not every press release, not every funding round, not every demo video.

§ 01

March 2026

5 posts

Gemini 3.1 Flash Live: real-time multimodal voice

90.8% on ComplexFuncBench Audio. Audio, images, video and text over a 128K context, powering Gemini Live and Search Live.

Claude Opus 4.5 hits 80.9% on SWE-bench Verified

Four in five real GitHub issues resolved. HumanEval reaches 97.8% in the 4.6 refresh. The scaffolding, not the model, is now the differentiator.

Gemini 3 Pro dominates LiveCodeBench at 91.7%

LiveCodeBench Pro Elo of 2887. DeepSeek V3.2 Speciale follows at 89.6%. The Gemini advantage is algorithmic reasoning, not IDE plumbing.

GPT-5 leads Aider Polyglot at 88%

o3-pro at 84.9% and Gemini 2.5 Pro at 83.1% trail behind. Claude Sonnet 4 disappoints at 61% on the real-world edit benchmark.

DeepSeek V3.2 Speciale: open-source at 89.6% LiveCodeBench

The open-source code generation frontier closes the gap. V3.1-Think hits 66% on SWE-bench. Weights, training recipe, and evaluation harness all public.

§ 02

February 2026

3 posts

Google Chirp 3 HD: instant voice cloning in 31 languages

Eight voice personalities, real-time streaming, cloning from short samples. GA on Vertex AI. TTS splits between LLM-native and dedicated models.

Is SWE-bench Verified contaminated?

OpenAI stops reporting Verified scores and shifts to SWE-bench Pro. Agent scaffolding adds 12+ points. Contamination is now a first-class concern.

Kimi K2: the dark horse reaches 94.5% HumanEval

Moonshot AI's K2 0905 quietly takes second on HumanEval, behind only Claude Opus 4.6. The Chinese lab arms race on coding benchmarks continues.

§ 03

January 2026

1 post

Gemini 2.5 Pro TTS: LLM-native speech at 4.7 MOS

30 speakers, 80+ locales, prompt-controlled emotion and style. Google's bet: TTS belongs inside the language model, not beside it.

§ 04

December 2025

7 posts

MiniMax M2.1: a new SWE-bench leader at 90% lower cost

229B-parameter MoE hits 74.0% on SWE-bench Verified at $0.30 / 1M tokens, edging out Claude Sonnet 4.5. The cost-efficiency champion, for now.

GLM-4.7: 95.7% on AIME 2025

Zhipu's 358B MoE sets a new record on AIME, surpassing GPT-5.1 High. Interleaved thinking traces, MIT license, reproducible harness.

TRELLIS.2: production-ready 3D assets in three seconds

Microsoft Research's 4B model generates game-ready PBR assets from a single image. The O-Voxel architecture pushes resolution to 1536.

Tencent HY-MT1.5: translation beats Google by 15–65%

1.8B-parameter WMT2025 winner approaches Gemini-3.0-Pro quality while running on a phone. 33 languages, open weights.

LiquidAI LFM2-2.6B: edge model beats 680B DeepSeek R1

2.6B dense model trained with pure RL surpasses a 263× larger instruction model. Hybrid convolution-attention makes phone deployment viable.

Wan2.2 Animate: the first open-source MoE video model

Alibaba's 14B MoE combines motion transfer and character animation. 720p at 24fps for ~$0.40 per 5s clip, versus $2 for Veo 3.

Z-Image-Turbo: FLUX-class images on 16 GB GPUs

Alibaba's 6B distilled model approaches FLUX quality in eight inference steps on consumer hardware. Apache 2.0 enables commercial use.

§ Elsewhere

Read sideways.

Editorial deep-dives on model families, evaluation and deployment.

Papers with Code →

What happened to the archived Meta registry — and the replacement.

The live registry. Every task, every score, every date.

Methodology →

How scores are verified and how retractions are recorded.