Codesota · GuidesEditorial deep-dives on models, benchmarks & evaluation29 guides · 7 sections

§ Guides

Long-form, dated.

The guides archive. Each piece takes a model family, a benchmark, or a method and treats it as a subject in its own right — with the same standard of evidence we apply to the registry itself. Grouped by what the guide is about, not by when it was written.

§ 01

Models & leaderboards

Comparative deep-dives on model families — what leads on which benchmark, and at what cost.

6 guides

Best AI code generation models compared →

Claude Opus 4, GPT-5, Gemini 2.5 Pro, DeepSeek-V3 and Qwen2.5-Coder on HumanEval, SWE-bench and LiveCodeBench — with pricing.

Best TTS models compared →

Vendor APIs and open-weight models split into separate tracks. Blind Elo first, then latency, cloning, VRAM and licence.

The state of multimodal AI →

What VLMs can actually do in 2026. GPT-5 vision, Claude Opus 4 and Gemini 2.5 Pro on MMMU, MathVista and video tasks.

Image segmentation: SAM 2 vs Mask2Former →

SAM 2, OneFormer, Mask2Former and SegGPT on ADE20K and COCO. Code, decision matrix, failure modes.

Time series: classical vs foundation models →

ARIMA, Prophet, PatchTST, TimesFM, Chronos and Moirai. When classical methods still win.

Graph neural networks: when and why →

GCN, GAT, GraphSAGE, GIN and GPS explained. OGB benchmarks, PyG code, real-world applications.

§ 02

Agents & coding

What benchmarks say about agent behaviour, and what actually happens when you run them.

9 guides

Agentic AI benchmarks explained →

SWE-bench, RE-bench, HCAST, WebArena, GAIA, OSWorld. What they measure, who wins, where scores diverge from reality.

SWE-Bench explained: methodology & contamination →

How tasks are constructed, how scoring works, which variants exist, and why contamination matters. Updated April 2026.

Understanding Claude Code →

Build software by describing what you want in plain English. A visual guide to Claude Code for non-technical readers.

The new programmable layer: AI coding concepts →

Agents, prompts, context, MCP, LSP and hooks — the abstractions reshaping software development.

DSPy: programming language models →

Signatures, modules, optimizers and production patterns for programming (not prompting) LLMs.

Atropos: LLM reinforcement learning →

Nous Research's framework for training LLMs through diverse environments. 4.6× improvement on tool calling.

RAG vs fine-tuning vs long context →

A decision framework with cost analysis, benchmarks and code for retrieval, fine-tuning and million-token context.

The prompting framework tarpit →

Eight frameworks (RTF, TAG, RACE, …) benchmarked. None improved accuracy. What actually works.

Frameworki promptowania (PL) →

Wersja polska — zdrowy sceptycyzm wobec RTF/TAG/RACE, oparty na pomiarach.

§ 03

Speech & audio

ASR and TTS compared on the numbers that matter: WER, MOS, latency and price per hour.

1 guide

Speech recognition: Whisper vs Gemini vs Deepgram →

WER on LibriSpeech, pricing per hour, latency, streaming support. An ASR model showdown.

§ 04

Documents & retrieval

OCR, invoice extraction and visual document retrieval — where vision LLMs are finally displacing specialist pipelines.

2 guides

Invoice processing with vision language models →

GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, Qwen3-VL on DocVQA and OmniDocBench. Pricing, code, production deployment.

IRPAPERS: visual document retrieval benchmark →

3,230 pages, 166 papers, 180 questions. Cohere Embed v4 leads at 58% Recall@1. Full text-vs-image results.

§ 05

Applied domains

Domain-specific playbooks — regulation, manufacturing, tracking — written for engineers who have to ship.

3 guides

Medical AI regulation cheat sheet →

FDA, EU MDR and MHRA pathways for developers. Risk classification, timelines, costs, common pitfalls.

Anomaly detection for manufacturing →

PatchCore, EfficientAD and AnomalyGPT on MVTec AD. Edge vs cloud deployment, ROI analysis.

Kalman filter for object tracking →

From state-estimation theory to production tracking. SORT, DeepSORT and ByteTrack with working code.

§ 06

Research method

How to read the field honestly — and why scale tends to win.

5 guides

How to read an ML paper (and why most benchmarks lie) →

The three-pass method, red flags in benchmarks, and a 20-point checklist for evaluating claims.

Token margins: what the 70× subscription giveaway reveals →

Three simulations of frontier inference economics — subscription arbitrage floors, first-principles serving cost, and why 33–40% blended margins imply ~90% API margins.

The bitter lesson: why compute wins →

Rich Sutton's 2019 insight. General methods leveraging computation beat human-engineered approaches.

Few-shot learning is dead, long live foundation models →

From Siamese nets to GPT-3: how foundation models absorbed few-shot learning, and the niches that remain.

RL from Atari to robotics: a visual timeline →

From DQN (2013) to physical world models (2026). Paradigm shifts, Atari SOTA, RL for LLMs.

§ 07

By reader

Longer-form pieces written for a specific audience — researcher, practitioner, buyer.

3 guides

ML research landscape 2025 →

Trends across 1,519 papers (2013–2025). Saturating fields, emerging benchmarks, reproducibility stats.

SOTA tracker guide →

How to track state-of-the-art across ML tasks. Current leaders in Scene Text Detection, Document Layout Analysis and more.

Document processing — executive brief →

CTO/CIO guide to document processing. 58+ OCR models, vendor comparison, build-vs-buy decisions.

§ Elsewhere

Read sideways.

Dated dispatches on model releases and benchmark shifts.

Methodology →

How scores are verified and how retractions are recorded.

The live registry. Every task, every score, every date.

Papers with Code →

What happened to the archived Meta registry — and the replacement.