Pick a task.
Choose the best model.
CodeSOTA starts from the job you need done, then maps it to tasks, benchmarks, model pages, papers, and dated sources. Benchmarks are evidence, not the product: the product is deciding which model to trust for the work in front of you.
Use search when you know a model, benchmark, paper, dataset, or short problem description. Use the modality map when you want to browse from a capability area.
Describe the task before you trust a leaderboard.
Query models, tasks, benchmarks, papers, and datasets in one place. Good results should point to typed registry objects and dated sources, not a flattened marketing rank.
One representative result per capability area.
A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.
- Results
- 9,102
- Models tracked
- 163
- Datasets indexed
- 371
- Capability areas
- 9
| Capability | Evidence | Trusted model | Metric | Score | Source | Snapshot |
|---|---|---|---|---|---|---|
| Language & Knowledge | MMLU-Pro | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Vision & Documents | OCRBench v2 | Pending audit | overall | Pending | pending source | 2026-04-27 |
| Audio & Speech | WildASR | Pending audit | WER (lower) | Pending | pending source | 2026-04-27 |
| Multimodal Media | VQA-v2 | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Code & Software Engineering | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | vendor | 2026-04-23 |
| Agents & Tool Use | GAIA | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Structured Data & Forecasting | MTEB | Pending audit | avg | Pending | pending source | 2026-04-27 |
| Robotics, Control & RL | Atari 2600 | Pending audit | human-normalized score | Pending | pending source | 2026-04-27 |
| Science, Medicine & Industry | MVTec-AD | Pending audit | score | Pending | pending source | 2026-04-27 |
curl https://www.codesota.com/api/sota/swe-benchPractical routes. Benchmarks as evidence.
The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.
Reasoning, exams, retrieval, and knowledge-heavy language tasks.
Images, detection, OCR, layout, tables, and document parsing.
ASR, audio tagging, voice assistants, speech quality, and TTS.
VQA, charts, video, image-text reasoning, and media understanding.
Code generation, repair, repository tasks, and verified software work.
Long-horizon tool use, browser work, OS tasks, and workflow execution.
Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.
Simulation, control, games, embodied agents, and manipulation.
Scientific QA, medical imaging, industrial inspection, and applied AI.
A leaderboard row is not a fact until it can be inspected.
CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.
Dated scores
Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.
Metric direction
Every benchmark declares whether higher or lower is better before a winner is selected.
Source tiers
Paper, vendor, reproduced, and registry-maintained rows are labeled separately.
Provenance trail
Benchmark pages connect model, paper, dataset, code, and source URL where available.
The registry is callable.
Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.
API docscurl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents{
"task": "swe-bench",
"metric": "resolve rate",
"direction": "higher",
"leader": {
"model": "registry top pick",
"score": "dated value",
"source": "paper | vendor | reproduced",
"snapshot_id": "2026-04-27"
}
}Trained something
that beats the table?
Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.