121 tasks by modality · 9,102 benchmark results

Pick a task.
Choose the best model.

CodeSOTA starts from the job you need done, then maps it to tasks, benchmarks, model pages, papers, and dated sources. Benchmarks are evidence, not the product: the product is deciding which model to trust for the work in front of you.

Use search when you know a model, benchmark, paper, dataset, or short problem description. Use the modality map when you want to browse from a capability area.

§ 01 · Registry search

Describe the task before you trust a leaderboard.

Query models, tasks, benchmarks, papers, and datasets in one place. Good results should point to typed registry objects and dated sources, not a flattened marketing rank.

Example queries
§ 02 · Current frontier

One representative result per capability area.

A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.


Results
9,102
Models tracked
163
Datasets indexed
371
Capability areas
9
API schema →
Representative evidence · inspect the task page
.jsonopen
CapabilityEvidenceTrusted modelMetricScoreSourceSnapshot
Language & KnowledgeMMLU-ProPending auditaccuracyPendingpending source2026-04-27
Vision & DocumentsOCRBench v2Pending auditoverallPendingpending source2026-04-27
Audio & SpeechWildASRPending auditWER (lower)Pendingpending source2026-04-27
Multimodal MediaVQA-v2Pending auditaccuracyPendingpending source2026-04-27
Code & Software EngineeringSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%vendor2026-04-23
Agents & Tool UseGAIAPending auditaccuracyPendingpending source2026-04-27
Structured Data & ForecastingMTEBPending auditavgPendingpending source2026-04-27
Robotics, Control & RLAtari 2600Pending audithuman-normalized scorePendingpending source2026-04-27
Science, Medicine & IndustryMVTec-ADPending auditscorePendingpending source2026-04-27
Fig 2 · Each row shows one representative benchmark for orientation, not proof that a whole capability has one canonical test. Unverified rows, missing sources, and malformed source links are deliberately withheld from the first-page model claim. Scores are drawn from the open JSON at /data/benchmarks.json.
API mirror
curl https://www.codesota.com/api/sota/swe-bench
Docs
§ 03 · Capability map

Practical routes. Benchmarks as evidence.

The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.

Route
Language & Knowledge

Reasoning, exams, retrieval, and knowledge-heavy language tasks.

MMLU-Pro · GPQA · MTEB
Route
Vision & Documents

Images, detection, OCR, layout, tables, and document parsing.

COCO · OCRBench · OmniDocBench
Route
Audio & Speech

ASR, audio tagging, voice assistants, speech quality, and TTS.

WildASR · VoiceBench · ESC-50
Route
Multimodal Media

VQA, charts, video, image-text reasoning, and media understanding.

VQA-v2 · TextVQA · MMMU
Route87.6%
Code & Software Engineering

Code generation, repair, repository tasks, and verified software work.

HumanEval · LiveCodeBench · SWE-bench
Route87.6%
Agents & Tool Use

Long-horizon tool use, browser work, OS tasks, and workflow execution.

GAIA · WebArena · OSWorld
Route
Structured Data & Forecasting

Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.

MTEB · tabular · graph suites
Route
Robotics, Control & RL

Simulation, control, games, embodied agents, and manipulation.

Atari · Habitat · LIBERO
Route
Science, Medicine & Industry

Scientific QA, medical imaging, industrial inspection, and applied AI.

CheXpert · MVTec-AD · MedQA
Fig 4 · Each tile is a route into the registry. Detailed pages can still separate capability, modality, domain, benchmark role, and trust flags without forcing all of that into the homepage.
§ 04 · Trust layer

A leaderboard row is not a fact until it can be inspected.

CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.

01

Dated scores

Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.

02

Metric direction

Every benchmark declares whether higher or lower is better before a winner is selected.

03

Source tiers

Paper, vendor, reproduced, and registry-maintained rows are labeled separately.

04

Provenance trail

Benchmark pages connect model, paper, dataset, code, and source URL where available.

§ 05 · API

The registry is callable.

Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.

API docs
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents
{
  "task": "swe-bench",
  "metric": "resolve rate",
  "direction": "higher",
  "leader": {
    "model": "registry top pick",
    "score": "dated value",
    "source": "paper | vendor | reproduced",
    "snapshot_id": "2026-04-27"
  }
}
§ 10 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.