RL-environment indexSOTA registryOpen JSONDaily arXiv scan

The source of record for RL environments & SOTA · snapshot 2026-04-27

The data terminal for
RL environments & SOTA-with-code.

One aggregated, dated, source-tiered registry of the evals, RL environments, models, and papers that move the frontier — the place AI labs check to see which environments actually separate models, and the team that builds the verifiable-reward environments that do.

163

Models

371

Benchmarks

9,102

Results

RL envs

Capabilities

121

Tasks

For AI labs: RL environments →RLVR data we build /api/sotaOpen JSON · no key · citable

Live standings2026-04-27

RL environments · by discriminative power

Environment	Spread	DP
Terminal-Bench 2.0	87%	0.86
OSWorld-Verified	81%	0.81
DeepSWE	65%	0.65
FrontierSWE	59%	0.59
GBA Eval	53%	0.53

SOTA registry · current frontier

Benchmark	Leader	Score
OCRBench v2	ovis2-5-9b	63.40
SWE-bench Verified	Claude Opus 4.7	87.6%

Open the RL-environment terminal →

§ 01 · Start here

Three ways in. All backed by the same evidence.

Whether you are choosing what to cite, deciding what to train on, or tracking the frontier, you land on the same dated, source-tiered registry underneath.

Pick an eval01

Browse benchmarks by what they prove.

Every eval with status, saturation, and lift evidence — so you cite numbers that still separate models, not dead ones.

Browse benchmarks →

Improve a capability02

Get the environments that lift it.

Pick a capability gap; get RL environments and datasets ranked by discriminative power — the ones most likely to move a strong model.

Open the recommender →

See what’s new03

The frontier feed.

Latest evals, environments, models, and papers — chronological, dated, and linked back to the registry rows they touch.

Open the feed →

§ 02 · Current frontier

One representative result per capability area.

A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.

Results: 9,102
Models tracked: 163
Datasets indexed: 371
Capability areas: 9

API schema →

Representative evidence · inspect the task page

.jsonopen

Capability	Evidence	Trusted model	Metric	Score	Source	Snapshot
Language & Knowledge	MMLU-Pro	Pending audit	accuracy	Pending	pending source	2026-04-27
Vision & Documents	OCRBench v2	ovis2-5-9b	overall	63.40	paperswithcode-public-api	2026-05-18
Audio & Speech	WildASR	Pending audit	WER (lower)	Pending	pending source	2026-04-27
Multimodal Media	VQA-v2	Pending audit	accuracy	Pending	pending source	2026-04-27
Code & Software Engineering	SWE-bench Verified	Claude Opus 4.7	resolve rate	87.6%	vendor	2026-04-23
Agents & Tool Use	GAIA	Pending audit	accuracy	Pending	pending source	2026-04-27
Structured Data & Forecasting	MTEB	Pending audit	avg	Pending	pending source	2026-04-27
Robotics, Control & RL	Atari 2600	Pending audit	human-normalized score	Pending	pending source	2026-04-27
Science, Medicine & Industry	MVTec-AD	Pending audit	score	Pending	pending source	2026-04-27

Fig 2 · Each row shows one representative benchmark for orientation, not proof that a whole capability has one canonical test. Unverified rows, missing sources, and malformed source links are deliberately withheld from the first-page model claim. Scores are drawn from the open JSON at /data/benchmarks.json.

API mirror

curl https://www.codesota.com/api/sota/swe-bench

Docs

Cited & referenced by

Researchers and analysts
cite the registry.

Univ. of Surrey · AAAI 2026Tomasz Tunguz · Theory VenturesUseAIAPIAlternativeToHacker Newsr/MachineLearning

See all citations →

§ 03 · Capability map

Practical routes. Benchmarks as evidence.

The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.

Route

Language & Knowledge

Reasoning, exams, retrieval, and knowledge-heavy language tasks.

MMLU-Pro · GPQA · MTEB

Route63.40

Vision & Documents

Images, detection, OCR, layout, tables, and document parsing.

COCO · OCRBench · OmniDocBench

Route

Audio & Speech

ASR, audio tagging, voice assistants, speech quality, and TTS.

WildASR · VoiceBench · ESC-50

Route

Multimodal Media

VQA, charts, video, image-text reasoning, and media understanding.

VQA-v2 · TextVQA · MMMU

Route87.6%

Code & Software Engineering

Code generation, repair, repository tasks, and verified software work.

HumanEval · LiveCodeBench · SWE-bench

Route87.6%

Agents & Tool Use

Long-horizon tool use, browser work, OS tasks, and workflow execution.

GAIA · WebArena · OSWorld

Route

Structured Data & Forecasting

Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.

MTEB · tabular · graph suites

Route

Robotics, Control & RL

Simulation, control, games, embodied agents, and manipulation.

Atari · Habitat · LIBERO

Route

Science, Medicine & Industry

Scientific QA, medical imaging, industrial inspection, and applied AI.

CheXpert · MVTec-AD · MedQA

Fig 4 · Each tile is a route into the registry. Detailed pages can still separate capability, modality, domain, benchmark role, and trust flags without forcing all of that into the homepage.

§ 04 · Trust layer

A leaderboard row is not a fact until it can be inspected.

CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.

Dated scores

Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.

Metric direction

Every benchmark declares whether higher or lower is better before a winner is selected.

Source tiers

Paper, vendor, reproduced, and registry-maintained rows are labeled separately.

Provenance trail

Benchmark pages connect model, paper, dataset, code, and source URL where available.

§ 04b · The RL-environment market

The neutral measure for the environments labs train on.

A market of RL-environment startups is selling to frontier labs — and the labs keep asking the same question: does this environment actually separate models, or is it saturated? CodeSOTA is the independent party that answers it. If you build environments, we certify yours discriminates. If you train models, we tell you which ones are worth the run.

Neutral third partySame lens on every envReproducible scoringLab-grade evidence

See the ranking →Get your environment measured

§ 05 · API

The registry is callable.

Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.

Public JSONSnapshot IDsCORS-openTask-first

API docs

codesota/sota-api

curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents

{
  "task": "swe-bench",
  "metric": "resolve rate",
  "direction": "higher",
  "leader": {
    "model": "registry top pick",
    "score": "dated value",
    "source": "paper | vendor | reproduced",
    "snapshot_id": "2026-04-27"
  }
}

§ 10 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.

Submit a score ↵Read the methodology

The data terminal forRL environments & SOTA-with-code.

Three ways in. All backed by the same evidence.

Browse benchmarks by what they prove.

Get the environments that lift it.

The frontier feed.

One representative result per capability area.

Practical routes. Benchmarks as evidence.

A leaderboard row is not a fact until it can be inspected.

Dated scores

Metric direction

Source tiers

Provenance trail

The neutral measure for the environments labs train on.

The registry is callable.

Trained somethingthat beats the table?

The data terminal for
RL environments & SOTA-with-code.

Trained something
that beats the table?