Codesota · OntologyAreas → tasks → benchmarks → resultsIssue: April 27, 2026
Editorial · Ontology

How Codesota
models evidence.

Codesota is not just a list of benchmarks. It is an ontology for machine-learning evidence: capability areas contain tasks, tasks contain benchmark protocols, benchmarks define datasets, metrics, splits, prompts, constraints, and aggregation rules, and every result points back to a paper, vendor report, reproduction, or correction.

This page explains the object model behind the website, the score submission flow, and the /api/sota endpoint.

§ 01 · Entities

The nouns in the registry.

Each entity exists to prevent a common benchmark error: mixing tasks with datasets, treating datasets as complete benchmark protocols, ranking numbers without metric direction, or quoting a score with no source.

Tasks

Capability area

id shape: vision-documents

A stable top-level capability group. Areas keep navigation readable without mixing modalities, methods, domains, and benchmark families.

Example · Language & Knowledge, Vision & Documents, Agents & Tool Use

Benchmarks

Task

id shape: document-ocr

The capability being evaluated. Tasks are stable enough for APIs and aliases, even when individual benchmarks age out.

Example · Document OCR, code generation, visual question answering

Protocol

Benchmark

id shape: omnidocbench

The evaluation protocol: one or more datasets, splits, metrics, prompts, constraints, aggregation rules, and versioned scoring instructions.

Example · OmniDocBench, LiveCodeBench, SWE-bench Verified, MTEB

Examples and splits

Dataset

id shape: kitab-bench-data

The raw or curated evaluation data used by a benchmark. A dataset can be reused by several benchmarks, and a benchmark can combine several datasets.

Example · ImageNet, KITAB, DocVQA, MMLU-Pro, COCO

Direction and unit

Metric

id shape: pass@1

The scoring function and comparison direction. Metrics can be generic, task-specific, or benchmark-specific.

Example · pass@1, resolve rate, CER, mAP, exact match, MTEB avg score

Result rows

Model

id shape: paddleocr-vl-1.5

A model, checkpoint, API release, agent scaffold, or system being evaluated. Models are canonicalized before rows are ranked.

Example · GPT-5, Claude Opus 4.7, PaddleOCR-VL, Qwen3

Provenance

Result

id shape: benchmark + model + metric + date

The atomic fact in Codesota: one model or system, one benchmark protocol, one metric value, one date, one source trail.

Example · Claude Opus 4.7 on SWE-bench Verified, resolve rate, dated source

Evidence

Paper / source

id shape: arxiv or vendor URL

The citation, reproduction package, vendor page, benchmark report, or correction note that justifies a result row.

Example · arXiv paper, GitHub reproduction, official leaderboard, vendor report

§ 02 · Relationships

The graph is small on purpose.

The core hierarchy is strict. Evidence objects attach to it. That keeps pages, APIs, and contribution review aligned.

Capability area
has many
Tasks
Task
has many
Benchmarks
Benchmark
uses
Datasets
Benchmark
defines
Metrics and protocol
Benchmark
has many
Results
Model
has many
Results
Paper / source
supports
Models and results
Result
selects
SOTA pick per benchmark
Lineage
orders
Benchmarks over time
Submission
proposes
New or corrected results
§ 03 · Lifecycle

From claim to registry row.

Ontology is what lets Codesota accept messy real-world evidence while publishing stable, inspectable objects.

01

Extract

A paper, leaderboard, vendor post, or community submission names models, benchmarks, datasets, metrics, protocols, and scores.

02

Canonicalize

Messy names are mapped to existing IDs: benchmark aliases resolve to protocols, dataset aliases resolve to data objects, models are de-duplicated, and task ownership is checked.

03

Validate

Metric direction, result date, source type, hidden status, protocol version, and benchmark comparability are reviewed before a row can influence a leaderboard.

04

Rank

The best valid row becomes the current SOTA for a dataset or task. /api/sota exposes that pick with runners-up and provenance.

05

Revise

Corrections, contamination flags, lineage changes, and new benchmark successors are appended rather than silently rewriting history.

§ 04 · Live snapshot

The hierarchy in practice.

These are the highest-coverage public areas right now, with a few tasks and datasets shown under each. Until the schema migration lands, these benchmark protocols are read from legacy dataset rows. The full browser remains at /browse.

§ 05 · API contract

Why ontology matters to callers.

The API can be simple because the ontology is explicit. A task alias resolves to a canonical task; the task points to benchmark protocols; each benchmark defines datasets and metrics; result rows decide the pick.

Example response shape
GET /api/sota/ocr

task           -> document-ocr
benchmark      -> omnidocbench
datasets       -> [omnidocbench-pages, docvqa-tables, formula-splits]
metric         -> edit / layout / table / formula aggregate
pick.model_id  -> paddleocr-vl-1.5
pick.score     -> numeric metric value
source_url     -> evidence trail
snapshot_id    -> cache key for this registry state