How Codesota
models evidence.
Codesota is not just a list of benchmarks. It is an ontology for machine-learning evidence: capability areas contain tasks, tasks contain benchmark protocols, benchmarks define datasets, metrics, splits, prompts, constraints, and aggregation rules, and every result points back to a paper, vendor report, reproduction, or correction.
This page explains the object model behind the website, the score submission flow, and the /api/sota endpoint.
The nouns in the registry.
Each entity exists to prevent a common benchmark error: mixing tasks with datasets, treating datasets as complete benchmark protocols, ranking numbers without metric direction, or quoting a score with no source.
Capability area
A stable top-level capability group. Areas keep navigation readable without mixing modalities, methods, domains, and benchmark families.
Example · Language & Knowledge, Vision & Documents, Agents & Tool Use
Task
The capability being evaluated. Tasks are stable enough for APIs and aliases, even when individual benchmarks age out.
Example · Document OCR, code generation, visual question answering
Benchmark
The evaluation protocol: one or more datasets, splits, metrics, prompts, constraints, aggregation rules, and versioned scoring instructions.
Example · OmniDocBench, LiveCodeBench, SWE-bench Verified, MTEB
Dataset
The raw or curated evaluation data used by a benchmark. A dataset can be reused by several benchmarks, and a benchmark can combine several datasets.
Example · ImageNet, KITAB, DocVQA, MMLU-Pro, COCO
Metric
The scoring function and comparison direction. Metrics can be generic, task-specific, or benchmark-specific.
Example · pass@1, resolve rate, CER, mAP, exact match, MTEB avg score
Model
A model, checkpoint, API release, agent scaffold, or system being evaluated. Models are canonicalized before rows are ranked.
Example · GPT-5, Claude Opus 4.7, PaddleOCR-VL, Qwen3
Result
The atomic fact in Codesota: one model or system, one benchmark protocol, one metric value, one date, one source trail.
Example · Claude Opus 4.7 on SWE-bench Verified, resolve rate, dated source
Paper / source
The citation, reproduction package, vendor page, benchmark report, or correction note that justifies a result row.
Example · arXiv paper, GitHub reproduction, official leaderboard, vendor report
The graph is small on purpose.
The core hierarchy is strict. Evidence objects attach to it. That keeps pages, APIs, and contribution review aligned.
From claim to registry row.
Ontology is what lets Codesota accept messy real-world evidence while publishing stable, inspectable objects.
Extract
A paper, leaderboard, vendor post, or community submission names models, benchmarks, datasets, metrics, protocols, and scores.
Canonicalize
Messy names are mapped to existing IDs: benchmark aliases resolve to protocols, dataset aliases resolve to data objects, models are de-duplicated, and task ownership is checked.
Validate
Metric direction, result date, source type, hidden status, protocol version, and benchmark comparability are reviewed before a row can influence a leaderboard.
Rank
The best valid row becomes the current SOTA for a dataset or task. /api/sota exposes that pick with runners-up and provenance.
Revise
Corrections, contamination flags, lineage changes, and new benchmark successors are appended rather than silently rewriting history.
The hierarchy in practice.
These are the highest-coverage public areas right now, with a few tasks and datasets shown under each. Until the schema migration lands, these benchmark protocols are read from legacy dataset rows. The full browser remains at /browse.
Natural Language Processing
Computer Vision
Reasoning
Computer Code
Agentic AI
Computer Vision
Medical
Time-series
Why ontology matters to callers.
The API can be simple because the ontology is explicit. A task alias resolves to a canonical task; the task points to benchmark protocols; each benchmark defines datasets and metrics; result rows decide the pick.
GET /api/sota/ocr
task -> document-ocr
benchmark -> omnidocbench
datasets -> [omnidocbench-pages, docvqa-tables, formula-splits]
metric -> edit / layout / table / formula aggregate
pick.model_id -> paddleocr-vl-1.5
pick.score -> numeric metric value
source_url -> evidence trail
snapshot_id -> cache key for this registry state