One endpoint per task.
Ready to use.
You shouldn't have to read a benchmark paper to transcribe a receipt. CodeSOTA is turning every AI task into a hosted endpoint with three tiers — SOTA, balanced, cheap — so you pick the trade-off and we run the rest.
The manifesto
Intelligence
as a commodity.
Oil has grades. Electricity has tariffs. Shipping has class codes. Every mature market commoditizes by standardizing the contract, not the molecule. Intelligence is next.
Grade
Contract
Quality cert
Spot price
OpenAI, Anthropic, Google — they're refineries. They output something extraordinary, but a refinery's output is only useful once the market around it standardizes how you buy, price, and substitute it.
CodeSOTA is building the standard behind the contract. Benchmarks are the assay. Task endpoints are the grade — one served today (/v1/ocr), the rest in flight. Everything else is implementation.
§ 01 — The thesis
The models already exist.
Nobody wants to pick.
Intelligence is moving from general and centralized to specific and everywhere. For any given task — OCR, transcription, translation, extraction — there is already an open-source model that comes within a few points of the frontier at a tiny fraction of the cost.
On OmniDocBench, the document-parsing benchmark CodeSOTA tracks, PaddleOCR-VL-1.5 scores 94.50 — higher than GPT-5.4 (85.80) and Gemini 2.5 Pro (84.20), at roughly 1/167th the price per 1,000 pages. The open model is measurably better and two orders of magnitude cheaper.
How the 167× is computed: PaddleOCR-VL-1.5 self-hosted cost of $0.09/1K pages is the amortised inference cost on a single A100 at a typical utilisation, not a retail price. GPT-5.4’s $15/1K is OpenAI’s published list price. So this is a COGS-vs-retail comparison, which flatters the delta. For a straight retail-vs-retail comparison, see the /ocr leaderboard where every row shows the actual hosted price you can buy today.
The problem isn't that the models don't exist. The problem is that picking one means reading papers, reconciling benchmarks, renting GPUs, and stitching APIs. On OCR today, CodeSOTA picks for you. We're building toward the rest, in design partnership with early teams.
§ 02 — How a task endpoint works
One request. Three possible answers.
You choose the trade-off.
§ 03 — Worked example
Watch it work on document OCR.
This is live data from codesota.com/ocr. The same table that powers the leaderboard also powers the router: whichever row is #1 this week is what the sota tier calls under the hood. When the ranking changes, the endpoint quietly follows.
| # | Model | Score | $/1K pages | Tier |
|---|---|---|---|---|
| 1 | GLM-OCR open-source | 94.62 | $0.09 | sota |
| 2 | PaddleOCR-VL-1.5 open-source | 94.50 | $0.09 | balanced |
| 3 | dots.ocr 3B open-source | 88.41 | $0.04 | cheap |
| 4 | MonkeyOCR-pro open-source | 86.96 | $0.03 | cheap |
| 5 | GPT-5.4 closed API | 85.80 | $15.00 | — |
| 6 | Gemini 2.5 Pro closed API | 84.20 | $12.50 | — |
| 7 | Mistral OCR 3 closed API | 83.40 | $1.00 | — |
§ 04 — The menu
Balanced is the default.
Everything else is opt-in.
The whole point of a three-tier menu is that you almost never need the top one. If a 30B open-source checkpoint is within a couple points of the frontier for 1/20th the price, that's the tier you should be calling 99% of the time — and it's the tier /v1/<task> routes to unless you say otherwise.
20× cheaper.
An open-source model that sits within a few points of the frontier at a tiny fraction of the price. A 30–70B open checkpoint on commodity GPUs. This is what you call 99% of the time.
For the last few points of accuracy that actually matter: compliance runs, eval suites, audit trails. Most workloads don't need this. When they do, one flag flips.
The smallest model that still clears the quality bar. 3–8B open checkpoints, distilled variants, or classic CNNs where a VLM is overkill. When you're running a million calls, money wins.
§ 05 — Why CodeSOTA is the right place for this
We're not another router.
We're the benchmark layer, first.
Benchmarks are the fuel
Task-first, not model-first
Independence is the product
sota tier calls.The flywheel compounds
§ 06 — Built for agents, not just humans
The customer isn't always human.
Autonomous agents — Hermes, Claude, OpenCode, your own home-grown loop — need two different things from an inference layer: a brain to run their reasoning cycle, and tools to solve concrete tasks inside it. CodeSOTA covers both.
Need a brain
Which LLM runs the agent loop best?
Tool-use accuracy, instruction following, tokens/sec, and $/1M tokens are all tracked on CodeSOTA agent benchmarks. Pick the reasoning engine the same way you pick a task tier — by data, not vibes.
Need tools
Which specialized API solves the sub-task?
A multimodal LLM can read a PDF, but PaddleOCR-VL does it 166× cheaper and more accurately. A frontier model can transcribe audio, but Whisper is 50× cheaper. Agents should delegate modality-bound tasks to the right specialist — and the API surface has to make that easy.
Founder's note · why we're building this
“I've been picking the LLM that powers my Hermes agent for months now. The pareto-optimal choice on quality-vs-cost changes every few weeks — a new release, a price drop, a benchmark update, a latency regression. I don't want to babysit that decision anymore. If I don't, why would anyone else?”
— Kacper Wikiel · CodeSOTA
// endpoint
POST /v1/ocr
// agent picks at call time
{
"file": "invoice.pdf",
"tier": "balanced",
"max_cost_usd": 0.01,
"timeout_s": 30
}// capabilities the agent can reason about
{
"task": "document-ocr",
"tiers": {
"sota": { quality: 0.946, usd_per_1k: 0.09 },
"balanced": { quality: 0.945, usd_per_1k: 0.09 },
"cheap": { quality: 0.884, usd_per_1k: 0.04 }
},
"benchmark": "omnidocbench",
"updated": "2026-03-28"
}§ 07 — The task catalog
OCR is live. The rest are in flight.
Every row is one task, one stable API contract, three tier choices. Backed by a dedicated CodeSOTA benchmark.
Today · serving requests
Roadmap · open to design partners
Request priority on a roadmap endpoint →§ 08 — Commitments
What we won't pretend.
Opinionated means we pick for you.
SOTA moves. Endpoints hold still — unless you say otherwise.
tier: "sota" is a contract for quality — when a better model tops the leaderboard, the endpoint quietly follows. That's the point of routing the choice away from you.Pinning is a first-class option.
POST /v1/ocr
{
"file": "invoice.pdf",
"model": "glm-ocr@2026-03-01",
"pin": true
}Use tier when you want the best answer today. Use model + pin when you need the same answer a year from now.Independence is non-negotiable.
§ 09 — Traction
And the curve has already turned.
The benchmark side of CodeSOTA started pulling traffic in late 2025, right as the OCR leaderboard and task pages came online. This is the surface the task router will sit on top of — and it's already compounding.
Visitors
18,899
Page views
43,310
MoM growth
+71%
The benchmark pages are the top of the funnel. The task endpoints are the conversion. The vision above isn't a bet — it's a roadmap on a curve that's already moving.
Build with us
Pick a task. Get the best model.
Pay the right price.
We're shipping the task catalog endpoint by endpoint. If a task we haven't covered yet is blocking you, come talk to us.