Codesota · Vol. II · No. 17The open registry of state of the art · est. 2025Issue: April 22, 2026

Live registry · 17 modalities · 9,247 benchmarks

ML results,
linked to the evidence.

Codesota connects papers, code, datasets, models, and benchmark results in one public registry. Use it to see what was measured, where the number came from, and whether anyone has reproduced it.

Explore tasks →Recent papersFree · no paywall · optional signup for edits

Which voice actually sounds better?

We will run blind audio tests where people hear A, B, and C reading the same text, then judge which one is best. Those votes will become a human preference layer for ranking TTS systems, alongside WER, latency, and cost.

Blind roundA/B/C

hidden

Fig 2 · Every system reads the same text. The listener never sees model or vendor names.

§ 02 · Benchmark

MTEB, English.

56 tasks across retrieval, classification, clustering, STS and reranking. Scores are averages on the public split; submissions require a public checkpoint and a reproducibility script.

Metric: Average · higher is better
Models: 164 submitted · 151 reproduced
Last submission: 4 hours ago
Citation: Muennighoff et al., 2023

Full method → arXiv:2210.07316

Top-8 · March 2026

.csv.json

#	Model	Org	Params	Submitted	Avg	Δ
01	Qwen3-8B-Embed	Alibaba	8.0B	Mar 18, 2026	74.23	+0.31
02	NV-Embed-v3	NVIDIA	7.9B	Feb 02, 2026	73.92	+0.12
03	Voyage-3-large	Voyage	N/A	Jan 11, 2026	73.80	+0.04
04	KaLM-v3-embedding	HIT	1.8B	Jan 04, 2026	73.12	+0.22
05	BGE-M3-reranker-lite	BAAI	0.6B	Dec 12, 2025	72.44	-0.01
06	OpenAI text-embedding-3-lg	OpenAI	N/A	May 02, 2024	70.10	0.00
07	Cohere embed-v4	Cohere	N/A	Nov 08, 2025	69.82	+0.08
08	E5-mistral-7b	Microsoft	7.1B	Jan 29, 2024	66.63	0.00

Fig 2 · Averages shown to two decimal places. Δ is change since previous submission by same organization. Shaded row marks current SOTA. All scores reproduced independently before publication.

§ 03 · Progress

Twenty-four months
of state of the art.

Embeddings have climbed 24 points on the MTEB average since late 2021, but progress is not monotonic. We mark every step-up and preserve the underlying submission record, so that a regression is visible even when a press release is not.

Dates reflect public release. Only independently reproduced results contribute to the SOTA line. Unverified submissions are listed separately and marked with a hollow marker.

Fig 3 · MTEB avg. · n=164 submissions · 16 SOTA-setting models annotated

SOTA-settingSubmission

§ 04 · Coverage

Seventeen modalities. One registry.

Everyone tracks LLM scores. We also track what your pipeline depends on: OCR, ASR, detection, retrieval, and translation, with the same standard of evidence.

OCR / document

1.64CER, ↓

Text embedding

74.2MTEB avg, ↑

ASR / English

2.1WER, ↓

Image classification

91.8top-1, ↑

Object detection

64.1mAP, ↑

Semantic seg.

58.9mIoU, ↑

Translation (en-de)

46.2BLEU, ↑

Summarization

48.3ROUGE-L, ↑

Hallucination det.

81.4F1, ↑

Hybrid retrieval

62.7nDCG@10, ↑

Code generation

89.4pass@1, ↑

SWE-bench verified

64.7solve, ↑

Fig 4 · Best published score by quarter, past two years. Dot marks current SOTA. Full task list: ocr · asr · mtr · det · seg · cls · emb · rag · sum · nmt · tts · qa · ranker · halluc · retr · code · swe.

§ 05

Methodology

Why these numbers can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. No reviewer, no reproduction, no retraction. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repository link alone, but a frozen commit, a declared environment, a recorded seed. If the code does not run, the row does not publish.

Second, every benchmark has a metric direction. It sounds trivial. It is not. Half the confusion in the field comes from tables that do not say whether higher is better; ours do.

Third, every score carries a date and survives its author. When a model regresses, the record is preserved. The table never silently forgets.

We are building the registry we wish we had when we were training the models ourselves.

§ 06 · Access

A registry, queryable.

Every score on the site is also available as JSON under the same URL. Point a notebook at it, build a dashboard, or write a survey paper. The registry is open and versioned.

Base: api.codesota.com/v1
Auth: None for reads · bearer token for submissions
Limit: 1,000 req/hr anonymous · 10,000 signed
Format: JSON · CSV · Parquet

Read the API reference →

Request.http

1# curl codesota.com/v1/query
2 
3GET /v1/benchmarks/mteb
4  ?since=2025-01 &verified=true
5  &sort=score.desc&limit=8

Response · 200⎯ 42 ms

1{
2  "benchmark": "mteb",
3  "metric": { "name": "avg", "direction": "higher" },
4  "reproduced": 151,
5  "results": [
6    {
7      "model": "Qwen3-8B-Embed",
8      "score": 74.23,
9      "date": "2026-03-18",
10      "repro": { "commit": "a41e90b", "seed": 42 },
11      "source": "verified"
12    },
13    { "model": "NV-Embed-v3", "score": 73.92, ... },
14    ...
15  ]
16}

§ 07 · Traction

Five months, actual people.

Unique monthly visitors · April extrapolated linearly (22 days observed → 30)

Dec 2025

3,174

launch

Jan 2026

3,428

Feb 2026

2,697

Mar 2026

5,204

Apr 2026

7,666

→ 10,454 projected

22 of 30 days

Flat solid bars are observed visitors. The hatched continuation on April is the linear projection assuming the remaining 8 days run at the same pace as the first 22. No growth amplification is applied.

§ 08 · Contribute

Trained something
that beats the table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and annotate the step on the chart with your name if it takes the top rank.

Submit a score ↵Read submission guide

Recent submissions

KaLM-v3-embedding@katelin.ml · MTEB

Reproduced12 min

ParaDet-R50@aai-research · COCO-val

Queued34 min

Voyage-3-large@voyage · BEIR

Reproduced52 min

WhisperX-large-v4@mkatamaran · LibriSpeech

Revise71 min

EVA-02-L@yihuaxie · ImageNet-val

Reproduced98 min