Codesota · Vol. II · No. 17The open registry of state of the art · est. 2025Issue: April 22, 2026
Live registry · 17 modalities · 9,247 benchmarks

ML results,
linked to the evidence.

Codesota connects papers, code, datasets, models, and benchmark results in one public registry. Use it to see what was measured, where the number came from, and whether anyone has reproduced it.

Explore tasks Recent papersFree · no paywall · optional signup for edits
Register now · TTS listening study

Which voice actually sounds better?

We will run blind audio tests where people hear A, B, and C reading the same text, then judge which one is best. Those votes will become a human preference layer for ranking TTS systems, alongside WER, latency, and cost.

Register for the study How the study worksSame prompt · hidden vendor · human vote
Blind roundA/B/C
A
hidden
B
hidden
C
hidden
Fig 2 · Every system reads the same text. The listener never sees model or vendor names.
§ 02 · Benchmark

MTEB, English.

56 tasks across retrieval, classification, clustering, STS and reranking. Scores are averages on the public split; submissions require a public checkpoint and a reproducibility script.


Metric
Average · higher is better
Models
164 submitted · 151 reproduced
Last submission
4 hours ago
Citation
Muennighoff et al., 2023
Full method → arXiv:2210.07316
Top-8 · March 2026
.csv.json
#ModelOrgParamsSubmittedTrendAvgΔ
01Qwen3-8B-EmbedAlibaba8.0BMar 18, 202674.23+0.31
02NV-Embed-v3NVIDIA7.9BFeb 02, 202673.92+0.12
03Voyage-3-largeVoyageN/AJan 11, 202673.80+0.04
04KaLM-v3-embeddingHIT1.8BJan 04, 202673.12+0.22
05BGE-M3-reranker-liteBAAI0.6BDec 12, 202572.44-0.01
06OpenAI text-embedding-3-lgOpenAIN/AMay 02, 202470.100.00
07Cohere embed-v4CohereN/ANov 08, 202569.82+0.08
08E5-mistral-7bMicrosoft7.1BJan 29, 202466.630.00
Fig 2 · Averages shown to two decimal places. Δ is change since previous submission by same organization. Shaded row marks current SOTA. All scores reproduced independently before publication.
§ 03 · Progress

Twenty-four months
of state of the art.

Embeddings have climbed 24 points on the MTEB average since late 2021, but progress is not monotonic. We mark every step-up and preserve the underlying submission record, so that a regression is visible even when a press release is not.

Dates reflect public release. Only independently reproduced results contribute to the SOTA line. Unverified submissions are listed separately and marked with a hollow marker.

MTEB avg.49.155.662.068.574.9Aug '21Dec '23Mar '26SBERT-v2INSTRUCTORE5-largeGTE-largeBGE-M3E5-mistral-7bOpenAI-3-lgNV-Embed-v2NV-Embed-v3Qwen3-8B-Embed
Fig 3 · MTEB avg. · n=164 submissions · 16 SOTA-setting models annotated
SOTA-settingSubmission
§ 04 · Coverage

Seventeen modalities. One registry.

Everyone tracks LLM scores. We also track what your pipeline depends on: OCR, ASR, detection, retrieval, and translation, with the same standard of evidence.

OCR / document
202326
1.64CER, ↓
Text embedding
202427
74.2MTEB avg, ↑
ASR / English
202326
2.1WER, ↓
Image classification
202427
91.8top-1, ↑
Object detection
202427
64.1mAP, ↑
Semantic seg.
202427
58.9mIoU, ↑
Translation (en-de)
202427
46.2BLEU, ↑
Summarization
202427
48.3ROUGE-L, ↑
Hallucination det.
202427
81.4F1, ↑
Hybrid retrieval
202427
62.7nDCG@10, ↑
Code generation
202427
89.4pass@1, ↑
SWE-bench verified
202427
64.7solve, ↑
Fig 4 · Best published score by quarter, past two years. Dot marks current SOTA. Full task list: ocr · asr · mtr · det · seg · cls · emb · rag · sum · nmt · tts · qa · ranker · halluc · retr · code · swe.
§ 05
Methodology

Why these numbers can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. No reviewer, no reproduction, no retraction. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repository link alone, but a frozen commit, a declared environment, a recorded seed. If the code does not run, the row does not publish.

Second, every benchmark has a metric direction. It sounds trivial. It is not. Half the confusion in the field comes from tables that do not say whether higher is better; ours do.

Third, every score carries a date and survives its author. When a model regresses, the record is preserved. The table never silently forgets.

We are building the registry we wish we had when we were training the models ourselves.

§ 06 · Access

A registry, queryable.

Every score on the site is also available as JSON under the same URL. Point a notebook at it, build a dashboard, or write a survey paper. The registry is open and versioned.

Base
api.codesota.com/v1
Auth
None for reads · bearer token for submissions
Limit
1,000 req/hr anonymous · 10,000 signed
Format
JSON · CSV · Parquet
Read the API reference →
Request.http
1# curl codesota.com/v1/query
2 
3GET /v1/benchmarks/mteb
4  ?since=2025-01 &verified=true
5  &sort=score.desc&limit=8
Response · 200⎯ 42 ms
1{
2 "benchmark": "mteb",
3 "metric": { "name": "avg", "direction": "higher" },
4 "reproduced": 151,
5 "results": [
6 {
7 "model": "Qwen3-8B-Embed",
8 "score": 74.23,
9 "date": "2026-03-18",
10 "repro": { "commit": "a41e90b", "seed": 42 },
11 "source": "verified"
12 },
13 { "model": "NV-Embed-v3", "score": 73.92, ... },
14 ...
15 ]
16}
§ 07 · Traction

Five months, actual people.

Unique monthly visitors · April extrapolated linearly (22 days observed → 30)
Dec 2025
3,174
launch
Jan 2026
3,428
Feb 2026
2,697
Mar 2026
5,204
Apr 2026
7,666
10,454 projected
22 of 30 days
Flat solid bars are observed visitors. The hatched continuation on April is the linear projection assuming the remaining 8 days run at the same pace as the first 22. No growth amplification is applied.
§ 08 · Contribute

Trained something
that beats the table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and annotate the step on the chart with your name if it takes the top rank.

Recent submissions
KaLM-v3-embedding@katelin.ml · MTEB
Reproduced12 min
ParaDet-R50@aai-research · COCO-val
Queued34 min
Voyage-3-large@voyage · BEIR
Reproduced52 min
WhisperX-large-v4@mkatamaran · LibriSpeech
Revise71 min
EVA-02-L@yihuaxie · ImageNet-val
Reproduced98 min