Codesota · Vol. II · No. 17The open registry of state of the art · est. 2025Issue: April 22, 2026
Live registry · 17 modalities · 9,247 benchmarks

The state of the art,
measured honestly.

Codesota is the registry ML engineers consult before choosing a model — every benchmark reproduced, every submission traced to code, every score dated. A calmer, stricter successor to Papers with Code.

Browse benchmarks Read the methodologyFree · no paywall · no signup
§ 02 · Benchmark

MTEB, English.

56 tasks across retrieval, classification, clustering, STS and reranking. Scores are averages on the public split; submissions require a public checkpoint and a reproducibility script.


Metric
Average · higher is better
Models
164 submitted · 151 reproduced
Last submission
4 hours ago
Citation
Muennighoff et al., 2023
Full method → arXiv:2210.07316
Top-8 · March 2026
.csv.json
#ModelOrgParamsSubmittedTrendAvgΔ
01Qwen3-8B-EmbedAlibaba8.0BMar 18, 202674.23+0.31
02NV-Embed-v3NVIDIA7.9BFeb 02, 202673.92+0.12
03Voyage-3-largeVoyageJan 11, 202673.80+0.04
04KaLM-v3-embeddingHIT1.8BJan 04, 202673.12+0.22
05BGE-M3-reranker-liteBAAI0.6BDec 12, 202572.44-0.01
06OpenAI text-embedding-3-lgOpenAIMay 02, 202470.100.00
07Cohere embed-v4CohereNov 08, 202569.82+0.08
08E5-mistral-7bMicrosoft7.1BJan 29, 202466.630.00
Fig 2 · Averages shown to two decimal places. Δ is change since previous submission by same organization. Shaded row marks current SOTA. All scores reproduced independently before publication.
§ 03 · Progress

Twenty-four months
of state of the art.

Embeddings have climbed 24 points on the MTEB average since late 2021 — but progress is not monotonic. We mark every step-up and preserve the underlying submission record, so that a regression is visible even when a press release is not.

Dates reflect public release. Only independently reproduced results contribute to the SOTA line. Unverified submissions are listed separately and marked with a hollow marker.

MTEB avg.49.155.662.068.574.9Aug '21Dec '23Mar '26SBERT-v2INSTRUCTORE5-largeGTE-largeBGE-M3E5-mistral-7bOpenAI-3-lgNV-Embed-v2NV-Embed-v3Qwen3-8B-Embed
Fig 3 · MTEB avg. · n=164 submissions · 16 SOTA-setting models annotated
SOTA-settingSubmission
§ 04 · Coverage

Seventeen modalities. One registry.

Everyone tracks LLM scores. We also track what your pipeline depends on — OCR, ASR, detection, retrieval, translation — with the same standard of evidence.

OCR / document
no dated history
1.64CER, ↓
Text embedding
no dated history
74.2MTEB avg, ↑
ASR / English
no dated history
2.1WER, ↓
Image classification
no dated history
91.8top-1, ↑
Object detection
no dated history
64.1mAP, ↑
Semantic seg.
no dated history
58.9mIoU, ↑
Translation (en-de)
no dated history
46.2BLEU, ↑
Summarization
no dated history
48.3ROUGE-L, ↑
Hallucination det.
no dated history
81.4F1, ↑
Hybrid retrieval
no dated history
62.7nDCG@10, ↑
Code generation
no dated history
89.4pass@1, ↑
SWE-bench verified
no dated history
64.7solve, ↑
Fig 4 · Best published score by quarter, past two years. Dot marks current SOTA. Full task list: ocr · asr · mtr · det · seg · cls · emb · rag · sum · nmt · tts · qa · ranker · halluc · retr · code · swe.
§ 05
Methodology

Why these numbers can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. No reviewer, no reproduction, no retraction. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repository link alone — a frozen commit, a declared environment, a recorded seed. If the code does not run, the row does not publish.

Second, every benchmark has a metric direction. It sounds trivial. It is not. Half the confusion in the field comes from tables that do not say whether higher is better; ours do.

Third, every score carries a date and survives its author. When a model regresses — and they do regress — the record is preserved. The table never silently forgets.

We are building the registry we wish we had when we were training the models ourselves.

§ 06 · Access

A registry, queryable.

Every score on the site is also available as JSON under the same URL. Point a notebook at it, build a dashboard, write a survey paper — the registry is open and versioned.

Base
api.codesota.com/v1
Auth
None for reads · bearer token for submissions
Limit
1,000 req/hr anonymous · 10,000 signed
Format
JSON · CSV · Parquet
Read the API reference →
Request.http
1# curl codesota.com/v1/query
2 
3GET /v1/benchmarks/mteb
4  ?since=2025-01 &verified=true
5  &sort=score.desc&limit=8
Response · 200⎯ 42 ms
1{
2 "benchmark": "mteb",
3 "metric": { "name": "avg", "direction": "higher" },
4 "reproduced": 151,
5 "results": [
6 {
7 "model": "Qwen3-8B-Embed",
8 "score": 74.23,
9 "date": "2026-03-18",
10 "repro": { "commit": "a41e90b", "seed": 42 },
11 "source": "verified"
12 },
13 { "model": "NV-Embed-v3", "score": 73.92, ... },
14 ...
15 ]
16}
§ 07 · Traction

Five months, actual people.

Unique monthly visitors · April extrapolated linearly (22 days observed → 30)
Dec 2025
3,174
launch
Jan 2026
3,428
Feb 2026
2,697
Mar 2026
5,204
Apr 2026
7,666
10,454 projected
22 of 30 days
Flat solid bars are observed visitors. The hatched continuation on April is the linear projection assuming the remaining 8 days run at the same pace as the first 22 — no growth amplification applied.
§ 08 · Contribute

Trained something
that beats the table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the chart with your name.

Recent submissions
KaLM-v3-embedding@katelin.ml · MTEB
Reproduced12 min
ParaDet-R50@aai-research · COCO-val
Queued34 min
Voyage-3-large@voyage · BEIR
Reproduced52 min
WhisperX-large-v4@mkatamaran · LibriSpeech
Revise71 min
EVA-02-L@yihuaxie · ImageNet-val
Reproduced98 min