Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota - NLP - Text embeddingsMTEB + production metricsTask page
00 - Embeddings

Which embedding model should you use in production?

MTEB is the shortlist, not the deployment answer. In production, choose embeddings by retrieval quality, vector size, context length, serving latency, re-embedding cost, and whether your data can leave your infrastructure.

documentsembed1024 dimsmodelvector indexMeasure: recall@k, nDCG, latency, storage, re-embedding cost
01 - Recommendation

Production picks by constraint.

You need one production default

Qwen3-Embedding-0.6B

Strong MTEB score, 1024 dimensions, 32k context, and realistic serving footprint.

You need best possible benchmark quality

KaLM-Embedding-Gemma3-12B

Highest listed MTEB average and retrieval score, but expensive to serve and store.

You cannot host models

text-embedding-3-large or voyage-3.5

Managed APIs remove ops burden; test retrieval on your own corpus before buying the leaderboard story.

You need multilingual open-source

bge-m3 or Qwen3-Embedding-0.6B

Both keep vectors compact; Qwen3 is stronger on aggregate, BGE has mature retrieval tooling.

02 - Performance

Quality plus serving cost.

Scores are from the local MTEB snapshot used on the CodeSOTA MTEB page. Storage assumes float32 vectors before quantization or compression.

ModelProduction pickMTEB avgRetrievalRerankDimsContextParamsStorageLatency
Qwen3-Embedding-0.6BDefault self-hosted production64.3464.6561.41102432k0.6B3.8 GB / 1M vectorsLow
KaLM-Embedding-Gemma3-12BOffline quality ceiling72.3275.6667.27384032k11.76B14.3 GB / 1M vectorsHigh
Qwen3-Embedding-8BHigh-quality self-hosted70.5870.8865.63409632k8B15.3 GB / 1M vectorsHigh
bge-m3Practical multilingual baseline59.5657.8956.7810248k568M3.8 GB / 1M vectorsLow
text-embedding-3-largeManaged API default58.9656.1254.1230728kAPI11.4 GB / 1M vectorsNetwork
voyage-3.5Managed API for long-context RAG58.4655.8953.45102432kAPI3.8 GB / 1M vectorsNetwork
Fig 1 - Storage footprint uses dimensions x 4 bytes x 1M vectors. Float16 halves it; int8 quantization cuts it to roughly one quarter before index overhead.
03 - Rule of thumb

Do not pick by MTEB average alone.

For RAG, measure recall@20 and answer quality on your own documents. A model with a lower aggregate MTEB score can win if it retrieves your domain language better or keeps vector storage small enough to allow a larger candidate pool.

Add a reranker when precision matters. Bi-encoder embeddings are the recall layer; cross-encoders or LLM rerankers are the precision layer.

Minimal production eval
  1. Sample 200 real queries and expected source documents.
  2. Measure recall@10, recall@20, nDCG@10, and no-answer behavior.
  3. Track p50/p95 embedding latency and index query latency separately.
  4. Calculate vector storage before choosing dimensions.
  5. Re-run after chunking, metadata filters, or reranking changes.