Which embedding model should you use in production?

MTEB is the shortlist, not the deployment answer. In production, choose embeddings by retrieval quality, vector size, context length, serving latency, re-embedding cost, and whether your data can leave your infrastructure.

Compare models ->Open MTEB Reranking

01 - Recommendation

Production picks by constraint.

You need one production default

Qwen3-Embedding-0.6B

Strong MTEB score, 1024 dimensions, 32k context, and realistic serving footprint.

You need best possible benchmark quality

KaLM-Embedding-Gemma3-12B

Highest listed MTEB average and retrieval score, but expensive to serve and store.

You cannot host models

text-embedding-3-large or voyage-3.5

Managed APIs remove ops burden; test retrieval on your own corpus before buying the leaderboard story.

You need multilingual open-source

bge-m3 or Qwen3-Embedding-0.6B

Both keep vectors compact; Qwen3 is stronger on aggregate, BGE has mature retrieval tooling.

02 - Performance

Quality plus serving cost.

Scores are from the local MTEB snapshot used on the CodeSOTA MTEB page. Storage assumes float32 vectors before quantization or compression.

Model	Production pick	MTEB avg	Retrieval	Rerank	Dims	Context	Params	Storage	Latency
Qwen3-Embedding-0.6B	Default self-hosted production	64.34	64.65	61.41	1024	32k	0.6B	3.8 GB / 1M vectors	Low
KaLM-Embedding-Gemma3-12B	Offline quality ceiling	72.32	75.66	67.27	3840	32k	11.76B	14.3 GB / 1M vectors	High
Qwen3-Embedding-8B	High-quality self-hosted	70.58	70.88	65.63	4096	32k	8B	15.3 GB / 1M vectors	High
bge-m3	Practical multilingual baseline	59.56	57.89	56.78	1024	8k	568M	3.8 GB / 1M vectors	Low
text-embedding-3-large	Managed API default	58.96	56.12	54.12	3072	8k	API	11.4 GB / 1M vectors	Network
voyage-3.5	Managed API for long-context RAG	58.46	55.89	53.45	1024	32k	API	3.8 GB / 1M vectors	Network

Fig 1 - Storage footprint uses dimensions x 4 bytes x 1M vectors. Float16 halves it; int8 quantization cuts it to roughly one quarter before index overhead.

03 - Rule of thumb

Do not pick by MTEB average alone.

For RAG, measure recall@20 and answer quality on your own documents. A model with a lower aggregate MTEB score can win if it retrieves your domain language better or keeps vector storage small enough to allow a larger candidate pool.

Add a reranker when precision matters. Bi-encoder embeddings are the recall layer; cross-encoders or LLM rerankers are the precision layer.

Minimal production eval

Sample 200 real queries and expected source documents.
Measure recall@10, recall@20, nDCG@10, and no-answer behavior.
Track p50/p95 embedding latency and index query latency separately.
Calculate vector storage before choosing dimensions.
Re-run after chunking, metadata filters, or reranking changes.