Natural Language Processingfeature-extraction

Feature Extraction

Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.

1
Datasets
6
Results
accuracy
Canonical metric
Canonical Benchmark

MTEB Leaderboard

Massive Text Embedding Benchmark across 8 task categories

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MTEB Leaderboard.

RankModelavg-scoreYearSource
1
NV-Embed-v2
72.32024paper
2
GTE-Qwen2-7B-instruct
72.02024paper
3
voyage-3-large
70.32025paper
4
E5-Mistral-7B-instruct
66.62024paper
5
jina-embeddings-v3
65.22024paper
6
text-embedding-3-large
64.62024paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace