Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Feature ExtractionHome/Tasks/Natural Language Processing/Feature Extraction
Natural Language Processing· feature-extraction

Feature Extraction.

Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.

1
Datasets
44
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

MTEB Leaderboard

Massive Text Embedding Benchmark across 8 task categories

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on MTEB Leaderboard.

#Modelmteb-scoreYearSource
QZhou-Embedding76.02025paper ↗
2Qwen3-Embedding-8B75.22025paper ↗
3Jasper-Token-Compression-600M74.82025paper ↗
4Qwen3-Embedding-4B74.62025paper ↗
5LGAI-Embedding-Preview74.12025paper ↗
6F2LLM-4B73.72025paper ↗
7gemini-embedding-00173.32025paper ↗
8F2LLM-v2-14B73.12026paper ↗
9F2LLM-v2-8B72.92026paper ↗
10F2LLM-v2-4B72.42026paper ↗

What were you looking for on Feature Extraction?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

MTEB Leaderboard
CANONICAL
44 results · accuracy
Top: QZhou-Embedding 76.0
§ 05 · Related tasks

Other tasks in Natural Language Processing.

Fill-MaskNamed Entity RecognitionNatural Language InferencePolish Conversation QualityPolish Cultural CompetencyPolish Emotional IntelligencePolish LLM GeneralPolish Text Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Feature Extraction? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.