Natural Language Processingfeature-extraction

Feature Extraction

Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.

1 datasets44 resultsView full task mapping →

Feature extraction produces dense vector representations (embeddings) from text, enabling semantic search, clustering, and downstream ML tasks. Sentence-transformers (SBERT) democratized the field, and modern embedding models like E5, GTE, and Nomic-embed achieve remarkably strong performance at small model sizes. Embeddings are the hidden backbone of every RAG and search system.

History

2013

Word2Vec (Mikolov et al.) produces the first widely-used dense word embeddings via skip-gram and CBOW

2014

GloVe (Pennington et al.) combines co-occurrence statistics with dense embeddings; becomes a standard feature input

2018

ELMo (Peters et al.) introduces contextualized word embeddings from BiLSTM language models

2018

BERT's [CLS] token and hidden states become the default feature extraction method for transfer learning

2019

Sentence-BERT (Reimers & Gurevych) fine-tunes BERT with siamese networks for sentence-level embeddings

2022

E5 (Wang et al., Microsoft) and Instructor show that contrastive pretraining with instructions produces superior embeddings

2023

OpenAI text-embedding-3 and Cohere embed-v3 launch commercial embedding APIs; Nomic-embed-text goes open-source

2024

GTE-Qwen2 and NV-Embed (NVIDIA) top the MTEB leaderboard; Matryoshka embeddings enable flexible dimensionality

2025

ModernBERT-embed and Arctic-embed push open-source embeddings to parity with commercial APIs on MTEB

How Feature Extraction Works

Tokenization

Input text is split into subword tokens; special tokens ([CLS], [SEP]) are added for pooling boundaries

Transformer encoding

Tokens pass through 6-24 transformer layers; each layer refines contextualized representations

Pooling

Token representations are aggregated into a single vector via mean pooling (preferred) or [CLS] token extraction

Normalization

The output vector is L2-normalized to unit length, enabling cosine similarity as a simple dot product

Contrastive training

Models are trained with InfoNCE loss on positive/negative text pairs to push similar texts closer in vector space

Current Landscape

Feature extraction / text embeddings in 2025 are a mature and commoditized capability. The MTEB leaderboard shows dozens of models achieving similar top-tier performance, and the differentiation is on dimensions, speed, and specialization rather than raw quality. Open-source models (Nomic, GTE, E5) have reached parity with commercial APIs. The trend is toward instruction-aware embeddings that adapt their representation based on the task description, and Matryoshka training that lets you truncate dimensions at inference time for speed/quality tradeoff.

Key Challenges

Task specificity: embeddings optimized for search may perform poorly for classification and vice versa

Long document encoding — most models truncate at 512 tokens, losing information from longer texts

Cross-lingual alignment: embedding models struggle to place semantically equivalent texts in different languages nearby

Domain shift: general-purpose embeddings underperform on specialized domains (biomedical, legal, code)

Evaluation: MTEB provides a holistic benchmark but individual use cases may not align with aggregate scores

Quick Recommendations

Best overall (MTEB)

NV-Embed-v2 or GTE-Qwen2-7B-instruct

Top MTEB scores across retrieval, classification, and clustering subtasks

Production search/RAG

Nomic-embed-text-v1.5 or E5-large-v2

384-1024 dim, fast inference, strong retrieval quality with Matryoshka support

Multilingual embeddings

multilingual-e5-large

Covers 100+ languages with consistent cross-lingual alignment

API-based (no self-hosting)

OpenAI text-embedding-3-large

3072 dims, Matryoshka-compatible, strong MTEB performance with simple API

Lightweight / on-device

all-MiniLM-L6-v2

22M params, 384 dims, runs in <5ms — ideal for edge and mobile deployment

What's Next

The future of embeddings is multimodal (text + image + code in one vector space), late-interaction architectures (ColBERT-style) that preserve token-level information while remaining fast, and dynamic embeddings that update as documents change. Expect embedding models to merge with rerankers into unified retrieval models, and task-specific fine-tuning to become a one-line operation through adapter libraries.

Benchmarks & SOTA

MTEB Leaderboard

202244 results

Massive Text Embedding Benchmark across 8 task categories

State of the Art

QZhou-Embedding

75.97

mteb-score

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Feature Extraction benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing