Natural Language Processingfeature-extraction

Feature Extraction

Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.

1 datasets6 resultsView full task mapping →

Feature extraction produces dense vector representations (embeddings) from text, enabling semantic search, clustering, and downstream ML tasks. Sentence-transformers (SBERT) democratized the field, and modern embedding models like E5, GTE, and Nomic-embed achieve remarkably strong performance at small model sizes. Embeddings are the hidden backbone of every RAG and search system.

History

2013

Word2Vec (Mikolov et al.) produces the first widely-used dense word embeddings via skip-gram and CBOW

2014

GloVe (Pennington et al.) combines co-occurrence statistics with dense embeddings; becomes a standard feature input

2018

ELMo (Peters et al.) introduces contextualized word embeddings from BiLSTM language models

2018

BERT's [CLS] token and hidden states become the default feature extraction method for transfer learning

2019

Sentence-BERT (Reimers & Gurevych) fine-tunes BERT with siamese networks for sentence-level embeddings

2022

E5 (Wang et al., Microsoft) and Instructor show that contrastive pretraining with instructions produces superior embeddings

2023

OpenAI text-embedding-3 and Cohere embed-v3 launch commercial embedding APIs; Nomic-embed-text goes open-source

2024

GTE-Qwen2 and NV-Embed (NVIDIA) top the MTEB leaderboard; Matryoshka embeddings enable flexible dimensionality

2025

ModernBERT-embed and Arctic-embed push open-source embeddings to parity with commercial APIs on MTEB

How Feature Extraction Works

1TokenizationInput text is split into su…2Transformer encodingTokens pass through 6-24 tr…3PoolingToken representations are a…4NormalizationThe output vector is L2-nor…5Contrastive trainingModels are trained with Inf…Feature Extraction Pipeline
1

Tokenization

Input text is split into subword tokens; special tokens ([CLS], [SEP]) are added for pooling boundaries

2

Transformer encoding

Tokens pass through 6-24 transformer layers; each layer refines contextualized representations

3

Pooling

Token representations are aggregated into a single vector via mean pooling (preferred) or [CLS] token extraction

4

Normalization

The output vector is L2-normalized to unit length, enabling cosine similarity as a simple dot product

5

Contrastive training

Models are trained with InfoNCE loss on positive/negative text pairs to push similar texts closer in vector space

Current Landscape

Feature extraction / text embeddings in 2025 are a mature and commoditized capability. The MTEB leaderboard shows dozens of models achieving similar top-tier performance, and the differentiation is on dimensions, speed, and specialization rather than raw quality. Open-source models (Nomic, GTE, E5) have reached parity with commercial APIs. The trend is toward instruction-aware embeddings that adapt their representation based on the task description, and Matryoshka training that lets you truncate dimensions at inference time for speed/quality tradeoff.

Key Challenges

Task specificity: embeddings optimized for search may perform poorly for classification and vice versa

Long document encoding — most models truncate at 512 tokens, losing information from longer texts

Cross-lingual alignment: embedding models struggle to place semantically equivalent texts in different languages nearby

Domain shift: general-purpose embeddings underperform on specialized domains (biomedical, legal, code)

Evaluation: MTEB provides a holistic benchmark but individual use cases may not align with aggregate scores

Quick Recommendations

Best overall (MTEB)

NV-Embed-v2 or GTE-Qwen2-7B-instruct

Top MTEB scores across retrieval, classification, and clustering subtasks

Production search/RAG

Nomic-embed-text-v1.5 or E5-large-v2

384-1024 dim, fast inference, strong retrieval quality with Matryoshka support

Multilingual embeddings

multilingual-e5-large

Covers 100+ languages with consistent cross-lingual alignment

API-based (no self-hosting)

OpenAI text-embedding-3-large

3072 dims, Matryoshka-compatible, strong MTEB performance with simple API

Lightweight / on-device

all-MiniLM-L6-v2

22M params, 384 dims, runs in <5ms — ideal for edge and mobile deployment

What's Next

The future of embeddings is multimodal (text + image + code in one vector space), late-interaction architectures (ColBERT-style) that preserve token-level information while remaining fast, and dynamic embeddings that update as documents change. Expect embedding models to merge with rerankers into unified retrieval models, and task-specific fine-tuning to become a one-line operation through adapter libraries.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Feature Extraction benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000