Natural Language Processingtext-classification

Text Classification

Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the task where transformers first proved their dominance over LSTMs. BERT (2018) set the template, but the real revolution came when instruction-tuned LLMs like GPT-4 and Llama 3 started matching fine-tuned classifiers zero-shot, threatening to make task-specific training obsolete. SST-2, AG News, and IMDB remain standard benchmarks, though the field increasingly cares about multilingual and low-resource performance where English-centric models still stumble. The open question: does a 70B parameter model doing classification via prompting actually beat a 100M fine-tuned encoder when you factor in latency and cost?

2 datasets14 resultsView full task mapping →

Text classification assigns labels to documents — sentiment, topic, intent, toxicity. Fine-tuned BERT variants dominated from 2018-2022, but GPT-4 and Claude now match or beat them zero-shot on most benchmarks. The real differentiator is cost: a distilled DeBERTa-v3 at 86M params handles production traffic at 1/1000th the cost of an API call.

History

2013

Word2Vec (Mikolov et al.) enables dense text representations, replacing bag-of-words for downstream classification

2014

Kim's CNN for sentence classification shows shallow convnets match RNNs on sentiment tasks

2018

BERT (Devlin et al.) achieves SOTA on SST-2, MNLI, and 9 other GLUE tasks with bidirectional pretraining

2019

RoBERTa (Liu et al.) pushes GLUE to 88.5 with better pretraining recipe — more data, longer training, dynamic masking

2020

DeBERTa (He et al.) introduces disentangled attention, surpassing human baseline on SuperGLUE

2021

Prompt-based classification emerges: GPT-3 few-shot rivals fine-tuned models on many tasks

2022

SetFit (Tunstall et al.) achieves competitive accuracy with only 8 labeled examples per class using contrastive fine-tuning

2023

GPT-4 zero-shot matches fine-tuned DeBERTa on SST-2 (96.4%), shifting economics of text classification

2024

DeBERTa-v3-large remains the cost-performance king for production deployments; ModernBERT offers updated architecture

How Text Classification Works

Tokenization

Text is split into subword tokens using BPE (GPT) or WordPiece (BERT), typically 128-512 tokens per input

Encoding

Tokens pass through transformer layers that build contextualized representations via self-attention

Pooling

The [CLS] token representation (BERT-style) or mean pooling aggregates the sequence into a fixed-size vector

Classification head

A linear layer maps the pooled vector to class logits; softmax produces probability distribution over labels

Fine-tuning

The entire model is fine-tuned on labeled data with cross-entropy loss, typically 3-5 epochs with learning rate ~2e-5

Current Landscape

Text classification in 2025 is a solved problem in the narrow sense — accuracy on standard benchmarks is at or above human level. The real competition is on efficiency, cost, and edge deployment. Fine-tuned encoder models (DeBERTa, ModernBERT) dominate production because they're 100-1000x cheaper per inference than LLM API calls. LLMs dominate prototyping and zero-shot scenarios. The emerging middle ground is task-specific distillation: use GPT-4 to label data, then train a tiny model for deployment.

Key Challenges

Class imbalance in real-world datasets — spam, toxicity, and fraud are rare events requiring stratified sampling and focal loss

Domain shift between pretraining data (web text) and target domains (medical, legal, financial) degrades zero-shot performance

Label noise in crowd-sourced annotations can cap effective accuracy below 95% regardless of model quality

Multilingual classification lacks high-quality labeled data for most of the world's 7,000+ languages

Latency constraints in production — BERT-base at 110M params is often too slow for real-time classification at scale

Quick Recommendations

Best accuracy (English)

DeBERTa-v3-large fine-tuned

Consistently tops GLUE/SuperGLUE; 304M params is manageable for most deployments

Zero-shot (no training data)

GPT-4o or Claude 3.5 Sonnet

Best zero-shot accuracy across diverse tasks without any fine-tuning

Low-latency production

DistilBERT or MiniLM-L6

6-layer distilled models run in <5ms on CPU with minimal accuracy loss

Few-shot (8-64 examples)

SetFit with all-MiniLM-L6-v2

Contrastive learning approach needs minimal labels and no prompts

Multilingual

XLM-RoBERTa-large

Covers 100 languages; fine-tunable for cross-lingual transfer

What's Next

The frontier is moving toward universal classifiers that handle arbitrary label sets without retraining (true zero-shot via natural language descriptions), efficient on-device models under 50M params, and multimodal classification that jointly reasons over text, images, and metadata. Expect ModernBERT and similar updated encoder architectures to replace the aging BERT/RoBERTa generation in production stacks.

Benchmarks & SOTA

GLUE

General Language Understanding Evaluation

20187 results

Collection of 9 NLU tasks including sentiment analysis, textual entailment, and question answering. Standard benchmark for general language understanding.

State of the Art

DeBERTa-v3-large

Microsoft

91.8

average-score

SuperGLUE

20197 results

More difficult successor to GLUE with 8 challenging tasks. Designed to be hard for current models.

State of the Art

DeBERTa-v3-large

Microsoft

91.4

average-score

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Text Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing