Text Classification
Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the task where transformers first proved their dominance over LSTMs. BERT (2018) set the template, but the real revolution came when instruction-tuned LLMs like GPT-4 and Llama 3 started matching fine-tuned classifiers zero-shot, threatening to make task-specific training obsolete. SST-2, AG News, and IMDB remain standard benchmarks, though the field increasingly cares about multilingual and low-resource performance where English-centric models still stumble. The open question: does a 70B parameter model doing classification via prompting actually beat a 100M fine-tuned encoder when you factor in latency and cost?
Text classification assigns labels to documents — sentiment, topic, intent, toxicity. Fine-tuned BERT variants dominated from 2018-2022, but GPT-4 and Claude now match or beat them zero-shot on most benchmarks. The real differentiator is cost: a distilled DeBERTa-v3 at 86M params handles production traffic at 1/1000th the cost of an API call.
History
Word2Vec (Mikolov et al.) enables dense text representations, replacing bag-of-words for downstream classification
Kim's CNN for sentence classification shows shallow convnets match RNNs on sentiment tasks
BERT (Devlin et al.) achieves SOTA on SST-2, MNLI, and 9 other GLUE tasks with bidirectional pretraining
RoBERTa (Liu et al.) pushes GLUE to 88.5 with better pretraining recipe — more data, longer training, dynamic masking
DeBERTa (He et al.) introduces disentangled attention, surpassing human baseline on SuperGLUE
Prompt-based classification emerges: GPT-3 few-shot rivals fine-tuned models on many tasks
SetFit (Tunstall et al.) achieves competitive accuracy with only 8 labeled examples per class using contrastive fine-tuning
GPT-4 zero-shot matches fine-tuned DeBERTa on SST-2 (96.4%), shifting economics of text classification
DeBERTa-v3-large remains the cost-performance king for production deployments; ModernBERT offers updated architecture
How Text Classification Works
Tokenization
Text is split into subword tokens using BPE (GPT) or WordPiece (BERT), typically 128-512 tokens per input
Encoding
Tokens pass through transformer layers that build contextualized representations via self-attention
Pooling
The [CLS] token representation (BERT-style) or mean pooling aggregates the sequence into a fixed-size vector
Classification head
A linear layer maps the pooled vector to class logits; softmax produces probability distribution over labels
Fine-tuning
The entire model is fine-tuned on labeled data with cross-entropy loss, typically 3-5 epochs with learning rate ~2e-5
Current Landscape
Text classification in 2025 is a solved problem in the narrow sense — accuracy on standard benchmarks is at or above human level. The real competition is on efficiency, cost, and edge deployment. Fine-tuned encoder models (DeBERTa, ModernBERT) dominate production because they're 100-1000x cheaper per inference than LLM API calls. LLMs dominate prototyping and zero-shot scenarios. The emerging middle ground is task-specific distillation: use GPT-4 to label data, then train a tiny model for deployment.
Key Challenges
Class imbalance in real-world datasets — spam, toxicity, and fraud are rare events requiring stratified sampling and focal loss
Domain shift between pretraining data (web text) and target domains (medical, legal, financial) degrades zero-shot performance
Label noise in crowd-sourced annotations can cap effective accuracy below 95% regardless of model quality
Multilingual classification lacks high-quality labeled data for most of the world's 7,000+ languages
Latency constraints in production — BERT-base at 110M params is often too slow for real-time classification at scale
Quick Recommendations
Best accuracy (English)
DeBERTa-v3-large fine-tuned
Consistently tops GLUE/SuperGLUE; 304M params is manageable for most deployments
Zero-shot (no training data)
GPT-4o or Claude 3.5 Sonnet
Best zero-shot accuracy across diverse tasks without any fine-tuning
Low-latency production
DistilBERT or MiniLM-L6
6-layer distilled models run in <5ms on CPU with minimal accuracy loss
Few-shot (8-64 examples)
SetFit with all-MiniLM-L6-v2
Contrastive learning approach needs minimal labels and no prompts
Multilingual
XLM-RoBERTa-large
Covers 100 languages; fine-tunable for cross-lingual transfer
What's Next
The frontier is moving toward universal classifiers that handle arbitrary label sets without retraining (true zero-shot via natural language descriptions), efficient on-device models under 50M params, and multimodal classification that jointly reasons over text, images, and metadata. Expect ModernBERT and similar updated encoder architectures to replace the aging BERT/RoBERTa generation in production stacks.
Benchmarks & SOTA
GLUE
General Language Understanding Evaluation
Collection of 9 NLU tasks including sentiment analysis, textual entailment, and question answering. Standard benchmark for general language understanding.
State of the Art
DeBERTa-v3-large
Microsoft
91.8
average-score
SuperGLUE
SuperGLUE
More difficult successor to GLUE with 8 challenging tasks. Designed to be hard for current models.
State of the Art
DeBERTa-v3-large
Microsoft
91.4
average-score
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Text Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.