Text classification
Text classification is a machine learning process of automatically assigning predefined categories or labels to text based on its content, often using natural language processing (NLP). It involves analyzing text to understand its meaning and then applying the most appropriate label, with common applications including sentiment analysis (e.g., positive/negative reviews), spam detection, and topic categorization (e.g., organizing news articles).
Text classification assigns labels to documents — sentiment, topic, intent, toxicity. Fine-tuned BERT variants dominated from 2018-2022, but GPT-4 and Claude now match or beat them zero-shot on most benchmarks. The real differentiator is cost: a distilled DeBERTa-v3 at 86M params handles production traffic at 1/1000th the cost of an API call.
History
Word2Vec (Mikolov et al.) enables dense text representations, replacing bag-of-words for downstream classification
Kim's CNN for sentence classification shows shallow convnets match RNNs on sentiment tasks
BERT (Devlin et al.) achieves SOTA on SST-2, MNLI, and 9 other GLUE tasks with bidirectional pretraining
RoBERTa (Liu et al.) pushes GLUE to 88.5 with better pretraining recipe — more data, longer training, dynamic masking
DeBERTa (He et al.) introduces disentangled attention, surpassing human baseline on SuperGLUE
Prompt-based classification emerges: GPT-3 few-shot rivals fine-tuned models on many tasks
SetFit (Tunstall et al.) achieves competitive accuracy with only 8 labeled examples per class using contrastive fine-tuning
GPT-4 zero-shot matches fine-tuned DeBERTa on SST-2 (96.4%), shifting economics of text classification
DeBERTa-v3-large remains the cost-performance king for production deployments; ModernBERT offers updated architecture
How Text classification Works
Tokenization
Text is split into subword tokens using BPE (GPT) or WordPiece (BERT), typically 128-512 tokens per input
Encoding
Tokens pass through transformer layers that build contextualized representations via self-attention
Pooling
The [CLS] token representation (BERT-style) or mean pooling aggregates the sequence into a fixed-size vector
Classification head
A linear layer maps the pooled vector to class logits; softmax produces probability distribution over labels
Fine-tuning
The entire model is fine-tuned on labeled data with cross-entropy loss, typically 3-5 epochs with learning rate ~2e-5
Current Landscape
Text classification in 2025 is a solved problem in the narrow sense — accuracy on standard benchmarks is at or above human level. The real competition is on efficiency, cost, and edge deployment. Fine-tuned encoder models (DeBERTa, ModernBERT) dominate production because they're 100-1000x cheaper per inference than LLM API calls. LLMs dominate prototyping and zero-shot scenarios. The emerging middle ground is task-specific distillation: use GPT-4 to label data, then train a tiny model for deployment.
Key Challenges
Class imbalance in real-world datasets — spam, toxicity, and fraud are rare events requiring stratified sampling and focal loss
Domain shift between pretraining data (web text) and target domains (medical, legal, financial) degrades zero-shot performance
Label noise in crowd-sourced annotations can cap effective accuracy below 95% regardless of model quality
Multilingual classification lacks high-quality labeled data for most of the world's 7,000+ languages
Latency constraints in production — BERT-base at 110M params is often too slow for real-time classification at scale
Quick Recommendations
Best accuracy (English)
DeBERTa-v3-large fine-tuned
Consistently tops GLUE/SuperGLUE; 304M params is manageable for most deployments
Zero-shot (no training data)
GPT-4o or Claude 3.5 Sonnet
Best zero-shot accuracy across diverse tasks without any fine-tuning
Low-latency production
DistilBERT or MiniLM-L6
6-layer distilled models run in <5ms on CPU with minimal accuracy loss
Few-shot (8-64 examples)
SetFit with all-MiniLM-L6-v2
Contrastive learning approach needs minimal labels and no prompts
Multilingual
XLM-RoBERTa-large
Covers 100 languages; fine-tunable for cross-lingual transfer
What's Next
The frontier is moving toward universal classifiers that handle arbitrary label sets without retraining (true zero-shot via natural language descriptions), efficient on-device models under 50M params, and multimodal classification that jointly reasons over text, images, and metadata. Expect ModernBERT and similar updated encoder architectures to replace the aging BERT/RoBERTa generation in production stacks.
Benchmarks & SOTA
SuperGLUE
SuperGLUE
More difficult successor to GLUE with 8 challenging tasks. Designed to be hard for current models.
State of the Art
DeBERTa-v3-large
Microsoft
91.4
average-score
GLUE
GLUE & SuperGLUE
The General Language Understanding Evaluation (GLUE, 2018) and its harder successor SuperGLUE (2019) — a multi-task NLU benchmark covering 9 sub-tasks (CoLA, SST-2, MRPC, STS-B, MNLI, BoolQ, COPA, WSC, ReCoRD). The leaderboard saturated near human baseline around 91 in 2022 and has seen no frontier submissions since; current frontier evaluation has moved to MMLU, GPQA, BIG-Bench Hard, and HELM.
State of the Art
Vega v2 (6B)
JD Explore Academy
91.3
SuperGLUE avg
GLUE (dev)
General Language Understanding Evaluation (GLUE)
GLUE (General Language Understanding Evaluation) is a widely-used benchmark suite for evaluating natural language understanding (NLU) systems. It aggregates nine sentence- or sentence-pair tasks drawn from established datasets — CoLA, SST-2, MRPC, STS-B, QQP, MNLI (matched/mismatched), QNLI, RTE, and WNLI — and also includes a hand-crafted diagnostic set (AX) for fine-grained linguistic analysis. The benchmark defines standard training/validation/test splits and an aggregate score (commonly reported on the dev or test sets) to summarize overall NLU performance; many papers report the GLUE dev-set aggregated score to compare models. GLUE was introduced in the paper “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” (Wang et al., 2018) and is hosted at the GLUE website and in major dataset libraries (Hugging Face, TensorFlow Datasets).
No results tracked yet
Related Tasks
Machine Translation
Machine Translation is the task of automatically translating text from one natural language to another. The goal is to produce translations that preserve the meaning, style, and grammatical correctness of the source text while being fluent in the target language.
Language Modeling
Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Text classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.