Natural Language Processingtext-classification

Text classification

Text classification is a machine learning process of automatically assigning predefined categories or labels to text based on its content, often using natural language processing (NLP). It involves analyzing text to understand its meaning and then applying the most appropriate label, with common applications including sentiment analysis (e.g., positive/negative reviews), spam detection, and topic categorization (e.g., organizing news articles).

3 datasets12 resultsView full task mapping →

Text classification assigns labels to documents — sentiment, topic, intent, toxicity. Fine-tuned BERT variants dominated from 2018-2022, but GPT-4 and Claude now match or beat them zero-shot on most benchmarks. The real differentiator is cost: a distilled DeBERTa-v3 at 86M params handles production traffic at 1/1000th the cost of an API call.

History

2013

Word2Vec (Mikolov et al.) enables dense text representations, replacing bag-of-words for downstream classification

2014

Kim's CNN for sentence classification shows shallow convnets match RNNs on sentiment tasks

2018

BERT (Devlin et al.) achieves SOTA on SST-2, MNLI, and 9 other GLUE tasks with bidirectional pretraining

2019

RoBERTa (Liu et al.) pushes GLUE to 88.5 with better pretraining recipe — more data, longer training, dynamic masking

2020

DeBERTa (He et al.) introduces disentangled attention, surpassing human baseline on SuperGLUE

2021

Prompt-based classification emerges: GPT-3 few-shot rivals fine-tuned models on many tasks

2022

SetFit (Tunstall et al.) achieves competitive accuracy with only 8 labeled examples per class using contrastive fine-tuning

2023

GPT-4 zero-shot matches fine-tuned DeBERTa on SST-2 (96.4%), shifting economics of text classification

2024

DeBERTa-v3-large remains the cost-performance king for production deployments; ModernBERT offers updated architecture

How Text classification Works

1TokenizationText is split into subword …2EncodingTokens pass through transfo…3PoolingThe [CLS] token representat…4Classification headA linear layer maps the poo…5Fine-tuningThe entire model is fine-tu…Text classification Pipeline
1

Tokenization

Text is split into subword tokens using BPE (GPT) or WordPiece (BERT), typically 128-512 tokens per input

2

Encoding

Tokens pass through transformer layers that build contextualized representations via self-attention

3

Pooling

The [CLS] token representation (BERT-style) or mean pooling aggregates the sequence into a fixed-size vector

4

Classification head

A linear layer maps the pooled vector to class logits; softmax produces probability distribution over labels

5

Fine-tuning

The entire model is fine-tuned on labeled data with cross-entropy loss, typically 3-5 epochs with learning rate ~2e-5

Current Landscape

Text classification in 2025 is a solved problem in the narrow sense — accuracy on standard benchmarks is at or above human level. The real competition is on efficiency, cost, and edge deployment. Fine-tuned encoder models (DeBERTa, ModernBERT) dominate production because they're 100-1000x cheaper per inference than LLM API calls. LLMs dominate prototyping and zero-shot scenarios. The emerging middle ground is task-specific distillation: use GPT-4 to label data, then train a tiny model for deployment.

Key Challenges

Class imbalance in real-world datasets — spam, toxicity, and fraud are rare events requiring stratified sampling and focal loss

Domain shift between pretraining data (web text) and target domains (medical, legal, financial) degrades zero-shot performance

Label noise in crowd-sourced annotations can cap effective accuracy below 95% regardless of model quality

Multilingual classification lacks high-quality labeled data for most of the world's 7,000+ languages

Latency constraints in production — BERT-base at 110M params is often too slow for real-time classification at scale

Quick Recommendations

Best accuracy (English)

DeBERTa-v3-large fine-tuned

Consistently tops GLUE/SuperGLUE; 304M params is manageable for most deployments

Zero-shot (no training data)

GPT-4o or Claude 3.5 Sonnet

Best zero-shot accuracy across diverse tasks without any fine-tuning

Low-latency production

DistilBERT or MiniLM-L6

6-layer distilled models run in <5ms on CPU with minimal accuracy loss

Few-shot (8-64 examples)

SetFit with all-MiniLM-L6-v2

Contrastive learning approach needs minimal labels and no prompts

Multilingual

XLM-RoBERTa-large

Covers 100 languages; fine-tunable for cross-lingual transfer

What's Next

The frontier is moving toward universal classifiers that handle arbitrary label sets without retraining (true zero-shot via natural language descriptions), efficient on-device models under 50M params, and multimodal classification that jointly reasons over text, images, and metadata. Expect ModernBERT and similar updated encoder architectures to replace the aging BERT/RoBERTa generation in production stacks.

Benchmarks & SOTA

SuperGLUE

SuperGLUE

20197 results

More difficult successor to GLUE with 8 challenging tasks. Designed to be hard for current models.

State of the Art

DeBERTa-v3-large

Microsoft

91.4

average-score

GLUE

GLUE & SuperGLUE

20185 results

The General Language Understanding Evaluation (GLUE, 2018) and its harder successor SuperGLUE (2019) — a multi-task NLU benchmark covering 9 sub-tasks (CoLA, SST-2, MRPC, STS-B, MNLI, BoolQ, COPA, WSC, ReCoRD). The leaderboard saturated near human baseline around 91 in 2022 and has seen no frontier submissions since; current frontier evaluation has moved to MMLU, GPQA, BIG-Bench Hard, and HELM.

State of the Art

Vega v2 (6B)

JD Explore Academy

91.3

SuperGLUE avg

GLUE (dev)

General Language Understanding Evaluation (GLUE)

0 results

GLUE (General Language Understanding Evaluation) is a widely-used benchmark suite for evaluating natural language understanding (NLU) systems. It aggregates nine sentence- or sentence-pair tasks drawn from established datasets — CoLA, SST-2, MRPC, STS-B, QQP, MNLI (matched/mismatched), QNLI, RTE, and WNLI — and also includes a hand-crafted diagnostic set (AX) for fine-grained linguistic analysis. The benchmark defines standard training/validation/test splits and an aggregate score (commonly reported on the dev or test sets) to summarize overall NLU performance; many papers report the GLUE dev-set aggregated score to compare models. GLUE was introduced in the paper “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” (Wang et al., 2018) and is hosted at the GLUE website and in major dataset libraries (Hugging Face, TensorFlow Datasets).

No results tracked yet

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000