Text Classification
Assign labels to text automatically. From hand-crafted rules in the 1960s to zero-shot LLM prompting today — and when each approach still makes sense.
What is Text Classification?
Text classification assigns predefined categories to text. Given an input document, email, tweet, or sentence, the model outputs one or more labels with confidence scores. It is the single most deployed NLP task in production — powering spam filters, customer support routing, content moderation, medical coding, legal document triage, and sentiment dashboards at every major technology company.
The task sounds simple, but the design space is enormous. You can classify with a regex, a Naive Bayes model, a fine-tuned BERT, or a prompted GPT-4. The right choice depends on your accuracy requirements, latency budget, label set stability, and how much labeled data you have. This lesson covers all of them.
Sentiment Analysis
Classify text as positive, negative, or neutral. Used for product reviews, social media monitoring, brand tracking, and earnings call analysis.
Topic Classification
Categorize documents by subject. News articles, support tickets, research papers, regulatory filings.
Intent Detection
Understand user goals in conversational AI. Route customer queries, trigger workflows, escalate issues.
Content Moderation
Filter harmful content at scale. Hate speech, spam, misinformation, NSFW content, policy violations.
60 Years of Classifying Text
Text classification has been reinvented at least five times. Each generation solved a real limitation of the last, and each — including the current LLM era — introduced new trade-offs that the next generation will have to address. Understanding this progression is the fastest way to build intuition for which tool to reach for.
The history matters because every approach is still in active production use somewhere. Naive Bayes still powers spam filters at ISPs. SVMs still classify medical records. BERT fine-tunes still handle the majority of high-throughput classification. Knowing the lineage helps you pick the right tool, not just the newest one.
Naive Bayes — The First Text Classifier
Maron and Kuhns at the RAND Corporation published "On Relevance, Probabilistic Indexing and Information Retrieval," applying Bayes' theorem to classify documents by subject. The "naive" assumption — that words in a document are conditionally independent given the class — is wildly wrong linguistically but works shockingly well in practice. The reason: classification doesn't need to model language perfectly; it only needs to find features that discriminate between classes.
# Naive Bayes: P(class | document) ∝ P(class) × ∏ P(word_i | class)
# "naive" = assume each word is independent given the class
P(spam | "free money click now") ∝
P(spam) × P("free"|spam) × P("money"|spam) × P("click"|spam) × P("now"|spam)
# 0.3 × 0.08 × 0.05 × 0.04 × 0.02
P(ham | "free money click now") ∝
P(ham) × P("free"|ham) × P("money"|ham) × P("click"|ham) × P("now"|ham)
# 0.7 × 0.01 × 0.02 × 0.005 × 0.03
# Spam wins by a large margin despite the independence assumption being wrongNaive Bayes dominated text classification for three decades. Paul Graham's 2002 essay "A Plan for Spam" brought it mainstream, and SpamAssassin used it to filter millions of emails daily. Even today, it remains the strongest baseline — if your system can't beat Naive Bayes, something is wrong with your pipeline.
Support Vector Machines
Vapnik and colleagues developed SVMs at Bell Labs, and Thorsten Joachims (1998) showed they were devastatingly effective for text. The key insight: text classification involves high-dimensional, sparse feature spaces (thousands of unique words), and SVMs thrive in exactly this regime. They find the maximum-margin hyperplane separating classes — the decision boundary with the widest gap to the nearest training examples.
# SVM with tf-idf features — dominated text classification for a decade
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
classifier = Pipeline([
('tfidf', TfidfVectorizer(max_features=50000, ngram_range=(1, 2))),
('svm', LinearSVC(C=1.0))
])
classifier.fit(train_texts, train_labels)
predictions = classifier.predict(test_texts)
# Accuracy on 20 Newsgroups: ~85-88% — still a strong baseline— Joachims, T. (1998). Text Categorization with SVMs. ECML, 137–142.
SVMs with tf-idf features set the standard for over a decade. They handled high-dimensional sparse data gracefully, required modest compute, and generalized well from small training sets. The main limitation: they treated text as a bag of words with no understanding of word order, syntax, or meaning.
Logistic Regression & Feature Engineering
While SVMs found the maximum margin, logistic regression (also called Maximum Entropy or MaxEnt in NLP) offered calibrated probabilities — critical for systems that need to know how confident a prediction is, not just what the prediction is. The real innovation of this era was feature engineering: n-grams, character features, POS tags, gazetteer lookups, hand-crafted lexicons like LIWC and SentiWordNet. Kaggle competitions in 2010–2015 were won by teams with the best feature pipelines, not the best algorithms. The models were commodities; the features were the moat.
Convolutional Neural Networks for Text
Yoon Kim published "Convolutional Neural Networks for Sentence Classification" — a deceptively simple architecture that applied 1D convolutions over pre-trained Word2Vec embeddings. Filters of widths 3, 4, and 5 captured local n-gram patterns, followed by max-pooling and a softmax classifier. Despite its simplicity, TextCNN beat all previous methods on multiple benchmarks and became the go-to baseline for neural text classification.
The paper's lasting contribution was showing that pre-trained word vectors + a simple neural architecture could replace years of hand-crafted feature engineering. The era of "feature engineering as competitive advantage" was ending.
— Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP. 18,000+ citations.
BERT — The Inflection Point
Devlin, Chang, Lee, and Toutanova at Google released BERT (Bidirectional Encoder Representations from Transformers) and NLP changed permanently. BERT was pre-trained on 3.3 billion words using masked language modeling and next sentence prediction, then fine-tuned on downstream tasks with just a single linear layer on top.
For text classification specifically, BERT established the pre-train then fine-tune paradigm: take a model that already understands language, add a classification head, and train on a few thousand labeled examples. This obliterated the need for feature engineering and set new state-of-the-art on every GLUE and SuperGLUE subtask simultaneously.
"We obtain new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), [and] SQuAD v1.1 question answering Test F1 to 93.2."
— Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. 90,000+ citations.
Variants followed rapidly: RoBERTa (Liu et al., 2019) showed BERT was undertrained and pushed accuracy higher with more data and longer training. DistilBERT (Sanh et al., 2019) distilled BERT into a model 60% smaller and 2.5x faster while retaining 97% of performance — the production workhorse. DeBERTa (He et al., 2021) introduced disentangled attention and surpassed human performance on the SuperGLUE benchmark.
Zero-shot Classification via NLI
Yin, Hay, and Roth proposed a clever reformulation: instead of training a classifier for each label set, recast classification as natural language inference (NLI). Given a text and a hypothesis like "This text is about sports," an NLI model determines if the hypothesis is entailed by the text. The entailment probability becomes the classification score. No task-specific training needed.
This was a paradigm shift: for the first time, you could classify text into any categories you could describe in natural language, without a single labeled example. Facebook's BART-large-MNLI became the standard implementation, trained on the Multi-Genre Natural Language Inference corpus (Williams et al., 2018).
— Yin, W., Hay, J., & Roth, D. (2019). Benchmarking Zero-shot Text Classification. EMNLP.
LLM Prompting — Classification as Conversation
GPT-3 (Brown et al., 2020) demonstrated that large language models could perform classification through in-context learning: provide a few examples in the prompt, and the model classifies new inputs without any weight updates. GPT-4, Claude, and Gemini pushed this further — achieving near-SOTA accuracy on standard benchmarks through prompting alone, with the ability to explain their reasoning.
The trade-off is fundamental: LLMs offer maximum flexibility (any labels, any explanation, chain-of-thought reasoning) at the cost of 10–100x higher latency and cost compared to fine-tuned models. At 1,000 classifications per minute, a DistilBERT fine-tune costs pennies; GPT-4 costs dollars. The decision is always about volume, latency, and accuracy requirements.
— Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS. 40,000+ citations.
The throughline: 1961 → 2026
Six decades. One task, five paradigm shifts:
Each generation didn't kill the last. Naive Bayes still runs in production. SVMs still classify at ISPs. BERT fine-tunes handle the bulk of high-throughput classification. LLMs handle the long tail of ambiguous, novel, or low-volume tasks. The right tool depends on the constraints, not the calendar.
The Benchmarks: GLUE & SuperGLUE
Text classification progress is measured primarily against two benchmark suites created at NYU, the University of Washington, and DeepMind. Understanding them is essential because they define what "state of the art" means in this space.
GLUE (General Language Understanding Evaluation)
Released in 2018 by Wang, Singh, et al., GLUE aggregates 9 tasks into a single score. It was designed to be hard — but BERT surpassed human baselines within months, and the benchmark was "solved" by early 2020.
SST-2 (Sentiment)
Binary sentiment on movie reviews. 67K examples. Human: 97.8%
MNLI (NLI)
3-class entailment across 10 genres. 393K examples. Human: 92.0%
QQP (Paraphrase)
Are two Quora questions semantically equivalent? 364K pairs.
QNLI / RTE / WNLI
Question answering, textual entailment, coreference — all as classification.
— Wang, A. et al. (2018). GLUE: A Multi-Task Benchmark for NLU. ICLR.
SuperGLUE
Released in 2019 as GLUE's harder successor. More challenging tasks requiring commonsense reasoning, reading comprehension, and causal inference. DeBERTa (He et al., 2021) was the first model to surpass human performance (90.3 vs 89.8), followed by GPT-4.
BoolQ
Yes/no questions about Wikipedia passages. Human: 89.0%
CB (CommitmentBank)
3-class textual entailment requiring pragmatic inference.
WiC (Word in Context)
Does a polysemous word have the same sense in two sentences?
WSC (Winograd Schema)
Pronoun resolution requiring world knowledge. The hardest subtask.
— Wang, A. et al. (2019). SuperGLUE: A Stickier Benchmark for NLU. NeurIPS.
SST-2 Accuracy Over Time
SST-2 binary sentiment classification accuracy. Higher is better. The gap between SVM (2005) and DeBERTa (2021) is 15 percentage points — earned over 16 years of architectural innovation.
Why benchmarks matter (and where they fail)
GLUE and SuperGLUE measure general language understanding on clean, English data. Your production data is likely domain-specific, noisy, multilingual, or adversarial. A model that scores 97% on SST-2 might score 70% on your internal Slack messages. Always evaluate on your own data — benchmarks tell you what's possible, not what you'll get.
Zero-shot Classification: No Training Data Required
Zero-shot classification uses a model trained on natural language inference (NLI) to classify text into any categories you provide — no training required. The model was never taught your specific labels; it reasons about whether a hypothesis ("This text is about technology") is entailed by the input text.
Zero-shot with BART-MNLI
Hugging Face Transformersfrom transformers import pipeline
# BART trained on Multi-Genre NLI (Williams et al., 2018)
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
# Single-label classification — labels are mutually exclusive
result = classifier(
"The Federal Reserve raised interest rates by 25 basis points",
candidate_labels=["finance", "politics", "technology", "sports"]
)
print(result['labels'][0], f"({result['scores'][0]:.3f})")
# finance (0.962)
# Multi-label classification — each label scored independently
result = classifier(
"Apple announces AI chip for data centers, stock surges 5%",
candidate_labels=["technology", "business", "science", "politics"],
multi_label=True # Key parameter: sigmoid instead of softmax
)
for label, score in zip(result['labels'], result['scores']):
print(f" {label}: {score:.3f}")
# technology: 0.953
# business: 0.821
# science: 0.231
# politics: 0.018How it works under the hood
For each candidate label, the model constructs a hypothesis: "This example is {label}."It then scores the entailment probability — how likely is it that the hypothesis follows from the premise (your input text)?
# Internally, for each label: premise = "The Federal Reserve raised interest rates by 25 basis points" hypothesis = "This example is finance." # Constructed from label # NLI model outputs: P(entailment), P(neutral), P(contradiction) # P(entailment) becomes the classification score for "finance" # Repeat for all labels, normalize with softmax (or sigmoid for multi-label)
Zero-shot with OpenAI / Claude
LLM Promptingfrom openai import OpenAI
import json
client = OpenAI()
def classify_text(text: str, labels: list[str]) -> dict:
"""Zero-shot classification via structured LLM output."""
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "system",
"content": f"""Classify the text into one of these categories: {labels}.
Return JSON: {{"label": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""
}, {
"role": "user",
"content": text
}]
)
return json.loads(response.choices[0].message.content)
result = classify_text(
"The service was okay, nothing special but not terrible either.",
["positive", "negative", "neutral"]
)
print(result)
# {"label": "neutral", "confidence": 0.85,
# "reasoning": "Mixed signals — acknowledges it wasn't bad but..."}LLM-based classification gives you explainability for free — the model can articulate why it chose a label. This is valuable for auditing, debugging, and building trust with stakeholders. The cost: ~100ms latency and ~$0.001 per classification with GPT-4o.
Fine-tuned Classifiers: Maximum Accuracy & Speed
For production systems with fixed categories and high throughput requirements, fine-tuning a pre-trained transformer on your labeled data delivers the best accuracy-to-latency ratio. The model learns your specific label semantics and domain vocabulary.
Sentiment with DistilBERT (Pre-trained)
SST-2 Fine-tuned — 91.3% accuracyfrom transformers import pipeline
# Load DistilBERT fine-tuned on Stanford Sentiment Treebank
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Single prediction
result = sentiment("This movie was absolutely terrible and a waste of time.")
print(result)
# [{'label': 'NEGATIVE', 'score': 0.9998}]
# Batch processing — 10x faster than individual calls
texts = [
"I love this product, best purchase ever!",
"The service was awful, waited 3 hours.",
"It's okay I guess, nothing remarkable.",
"Incredible experience, exceeded all expectations!",
]
results = sentiment(texts, batch_size=32)
for text, res in zip(texts, results):
print(f" {res['label']:8s} ({res['score']:.3f}) | {text[:50]}")
# POSITIVE (0.999) | I love this product, best purchase ever!
# NEGATIVE (0.999) | The service was awful, waited 3 hours.
# POSITIVE (0.724) | It's okay I guess, nothing remarkable.
# POSITIVE (0.999) | Incredible experience, exceeded all expectations!Fine-tune Your Own Classifier
Hugging Face Trainer APIfrom transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np
# 1. Load pre-trained model + tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3 # e.g., positive/negative/neutral
)
# 2. Prepare your dataset
dataset = load_dataset("csv", data_files={
"train": "train.csv", # columns: text, label
"test": "test.csv"
})
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=128)
tokenized = dataset.map(tokenize, batched=True)
# 3. Train — typically 2-5 epochs is enough
trainer = Trainer(
model=model,
args=TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5, # Standard for BERT fine-tuning
weight_decay=0.01,
eval_strategy="epoch",
),
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
# With 5,000 examples, expect ~90-95% accuracy on most binary tasks
# With 500 examples, expect ~80-85% — still often beats zero-shotFine-tuning takes 5–30 minutes on a single GPU. The result is a model that runs inference at 10–50ms per batch of 32 — 100x faster than an LLM API call. This is why most high-throughput production classification still uses fine-tuned transformers.
Pre-trained Models Worth Knowing
| Model | Task | Latency | Parameters |
|---|---|---|---|
| distilbert-base-uncased-finetuned-sst-2 | Sentiment (2-class) | ~5ms | 66M |
| cardiffnlp/twitter-roberta-base-sentiment | Sentiment (3-class) | ~10ms | 125M |
| facebook/bart-large-mnli | Zero-shot (any labels) | ~100ms | 407M |
| MoritzLaurer/DeBERTa-v3-large-mnli | Zero-shot (SOTA NLI) | ~200ms | 304M |
| SamLowe/roberta-base-go_emotions | 28 emotion labels | ~10ms | 125M |
Latency is approximate, measured on a single CPU core with batch size 1. GPU inference is 5–10x faster.
Multi-class vs Multi-label Classification
A subtle but critical distinction that determines your loss function, evaluation metrics, and inference logic.
Multi-class
Exactly one label per text. Labels are mutually exclusive. Use softmax + cross-entropy loss.
Sentiment analysis
[positive OR negative OR neutral]
# Softmax: probabilities sum to 1.0 logits = model(text) # [2.1, -0.5, 0.3] probs = softmax(logits) # [0.82, 0.06, 0.12] label = argmax(probs) # "positive"
Multi-label
Zero or more labels per text. Labels are independent. Use sigmoid + binary cross-entropy per label.
Article tagging
[tech AND finance AND breaking]
# Sigmoid: each label scored independently logits = model(text) # [2.1, 1.8, -2.0, 0.1] probs = sigmoid(logits) # [0.89, 0.86, 0.12, 0.52] labels = [l for l, p in zip(labels, probs) if p > 0.5] # ["tech", "finance"]
Common mistake
Using softmax for multi-label classification. If a news article is about both technology and business, softmax forces the probabilities to sum to 1, artificially suppressing one label. Use sigmoid (independent binary decisions per label) whenever multiple labels can co-occur. In zero-shot NLI, this is the multi_label=True parameter.
Confidence Scores & Production Thresholds
Every classification model outputs probability scores. Turning those scores into production decisions requires thresholding — and the right threshold depends entirely on the cost of mistakes.
Production Threshold Strategy
from transformers import pipeline
classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
def classify_with_routing(text: str, auto_threshold=0.92, flag_threshold=0.65):
"""Three-tier classification with human-in-the-loop fallback.
>= auto_threshold: Act automatically (high precision)
>= flag_threshold: Flag for human review (balanced)
< flag_threshold: Reject / mark as uncertain
"""
result = classifier(text)[0]
score = result['score']
label = result['label']
if score >= auto_threshold:
return {"label": label, "score": score, "action": "auto"}
elif score >= flag_threshold:
return {"label": label, "score": score, "action": "review"}
else:
return {"label": "UNCERTAIN", "score": score, "action": "skip"}
# High confidence — auto-process
print(classify_with_routing("This product is absolutely amazing!"))
# {'label': 'POSITIVE', 'score': 0.9998, 'action': 'auto'}
# Medium confidence — needs human review
print(classify_with_routing("It was fine I guess, met expectations."))
# {'label': 'POSITIVE', 'score': 0.7834, 'action': 'review'}
# Low confidence — skip or escalate
print(classify_with_routing("The product arrived."))
# {'label': 'UNCERTAIN', 'score': 0.5612, 'action': 'skip'}High threshold (0.90+)
Auto-delete spam. Auto-route support tickets. Flag for compliance.
Optimizes for precision — when you act, you're almost always right.
Medium threshold (0.70–0.90)
Suggest labels for humans to confirm. Pre-sort review queues.
Balanced — useful for human-in-the-loop workflows.
Low threshold (0.50–0.70)
Catch all potentially harmful content. Never miss a fraud signal.
Optimizes for recall — you catch everything, even at the cost of false alarms.
Calibration warning
Model confidence scores are not calibrated probabilities by default. A model that outputs 0.90 is not necessarily right 90% of the time. Neural networks tend to be overconfident — they output scores close to 0 or 1 even when uncertain. For critical applications (medical, legal, financial), apply temperature scaling or Platt scaling to calibrate confidence scores before setting thresholds.
— Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.
Decision Framework: Which Approach to Use
The right approach depends on four variables: how much labeled data you have, how fast you need inference, whether your labels change, and your accuracy requirements.
Decision Matrix
| Scenario | Approach | Accuracy | Latency | Cost/1M |
|---|---|---|---|---|
| Exploring, labels unknown | LLM prompting | ~93% | ~500ms | $100+ |
| Labels fixed, no labeled data | Zero-shot NLI | ~88% | ~100ms | $5–20 |
| 500–5K labeled examples | Fine-tune DistilBERT | ~92% | ~5ms | $0.50 |
| 50K+ labeled, accuracy critical | Fine-tune DeBERTa | ~97% | ~20ms | $2 |
| Dynamic labels, need reasoning | LLM + structured output | ~95% | ~500ms | $50–200 |
Accuracy estimates are approximate for binary/3-class sentiment tasks. Your domain will vary. Cost is for self-hosted GPU (fine-tuned) or API pricing (LLM/NLI).
Start with zero-shot, graduate to fine-tuning
The best workflow: use BART-MNLI or LLM prompting to validate your label taxonomy and generate an initial labeled dataset. Once you have 1,000+ validated examples, fine-tune DistilBERT for production. This avoids the cold-start problem.
This is the approach used at most startups: prototype with GPT-4, ship with DistilBERT.
The cost crossover
At roughly 10,000 classifications per day, the cost of LLM API calls exceeds the cost of training and hosting a fine-tuned model. Below that volume, the engineering time to set up fine-tuning isn't worth it. This is the inflection point for most teams.
Hybrid: LLM for hard cases, fine-tune for easy ones
Route high-confidence predictions through a fast fine-tuned model and send uncertain cases (confidence below 0.80) to an LLM for chain-of-thought reasoning. This gives you the speed of fine-tuned models for 85% of traffic and the accuracy of LLMs for the ambiguous tail.
Key Takeaways
- 1
Text classification has a 60-year history. Naive Bayes (1961), SVMs (1998), TextCNN (2014), BERT (2018), zero-shot NLI (2019), LLM prompting (2020). Each approach is still in production somewhere.
- 2
GLUE and SuperGLUE are the standard benchmarks. SST-2 measures binary sentiment. MNLI measures natural language inference. DeBERTa-v3 holds SOTA on both, but always evaluate on your data.
- 3
Zero-shot NLI is the fastest path to a working classifier. No training data, any labels, change them on the fly. Accuracy is 5–10 points below fine-tuned but sufficient for prototyping and low-volume production.
- 4
Fine-tuned DistilBERT is the production workhorse. 91.3% SST-2 accuracy at 5ms inference. Costs almost nothing to run. Start with zero-shot, graduate to fine-tuning when you have 1K+ labeled examples and 10K+ daily classifications.
- 5
Thresholds matter more than models. Setting the right confidence threshold for your use case — high for automation, low for flagging — often improves production outcomes more than switching to a better model.
Practice Exercise
Build a three-approach classifier and compare results:
- 1.Install dependencies:
pip install transformers torch scikit-learn datasets - 2.Classify 20 product reviews with DistilBERT (fine-tuned) and BART-MNLI (zero-shot). Compare accuracy and latency.
- 3.Try zero-shot with custom labels ("urgent", "question", "complaint", "praise") on the same texts.
- 4.Implement the three-tier threshold strategy. Find the threshold that gives zero false positives on your test set.
- 5.Bonus: Fine-tune DistilBERT on a small subset (100 examples from
datasets.load_dataset("imdb")) and measure how accuracy scales with dataset size.
Further Reading
- •Devlin et al. (2019)— The BERT paper. Essential reading for understanding the pre-train/fine-tune paradigm.
- •Liu et al. (2019)— RoBERTa: shows BERT was undertrained and demonstrates optimal pre-training recipe.
- •Yin, Hay, & Roth (2019)— The zero-shot text classification paper. Foundational for NLI-based approaches.
- •He et al. (2021)— DeBERTa: disentangled attention, first model to surpass human performance on SuperGLUE.
- •Wang et al. (2018)— GLUE benchmark paper. Defines the standard evaluation suite for NLU.
Explore Text Classification Benchmarks
See how different models perform on GLUE, SuperGLUE, and other text classification benchmarks:
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.