Level 1: Single Blocks~30 min

Text Classification

Assign labels to text automatically. From hand-crafted rules in the 1960s to zero-shot LLM prompting today — and when each approach still makes sense.

What is Text Classification?

Text classification assigns predefined categories to text. Given an input document, email, tweet, or sentence, the model outputs one or more labels with confidence scores. It is the single most deployed NLP task in production — powering spam filters, customer support routing, content moderation, medical coding, legal document triage, and sentiment dashboards at every major technology company.

The task sounds simple, but the design space is enormous. You can classify with a regex, a Naive Bayes model, a fine-tuned BERT, or a prompted GPT-4. The right choice depends on your accuracy requirements, latency budget, label set stability, and how much labeled data you have. This lesson covers all of them.

Sentiment Analysis

Classify text as positive, negative, or neutral. Used for product reviews, social media monitoring, brand tracking, and earnings call analysis.

"Love this product!" -> positive (0.98)

Topic Classification

Categorize documents by subject. News articles, support tickets, research papers, regulatory filings.

"Fed raises rates..." -> finance (0.92)

Intent Detection

Understand user goals in conversational AI. Route customer queries, trigger workflows, escalate issues.

"Cancel my order" -> cancel_order (0.95)

Content Moderation

Filter harmful content at scale. Hate speech, spam, misinformation, NSFW content, policy violations.

"You won $1M..." -> spam (0.99)

60 Years of Classifying Text

Text classification has been reinvented at least five times. Each generation solved a real limitation of the last, and each — including the current LLM era — introduced new trade-offs that the next generation will have to address. Understanding this progression is the fastest way to build intuition for which tool to reach for.

The history matters because every approach is still in active production use somewhere. Naive Bayes still powers spam filters at ISPs. SVMs still classify medical records. BERT fine-tunes still handle the majority of high-throughput classification. Knowing the lineage helps you pick the right tool, not just the newest one.

Era I: Statistical Foundations

1961

Naive Bayes — The First Text Classifier

Maron and Kuhns at the RAND Corporation published "On Relevance, Probabilistic Indexing and Information Retrieval," applying Bayes' theorem to classify documents by subject. The "naive" assumption — that words in a document are conditionally independent given the class — is wildly wrong linguistically but works shockingly well in practice. The reason: classification doesn't need to model language perfectly; it only needs to find features that discriminate between classes.

# Naive Bayes: P(class | document) ∝ P(class) × ∏ P(word_i | class)
# "naive" = assume each word is independent given the class

P(spam | "free money click now") ∝
    P(spam) × P("free"|spam) × P("money"|spam) × P("click"|spam) × P("now"|spam)
#   0.3    ×    0.08         ×    0.05          ×    0.04          ×    0.02

P(ham | "free money click now") ∝
    P(ham)  × P("free"|ham)  × P("money"|ham)  × P("click"|ham)  × P("now"|ham)
#   0.7    ×    0.01         ×    0.02          ×    0.005         ×    0.03

# Spam wins by a large margin despite the independence assumption being wrong

— Maron, M.E. & Kuhns, J.L. (1960). On Relevance, Probabilistic Indexing and Information Retrieval. JACM, 7(3), 216–244.

Naive Bayes dominated text classification for three decades. Paul Graham's 2002 essay "A Plan for Spam" brought it mainstream, and SpamAssassin used it to filter millions of emails daily. Even today, it remains the strongest baseline — if your system can't beat Naive Bayes, something is wrong with your pipeline.

1992–1998

Support Vector Machines

Vapnik and colleagues developed SVMs at Bell Labs, and Thorsten Joachims (1998) showed they were devastatingly effective for text. The key insight: text classification involves high-dimensional, sparse feature spaces (thousands of unique words), and SVMs thrive in exactly this regime. They find the maximum-margin hyperplane separating classes — the decision boundary with the widest gap to the nearest training examples.

# SVM with tf-idf features — dominated text classification for a decade
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

classifier = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=50000, ngram_range=(1, 2))),
    ('svm', LinearSVC(C=1.0))
])

classifier.fit(train_texts, train_labels)
predictions = classifier.predict(test_texts)
# Accuracy on 20 Newsgroups: ~85-88% — still a strong baseline

— Joachims, T. (1998). Text Categorization with SVMs. ECML, 137–142.

SVMs with tf-idf features set the standard for over a decade. They handled high-dimensional sparse data gracefully, required modest compute, and generalized well from small training sets. The main limitation: they treated text as a bag of words with no understanding of word order, syntax, or meaning.

2000s

Logistic Regression & Feature Engineering

While SVMs found the maximum margin, logistic regression (also called Maximum Entropy or MaxEnt in NLP) offered calibrated probabilities — critical for systems that need to know how confident a prediction is, not just what the prediction is. The real innovation of this era was feature engineering: n-grams, character features, POS tags, gazetteer lookups, hand-crafted lexicons like LIWC and SentiWordNet. Kaggle competitions in 2010–2015 were won by teams with the best feature pipelines, not the best algorithms. The models were commodities; the features were the moat.

Era II: Neural Text Classification

2014

Convolutional Neural Networks for Text

Yoon Kim published "Convolutional Neural Networks for Sentence Classification" — a deceptively simple architecture that applied 1D convolutions over pre-trained Word2Vec embeddings. Filters of widths 3, 4, and 5 captured local n-gram patterns, followed by max-pooling and a softmax classifier. Despite its simplicity, TextCNN beat all previous methods on multiple benchmarks and became the go-to baseline for neural text classification.

The paper's lasting contribution was showing that pre-trained word vectors + a simple neural architecture could replace years of hand-crafted feature engineering. The era of "feature engineering as competitive advantage" was ending.

— Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP. 18,000+ citations.

October 2018

BERT — The Inflection Point

Devlin, Chang, Lee, and Toutanova at Google released BERT (Bidirectional Encoder Representations from Transformers) and NLP changed permanently. BERT was pre-trained on 3.3 billion words using masked language modeling and next sentence prediction, then fine-tuned on downstream tasks with just a single linear layer on top.

For text classification specifically, BERT established the pre-train then fine-tune paradigm: take a model that already understands language, add a classification head, and train on a few thousand labeled examples. This obliterated the need for feature engineering and set new state-of-the-art on every GLUE and SuperGLUE subtask simultaneously.

"We obtain new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), [and] SQuAD v1.1 question answering Test F1 to 93.2."

— Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. 90,000+ citations.

Variants followed rapidly: RoBERTa (Liu et al., 2019) showed BERT was undertrained and pushed accuracy higher with more data and longer training. DistilBERT (Sanh et al., 2019) distilled BERT into a model 60% smaller and 2.5x faster while retaining 97% of performance — the production workhorse. DeBERTa (He et al., 2021) introduced disentangled attention and surpassed human performance on the SuperGLUE benchmark.

Era III: Zero-shot & LLM Prompting

2019

Zero-shot Classification via NLI

Yin, Hay, and Roth proposed a clever reformulation: instead of training a classifier for each label set, recast classification as natural language inference (NLI). Given a text and a hypothesis like "This text is about sports," an NLI model determines if the hypothesis is entailed by the text. The entailment probability becomes the classification score. No task-specific training needed.

This was a paradigm shift: for the first time, you could classify text into any categories you could describe in natural language, without a single labeled example. Facebook's BART-large-MNLI became the standard implementation, trained on the Multi-Genre Natural Language Inference corpus (Williams et al., 2018).

— Yin, W., Hay, J., & Roth, D. (2019). Benchmarking Zero-shot Text Classification. EMNLP.

2020–present

LLM Prompting — Classification as Conversation

GPT-3 (Brown et al., 2020) demonstrated that large language models could perform classification through in-context learning: provide a few examples in the prompt, and the model classifies new inputs without any weight updates. GPT-4, Claude, and Gemini pushed this further — achieving near-SOTA accuracy on standard benchmarks through prompting alone, with the ability to explain their reasoning.

The trade-off is fundamental: LLMs offer maximum flexibility (any labels, any explanation, chain-of-thought reasoning) at the cost of 10–100x higher latency and cost compared to fine-tuned models. At 1,000 classifications per minute, a DistilBERT fine-tune costs pennies; GPT-4 costs dollars. The decision is always about volume, latency, and accuracy requirements.

— Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS. 40,000+ citations.

The throughline: 1961 → 2026

Six decades. One task, five paradigm shifts:

1961–1990sCount words: Naive Bayes, tf-idf + SVM. Bag-of-words features, no semantic understanding.

2000s–2013Engineer features: n-grams, sentiment lexicons, hand-crafted pipelines. Humans encode domain knowledge.

2014–2017Learn features: CNNs and LSTMs over Word2Vec. Neural nets replace feature engineering.

2018–2021Transfer knowledge: Pre-train BERT, fine-tune on task. Thousands of labeled examples suffice.

2022–nowDescribe in English: Prompt an LLM or use zero-shot NLI. No labeled data at all.

Each generation didn't kill the last. Naive Bayes still runs in production. SVMs still classify at ISPs. BERT fine-tunes handle the bulk of high-throughput classification. LLMs handle the long tail of ambiguous, novel, or low-volume tasks. The right tool depends on the constraints, not the calendar.

The Benchmarks: GLUE & SuperGLUE

Text classification progress is measured primarily against two benchmark suites created at NYU, the University of Washington, and DeepMind. Understanding them is essential because they define what "state of the art" means in this space.

GLUE (General Language Understanding Evaluation)

Released in 2018 by Wang, Singh, et al., GLUE aggregates 9 tasks into a single score. It was designed to be hard — but BERT surpassed human baselines within months, and the benchmark was "solved" by early 2020.

SST-2 (Sentiment)

Binary sentiment on movie reviews. 67K examples. Human: 97.8%

MNLI (NLI)

3-class entailment across 10 genres. 393K examples. Human: 92.0%

QQP (Paraphrase)

Are two Quora questions semantically equivalent? 364K pairs.

QNLI / RTE / WNLI

Question answering, textual entailment, coreference — all as classification.

— Wang, A. et al. (2018). GLUE: A Multi-Task Benchmark for NLU. ICLR.

SuperGLUE

Released in 2019 as GLUE's harder successor. More challenging tasks requiring commonsense reasoning, reading comprehension, and causal inference. DeBERTa (He et al., 2021) was the first model to surpass human performance (90.3 vs 89.8), followed by GPT-4.

BoolQ

Yes/no questions about Wikipedia passages. Human: 89.0%

CB (CommitmentBank)

3-class textual entailment requiring pragmatic inference.

WiC (Word in Context)

Does a polysemous word have the same sense in two sentences?

WSC (Winograd Schema)

Pronoun resolution requiring world knowledge. The hardest subtask.

— Wang, A. et al. (2019). SuperGLUE: A Stickier Benchmark for NLU. NeurIPS.

SST-2 Accuracy Over Time

Human baseline

—

97.8%

DeBERTa-v3-large

2021

97.1%

RoBERTa-large

2019

96.4%

GPT-4 (zero-shot)

2023

95.3%

BERT-large

2018

93.5%

DistilBERT

2019

91.3%

TextCNN

2014

87.2%

SVM + tf-idf

~2005

82%

SST-2 binary sentiment classification accuracy. Higher is better. The gap between SVM (2005) and DeBERTa (2021) is 15 percentage points — earned over 16 years of architectural innovation.

Why benchmarks matter (and where they fail)

GLUE and SuperGLUE measure general language understanding on clean, English data. Your production data is likely domain-specific, noisy, multilingual, or adversarial. A model that scores 97% on SST-2 might score 70% on your internal Slack messages. Always evaluate on your own data — benchmarks tell you what's possible, not what you'll get.

Zero-shot Classification: No Training Data Required

Zero-shot classification uses a model trained on natural language inference (NLI) to classify text into any categories you provide — no training required. The model was never taught your specific labels; it reasons about whether a hypothesis ("This text is about technology") is entailed by the input text.

Zero-shot with BART-MNLI

Hugging Face Transformers

from transformers import pipeline

# BART trained on Multi-Genre NLI (Williams et al., 2018)
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

# Single-label classification — labels are mutually exclusive
result = classifier(
    "The Federal Reserve raised interest rates by 25 basis points",
    candidate_labels=["finance", "politics", "technology", "sports"]
)

print(result['labels'][0], f"({result['scores'][0]:.3f})")
# finance (0.962)

# Multi-label classification — each label scored independently
result = classifier(
    "Apple announces AI chip for data centers, stock surges 5%",
    candidate_labels=["technology", "business", "science", "politics"],
    multi_label=True  # Key parameter: sigmoid instead of softmax
)

for label, score in zip(result['labels'], result['scores']):
    print(f"  {label}: {score:.3f}")
# technology: 0.953
# business: 0.821
# science: 0.231
# politics: 0.018

How it works under the hood

For each candidate label, the model constructs a hypothesis: "This example is {label}."It then scores the entailment probability — how likely is it that the hypothesis follows from the premise (your input text)?

# Internally, for each label:
premise  = "The Federal Reserve raised interest rates by 25 basis points"
hypothesis = "This example is finance."   # Constructed from label

# NLI model outputs: P(entailment), P(neutral), P(contradiction)
# P(entailment) becomes the classification score for "finance"
# Repeat for all labels, normalize with softmax (or sigmoid for multi-label)

Zero-shot with OpenAI / Claude

LLM Prompting

from openai import OpenAI
import json

client = OpenAI()

def classify_text(text: str, labels: list[str]) -> dict:
    """Zero-shot classification via structured LLM output."""
    response = client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[{
            "role": "system",
            "content": f"""Classify the text into one of these categories: {labels}.
Return JSON: {{"label": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""
        }, {
            "role": "user",
            "content": text
        }]
    )
    return json.loads(response.choices[0].message.content)

result = classify_text(
    "The service was okay, nothing special but not terrible either.",
    ["positive", "negative", "neutral"]
)
print(result)
# {"label": "neutral", "confidence": 0.85,
#  "reasoning": "Mixed signals — acknowledges it wasn't bad but..."}

LLM-based classification gives you explainability for free — the model can articulate why it chose a label. This is valuable for auditing, debugging, and building trust with stakeholders. The cost: ~100ms latency and ~$0.001 per classification with GPT-5.4.

Fine-tuned Classifiers: Maximum Accuracy & Speed

For production systems with fixed categories and high throughput requirements, fine-tuning a pre-trained transformer on your labeled data delivers the best accuracy-to-latency ratio. The model learns your specific label semantics and domain vocabulary.

Sentiment with DistilBERT (Pre-trained)

SST-2 Fine-tuned — 91.3% accuracy

from transformers import pipeline

# Load DistilBERT fine-tuned on Stanford Sentiment Treebank
sentiment = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Single prediction
result = sentiment("This movie was absolutely terrible and a waste of time.")
print(result)
# [{'label': 'NEGATIVE', 'score': 0.9998}]

# Batch processing — 10x faster than individual calls
texts = [
    "I love this product, best purchase ever!",
    "The service was awful, waited 3 hours.",
    "It's okay I guess, nothing remarkable.",
    "Incredible experience, exceeded all expectations!",
]
results = sentiment(texts, batch_size=32)

for text, res in zip(texts, results):
    print(f"  {res['label']:8s} ({res['score']:.3f}) | {text[:50]}")
# POSITIVE (0.999) | I love this product, best purchase ever!
# NEGATIVE (0.999) | The service was awful, waited 3 hours.
# POSITIVE (0.724) | It's okay I guess, nothing remarkable.
# POSITIVE (0.999) | Incredible experience, exceeded all expectations!

Fine-tune Your Own Classifier

Hugging Face Trainer API

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np

# 1. Load pre-trained model + tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3  # e.g., positive/negative/neutral
)

# 2. Prepare your dataset
dataset = load_dataset("csv", data_files={
    "train": "train.csv",  # columns: text, label
    "test": "test.csv"
})

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=128)

tokenized = dataset.map(tokenize, batched=True)

# 3. Train — typically 2-5 epochs is enough
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        learning_rate=2e-5,       # Standard for BERT fine-tuning
        weight_decay=0.01,
        eval_strategy="epoch",
    ),
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()
# With 5,000 examples, expect ~90-95% accuracy on most binary tasks
# With 500 examples, expect ~80-85% — still often beats zero-shot

Fine-tuning takes 5–30 minutes on a single GPU. The result is a model that runs inference at 10–50ms per batch of 32 — 100x faster than an LLM API call. This is why most high-throughput production classification still uses fine-tuned transformers.

Pre-trained Models Worth Knowing

Model	Task	Latency	Parameters
distilbert-base-uncased-finetuned-sst-2	Sentiment (2-class)	~5ms	66M
cardiffnlp/twitter-roberta-base-sentiment	Sentiment (3-class)	~10ms	125M
facebook/bart-large-mnli	Zero-shot (any labels)	~100ms	407M
MoritzLaurer/DeBERTa-v3-large-mnli	Zero-shot (SOTA NLI)	~200ms	304M
SamLowe/roberta-base-go_emotions	28 emotion labels	~10ms	125M

Latency is approximate, measured on a single CPU core with batch size 1. GPU inference is 5–10x faster.

Multi-class vs Multi-label Classification

A subtle but critical distinction that determines your loss function, evaluation metrics, and inference logic.

Multi-class

Exactly one label per text. Labels are mutually exclusive. Use softmax + cross-entropy loss.

Sentiment analysis

[positive OR negative OR neutral]

# Softmax: probabilities sum to 1.0
logits = model(text)  # [2.1, -0.5, 0.3]
probs = softmax(logits)  # [0.82, 0.06, 0.12]
label = argmax(probs)  # "positive"

Multi-label

Zero or more labels per text. Labels are independent. Use sigmoid + binary cross-entropy per label.

Article tagging

[tech AND finance AND breaking]

# Sigmoid: each label scored independently
logits = model(text)  # [2.1, 1.8, -2.0, 0.1]
probs = sigmoid(logits)  # [0.89, 0.86, 0.12, 0.52]
labels = [l for l, p in zip(labels, probs) if p > 0.5]
# ["tech", "finance"]

Common mistake

Using softmax for multi-label classification. If a news article is about both technology and business, softmax forces the probabilities to sum to 1, artificially suppressing one label. Use sigmoid (independent binary decisions per label) whenever multiple labels can co-occur. In zero-shot NLI, this is the multi_label=True parameter.

Confidence Scores & Production Thresholds

Every classification model outputs probability scores. Turning those scores into production decisions requires thresholding — and the right threshold depends entirely on the cost of mistakes.

Production Threshold Strategy

from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

def classify_with_routing(text: str, auto_threshold=0.92, flag_threshold=0.65):
    """Three-tier classification with human-in-the-loop fallback.

    >= auto_threshold: Act automatically (high precision)
    >= flag_threshold: Flag for human review (balanced)
    <  flag_threshold: Reject / mark as uncertain
    """
    result = classifier(text)[0]
    score = result['score']
    label = result['label']

    if score >= auto_threshold:
        return {"label": label, "score": score, "action": "auto"}
    elif score >= flag_threshold:
        return {"label": label, "score": score, "action": "review"}
    else:
        return {"label": "UNCERTAIN", "score": score, "action": "skip"}

# High confidence — auto-process
print(classify_with_routing("This product is absolutely amazing!"))
# {'label': 'POSITIVE', 'score': 0.9998, 'action': 'auto'}

# Medium confidence — needs human review
print(classify_with_routing("It was fine I guess, met expectations."))
# {'label': 'POSITIVE', 'score': 0.7834, 'action': 'review'}

# Low confidence — skip or escalate
print(classify_with_routing("The product arrived."))
# {'label': 'UNCERTAIN', 'score': 0.5612, 'action': 'skip'}

High threshold (0.90+)

Auto-delete spam. Auto-route support tickets. Flag for compliance.

Optimizes for precision — when you act, you're almost always right.

Medium threshold (0.70–0.90)

Suggest labels for humans to confirm. Pre-sort review queues.

Balanced — useful for human-in-the-loop workflows.

Low threshold (0.50–0.70)

Catch all potentially harmful content. Never miss a fraud signal.

Optimizes for recall — you catch everything, even at the cost of false alarms.

Calibration warning

Model confidence scores are not calibrated probabilities by default. A model that outputs 0.90 is not necessarily right 90% of the time. Neural networks tend to be overconfident — they output scores close to 0 or 1 even when uncertain. For critical applications (medical, legal, financial), apply temperature scaling or Platt scaling to calibrate confidence scores before setting thresholds.

— Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.

Decision Framework: Which Approach to Use

The right approach depends on four variables: how much labeled data you have, how fast you need inference, whether your labels change, and your accuracy requirements.

Decision Matrix

Scenario	Approach	Accuracy	Latency	Cost/1M
Exploring, labels unknown	LLM prompting	~93%	~500ms	$100+
Labels fixed, no labeled data	Zero-shot NLI	~88%	~100ms	$5–20
500–5K labeled examples	Fine-tune DistilBERT	~92%	~5ms	$0.50
50K+ labeled, accuracy critical	Fine-tune DeBERTa	~97%	~20ms	$2
Dynamic labels, need reasoning	LLM + structured output	~95%	~500ms	$50–200

Accuracy estimates are approximate for binary/3-class sentiment tasks. Your domain will vary. Cost is for self-hosted GPU (fine-tuned) or API pricing (LLM/NLI).

Start with zero-shot, graduate to fine-tuning

The best workflow: use BART-MNLI or LLM prompting to validate your label taxonomy and generate an initial labeled dataset. Once you have 1,000+ validated examples, fine-tune DistilBERT for production. This avoids the cold-start problem.

This is the approach used at most startups: prototype with GPT-4, ship with DistilBERT.

The cost crossover

At roughly 10,000 classifications per day, the cost of LLM API calls exceeds the cost of training and hosting a fine-tuned model. Below that volume, the engineering time to set up fine-tuning isn't worth it. This is the inflection point for most teams.

Hybrid: LLM for hard cases, fine-tune for easy ones

Route high-confidence predictions through a fast fine-tuned model and send uncertain cases (confidence below 0.80) to an LLM for chain-of-thought reasoning. This gives you the speed of fine-tuned models for 85% of traffic and the accuracy of LLMs for the ambiguous tail.

Key Takeaways

1
Text classification has a 60-year history. Naive Bayes (1961), SVMs (1998), TextCNN (2014), BERT (2018), zero-shot NLI (2019), LLM prompting (2020). Each approach is still in production somewhere.
2
GLUE and SuperGLUE are the standard benchmarks. SST-2 measures binary sentiment. MNLI measures natural language inference. DeBERTa-v3 holds SOTA on both, but always evaluate on your data.
3
Zero-shot NLI is the fastest path to a working classifier. No training data, any labels, change them on the fly. Accuracy is 5–10 points below fine-tuned but sufficient for prototyping and low-volume production.
4
Fine-tuned DistilBERT is the production workhorse. 91.3% SST-2 accuracy at 5ms inference. Costs almost nothing to run. Start with zero-shot, graduate to fine-tuning when you have 1K+ labeled examples and 10K+ daily classifications.
5
Thresholds matter more than models. Setting the right confidence threshold for your use case — high for automation, low for flagging — often improves production outcomes more than switching to a better model.

Practice Exercise

Build a three-approach classifier and compare results:

1.Install dependencies: pip install transformers torch scikit-learn datasets
2.Classify 20 product reviews with DistilBERT (fine-tuned) and BART-MNLI (zero-shot). Compare accuracy and latency.
3.Try zero-shot with custom labels ("urgent", "question", "complaint", "praise") on the same texts.
4.Implement the three-tier threshold strategy. Find the threshold that gives zero false positives on your test set.
5.Bonus: Fine-tune DistilBERT on a small subset (100 examples from datasets.load_dataset("imdb")) and measure how accuracy scales with dataset size.

Text Classification

What is Text Classification?

Sentiment Analysis

Topic Classification

Intent Detection

Content Moderation

60 Years of Classifying Text

Naive Bayes — The First Text Classifier

Support Vector Machines

Logistic Regression & Feature Engineering

Convolutional Neural Networks for Text

BERT — The Inflection Point

Zero-shot Classification via NLI

LLM Prompting — Classification as Conversation

The throughline: 1961 → 2026

The Benchmarks: GLUE & SuperGLUE

GLUE (General Language Understanding Evaluation)

SuperGLUE

SST-2 Accuracy Over Time

Why benchmarks matter (and where they fail)

Zero-shot Classification: No Training Data Required

Zero-shot with BART-MNLI

How it works under the hood

Zero-shot with OpenAI / Claude

Fine-tuned Classifiers: Maximum Accuracy & Speed

Sentiment with DistilBERT (Pre-trained)

Fine-tune Your Own Classifier

Pre-trained Models Worth Knowing

Multi-class vs Multi-label Classification

Multi-class

Multi-label

Common mistake

Confidence Scores & Production Thresholds

Production Threshold Strategy

High threshold (0.90+)

Medium threshold (0.70–0.90)

Low threshold (0.50–0.70)

Calibration warning

Decision Framework: Which Approach to Use

Decision Matrix

Start with zero-shot, graduate to fine-tuning

The cost crossover

Hybrid: LLM for hard cases, fine-tune for easy ones

Key Takeaways

Practice Exercise

Further Reading

Explore Text Classification Benchmarks

Help improve this page