Text classification task router

Classification turns text into labels: intent, topic, sentiment, risk, moderation category, or routing decision. GLUE and SuperGLUE are historical reference points; your label set and class imbalance are the real test.

Benchmark

GLUE - SuperGLUE - domain eval

Current pick

DeBERTa v3

01 - Explainer

What this task measures.

Text classification maps an input string to one or more labels: sentiment, intent, topic, urgency, moderation category, compliance risk, or routing destination. The important design choice is whether labels are stable enough for a fine-tuned classifier or fluid enough that an LLM policy classifier is easier to maintain.

02 - Benchmarks

Use a benchmark ladder.

One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.

Benchmark	Role	Metric	Caveat
GLUE	Historical NLU suite	Average task score	Saturated; useful for model lineage more than production classifier selection.
SuperGLUE	Harder NLU suite	Average task score	Still a broad language-understanding proxy, not a domain-label benchmark.
GoEmotions / SST-2	Sentiment and emotion	Accuracy / macro F1	Good for public comparison; label definitions rarely match business policies.
Local validation set	Production gate	Macro F1 / AUROC / calibration	Required for imbalance, drift, threshold tuning, and costly minority-class misses.

03 - Evaluation

What to compare.

The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.

Axis	Value	Why it matters
Historical benchmark	GLUE / SuperGLUE	Useful for lineage, but saturated for frontier model selection.
Production metric	F1 / AUROC / calibration	Accuracy hides minority-class misses and bad confidence estimates.
Model families	DeBERTa, SetFit, zero-shot NLI, LLMs	Pick by label stability, data volume, and explanation needs.
Failure mode	Label drift	Support tickets, policy categories, and abuse labels change over time.

04 - Routing

Pick by task shape.

Stable labels, high volume

Fine-tuned encoder

Best latency and cost once you have labeled examples.

Few examples

SetFit or small fine-tune

Works well when each class has only a handful of samples.

No training data

Zero-shot NLI

Good for first pass labeling before annotation exists.

Explain decision

LLM classifier

Use when rationale and flexible policy language matter.

05 - Related

Need implementation details?

Open the lower-level explainer for architecture, code examples, and implementation options.

Open GLUE editorial ->