Fine-tuned encoder
Best latency and cost once you have labeled examples.
Classification turns text into labels: intent, topic, sentiment, risk, moderation category, or routing decision. GLUE and SuperGLUE are historical reference points; your label set and class imbalance are the real test.
Text classification maps an input string to one or more labels: sentiment, intent, topic, urgency, moderation category, compliance risk, or routing destination. The important design choice is whether labels are stable enough for a fine-tuned classifier or fluid enough that an LLM policy classifier is easier to maintain.
One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.
| Benchmark | Role | Metric | Caveat |
|---|---|---|---|
| GLUE | Historical NLU suite | Average task score | Saturated; useful for model lineage more than production classifier selection. |
| SuperGLUE | Harder NLU suite | Average task score | Still a broad language-understanding proxy, not a domain-label benchmark. |
| GoEmotions / SST-2 | Sentiment and emotion | Accuracy / macro F1 | Good for public comparison; label definitions rarely match business policies. |
| Local validation set | Production gate | Macro F1 / AUROC / calibration | Required for imbalance, drift, threshold tuning, and costly minority-class misses. |
The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.
| Axis | Value | Why it matters |
|---|---|---|
| Historical benchmark | GLUE / SuperGLUE | Useful for lineage, but saturated for frontier model selection. |
| Production metric | F1 / AUROC / calibration | Accuracy hides minority-class misses and bad confidence estimates. |
| Model families | DeBERTa, SetFit, zero-shot NLI, LLMs | Pick by label stability, data volume, and explanation needs. |
| Failure mode | Label drift | Support tickets, policy categories, and abuse labels change over time. |
Best latency and cost once you have labeled examples.
Works well when each class has only a handful of samples.
Good for first pass labeling before annotation exists.
Use when rationale and flexible policy language matter.
Open the lower-level explainer for architecture, code examples, and implementation options.
Open GLUE editorial ->