Text Classification
Assign labels to text automatically. Reproduce SST-2 benchmark accuracy and push beyond the baseline.
GLUE (SST-2) — Stanford Sentiment Treebank
Binary sentiment classification. Part of the GLUE benchmark suite.
What is Text Classification?
Text classification assigns predefined categories to text. It powers spam filters, customer support routing, content moderation, and sentiment dashboards.
Sentiment Analysis
Positive, negative, or neutral. Product reviews, social media.
Intent Detection
Understand user goals in conversational AI.
Two Approaches: Zero-shot vs Fine-tuned
Zero-shot Classification
- + No training data needed
- + Works with any labels
- - Lower accuracy
- - Higher latency
Fine-tuned Classifiers
- + Highest accuracy
- + Fast inference (ms)
- - Requires labeled data
- - Fixed categories
Zero-shot Classification
Zero-shot with BART-MNLI
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"I love this product! Best purchase ever.",
candidate_labels=["positive", "negative", "neutral"]
)
print(result)
# {'labels': ['positive', 'negative', 'neutral'],
# 'scores': [0.9845, 0.0098, 0.0057]}Fine-tuned Classifiers
Sentiment with DistilBERT (SST-2)
91.3% accuracyfrom transformers import pipeline
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
texts = ["I love this!", "This is awful.", "It's okay I guess."]
results = sentiment(texts)
for text, res in zip(texts, results):
print(f"{text} -> {res['label']} ({res['score']:.2f})")Benchmark: GLUE and SST-2
The Stanford Sentiment Treebank (SST-2) is the standard benchmark for binary sentiment classification, part of the GLUE suite.
SST-2 Accuracy
SST-2 binary sentiment classification accuracy. Human baseline is ~97%.
Confidence Scores and Thresholds
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
def classify_with_threshold(text, threshold=0.85):
result = classifier(text)[0]
if result['score'] >= threshold:
return result['label'], result['score']
return 'UNCERTAIN', result['score']
print(classify_with_threshold("I love this!")) # ('POSITIVE', 0.9998)
print(classify_with_threshold("It's fine I guess.")) # ('UNCERTAIN', 0.7234)Reproduce
Replicate DistilBERT on SST-2: 91.3% Accuracy
Evaluate distilbert-base-uncased-finetuned-sst-2-english on the SST-2 validation set and reproduce its published accuracy of 91.3%.
Reproduce Script
from transformers import pipeline
from datasets import load_dataset
# Load model and dataset
classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 # Use GPU if available, else remove this line
)
dataset = load_dataset("glue", "sst2", split="validation")
# Evaluate
correct = 0
total = len(dataset)
label_map = {"POSITIVE": 1, "NEGATIVE": 0}
for example in dataset:
result = classifier(example["sentence"])[0]
predicted = label_map[result["label"]]
if predicted == example["label"]:
correct += 1
accuracy = correct / total * 100
print(f"SST-2 Accuracy: {accuracy:.1f}%")
# Expected: ~91.3%What you need
- 1.
pip install transformers datasets torch - 2.SST-2 validation set downloads automatically (~1 MB)
- 3.~5 minutes on CPU, ~30 seconds on GPU
Target: Your reproduced accuracy should be within ±0.5% of 91.3%. The SST-2 validation set has 872 examples — each misclassification shifts accuracy by ~0.11%.
Improve
Beat 91.3% on SST-2
DistilBERT is a distilled model — it trades accuracy for speed. The full BERT-large hits 93.5%, and RoBERTa-large reaches 96.4%. Can you close the gap with a small model?
Strategies to explore
Fine-tune RoBERTa-base on SST-2
RoBERTa-base has the same size as BERT but better pre-training. Fine-tuning on SST-2 for 3 epochs typically reaches 94%+.
Knowledge distillation
Distill a large model (DeBERTa-v3) into a DistilBERT-sized model. Can you exceed 91.3% while keeping inference under 10ms?
Data augmentation
Use back-translation or paraphrase mining to expand the training set. More diverse training data often improves generalization.
Ensemble methods
Combine predictions from multiple small models. Majority vote or weighted average can outperform individual models.
The real challenge: Human baseline on SST-2 is ~97%. Can you get closer to human-level accuracy? Every percentage point above 91.3% is a meaningful contribution — these are well-studied benchmark numbers.
Submit Your Result
Submit your SST-2 evaluation result. Include your training code so peers can reproduce and verify your accuracy.
Contribute to GLUE (SST-2)
Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.
Submit New Results
Share benchmark scores from recent papers or your own experiments
Report Data Issues
Found incorrect scores or broken links? Let us know
Build the Data Flywheel
Your contributions help make CodeSOTA better for everyone
Submit Benchmark Result
Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.