Level 1: Single Blocks~30 min

Text Classification

Assign labels to text automatically. Reproduce SST-2 benchmark accuracy and push beyond the baseline.

Target Benchmark

GLUE (SST-2) — Stanford Sentiment Treebank

Binary sentiment classification. Part of the GLUE benchmark suite.

91.3%
SST-2 Accuracy
DistilBERT

What is Text Classification?

Text classification assigns predefined categories to text. It powers spam filters, customer support routing, content moderation, and sentiment dashboards.

Sentiment Analysis

Positive, negative, or neutral. Product reviews, social media.

"Love this product!" -> positive (0.98)

Intent Detection

Understand user goals in conversational AI.

"Cancel my order" -> cancel_order (0.95)

Two Approaches: Zero-shot vs Fine-tuned

Zero-shot Classification

  • + No training data needed
  • + Works with any labels
  • - Lower accuracy
  • - Higher latency

Fine-tuned Classifiers

  • + Highest accuracy
  • + Fast inference (ms)
  • - Requires labeled data
  • - Fixed categories

Zero-shot Classification

Zero-shot with BART-MNLI

from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "I love this product! Best purchase ever.",
    candidate_labels=["positive", "negative", "neutral"]
)
print(result)
# {'labels': ['positive', 'negative', 'neutral'],
#  'scores': [0.9845, 0.0098, 0.0057]}

Fine-tuned Classifiers

Sentiment with DistilBERT (SST-2)

91.3% accuracy
from transformers import pipeline

sentiment = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

texts = ["I love this!", "This is awful.", "It's okay I guess."]
results = sentiment(texts)
for text, res in zip(texts, results):
    print(f"{text} -> {res['label']} ({res['score']:.2f})")

Benchmark: GLUE and SST-2

The Stanford Sentiment Treebank (SST-2) is the standard benchmark for binary sentiment classification, part of the GLUE suite.

SST-2 Accuracy

RoBERTa-large
96.4%
GPT-4 (zero-shot)
95%
DeBERTa-v3-base
94.8%
BERT-large
93.5%
DistilBERT
91.3%
BART-MNLI (zero-shot)
88%

SST-2 binary sentiment classification accuracy. Human baseline is ~97%.

Confidence Scores and Thresholds

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def classify_with_threshold(text, threshold=0.85):
    result = classifier(text)[0]
    if result['score'] >= threshold:
        return result['label'], result['score']
    return 'UNCERTAIN', result['score']

print(classify_with_threshold("I love this!"))       # ('POSITIVE', 0.9998)
print(classify_with_threshold("It's fine I guess.")) # ('UNCERTAIN', 0.7234)
Stage 1

Reproduce

Replicate DistilBERT on SST-2: 91.3% Accuracy

Evaluate distilbert-base-uncased-finetuned-sst-2-english on the SST-2 validation set and reproduce its published accuracy of 91.3%.

Reproduce Script

from transformers import pipeline
from datasets import load_dataset

# Load model and dataset
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0  # Use GPU if available, else remove this line
)

dataset = load_dataset("glue", "sst2", split="validation")

# Evaluate
correct = 0
total = len(dataset)
label_map = {"POSITIVE": 1, "NEGATIVE": 0}

for example in dataset:
    result = classifier(example["sentence"])[0]
    predicted = label_map[result["label"]]
    if predicted == example["label"]:
        correct += 1

accuracy = correct / total * 100
print(f"SST-2 Accuracy: {accuracy:.1f}%")
# Expected: ~91.3%

What you need

  • 1.pip install transformers datasets torch
  • 2.SST-2 validation set downloads automatically (~1 MB)
  • 3.~5 minutes on CPU, ~30 seconds on GPU

Target: Your reproduced accuracy should be within ±0.5% of 91.3%. The SST-2 validation set has 872 examples — each misclassification shifts accuracy by ~0.11%.

Stage 2

Improve

Beat 91.3% on SST-2

DistilBERT is a distilled model — it trades accuracy for speed. The full BERT-large hits 93.5%, and RoBERTa-large reaches 96.4%. Can you close the gap with a small model?

Strategies to explore

Fine-tune RoBERTa-base on SST-2

RoBERTa-base has the same size as BERT but better pre-training. Fine-tuning on SST-2 for 3 epochs typically reaches 94%+.

Knowledge distillation

Distill a large model (DeBERTa-v3) into a DistilBERT-sized model. Can you exceed 91.3% while keeping inference under 10ms?

Data augmentation

Use back-translation or paraphrase mining to expand the training set. More diverse training data often improves generalization.

Ensemble methods

Combine predictions from multiple small models. Majority vote or weighted average can outperform individual models.

The real challenge: Human baseline on SST-2 is ~97%. Can you get closer to human-level accuracy? Every percentage point above 91.3% is a meaningful contribution — these are well-studied benchmark numbers.

Submit Your Result

Submit your SST-2 evaluation result. Include your training code so peers can reproduce and verify your accuracy.

Contribute to GLUE (SST-2)

Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.

Submit New Results

Share benchmark scores from recent papers or your own experiments

Report Data Issues

Found incorrect scores or broken links? Let us know

Build the Data Flywheel

Your contributions help make CodeSOTA better for everyone

Submit Benchmark Result

Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.