For Enthusiasts

The SOTA Tracker's Guide

Everything you need to track state-of-the-art models across machine learning tasks. From understanding SOTA to finding the latest breakthroughs in your favorite domains.

Updated December 2025|15 min read|Based on 1,519 papers

What is SOTA and Why Track It?

SOTA (State-of-the-Art) represents the best-performing model on a specific benchmark at a given time. It's the answer to "what's the best we can do right now?" for any ML task.

Why Enthusiasts Track SOTA

  • Follow the cutting edge of ML research in real-time
  • Understand which approaches are winning and why
  • Spot emerging trends before they become mainstream
  • Benchmark your own models against the best

The SOTA Tuple

Every SOTA result is defined by three components:

# SOTA Tuple
Task: Image Classification
Dataset: ImageNet
Metric: Top-1 Accuracy
# Current SOTA: 91.5%

How to Read Benchmark Results

Understanding Metrics

Higher is Better

  • Accuracy: Percentage of correct predictions (0-100%)
  • F-Measure: Harmonic mean of precision and recall (0-100)
  • mAP: Mean Average Precision for object detection
  • BLEU: Translation quality score (0-100)

Lower is Better

  • Error Rate: Percentage of incorrect predictions
  • CER: Character Error Rate for OCR
  • WER: Word Error Rate for speech recognition
  • Perplexity: Language model uncertainty

Context Matters

1.
Dataset Size & Difficulty:

90% accuracy on MNIST (handwritten digits) is trivial. 90% on ImageNet is world-class. Always check what the baseline models achieve.

2.
Training Data:

Models trained on private datasets or massive compute may not be reproducible. Check if the model used external data beyond the standard training set.

3.
Test Set Contamination:

Be skeptical of results that are significantly better than previous SOTA. The test set might have leaked into training data, especially for large language models.

4.
Inference Speed:

A model that's 0.5% more accurate but 10x slower may not be better for your use case. Look for latency/throughput benchmarks alongside accuracy.

Getting Started with Reproducing Results

Step-by-Step Reproduction Guide

1

Find the Paper and Code

Start with papers that have official code repositories. Look for GitHub links in the paper or on the dataset leaderboard page.

2

Check Requirements

Review compute requirements (GPU memory, training time) and dependencies (PyTorch/TensorFlow version, CUDA version).

# Check common requirements GPU: 24GB VRAM (RTX 3090/4090) PyTorch: 2.0+ CUDA: 11.8+
3

Download the Dataset

Most datasets require registration and download. Popular ones (ICDAR, COCO, ImageNet) have standard download scripts.

4

Use Pre-trained Weights

Start with inference using pre-trained weights. Full training can take days or weeks. Verify the model works before attempting to retrain.

# Example: Load pre-trained model python eval.py --checkpoint model.pth \ --dataset icdar2015 --split test
5

Compare Your Results

If your numbers match within 0.5-1%, you've successfully reproduced. Larger gaps suggest setup issues or undocumented training details.

Common Pitfalls

  • ×Using different data preprocessing than the paper
  • ×Missing data augmentation details
  • ×Evaluating on wrong dataset split (train vs test)
  • ×Ignoring random seed (some models are sensitive)

Best Practices

  • Start with smaller datasets to debug faster
  • Document your exact environment (Docker helps)
  • Check GitHub issues for known reproduction problems
  • Join the paper's community (Discord, forums) for help

Explore Individual Leaderboards

Dive deep into specific tasks and datasets. Each leaderboard shows historical progression, code implementations, and benchmark details.

Ready to Track SOTA?

Start exploring benchmarks and following the latest breakthroughs in ML research.