Self-Supervised Learning
Learning representations without labeled data.
Self-supervised learning (SSL) trains models on unlabeled data by defining pretext tasks — predicting masked tokens, matching augmented views, or reconstructing corrupted inputs. SSL is the foundation of modern AI: BERT, GPT, CLIP, DINO, and MAE all use self-supervised pretraining, making it the dominant paradigm for learning representations.
History
BERT introduces masked language modeling — predicting masked tokens in text
GPT-1 uses autoregressive language modeling as self-supervised pretraining
SimCLR (Chen et al.) establishes contrastive learning for visual SSL
BYOL (Bootstrap Your Own Latent) achieves SSL without negative pairs
DINO shows self-distillation produces powerful visual features with ViT
MAE (Masked Autoencoder) applies BERT-style masking to image patches
data2vec unifies SSL across vision, language, and speech
DINOv2 — Meta's universal visual feature extractor trained on 142M curated images
I-JEPA (LeCun et al.) — Joint Embedding Predictive Architecture for learning world models
Self-supervised pretraining is the default — every foundation model uses it
How Self-Supervised Learning Works
Pretext Task Design
Define a task using only unlabeled data: mask tokens and predict them (MLM), predict the next token (autoregressive), match augmented views (contrastive), or reconstruct masked patches (MAE).
Large-Scale Pretraining
The model is trained on massive unlabeled datasets — internet text for language, ImageNet/web images for vision, or audio corpora for speech.
Representation Learning
Through solving the pretext task, the model learns general-purpose representations that capture semantic structure.
Transfer / Fine-Tuning
Pretrained representations are adapted to downstream tasks — classification, detection, generation — via fine-tuning, linear probing, or in-context learning.
Current Landscape
Self-supervised learning in 2025 is not a technique — it's the paradigm. Every foundation model (GPT-4, Claude, Gemini, Llama, DINO, CLIP) uses SSL pretraining. The interesting research frontier has moved from 'does SSL work?' (yes, definitively) to 'what is the best pretext task?' and 'how to scale efficiently?' In vision, the debate is between contrastive methods (CLIP, DINO), reconstructive methods (MAE), and predictive methods (I-JEPA). In language, autoregressive modeling has clearly won. The field is increasingly focused on multi-modal SSL that bridges vision, language, and other modalities.
Key Challenges
Pretext task design — the choice of self-supervised objective significantly affects downstream performance
Compute requirements — SSL pretraining at scale requires thousands of GPU-hours (DINOv2: 12K GPU-hours)
Representation collapse — contrastive methods can converge to trivial solutions where all representations are identical
Evaluation standardization — no consensus on how to compare SSL methods (linear probe vs. fine-tuning vs. few-shot)
Data quality — SSL amplifies data biases, since the model learns whatever structure is present in the unlabeled data
Quick Recommendations
Visual feature extraction
DINOv2
Best general-purpose visual features — works for classification, detection, segmentation without fine-tuning
Language pretraining
GPT-style autoregressive / BERT-style MLM
The foundation of all modern LLMs
Multi-modal
CLIP (contrastive) / I-JEPA (predictive)
CLIP for text-image alignment; I-JEPA for learning predictive world models
Domain-specific SSL
MAE fine-tuned on domain data
Masked autoencoding adapts well to medical, satellite, and scientific imaging
What's Next
The frontier is self-supervised world models — learning predictive representations of how the world works from video and sensory data, not just static images and text. I-JEPA and video prediction models point toward agents that understand physics and causality through self-supervised observation. Expect SSL to become invisible — assumed as the default, not a research contribution.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Something wrong or missing?
Help keep Self-Supervised Learning benchmarks accurate. Report outdated results, missing benchmarks, or errors.