Fill-Mask
Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.
Fill-mask (masked language modeling) predicts missing tokens in text, serving as both a pretraining objective and a probe for linguistic knowledge. BERT popularized it, and it remains the core training signal for encoder models like RoBERTa, DeBERTa, and ModernBERT. As a standalone task it's mostly used for analysis and education rather than production applications.
History
Word2Vec's CBOW model predicts center words from context — a precursor to masked prediction
BERT (Devlin et al.) introduces Masked Language Modeling: randomly mask 15% of tokens and predict them from bidirectional context
RoBERTa (Liu et al.) shows that dynamic masking and more training data significantly improve MLM-pretrained models
ALBERT uses sentence-order prediction alongside MLM for more parameter-efficient pretraining
ELECTRA (Clark et al.) replaces MLM with replaced-token detection — more sample-efficient pretraining
DeBERTa-v3 introduces replaced token detection + MLM hybrid pretraining, achieving SOTA on downstream tasks
Fill-mask used as a probing tool to study what language models know about syntax, semantics, and world knowledge
ModernBERT revives the encoder architecture with updated training recipes, using MLM as the core objective
How Fill-Mask Works
Masking
Random tokens in the input are replaced with a [MASK] token; typically 15% of tokens in training, user-chosen in inference
Bidirectional encoding
The entire sequence (with masked positions) is processed by the transformer encoder, attending in all directions
Prediction
A classification head over the vocabulary predicts the original token at each masked position
Scoring
The model outputs a probability distribution over the vocabulary; top-k predictions are returned with confidence scores
Current Landscape
Fill-mask as a standalone task is primarily an educational and analytical tool in 2025 — it's how encoder models are pretrained, but it's not itself a production task. The real value of MLM is as a pretraining objective that produces models (DeBERTa, RoBERTa, ModernBERT) used for classification, NER, and other downstream tasks. The debate between MLM and autoregressive pretraining is settled: both work, but autoregressive models (GPT-style) scale to generation while MLM models excel at understanding tasks.
Key Challenges
MLM trains on only 15% of tokens per pass — less efficient than autoregressive LM or replaced-token detection
The [MASK] token doesn't appear at inference time for downstream tasks, creating a train-test mismatch
Fill-mask predictions are local — they don't capture long-range document-level coherence
Tokenizer artifacts: subword tokenization means the model predicts subword pieces, not always complete words
Quick Recommendations
Best MLM model
DeBERTa-v3-large or ModernBERT-large
Top fill-mask accuracy with modern pretraining; strong transfer to downstream tasks
Linguistic probing
BERT-base-uncased
Most studied model; extensive literature on what it captures at each layer
Multilingual fill-mask
XLM-RoBERTa-large
Covers 100 languages with consistent MLM performance
Efficient pretraining
ELECTRA-large
Replaced-token detection learns from all tokens, not just 15%; more sample-efficient
What's Next
MLM's role will continue as the pretraining objective for efficient encoder models. ModernBERT and future encoder architectures will use improved variants (whole-word masking, span masking, replaced-token detection) rather than vanilla BERT-style token masking. Expect MLM to remain important wherever bidirectional encoding outperforms autoregressive models — classification, retrieval, and structured prediction tasks.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Fill-Mask benchmarks accurate. Report outdated results, missing benchmarks, or errors.