Natural Language Processingfill-mask

Fill-Mask

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

1 datasets3 resultsView full task mapping →

Fill-mask (masked language modeling) predicts missing tokens in text, serving as both a pretraining objective and a probe for linguistic knowledge. BERT popularized it, and it remains the core training signal for encoder models like RoBERTa, DeBERTa, and ModernBERT. As a standalone task it's mostly used for analysis and education rather than production applications.

History

2013

Word2Vec's CBOW model predicts center words from context — a precursor to masked prediction

2018

BERT (Devlin et al.) introduces Masked Language Modeling: randomly mask 15% of tokens and predict them from bidirectional context

2019

RoBERTa (Liu et al.) shows that dynamic masking and more training data significantly improve MLM-pretrained models

2019

ALBERT uses sentence-order prediction alongside MLM for more parameter-efficient pretraining

2020

ELECTRA (Clark et al.) replaces MLM with replaced-token detection — more sample-efficient pretraining

2021

DeBERTa-v3 introduces replaced token detection + MLM hybrid pretraining, achieving SOTA on downstream tasks

2023

Fill-mask used as a probing tool to study what language models know about syntax, semantics, and world knowledge

2024

ModernBERT revives the encoder architecture with updated training recipes, using MLM as the core objective

How Fill-Mask Works

1MaskingRandom tokens in the input …2Bidirectional encodingThe entire sequence (with m…3PredictionA classification head over …4ScoringThe model outputs a probabi…Fill-Mask Pipeline
1

Masking

Random tokens in the input are replaced with a [MASK] token; typically 15% of tokens in training, user-chosen in inference

2

Bidirectional encoding

The entire sequence (with masked positions) is processed by the transformer encoder, attending in all directions

3

Prediction

A classification head over the vocabulary predicts the original token at each masked position

4

Scoring

The model outputs a probability distribution over the vocabulary; top-k predictions are returned with confidence scores

Current Landscape

Fill-mask as a standalone task is primarily an educational and analytical tool in 2025 — it's how encoder models are pretrained, but it's not itself a production task. The real value of MLM is as a pretraining objective that produces models (DeBERTa, RoBERTa, ModernBERT) used for classification, NER, and other downstream tasks. The debate between MLM and autoregressive pretraining is settled: both work, but autoregressive models (GPT-style) scale to generation while MLM models excel at understanding tasks.

Key Challenges

MLM trains on only 15% of tokens per pass — less efficient than autoregressive LM or replaced-token detection

The [MASK] token doesn't appear at inference time for downstream tasks, creating a train-test mismatch

Fill-mask predictions are local — they don't capture long-range document-level coherence

Tokenizer artifacts: subword tokenization means the model predicts subword pieces, not always complete words

Quick Recommendations

Best MLM model

DeBERTa-v3-large or ModernBERT-large

Top fill-mask accuracy with modern pretraining; strong transfer to downstream tasks

Linguistic probing

BERT-base-uncased

Most studied model; extensive literature on what it captures at each layer

Multilingual fill-mask

XLM-RoBERTa-large

Covers 100 languages with consistent MLM performance

Efficient pretraining

ELECTRA-large

Replaced-token detection learns from all tokens, not just 15%; more sample-efficient

What's Next

MLM's role will continue as the pretraining objective for efficient encoder models. ModernBERT and future encoder architectures will use improved variants (whole-word masking, span masking, replaced-token detection) rather than vanilla BERT-style token masking. Expect MLM to remain important wherever bidirectional encoding outperforms autoregressive models — classification, retrieval, and structured prediction tasks.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Fill-Mask benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000