Natural Language Processingfill-mask

Fill-Mask

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

1 datasets3 resultsView full task mapping →

Fill-mask (masked language modeling) predicts missing tokens in text, serving as both a pretraining objective and a probe for linguistic knowledge. BERT popularized it, and it remains the core training signal for encoder models like RoBERTa, DeBERTa, and ModernBERT. As a standalone task it's mostly used for analysis and education rather than production applications.

History

2013

Word2Vec's CBOW model predicts center words from context — a precursor to masked prediction

2018

BERT (Devlin et al.) introduces Masked Language Modeling: randomly mask 15% of tokens and predict them from bidirectional context

2019

RoBERTa (Liu et al.) shows that dynamic masking and more training data significantly improve MLM-pretrained models

2019

ALBERT uses sentence-order prediction alongside MLM for more parameter-efficient pretraining

2020

ELECTRA (Clark et al.) replaces MLM with replaced-token detection — more sample-efficient pretraining

2021

DeBERTa-v3 introduces replaced token detection + MLM hybrid pretraining, achieving SOTA on downstream tasks

2023

Fill-mask used as a probing tool to study what language models know about syntax, semantics, and world knowledge

2024

ModernBERT revives the encoder architecture with updated training recipes, using MLM as the core objective

How Fill-Mask Works

Masking

Random tokens in the input are replaced with a [MASK] token; typically 15% of tokens in training, user-chosen in inference

Bidirectional encoding

The entire sequence (with masked positions) is processed by the transformer encoder, attending in all directions

Prediction

A classification head over the vocabulary predicts the original token at each masked position

Scoring

The model outputs a probability distribution over the vocabulary; top-k predictions are returned with confidence scores

Current Landscape

Fill-mask as a standalone task is primarily an educational and analytical tool in 2025 — it's how encoder models are pretrained, but it's not itself a production task. The real value of MLM is as a pretraining objective that produces models (DeBERTa, RoBERTa, ModernBERT) used for classification, NER, and other downstream tasks. The debate between MLM and autoregressive pretraining is settled: both work, but autoregressive models (GPT-style) scale to generation while MLM models excel at understanding tasks.

Key Challenges

MLM trains on only 15% of tokens per pass — less efficient than autoregressive LM or replaced-token detection

The [MASK] token doesn't appear at inference time for downstream tasks, creating a train-test mismatch

Fill-mask predictions are local — they don't capture long-range document-level coherence

Tokenizer artifacts: subword tokenization means the model predicts subword pieces, not always complete words

Quick Recommendations

Best MLM model

DeBERTa-v3-large or ModernBERT-large

Top fill-mask accuracy with modern pretraining; strong transfer to downstream tasks

Linguistic probing

BERT-base-uncased

Most studied model; extensive literature on what it captures at each layer

Multilingual fill-mask

XLM-RoBERTa-large

Covers 100 languages with consistent MLM performance

Efficient pretraining

ELECTRA-large

Replaced-token detection learns from all tokens, not just 15%; more sample-efficient

What's Next

MLM's role will continue as the pretraining objective for efficient encoder models. ModernBERT and future encoder architectures will use improved variants (whole-word masking, span masking, replaced-token detection) rather than vanilla BERT-style token masking. Expect MLM to remain important wherever bidirectional encoding outperforms autoregressive models — classification, retrieval, and structured prediction tasks.

Benchmarks & SOTA

GLUE

20183 results

General Language Understanding Evaluation for masked language models

State of the Art

DeBERTa-v3-large

Microsoft

91.37

avg-score

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Fill-Mask benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing