Fill-Mask

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

1
Datasets
3
Results
accuracy
Canonical metric
Canonical Benchmark

GLUE

General Language Understanding Evaluation for masked language models

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on GLUE.

RankModelavg-scoreYearSource
1
DeBERTa-v3-large
91.42023paper
2
ALBERT-xxlarge-v2
89.42020paper
3
RoBERTa-large
88.52019paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace