Codesota · Tasks · Fill-MaskHome/Tasks/Natural Language Processing/Fill-Mask

Natural Language Processing· fill-mask

Fill-Mask.

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

1

Datasets

3

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

GLUE

General Language Understanding Evaluation for masked language models

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on GLUE.

#	Model	avg-score	Year	Source
★	DeBERTa-v3-large✓	91.4	2023	paper ↗
2	ALBERT-xxlarge-v2✓	89.4	2020	paper ↗
3	RoBERTa-large✓	88.5	2019	paper ↗

What were you looking for on Fill-Mask?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

3 results · accuracy

Top: DeBERTa-v3-large — 91.4

§ 05 · Related tasks

Other tasks in Natural Language Processing.

Feature Extraction Named Entity Recognition Natural Language Inference Polish Conversation Quality Polish Cultural Competency Polish Emotional Intelligence Polish LLM General Polish Text Understanding

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Fill-Mask? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.