Natural Language Processing
Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.
Language Modeling
Predicting the next word or token in a sequence. Core task for GPT-style models.
Machine Translation
Translating text from one language to another (WMT benchmarks).
Question Answering
Answering questions based on context (SQuAD, Natural Questions).
150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.
Text Classification
Categorizing text into predefined classes (sentiment, topic).
Collection of 9 NLU tasks including sentiment analysis, textual entailment, and question answering. Standard benchmark for general language understanding.
More difficult successor to GLUE with 8 challenging tasks. Designed to be hard for current models.
Named Entity Recognition
Identifying and classifying named entities in text (CoNLL).
Reuters news stories annotated with 4 entity types: PER, ORG, LOC, MISC. The standard NER benchmark.
Text Summarization
Generating concise summaries of longer documents (CNN/DailyMail, XSum).
300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral.
Semantic Textual Similarity
Measuring similarity between text pairs (STS Benchmark).
Reading Comprehension
Understanding and answering questions about passages.