Natural Language Processingtext-generation

Language Modeling

Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.

55 datasets14 resultsView full task mapping →

Language modeling — predicting the next token given preceding context — is the foundational task that powers all modern NLP. GPT-4, Claude, Llama, and Gemini are all language models at their core. Perplexity on held-out text remains the key intrinsic metric, but downstream task performance has become the real measure of progress.

History

2003

Bengio et al. introduce neural language models with feedforward networks, replacing n-gram models

2013

Word2Vec shows that language model byproducts (embeddings) transfer to downstream NLP tasks

2017

Transformer architecture (Vaswani et al.) enables massively parallel training, replacing recurrent models

2018

GPT (Radford et al.) demonstrates that autoregressive pretraining on 40GB of text produces useful representations

2019

GPT-2 (1.5B params) shows emergent generation quality; OpenAI delays release over misuse concerns

2020

GPT-3 (175B params) demonstrates in-context learning — the model performs tasks from examples in the prompt

2023

GPT-4 and Claude 2 reach broadly expert-level performance across NLP, coding, and reasoning

2023

Llama 2 (Meta) opens the floodgates for open-weight LLMs; Mistral-7B matches Llama 2 13B

2024

Llama 3.1 405B, DeepSeek-V3, and Qwen2.5-72B close the gap with proprietary frontier models

2025

Claude 3.5, GPT-4o, Gemini 2.0 compete on reasoning, coding, and agentic capabilities; Llama 4 and DeepSeek-R1 push open-source further

How Language Modeling Works

1TokenizationText is encoded into subwor…2EmbeddingEach token is mapped to a d…3Transformer layersTokens pass through N layer…4Next-token predictionA linear head projects the …5TrainingCross-entropy loss on next-…Language Modeling Pipeline
1

Tokenization

Text is encoded into subword tokens using BPE (GPT), SentencePiece (Llama), or custom tokenizers; vocabulary sizes range from 32K to 256K

2

Embedding

Each token is mapped to a dense vector; positional information is added via learned or rotary (RoPE) position embeddings

3

Transformer layers

Tokens pass through N layers of multi-head self-attention and feed-forward networks; modern models use 32-128 layers

4

Next-token prediction

A linear head projects the final hidden state to vocabulary logits; softmax gives probability distribution over next token

5

Training

Cross-entropy loss on next-token prediction over trillions of tokens from web text, code, and curated data

Current Landscape

Language modeling in 2025 is the foundation of the entire AI industry. The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) continue to hold: more compute and data produce better models. But the frontier has shifted from pure scale to efficiency (MoE architectures, DeepSeek), reasoning (o1-style inference-time compute), and post-training (RLHF, DPO, Constitutional AI). Open-source models lag frontier by 6-12 months but are increasingly competitive. The Chinchilla-optimal training paradigm has given way to over-training smaller models for cheaper inference.

Key Challenges

Scaling cost: training a frontier model costs $50-500M+ in compute; only a handful of organizations can afford it

Data quality and curation are arguably more important than model size — garbage in, garbage out at scale

Evaluation: perplexity doesn't capture reasoning ability; benchmarks saturate quickly; human evaluation is expensive

Alignment: making models helpful, harmless, and honest through RLHF/RLAIF adds complexity and potential capability loss

Inference cost: serving large models requires expensive GPU clusters; efficiency techniques (quantization, speculative decoding) are critical

Quick Recommendations

Best frontier model

Claude 3.5 Sonnet, GPT-4o, or Gemini 2.0 Pro

Top performance on reasoning, coding, and instruction following; competitive pricing

Open-source (large)

Llama 3.1 405B or DeepSeek-V3-671B (MoE)

Approaching frontier model quality; self-hostable for full data control

Open-source (efficient)

Qwen2.5-72B or Llama 3.1 70B

Best quality at the 70B scale; fits on 2x A100 with quantization

Small / edge

Llama 3.2 3B or Phi-3.5 Mini (3.8B)

Runs on mobile and laptop hardware; surprisingly capable for their size

Research / perplexity benchmark

GPT-4 or Gemini 1.5 Pro

Lowest published perplexity on standard LM benchmarks

What's Next

The next phase is test-time compute scaling (thinking longer to solve harder problems), multi-modal native models (text + image + audio + video in one architecture), and agentic models that can use tools, write code, and take actions. Expect the open-source gap to continue closing, with 70B-class models matching today's frontier within a year. Architecture innovations (state-space models, hybrid attention-SSM) may complement or partially replace pure transformers.

Benchmarks & SOTA

MMLU-Pro

1 results

The MMLU-Pro dataset contains 12K complex questions across various disciplines, including biology, business, chemistry, computer science, economics, engineering, math, physics, and psychology. It has 10 options per question, compared to the original MMLU's 4, making it more challenging. It also integrates more reasoning-focused problems, where Chain-of-Thought (CoT) results can be significantly higher than Perplexity (PPL).

State of the Art

Qwen2.5-Plus

72.5

Accuracy

MMLU-Redux

MMLU-Redux: Massive Multitask Language Understanding Redux

1 results

A carefully re-annotated version of the MMLU benchmark dataset with 30 subjects and 100 randomly sampled questions per subject (3,000 questions total). MMLU-Redux addresses numerous ground truth errors found in the original MMLU dataset. The analysis revealed that approximately 6.49% of MMLU questions contain errors, with some subjects like Virology containing errors in 57% of questions. This dataset provides more accurate and reliable evaluation of language model capabilities across 57 subjects.

State of the Art

Qwen2.5-72B-Instruct

86.8

Accuracy

IFEval

Instruction-Following Eval

1 results

A straightforward and easy-to-reproduce evaluation benchmark for large language models focused on instruction-following capabilities. IFEval contains around 500 prompts (541 in the train split) with verifiable instructions that can be objectively evaluated by heuristics, such as "write in more than 400 words", "mention the keyword of AI at least 3 times", "use no commas", or "include at least 3 highlighted sections". The benchmark identifies 25 types of verifiable instructions including punctuation constraints, length requirements, detectable content/format requirements, and keyword usage. Each prompt contains one or more verifiable instructions with corresponding kwargs for verification. This benchmark is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the Open LLM Leaderboard.

State of the Art

Qwen2.5-Plus

86.3

Accuracy

GPQA

GPQA: Graduate-Level Google-Proof Q&A Benchmark

1 results

GPQA science QA benchmark; reported as Avg@8 in the paper.

State of the Art

Qwen2.5-Plus

49.7

Accuracy

MATH

MATH (Measuring Mathematical Problem Solving) Dataset

1 results

MATH is a benchmark dataset of challenging competition-level mathematics problems introduced by Hendrycks et al. (NeurIPS Datasets & Benchmarks / arXiv 2103.03874). The dataset contains about 12,500 problems drawn from math competitions and is annotated with full step-by-step solutions (expressed in LaTeX and natural language) and final answers. Problems are organized by subject (e.g., algebra, counting & probability, geometry, number theory, precalculus) and difficulty level and are commonly distributed as a ~12,000-example training set plus a 500-example test set in public conversions. MATH is intended to evaluate and train models on mathematical problem solving and derivation generation (reasoning) and has been widely used as a benchmark for LLM math reasoning.

State of the Art

Qwen2.5-Plus

84.7

Accuracy

MGS

Multilingual Grade School Math (MGSM)

1 results

Multilingual Grade School Math (MGSM) is a multilingual benchmark of grade-school math word problems introduced in the paper “Language Models are Multilingual Chain-of-Thought Reasoners” (arXiv:2210.03057). It contains the same 250 problems from GSM8K, each manually translated into 10 typologically diverse languages (Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu) plus English. MGSM is used to evaluate multilingual reasoning and chain-of-thought capabilities of language models (includes inputs, targets, and manually translated few-shot exemplars). License: CC BY-SA 4.0. Size: 250 problems × languages (1K<n<10K overall). Note: referenced as MGS / MGSM in some papers (reported in pre-training comparisons).

State of the Art

Qwen2.5-72B-Instruct

88.16

Accuracy

Arena-Hard

Arena-Hard (Arena-Hard-Auto)

1 results

Arena-Hard is a human-aligned benchmark of challenging open-ended prompts sourced from live crowd platforms (notably Chatbot Arena) designed to robustly separate LLM capability and reflect human preference. It was introduced in the paper “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline” (arXiv:2406.11939). The Arena-Hard-Auto variant (published on Hugging Face as Arena-Hard-Auto / v0.1) is an automatic evaluation suite that contains 500 challenging user queries extracted from Chatbot Arena and uses an LLM-as-a-judge (the dataset authors report prompting GPT-4-Turbo to act as judge, comparing model responses against a baseline such as GPT-4-0314). The BenchBuilder pipeline described in the paper automates extracting high-quality prompts from crowdsourced data and producing an automatically-judged benchmark with high correlation and separability relative to the live Chatbot Arena. Common uses: automatic and human-aligned evaluation of instruction-tuned LLMs and benchmarking alignment/safety/helpfulness.

State of the Art

Qwen2.5-Plus

81.4

Accuracy

RULER

RULER: What’s the Real Context Size of Your Long-Context Language Models?

1 results

RULER is a synthetic, configurable long-context benchmarking suite for evaluating language models’ ability to use very long contexts. Introduced in the paper “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (arXiv:2404.06654), RULER extends the common “needle-in-a-haystack” (NIAH) retrieval test into a richer set of controlled variations with flexible configurations for sequence length and task complexity. The benchmark is designed to probe more than simple retrieval by varying task types and difficulty and to measure model performance across many context lengths (the authors report evaluations up to 1M tokens). The code and data-generation tools are provided by the authors in the public NVIDIA RULER GitHub repository (https://github.com/NVIDIA/RULER).

State of the Art

Qwen2.5-72B-Instruct

95.1

Accuracy

okapi MMLU (translated)

okapi MMLU (translated MMLU for multilingual evaluation)

1 results

A translated / multilingual version of the MMLU (Measuring Massive Multitask Language Understanding) benchmark adapted for multilingual evaluation. MMLU is a 57-task, multiple-choice benchmark covering subjects across humanities, social sciences, and STEM requiring broad world knowledge and problem-solving. The "okapi MMLU (translated)" assets on Hugging Face provide MMLU questions and answers translated into multiple languages (examples on HF include many languages such as id, vi, ar, bn, de, es, fr, etc.). The translated MMLU variants are commonly used for multilingual few-shot evaluation (the Okapi paper reports using translated MMLU in 5-shot evaluations). License on the HF repos is listed as CC-BY-NC-4.0. Source references: the original MMLU paper (Hendrycks et al., arXiv:2009.03300) and the Okapi project (Okapi: instruction-tuned LLMs; arXiv:2307.16039) and the Hugging Face dataset pages (e.g., jon-tow/okapi_mmlu and SEACrowd/okapi_m_mmlu).

State of the Art

Qwen2.5-72B-Instruct

79.97

Accuracy

MTbench

MT-Bench (Multi-Turn Benchmark)

1 results

MT-Bench (MT-bench / MT-Bench) is a multi-turn benchmark for evaluating the conversational and instruction-following abilities of large language model (LLM) chat assistants. It was introduced in the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685). MT-Bench is a collection of open-ended, multi-turn question/prompts designed to probe coherence, context maintenance, reasoning, and helpfulness in dialogue. The benchmark is commonly evaluated using a “LLM-as-a-judge” methodology (using strong LLMs such as GPT-4 to score/rank responses), which the authors show can achieve high agreement with human preferences. Public Hugging Face mirrors of the MT-Bench data (e.g., philschmid/mt-bench and lighteval/mt-bench) commonly expose an 80-item multi-turn set that is widely used for reporting a numeric MT-Bench score.

State of the Art

Qwen2.5-72B-Instruct

9.35

Score (1-10)

LV-Eval

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

1 results

LV-Eval is a bilingual long-context benchmark designed to evaluate large language models at very large context lengths (up to 256k tokens). It provides controllable evaluation across five length levels (16k, 32k, 64k, 128k, 256k) and includes multiple QA-style tasks (single-hop and multi-hop QA) drawn from several bilingual datasets. The benchmark incorporates techniques to reduce knowledge leakage and increase difficulty and objectivity: confusing facts insertion (CFI), keyword and phrase replacement (KPR), and a keyword-recall-based metric evaluated at multiple lengths. LV-Eval is provided with balanced numbers of instances across lengths and is intended to stress-test long-context capabilities of LLMs.

State of the Art

Qwen2.5-72B-Instruct

60.4

Accuracy

LongBench-Chat

LongBench-Chat: Long Context Instruction-Following Benchmark

1 results

LongBench-Chat is a benchmark for evaluating instruction-following capabilities of large language models on queries of 10k-100k in length. It was introduced in the LongAlign paper to test how well models can follow instructions over very long contexts.

State of the Art

Qwen2.5-72B-Instruct

8.72

Score (1-10)

GSM8k

1 results

GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality, linguistically diverse grade school math word problems.

State of the Art

Qwen2.5-Plus

96

Accuracy

Livebench

1 results

The Livebench dataset is a time-series dataset related to language modeling. It gathers and processes data from the LiveBench website's GitHub repository and the files used by the live version of the website to ensure the data is up-to-date. The dataset includes information such as question IDs, categories (which are consistently "language"), and release dates for the data. It also contains counts associated with different date ranges and label ranges (e.g., 0.00 - 10.00, 10.00 - 20.00).

State of the Art

Qwen2.5-Plus

54.6

Accuracy

MultiChallenge

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

0 results

MultiChallenge is a multi-turn conversational evaluation benchmark designed to measure LLMs' ability to conduct realistic, multi-turn conversations with human users. The benchmark identifies four categories of realistic conversational challenges (e.g., instruction retention, inference/memory across turns, handling versioned or updated information, and context allocation) that require integrated instruction-following, context management, and in-context reasoning. The dataset was created via a hybrid data-generation process (LLM agents plus human review) and includes an automatic evaluation pipeline that uses LLMs-as-judges with instance-level rubrics, which the authors report aligns well with experienced human raters. In the paper's reported evaluations, current frontier models score well below saturation on MultiChallenge (all <50% average accuracy; top reported model Claude 3.5 Sonnet reached ≤41.4%), demonstrating that MultiChallenge exposes realistic multi-turn failure modes not captured by prior multi-turn benchmarks. The benchmark is accompanied by a public leaderboard (Scale) and a GitHub repo with details and data generation code. Table 6 in the paper summarizes the multi-turn evaluation setup and results.

No results tracked yet

SafetyBench

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

0 results

SafetyBench is a comprehensive benchmark for evaluating the safety of large language models. It contains 11,435 diverse multiple-choice safety questions spanning seven safety categories (Offensiveness; Unfairness & Bias; Physical Health; Mental Health; Illegal Activities; Ethics & Morality; Privacy & Property). The benchmark includes both Chinese and English data (the authors release language-specific test files such as test_zh.json and test_en.json) and is intended for automatic evaluation of LLM safety via multiple-choice accuracy per-category and overall (the paper reports overall and per-category scores, e.g., in Table 7).

No results tracked yet

SysBench (ISR)

SysBench (system-message-following benchmark)

0 results

SysBench is a system-message-following benchmark for evaluating Large Language Models (LLMs). It measures how well models adhere to system messages across dimensions such as constraint complexity, instruction misalignment, and multi-turn stability. The benchmark provides evaluation examples (the Hugging Face dataset includes a test split stored as system_benchmark_eval_datas.json) and reports results using an ISR metric (reported in the paper) to quantify system-message-following performance. The dataset and code are publicly released by PKU-Baichuan-MLSystemLab (GitHub) and are hosted on Hugging Face.

No results tracked yet

ZeroSCROLLS/QuALITY

QuALITY (ZeroSCROLLS subset)

0 results

QuALITY (as used in ZeroSCROLLS) is the QuALITY multiple-choice reading-comprehension / question-answering dataset subset included in the ZeroSCROLLS zero-shot long-context benchmark. The original QuALITY dataset (Pang et al., NAACL 2022; arXiv:2112.08608) contains English passages with very long contexts (average ~5,000 tokens) and human-authored multiple-choice questions and distractors; questions were written and validated by annotators who read the full passage, so many require deep comprehension and cannot be solved by simple skimming or short excerpts. In ZeroSCROLLS the QuALITY data is adapted/used as a zero-shot test (and small validation) set to evaluate long-context model understanding in a zero-shot setting (see ZeroSCROLLS paper arXiv:2305.14196). Use cases: long-document QA / reading comprehension, multiple-choice QA over long contexts.

No results tracked yet

HiddenMath

HiddenMath

0 results

HiddenMath is reported to be a hidden/internal benchmark of competition-style mathematics problems used to evaluate large language models. Publicly-available evidence is limited: an LLM benchmark listing (LLMDB) describes HiddenMath as "Google’s internal holdout set of competition math problems" and reports scores on a 0–100 accuracy scale. No public dataset release, Hugging Face dataset page, or dedicated paper was found; the dataset appears to be a private/held-out test set used in model evaluation (reported in Gemma 3 Technical Report Table 6 as "HiddenMath", metric = accuracy). Source: LLMDB entry for HiddenMath (https://llmdb.com/benchmarks/hiddenmath).

No results tracked yet

Bird-SQL (dev)

BIRD-SQL (BIg Bench for Large-Scale Database-Grounded Text-to-SQLs)

0 results

Development split (dev) of BIRD-SQL (BIRD). BIRD-SQL is a large cross-domain text-to-SQL benchmark designed to evaluate natural-language-to-SQL parsing against realistic, value-rich relational databases. BIRD contains 12,751 text-to-SQL question–SQL pairs grounded on 95 databases (total ~33.4 GB) spanning ~37 professional domains; it emphasizes database values (dirty/noisy values and external-knowledge grounding) to better match real-world DB assistant scenarios. The benchmark provides standard splits including a development (dev) split (the dev archive is distributed by the authors) which is commonly used for model evaluation (accuracy / execution metrics in papers). The evaluation result referenced corresponds specifically to the development split as reported in Table 6 (metric: accuracy).

No results tracked yet

Global MMLU-Lite

Global-MMLU-Lite

0 results

Global-MMLU-Lite is a compact multilingual evaluation subset of the Global-MMLU benchmark. The Lite version covers 16 languages (a subset of the full 42-language Global-MMLU) and contains human-translated / post-edited MMLU-style multiple-choice questions. For each included language, the dataset provides 200 Culturally Sensitive (CS) and 200 Culturally Agnostic (CA) samples (i.e., 400 examples per language). The Lite split selects languages from Global-MMLU that were fully human-translated or post-edited, enabling a smaller, reproducible evaluation set for multilingual model comparisons. License: Apache-2.0. (Source: Hugging Face dataset card for CohereLabs/Global-MMLU-Lite and the Global MMLU paper arXiv:2412.03304 / ACL 2025.)

No results tracked yet

BBH

BIG-Bench Hard (BBH)

0 results

BIG-Bench Hard (BBH) is a curated subset of challenging tasks from the BIG-Bench benchmark selected because prior language model evaluations underperformed average human raters. Introduced in "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them" (Suzgun et al., 2022/ACL Findings 2023), BBH comprises 23 diverse, hard reasoning and understanding tasks (examples: boolean_expressions, logical_deduction_three/five/seven_objects, dyck_languages, multistep_arithmetic_two, object_counting, tracking_shuffled_objects, salient_translation_error_detection, etc.). BBH is explicitly evaluated with few-shot and chain-of-thought (CoT) prompting to study whether CoT helps solve these harder tasks. The suite is commonly distributed on Hugging Face and GitHub as "BIG-Bench Hard" and is widely used as a benchmark for advanced reasoning capabilities.

No results tracked yet

SimpleQA

0 results

The SimpleQA dataset is a benchmark from OpenAI that evaluates short-form factuality in large language models. It is used for language modeling tasks.

No results tracked yet

SuperGPQA

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

0 results

SuperGPQA is a large multiple-choice question benchmark for evaluating LLM knowledge and reasoning across 285 graduate-level disciplines. The public dataset (HF: m-a-p/SuperGPQA) contains ~26.5K question instances (train split) and was constructed to include at least 50 questions per discipline. Each example includes fields such as question, options, answer (and answer_letter), discipline/field/subfield labels, difficulty, and an is_calculation flag. The benchmark was released with an open-data license (ODC-BY) and is intended for evaluation of LLM factual knowledge and problem solving across highly specialized academic and professional subject areas.

No results tracked yet

AutoLogi

AutoLogi: Automated Logic Puzzle Benchmark

0 results

AutoLogi is a bilingual benchmark of automatically generated, open-ended logic puzzles designed to evaluate the logical reasoning abilities of large language models. Instances are synthesized by a programmatic generator with program-based verification to ensure solvability and correctness, and the generation process supports controllable difficulty levels to better distinguish model capabilities. The dataset was published alongside the paper “AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models” (arXiv:2502.16906). The Hugging Face release (qzhu/AutoLogi) is licensed under Apache-2.0 and contains on the order of 1K–10K examples. Used in post-training evaluations (Table 11) of Qwen3.

No results tracked yet

FACTS Grounding

0 results

Evaluates LLMs' ability to generate long-form responses that are factually accurate and strictly "grounded" in provided context documents, thereby mitigating hallucination. Tasks require models to generate responses based exclusively on documents up to 32,000 tokens long.

No results tracked yet

C-Eval

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

0 results

C-Eval is a comprehensive Chinese evaluation suite for foundation models containing 13,948 multiple-choice questions across 52 disciplines and four difficulty levels (middle school, high school, college, and professional). It also provides a C-Eval HARD subset of especially challenging questions. The benchmark is designed to assess knowledge and reasoning abilities of Chinese/Chinese-aware large language models; the authors publish dataset files, code, and examples on the project website and GitHub, and the dataset is hosted on Hugging Face (ceval/ceval-exam). (Paper: C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, arXiv:2305.08322; NeurIPS 2023 Datasets & Benchmarks track.)

No results tracked yet

WikiText Perplexity

20160 results

Language modeling quality measured by perplexity on Wikipedia text

No results tracked yet

EvalPlus

EvalPlus

0 results

EvalPlus is an evaluation framework and leaderboard for LLMs on code-generation tasks (LLM4Code). The EvalPlus project provides rigorously extended test suites for popular coding benchmarks (notably HumanEval+ and MBPP+) and tooling to evaluate models (pass@1, chat vs completion, etc.). HumanEval+ and MBPP+ are enlarged, hand-verified test sets (HumanEval+ ~80x more tests than original HumanEval; MBPP+ ~35x more tests than original MBPP) maintained by the EvalPlus team. In the NeurIPS paper “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation” (arXiv:2305.01210) the authors report an aggregate coding score referred to as “EvalPlus” (used in e.g., Table 3) which is computed from the constituent benchmarks (HumanEval, MBPP, HumanEval+, MBPP+). Primary sources: EvalPlus GitHub & website (https://github.com/evalplus, https://evalplus.github.io/leaderboard.html), Hugging Face dataset pages for the extended datasets (HumanEval+: https://huggingface.co/datasets/evalplus/humanevalplus , MBPP+: https://huggingface.co/datasets/evalplus/mbppplus), and the NeurIPS / arXiv paper (arXiv:2305.01210).

No results tracked yet

Multi-IF

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

0 results

Multi-IF is a benchmark for evaluating large language models on multi-turn, multilingual instruction-following. It extends the IFEval framework by incorporating multi-turn sequences and translating English prompts into seven additional languages, producing 4,501 multilingual conversations where each conversation has three turns. The benchmark uses a hybrid annotation/evaluation framework combining LLMs and human annotators and was used to evaluate state-of-the-art LLMs. Languages covered include English, French, Spanish, Portuguese, Hindi, Chinese, Russian, and Italian. The dataset and evaluation code are hosted by Facebook/Meta on Hugging Face and GitHub (license: CC-BY-NC-2.0).

No results tracked yet

INCLUDE

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

0 results

INCLUDE is a multilingual, knowledge- and reasoning-centric evaluation benchmark built from local academic and professional exam sources to measure multilingual LLM performance in real regional contexts. According to the paper (arXiv:2411.19799) INCLUDE comprises a large evaluation suite (the paper reports 197,243 QA pairs in total) covering regional/cultural knowledge across many topics and 44 written languages. A released Hugging Face dataset variant (CohereLabs/include-base-44) is a curated subset described as "INCLUDE-base (44 languages)" and contains 22,637 4-option multiple-choice questions spanning 57 topics (domains include chemistry, biology, legal, finance, medical, climate, art, code). Metadata on the HF page lists the 44 languages, Apache-2.0 license, task categories (multiple-choice, text2text-generation), and links to the paper. Note: the Qwen3 paper (arXiv:2505.09388) reports using INCLUDE with 10% sampling for some evaluations (used in post-training, Table 11). Source: arXiv:2411.19799 and Hugging Face dataset page CohereLabs/include-base-44.

No results tracked yet

Winogrande

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

0 results

WinoGrande is a large-scale Winograd-style commonsense reasoning dataset introduced to probe pronoun resolution and robust commonsense understanding. Inspired by the original Winograd Schema Challenge, WinoGrande contains ~44k fill-in-the-blank problems with binary options (right/wrong antecedent). Instances were collected via a careful crowdsourcing pipeline and then filtered with an adversarial filtering algorithm (AFLITE) to reduce dataset-specific statistical biases; roughly half the examples were identified as adversarial in the original release. The benchmark is designed to be harder and less exploitable by spurious correlations than earlier WSC variants; reported human performance is very high (~94%) while state-of-the-art models (as of the paper) were substantially lower. The dataset is available from the authors (allenai) and hosted on the Hugging Face datasets hub.

No results tracked yet

CommonsenseQA

CommonsenseQA

0 results

CommonsenseQA is a multiple-choice question-answering benchmark that tests commonsense/world knowledge. Questions were created by crowdworkers based on ConceptNet relations: for a source concept the authors extracted multiple target concepts that share a semantic relation, and workers authored questions that mention the source concept and discriminate among the targets. The set contains roughly 12k questions (paper reports 12,247 questions; the Hugging Face dataset card lists 12,102) with one correct answer and four distractors (5-way multiple choice). The dataset includes standard train/validation/test splits (see paper) and was shown to be challenging for strong baselines (BERT-large baseline ~56% vs. human ~89% per the original paper).

No results tracked yet

OpenBookQA

OpenBookQA (Open Book Question Answering)

0 results

OpenBookQA is a multiple-choice question answering dataset modeled after open-book exams to probe deeper understanding and multi-step reasoning. The dataset provides an “open book” of elementary-level science facts (≤1.3k facts) plus roughly 6k multiple-choice questions that require combining a provided core science fact with broad common-sense or world knowledge to answer. Each example contains a question stem, four answer choices, an answer key, and an associated core fact (the ‘‘open book’’ fact). The data is split into train (~4.96k questions), validation (500) and test (500). It was created to encourage research on reasoning and knowledge-combination beyond surface-level reading comprehension.

No results tracked yet

OpenRewrite-Eval

OPENREWRITEEVAL (OpenRewriteEval)

0 results

OPENREWRITEEVAL (OpenRewriteEval) is a benchmark for evaluating long-form, open-ended text rewriting by large language models. It covers a wide variety of rewriting types expressed through natural-language instructions and is designed to measure content preservation and to detect hallucinations or unintended modifications introduced by models when rewriting long-form text. The Hugging Face reupload (gabrielmbmb/OpenRewriteEval) contains a single split (train) with ~1.63k examples; fields include source (original long-form text), target (desired rewritten text), comment, and a task label with 6 classes (different rewriting types). The HF dataset page notes it was reuploaded from the original RewriteLM GitHub repository for convenience.

No results tracked yet

ARC

AI2 Reasoning Challenge (ARC)

0 results

The AI2 Reasoning Challenge (ARC) is a benchmark of 7,787 natural, grade-school-level multiple-choice science questions (authored for human tests) designed to encourage research in advanced question answering and reasoning. The question set is partitioned into two subsets: ARC-Challenge (questions that simple retrieval and word co-occurrence algorithms get wrong; ~2.59k questions) and ARC-Easy (~5.2k questions). The release also includes the ARC Corpus, a large corpus of science-relevant sentences (~14 million sentences) intended to support retrieval/knowledge components. ARC focuses on questions requiring deeper knowledge and reasoning than many earlier QA datasets and provides baseline implementations; it is widely used for multiple-choice and open-domain QA evaluation. License: CC BY-SA 4.0. Language: English.

No results tracked yet

HellaSwag

HellaSwag: Can a Machine Really Finish Your Sentence?

0 results

HellaSwag is a multiple-choice commonsense sentence-completion / commonsense NLI benchmark introduced by Zellers et al. (ACL 2019). Each example provides a short context and four candidate endings; the task is to pick the most plausible continuation. The dataset was constructed using Adversarial Filtering (AF) to select challenging, machine-generated distractors (making examples trivial for humans but difficult for models). Source contexts are drawn from domains such as ActivityNet captions and WikiHow. Standard splits on the Hugging Face / official release are roughly: train ≈ 39.9k, validation 10k, test 10k (≈60k total). Human accuracy reported >95%, while contemporary models at publication time scored substantially lower (paper reports under ~48%).

No results tracked yet

TriviaQA

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

0 results

TriviaQA is a large-scale reading-comprehension / question-answering dataset introduced by Joshi et al. (ACL 2017). It contains over 650K question-answer-evidence triples (about 95K question-answer pairs authored by trivia enthusiasts) with independently gathered evidence documents (about six evidence documents per question on average). The dataset provides both a reading-comprehension (RC) version (contexts where answers appear) and an unfiltered / open-domain style version (where not all retrieved documents necessarily contain the answer). TriviaQA was designed to be more challenging than prior RC datasets: questions are often compositional, exhibit high syntactic/lexical variability relative to answer-evidence sentences, and frequently require cross-sentence reasoning. The original paper provides RC and open-domain splits and baselines; data and downloads are available from the project page and via Hugging Face.

No results tracked yet

GPQA Diamond

0 results

GPQA Diamond is a dataset for language modeling [1]. It consists of 448 expert-validated multiple-choice questions in STEM fields. It is designed to be a challenging benchmark for advanced AI reasoning and drives progress in scalable oversight and structured problem-solving [2].

No results tracked yet

MMMLU

0 results

The Multilingual Massive Multitask Language Understanding (MMMLU) dataset was released by OpenAI on Hugging Face to evaluate multilingual large language models across diverse linguistic, cognitive, and cultural contexts.

No results tracked yet

C-SimpleQA

Chinese SimpleQA (C-SimpleQA)

0 results

Chinese SimpleQA (C-SimpleQA) is a Chinese-language benchmark for evaluating the factuality of large language models on short question answering. It was designed to be diverse and high-quality: it covers six major topics with 99 diverse subtopics, uses static (time-stable) reference answers, and includes a thorough quality-control process to ensure reliable evaluation. The dataset is provided in JSON format and is intended for easy, reproducible evaluation of model factuality on short Chinese questions. License: CC BY-NC-SA 4.0.

No results tracked yet

LongBench v2

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

0 results

LongBench v2 is a long-context benchmark designed to evaluate large language models’ ability to perform deep understanding and reasoning across realistic long-context multitasks. The benchmark contains 503 challenging multiple-choice questions with contexts ranging from ~8k to 2M words (majority under ~128k). It covers six major categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code-repository understanding, and long structured-data understanding. The authors provide evaluation modes with and without chain-of-thought (CoT) reasoning and categorize examples by short/medium/long context lengths to measure model performance as context size grows. Data and code are available from the project page and the Hugging Face dataset repository; the dataset is tagged for multiple-choice, question-answering, text-classification, and table-question-answering tasks.

No results tracked yet

AIME 2025

0 results

This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.

No results tracked yet

AIME 2024

0 results

The AIME 2024 dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. It is primarily used for evaluating Large Language Models' (LLMs) mathematical reasoning and problem-solving capabilities on complex mathematical problems. Each record includes an ID, problem statement, detailed solution process, and the final numerical answer. The dataset covers various mathematical domains (geometry, algebra, number theory, etc.) and is known for its high difficulty level.

No results tracked yet

ECLeKTic

ECLeKTic: A Multi-Lingual Knowledge Testing Dataset

0 results

ECLeKTic is a multilingual dataset for evaluating knowledge comprehension across multiple languages, featuring question-answering tasks based on content translated across different languages.

No results tracked yet

MRCR v2 (1M)

Multi-turn Response Coherence and Relevance (1M context)

0 results

MRCR benchmark variant for evaluating long-context language models with 1M token context window

No results tracked yet

MRCR v2 (≤128K)

Multi-turn Response Coherence and Relevance (≤128K context)

0 results

MRCR benchmark variant for evaluating language models with context window up to 128K tokens

No results tracked yet

ZebraLogic

0 results

The dataset is named ZebraLogic and its task is Language modeling.

No results tracked yet

WritingBench

WritingBench: A Comprehensive Benchmark for Generative Writing

0 results

A comprehensive benchmark for evaluating LLMs writing capabilities across 1,000 real-world queries spanning 6 primary domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, and Advertising & Marketing) and 100 fine-grained subdomains. Each query averages 1,500+ tokens and is paired with 5 instance-specific evaluation criteria. The benchmark uses a hybrid construction pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement. Evaluation is conducted through a query-dependent framework with dynamic criteria generation and rubric-based scoring on a 10-point scale, using either LLM evaluators (Claude-Sonnet-4) or a fine-tuned critic model.

No results tracked yet

MMLU

0 results

MMLU (Measuring Massive Multitask Language Understanding) is a popular benchmark used to evaluate the capabilities of large language models. It is a multidisciplinary multiple-choice collection that has inspired other versions and spin-offs.

No results tracked yet

Creative Writing Benchmark v3

EQ-Bench Creative Writing Benchmark v3

0 results

A comprehensive benchmark for evaluating the creative writing capabilities of large language models using a hybrid rubric and Elo scoring system. The evaluation uses 32 distinct writing prompts across 3 iterations (96 items total) with temperature 0.7 and min_p 0.1. Each generated piece is assessed by a judge model (Claude 3.7 Sonnet) against a comprehensive rubric, followed by pairwise matchups using the Glicko-2 rating system that accounts for win margins. The benchmark is designed for enhanced discrimination at the top end of model performance and includes prompts challenging models in humor, romance, spatial awareness, and unique perspectives. It implements bias mitigation strategies for length, position, verbosity, and poetic incoherence. Used for the official Creative Writing leaderboard on EQ-Bench.com.

No results tracked yet

DROP

Discrete Reasoning Over Paragraphs (DROP)

0 results

DROP (Discrete Reasoning Over Paragraphs) is an English reading-comprehension benchmark that requires discrete, multi-step reasoning over paragraphs (e.g., addition, counting, sorting, and resolving references to multiple passage positions). Introduced by Dua et al. in "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs" (NAACL/ACL 2019; arXiv:1903.00161), the dataset was crowdsourced and adversarially created to avoid shallow shortcuts. The full collection contains approximately 96k question–answer pairs over ~6.7k passages (train ≈77k, dev ≈9.5k, hidden test ≈9.5k). Publicly-available splits on Hugging Face and other mirrors contain the train and dev splits (train ≈77.4k, validation ≈9.54k). Answers include span-based answers and free-form/numeric answers (numerical reasoning is a core focus). Evaluation follows common QA practice with word-level F1 and exact match (EM). The dataset is provided under a CC BY license and is hosted/mirrored by the Allen Institute for AI and on Hugging Face.

No results tracked yet

AlignBench

AlignBench: Benchmarking Chinese Alignment of Large Language Models

0 results

A comprehensive multi-dimensional benchmark for evaluating large language models alignment capabilities in Chinese. AlignBench contains 683 high-quality samples curated through a human-in-the-loop data curation pipeline across 8 main categories: Fundamental Language Ability (68 samples), Chinese Advanced Understanding (58), Open-ended Questions (38), Writing Ability (75), Logical Reasoning (92), Mathematics (112), Task-oriented Role Play (116), and Professional Knowledge (124). Each sample includes a task-oriented query, a high-quality reference answer with evidence from reliable web sources, and corresponding category classification. The benchmark uses a multi-dimensional rule-calibrated LLM-as-Judge approach with Chain-of-Thought to generate explanations and ratings (1-10 scale), employing GPT-4 or the dedicated CritiqueLLM evaluator (which recovers 95% of GPT-4s evaluation ability). The evaluation ensures high reliability and interpretability through point-wise grading, Chain-of-Thought reasoning, and rule-calibrated referencing. Since release, AlignBench has been adopted by top Chinese LLMs including ChatGLM, Qwen, DeepSeek, Yi, Baichuan, and Abab.

No results tracked yet

MATH 500

0 results

The MATH 500 dataset is an academic math benchmark focusing on probability, algebra, and trigonometry. It is designed to evaluate language models on their ability to solve mathematical problems. The dataset includes questions from various subjects such as Algebra, Intermediate Algebra, Precalculus, Geometry, Number Theory, Prealgebra, and Counting & Probability, across different difficulty levels (1 to 5).

No results tracked yet

Penn Treebank (WSJ Section 23)

Penn Treebank (Wall Street Journal, Section 23)

0 results

The Penn Treebank (PTB) WSJ portion is a widely used annotated corpus of Wall Street Journal newswire text (roughly 1 million words). It was originally described in Marcus et al., 1993 ("Building a Large Annotated Corpus of English: The Penn Treebank") and distributed as the Treebank releases (e.g. Treebank-3 / LDC99T42). The WSJ portion is annotated for part-of-speech (POS) and syntactic constituency trees and is commonly used for parsing, POS tagging and language modeling research. Section 23 of the WSJ is the standard test set in many parsing and language-modeling evaluations (e.g., parsing train/dev/test splits often use sections 02–21 for training, 22 for development and 23 for test). Hugging Face hosts a text-only PTB dataset (ptb-text-only/ptb_text_only) which provides the PTB text splits (the HF dataset notes that the source is the Penn Treebank Project / WSJ material and that licensing is via LDC). Note: the original Penn Treebank was published in Computational Linguistics (Marcus et al., 1993) and the corpus distribution is controlled by the LDC (Treebank releases such as LDC99T42).

No results tracked yet

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Language Modeling benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000