Language Modeling
Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.
Language modeling — predicting the next token given preceding context — is the foundational task that powers all modern NLP. GPT-4, Claude, Llama, and Gemini are all language models at their core. Perplexity on held-out text remains the key intrinsic metric, but downstream task performance has become the real measure of progress.
History
Bengio et al. introduce neural language models with feedforward networks, replacing n-gram models
Word2Vec shows that language model byproducts (embeddings) transfer to downstream NLP tasks
Transformer architecture (Vaswani et al.) enables massively parallel training, replacing recurrent models
GPT (Radford et al.) demonstrates that autoregressive pretraining on 40GB of text produces useful representations
GPT-2 (1.5B params) shows emergent generation quality; OpenAI delays release over misuse concerns
GPT-3 (175B params) demonstrates in-context learning — the model performs tasks from examples in the prompt
GPT-4 and Claude 2 reach broadly expert-level performance across NLP, coding, and reasoning
Llama 2 (Meta) opens the floodgates for open-weight LLMs; Mistral-7B matches Llama 2 13B
Llama 3.1 405B, DeepSeek-V3, and Qwen2.5-72B close the gap with proprietary frontier models
Claude 3.5, GPT-4o, Gemini 2.0 compete on reasoning, coding, and agentic capabilities; Llama 4 and DeepSeek-R1 push open-source further
How Language Modeling Works
Tokenization
Text is encoded into subword tokens using BPE (GPT), SentencePiece (Llama), or custom tokenizers; vocabulary sizes range from 32K to 256K
Embedding
Each token is mapped to a dense vector; positional information is added via learned or rotary (RoPE) position embeddings
Transformer layers
Tokens pass through N layers of multi-head self-attention and feed-forward networks; modern models use 32-128 layers
Next-token prediction
A linear head projects the final hidden state to vocabulary logits; softmax gives probability distribution over next token
Training
Cross-entropy loss on next-token prediction over trillions of tokens from web text, code, and curated data
Current Landscape
Language modeling in 2025 is the foundation of the entire AI industry. The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) continue to hold: more compute and data produce better models. But the frontier has shifted from pure scale to efficiency (MoE architectures, DeepSeek), reasoning (o1-style inference-time compute), and post-training (RLHF, DPO, Constitutional AI). Open-source models lag frontier by 6-12 months but are increasingly competitive. The Chinchilla-optimal training paradigm has given way to over-training smaller models for cheaper inference.
Key Challenges
Scaling cost: training a frontier model costs $50-500M+ in compute; only a handful of organizations can afford it
Data quality and curation are arguably more important than model size — garbage in, garbage out at scale
Evaluation: perplexity doesn't capture reasoning ability; benchmarks saturate quickly; human evaluation is expensive
Alignment: making models helpful, harmless, and honest through RLHF/RLAIF adds complexity and potential capability loss
Inference cost: serving large models requires expensive GPU clusters; efficiency techniques (quantization, speculative decoding) are critical
Quick Recommendations
Best frontier model
Claude 3.5 Sonnet, GPT-4o, or Gemini 2.0 Pro
Top performance on reasoning, coding, and instruction following; competitive pricing
Open-source (large)
Llama 3.1 405B or DeepSeek-V3-671B (MoE)
Approaching frontier model quality; self-hostable for full data control
Open-source (efficient)
Qwen2.5-72B or Llama 3.1 70B
Best quality at the 70B scale; fits on 2x A100 with quantization
Small / edge
Llama 3.2 3B or Phi-3.5 Mini (3.8B)
Runs on mobile and laptop hardware; surprisingly capable for their size
Research / perplexity benchmark
GPT-4 or Gemini 1.5 Pro
Lowest published perplexity on standard LM benchmarks
What's Next
The next phase is test-time compute scaling (thinking longer to solve harder problems), multi-modal native models (text + image + audio + video in one architecture), and agentic models that can use tools, write code, and take actions. Expect the open-source gap to continue closing, with 70B-class models matching today's frontier within a year. Architecture innovations (state-space models, hybrid attention-SSM) may complement or partially replace pure transformers.
Benchmarks & SOTA
MMLU-Pro
The MMLU-Pro dataset contains 12K complex questions across various disciplines, including biology, business, chemistry, computer science, economics, engineering, math, physics, and psychology. It has 10 options per question, compared to the original MMLU's 4, making it more challenging. It also integrates more reasoning-focused problems, where Chain-of-Thought (CoT) results can be significantly higher than Perplexity (PPL).
State of the Art
Qwen2.5-Plus
72.5
Accuracy
MMLU-Redux
MMLU-Redux: Massive Multitask Language Understanding Redux
A carefully re-annotated version of the MMLU benchmark dataset with 30 subjects and 100 randomly sampled questions per subject (3,000 questions total). MMLU-Redux addresses numerous ground truth errors found in the original MMLU dataset. The analysis revealed that approximately 6.49% of MMLU questions contain errors, with some subjects like Virology containing errors in 57% of questions. This dataset provides more accurate and reliable evaluation of language model capabilities across 57 subjects.
State of the Art
Qwen2.5-72B-Instruct
86.8
Accuracy
IFEval
Instruction-Following Eval
A straightforward and easy-to-reproduce evaluation benchmark for large language models focused on instruction-following capabilities. IFEval contains around 500 prompts (541 in the train split) with verifiable instructions that can be objectively evaluated by heuristics, such as "write in more than 400 words", "mention the keyword of AI at least 3 times", "use no commas", or "include at least 3 highlighted sections". The benchmark identifies 25 types of verifiable instructions including punctuation constraints, length requirements, detectable content/format requirements, and keyword usage. Each prompt contains one or more verifiable instructions with corresponding kwargs for verification. This benchmark is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the Open LLM Leaderboard.
State of the Art
Qwen2.5-Plus
86.3
Accuracy
GPQA
GPQA: Graduate-Level Google-Proof Q&A Benchmark
GPQA science QA benchmark; reported as Avg@8 in the paper.
State of the Art
Qwen2.5-Plus
49.7
Accuracy
MATH
MATH (Measuring Mathematical Problem Solving) Dataset
MATH is a benchmark dataset of challenging competition-level mathematics problems introduced by Hendrycks et al. (NeurIPS Datasets & Benchmarks / arXiv 2103.03874). The dataset contains about 12,500 problems drawn from math competitions and is annotated with full step-by-step solutions (expressed in LaTeX and natural language) and final answers. Problems are organized by subject (e.g., algebra, counting & probability, geometry, number theory, precalculus) and difficulty level and are commonly distributed as a ~12,000-example training set plus a 500-example test set in public conversions. MATH is intended to evaluate and train models on mathematical problem solving and derivation generation (reasoning) and has been widely used as a benchmark for LLM math reasoning.
State of the Art
Qwen2.5-Plus
84.7
Accuracy
MGS
Multilingual Grade School Math (MGSM)
Multilingual Grade School Math (MGSM) is a multilingual benchmark of grade-school math word problems introduced in the paper “Language Models are Multilingual Chain-of-Thought Reasoners” (arXiv:2210.03057). It contains the same 250 problems from GSM8K, each manually translated into 10 typologically diverse languages (Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu) plus English. MGSM is used to evaluate multilingual reasoning and chain-of-thought capabilities of language models (includes inputs, targets, and manually translated few-shot exemplars). License: CC BY-SA 4.0. Size: 250 problems × languages (1K<n<10K overall). Note: referenced as MGS / MGSM in some papers (reported in pre-training comparisons).
State of the Art
Qwen2.5-72B-Instruct
88.16
Accuracy
Arena-Hard
Arena-Hard (Arena-Hard-Auto)
Arena-Hard is a human-aligned benchmark of challenging open-ended prompts sourced from live crowd platforms (notably Chatbot Arena) designed to robustly separate LLM capability and reflect human preference. It was introduced in the paper “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline” (arXiv:2406.11939). The Arena-Hard-Auto variant (published on Hugging Face as Arena-Hard-Auto / v0.1) is an automatic evaluation suite that contains 500 challenging user queries extracted from Chatbot Arena and uses an LLM-as-a-judge (the dataset authors report prompting GPT-4-Turbo to act as judge, comparing model responses against a baseline such as GPT-4-0314). The BenchBuilder pipeline described in the paper automates extracting high-quality prompts from crowdsourced data and producing an automatically-judged benchmark with high correlation and separability relative to the live Chatbot Arena. Common uses: automatic and human-aligned evaluation of instruction-tuned LLMs and benchmarking alignment/safety/helpfulness.
State of the Art
Qwen2.5-Plus
81.4
Accuracy
RULER
RULER: What’s the Real Context Size of Your Long-Context Language Models?
RULER is a synthetic, configurable long-context benchmarking suite for evaluating language models’ ability to use very long contexts. Introduced in the paper “RULER: What’s the Real Context Size of Your Long-Context Language Models?” (arXiv:2404.06654), RULER extends the common “needle-in-a-haystack” (NIAH) retrieval test into a richer set of controlled variations with flexible configurations for sequence length and task complexity. The benchmark is designed to probe more than simple retrieval by varying task types and difficulty and to measure model performance across many context lengths (the authors report evaluations up to 1M tokens). The code and data-generation tools are provided by the authors in the public NVIDIA RULER GitHub repository (https://github.com/NVIDIA/RULER).
State of the Art
Qwen2.5-72B-Instruct
95.1
Accuracy
okapi MMLU (translated)
okapi MMLU (translated MMLU for multilingual evaluation)
A translated / multilingual version of the MMLU (Measuring Massive Multitask Language Understanding) benchmark adapted for multilingual evaluation. MMLU is a 57-task, multiple-choice benchmark covering subjects across humanities, social sciences, and STEM requiring broad world knowledge and problem-solving. The "okapi MMLU (translated)" assets on Hugging Face provide MMLU questions and answers translated into multiple languages (examples on HF include many languages such as id, vi, ar, bn, de, es, fr, etc.). The translated MMLU variants are commonly used for multilingual few-shot evaluation (the Okapi paper reports using translated MMLU in 5-shot evaluations). License on the HF repos is listed as CC-BY-NC-4.0. Source references: the original MMLU paper (Hendrycks et al., arXiv:2009.03300) and the Okapi project (Okapi: instruction-tuned LLMs; arXiv:2307.16039) and the Hugging Face dataset pages (e.g., jon-tow/okapi_mmlu and SEACrowd/okapi_m_mmlu).
State of the Art
Qwen2.5-72B-Instruct
79.97
Accuracy
MTbench
MT-Bench (Multi-Turn Benchmark)
MT-Bench (MT-bench / MT-Bench) is a multi-turn benchmark for evaluating the conversational and instruction-following abilities of large language model (LLM) chat assistants. It was introduced in the paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685). MT-Bench is a collection of open-ended, multi-turn question/prompts designed to probe coherence, context maintenance, reasoning, and helpfulness in dialogue. The benchmark is commonly evaluated using a “LLM-as-a-judge” methodology (using strong LLMs such as GPT-4 to score/rank responses), which the authors show can achieve high agreement with human preferences. Public Hugging Face mirrors of the MT-Bench data (e.g., philschmid/mt-bench and lighteval/mt-bench) commonly expose an 80-item multi-turn set that is widely used for reporting a numeric MT-Bench score.
State of the Art
Qwen2.5-72B-Instruct
9.35
Score (1-10)
LV-Eval
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
LV-Eval is a bilingual long-context benchmark designed to evaluate large language models at very large context lengths (up to 256k tokens). It provides controllable evaluation across five length levels (16k, 32k, 64k, 128k, 256k) and includes multiple QA-style tasks (single-hop and multi-hop QA) drawn from several bilingual datasets. The benchmark incorporates techniques to reduce knowledge leakage and increase difficulty and objectivity: confusing facts insertion (CFI), keyword and phrase replacement (KPR), and a keyword-recall-based metric evaluated at multiple lengths. LV-Eval is provided with balanced numbers of instances across lengths and is intended to stress-test long-context capabilities of LLMs.
State of the Art
Qwen2.5-72B-Instruct
60.4
Accuracy
LongBench-Chat
LongBench-Chat: Long Context Instruction-Following Benchmark
LongBench-Chat is a benchmark for evaluating instruction-following capabilities of large language models on queries of 10k-100k in length. It was introduced in the LongAlign paper to test how well models can follow instructions over very long contexts.
State of the Art
Qwen2.5-72B-Instruct
8.72
Score (1-10)
GSM8k
GSM8K (Grade School Math 8K) is a dataset of 8.5K high-quality, linguistically diverse grade school math word problems.
State of the Art
Qwen2.5-Plus
96
Accuracy
Livebench
The Livebench dataset is a time-series dataset related to language modeling. It gathers and processes data from the LiveBench website's GitHub repository and the files used by the live version of the website to ensure the data is up-to-date. The dataset includes information such as question IDs, categories (which are consistently "language"), and release dates for the data. It also contains counts associated with different date ranges and label ranges (e.g., 0.00 - 10.00, 10.00 - 20.00).
State of the Art
Qwen2.5-Plus
54.6
Accuracy
MultiChallenge
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
MultiChallenge is a multi-turn conversational evaluation benchmark designed to measure LLMs' ability to conduct realistic, multi-turn conversations with human users. The benchmark identifies four categories of realistic conversational challenges (e.g., instruction retention, inference/memory across turns, handling versioned or updated information, and context allocation) that require integrated instruction-following, context management, and in-context reasoning. The dataset was created via a hybrid data-generation process (LLM agents plus human review) and includes an automatic evaluation pipeline that uses LLMs-as-judges with instance-level rubrics, which the authors report aligns well with experienced human raters. In the paper's reported evaluations, current frontier models score well below saturation on MultiChallenge (all <50% average accuracy; top reported model Claude 3.5 Sonnet reached ≤41.4%), demonstrating that MultiChallenge exposes realistic multi-turn failure modes not captured by prior multi-turn benchmarks. The benchmark is accompanied by a public leaderboard (Scale) and a GitHub repo with details and data generation code. Table 6 in the paper summarizes the multi-turn evaluation setup and results.
No results tracked yet
SafetyBench
SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions
SafetyBench is a comprehensive benchmark for evaluating the safety of large language models. It contains 11,435 diverse multiple-choice safety questions spanning seven safety categories (Offensiveness; Unfairness & Bias; Physical Health; Mental Health; Illegal Activities; Ethics & Morality; Privacy & Property). The benchmark includes both Chinese and English data (the authors release language-specific test files such as test_zh.json and test_en.json) and is intended for automatic evaluation of LLM safety via multiple-choice accuracy per-category and overall (the paper reports overall and per-category scores, e.g., in Table 7).
No results tracked yet
SysBench (ISR)
SysBench (system-message-following benchmark)
SysBench is a system-message-following benchmark for evaluating Large Language Models (LLMs). It measures how well models adhere to system messages across dimensions such as constraint complexity, instruction misalignment, and multi-turn stability. The benchmark provides evaluation examples (the Hugging Face dataset includes a test split stored as system_benchmark_eval_datas.json) and reports results using an ISR metric (reported in the paper) to quantify system-message-following performance. The dataset and code are publicly released by PKU-Baichuan-MLSystemLab (GitHub) and are hosted on Hugging Face.
No results tracked yet
ZeroSCROLLS/QuALITY
QuALITY (ZeroSCROLLS subset)
QuALITY (as used in ZeroSCROLLS) is the QuALITY multiple-choice reading-comprehension / question-answering dataset subset included in the ZeroSCROLLS zero-shot long-context benchmark. The original QuALITY dataset (Pang et al., NAACL 2022; arXiv:2112.08608) contains English passages with very long contexts (average ~5,000 tokens) and human-authored multiple-choice questions and distractors; questions were written and validated by annotators who read the full passage, so many require deep comprehension and cannot be solved by simple skimming or short excerpts. In ZeroSCROLLS the QuALITY data is adapted/used as a zero-shot test (and small validation) set to evaluate long-context model understanding in a zero-shot setting (see ZeroSCROLLS paper arXiv:2305.14196). Use cases: long-document QA / reading comprehension, multiple-choice QA over long contexts.
No results tracked yet
HiddenMath
HiddenMath
HiddenMath is reported to be a hidden/internal benchmark of competition-style mathematics problems used to evaluate large language models. Publicly-available evidence is limited: an LLM benchmark listing (LLMDB) describes HiddenMath as "Google’s internal holdout set of competition math problems" and reports scores on a 0–100 accuracy scale. No public dataset release, Hugging Face dataset page, or dedicated paper was found; the dataset appears to be a private/held-out test set used in model evaluation (reported in Gemma 3 Technical Report Table 6 as "HiddenMath", metric = accuracy). Source: LLMDB entry for HiddenMath (https://llmdb.com/benchmarks/hiddenmath).
No results tracked yet
Bird-SQL (dev)
BIRD-SQL (BIg Bench for Large-Scale Database-Grounded Text-to-SQLs)
Development split (dev) of BIRD-SQL (BIRD). BIRD-SQL is a large cross-domain text-to-SQL benchmark designed to evaluate natural-language-to-SQL parsing against realistic, value-rich relational databases. BIRD contains 12,751 text-to-SQL question–SQL pairs grounded on 95 databases (total ~33.4 GB) spanning ~37 professional domains; it emphasizes database values (dirty/noisy values and external-knowledge grounding) to better match real-world DB assistant scenarios. The benchmark provides standard splits including a development (dev) split (the dev archive is distributed by the authors) which is commonly used for model evaluation (accuracy / execution metrics in papers). The evaluation result referenced corresponds specifically to the development split as reported in Table 6 (metric: accuracy).
No results tracked yet
Global MMLU-Lite
Global-MMLU-Lite
Global-MMLU-Lite is a compact multilingual evaluation subset of the Global-MMLU benchmark. The Lite version covers 16 languages (a subset of the full 42-language Global-MMLU) and contains human-translated / post-edited MMLU-style multiple-choice questions. For each included language, the dataset provides 200 Culturally Sensitive (CS) and 200 Culturally Agnostic (CA) samples (i.e., 400 examples per language). The Lite split selects languages from Global-MMLU that were fully human-translated or post-edited, enabling a smaller, reproducible evaluation set for multilingual model comparisons. License: Apache-2.0. (Source: Hugging Face dataset card for CohereLabs/Global-MMLU-Lite and the Global MMLU paper arXiv:2412.03304 / ACL 2025.)
No results tracked yet
BBH
BIG-Bench Hard (BBH)
BIG-Bench Hard (BBH) is a curated subset of challenging tasks from the BIG-Bench benchmark selected because prior language model evaluations underperformed average human raters. Introduced in "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them" (Suzgun et al., 2022/ACL Findings 2023), BBH comprises 23 diverse, hard reasoning and understanding tasks (examples: boolean_expressions, logical_deduction_three/five/seven_objects, dyck_languages, multistep_arithmetic_two, object_counting, tracking_shuffled_objects, salient_translation_error_detection, etc.). BBH is explicitly evaluated with few-shot and chain-of-thought (CoT) prompting to study whether CoT helps solve these harder tasks. The suite is commonly distributed on Hugging Face and GitHub as "BIG-Bench Hard" and is widely used as a benchmark for advanced reasoning capabilities.
No results tracked yet
SimpleQA
The SimpleQA dataset is a benchmark from OpenAI that evaluates short-form factuality in large language models. It is used for language modeling tasks.
No results tracked yet
SuperGPQA
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
SuperGPQA is a large multiple-choice question benchmark for evaluating LLM knowledge and reasoning across 285 graduate-level disciplines. The public dataset (HF: m-a-p/SuperGPQA) contains ~26.5K question instances (train split) and was constructed to include at least 50 questions per discipline. Each example includes fields such as question, options, answer (and answer_letter), discipline/field/subfield labels, difficulty, and an is_calculation flag. The benchmark was released with an open-data license (ODC-BY) and is intended for evaluation of LLM factual knowledge and problem solving across highly specialized academic and professional subject areas.
No results tracked yet
AutoLogi
AutoLogi: Automated Logic Puzzle Benchmark
AutoLogi is a bilingual benchmark of automatically generated, open-ended logic puzzles designed to evaluate the logical reasoning abilities of large language models. Instances are synthesized by a programmatic generator with program-based verification to ensure solvability and correctness, and the generation process supports controllable difficulty levels to better distinguish model capabilities. The dataset was published alongside the paper “AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models” (arXiv:2502.16906). The Hugging Face release (qzhu/AutoLogi) is licensed under Apache-2.0 and contains on the order of 1K–10K examples. Used in post-training evaluations (Table 11) of Qwen3.
No results tracked yet
FACTS Grounding
Evaluates LLMs' ability to generate long-form responses that are factually accurate and strictly "grounded" in provided context documents, thereby mitigating hallucination. Tasks require models to generate responses based exclusively on documents up to 32,000 tokens long.
No results tracked yet
C-Eval
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
C-Eval is a comprehensive Chinese evaluation suite for foundation models containing 13,948 multiple-choice questions across 52 disciplines and four difficulty levels (middle school, high school, college, and professional). It also provides a C-Eval HARD subset of especially challenging questions. The benchmark is designed to assess knowledge and reasoning abilities of Chinese/Chinese-aware large language models; the authors publish dataset files, code, and examples on the project website and GitHub, and the dataset is hosted on Hugging Face (ceval/ceval-exam). (Paper: C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, arXiv:2305.08322; NeurIPS 2023 Datasets & Benchmarks track.)
No results tracked yet
WikiText Perplexity
Language modeling quality measured by perplexity on Wikipedia text
No results tracked yet
EvalPlus
EvalPlus
EvalPlus is an evaluation framework and leaderboard for LLMs on code-generation tasks (LLM4Code). The EvalPlus project provides rigorously extended test suites for popular coding benchmarks (notably HumanEval+ and MBPP+) and tooling to evaluate models (pass@1, chat vs completion, etc.). HumanEval+ and MBPP+ are enlarged, hand-verified test sets (HumanEval+ ~80x more tests than original HumanEval; MBPP+ ~35x more tests than original MBPP) maintained by the EvalPlus team. In the NeurIPS paper “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation” (arXiv:2305.01210) the authors report an aggregate coding score referred to as “EvalPlus” (used in e.g., Table 3) which is computed from the constituent benchmarks (HumanEval, MBPP, HumanEval+, MBPP+). Primary sources: EvalPlus GitHub & website (https://github.com/evalplus, https://evalplus.github.io/leaderboard.html), Hugging Face dataset pages for the extended datasets (HumanEval+: https://huggingface.co/datasets/evalplus/humanevalplus , MBPP+: https://huggingface.co/datasets/evalplus/mbppplus), and the NeurIPS / arXiv paper (arXiv:2305.01210).
No results tracked yet
Multi-IF
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
Multi-IF is a benchmark for evaluating large language models on multi-turn, multilingual instruction-following. It extends the IFEval framework by incorporating multi-turn sequences and translating English prompts into seven additional languages, producing 4,501 multilingual conversations where each conversation has three turns. The benchmark uses a hybrid annotation/evaluation framework combining LLMs and human annotators and was used to evaluate state-of-the-art LLMs. Languages covered include English, French, Spanish, Portuguese, Hindi, Chinese, Russian, and Italian. The dataset and evaluation code are hosted by Facebook/Meta on Hugging Face and GitHub (license: CC-BY-NC-2.0).
No results tracked yet
INCLUDE
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
INCLUDE is a multilingual, knowledge- and reasoning-centric evaluation benchmark built from local academic and professional exam sources to measure multilingual LLM performance in real regional contexts. According to the paper (arXiv:2411.19799) INCLUDE comprises a large evaluation suite (the paper reports 197,243 QA pairs in total) covering regional/cultural knowledge across many topics and 44 written languages. A released Hugging Face dataset variant (CohereLabs/include-base-44) is a curated subset described as "INCLUDE-base (44 languages)" and contains 22,637 4-option multiple-choice questions spanning 57 topics (domains include chemistry, biology, legal, finance, medical, climate, art, code). Metadata on the HF page lists the 44 languages, Apache-2.0 license, task categories (multiple-choice, text2text-generation), and links to the paper. Note: the Qwen3 paper (arXiv:2505.09388) reports using INCLUDE with 10% sampling for some evaluations (used in post-training, Table 11). Source: arXiv:2411.19799 and Hugging Face dataset page CohereLabs/include-base-44.
No results tracked yet
Winogrande
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande is a large-scale Winograd-style commonsense reasoning dataset introduced to probe pronoun resolution and robust commonsense understanding. Inspired by the original Winograd Schema Challenge, WinoGrande contains ~44k fill-in-the-blank problems with binary options (right/wrong antecedent). Instances were collected via a careful crowdsourcing pipeline and then filtered with an adversarial filtering algorithm (AFLITE) to reduce dataset-specific statistical biases; roughly half the examples were identified as adversarial in the original release. The benchmark is designed to be harder and less exploitable by spurious correlations than earlier WSC variants; reported human performance is very high (~94%) while state-of-the-art models (as of the paper) were substantially lower. The dataset is available from the authors (allenai) and hosted on the Hugging Face datasets hub.
No results tracked yet
CommonsenseQA
CommonsenseQA
CommonsenseQA is a multiple-choice question-answering benchmark that tests commonsense/world knowledge. Questions were created by crowdworkers based on ConceptNet relations: for a source concept the authors extracted multiple target concepts that share a semantic relation, and workers authored questions that mention the source concept and discriminate among the targets. The set contains roughly 12k questions (paper reports 12,247 questions; the Hugging Face dataset card lists 12,102) with one correct answer and four distractors (5-way multiple choice). The dataset includes standard train/validation/test splits (see paper) and was shown to be challenging for strong baselines (BERT-large baseline ~56% vs. human ~89% per the original paper).
No results tracked yet
OpenBookQA
OpenBookQA (Open Book Question Answering)
OpenBookQA is a multiple-choice question answering dataset modeled after open-book exams to probe deeper understanding and multi-step reasoning. The dataset provides an “open book” of elementary-level science facts (≤1.3k facts) plus roughly 6k multiple-choice questions that require combining a provided core science fact with broad common-sense or world knowledge to answer. Each example contains a question stem, four answer choices, an answer key, and an associated core fact (the ‘‘open book’’ fact). The data is split into train (~4.96k questions), validation (500) and test (500). It was created to encourage research on reasoning and knowledge-combination beyond surface-level reading comprehension.
No results tracked yet
OpenRewrite-Eval
OPENREWRITEEVAL (OpenRewriteEval)
OPENREWRITEEVAL (OpenRewriteEval) is a benchmark for evaluating long-form, open-ended text rewriting by large language models. It covers a wide variety of rewriting types expressed through natural-language instructions and is designed to measure content preservation and to detect hallucinations or unintended modifications introduced by models when rewriting long-form text. The Hugging Face reupload (gabrielmbmb/OpenRewriteEval) contains a single split (train) with ~1.63k examples; fields include source (original long-form text), target (desired rewritten text), comment, and a task label with 6 classes (different rewriting types). The HF dataset page notes it was reuploaded from the original RewriteLM GitHub repository for convenience.
No results tracked yet
ARC
AI2 Reasoning Challenge (ARC)
The AI2 Reasoning Challenge (ARC) is a benchmark of 7,787 natural, grade-school-level multiple-choice science questions (authored for human tests) designed to encourage research in advanced question answering and reasoning. The question set is partitioned into two subsets: ARC-Challenge (questions that simple retrieval and word co-occurrence algorithms get wrong; ~2.59k questions) and ARC-Easy (~5.2k questions). The release also includes the ARC Corpus, a large corpus of science-relevant sentences (~14 million sentences) intended to support retrieval/knowledge components. ARC focuses on questions requiring deeper knowledge and reasoning than many earlier QA datasets and provides baseline implementations; it is widely used for multiple-choice and open-domain QA evaluation. License: CC BY-SA 4.0. Language: English.
No results tracked yet
HellaSwag
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag is a multiple-choice commonsense sentence-completion / commonsense NLI benchmark introduced by Zellers et al. (ACL 2019). Each example provides a short context and four candidate endings; the task is to pick the most plausible continuation. The dataset was constructed using Adversarial Filtering (AF) to select challenging, machine-generated distractors (making examples trivial for humans but difficult for models). Source contexts are drawn from domains such as ActivityNet captions and WikiHow. Standard splits on the Hugging Face / official release are roughly: train ≈ 39.9k, validation 10k, test 10k (≈60k total). Human accuracy reported >95%, while contemporary models at publication time scored substantially lower (paper reports under ~48%).
No results tracked yet
TriviaQA
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
TriviaQA is a large-scale reading-comprehension / question-answering dataset introduced by Joshi et al. (ACL 2017). It contains over 650K question-answer-evidence triples (about 95K question-answer pairs authored by trivia enthusiasts) with independently gathered evidence documents (about six evidence documents per question on average). The dataset provides both a reading-comprehension (RC) version (contexts where answers appear) and an unfiltered / open-domain style version (where not all retrieved documents necessarily contain the answer). TriviaQA was designed to be more challenging than prior RC datasets: questions are often compositional, exhibit high syntactic/lexical variability relative to answer-evidence sentences, and frequently require cross-sentence reasoning. The original paper provides RC and open-domain splits and baselines; data and downloads are available from the project page and via Hugging Face.
No results tracked yet
GPQA Diamond
GPQA Diamond is a dataset for language modeling [1]. It consists of 448 expert-validated multiple-choice questions in STEM fields. It is designed to be a challenging benchmark for advanced AI reasoning and drives progress in scalable oversight and structured problem-solving [2].
No results tracked yet
MMMLU
The Multilingual Massive Multitask Language Understanding (MMMLU) dataset was released by OpenAI on Hugging Face to evaluate multilingual large language models across diverse linguistic, cognitive, and cultural contexts.
No results tracked yet
C-SimpleQA
Chinese SimpleQA (C-SimpleQA)
Chinese SimpleQA (C-SimpleQA) is a Chinese-language benchmark for evaluating the factuality of large language models on short question answering. It was designed to be diverse and high-quality: it covers six major topics with 99 diverse subtopics, uses static (time-stable) reference answers, and includes a thorough quality-control process to ensure reliable evaluation. The dataset is provided in JSON format and is intended for easy, reproducible evaluation of model factuality on short Chinese questions. License: CC BY-NC-SA 4.0.
No results tracked yet
LongBench v2
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
LongBench v2 is a long-context benchmark designed to evaluate large language models’ ability to perform deep understanding and reasoning across realistic long-context multitasks. The benchmark contains 503 challenging multiple-choice questions with contexts ranging from ~8k to 2M words (majority under ~128k). It covers six major categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code-repository understanding, and long structured-data understanding. The authors provide evaluation modes with and without chain-of-thought (CoT) reasoning and categorize examples by short/medium/long context lengths to measure model performance as context size grows. Data and code are available from the project page and the Hugging Face dataset repository; the dataset is tagged for multiple-choice, question-answering, text-classification, and table-question-answering tasks.
No results tracked yet
AIME 2025
This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
No results tracked yet
AIME 2024
The AIME 2024 dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. It is primarily used for evaluating Large Language Models' (LLMs) mathematical reasoning and problem-solving capabilities on complex mathematical problems. Each record includes an ID, problem statement, detailed solution process, and the final numerical answer. The dataset covers various mathematical domains (geometry, algebra, number theory, etc.) and is known for its high difficulty level.
No results tracked yet
ECLeKTic
ECLeKTic: A Multi-Lingual Knowledge Testing Dataset
ECLeKTic is a multilingual dataset for evaluating knowledge comprehension across multiple languages, featuring question-answering tasks based on content translated across different languages.
No results tracked yet
MRCR v2 (1M)
Multi-turn Response Coherence and Relevance (1M context)
MRCR benchmark variant for evaluating long-context language models with 1M token context window
No results tracked yet
MRCR v2 (≤128K)
Multi-turn Response Coherence and Relevance (≤128K context)
MRCR benchmark variant for evaluating language models with context window up to 128K tokens
No results tracked yet
ZebraLogic
The dataset is named ZebraLogic and its task is Language modeling.
No results tracked yet
WritingBench
WritingBench: A Comprehensive Benchmark for Generative Writing
A comprehensive benchmark for evaluating LLMs writing capabilities across 1,000 real-world queries spanning 6 primary domains (Academic & Engineering, Finance & Business, Politics & Law, Literature & Art, Education, and Advertising & Marketing) and 100 fine-grained subdomains. Each query averages 1,500+ tokens and is paired with 5 instance-specific evaluation criteria. The benchmark uses a hybrid construction pipeline combining Model-Augmented Query Generation and Human-in-the-Loop Refinement. Evaluation is conducted through a query-dependent framework with dynamic criteria generation and rubric-based scoring on a 10-point scale, using either LLM evaluators (Claude-Sonnet-4) or a fine-tuned critic model.
No results tracked yet
MMLU
MMLU (Measuring Massive Multitask Language Understanding) is a popular benchmark used to evaluate the capabilities of large language models. It is a multidisciplinary multiple-choice collection that has inspired other versions and spin-offs.
No results tracked yet
Creative Writing Benchmark v3
EQ-Bench Creative Writing Benchmark v3
A comprehensive benchmark for evaluating the creative writing capabilities of large language models using a hybrid rubric and Elo scoring system. The evaluation uses 32 distinct writing prompts across 3 iterations (96 items total) with temperature 0.7 and min_p 0.1. Each generated piece is assessed by a judge model (Claude 3.7 Sonnet) against a comprehensive rubric, followed by pairwise matchups using the Glicko-2 rating system that accounts for win margins. The benchmark is designed for enhanced discrimination at the top end of model performance and includes prompts challenging models in humor, romance, spatial awareness, and unique perspectives. It implements bias mitigation strategies for length, position, verbosity, and poetic incoherence. Used for the official Creative Writing leaderboard on EQ-Bench.com.
No results tracked yet
DROP
Discrete Reasoning Over Paragraphs (DROP)
DROP (Discrete Reasoning Over Paragraphs) is an English reading-comprehension benchmark that requires discrete, multi-step reasoning over paragraphs (e.g., addition, counting, sorting, and resolving references to multiple passage positions). Introduced by Dua et al. in "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs" (NAACL/ACL 2019; arXiv:1903.00161), the dataset was crowdsourced and adversarially created to avoid shallow shortcuts. The full collection contains approximately 96k question–answer pairs over ~6.7k passages (train ≈77k, dev ≈9.5k, hidden test ≈9.5k). Publicly-available splits on Hugging Face and other mirrors contain the train and dev splits (train ≈77.4k, validation ≈9.54k). Answers include span-based answers and free-form/numeric answers (numerical reasoning is a core focus). Evaluation follows common QA practice with word-level F1 and exact match (EM). The dataset is provided under a CC BY license and is hosted/mirrored by the Allen Institute for AI and on Hugging Face.
No results tracked yet
AlignBench
AlignBench: Benchmarking Chinese Alignment of Large Language Models
A comprehensive multi-dimensional benchmark for evaluating large language models alignment capabilities in Chinese. AlignBench contains 683 high-quality samples curated through a human-in-the-loop data curation pipeline across 8 main categories: Fundamental Language Ability (68 samples), Chinese Advanced Understanding (58), Open-ended Questions (38), Writing Ability (75), Logical Reasoning (92), Mathematics (112), Task-oriented Role Play (116), and Professional Knowledge (124). Each sample includes a task-oriented query, a high-quality reference answer with evidence from reliable web sources, and corresponding category classification. The benchmark uses a multi-dimensional rule-calibrated LLM-as-Judge approach with Chain-of-Thought to generate explanations and ratings (1-10 scale), employing GPT-4 or the dedicated CritiqueLLM evaluator (which recovers 95% of GPT-4s evaluation ability). The evaluation ensures high reliability and interpretability through point-wise grading, Chain-of-Thought reasoning, and rule-calibrated referencing. Since release, AlignBench has been adopted by top Chinese LLMs including ChatGLM, Qwen, DeepSeek, Yi, Baichuan, and Abab.
No results tracked yet
MATH 500
The MATH 500 dataset is an academic math benchmark focusing on probability, algebra, and trigonometry. It is designed to evaluate language models on their ability to solve mathematical problems. The dataset includes questions from various subjects such as Algebra, Intermediate Algebra, Precalculus, Geometry, Number Theory, Prealgebra, and Counting & Probability, across different difficulty levels (1 to 5).
No results tracked yet
Penn Treebank (WSJ Section 23)
Penn Treebank (Wall Street Journal, Section 23)
The Penn Treebank (PTB) WSJ portion is a widely used annotated corpus of Wall Street Journal newswire text (roughly 1 million words). It was originally described in Marcus et al., 1993 ("Building a Large Annotated Corpus of English: The Penn Treebank") and distributed as the Treebank releases (e.g. Treebank-3 / LDC99T42). The WSJ portion is annotated for part-of-speech (POS) and syntactic constituency trees and is commonly used for parsing, POS tagging and language modeling research. Section 23 of the WSJ is the standard test set in many parsing and language-modeling evaluations (e.g., parsing train/dev/test splits often use sections 02–21 for training, 22 for development and 23 for test). Hugging Face hosts a text-only PTB dataset (ptb-text-only/ptb_text_only) which provides the PTB text splits (the HF dataset notes that the source is the Penn Treebank Project / WSJ material and that licensing is via LDC). Note: the original Penn Treebank was published in Computational Linguistics (Marcus et al., 1993) and the corpus distribution is controlled by the LDC (Treebank releases such as LDC99T42).
No results tracked yet
Related Tasks
Machine Translation
Machine Translation is the task of automatically translating text from one natural language to another. The goal is to produce translations that preserve the meaning, style, and grammatical correctness of the source text while being fluent in the target language.
Text classification
Text classification is a machine learning process of automatically assigning predefined categories or labels to text based on its content, often using natural language processing (NLP). It involves analyzing text to understand its meaning and then applying the most appropriate label, with common applications including sentiment analysis (e.g., positive/negative reviews), spam detection, and topic categorization (e.g., organizing news articles).
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Language Modeling benchmarks accurate. Report outdated results, missing benchmarks, or errors.