| 01 | Commonsense Reasoning Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social… | Massive Multitask Language Understanding legacylegacyambiguous MMLU is saturated and better treated as general knowledge / legacy LLM eval, not canonical commonsense reasoning. | o3 | 92.9% accuracy | 82 |
| 02 | Mathematical Reasoning Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco… | Mathematics Aptitude Test of Heuristics | Claude Opus 4.5 | 90.7% accuracy | 79 |
| 03 | Multi-step Reasoning Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili… | Graduate-Level Google-Proof Q&A Diamond | Gemini 2.5 Pro | 84.0% accuracy | 53 |
| 04 | Question Answering Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,… | Natural Questions: a Benchmark for Question Answering Research | — | — | 26 |
| 05 | Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor… | CNN/DailyMail Summarization | BRIO | 47.8% rouge-1 | 15 |
| 06 | Logical Reasoning Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak… | LogiQA | GPT-4o | 56.3% accuracy | 12 |
| 07 | Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI). | Stanford Natural Language Inference | GPT-4o | 92.6% accuracy | 8 |
| 08 | Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C… | BEIR legacy retrieval legacylegacyambiguous Legacy retrieval snapshot. Split modern retrieval, reranking, multilingual, and long-context RAG evals before calling this current SOTA. | NV-Embed-v2 | 62.65 ndcg@10 | 8 |
| 09 | Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u… | CoNLL-2003 Named Entity Recognition | GLiNER-multitask | 93.8% f1 | 7 |
| 10 | Arithmetic Reasoning Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca… | Math Word Problem Repository | GPT-4o | 97.2% accuracy | 6 |
| 11 | Text Embeddings Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search. | Legacy MTEB English, 2024 snapshot historicallegacyambiguous NV-Embed-v2 is a historical MTEB English 56-task snapshot, not a fresh 2026 embedding frontier. | NV-Embed-v2 | 72.31 avg-score | 6 |
| 12 | Entity Linking Linking mentions to knowledge base entities. | AIDA-CoNLL-YAGO (test-b) | GENRE | 93.30 micro_f1 | 3 |
| 13 | Knowledge Graph Completion Predicting missing links in knowledge graphs. | FB15k-237 Knowledge Graph Completion | NBFNet | 0.415 mrr | 3 |
| 14 | Relation Extraction Extracting relationships between entities from text. | TAC Relation Extraction Dataset | LUKE | 72.7% f1 | 3 |
| 15 | Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti… | STS Benchmark | GTE-Qwen2-7B-instruct | 88.40 spearman | 3 |
| 16 | Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s… | WikiTableQuestions | GPT-4 | 75.3% accuracy | 3 |