Retrieval
Retrieval is the process of fetching relevant information from a vast knowledge base or database to answer a user's query or enhance a model's response, most notably seen in Retrieval-Augmented Generation (RAG) systems. RAG combines traditional search capabilities with large language models (LLMs) to ensure accuracy, provide up-to-date information, and ground AI responses in factual, external data rather than relying solely on a model's internal, potentially outdated knowledge.
Retrieval is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
BEIR
BEIR — Benchmarking-IR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
BEIR (Benchmarking-IR) is a heterogeneous, zero-shot information retrieval benchmark that consolidates 18 publicly available datasets from diverse retrieval tasks and domains (e.g., fact-checking, question-answering, biomedical IR, news retrieval, argument retrieval, duplicate question retrieval, citation prediction, tweets). It provides a common evaluation framework for IR models (lexical, sparse, dense, late-interaction, re-ranking) and is commonly reported using metrics such as nDCG@10 (average across datasets), MRR and recall. The BEIR code and data are available from the project GitHub and the Hugging Face dataset hub.
No results tracked yet
StackOverflow-QA (StackQA)
StackOverflow-QA (StackQA)
StackOverflow-QA (aka StackQA) is a retrieval benchmark constructed from Stack Overflow question/answer posts where both queries and candidate documents can contain long mixed content of natural language and code. It is provided in a retrieval format (queries, corpus, qrels/scores) and intended for code+text information retrieval evaluations (e.g., dense single-vector retrieval). The Hugging Face mirror (mteb/stackoverflow-qa) shows splits and typical fields (query-id, corpus-id, score) and sizes: ~15.9k default rows (train: ~14k, test: ~1.99k) with corpus/queries subsets (~19.9k). This dataset has been used in recent code-IR benchmarks (e.g., CoIR) and evaluated with metrics such as nDCG@10 for single-vector retrieval.
No results tracked yet
MLDR (English subset)
MLDR (Multilingual Long-Document Retrieval) — English subset
MLDR (Multilingual Long-Document Retrieval) is a long-document retrieval benchmark intended for evaluating embedding and retrieval models on lengthy texts. The dataset samples lengthy articles from Wikipedia, Wudao and mC4 across 13 typologically diverse languages, then randomly selects paragraphs and uses GPT-3.5 to generate questions based on those paragraphs; each generated question paired with its sampled article forms a retrieval example. The full multilingual release contains on the order of 200,000 long documents; papers and implementations that cite MLDR sometimes evaluate an English-only subset (the “English subset”) for in-domain (fine-tuned) and out-of-domain (no finetuning) retrieval, reporting metrics such as nDCG@10. Source: Hugging Face dataset page (Shitao/MLDR) and related project docs (e.g., BGE evaluation docs, third-party benchmarks).
No results tracked yet
CodeSearchNet (CSN)
CodeSearchNet Corpus
CodeSearchNet (CodeSearchNet Corpus) is a benchmark and large corpus for semantic code search. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). For roughly 2 million functions the dataset includes automatically generated, query-like natural language derived from function documentation (docstrings). The CodeSearchNet Challenge portion provides a manually annotated evaluation set consisting of 99 natural language queries with ~4k expert relevance annotations to measure retrieval performance. The dataset is commonly used for code--text retrieval / semantic code search and supports evaluation with IR metrics such as MRR and nDCG.
No results tracked yet
INRIA Copydays (strong subset)
INRIA CopyDays
INRIA CopyDays (Copydays) is a standard benchmark dataset from INRIA (H. Jégou and collaborators) for image copy-detection / near-duplicate image retrieval. The dataset contains original (unmodified) images and corresponding transformed copies produced with various image distortions; the dataset is provided with named subsets, including a "strong" subset that contains heavily modified copies (examples of strong modifications: cropping, rotation, compression, large photometric/geometric changes). INRIA CopyDays is widely used to evaluate robustness of image-retrieval and copy-detection systems; many works evaluate on the CopyDays strong subset and commonly augment the evaluation by adding distractors from large web collections such as YFCC100M (the paper reports results on the strong subset with 10k YFCC100M distractors). Sources: INRIA dataset page for Jégou's datasets (INRIA Holidays / CopyDays) and the Hugging Face dataset entry (randall-lab/INRIA-CopyDays).
No results tracked yet
Revisited Paris (R_Par) — Medium split
Revisited Paris (RParis / R_Par / RParis6k) — Medium split
Revisited Paris (often written RParis or R_Par) is the "revisited"/re-annotated version of the Paris 6k landmark image retrieval dataset introduced by Radenović et al. (CVPR 2018 / arXiv:1803.11285). The revisited benchmark fixes annotation errors, adds 15 new challenging queries to the original 55 (total 70 queries), provides per-query bounding boxes and reliable ground-truth files (e.g. gnd_rparis6k.mat), and defines three evaluation protocols of increasing difficulty (Easy, Medium, Hard). The “Medium” split refers to the Medium-difficulty evaluation protocol defined in the paper (commonly used for reporting mAP in image retrieval evaluations). The dataset is widely used for instance/landmark image retrieval research and is available for download along with the revisited annotations. In evaluations, mAP for the Medium split is commonly reported (e.g., in Table 3 of related works). The DINO paper (arXiv:2104.14294) reports results using models pretrained on Google Landmarks v2 (GLDv2).
No results tracked yet
AmsterTime
AmsterTime: A Visual Place Recognition Benchmark Dataset for Severe Domain Shift
AmsterTime is a visual place recognition (VPR) benchmark designed to evaluate retrieval and verification under severe domain shift (temporal, viewpoint and camera changes). The dataset contains ~2,500 carefully curated image pairs that match the same scene in Amsterdam: historical archival images (1200+ license-free images from the Amsterdam City Archive) paired with contemporary street-level images sourced from Mapillary. Matches were human-verified and the benchmark supports verification and retrieval evaluations (mean Average Precision (mAP) reported for retrieval). The authors evaluate non-learning, supervised and self-supervised baselines (e.g., ResNet-101 pre-trained on Landmarks) and provide extracted feature sets and dataset releases via a data repository (4TU ResearchData) and a GitHub repo.
No results tracked yet
Related Tasks
General
Task for General
World Models
World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.
Omni models
Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).
Video-Language Models
Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Retrieval benchmarks accurate. Report outdated results, missing benchmarks, or errors.