Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Cross-Modal RetrievalHome/Tasks/Multimodal/Cross-Modal Retrieval
Multimodal· visual-document-retrieval

Cross-Modal Retrieval.

Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.

1
Datasets
0
Results
ndcg-at-5
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

ViDoRe

Visual document retrieval benchmark for page-level document search

Primary metric: ndcg-at-5
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on ViDoRe.

No results yet. Be the first to contribute.

What were you looking for on Cross-Modal Retrieval?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

ViDoRe
CANONICAL
0 results · ndcg-at-5
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyAudio-Text-to-TextImage CaptioningImage-Text-to-ImageImage-Text-to-TextImage-Text-to-VideoText-to-Image GenerationVideo Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Cross-Modal Retrieval? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.