Multimodalvisual-document-retrieval

Cross-Modal Retrieval

Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.

1 datasets0 resultsView full task mapping →

Cross-modal retrieval finds relevant items across different modalities — retrieving images from text queries, videos from audio, or any modality from any other. This is the foundation of multimodal search, powering image search engines, content recommendation, and accessibility tools that bridge modality gaps.

History

2013

DeViSE (Google) projects visual and semantic embeddings into a shared space for zero-shot visual recognition

2016

VSE++ introduces hard-negative mining for visual-semantic embedding learning, significantly improving image-text retrieval

2021

CLIP (OpenAI) trains a contrastive vision-language model on 400M web-scraped image-text pairs, revolutionizing zero-shot retrieval

2022

BLIP (Salesforce) combines contrastive, matching, and generation objectives for unified vision-language understanding and retrieval

2023

SigLIP replaces CLIP's softmax loss with sigmoid pairwise loss, improving training efficiency and retrieval accuracy

2023

ImageBind (Meta) extends contrastive learning to 6 modalities: images, text, audio, depth, thermal, and IMU

2024

Nomic Embed Vision and Jina CLIP v2 push text-image retrieval to new heights on MTEB multimodal benchmarks

2024

ColPali introduces late-interaction retrieval over document page images, bypassing OCR pipelines entirely

2025

Unified embedding models (ONE-PEACE, LanguageBind) support retrieval across text, image, audio, video, and 3D in a single model

How Cross-Modal Retrieval Works

1Modality-specific Enc…Each input modality is proc…2Shared Embedding SpaceModality-specific embedding…3Similarity SearchAt inference time4Re-ranking (optional)Top-k retrieved candidates …Cross-Modal Retrieval Pipeline
1

Modality-specific Encoding

Each input modality is processed by its own encoder — ViT for images, text transformer for queries, audio encoder for sound. Each encoder produces a fixed-size embedding vector.

2

Shared Embedding Space

Modality-specific embeddings are projected into a shared vector space via learned projection heads. Contrastive learning (InfoNCE, sigmoid loss) trains the model so that semantically matching pairs (e.g., an image and its caption) are close together while non-matching pairs are far apart.

3

Similarity Search

At inference time, the query (e.g., text) is encoded and nearest neighbors are found in the embedding space using cosine similarity or dot product. Approximate nearest neighbor (ANN) indices (FAISS, ScaNN, HNSW) enable millisecond search over billions of vectors.

4

Re-ranking (optional)

Top-k retrieved candidates may be re-ranked by a more expensive cross-encoder (late-interaction or full cross-attention model) for improved precision.

Current Landscape

Cross-modal retrieval in 2025 is built on the CLIP/SigLIP foundation — contrastive vision-language pretraining remains the dominant paradigm, with SigLIP's sigmoid loss becoming preferred over CLIP's softmax. The field has expanded from image-text to arbitrary modality pairs via models like ImageBind and ONE-PEACE. Document retrieval has been transformed by ColPali, which retrieves documents as visual objects rather than extracted text. Production search systems (Google Lens, Pinterest, Spotify) all rely on variants of these contrastive embedding models. The key trend is late-interaction retrieval (ColBERT-style) which preserves token-level information for better fine-grained matching while remaining scalable.

Key Challenges

Fine-grained discrimination — retrieving 'the red car in the parking lot' vs. 'the blue car in the parking lot' requires attribute-level embedding precision

Compositionality — queries like 'a horse riding an astronaut' (vs. the reverse) test whether embeddings capture relational structure, not just bag-of-concepts

Domain shift — CLIP-style models trained on web data underperform on specialized domains (medical imaging, satellite imagery, scientific figures) without fine-tuning

Scalability — maintaining sub-100ms retrieval latency over billion-scale indices while supporting multiple modalities is an engineering challenge

Asymmetric retrieval — text-to-image retrieval is much easier than image-to-text retrieval because images contain more information than typical captions describe

Quick Recommendations

Best general text-image retrieval

SigLIP-SO400M

Strong zero-shot retrieval accuracy, efficient ViT-SO400M backbone, and widely adopted as the vision encoder in VLMs

Best for document retrieval

ColPali (ColQwen2)

Retrieves documents by visual appearance — no OCR needed; strong on visually rich documents, tables, and infographics

Best multi-modal (6+ modalities)

ImageBind (Meta)

Only model supporting retrieval across images, text, audio, depth, thermal, and IMU in a single embedding space

Best for production search

Jina CLIP v2

Optimized for production deployment with multilingual support, efficient inference, and strong MTEB retrieval scores

Best for video-text retrieval

InternVideo2.5

State-of-the-art video-text retrieval on MSR-VTT, DiDeMo, and ActivityNet Captions benchmarks

What's Next

Expect unified embedding models that handle any-to-any retrieval across text, images, video, audio, 3D, and code in a single model. Generative retrieval — where a model directly generates document identifiers instead of computing similarity — may replace dense retrieval for certain use cases. Personal multimodal search (searching your own photos, videos, and documents with natural language) will become a killer app as on-device embedding models improve.

Benchmarks & SOTA

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Something wrong or missing?

Help keep Cross-Modal Retrieval benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000