Level 2: Pipelines~30 min

Caption + Search: Image Search Pipeline

How teaching machines to describe images unlocked the most practical approach to visual search — and why a two-stage pipeline often beats end-to-end alternatives.

A Decade of Teaching Machines to See and Speak

Image captioning — generating natural language descriptions of visual content — is one of the hardest problems in AI. It requires a model to perceive objects, understand their relationships, infer context, and produce grammatical, accurate text. The field progressed through four distinct generations, each solving a fundamental limitation of the last.

Understanding this evolution matters because the captioning model you choose for a search pipeline determines the ceiling of your search quality. A caption that misses "the dog is wearing a red bandana" means the query "dog with bandana" will never match.

Era I: Detection + Templates
2010–2013

Detect, Then Fill Templates

The earliest image captioning systems were two-stage: first, run object detectors (often pre-trained on ImageNet or PASCAL VOC) to identify nouns and attributes. Then, fill in sentence templates like "A [color] [object] is [action] in [scene]." Kulkarni et al. (2011) built one of the first complete systems, combining Felzenszwalb's deformable parts model for object detection with a Conditional Random Field to select words and a simple language model to smooth the output.

# Template-era captioning (conceptual)
objects = detect_objects(image)     # ["dog", "park", "ball"]
attributes = detect_attributes(image)  # ["golden", "sunny"]
scene = classify_scene(image)      # "outdoor"

caption = f"A {attributes[0]} {objects[0]} in a {attributes[1]} {scene}"
# "A golden dog in a sunny outdoor"  <- grammatically awkward, misses action

The captions were stilted and missed relationships (who is doing what to whom). But the insight was powerful: if you can name what's in an image, you can search for it with text. The entire Caption+Search paradigm descends from this observation.

Kulkarni, G. et al. (2013). BabyTalk: Understanding and Generating Image Descriptions. TPAMI, 35(12).

Era II: Neural Encoder-Decoder
2015

Show and Tell — The CNN+LSTM Revolution

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google proposed a deceptively simple architecture: pass the image through a CNN (GoogleNet/Inception) to get a single feature vector, then feed that vector as the initial hidden state of an LSTM decoder that generates a caption word by word. No templates. No hand-crafted rules. End-to-end training on image-caption pairs from MS-COCO.

"We present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image."

Vinyals, O. et al. (2015). Show and Tell: A Neural Image Caption Generator. CVPR.

Show and Tell won the 2015 MS-COCO Captioning Challenge. Its core architecture — visual encoder + language decoder — remains the blueprint that every captioning model follows, even GPT-4V. The captions were fluent but often generic: "a man is standing on a sidewalk" instead of "a firefighter directing traffic at a rain-soaked intersection."

2015

Show, Attend and Tell — Visual Attention

Kelvin Xu et al. at the Université de Montréal identified the bottleneck: compressing an entire image into a single vector loses spatial information. Their solution was attention — at each decoding step, the model learns to focus on different spatial regions of the image. When generating "dog," it attends to the dog region; when generating "frisbee," it shifts to the frisbee.

This was one of the earliest applications of attention mechanisms to vision (predating the Transformer by two years). It improved caption quality and, crucially, made the model interpretable — you could visualize exactly where the model was "looking" as it generated each word. The paper has 19,000+ citations.

Xu, K. et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML.

2016–2019

Bottom-Up Attention & Object-Level Features

Anderson et al. (2018) from the University of Adelaide and Microsoft replaced the CNN grid features with bottom-up attention: first, use a Faster R-CNN object detector to propose salient image regions, then let the caption decoder attend over these object-level features instead of raw grid cells. This gave the model a vocabulary of "things" (detected objects with bounding boxes) rather than raw pixels.

Bottom-up features became the de facto standard. They powered the winning entries in VQA and captioning competitions for three consecutive years and were used in early versions of LXMERT, ViLBERT, and OSCAR — the first wave of vision-language pre-training.

Anderson, P. et al. (2018). Bottom-Up and Top-Down Attention for Image Captioning and VQA. CVPR. 7,000+ citations.

Era III: Vision-Language Pre-training
January 2021

CLIP — Contrastive Language-Image Pre-training

Alec Radford et al. at OpenAI trained a dual-encoder model on 400 million image-text pairs scraped from the internet. CLIP didn't generate captions — it learned a shared embedding space where images and their descriptions had high cosine similarity. This enabled zero-shot image classification and direct image-text retrieval without any captioning step.

CLIP's visual encoder (ViT) became the standard backbone for almost every subsequent vision-language model. Its contrastive pre-training objective proved that scaling data (not model size) was the key insight the field had been missing.

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. 25,000+ citations.

January 2022

BLIP — Bootstrapping Language-Image Pre-training

Junnan Li et al. at Salesforce Research solved a critical data quality problem. Web-scraped image-text pairs are noisy — alt-text is often irrelevant or wrong. BLIP introduced a bootstrapping approach: train a captioner and a filter jointly, use the captioner to generate synthetic captions for web images, then use the filter to remove noisy pairs from both the original and synthetic data. Train again on the cleaned dataset.

BLIP unified three capabilities in one model: image-text contrastive learning (like CLIP), image-text matching, and image-conditioned text generation (captioning). This made it the first practical model for building Caption+Search pipelines — you could use a single checkpoint for both captioning and retrieval.

Li, J. et al. (2022). BLIP: Bootstrapping Language-Image Pre-training. ICML.

January 2023

BLIP-2 — The Q-Former Bridge

Junnan Li et al. introduced the Q-Former (Querying Transformer) — a lightweight module that acts as a bridge between a frozen image encoder (ViT) and a frozen LLM (OPT or Flan-T5). Instead of training the entire model end-to-end (billions of parameters), BLIP-2 only trained the Q-Former (~188M parameters) to extract the most language-relevant visual features.

# BLIP-2 architecture (conceptual)
image_features = frozen_vit(image)           # (257, 1408) -- frozen
query_output = q_former(learnable_queries, image_features)  # (32, 768) -- trained
caption = frozen_llm.generate(query_output)  # "a dog playing fetch in a park"

# Only Q-Former parameters are updated during training
# Total trainable: ~188M (vs ~13B for the full model)

This "bridge" architecture was the key insight: you could combine any vision encoder with any LLM without retraining either. BLIP-2 achieved state-of-the-art on captioning, VQA, and retrieval while using 54x fewer trainable parameters than Flamingo. It remains the most popular open-source model for production Caption+Search pipelines.

Li, J. et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML.

Era IV: LLM-Powered Vision
April 2023

LLaVA — Large Language and Vision Assistant

Haotian Liu et al. at the University of Wisconsin took a more direct approach: connect CLIP's visual encoder to LLaMA via a simple linear projection layer, then instruction-tune on GPT-4-generated visual conversation data. The result was the first open-source model that could hold detailed visual conversations — not just caption an image, but answer follow-up questions, describe specific regions, and reason about visual content.

For Caption+Search, LLaVA's advantage was prompt-controlled captioning. Instead of a fixed caption style, you could prompt: "Describe this image focusing on the colors and materials of clothing" — generating captions tuned to your specific search use case.

Liu, H. et al. (2023). Visual Instruction Tuning. NeurIPS.

September 2023

GPT-4V — Captioning Reaches Human Level

OpenAI's GPT-4 with Vision was the first model to produce captions that consistently matched or exceeded human-written descriptions in detail and accuracy. It could read text in images, understand charts and diagrams, identify fine-grained attributes ("a 1960s Eames lounge chair in walnut with black leather"), and reason about spatial relationships and implicit context.

For Caption+Search pipelines, GPT-4V represented a ceiling breakthrough: the captioning model was no longer the quality bottleneck. The limitation shifted to the embedding and retrieval stage. The trade-off was cost ($0.01–0.03 per image) and latency (2–5 seconds per image), making it impractical for real-time captioning but excellent for batch indexing high-value collections.

2024–present

The Open-Source Surge

The field has converged on a standard architecture: frozen ViT encoder + bridge module + LLM decoder. Open-source models now rival GPT-4V on most captioning benchmarks:

InternVL 2.5

Shanghai AI Lab. 76B parameters. Matches GPT-4V on most benchmarks. Open weights.

Qwen2-VL

Alibaba. Dynamic resolution. Strong OCR. 2B/7B/72B variants. Apache 2.0.

LLaVA-OneVision

Multi-image + video understanding. Single-image, multi-image, and video in one model.

Molmo

Allen AI. Trained on human-annotated captions (not synthetic). Highest caption faithfulness.

The throughline: 2011 → 2026

Four generations. One architecture, refined relentlessly:

2010–2013Templates: Detect objects, fill slots (Kulkarni, Farhadi)
2015–2019Encoder-Decoder: CNN visual features + LSTM/attention text generation (Vinyals, Xu, Anderson)
2022–2023Bridge Models: Frozen encoders + lightweight adapters (BLIP, BLIP-2)
2023–nowLLM-Powered: Visual tokens fed into instruction-tuned LLMs (LLaVA, GPT-4V, Qwen2-VL)

Every advance improved the same pipeline: see the image → describe it in words. Better captions mean better search, because the caption is the semantic interface between visual content and text retrieval.

The Image Search Problem

You have 10,000 product photos, medical scans, or surveillance frames. A user asks: "Find images of dogs playing in rain". How do you search?

Traditional approaches require manual tagging — someone labels each image with keywords. This doesn't scale, it's subjective, and it can never anticipate every possible query. AI gives us two fundamentally different alternatives:

Manual Tagging

image_001.jpg: ["dog", "park", "playing"]
image_002.jpg: ["cat", "sleeping", "couch"]
# Misses: breed, weather, time of day,
# action details, spatial relations...

Does not scale. Misses unanticipated queries.

AI Captioning

image_001.jpg: "a golden retriever
  playing fetch in a rainy park"
image_002.jpg: "an orange tabby cat
  sleeping on a gray couch near a window"

Captures context, relations, attributes. Searchable.

Pipeline Architecture: Three Stages

The Caption+Search pipeline chains three building blocks. Each stage can be independently upgraded without touching the others — a key architectural advantage over monolithic approaches.

1

Caption (Image → Text)

A vision-language model generates a natural language description. This runs once per image at indexing time — the most expensive step, but amortized over all future queries.

Models: BLIP-2 (best value), GPT-4o (highest quality), Qwen2-VL (best open-source)

2

Embed (Text → Vector)

Convert each caption into a dense vector using a text embedding model. Store these vectors in an index (FAISS, Pinecone, pgvector). This step is fast — thousands of captions per second.

Models: BGE-M3 (multilingual), text-embedding-3-large (API), GTE-Qwen2 (long context)

3

Search (Query → Results)

Embed the user's text query with the same embedding model, then find the most similar caption vectors via approximate nearest neighbor search. Return the corresponding images. Sub-10ms latency at 1M+ images.

Indexes: FAISS (local), Pinecone (managed), Qdrant (hybrid), pgvector (Postgres-native)

Key Insight: The Semantic Bridge

By converting images to text first, you leverage the full power of text embedding models that have been trained on billions of sentence pairs and fine-tuned on retrieval benchmarks. A top MTEB model understands that "dogs playing" should match "a golden retriever chasing a ball" — semantic matching that CLIP's 512-dimensional joint space cannot capture with the same nuance. The caption is a semantic bridge between pixels and meaning.

Step 1: Image Captioning with BLIP-2

BLIP-2 is the workhorse for production captioning. It combines a frozen ViT-G encoder (1.3B params) with a lightweight Q-Former bridge (188M params) and an OPT or Flan-T5 language model. The result: high-quality captions at a fraction of GPT-4V's cost and latency.

Install

Python 3.10+
pip install transformers torch accelerate pillow

BLIP-2 Captioning

Local (GPU)
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load BLIP-2 with Flan-T5-XL (downloads ~15GB on first run)
processor = Blip2Processor.from_pretrained(
    "Salesforce/blip2-flan-t5-xl"
)
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xl",
    torch_dtype=torch.float16,
    device_map="auto"
)

def caption_image(image_path, prompt=None):
    """Generate a detailed caption for an image.

    Args:
        image_path: Path to image file
        prompt: Optional prompt to guide caption style.
                e.g. "Describe this image in detail, including
                colors, objects, and their spatial relationships."
    """
    image = Image.open(image_path).convert("RGB")

    if prompt:
        inputs = processor(image, text=prompt, return_tensors="pt").to(
            model.device, torch.float16
        )
    else:
        inputs = processor(image, return_tensors="pt").to(
            model.device, torch.float16
        )

    generated_ids = model.generate(**inputs, max_new_tokens=100)
    caption = processor.batch_decode(
        generated_ids, skip_special_tokens=True
    )[0].strip()

    return caption

# Basic caption
print(caption_image("photo.jpg"))
# "a golden retriever playing fetch in a park on a sunny day"

# Prompted caption (more detail for search)
print(caption_image("photo.jpg",
    prompt="Describe this image in detail:"
))
# "a young golden retriever with a red collar is catching
#  a yellow tennis ball mid-air in a green park. The sky is
#  clear blue and there are oak trees in the background."

GPT-4V Captioning (API)

Cloud API
import openai
import base64

client = openai.OpenAI()

def caption_with_gpt4v(image_path, detail="high"):
    """Caption using GPT-4V for maximum quality.

    Cost: ~$0.01-0.03 per image depending on resolution.
    Latency: ~2-5 seconds per image.
    Best for: high-value collections, batch indexing.
    """
    with open(image_path, "rb") as f:
        b64_image = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Describe this image in one detailed paragraph. "
                        "Include: objects, their attributes (color, size, "
                        "material), spatial relationships, actions, setting, "
                        "lighting, and mood. Be specific and factual."
                    )
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{b64_image}",
                        "detail": detail
                    }
                }
            ]
        }],
        max_tokens=300
    )

    return response.choices[0].message.content

# Example output:
# "A young golden retriever with a glossy coat and red nylon
#  collar leaps approximately two feet off the ground to catch
#  a bright yellow tennis ball. The dog is in a well-maintained
#  public park with freshly mowed grass. Behind it, mature oak
#  trees cast dappled shadows. The lighting suggests late
#  afternoon with warm, golden-hour tones. Two blurred figures
#  sit on a bench in the background."

Captioning Model Comparison (2026)

ModelVRAMSpeedCaption QualityCost
BLIP-base~2 GB~0.1sBasicFree (local)
BLIP-2 (Flan-T5-XL)~16 GB~0.5sGoodFree (local)
LLaVA-1.6 (34B)~40 GB~2sVery goodFree (local)
Qwen2-VL (7B)~16 GB~0.8sVery goodFree (local)
GPT-4oAPI~3sBest~$0.01/img

BLIP-2 with Flan-T5-XL hits the best cost/quality ratio for most production use cases. Use GPT-4o when caption quality directly drives revenue (e-commerce, medical imaging).

Steps 2–3: Embed, Index, Search

Once you have captions, the remaining pipeline is pure text retrieval — a problem with well-established, battle-tested solutions. Here is the complete pipeline from captions to searchable index:

Additional Dependencies

Python
pip install sentence-transformers faiss-cpu

Complete Caption + Embed + Search Pipeline

Production-Ready
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json
import os

# --- Configuration ---
EMBED_MODEL = "BAAI/bge-base-en-v1.5"  # 768-dim, strong retrieval
CAPTION_FILE = "captions.jsonl"          # Pre-generated captions
INDEX_FILE = "image_search.index"

# --- Step 2: Embed all captions ---
embed_model = SentenceTransformer(EMBED_MODEL)

# Load pre-generated captions (from Step 1)
# Each line: {"path": "images/001.jpg", "caption": "a dog..."}
entries = []
with open(CAPTION_FILE) as f:
    for line in f:
        entries.append(json.loads(line))

captions = [e["caption"] for e in entries]
paths = [e["path"] for e in entries]

# Encode all captions in one batch (GPU-accelerated)
print(f"Embedding {len(captions)} captions...")
embeddings = embed_model.encode(
    captions,
    normalize_embeddings=True,  # Required for cosine similarity
    show_progress_bar=True,
    batch_size=256
)
embeddings = embeddings.astype("float32")

# --- Step 3: Build FAISS index ---
dimension = embeddings.shape[1]  # 768 for bge-base
index = faiss.IndexFlatIP(dimension)  # Inner product = cosine sim
index.add(embeddings)

# Persist to disk
faiss.write_index(index, INDEX_FILE)
print(f"Index built: {index.ntotal} vectors, {dimension} dimensions")

# --- Search function ---
def search_images(query: str, k: int = 5) -> list[dict]:
    """Semantic search over captioned images.

    Args:
        query: Natural language search query
        k: Number of results to return

    Returns:
        List of {path, caption, score} dicts, sorted by relevance
    """
    query_vec = embed_model.encode(
        [query], normalize_embeddings=True
    ).astype("float32")

    scores, indices = index.search(query_vec, k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        if idx < 0:
            continue  # FAISS returns -1 for missing results
        results.append({
            "path": paths[idx],
            "caption": captions[idx],
            "score": round(float(score), 4)
        })
    return results

# --- Example ---
results = search_images("dog playing in the rain")
for r in results:
    print(f"  {r['score']:.3f}  {r['path']}")
    print(f"          {r['caption']}")
    print()
Example output:
  0.871  images/dog_rain_park.jpg
          a golden retriever catching a ball in a rainy park

  0.743  images/puppy_puddle.jpg
          a small puppy splashing through puddles on a wet sidewalk

  0.652  images/dogs_garden.jpg
          two dogs running through a garden with wet grass

  0.501  images/cat_window_rain.jpg
          an orange cat watching rain through a window
~0.5s

Caption per image (BLIP-2)

~3KB

Storage per image (embedding + caption)

<5ms

Search latency (100K images)

Architecture Comparison

Caption+Search vs CLIP: The Real Trade-offs

The most important architectural decision in image search. Both approaches work — but they fail in very different ways.

CLIP embeds images and text into the same 512-dimensional vector space, enabling direct image-to-text similarity without any captioning step. It is simpler, faster to index, and requires fewer moving parts. So why would anyone use the more complex Caption+Search pipeline?

Caption + Search

  • +Uses SOTA text embeddings (768–3072 dim, MTEB-optimized)
  • +Captions are human-readable (debugging, auditing, display)
  • +Supports hybrid search (BM25 + semantic on caption text)
  • +Each component upgradeable independently
  • -Slower indexing (caption generation bottleneck)
  • -Information loss: caption misses visual details the model doesn't describe

Best for: E-commerce catalogs, content management, medical imaging, any domain where you need explainability or complex multi-attribute queries.

CLIP Direct

  • +Simpler pipeline (one model, no intermediate text)
  • +Faster indexing (no caption generation step)
  • +Captures visual features that are hard to describe in words
  • +Works for abstract/artistic content, visual styles, textures
  • -512-dim joint space compresses both modalities, losing nuance
  • -No intermediate representation for debugging or filtering

Best for: Quick prototypes, real-time indexing, style/aesthetic search, zero-shot classification, abstract or hard-to-describe visual content.

What Most Tutorials Get Wrong

The comparison is usually framed as "CLIP is simpler, Caption+Search is more powerful." The reality is more nuanced. There are queries where CLIP wins, and the difference comes from the nature of the query, not the overall system quality:

"red vintage car on a coastal road" → Caption+Search wins(multiple specific attributes)
"images with a melancholic mood" → CLIP wins(abstract visual quality)
"product photos with white background" → CLIP wins(visual style, not content)
"person wearing a blue scarf near a fountain" → Caption+Search wins(compositional, multi-object)

The best production systems use both: CLIP embeddings for visual similarity, caption embeddings for semantic queries, combined with reciprocal rank fusion. This is called a hybrid visual retrieval pipeline.

CLIP Direct Image Search (for comparison)

CLIP ViT-B/32
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
import faiss
import numpy as np

# Load CLIP
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

def embed_image(image_path):
    """Embed image directly into CLIP's joint space."""
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        features = model.get_image_features(**inputs)
    vec = features.numpy().flatten()
    return vec / np.linalg.norm(vec)

def embed_text(text):
    """Embed text query into CLIP's joint space."""
    inputs = processor(text=text, return_tensors="pt", padding=True)
    with torch.no_grad():
        features = model.get_text_features(**inputs)
    vec = features.numpy().flatten()
    return vec / np.linalg.norm(vec)

# Index images (no captioning needed)
image_vecs = np.array([embed_image(p) for p in image_paths])
index = faiss.IndexFlatIP(image_vecs.shape[1])  # 512-dim
index.add(image_vecs.astype("float32"))

# Search
query_vec = embed_text("dog playing in the rain")
D, I = index.search(query_vec.reshape(1, -1).astype("float32"), k=5)
for score, idx in zip(D[0], I[0]):
    print(f"  {score:.3f}  {image_paths[idx]}")

Decision Matrix

RequirementCaption+SearchCLIP DirectHybrid
Complex multi-attribute queriesBestGoodBest
Abstract / style queriesWeakBestBest
Indexing speedSlowerFastSlower
ExplainabilityYes (captions)NoYes
Hybrid text+keyword searchNativeNoNative
Implementation complexityMediumLowHigh

Production: Scaling to Millions of Images

The pipeline above works for prototypes. At scale, three things change: batch processing, approximate nearest neighbors, and caption quality monitoring.

Production Pipeline with SQLite + FAISS-IVF

Scales to 10M+ images
import sqlite3
import faiss

# --- Metadata store ---
conn = sqlite3.connect("image_search.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS images (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        path TEXT UNIQUE NOT NULL,
        caption TEXT NOT NULL,
        model_version TEXT DEFAULT 'blip2-flan-t5-xl',
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
""")

# --- IVF index for fast ANN search at scale ---
# IndexFlatIP is exact but O(n) per query.
# IndexIVFFlat partitions the space into clusters:
# search only the nearest nprobe clusters.
dimension = 768
nlist = 1024          # number of Voronoi cells
quantizer = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(
    quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT
)

# Must train IVF on representative sample before adding vectors
sample = embeddings[:min(len(embeddings), 50000)]
index.train(sample)
index.add(embeddings)
index.nprobe = 32     # search 32 of 1024 clusters (tunable)

# --- Batch captioning with progress ---
from tqdm import tqdm

def batch_caption_and_index(image_dir, batch_size=32):
    """Caption and index all images in a directory."""
    paths = [
        os.path.join(image_dir, f)
        for f in os.listdir(image_dir)
        if f.lower().endswith(('.jpg', '.jpeg', '.png', '.webp'))
    ]

    for i in tqdm(range(0, len(paths), batch_size)):
        batch_paths = paths[i:i + batch_size]
        batch_captions = [caption_image(p) for p in batch_paths]

        # Embed batch
        batch_embeds = embed_model.encode(
            batch_captions, normalize_embeddings=True
        ).astype("float32")

        # Add to FAISS
        index.add(batch_embeds)

        # Store metadata
        for path, caption in zip(batch_paths, batch_captions):
            conn.execute(
                "INSERT OR IGNORE INTO images (path, caption) "
                "VALUES (?, ?)",
                (path, caption)
            )
        conn.commit()

    # Persist
    faiss.write_index(index, "image_search_ivf.index")

Caption Quality Is Your Ceiling

The most common failure mode in Caption+Search is not the embedding model or the vector index — it's the caption quality. A caption that says "a person standing outside" when the image shows "a firefighter directing traffic at a rain-soaked intersection" means every query about firefighters, traffic, intersections, or rain will miss this image. Invest in the best captioning model your budget allows, and audit a random sample of captions monthly to catch quality regressions.

Key Papers

The papers that defined the field, in chronological order. Reading the first three gives you the conceptual foundation; the rest are for going deeper.

Show and Tell: A Neural Image Caption Generator

Vinyals, O. et al. (2015). CVPR. arxiv.org/abs/1411.4555

Established the CNN encoder + LSTM decoder paradigm. 12,000+ citations.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Xu, K. et al. (2015). ICML. arxiv.org/abs/1502.03044

Introduced visual attention to captioning. Made the model interpretable. 19,000+ citations.

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Radford, A. et al. (2021). ICML. arxiv.org/abs/2103.00020

Contrastive pre-training on 400M image-text pairs. Made zero-shot visual search practical. 25,000+ citations.

BLIP: Bootstrapping Language-Image Pre-training

Li, J. et al. (2022). ICML. arxiv.org/abs/2201.12086

Unified captioning, retrieval, and matching. Bootstrap filtering for data quality.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and LLMs

Li, J. et al. (2023). ICML. arxiv.org/abs/2301.12597

Q-Former bridge architecture. 54x fewer trainable params than Flamingo. Production workhorse.

Visual Instruction Tuning (LLaVA)

Liu, H. et al. (2023). NeurIPS. arxiv.org/abs/2304.08485

Connected CLIP ViT to LLaMA via linear projection. Enabled prompt-controlled captioning.

Bottom-Up and Top-Down Attention for Image Captioning and VQA

Anderson, P. et al. (2018). CVPR. arxiv.org/abs/1707.07998

Object-level attention features. Dominated captioning and VQA benchmarks for 3 years. 7,000+ citations.

Key Takeaways

  • 1

    Caption+Search is a two-stage pipeline — a vision-language model generates text descriptions, then text embedding models enable semantic search over those descriptions. Each stage is independently upgradeable.

  • 2

    Caption quality is your search ceiling — the best embedding model in the world cannot find information that was never captured in the caption. Invest in the best captioning model your budget allows.

  • 3

    CLIP and Caption+Search are complementary, not competing — CLIP excels at visual similarity and abstract queries; Caption+Search excels at compositional, multi-attribute, and keyword-filterable queries. Production systems use both.

  • 4

    Captioning has reached human level — the field progressed from template filling (2011) to GPT-4V-class descriptions (2023) in just twelve years. The bottleneck has shifted from caption quality to retrieval quality.

Practice Exercise

Build your own image search system and compare approaches:

  1. 1.Collect 50+ images in a folder. Photos from your phone, a product catalog, or stock images all work.
  2. 2.Caption all images with BLIP-2. Save captions to a JSONL file. Read through 10–20 captions manually — are they accurate? What details are missing?
  3. 3.Build the full pipeline (embed + FAISS index). Test 10 diverse queries. Record which results are relevant and which are not.
  4. 4.Build a CLIP index over the same images. Run the same 10 queries. Compare: where does Caption+Search win? Where does CLIP win?
  5. 5.Bonus: Try prompted captioning with BLIP-2. Does "Describe this image in detail including colors, materials, and spatial relationships" produce better search results than unprompted captions?

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.