Image Captioning
Generate natural language descriptions of image content. Enables text-based search over visual content.
How Vision Language Models Work
A technical deep-dive into vision-language models. From image captioning to multimodal reasoning with LLaVA, GPT-4V, and beyond.
What VLMs Can Do
Vision-Language Models understand images and generate text. One model, many tasks.
Image Captioning
Describe image content
VQA
Answer questions about images
OCR/Document
Extract text from images
Visual Reasoning
Complex inference about images
How VLMs Process Images
VLM Evolution
From CLIP to GPT-4V to open-source alternatives like Qwen2-VL.
VLM Architectures
How to connect vision and language.
CLIP-style (Contrastive)
Separate encoders, contrastive loss
LLaVA-style (Projector)
Vision encoder + linear projection + LLM
Qwen-VL (Native)
Vision tokens in transformer
LLaVA Architecture (Most Popular)
Simple but effective: freeze vision encoder, train projector + LLM.
VLM Benchmarks
How to evaluate multimodal understanding.
| Model | MMBench | SEED-Bench | MME | Type |
|---|---|---|---|---|
| GPT-4o | 83.4 | 77.1 | 2070 | Proprietary |
| Gemini 1.5 Pro | 80.6 | 75.8 | 2015 | Proprietary |
| Qwen2-VL-72B | 82 | 76.5 | 2055 | Open |
| InternVL2-76B | 81.2 | 75.4 | 2000 | Open |
| LLaVA 1.5-13B | 68.2 | 63 | 1570 | Open |
Code Examples
Get started with VLMs in Python.
from transformers import LlavaForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
# Load LLaVA model
model = LlavaForConditionalGeneration.from_pretrained(
'llava-hf/llava-1.5-7b-hf',
torch_dtype=torch.float16,
device_map='auto'
)
processor = AutoProcessor.from_pretrained('llava-hf/llava-1.5-7b-hf')
# Load image
image = Image.open('image.jpg')
# Create conversation
conversation = [
{
'role': 'user',
'content': [
{'type': 'image'},
{'type': 'text', 'text': 'What is happening in this image?'}
]
}
]
# Process and generate
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors='pt').to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)Quick Reference
- - GPT-4o / Claude 3.5
- - Gemini 2.0 Flash
- - Qwen2-VL-72B
- - InternVL2
- - LLaVA 1.6
- - BLIP-2
- - LLaVA 1.5-7B
- - Qwen2-VL-2B
Use Cases
- ✓Accessibility (alt text generation)
- ✓Photo library organization
- ✓Content moderation descriptions
- ✓RAG pipeline input for image search
Architectural Patterns
VLM Captioning
Use a vision-language model (GPT-4V, Claude, LLaVA) to generate detailed captions.
- +Rich, detailed descriptions
- +Can follow specific prompts
- +Handles complex scenes
- -Slower and more expensive
- -May hallucinate details
Specialized Captioning Models
Use dedicated captioning models like BLIP-2 or CoCa.
- +Fast inference
- +Optimized for the task
- +Lower cost
- -Less flexible prompting
- -May miss nuances
Caption + Text RAG Pipeline
Generate captions, embed them, and use text retrieval. Two-stage approach.
- +Leverages mature text RAG
- +Captions are human-readable
- +Easy debugging
- -Information loss in captioning
- -Slower indexing
Implementations
API Services
GPT-4 Vision
OpenAIState-of-the-art for detailed, accurate captions. Best for complex scenes.
Claude 3.5 Sonnet
AnthropicExcellent vision capabilities with nuanced descriptions.
Open Source
LLaVA
Apache 2.0Strong open-source VLM. LLaVA-1.6 significantly improved.
Benchmarks
Code Examples
Image Captioning with BLIP-2
Generate captions using Salesforce BLIP-2
pip install transformers torch pillow acceleratefrom transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch
# Load BLIP-2
processor = Blip2Processor.from_pretrained('Salesforce/blip2-opt-2.7b')
model = Blip2ForConditionalGeneration.from_pretrained(
'Salesforce/blip2-opt-2.7b',
torch_dtype=torch.float16,
device_map='auto'
)
# Caption an image
image = Image.open('photo.jpg').convert('RGB')
inputs = processor(image, return_tensors='pt').to('cuda', torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f'Caption: {caption}')Detailed Captioning with GPT-4 Vision
Get rich descriptions using OpenAI's vision API
pip install openaifrom openai import OpenAI
import base64
client = OpenAI()
# Read and encode image
with open('photo.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'user',
'content': [
{
'type': 'text',
'text': 'Describe this image in detail for search indexing. '
'Include objects, actions, setting, colors, and mood.'
},
{
'type': 'image_url',
'image_url': {
'url': f'data:image/jpeg;base64,{image_data}'
}
}
]
}
],
max_tokens=300
)
caption = response.choices[0].message.content
print(f'Caption: {caption}')Local Captioning with LLaVA
Run vision-language model locally with Ollama
pip install ollamaimport ollama
import base64
# Read image
with open('photo.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
# Generate caption with LLaVA via Ollama
response = ollama.chat(
model='llava:13b',
messages=[
{
'role': 'user',
'content': 'Describe this image in detail.',
'images': [image_data]
}
]
)
caption = response['message']['content']
print(f'Caption: {caption}')Quick Facts
- Input
- Image
- Output
- Text
- Implementations
- 3 open source, 2 API
- Patterns
- 3 approaches