Multimodal
Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.
5 tasks
2 datasets
0 results
Image Captioning
Generating text descriptions of images (COCO Captions).
1 datasets
0 results
330K images with 5 captions each. Standard benchmark for image captioning.
Visual Question Answering
Answering questions about images (VQA, GQA).
1 datasets
0 results
265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.
Text-to-Image Generation
Generating images from text descriptions (Stable Diffusion, DALL-E).
0 datasets
0 results
No datasets indexed yet. Contribute on GitHub
Video Understanding
Understanding and reasoning about video content.
0 datasets
0 results
No datasets indexed yet. Contribute on GitHub
Cross-Modal Retrieval
Retrieving items across different modalities (image-text).
0 datasets
0 results
No datasets indexed yet. Contribute on GitHub