Multimodal
Image Captioning
Generating text descriptions of images (COCO Captions).
1 datasets0 results
Image Captioning is a key task in multimodal. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
Related Tasks
Visual Question Answering
Answering questions about images (VQA, GQA).
Text-to-Image Generation
Generating images from text descriptions (Stable Diffusion, DALL-E).
Video Understanding
Understanding and reasoning about video content.
Cross-Modal Retrieval
Retrieving items across different modalities (image-text).