Microsoft COCO Captions (COCO 2017 Captions) is the image-captioning portion of the MS COCO (Microsoft Common Objects in Context) benchmark. It provides human-written natural language captions describing the images; each annotated image has five independent captions. The COCO 2017 release reorganized the original COCO images into the train/val/test2017 splits (commonly used splits are train2017 with ~118,287 images and val2017 with 5,000 images; caption annotations cover the train+val captioned images, i.e. ~123k images with 5 captions each). COCO Captions is a standard benchmark for image-to-text / vision-language tasks and is widely used to train and evaluate image captioning and vision-language models using metrics such as BLEU, METEOR and CIDEr. (Original paper: "Microsoft COCO: Common Objects in Context", arXiv:1405.0312.)
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.