BLIP-2.

Salesforceopen-sourceUnknown paramsFrozen image encoder + Q-Former + frozen LLM

Bootstrapped vision-language pre-training with Q-Former connecting frozen encoders. OPT/FlanT5 backbone. 2023. Source: arxiv:2301.12597.

§ 01 · Benchmarks

Every benchmark BLIP-2 has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	COCO Captions	Multimodal · Image Captioning	CIDEr	145.80	#1/2	2023-01-30	source ↗
02	VQA v2.0	Multimodal · Visual Question Answering	accuracy	82.2%	#4/7	2023-01-30	source ↗
03	TextVQA	Multimodal · Visual Question Answering	accuracy	42.5%	#9/9	2023-01-30	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area