Model card
BLIP-2.
Salesforceopen-sourceUnknown paramsFrozen image encoder + Q-Former + frozen LLM
Bootstrapped vision-language pre-training with Q-Former connecting frozen encoders. OPT/FlanT5 backbone. 2023. Source: arxiv:2301.12597.
§ 01 · Benchmarks
Every benchmark BLIP-2 has a recorded score for.
| # | Benchmark | Area · Task | Metric | Value | Rank | Date | Source |
|---|---|---|---|---|---|---|---|
| 01 | COCO Captions | Multimodal · Image Captioning | CIDEr | 145.80 | #1 | 2023-01-30 | source ↗ |
| 02 | VQA v2.0 | Multimodal · Visual Question Answering | accuracy | 82.2% | #4 | 2023-01-30 | source ↗ |
| 03 | TextVQA | Multimodal · Visual Question Answering | accuracy | 42.5% | #9 | 2023-01-30 | source ↗ |
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 03 · Papers
1 paper with results for BLIP-2.
- 2023-01-30· Multimodal· 3 results
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
§ 04 · Related models
Other Salesforce models scored on Codesota.
§ 05 · Sources & freshness
Where these numbers come from.
arxiv
3
results
3 of 3 rows marked verified.