BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efﬁcient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. 2023: Junnan Li, Dongxu Li, S. Savarese, Steven Hoi Ranked #1 on Image Retrieval on COCO https://arxiv.org/pdf/2301.12597v1.pdf

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments