Monday Feb 06, 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. 2023: Junnan Li, Dongxu Li, S. Savarese, Steven Hoi Ranked #1 on Image Retrieval on COCO https://arxiv.org/pdf/2301.12597v1.pdf
Version: 20241125
No comments yet. Be the first to say something!