Language Models Can See: Plugging Visual Controls in Text Generation

In this work, we propose a training-free framework, called MAGIC (i MA ge- Guided text generation with C LIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. 2022: Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, N. Collier https://arxiv.org/pdf/2205.02655v1.pdf

Comments (0)

To leave or reply to comments, please download free Podbean or

No Comments