[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

TL;DR

I read this because.. : aka BLIP2, the rumored
task : Vision Language Pretraining -> zero-shot VQA, captioning, image-text retrieval,
problem : Vision language pretraining too expensive
idea : Freeze the vision model / language model and learn Q former as a bridge in the middle.
input : image, text
output : text
architecture : ViT + OPT(decoder only) or FLAN-T5(encoder-decoder) + Querying Transformer
objective : Image-Text Matching(ITM), Image-Grounded Text Generation(ITG), Image-Text Contrastive Learning(ITC)
baseline : SimVLM, BeiT-3, Flamingo, Frozen, VL-T5, VLKD, OSCAR, VinVL, Florence, ALBEF, …
data : COCO, Visual Genome, CC3M, CC12M, SBI, LAION400M -> NoCaps , COCO Caption, Flickr30K
evaluation : Do it yourself…
result : The trainable parameter is much smaller, but the performance is sota.
contribution : vision - Proposing an efficient way to learn a language.
etc. : flamingo 같은 것은 weight 공개하지 않았지만 이 얘기는 weight 공개까지 하고 hf에도 uploaded, so people seem to use it a lot.

Details

Querying-TransFormer(Q-Former)

Find something to link frozen image encoder / frozen LLM Pulls the same number of output features regardless of image resolution.

Learned learnable query embedding. Trained with SA + CA with visual encoder. Imported pretrained $BERT_{base}$ and trained CA anew. 188M in size. 32 query, 768 hidden dim. output query as $Z$ representation. The dimension of $Z$, 32 $\times$ 768, is much smaller than the dimension of the frozen image feature (257 $\times$ 1024 for ViT-L/14).

Pretraining

first stage : Vision-Language Representation Learning from a Frozen Image Encoder
Image-Text Contrastive Learning(ITC) To ensure that the mutual information between the image and text representations align well. Positive pairs have a higher similarity and negative pairs have a lower similarity. Find the pair-wise similarity between the query representation $Z$ from the image transformer and the representation $t$ for the [CLS] token from the text transformer, and pick the highest one in $Z$ for as many queries as there are learned. Use a unimodal self-attention mask so that query and text cannot see each other.
Image-Grounded Text Generation(ITG) Given an image, a loss that enables the generation of text. Since the Q-Former structure itself has no direct interaction between the image encoder and the text token, queries are trained to extract the visual feature that has all the information about the text. Multi-modal Causal Self-Attention Mask is applied. Images don’t see text and text is casual masking. Similar to the UniLM method)
Image-Text Matching(ITM) Learn fine-grained alignment between image and text. image-text pair is matched or not as binary classification. query and text are both present. Averaging the logit of all queries over the two class FCNs and using it as the output matching score. Using negative hard mining techniques like Flamingo
second stage: Generative Learning from a Frozen LLM Increased generative lanugage capability using Q-Former and LLM. Project the output query embedding $Z$ to the LLM’s text embedding hidden dimension and put it in front of the input text embedding.

Experiment

Data is organized above. CapFilt + $BLIP_{large}$ was used to create a synthetic caption for the web image and CLIP ViT-L/14 was used to rank it, leaving only the top-2r to use as training-data