image

paper , code

TL;DR

  • I read this because.. : aka BLIP2, the rumored
  • task : Vision Language Pretraining -> zero-shot VQA, captioning, image-text retrieval,
  • problem : Vision language pretraining too expensive
  • idea : Freeze the vision model / language model and learn Q former as a bridge in the middle.
  • input : image, text
  • output : text
  • architecture : ViT + OPT(decoder only) or FLAN-T5(encoder-decoder) + Querying Transformer
  • objective : Image-Text Matching(ITM), Image-Grounded Text Generation(ITG), Image-Text Contrastive Learning(ITC)
  • baseline : SimVLM, BeiT-3, Flamingo, Frozen, VL-T5, VLKD, OSCAR, VinVL, Florence, ALBEF, …
  • data : COCO, Visual Genome, CC3M, CC12M, SBI, LAION400M -> NoCaps , COCO Caption, Flickr30K
  • evaluation : Do it yourself…
  • result : The trainable parameter is much smaller, but the performance is sota.
  • contribution : vision - Proposing an efficient way to learn a language.
  • etc. : flamingo 같은 것은 weight κ³΅κ°œν•˜μ§€ μ•Šμ•˜μ§€λ§Œ 이 μ–˜κΈ°λŠ” weight κ³΅κ°œκΉŒμ§€ ν•˜κ³  hf에도 uploaded, so people seem to use it a lot.

Details

image

Querying-TransFormer(Q-Former)

Find something to link frozen image encoder / frozen LLM Pulls the same number of output features regardless of image resolution.

Learned learnable query embedding. Trained with SA + CA with visual encoder. Imported pretrained $BERT_{base}$ and trained CA anew. 188M in size. 32 query, 768 hidden dim. output query as $Z$ representation. The dimension of $Z$, 32 $\times$ 768, is much smaller than the dimension of the frozen image feature (257 $\times$ 1024 for ViT-L/14).

Pretraining

  • first stage : Vision-Language Representation Learning from a Frozen Image Encoder image

  • Image-Text Contrastive Learning(ITC) To ensure that the mutual information between the image and text representations align well. Positive pairs have a higher similarity and negative pairs have a lower similarity. Find the pair-wise similarity between the query representation $Z$ from the image transformer and the representation $t$ for the [CLS] token from the text transformer, and pick the highest one in $Z$ for as many queries as there are learned. Use a unimodal self-attention mask so that query and text cannot see each other.

  • Image-Grounded Text Generation(ITG) Given an image, a loss that enables the generation of text. Since the Q-Former structure itself has no direct interaction between the image encoder and the text token, queries are trained to extract the visual feature that has all the information about the text. Multi-modal Causal Self-Attention Mask is applied. Images don’t see text and text is casual masking. Similar to the UniLM method)

  • Image-Text Matching(ITM) Learn fine-grained alignment between image and text. image-text pair is matched or not as binary classification. query and text are both present. Averaging the logit of all queries over the two class FCNs and using it as the output matching score. Using negative hard mining techniques like Flamingo

  • second stage: Generative Learning from a Frozen LLM Increased generative lanugage capability using Q-Former and LLM. Project the output query embedding $Z$ to the LLM’s text embedding hidden dimension and put it in front of the input text embedding.

image

Experiment

Data is organized above. CapFilt + $BLIP_{large}$ was used to create a synthetic caption for the web image and CLIP ViT-L/14 was used to rank it, leaving only the top-2r to use as training-data

image image

Result

image image
  • Strong encoder is important. ViT-G > ViT-L, Larger LLM better
  • FlanT5(instruction tuned) > OPT(unsuperviesd) in VQA
image image image

Without vision-language representation learning, generative learning is not very good. Unable to bridge modality gaps.