image

paper

TL;DR

  • I read this because.. : It’s been mentioned a lot, so I thought I’d give it a try.
  • task : object detection, captioning, VQA, text VQA, …
  • problem : Let’s grow VLMs like we did LLMs.
  • idea : raise vision from ViT-e (3.9B) to ViT (22B), language model from mT5-xxl (13B) to UL2 (35B)
  • input/output : image / text -> text (or visual token for BeiT objective)
  • architecture : ViT + UL2 like encoder-decoder image patch with text as input.
  • objective : (a) span corruption (b) split-captioning (c) captioning (d) VQA (e) VQG (f) VQA with objective aware (g) captioning on Eposodic WebLI (h) pix2struct objective (i) captioning on short video (j) BeiT-like image-token prediction
  • baseline : PALI, Flamingo, GIT
  • data : CC3M, WebLI(proposed in PaLI), VQ2A-CC3M, …
  • evaluation : each…
  • result : sota with finetuning on 25+ VLM benchmarks.
  • contribution : scaling PALI
  • etc. :

Details

image

A methodology to give prefixes to multiple pretraining tasks to be trained at once and have the model act on them. Not necessarily multiple architectures like MoE.

Papers that tend to focus a bit more on multilingual~. image It looks like it’s just pushing a visual token as input. No pooling?

Learning ViT-e. Performance gains were marginal when scaling on ImageNet, but significant on multi-modal. image

image

PALI full size is this image

For the same size parameter increase, performance improvement was better for the visual model than the language model image

WebLI is multi-modal to do well with images created on the web.

  • 10 billion images and 12 billion alt-texts
  • from English-only datasets to 109 languages
  • use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs It’s like alt-text + ocr from image… not m3w style after all.
image

ablation for each objective image

mixing ratio image

As a limitation 1) I lost multilingual ability in some benchmarks when I finetune with english only 2) I’m not sure if I’m evaluating synonyms well since the benchmarks are in english image

image

Asking “Are there specific objects in this image?

image

Adding objective-aware improved overall performance. -> Visual Question Answering, visual entailment and captioning.

  • Said to have created qa with the VQ2A method https://arxiv.org/pdf/2205.01883.pdf How to create QA with captions image

  • The configuration looks like this

  • Candidate answer: POS-based

    • question gerenation : T5-XXL model and further fine-tune it on SQuAD1.1
    • Question-Answer Filtering : If the answer does not match the answer candidate offered as input to the question generation model, the generated question is discarded. / T5-XXL model and further fine-tune it on SQuAD1.1 and Natural Questions.
  • least-to-most prompting image

Research shows that students thrive when you break down a difficult question into smaller chunks and prompt them to work through the answer. I’m not sure if it’s tuning for CoT, learning to decompose, or something else.

Model

image This is a few-shot example, but the model architecture hasn't changed from PALI - The language model is UL2 variants 32B. First, the language encoder-decoder is a bit bigger (previously 13B) - visual model wrote 22B Scaling vision transformers to 22 billion parameters. https://arxiv.org/pdf/2302.05442.pdf - And there seems to be a high-resolution phase, which I'll explain below.

Training objectives

image

Training procedure

In stage 1, the visual encoder (after mixed-objective training) is kept frozen, while the rest of the parameters are trained on a total of 2.2B examples at the base resolution 224×224 (native to ViT-22B), using the entire mixture. In stage 2, it continues training using only the OCR-related objectives (pix2struct and split-ocr) plus the object detection objective; this is done in several substages, during which image resolution is gradually increased to 448×448, 672×672 and finally 756×756.

Result

image

Per-task finetuning

image

Multi-task finetuning

image

Few-shot performance

image

zero-shot detection

image