TL;DR
- I read this because.. : It’s been mentioned a lot, so I thought I’d give it a try.
- task : object detection, captioning, VQA, text VQA, …
- problem : Let’s grow VLMs like we did LLMs.
- idea : raise vision from ViT-e (3.9B) to ViT (22B), language model from mT5-xxl (13B) to UL2 (35B)
- input/output : image / text -> text (or visual token for BeiT objective)
- architecture : ViT + UL2 like encoder-decoder image patch with text as input.
- objective : (a) span corruption (b) split-captioning (c) captioning (d) VQA (e) VQG (f) VQA with objective aware (g) captioning on Eposodic WebLI (h) pix2struct objective (i) captioning on short video (j) BeiT-like image-token prediction
- baseline : PALI, Flamingo, GIT
- data : CC3M, WebLI(proposed in PaLI), VQ2A-CC3M, …
- evaluation : each…
- result : sota with finetuning on 25+ VLM benchmarks.
- contribution : scaling PALI
- etc. :
Details
Related Work
- Mixture of Denoiser
proposed in UL2. https://arxiv.org/pdf/2205.05131.pdf
A methodology to give prefixes to multiple pretraining tasks to be trained at once and have the model act on them. Not necessarily multiple architectures like MoE.
Papers that tend to focus a bit more on multilingual~.
It looks like it’s just pushing a visual token as input. No pooling?
Learning ViT-e. Performance gains were marginal when scaling on ImageNet, but significant on multi-modal.
PALI full size is this
For the same size parameter increase, performance improvement was better for the visual model than the language model
WebLI is multi-modal to do well with images created on the web.
- 10 billion images and 12 billion alt-texts
- from English-only datasets to 109 languages
- use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs It’s like alt-text + ocr from image… not m3w style after all.
ablation for each objective
mixing ratio
As a limitation 1) I lost multilingual ability in some benchmarks when I finetune with english only 2) I’m not sure if I’m evaluating synonyms well since the benchmarks are in english
PreSTU: pretraining for scene-text understanding https://arxiv.org/pdf/2209.05534.pdf Propose a ‘split-ocr’ task that just feeds input up to the mth token and ocr read from the (m+1)th
object aware task https://arxiv.org/pdf/2209.04372.pdf
Asking “Are there specific objects in this image?
Adding objective-aware improved overall performance. -> Visual Question Answering, visual entailment and captioning.
Said to have created qa with the VQ2A method https://arxiv.org/pdf/2205.01883.pdf How to create QA with captions
The configuration looks like this
Candidate answer: POS-based
- question gerenation : T5-XXL model and further fine-tune it on SQuAD1.1
- Question-Answer Filtering : If the answer does not match the answer candidate offered as input to the question generation model, the generated question is discarded. / T5-XXL model and further fine-tune it on SQuAD1.1 and Natural Questions.
least-to-most prompting
Research shows that students thrive when you break down a difficult question into smaller chunks and prompt them to work through the answer. I’m not sure if it’s tuning for CoT, learning to decompose, or something else.
Model
Training objectives
Training procedure
In stage 1, the visual encoder (after mixed-objective training) is kept frozen, while the rest of the parameters are trained on a total of 2.2B examples at the base resolution 224×224 (native to ViT-22B), using the entire mixture. In stage 2, it continues training using only the OCR-related objectives (pix2struct and split-ocr) plus the object detection objective; this is done in several substages, during which image resolution is gradually increased to 448×448, 672×672 and finally 756×756.