[118] PaLI-X: On Scaling up a Multilingual Vision and Language Model

TL;DR

I read this because.. : It’s been mentioned a lot, so I thought I’d give it a try.
task : object detection, captioning, VQA, text VQA, …
problem : Let’s grow VLMs like we did LLMs.
idea : raise vision from ViT-e (3.9B) to ViT (22B), language model from mT5-xxl (13B) to UL2 (35B)
input/output : image / text -> text (or visual token for BeiT objective)
architecture : ViT + UL2 like encoder-decoder image patch with text as input.
objective : (a) span corruption (b) split-captioning (c) captioning (d) VQA (e) VQG (f) VQA with objective aware (g) captioning on Eposodic WebLI (h) pix2struct objective (i) captioning on short video (j) BeiT-like image-token prediction
baseline : PALI, Flamingo, GIT
data : CC3M, WebLI(proposed in PaLI), VQ2A-CC3M, …
evaluation : each…
result : sota with finetuning on 25+ VLM benchmarks.
contribution : scaling PALI
etc. :

Details

Mixture of Denoiser proposed in UL2. https://arxiv.org/pdf/2205.05131.pdf

A methodology to give prefixes to multiple pretraining tasks to be trained at once and have the model act on them. Not necessarily multiple architectures like MoE.

PALI https://arxiv.org/abs/2209.06794

Papers that tend to focus a bit more on multilingual~. It looks like it’s just pushing a visual token as input. No pooling?

Learning ViT-e. Performance gains were marginal when scaling on ImageNet, but significant on multi-modal.

PALI full size is this

For the same size parameter increase, performance improvement was better for the visual model than the language model

WebLI is multi-modal to do well with images created on the web.

10 billion images and 12 billion alt-texts
from English-only datasets to 109 languages
use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs It’s like alt-text + ocr from image… not m3w style after all.

ablation for each objective

mixing ratio

As a limitation 1) I lost multilingual ability in some benchmarks when I finetune with english only 2) I’m not sure if I’m evaluating synonyms well since the benchmarks are in english

PreSTU: pretraining for scene-text understanding https://arxiv.org/pdf/2209.05534.pdf Propose a ‘split-ocr’ task that just feeds input up to the mth token and ocr read from the (m+1)th
object aware task https://arxiv.org/pdf/2209.04372.pdf

Asking “Are there specific objects in this image?

Adding objective-aware improved overall performance. -> Visual Question Answering, visual entailment and captioning.

Said to have created qa with the VQ2A method https://arxiv.org/pdf/2205.01883.pdf How to create QA with captions
The configuration looks like this
Candidate answer: POS-based
- question gerenation : T5-XXL model and further fine-tune it on SQuAD1.1
- Question-Answer Filtering : If the answer does not match the answer candidate offered as input to the question generation model, the generated question is discarded. / T5-XXL model and further fine-tune it on SQuAD1.1 and Natural Questions.
least-to-most prompting

Research shows that students thrive when you break down a difficult question into smaller chunks and prompt them to work through the answer. I’m not sure if it’s tuning for CoT, learning to decompose, or something else.

Model

This is a few-shot example, but the model architecture hasn't changed from PALI - The language model is UL2 variants 32B. First, the language encoder-decoder is a bit bigger (previously 13B) - visual model wrote 22B Scaling vision transformers to 22 billion parameters. https://arxiv.org/pdf/2302.05442.pdf - And there seems to be a high-resolution phase, which I'll explain below.

Training objectives

Training procedure

In stage 1, the visual encoder (after mixed-objective training) is kept frozen, while the rest of the parameters are trained on a total of 2.2B examples at the base resolution 224×224 (native to ViT-22B), using the entire mixture. In stage 2, it continues training using only the OCR-related objectives (pix2struct and split-ocr) plus the object detection objective; this is done in several substages, during which image resolution is gradually increased to 448×448, 672×672 and finally 756×756.

[118] PaLI-X: On Scaling up a Multilingual Vision and Language Model

TL;DR

Details

Model

Training objectives

Training procedure

Result

Per-task finetuning

Multi-task finetuning

Few-shot performance

zero-shot detection

TL;DR#

Details#

Related Work#

Model#

Training objectives#

Training procedure#

Result#

Per-task finetuning#

Multi-task finetuning#

Few-shot performance#

zero-shot detection#

TL;DR

Details

Related Work

Model

Training objectives

Training procedure

Result

Per-task finetuning

Multi-task finetuning

Few-shot performance

zero-shot detection