[127] Linearly Mapping from Image to Text Space

TL;DR

I read this because.. : Is CLIP pretrained better when attached to LM? Is a vision backbone trained with image only better in VLM?
task : image captioning, VQA
problem : Which pretrained vision backbone is good in VLM?
idea : Let’s learn only linear maps, which is a harsher setting than Frozen and MAGMA, and extract the performance -> LIMBeR
input/output : image, task query, (optional) question
architecture : (vision) CLIP RN50x16, NFRN50, BEiT-Large (language) GPT-J(6billion) (linear map) 4096 dim projection
objective : language model loss
baseline : tuning MAGMA, Blind (image security), NFRN50
data : (train) CC3M -> (eval) NoCaps, COCO, VQAv2
evaluation : CIDEr-D, CLIP-S, Ref-S, {0,1,2,4}-shot accuracy
result : Often performs better than more trained MAGMA. freeze is sufficient.
contribution : ablation for multiple vision backbones.
etc. :

Details

The architecture itself is simple! vision backbone coarse feature map with linear projection and prefix it like a soft prompt in lm to learn vlm. The point is to learn only linear projection.
This is where performance analysis is fun

MAGMA trains adapters on vision backbone + LM with similar architecture, but the proposed LiMBER often performs better than this.
Among the vision backbone, CLIP has language supervision, BEiT has no language supervision at all (self-supervision), and NFRNet50 is ImageNet22K, so it can be said that it is in the middle (classification, but after all, the classification is based on WordNet (?), so it can be said that it has language supervision indirectly), and CLIP is the best.
BEiT is the most interesting, especially for VQA {1,2,4}-shot, which performs worse than blind (VQA without seeing any images). It’s better than random NFRNet, but hardly helpful.
However, if you attach BEiT to the decoder and attach BEiT-FT with additional training to image classification (what data did you use?), it may outperform CLIP -> After all, self-supervision systems such as MAE or BEiT need to be finetuned to the downstream task.

c.f.

In the MAE paper, linear probing performed worse than MoCo trained with InfoNCE loss, which is slightly closer to classification. -> but it gets even better when finetuning layers but… Masked Autoencoding Does Not Help Natural Language Supervision at Scale This paper also. In CLIP, doing MAE helps at million scale, but makes it worse at billion -> In the end, self-supervision shines for small amounts of data, but when you have a large corpus like clips, why not?

BEiT failure cases

There are multiple caption metrics, but the dominance of vision backbone is consistent

TL;DR#

Details#

TL;DR

Details