[74] “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations

paper

TL;DR

task : personalized vision and language => personalized image retrieval/object detection/segmentation
problem : We want to learn user-specific objects efficiently, and adding adaptors to CLIP has the effect of worsening the performance of previous classes.
idea : Let’s learn a new concept by adding it as a new vocab! To do this, 1) learn an inverse function that finds the input word embedding given an image, 2) initialize the word embedding of the new concept by passing a few images of the new concept through the inverse function, and 3) refine it with the textual information of the new concept.
architecture : CLIP
objective : Learn to make the embedding of A photo of a [new vocab] closer to the embedding of A photo of a [new vocab] that passed through the image encoder, and farther away from the embedding of the super-concept of the new concept.
baseline : Adapter, text-only CLIP, COLLIE
data : Youtube-VOS, DeepFashion2(both introduced in this paper)
result : SOTA
contribution : Propose a new task. Efficient architecture!
Limitation or something I don’t understand: CLIP looks like it needs to be re-read? Deep Sets?

Details

new setup, personalized vision & language

A new sentence S and image I are fed into the pretrained model h(S, I).
You want a new concept, C, to enter so that it can be trained as V’ = V U C
Students are given a few images of concept C and descriptive text about the new concept (e.g., “mug,” “short sleeve top”)

Adaptor vs add new vocab

If we don’t add new vocab, the encoder output for the old class will be crushed. The model starts with the assumption that our text embeddings are large enough to hold new concepts.

Architecture

Learn inverse mapping functions with a network called DeepSets

TL;DR#

Details#

new setup, personalized vision & language#

Adaptor vs add new vocab#

Architecture#

Loss#

TL;DR

Details

new setup, personalized vision & language

Adaptor vs add new vocab

Architecture

Loss