image

paper

TL;DR

  • task : personalized vision and language => personalized image retrieval/object detection/segmentation
  • problem : We want to learn user-specific objects efficiently, and adding adaptors to CLIP has the effect of worsening the performance of previous classes.
  • idea : Let’s learn a new concept by adding it as a new vocab! To do this, 1) learn an inverse function that finds the input word embedding given an image, 2) initialize the word embedding of the new concept by passing a few images of the new concept through the inverse function, and 3) refine it with the textual information of the new concept.
  • architecture : CLIP
  • objective : Learn to make the embedding of A photo of a [new vocab] closer to the embedding of A photo of a [new vocab] that passed through the image encoder, and farther away from the embedding of the super-concept of the new concept.
  • baseline : Adapter, text-only CLIP, COLLIE
  • data : Youtube-VOS, DeepFashion2(both introduced in this paper)
  • result : SOTA
  • contribution : Propose a new task. Efficient architecture!
  • Limitation or something I don’t understand: CLIP looks like it needs to be re-read? Deep Sets?

Details

new setup, personalized vision & language

image
  • A new sentence S and image I are fed into the pretrained model h(S, I).
  • You want a new concept, C, to enter so that it can be trained as V’ = V U C
  • Students are given a few images of concept C and descriptive text about the new concept (e.g., “mug,” “short sleeve top”)

Adaptor vs add new vocab

image

If we don’t add new vocab, the encoder output for the old class will be crushed. The model starts with the assumption that our text embeddings are large enough to hold new concepts.

Architecture

image

Learn inverse mapping functions with a network called DeepSets

Loss

image