image

paper , code

TL;DR

  • I read this because.. : It is mentioned that there can be multiple texts representing one image. ambiguity about this?!(Song Kang-ho, actor, man)
  • task : contrastive learning
  • problem : Text can be represented at different levels for one image (dog standing on snow, puppy, giggle~)
  • idea : Let’s move CLIP’s embedding space to hyperbolic space instead of euclidean space.
  • input/output : image/text -> score
  • architecture : Same as CLIP
  • objective : contrastive + entailment loss
  • baseline : CLIP trained with YFCC-100M(by SLIP)
  • data : YFCC-100M
  • evaluation : image text retrieval, zs-image classification
  • result : Improved performance. The text traversed for [ROOT] on certain images became more generic.
  • contribution : Probably the first work with CLIP in hyperbolic space?
  • etc. :

Details

Motivation

image

Arch

image

Lifting embeddings onto the hyperboloid

After passing through the CLIP encoder, each image and text vector comes out as an n-dimensional vector, and we apply a transformation that adds an origin 0 vector to it. Let $v =[v_{enc}, 0]\in\mathbb{R}^{n+1}$ enter the tangent space of origin O, which satisfies the condition that it is zero and injective to zero. We only need to compute over the space of Lorents models. In that case, the exponential map (map vectors projecting from tangent space -> manifold) for x vectors is organized as follows. image

image

This means that if you take the embedding from the CLIP encoder and apply that transformation, you will end up in hyperbolic space.

The Lorents inner product is shown below, so we can use the inner product to get the similarity and add the contrastive loss image

Entailment loss

image

Add the following loss to the contrastive loss I’m not sure I understand the math, and my intuition for adding this loss is that when you have a {Text-image} pair, the text should entail the image. image

image

Results

image image
  • Text is more generic and widely distributed
  • The two spaces are completely separate

Ablations

image image

Image Traverse

image