[132] Hyperbolic Image-Text Representations

paper , code

TL;DR

I read this because.. : It is mentioned that there can be multiple texts representing one image. ambiguity about this?!(Song Kang-ho, actor, man)
task : contrastive learning
problem : Text can be represented at different levels for one image (dog standing on snow, puppy, giggle~)
idea : Let’s move CLIP’s embedding space to hyperbolic space instead of euclidean space.
input/output : image/text -> score
architecture : Same as CLIP
objective : contrastive + entailment loss
baseline : CLIP trained with YFCC-100M(by SLIP)
data : YFCC-100M
evaluation : image text retrieval, zs-image classification
result : Improved performance. The text traversed for [ROOT] on certain images became more generic.
contribution : Probably the first work with CLIP in hyperbolic space?
etc. :

Details

Motivation

Arch

Lifting embeddings onto the hyperboloid

After passing through the CLIP encoder, each image and text vector comes out as an n-dimensional vector, and we apply a transformation that adds an origin 0 vector to it. Let $v =[v_{enc}, 0]\in\mathbb{R}^{n+1}$ enter the tangent space of origin O, which satisfies the condition that it is zero and injective to zero. We only need to compute over the space of Lorents models. In that case, the exponential map (map vectors projecting from tangent space -> manifold) for x vectors is organized as follows.

This means that if you take the embedding from the CLIP encoder and apply that transformation, you will end up in hyperbolic space.

The Lorents inner product is shown below, so we can use the inner product to get the similarity and add the contrastive loss

Entailment loss

Add the following loss to the contrastive loss I’m not sure I understand the math, and my intuition for adding this loss is that when you have a {Text-image} pair, the text should entail the image.

Results

Text is more generic and widely distributed
The two spaces are completely separate

[132] Hyperbolic Image-Text Representations

TL;DR

Details

Motivation

Arch

Lifting embeddings onto the hyperboloid

Entailment loss

Results

Ablations

Image Traverse

TL;DR#

Details#

Motivation#

Arch#

Lifting embeddings onto the hyperboloid#

Entailment loss#

Results#

Ablations#

Image Traverse#

TL;DR

Details

Motivation

Arch

Lifting embeddings onto the hyperboloid

Entailment loss

Results

Ablations

Image Traverse