image

paper

TL;DR

  • I read this because.. : NeurIPS 2023, graph
  • task : multi-modal training -> image retrieval, VQA, Visual Entailment, Image Classification, GLUE
  • Problem :** CLIP is too simple with only two labels, “match” and “not matched”, which does not contain any semantic information between text and image.
  • idea :** CLIP + knowlege graph. takes a {head, relation, tail} triplet as input, not a text-image pair. Head or Tail can be either image or text.
  • architecture : Take the CLIP architecture, but without pooling, concat + Transformer Encoder stack to pull features
  • objective : Remove relation or tail (or head) from a triplet and make a prediction. 1) When removing relation, it is just a classification problem (E2R loss) 2) When removing tail, the representation of tail, head, and relation should be close to the same triplet (E2E Loss) 3) GNN is attached so that the representation of tail is similar to the representation of transformer after GNN (E2G Loss) 4) KL divergence with CLIP teacher leads to KD (KD Loss)
  • baseline : CLIP, UNITER, OSCAR, ViLT, … and more
  • data : VisualSem(WordNet + ImageNet), Visual Genome, ConceptNet, COCO Caption, CC3M
  • result : SOTA.
  • contribution : formulate data in the form of a triplet so that it can be CLIP trained.

Details

Motivation

image

Dataset

image

Additionally, for image-text pairs, you can arbitrarily specify a relation, such as is a image of, is a caption of, to make it a triplet.

Architecture

image image
  • f$ is a text or image encoder image
image image

Just index the representation for the relation

Loss

  • Triplet based loss Like mlm, we’re going to cover up some of the triplet elements and ask you to guess

E2E loss

If the entity (head or tail) is masked, estimate the loss as below image

Masking is just a 0 vector cat format image

To make the representation of a tail and the representation of a head, relation that is part of the same triplet as that tail, closer together.

E2R loss

Matching relation is just a matter of categorization image

  • Graph-based loss To make the entity representation similar between the GNN and transformer passes, we’ll use the image

Continuous Learning

KL Divergence with Results from Pretrained CLIPs image

Experiement setup

image

Result

Image Retrieval

image

VQA, SNLI_VE

image

snli_ve is said to be this data. image https://github.com/necla-ml/SNLI-VE

GLUE

image

Image Classification

image

Ablation

image

  • Better than CLIP + KG

Did you solve the problem from motivation?

image

We reran the evaluation only for VQAs with properties like color in VQA, and it performed better.