[97] Contrastive Language-Image Pre-Training with Knowledge Graph

TL;DR

I read this because.. : NeurIPS 2023, graph
task : multi-modal training -> image retrieval, VQA, Visual Entailment, Image Classification, GLUE
Problem :** CLIP is too simple with only two labels, “match” and “not matched”, which does not contain any semantic information between text and image.
idea :** CLIP + knowlege graph. takes a {head, relation, tail} triplet as input, not a text-image pair. Head or Tail can be either image or text.
architecture : Take the CLIP architecture, but without pooling, concat + Transformer Encoder stack to pull features
objective : Remove relation or tail (or head) from a triplet and make a prediction. 1) When removing relation, it is just a classification problem (E2R loss) 2) When removing tail, the representation of tail, head, and relation should be close to the same triplet (E2E Loss) 3) GNN is attached so that the representation of tail is similar to the representation of transformer after GNN (E2G Loss) 4) KL divergence with CLIP teacher leads to KD (KD Loss)
baseline : CLIP, UNITER, OSCAR, ViLT, … and more
data : VisualSem(WordNet + ImageNet), Visual Genome, ConceptNet, COCO Caption, CC3M
result : SOTA.
contribution : formulate data in the form of a triplet so that it can be CLIP trained.