image

paper

TL;DR

  • I read this because.. : NeurIPS 2023, graph
  • task : multi-modal training -> image retrieval, VQA, Visual Entailment, Image Classification, GLUE
  • problem : CLIP์€ ๋„ˆ๋ฌด ๊ฐ„๋‹จํ•˜๊ฒŒ “match”, “not matched” ๋‘ ๋ ˆ์ด๋ธ”๋กœ๋งŒ ์žˆ์–ด์„œ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๊ฐ„์˜ semanticํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š๋‹ค
  • idea : CLIP + knowlege graph. ์ธํ’‹์ด ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ํŽ˜์–ด๊ฐ€ ์•„๋‹ˆ๋ผ {head, relation, tail} triplet์„ ๋ฐ›์Œ. head๋‚˜ Tail์€ ์ด๋ฏธ์ง€ ๋˜๋Š” ํ…์ŠคํŠธ ๋‘˜๋‹ค ๋  ์ˆ˜ ์žˆ์Œ.
  • architecture : CLIP ์•„ํ‚คํ…์ณ๋ฅผ ๊ฐ€์ ธ๊ฐ€๋˜, pooling์„ ํ•˜์ง€ ์•Š๊ณ  concat + Transformer Encoder ์Œ“์•„์„œ Feature ๋ฝ‘์Œ
  • objective : triplet์—์„œ relation ๋˜๋Š” tail(๋˜๋Š” Head)์„ ์ง€์šฐ๊ณ  ์˜ˆ์ธกํ•˜๋„๋ก ํ•จ. 1) relation์„ ์ง€์šธ ๋• ๊ทธ๋ƒฅ ๋ถ„๋ฅ˜ ๋ฌธ์ œ(E2R loss) 2) tail์„ ์ง€์› ์„ ๋• tail์˜ ํ‘œํ˜„๊ณผ head, relation์˜ ํ‘œํ˜„์ด ๊ฐ™์€ triplet์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒฝ์šฐ ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก(E2E Loss) 3) GNN ๋ถ™์—ฌ์„œ tail์— ๋Œ€ํ•œ ํ‘œํ˜„์ด GNN ํ†ต๊ณผํ•œ ํ‘œํ˜„๊ณผ ํŠธ๋žœ์Šคํฌ๋จธ์— ๋Œ€ํ•œ ํ‘œํ˜„์ด ๋น„์Šทํ•ด์ง€๋„๋ก(E2G Loss) 4) CLIP teacher์™€์˜ KL divergence๋กœ KD(KD Loss)
  • baseline : CLIP, UNITER, OSCAR, ViLT, … ์™ธ ๋‹ค์ˆ˜
  • data : VisualSem(WordNet + ImageNet), Visual Genome, ConceptNet, COCO Caption, CC3M
  • result : SOTA.
  • contribution : triplet ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ CLIP ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ formulation.

Details

Motivation

image

Dataset

image

์ถ”๊ฐ€๋กœ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด์˜ ๊ฒฝ์šฐ is a image of, is a caption of์™€ ๊ฐ™์ด relation์„ ์ž„์˜๋กœ ์ง€์ •ํ•ด์„œ triplet์œผ๋กœ ๋งŒ๋“ฆ

Architecture

image image
  • $f$๋Š” text ๋‚˜ image encoder image
image image

relation์— ๋Œ€ํ•œ ํ‘œํ˜„์€ ๊ทธ๋ƒฅ ์ธ๋ฑ์‹ฑํ•˜๋ฉด ๋จ

Loss

  • Triplet based loss mlm ์ฒ˜๋Ÿผ Triplet ์š”์†Œ์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€๋ ค๋†“๊ณ  ๋งž์ถ”๋ผ๊ณ  ํ• ๊ฑฐ์ž„

E2E loss

entity (head or tail)์„ ๊ฐ€๋ ค๋†จ์„ ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์ด loss ์ถ”์ • image

๊ฐ€๋ฆฌ๋Š”๊ฑด ๊ทธ๋ƒฅ 0 ๋ฒกํ„ฐ catํ•˜๋Š” ํ˜•์‹ image

tail์˜ ํ‘œํ˜„๊ณผ ํ•ด๋‹น tail๊ณผ ๊ฐ™์€ triplet์— ์†ํ•ด์žˆ๋Š” Head, relation์˜ ํ‘œํ˜„์ด ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก ํ•˜๋Š” ๊ฒƒ

E2R loss

relation ๋งž์ถ”๋Š”๊ฑด ๊ทธ๋ƒฅ ๋ถ„๋ฅ˜๋ฌธ์ œ image

  • Graph-based loss GNN ํ†ต๊ณผ์‹œํ‚จ๊ฑฐ๋ž‘ transformer ํ†ต๊ณผ์‹œํ‚จ๊ฑฐ๋ž‘ entity ํ‘œํ˜„์ด ๋น„์Šทํ•ด์ง€๋„๋ก image

Continuous Learning

Pretrained CLIP์˜ ๊ฒฐ๊ณผ์™€ KL Divergence image

Experiement setup

image

Result

Image Retrieval

image

VQA, SNLI_VE

image

snli_ve๋Š” ์ด๋Ÿฐ ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•˜๋„น image https://github.com/necla-ml/SNLI-VE

GLUE

image

Image Classification

image

Ablation

image

  • CLIP + KG๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๋„น

motivation์—์„œ ๋‚˜์˜จ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‚˜?

image

VQA์—์„œ ์ƒ‰๊น”๊ณผ ๊ฐ™์€ property๋ฅผ ๊ฐ€์ง„ VQA์— ๋Œ€ํ•ด์„œ๋งŒ ํ‰๊ฐ€๋ฅผ ๋‹ค์‹œ ํ•ด๋ดค๋Š”๋ฐ ์„ฑ๋Šฅ์ด ๋” ์ข‹์•˜๋‹ค๊ณ  ํ•œ๋‹ค