image

paper

TL;DR

  • I read this because.. : clip score์— ๊ด€์‹ฌ ์žˆ์–ด์„œ
  • task : evaluation for captioning
  • problem : ์ด์ „์˜ reference ๊ธฐ๋ฐ˜์˜ evaluation์€ ์นœ์ˆ™ํ•œ ๋‹จ์–ด์— bias๋˜์–ด ์žˆ๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค
  • idea : CLIP score ์จ์„œ ํ‰๊ฐ€ํ•˜์ž!
  • input/output : {image, caption, (optionally) references} -> score
  • architecture : CLIP ViT-B/32
  • baseline : BLEU-1, BLEU-4, ROUGE-L, BERT-score, CIDEr, SPICE
  • data : Flickr8K-Expert, Flickr-CF, Pascal-50S, FOIL hallucination detection,
  • evaluation : kendall correlation with human judgement(Flickr8K-Expert, Flickr-CF). accuracy(Pascal-50S, FOIL)
  • result : human judgement์™€ ๊ฐ€์žฅ ๋†’์€ correlation, ๋†’์€ accuracy, captioning score๋“ค๋กœ forward selection ํ–ˆ์„ ๋•Œ ํ•ญ์ƒ ์„ ํƒ๋˜๋Š” metric๋“ค ์ค‘ ํ•˜๋‚˜.
  • contribution : ๊ฐ„๋‹จํ•˜๊ณ  ์ด์ „ referecne๊ธฐ๋ฐ˜์˜ ํ‰๊ฐ€๋ฅผ ๊ฐœ์„ ํ•˜๋Š” metric ์ œ์•ˆ! ๋ถ„์„์„ ์—„์ฒญ massiveํ•˜๊ฒŒ ํ•จ.
  • etc. : ์•„์ด๋””์–ด๊ฐ€ ๊ฐ„๋‹จํ•˜๋ฉด ์ด์ •๋„ ๋ถ„์„์€ ํ•ด์•ผ ๋…ผ๋ฌธ์„ ๋‚ผ ์ˆ˜ ์žˆ๊ตฌ๋‚˜..

Details

motivation

image

CLIPScore

image
  • c: caption์˜ CLIP text embedding
  • v: image์˜ CLIP vision embedding
  • w is set to 2.5 ๊ทธ๋ƒฅ ํ•ด์„์˜ ์šฉ์ด์„ฑ์„ ์œ„ํ•ด ์ถ”๊ฐ€ํ•œ rescaling scalar.
    • cosine์€ ์ด๋ก  ์ƒ [-1, 1] scale์„ ๊ฐ€์ ธ์•ผํ•˜์ง€๋งŒ ํ•œ๋ฒˆ๋„ negative๋ฅผ ๋ณธ์ ์ด ์—†๋‹ค๊ณ 
    • score๊ฐ€ ํ•ญ์ƒ [0, 0.4] ์‚ฌ์ด์—์„œ ์œ„์น˜ํ•˜๋Š”๊ฑธ๋กœ ๋ณด์—ฌ์„œ [0, 1]๋กœ ๋งŒ๋“œ๋ ค๊ณ  2.5๋ฅผ ๊ณฑํ•จ footnote์— region-leval/token-level correspondence models(maybe FILIP?!)์ด ์„ฑ๋Šฅ์ด ๋” ์ข‹์ง€ ์•Š์•˜๋‹ค๊ณ  ์„œ์ˆ .

RefCLIP-s

referecne caption๋„ ํ™œ์šฉํ•˜๋Š” ๋ฒ„์ „.

image
  • r: referecnes์˜ CLIP text embedding

Caption-level likert judgements

  • Flickr8K-Expert 5664๊ฐœ์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด 17K๊ฐœ์˜ “expert” human์ด caption์— ๋Œ€ํ•œ ์ ์ˆ˜๋ฅผ 1์ ๋ถ€ํ„ฐ 4์ ์œผ๋กœ ๋งค๊ธด ๊ฒƒ(1์  unrelated~4์  ์—๋Ÿฌ๊ฐ€ ์—†์ด ์ž˜ ํ‰๊ฐ€ํ–ˆ๋‹ค) image

leaderboard ์˜ค ์ด ๋ฒค์น˜๋งˆํฌ 1์œ„๊ฐ€ ๋„ค์ด๋ฒ„ ๋…ผ๋ฌธ์ด๋„น .. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

  • Flickr8K-CF 1K์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด 48K์˜ {image, caption} pair์— ๋Œ€ํ•ด binary๋กœ judgement๋ฅผ crowd sourcing์œผ๋กœ ๋ชจ์€ ๋ฐ์ดํ„ฐ์…‹ image

  • Composite https://arxiv.org/pdf/1511.03292.pdf MSCOCO, Flickr8K, Flickr30K์— ๋Œ€ํ•œ 12K์˜ human judgement
    image

System-level correlation for MSCOCO

COCO captioner๋“ค ๊ฒฐ๊ณผ๋ž‘ ๋น„๊ตํ•˜๋Š”? ๋ฐ์ดํ„ฐ๊ฐ€ 12๊ฐœ ๋ฐ–์— ์—†๋‹ค๊ณ  ํ•จ

Sensitivity of CLIP-S to hallucination

์‚ฌ๋žŒ์˜ ํ‰๊ฐ€๊ฐ€ “speicificity"๋ณด๋‹ค “correctness"์— ๋” ๋งŽ์€ ์˜ํ–ฅ์„ ์ค€๋‹ค๊ณ  ํ•จ ์ด๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด hallucination ๋ฐ์ดํ„ฐ์…‹์ธ FOIL(https://arxiv.org/pdf/1705.01359.pdf )๋กœ ํ‰๊ฐ€ MSCOCO์—์„œ single noun phrase์—์„œ ๋ช…์‚ฌ๋ฅผ ๋น„์Šทํ•œ ๋‹จ์–ด๋กœ ์น˜ํ™˜์„ ํ•˜๋Š” ํ˜•ํƒœ (e.g., switching โ€œmotorcycle” for โ€œbicycle”) 32K์˜ sentence์— ๋Œ€ํ•ด ์น˜ํ™˜ํ•œ ๋ฌธ์žฅ์ด ๊ทธ๋ ‡์ง€ ์•Š์€ ๋ฌธ์žฅ๋ณด๋‹ค ๋” ๋†’์€ score๋ฅผ ์ฃผ์—ˆ๋Š”์ง€๋กœ ํ‰๊ฐ€.

image

Sensitivity of CLIP-S to memorization

ํ˜น์‹œ CLIP ํ•™์Šต ๊ณผ์ •์—์„œ caption์„ ๋ฐฐ์šด ๊ฑธ๊นŒ๋ด ์ง์ ‘ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ์•„์„œ ํ•จ

Which metrics should I report?

  • R2๋ฅผ ๊ธฐ์ค€์œผ๋กœ 10๊ฐœ์˜ Metric์— ๋Œ€ํ•ด forward selection ์ง„ํ–‰.
  • BLEU-1, BLEU-4, METEOR, CIDEr, ROUGE-L, SPICE, BERT-S(RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S image

์ ์–ด๋„ ์ƒ์œ„ 4๊ฐœ๊ฐœ์—์„œ ์„ ํƒ๋จ์„ ํ™•์ธ ๋˜ํ•œ metric๋ผ๋ฆฌ correlate๋˜์–ด ์žˆ์ง€๋งŒ redundantํ•˜์ง€๋Š” ์•Š์Œ์„ ํ™•์ธ. SPICE ๊ฐ™์€ reference ๊ธฐ๋ฐ˜์ด๋ž‘ ๊ฐ™์ด ์“ฐ๋Š”๊ฒŒ ๋” ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ํ•จ