image

paper , code

TL;DR

  • I read this because.. : CLIP reward
  • task : captioning with reward
  • problem : ๊ธฐ์กด์˜ metric(cider, ..)๋“ค์€ ๊ฐ€์žฅ salientํ•œ object์— ๋Œ€ํ•ด annotate ๋˜์–ด์žˆ๋Š” ์บก์…˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ finegrainedํ•œ ์ •๋ณด๋ฅผ ๋‹ด์ง€ ๋ชปํ•œ๋‹ค
  • idea : CLIP-Score๋ฅผ reward๋กœ ์‚ฌ์šฉํ•˜์ž
  • input/output : image -> caption
  • architecture : CLIP-Res50 + encoder-decoder transformer(6 layer)
  • objective : REINFORCE objective with CLIP-S
  • baseline : MLE, CIDEr, CLIP-S, CIDEr-CLIP-S, CLIP-S + Grammar
  • data : MS COCO karpathy split
  • evaluation : Text-Based(BLEU, CIDEr, METOR, ROUGE-L, BERT-S), Image Based(CLIP-S, RefCLIP-S), T2I retrieval, FineCapEval(proposed), human eval
  • result : text based๋ณด๋‹จ ๋‹น์—ฐํžˆ ์•ˆ์ข‹์ง€๋งŒ Image eval์— ๋Œ€ํ•ด์„œ๋Š” ์šฐ์„ธํ•œ ์„ฑ์ . ํŠนํžˆ background ๋“ฑ ์„ธ๋ฐ€ํ•œ ๋ถ€๋ถ„์— ๋Œ€ํ•œ ๋ฒค์น˜๋งˆํฌ์ธ FineCapEval์—์„œ MLE, CIDEr based๋ณด๋‹ค ์ข‹์€ ์„ฑ์ 
  • contribution : motivation – ์‹คํ—˜ – ํ‰๊ฐ€๊ฐ€ ์ž˜ ์ด์–ด์ง
  • etc. : LM์„ agent๋กœ ๋ณด๋Š”๊ฒŒ ์˜›๋‚ ๋ถ€ํ„ฐ ์žˆ์—ˆ๊ตฌ๋‚˜,, ์˜›๋‚  ๋…ผ๋ฌธ๋„ ์ข€ ์ฝ์ž,,

Details

image

Preliminary

teacher-forcing ์ด ์•„๋‹ˆ๋ผ captioning model์„ ์ผ์ข…์˜ agent๋กœ ๋ณด๋Š” ๊ฒƒ์˜ ์›๋ฅ˜๋Š” ์ด ๋…ผ๋ฌธ Sequence Level Training with Recurrent Neural Networks(ICLR'16, https://arxiv.org/pdf/1511.06732 ) image

REINFORCE ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ BLEU, ROUGE-L ๋ฅผ reward๋กœ ํ•˜๋Š” captioning model reward๊ฐ€ variance๊ฐ€ ๋„ˆ๋ฌด ์ปค์„œ ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋นผ๋Š”๊ฑด ์•„๋ž˜ ๋…ผ๋ฌธ Self-critical Sequence Training for Image Captioning(CVPR'16 https://arxiv.org/pdf/1612.00563 ) image

์œ„๋Š” REINFORCE with baseline์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ์ˆ˜์‹์ด๊ณ  $r(w^s)$๋Š” ์ƒ˜ํ”Œ๋ง decoding, b๋Š” greedy decodeํ•œ sequence์˜ reward๋ฅผ ์‚ฌ์šฉํ•จ

proposed

image image image
  • $R(I,c)=CLIP-S(I,c)$

๊ทผ๋ฐ ์ด๋ ‡๊ฒŒ ํ• ๊ฒฝ์šฐ CLIP text encoder๊ฐ€ ๋ฌธ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” ์•ฝํ•ด์„œ ๋ฌธ๋ฒ•์ด ํ‹€๋ฆฐ ์บก์…˜์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ์Œ. ๊ทธ๋ž˜์„œ ์ผ๋ถ€๋Ÿฌ ๋ฌธ๋ฒ•์„ ํ‹€๋ฆฌ๊ฒŒํ•œ ๋ฌธ์žฅ์„ ์ž„์˜๋กœ ๋งŒ๋“ค์–ด์„œ ๋ฌธ๋ฒ•์ด ๋งž๋Š”์ง€ ์•ˆ๋งž๋Š”์ง€์— ๋Œ€ํ•ด head๋กœ ๋ถ™์—ฌ binary๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•จ. ๊ทธ๋ฆฌ๊ณ  ์ƒ์„ฑ๋œ ์บก์…˜์˜ grammar ์ ์ˆ˜๋„ Reward์— ์ถ”๊ฐ€ํ•จ

image

MLE๋กœ 15์—ํญ ๋จผ์ € ํ•™์Šตํ•˜๊ณ  25 ์—ํญ์€ ๊ฐ๊ฐ์˜ Reward๋กœ ํ•™์Šต

Result

image

proposed FineCapEval

image

Human evaluation result image