image

paper , code

TL;DR

  • I read this because.. : CLIP reward
  • task : captioning with reward
  • problem : Existing metrics (cider, ..) are based on captions that are annotated for the most salient objects, so they don’t capture fine-grained information.
  • Idea :** Use CLIP-Score as a reward
  • input/output : image -> caption
  • architecture : CLIP-Res50 + encoder-decoder transformer(6 layer)
  • objective : REINFORCE objective with CLIP-S
  • baseline : MLE, CIDEr, CLIP-S, CIDEr-CLIP-S, CLIP-S + Grammar
  • data : MS COCO karpathy split
  • evaluation : Text-Based(BLEU, CIDEr, METOR, ROUGE-L, BERT-S), Image Based(CLIP-S, RefCLIP-S), T2I retrieval, FineCapEval(proposed), human eval
  • result : Naturally worse than text based, but dominant performance on Image eval. Better than MLE and CIDEr based, especially on FineCapEval, a benchmark for fine details such as background.
  • contribution : motivation – experimentation – good at evaluation
  • etc. : LM as an agent has been around for a long time,, let’s read some old papers,,

Details

image

Preliminary

The idea of viewing the captioning model as a kind of agent rather than teacher-forcing originated in this paper Sequence Level Training with Recurrent Neural Networks(ICLR'16, https://arxiv.org/pdf/1511.06732 ) image

Captioning model using REINFORCE algorithm with BLEU, ROUGE-L as rewards Subtracting the baseline because the reward has too much variance is described in the paper Self-critical Sequence Training for Image Captioning(CVPR'16 https://arxiv.org/pdf/1612.00563 ) image

Above is the general formula for REINFORCE with baseline, with $r(w^s)$ as the sampling decoding and b as the reward of the greedy decoded sequence

proposed

image image image
  • $R(I,c)=CLIP-S(I,c)$

However, the CLIP text encoder is not very good at grammar and sometimes generates ungrammatical captions. So, we randomly generate a sentence that is intentionally ungrammatical and put it as a head to make a binary prediction about whether it is grammatical or not. We also added the grammar score of the generated caption to the reward

image

Learn 15 epochs with MLE first, then 25 epochs with each reward

Result

image

proposed FineCapEval

image

Human evaluation result image