TL;DR
- I read this because.. : CLIP reward
- task : captioning with reward
- problem : Existing metrics (cider, ..) are based on captions that are annotated for the most salient objects, so they don’t capture fine-grained information.
- Idea :** Use CLIP-Score as a reward
- input/output : image -> caption
- architecture : CLIP-Res50 + encoder-decoder transformer(6 layer)
- objective : REINFORCE objective with CLIP-S
- baseline : MLE, CIDEr, CLIP-S, CIDEr-CLIP-S, CLIP-S + Grammar
- data : MS COCO karpathy split
- evaluation : Text-Based(BLEU, CIDEr, METOR, ROUGE-L, BERT-S), Image Based(CLIP-S, RefCLIP-S), T2I retrieval, FineCapEval(proposed), human eval
- result : Naturally worse than text based, but dominant performance on Image eval. Better than MLE and CIDEr based, especially on FineCapEval, a benchmark for fine details such as background.
- contribution : motivation – experimentation – good at evaluation
- etc. : LM as an agent has been around for a long time,, let’s read some old papers,,
Details
Preliminary
The idea of viewing the captioning model as a kind of agent rather than teacher-forcing originated in this paper
Sequence Level Training with Recurrent Neural Networks(ICLR'16, https://arxiv.org/pdf/1511.06732
)
Captioning model using REINFORCE algorithm with BLEU, ROUGE-L as rewards
Subtracting the baseline because the reward has too much variance is described in the paper
Self-critical Sequence Training for Image Captioning(CVPR'16 https://arxiv.org/pdf/1612.00563
)
Above is the general formula for REINFORCE with baseline, with $r(w^s)$ as the sampling decoding and b as the reward of the greedy decoded sequence
proposed
- $R(I,c)=CLIP-S(I,c)$
However, the CLIP text encoder is not very good at grammar and sometimes generates ungrammatical captions. So, we randomly generate a sentence that is intentionally ungrammatical and put it as a head to make a binary prediction about whether it is grammatical or not. We also added the grammar score of the generated caption to the reward
Learn 15 epochs with MLE first, then 25 epochs with each reward
Result
proposed FineCapEval
Human evaluation result