[177] Fine-grained Image Captioning with CLIP Reward

TL;DR

I read this because.. : CLIP reward
task : captioning with reward
problem : Existing metrics (cider, ..) are based on captions that are annotated for the most salient objects, so they don’t capture fine-grained information.
Idea :** Use CLIP-Score as a reward
input/output : image -> caption
architecture : CLIP-Res50 + encoder-decoder transformer(6 layer)
objective : REINFORCE objective with CLIP-S
baseline : MLE, CIDEr, CLIP-S, CIDEr-CLIP-S, CLIP-S + Grammar
data : MS COCO karpathy split
evaluation : Text-Based(BLEU, CIDEr, METOR, ROUGE-L, BERT-S), Image Based(CLIP-S, RefCLIP-S), T2I retrieval, FineCapEval(proposed), human eval
result : Naturally worse than text based, but dominant performance on Image eval. Better than MLE and CIDEr based, especially on FineCapEval, a benchmark for fine details such as background.
contribution : motivation – experimentation – good at evaluation
etc. : LM as an agent has been around for a long time,, let’s read some old papers,,

Details

Preliminary

The idea of viewing the captioning model as a kind of agent rather than teacher-forcing originated in this paper Sequence Level Training with Recurrent Neural Networks(ICLR'16, https://arxiv.org/pdf/1511.06732 )

Captioning model using REINFORCE algorithm with BLEU, ROUGE-L as rewards Subtracting the baseline because the reward has too much variance is described in the paper Self-critical Sequence Training for Image Captioning(CVPR'16 https://arxiv.org/pdf/1612.00563 )

Above is the general formula for REINFORCE with baseline, with $r(w^s)$ as the sampling decoding and b as the reward of the greedy decoded sequence

proposed

$R(I,c)=CLIP-S(I,c)$

However, the CLIP text encoder is not very good at grammar and sometimes generates ungrammatical captions. So, we randomly generate a sentence that is intentionally ungrammatical and put it as a head to make a binary prediction about whether it is grammatical or not. We also added the grammar score of the generated caption to the reward

Learn 15 epochs with MLE first, then 25 epochs with each reward

Result

proposed FineCapEval

Human evaluation result

TL;DR#

Details#

Preliminary#

proposed#

Result#

TL;DR

Details

Preliminary

proposed

Result