[145] CLIPScore: A Reference-free Evaluation Metric for Image Captioning

paper

TL;DR

I read this because.. : I’m interested in clip scores.
task : evaluation for captioning
problem : Previous reference-based evaluations tend to be biased towards familiar words
idea : CLIP score Write and rate!
input/output : {image, caption, (optionally) references} -> score
architecture : CLIP ViT-B/32
baseline : BLEU-1, BLEU-4, ROUGE-L, BERT-score, CIDEr, SPICE
data : Flickr8K-Expert, Flickr-CF, Pascal-50S, FOIL hallucination detection,
evaluation : kendall correlation with human judgement(Flickr8K-Expert, Flickr-CF). accuracy(Pascal-50S, FOIL)
result : One of the metrics that is always selected when forward selection is made with the highest correlation with human judgment, high accuracy, and captioning scores.
contribution : Proposed a metric to improve the evaluation based on simple and old referecne! Makes the analysis massive.
etc. : If the idea is simple, this is the level of analysis I need to do to write a paper.

Details

motivation

`CLIPScore`

c: CLIP text embedding in caption
v: CLIP vision embedding in image
w is set to 2.5 A rescaling scalar added just for ease of interpretation.
cosine should theoretically have a scale of [-1, 1], but I’ve never seen negative
Multiply by 2.5 to make it [0, 1] because score always seems to be between [0, 0.4]. State in footnote that region-leval/token-level correspondence models (maybe FILIP?!) did not perform better.

`RefCLIP-s`

A version that also utilizes referecne caption.

r: CLIP text embedding in referecnes

Caption-level likert judgements

Flickr8K-Expert 17K “expert” humans scored the captions on 5664 images on a scale of 1 to 4 (1 unrelated to 4 well rated with no errors)

leaderboard Oh this benchmark #1 is Naver paper… Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Flickr8K-CF Dataset of crowd-sourced judgments in binary for 48K {image, caption} pairs for 1K images
Composite https://arxiv.org/pdf/1511.03292.pdf 12K of human judgment on MSCOCO, Flickr8K, and Flickr30K

System-level correlation for MSCOCO

How do COCO captioners compare to the results? You say you only have 12 pieces of data

Sensitivity of CLIP-S to hallucination

Human evaluation is more influenced by “correctness” than “specificity” To evaluate this, we use the hallucination dataset, FOIL (https://arxiv.org/pdf/1705.01359.pdf ) In MSCOCO, substituting a noun for a similar word in a single noun phrase (e.g., switching “motorcycle” for “bicycle”). For 32K sentences, evaluated whether the substituted sentence scored higher than the non-substituted sentence.

Sensitivity of CLIP-S to memorization

Collecting datasets myself in case I learned captions in the CLIP training course

Which metrics should I report?

Forward selection for 10 metrics based on R2.
BLEU-1, BLEU-4, METEOR, CIDEr, ROUGE-L, SPICE, BERT-S(RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S

Confirm that at least the top four are selected It also ensures that metrics are correlated but not redundant. It would be better to use it with a reference base like SPICE.

TL;DR#

Details#

motivation#

CLIPScore#

RefCLIP-s#

Caption-level likert judgements#

System-level correlation for MSCOCO#

Sensitivity of CLIP-S to hallucination#

Sensitivity of CLIP-S to memorization#

Which metrics should I report?#