[151] FOIL it! Find One mismatch between Image and Language caption

paper

TL;DR

I read this because.. : I want to see how the data was created / how it was evaluated.
task : proposed. (1) FOIL Detection (2) FOIL word detection (3) FOIL word correction
problem : Do VLM models like captioning, VQA models, etc. really understand both modalities well?
idea : replace word in caption with another similar word
input/output : {image, caption} -> (1) whether it’s FOIL or not (2) where the FOIL word is (3) FOIL word correction
objective : ce loss
baseline : sota VQA at the time, Caption model / LSTM that only saw captions, CNN LSTM
data : 65K (train) / 32K (test) images, 197K (train) / 99K (test) captions utilizing COCO’s captions.
evaluation : (1) accuracy (2) Did I find the word well in the FOIL caption. evaluate as a noun only / evaluate as a full noun (3) Given a FOIL word, does it change to the original word?
contribution : later used as hallucination measure, etc.
etc. :
Make it the most reasonable way you can in ‘17
Not a very popular evaluation set -> you might be better off with the latest LVLM benchmarks
Changing a single noun is a bit of a disadvantage.

Details

Task

num samples

How data is created

Create a pair with objects with the same supercategory in MS-COCO

Remove those with more than one word. e.g. traffic light

Split the train / test category

The targe::foil pair used in training will not be used in testing

Create a foil caption

Replace the words in the caption with
And replace for objects that don’t exist in the image
e.g., in “A dog and a cat eat,” you don’t replace the dog with the cat because you have a cat

Using a captioning model called Neuraltalk to select the most difficult captions

Evaluation

T1 is just a classification
T2 finds the foil word given {image, FOIL caption}.
T3 is good at fixing the foil word when given {image, FOIL caption, FOIL word}.

For T1, the captioner model was asked to generate a caption with each word removed from the original caption, and then compare the caption with the word replaced with the model’s higher prediction to the original caption, and if the model’s prediction was higher, the caption with the word replaced was determined to be FOIL.

For T2, read Towards Transparent AI Systems: Interpreting Visual Question Answering Models (https://arxiv.org/pdf/1608.08974.pdf ) 에서 사용된 occulsion 방법을 사용. This is measured by how much the score changes from the original predicted answer after masking and forwarding the words in the question one by one.

For T3, do a linear regression on target word (as if he’s the only one learning new?)

Analysis

Badly created datasets

TL;DR#

Details#

Task#

num samples#

How data is created#

Evaluation#

Analysis#