TL;DR
- I read this because.. : I want to see how the data was created / how it was evaluated.
- task : proposed. (1) FOIL Detection (2) FOIL word detection (3) FOIL word correction
- problem : Do VLM models like captioning, VQA models, etc. really understand both modalities well?
- idea : replace word in caption with another similar word
- input/output : {image, caption} -> (1) whether it’s FOIL or not (2) where the FOIL word is (3) FOIL word correction
- objective : ce loss
- baseline : sota VQA at the time, Caption model / LSTM that only saw captions, CNN LSTM
- data : 65K (train) / 32K (test) images, 197K (train) / 99K (test) captions utilizing COCO’s captions.
- evaluation : (1) accuracy (2) Did I find the word well in the FOIL caption. evaluate as a noun only / evaluate as a full noun (3) Given a FOIL word, does it change to the original word?
- contribution : later used as hallucination measure, etc.
- etc. :
- Make it the most reasonable way you can in ‘17
- Not a very popular evaluation set -> you might be better off with the latest LVLM benchmarks
- Changing a single noun is a bit of a disadvantage.
Details
Task
num samples
How data is created
- Create a pair with objects with the same supercategory in MS-COCO
- Remove those with more than one word. e.g. traffic light
- Split the train / test category
- The targe::foil pair used in training will not be used in testing
- Create a foil caption
- Replace the words in the caption with
- And replace for objects that don’t exist in the image
- e.g., in “A dog and a cat eat,” you don’t replace the dog with the cat because you have a cat
- Using a captioning model called Neuraltalk to select the most difficult captions
Evaluation
- T1 is just a classification
- T2 finds the foil word given {image, FOIL caption}.
- T3 is good at fixing the foil word when given {image, FOIL caption, FOIL word}.
For T1, the captioner model was asked to generate a caption with each word removed from the original caption, and then compare the caption with the word replaced with the model’s higher prediction to the original caption, and if the model’s prediction was higher, the caption with the word replaced was determined to be FOIL.
For T2, read Towards Transparent AI Systems: Interpreting Visual Question Answering Models (https://arxiv.org/pdf/1608.08974.pdf
)
์์ ์ฌ์ฉ๋ occulsion ๋ฐฉ๋ฒ์ ์ฌ์ฉ.
This is measured by how much the score changes from the original predicted answer after masking and forwarding the words in the question one by one.
For T3, do a linear regression on target word (as if he’s the only one learning new?)
Analysis
Badly created datasets