[163] What You See is What You Read? Improving Text-Image Alignment Evaluation

paper

TL;DR

I read this because.. : #169’s predecessor.
task : text-to-image alignment evaluation
problem : When evaluating text-to-image and image-to-text generation models, it is important to ensure that the two images and text are semantically aligned.
idea : (zs) LLM + VQA pipeline proposal / (finetune) VNLI model
input/output : {image, text} -> score
architecture : VQ^2(spacy, T5-XXL, PALI-17B), VNLI(BLIP2, PALI-17B)
baseline : CLIP, BLIP, BLIP2, PALI, TIFA
data : 44K dataset created with Congen for training VNLI
evaluation : SeeTrue Benchmark(proposed) -> AUC ROC
result : Better than TIFA
contribution : VQ^2 as if it were the first? I don’t know if it came out at the same time as TIFA.

Details

Proposed SeeTRUE benchmark

EditBench: created here. Made with SD v1.4 and 2.1 with COCO caption and drawbench’s caption
COCO-Con: A contradiction caption created with the ConGen method below for COCO captions.
PickaPic-Con: Caption your PickaPic images with BLIP2

SeeTrue generation

ConGen: Ask the PaLM model to create contradict captions and then use the NLI model to adopt the one with the highest contradiction score.

VQ^2

Create an answer first, use a question generation (QG) model, and filter with a QA model. Then ask the question to the VQA model, and the VQA answer scores the answer by averaging the confidence of the answer.

answer span is created by SpaCy’s POS + dependency parse tree
QG uses the T5-XXL
QA model is T5-XXL trained with SQuAD2.0 and Natural Question
The VQA model is the PALI-17B

E2E VNLI model

Additional training of BLIP2, PALI-17B with 44K of data generated with ConGen

Result

winoground result
correlation with human
Can also be used for rerank

TL;DR#

Details#

Proposed SeeTRUE benchmark#

SeeTrue generation#

VQ^2#

E2E VNLI model#

Result#