TL;DR
- I read this because.. : #169’s predecessor.
- task : text-to-image alignment evaluation
- problem : When evaluating text-to-image and image-to-text generation models, it is important to ensure that the two images and text are semantically aligned.
- idea : (zs) LLM + VQA pipeline proposal / (finetune) VNLI model
- input/output : {image, text} -> score
- architecture : VQ^2(spacy, T5-XXL, PALI-17B), VNLI(BLIP2, PALI-17B)
- baseline : CLIP, BLIP, BLIP2, PALI, TIFA
- data : 44K dataset created with Congen for training VNLI
- evaluation : SeeTrue Benchmark(proposed) -> AUC ROC
- result : Better than TIFA
- contribution : VQ^2 as if it were the first? I don’t know if it came out at the same time as TIFA.
Details
Proposed SeeTRUE benchmark
- EditBench: created here. Made with SD v1.4 and 2.1 with COCO caption and drawbench’s caption
- COCO-Con: A contradiction caption created with the ConGen method below for COCO captions.
- PickaPic-Con: Caption your PickaPic images with BLIP2
SeeTrue generation
- ConGen: Ask the PaLM model to create contradict captions and then use the NLI model to adopt the one with the highest contradiction score.
VQ^2
Create an answer first, use a question generation (QG) model, and filter with a QA model. Then ask the question to the VQA model, and the VQA answer scores the answer by averaging the confidence of the answer.
- answer span is created by SpaCy’s POS + dependency parse tree
- QG uses the T5-XXL
- QA model is T5-XXL trained with SQuAD2.0 and Natural Question
- The VQA model is the PALI-17B
E2E VNLI model
Additional training of BLIP2, PALI-17B with 44K of data generated with ConGen
Result
winoground result
correlation with human
Can also be used for rerank