image

paper

TL;DR

  • I read this because.. : #169’s predecessor.
  • task : text-to-image alignment evaluation
  • problem : When evaluating text-to-image and image-to-text generation models, it is important to ensure that the two images and text are semantically aligned.
  • idea : (zs) LLM + VQA pipeline proposal / (finetune) VNLI model
  • input/output : {image, text} -> score
  • architecture : VQ^2(spacy, T5-XXL, PALI-17B), VNLI(BLIP2, PALI-17B)
  • baseline : CLIP, BLIP, BLIP2, PALI, TIFA
  • data : 44K dataset created with Congen for training VNLI
  • evaluation : SeeTrue Benchmark(proposed) -> AUC ROC
  • result : Better than TIFA
  • contribution : VQ^2 as if it were the first? I don’t know if it came out at the same time as TIFA.

Details

image

Proposed SeeTRUE benchmark

image
  • EditBench: created here. Made with SD v1.4 and 2.1 with COCO caption and drawbench’s caption
  • COCO-Con: A contradiction caption created with the ConGen method below for COCO captions.
  • PickaPic-Con: Caption your PickaPic images with BLIP2

SeeTrue generation

image
  • ConGen: Ask the PaLM model to create contradict captions and then use the NLI model to adopt the one with the highest contradiction score.

VQ^2

Create an answer first, use a question generation (QG) model, and filter with a QA model. Then ask the question to the VQA model, and the VQA answer scores the answer by averaging the confidence of the answer. image

  • answer span is created by SpaCy’s POS + dependency parse tree
  • QG uses the T5-XXL
  • QA model is T5-XXL trained with SQuAD2.0 and Natural Question
  • The VQA model is the PALI-17B

E2E VNLI model

Additional training of BLIP2, PALI-17B with 44K of data generated with ConGen

Result

image
  • winoground result image

  • correlation with human image

  • Can also be used for rerank image