image

paper , code

TL;DR

  • I read this because.. : T2I evaluation and word-level blah blah blah.
  • task : T2I evaluation
  • problem : Existing DSG, QG2 methodology causes hallucination of LLM. changes the way skill behaves.
  • IDEA: A slightly different way of calling skill.
  • input/output : {image, text} -> score
  • architecture : PALM (QA) + PALI (VQA)
  • baseline : METEOR, SPICE, CLIP, TIFA, DSG
  • data : proposed Gecko2k
  • evaluation : proposed Gecko
  • result : Human correlation is higher than other metrics.
  • contribution : Data suggestions. word-level annotation.
  • etc. : I’m not sure how I did after getting the word-level annotation. Is it just that it was better than the likert (absolute score) annotation?

Details

problems in DSG

image image

proposed

image

result

image

I was really curious about how you measured WL in CLIP here, so I read the paper… and I saw that the metric was Spearman. We did word-level annotation, and then we did something like a score average with that, and then we scored {image, caption}, and then we just saw how close it was to human preference.

word-level annotation

image image

Evaluate performance for different CLIP models

image

SigLIP is nice. For the same model, the data view was better Not in all cases, but the larger models were better.

For a pyramid CLIP, it looks like this image