[155] Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

paper , code

TL;DR

I read this because.. : T2I evaluation and word-level blah blah blah.
task : T2I evaluation
problem : Existing DSG, QG2 methodology causes hallucination of LLM. changes the way skill behaves.
IDEA: A slightly different way of calling skill.
input/output : {image, text} -> score
architecture : PALM (QA) + PALI (VQA)
baseline : METEOR, SPICE, CLIP, TIFA, DSG
data : proposed Gecko2k
evaluation : proposed Gecko
result : Human correlation is higher than other metrics.
contribution : Data suggestions. word-level annotation.
etc. : I’m not sure how I did after getting the word-level annotation. Is it just that it was better than the likert (absolute score) annotation?

Details

problems in DSG

proposed

result

I was really curious about how you measured WL in CLIP here, so I read the paper… and I saw that the metric was Spearman. We did word-level annotation, and then we did something like a score average with that, and then we scored {image, caption}, and then we just saw how close it was to human preference.

word-level annotation

Evaluate performance for different CLIP models

SigLIP is nice. For the same model, the data view was better Not in all cases, but the larger models were better.

For a pyramid CLIP, it looks like this

TL;DR#

Details#

problems in DSG#

proposed#

result#

word-level annotation#

Evaluate performance for different CLIP models#