TL;DR
- I read this because.. : T2I evaluation and word-level blah blah blah.
- task : T2I evaluation
- problem : Existing DSG, QG2 methodology causes hallucination of LLM. changes the way skill behaves.
- IDEA: A slightly different way of calling skill.
- input/output : {image, text} -> score
- architecture : PALM (QA) + PALI (VQA)
- baseline : METEOR, SPICE, CLIP, TIFA, DSG
- data : proposed Gecko2k
- evaluation : proposed Gecko
- result : Human correlation is higher than other metrics.
- contribution : Data suggestions. word-level annotation.
- etc. : I’m not sure how I did after getting the word-level annotation. Is it just that it was better than the likert (absolute score) annotation?
Details
problems in DSG
proposed
result
I was really curious about how you measured WL in CLIP here, so I read the paper… and I saw that the metric was Spearman. We did word-level annotation, and then we did something like a score average with that, and then we scored {image, caption}, and then we just saw how close it was to human preference.
word-level annotation
Evaluate performance for different CLIP models
SigLIP is nice. For the same model, the data view was better Not in all cases, but the larger models were better.
For a pyramid CLIP, it looks like this