[155] Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

paper , code

TL;DR

I read this because.. : T2I evaluation이고 word-level 어쩌구가 있길래 읽음.
task : T2I evaluation
problem : 기존 DSG, QG2 방법론은 LLM의 hallucination 발생. skill 동작 방식을 바꿈.
idea : skill을 부르는 방식을 조금 다르게 함.
input/output : {image, text} -> score
architecture : PALM (QA) + PALI (VQA)
baseline : METEOR, SPICE, CLIP, TIFA, DSG
data : proposed Gecko2k
evaluation : proposed Gecko
result : 다른 metric보다 human correlation이 높음.
contribution : 데이터 제안. word-level annotation.
etc. : 나름 열심히 읽었는데 word-level annotation을 받고 나서 어떻게 한지 모르겠네. 그냥 likert(절대 점수) annotation보다 좋았다는건가?

Details

problems in DSG

proposed

result

여기서 CLIP에 WL을 어떻게 쟀는지가 너무 궁금해서 논문을 읽었는데.. metric을 Spearman으로 된걸 보니까 word-level annoation을 한 뒤에 이걸로 점수 평균같은걸 매겨서 {image, caption}의 score를 매긴 다음에 그냥 이게 human의 선호와 얼마나 같은지를 본듯 했다.

word-level annotation

다양한 CLIP 모델에 대한 성능 평가

SigLIP이 좋긴 하네 같은 모델의 경우 데이터 본 경우가 더 좋았음 모든 경우는 아니지만 larger model일 수록 좋긴 했음.

pyramid CLIP의 경우 아래와 같이 생김

TL;DR#

Details#

problems in DSG#

proposed#

result#

word-level annotation#

다양한 CLIP 모델에 대한 성능 평가#