[165] Rich Human Feedback for Text-to-Image Generation

TL;DR

I read this because.. : CVPR best paper. 개인 연구 관련 연구
task : evaluation in T2I generation. score with feedback
problem : 기존의 score 기반의 방법은 해석이 어렵고, 어떤 부분이 틀렸는지 알려줄 수 없다
idea : human annotation으로 이미지와 텍스트가 주어졌을 때 틀린 부분을 annotate, 외에 aesthetic / alignment / plausible score를 매기게 함. 이를 학습한 모델.
input/output : {image, text} -> 3 scores(aesthetic / alignment / plausible), tokens with align label, heatmap for misalignment
architecture : ViT / T5X / SA
objective : MSE (score and heatmap) + CE loss (misaligned token prediction)
baseline : (score) CLIPScore, PickScore, finetune CLIP, (heatmap) CLIP Gradient,
data : proposed Rich-hf 18K
evaluation : (image heatmap) MSE(gt=0) or saliency heatmap evaluaton, (misaligned tokens) precision, recall, F1 (scores) spearman, kendall correlation
result : baseline보다 더 높은 피드백. 이를 활용하여 (1) data filtering 에 사용 (2) image model의 reward로 사용 (3) heatmap을 준 뒤 다시 생성하라고 제안 세가지로 성능 개선을 나타냄.
contribution : 데이터셋, 벤치마크, 모델, 그 모델을 사용한 모델 개선.. 이정도는 써야 best paper..
etc. :