[165] Rich Human Feedback for Text-to-Image Generation

TL;DR

I read this because.. : CVPR best paper. personal research related research.
task : evaluation in T2I generation. score with feedback
Problem :** Traditional score-based methods are hard to interpret and don’t tell you what you’re doing wrong.
Idea :** Given an image and text with human annotation, have human annotators annotate what is wrong, and give an aesthetic / alignment / plausible score in addition. A model trained on this.
input/output : {image, text} -> 3 scores(aesthetic / alignment / plausible), tokens with align label, heatmap for misalignment
architecture : ViT / T5X / SA
objective : MSE (score and heatmap) + CE loss (misaligned token prediction)
baseline : (score) CLIPScore, PickScore, finetune CLIP, (heatmap) CLIP Gradient,
data : proposed Rich-hf 18K
evaluation : (image heatmap) MSE(gt=0) or saliency heatmap evaluaton, (misaligned tokens) precision, recall, F1 (scores) spearman, kendall correlation
result : Higher feedback than baseline. We used it to (1) filter data, (2) reward the image model, and (3) give a heatmap and suggest to regenerate it again to show performance improvement in three ways.
contribution : Datasets, benchmarks, models, model improvements using those models… This should be the best paper…
etc. :