TL;DR
- I read this because.. : CVPR best paper. personal research related research.
- task : evaluation in T2I generation. score with feedback
- Problem :** Traditional score-based methods are hard to interpret and don’t tell you what you’re doing wrong.
- Idea :** Given an image and text with human annotation, have human annotators annotate what is wrong, and give an aesthetic / alignment / plausible score in addition. A model trained on this.
- input/output : {image, text} -> 3 scores(aesthetic / alignment / plausible), tokens with align label, heatmap for misalignment
- architecture : ViT / T5X / SA
- objective : MSE (score and heatmap) + CE loss (misaligned token prediction)
- baseline : (score) CLIPScore, PickScore, finetune CLIP, (heatmap) CLIP Gradient,
- data : proposed Rich-hf 18K
- evaluation : (image heatmap) MSE(gt=0) or saliency heatmap evaluaton, (misaligned tokens) precision, recall, F1 (scores) spearman, kendall correlation
- result : Higher feedback than baseline. We used it to (1) filter data, (2) reward the image model, and (3) give a heatmap and suggest to regenerate it again to show performance improvement in three ways.
- contribution : Datasets, benchmarks, models, model improvements using those models… This should be the best paper…
- etc. :
Details
What to do?
architecture of rich feedback model
Result
performance of feedback model
Improved model with feeback model
Results from a Muse model that ft only high scores / rewards as guidance
I gave them the wrong heatmap and asked them to redraw it.
Before and after finetune comparison