image

paper

TL;DR

  • I read this because.. : CVPR best paper. personal research related research.
  • task : evaluation in T2I generation. score with feedback
  • Problem :** Traditional score-based methods are hard to interpret and don’t tell you what you’re doing wrong.
  • Idea :** Given an image and text with human annotation, have human annotators annotate what is wrong, and give an aesthetic / alignment / plausible score in addition. A model trained on this.
  • input/output : {image, text} -> 3 scores(aesthetic / alignment / plausible), tokens with align label, heatmap for misalignment
  • architecture : ViT / T5X / SA
  • objective : MSE (score and heatmap) + CE loss (misaligned token prediction)
  • baseline : (score) CLIPScore, PickScore, finetune CLIP, (heatmap) CLIP Gradient,
  • data : proposed Rich-hf 18K
  • evaluation : (image heatmap) MSE(gt=0) or saliency heatmap evaluaton, (misaligned tokens) precision, recall, F1 (scores) spearman, kendall correlation
  • result : Higher feedback than baseline. We used it to (1) filter data, (2) reward the image model, and (3) give a heatmap and suggest to regenerate it again to show performance improvement in three ways.
  • contribution : Datasets, benchmarks, models, model improvements using those models… This should be the best paper…
  • etc. :

Details

What to do?

image

architecture of rich feedback model

image

Result

performance of feedback model

image image

Improved model with feeback model

  • Results from a Muse model that ft only high scores / rewards as guidance image

  • I gave them the wrong heatmap and asked them to redraw it. image

  • Before and after finetune comparison image