TL;DR
- I read this because.. : Personal Research Related Research
- task : Learn human preference for T2I generation product
- Problem :** Measuring with FIDs is not a good representation of human preference. We need an open source preference dataset.
- Idea :** Create a webpage to collect human preference data
- input/output : {image, prompt} -> score
- architecture : ViT-H/14
- objective : KL divergence
- baseline : Aesthetic score, CLIP-H, ImageReward, HPS, Human Expert
- data : Pick-a-Pic data (data used in the paper is 583K of training / 500 / 500 valid and test samples)
- evaluation : prefers to report that the difference in scores is above a threshold. spearman correlation with human expert
- result : Highest accuracy, correlation. I preferred this to the Classifier-free guidance technique.
- contribution : Huge data release. Release the model. Disclose performance improvements with it.
- etc. : The neurips paper seems to have a lot of data disclosure.
Details
annotation
- prompt is entered by the user
- Image generation is supported by Stable Diffusion 2.1, Dreamlike Photoreal 2.0, and Stable Diffusion XL variants
Pick-a-Pic Dataset
- Total 968K ranking
- The paper used 583K rankings from 37K prompts and 4K users
- Doing a lot of things to care about data quality (email verification, bot detection…)
PickScore
CLIP
finetuning loss
$s$ : score $x$ : prompt $y_1, y_2$: image
They tried in-batch negatives, but they didn’t perform well. trainingdms 4000 step, lr 3e-6, bs 128, warmup 500 step 8 Reportedly took less than an hour with the A100.
Result
rerank vis CLIP-H vs Pick-a-Pic
accuracy
What we learned with classifier-free guidance
correlation between human expert
Comparison to other models
why not COCO? Image creation with COCO prompts is still the most popular way to generate images COCO uses a generic object, which is not what you want.
Just generated vs. reranked with PickScore