[167] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

TL;DR

I read this because.. : 개인연구 관련 연구
task : T2I generation 생성물에 대해 human preference 학습
problem : FID로 측정하는것은 human preference를 잘 나타내지 못한다. open source preference dataset이 필요하다.
idea : 웹페이지 만들어서 human preference data 모음
input/output : {image, prompt} -> score
architecture : ViT-H/14
objective : KL divergence
baseline : Aesthetic score, CLIP-H, ImageReward, HPS, Human Expert
data : Pick-a-Pic data (논문에서 사용된 데이터는 583K의 training / 500 / 500 valid and test samples)
evaluation : score의 차이가 threshold 이상인걸 더 prefer한다고 보고 정확도. human expert와의 spearman correlation
result : 가장 높은 accuracy, correlation. 이걸 사용하여 Classifier-free guidance 기법을 사용했더니 더 결과물이 prefer되었다.
contribution : 엄청 큰 데이터 공개. 모델도 공개. 이걸로 성능 개선도 공개.
etc. : neurips 논문은 데이터 공개가 참 많은듯

prompt를 사용자가 입력
이미지 생성은 Stable Diffusion 2.1, Dreamlike Photoreal 2.0, Stable Diffusion XL variants

$s$ : score $x$ : prompt $y_1, y_2$: image

in-batch negative도 해봤는데 별로 성능이 안좋았다고 함 trainingdms 4000 step, lr 3e-6, bs 128, warmup 500 step 8 A100으로 1시간도 안걸렸다고 함.

rerank vis CLIP-H vs Pick-a-Pic
accuracy
classifier-free guidance로 학습한 것
correlation between human expert
다른 모델들과 비교
why not COCO? 아직도 가장 많이 사용되는 게 COCO prompt를 사용한 이미지 생성이라고 함 COCO는 일반적인 object를 사용하는데 그게 사용자가 바라는것과는 상이하다.
그냥 생성한 것 vs PickScore로 rerank한 것