image

paper , page

TL;DR

  • I read this because.. : I read this because I was curious about dataset filtering / evaluation
  • task : CLIP
  • problem : open large image - text set
  • idea : common crawl + study
  • input/output : image / text -> similiarity score
  • architecture :** Same as CLIP
  • objective : contrastive loss
  • baseline : LAION-2B
  • data : CommonPool 14B -> (filtered) DataComp 1.4B
  • evaluation : zero-shot imagenet /imagenet-A/ .. detailed below + retrieval
  • result : Higher performance than LAION-2B
  • contribution : Making datasets publicly available. Various filtering techniques ablation. competition to stimulate research directions that focus on data.
  • etc. :

Details

Evaluation

image
  • zs-image classifcation
  • 22 datasets evaluated in the original CLIP paper
  • 6 distrbution shifted imagenets : ImaeNet-Sketch, ImageNet-V2, ImageNet-A, ImageNet-O, ImageNet-R, ObjectBet
  • 13 VTAB data: https://arxiv.org/pdf/1910.04867.pdf
  • 3 WILDS data: benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. e.g. WILDS: A benchmark of in-the-wild distribution shifts. iWildCam2020-wilds(wildlife..), Camelyon17-wilds(cellular tissue..), RxRx1-wilds(RNA…)
  • WinoGAViL : commonsense association task https://paperswithcode.com/dataset/winogavil I don’t understand what it is even when I look at it.
  • Finally, two fairness data: FairFace, UTKFace -> race-matched classification

Some discoveries

  • High correlation between zs retrieval and linear probing image

  • High correlation between performance with small datasets and performance with large datasets image

  • High correlation between imagenet and other datasets image

Children with low correlations performed closer to random guesses.

It’s all so esoteric… The only thing useful here is imagenet-a and country211?! And unsurprisingly, the OCR side datasets (rendered SST2, SVHN) were also uncorrelated.

c.f. hparam like bs has little change in rank for data filtering image