TL;DR
- I read this because.. : I read this because I was curious about dataset filtering / evaluation
- task : CLIP
- problem : open large image - text set
- idea : common crawl + study
- input/output : image / text -> similiarity score
- architecture :** Same as CLIP
- objective : contrastive loss
- baseline : LAION-2B
- data : CommonPool 14B -> (filtered) DataComp 1.4B
- evaluation : zero-shot imagenet /imagenet-A/ .. detailed below + retrieval
- result : Higher performance than LAION-2B
- contribution : Making datasets publicly available. Various filtering techniques ablation. competition to stimulate research directions that focus on data.
- etc. :
Details
Evaluation
- zs-image classifcation
- 22 datasets evaluated in the original CLIP paper
- 6 distrbution shifted imagenets : ImaeNet-Sketch, ImageNet-V2, ImageNet-A, ImageNet-O, ImageNet-R, ObjectBet
- 13 VTAB data: https://arxiv.org/pdf/1910.04867.pdf
- 3 WILDS data: benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. e.g. WILDS: A benchmark of in-the-wild distribution shifts. iWildCam2020-wilds(wildlife..), Camelyon17-wilds(cellular tissue..), RxRx1-wilds(RNA…)
- WinoGAViL : commonsense association task https://paperswithcode.com/dataset/winogavil I don’t understand what it is even when I look at it.
- Finally, two fairness data: FairFace, UTKFace -> race-matched classification
Some discoveries
High correlation between zs retrieval and linear probing
High correlation between performance with small datasets and performance with large datasets
High correlation between imagenet and other datasets
Children with low correlations performed closer to random guesses.
- I don’t understand even looking at it: https://paperswithcode.com/dataset/winogavil
- Wildlife: https://paperswithcode.com/dataset/iwildcam-2021
- Autonomous driving: https://github.com/harshilpatel312/KITTI-distance-estimation
- A collection of misclassifications from ImageNet: https://paperswithcode.com/dataset/imagenet-a
- Satellite image: https://paperswithcode.com/dataset/fmow
- Airplane type classification: ttps://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Categorize which country this photo was taken in: https://paperswithcode.com/dataset/country211
- Medical: https://camelyon17.grand-challenge.org/ , https://patchcamelyon.grand-challenge.org/
- 3D objects relationship: https://paperswithcode.com/dataset/clevr
It’s all so esoteric… The only thing useful here is imagenet-a and country211?! And unsurprisingly, the OCR side datasets (rendered SST2, SVHN) were also uncorrelated.
c.f. hparam like bs has little change in rank for data filtering