image

paper , code

TL;DR

  • I read this because.. : Regarding CLIPScore, is SigLIP’s score much different from the one trained with softmax? I was curious about the loss part and its effect.
  • task : CLIP
  • problem : The softmax in the InfoNCE function is learning unstable, and the process of summing negative pairs in the denominator involves all-gather, which causes learning inefficiency.
  • idea: sigmoid loss suggestion. See more below
  • input/output : {image, text} -> score
  • architecture : ViT-B/16, (LiT setting) ViT-B/8, ViT-g/14
  • objective : Sigmoid Loss
  • baseline : CLIP, OpenCLIP, EVA-CLIP, CLIPA-v2
  • data : WebLI dataset using only English image and text pairs
  • evaluation : ImageNet-1k / COCO R@1
  • result : Better performance than the comparison group. Data is different. ㅋㅋ 자세히 못봤지만 step수 맞았겠죠….
  • contribution : Sigmoid loss proposal. Various ablation experiments.
  • etc. :

Details

Sigmoid Loss

Existing InfoNCEs image

Here, the summation is done twice with each axis for image -> text / text -> image.

image

Proposed sigmoid loss, where $z_{ij}$ is a label that is 1 for positive and -1 for negative. Since there are too many negatives, we put $t’$, $b$ to solve the imbalance, which is initialized to log10 and -10.

At first glance, I wonder if it’s different from the softmax operation because it needs to calculate all negatives. image

Chunking in this way means that for softmax, we need to all_gather the features to compute the denominator. However, for sigmoid loss, the negative pair is included in the loss, but we don’t need the negative pair for the positive pair, so we can just chunk it and forward it, which is more efficient.

image

Better than softmax on LiT settings, better than softmax for moderately small bs on just CLIP settings.

Performance

image image

Ablations

image

Said to be more robust to perturbations

image
  • hard: a strategy for masking hard samples
  • Hard, matched pairs: masking reduces the number of pairs you actually see when training, so you’ll have a smaller sample size

Wondering if there’s a better benchmark for hard negatives