[152] Sigmoid Loss for Language Image Pre-Training

paper , code

TL;DR

I read this because.. : Regarding CLIPScore, is SigLIP’s score much different from the one trained with softmax? I was curious about the loss part and its effect.
task : CLIP
problem : The softmax in the InfoNCE function is learning unstable, and the process of summing negative pairs in the denominator involves all-gather, which causes learning inefficiency.
idea: sigmoid loss suggestion. See more below
input/output : {image, text} -> score
architecture : ViT-B/16, (LiT setting) ViT-B/8, ViT-g/14
objective : Sigmoid Loss
baseline : CLIP, OpenCLIP, EVA-CLIP, CLIPA-v2
data : WebLI dataset using only English image and text pairs
evaluation : ImageNet-1k / COCO R@1
result : Better performance than the comparison group. Data is different. ㅋㅋ 자세히 못봤지만 step수 맞았겠죠….
contribution : Sigmoid loss proposal. Various ablation experiments.
etc. :

Details

Sigmoid Loss

Existing InfoNCEs

Here, the summation is done twice with each axis for image -> text / text -> image.

Proposed sigmoid loss, where $z_{ij}$ is a label that is 1 for positive and -1 for negative. Since there are too many negatives, we put $t’$, $b$ to solve the imbalance, which is initialized to log10 and -10.

At first glance, I wonder if it’s different from the softmax operation because it needs to calculate all negatives.

Chunking in this way means that for softmax, we need to all_gather the features to compute the denominator. However, for sigmoid loss, the negative pair is included in the loss, but we don’t need the negative pair for the positive pair, so we can just chunk it and forward it, which is more efficient.

Better than softmax on LiT settings, better than softmax for moderately small bs on just CLIP settings.

Performance

Ablations

Said to be more robust to perturbations

hard: a strategy for masking hard samples
Hard, matched pairs: masking reduces the number of pairs you actually see when training, so you’ll have a smaller sample size

Wondering if there’s a better benchmark for hard negatives

TL;DR#

Details#

Sigmoid Loss#

Performance#

Ablations#

TL;DR

Details

Sigmoid Loss

Performance

Ablations