image

paper , blog

TL;DR

  • I read this because.. : Controversial in many places. I read this because… :** Controversy in various places.
  • task : align many modalities into one embedding space -> image / audio / thermal classification
  • problem : getting a pair between all modalities is practically impossible (audio - thermal?!)
  • idea : wrap everything in the image modality with image as the middle
  • input/output : image + video / audio / depth / thermal / IMU
  • architecture : pretrained CLIP. image text encoder is freeze. encoder for each modality
  • objective : InfoNCE
  • baseline : classification sota / supervised for each benchmark
  • data : AudioSet, SUN RGB-D(depth), LLVIP(thermal), Ego4D(video IMU)
  • evaluation : zero-shot cross-modal retrieval / zero-shot classification (create class text embedding and classify as closest)
  • result : good performance for few-shot in audio/depth. “emergent retrieval”, I didn’t actually use pair for training and measured the performance.
  • contribution : Integrate multiple modalities. It’s nice to see the image in the middle. Good performance
  • etc. : Completely read-only

Details

image image image image

If there’s a text pair, we’ll do that as well, so that what we’ve learned is Text Paired. Absolute SOTA is a supervised training

image

emergent means we didn’t directly use text-audio set

image

Can be computed like old word embedding lol

image

OD…

image

Of course… I turned on the image encoder. Good.