[115] ImageBind: One Embedding Space To Bind Them All

TL;DR

I read this because.. : Controversial in many places. I read this because… :** Controversy in various places.
task : align many modalities into one embedding space -> image / audio / thermal classification
problem : getting a pair between all modalities is practically impossible (audio - thermal?!)
idea : wrap everything in the image modality with image as the middle
input/output : image + video / audio / depth / thermal / IMU
architecture : pretrained CLIP. image text encoder is freeze. encoder for each modality
objective : InfoNCE
baseline : classification sota / supervised for each benchmark
data : AudioSet, SUN RGB-D(depth), LLVIP(thermal), Ego4D(video IMU)
evaluation : zero-shot cross-modal retrieval / zero-shot classification (create class text embedding and classify as closest)
result : good performance for few-shot in audio/depth. “emergent retrieval”, I didn’t actually use pair for training and measured the performance.
contribution : Integrate multiple modalities. It’s nice to see the image in the middle. Good performance
etc. : Completely read-only