TL;DR
- I read this because.. : Controversial in many places. I read this because… :** Controversy in various places.
- task : align many modalities into one embedding space -> image / audio / thermal classification
- problem : getting a pair between all modalities is practically impossible (audio - thermal?!)
- idea : wrap everything in the image modality with image as the middle
- input/output : image + video / audio / depth / thermal / IMU
- architecture : pretrained CLIP. image text encoder is freeze. encoder for each modality
- objective : InfoNCE
- baseline : classification sota / supervised for each benchmark
- data : AudioSet, SUN RGB-D(depth), LLVIP(thermal), Ego4D(video IMU)
- evaluation : zero-shot cross-modal retrieval / zero-shot classification (create class text embedding and classify as closest)
- result : good performance for few-shot in audio/depth. “emergent retrieval”, I didn’t actually use pair for training and measured the performance.
- contribution : Integrate multiple modalities. It’s nice to see the image in the middle. Good performance
- etc. : Completely read-only
Details
If there’s a text pair, we’ll do that as well, so that what we’ve learned is Text Paired. Absolute SOTA is a supervised training
emergent means we didn’t directly use text-audio set
Can be computed like old word embedding lol
OD…
Of course… I turned on the image encoder. Good.