[115] ImageBind: One Embedding Space To Bind Them All

TL;DR

I read this because.. : 여러 곳에서 논란. 읽어야지~ 했는데 논문스터디에서 발제해주심.
task : align many modalities into one embedding space -> image / audio / thermal classification
problem : 모든 modality간 pair를 얻는것은 사실상 불가능(audio - thermal?!)
idea : image를 중간으로 해서 image modality에 모든걸 엮자
input/output : image + video / audio / depth / thermal / IMU
architecture : pretrained CLIP. image text encoder는 freeze. 각 modality에 대한 encoder
objective : InfoNCE
baseline : 각 benchmark의 classification sota / supervised
data : AudioSet, SUN RGB-D(depth), LLVIP(thermal), Ego4D(video IMU)
evaluation : zero-shot cross-modal retrieval / zero-shot classifcation(class text embedding 만들고 가장 가까운 걸로 분류)
result : audio / depth 에서 few-shot 좋은 성능. “emergent retrieval"이라고 실제로 학습엔 pair를 안 넣고 성능 측정했는데 성능 ㄱㅊ.
contribution : 여러 modality 통합. image가 중간에 들어가는게 좋은걸 보임. 좋은 성능
etc. : 완전 대충 읽음