image

paper

TL;DR

  • task : personalized vision and language => personalized image retrieval/object detection/segmentation
  • problem : user-specificํ•œ object๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ์‹ถ๋‹ค. CLIP์— adaptor๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์€ ์ด์ „ class๋“ค์˜ ์„ฑ๋Šฅ์„ ์•…ํ™”์‹œํ‚ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Œ.
  • idea : ์ƒˆ๋กœ์šด concept์„ ์ƒˆ๋กœ์šด vocab์œผ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต ํ•˜์ž! ์ด๋ฅผ ์œ„ํ•ด 1) ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ input word embedding์„ ์ฐพ๋Š” inverse function์„ ํ•™์Šตํ•˜๊ณ  2) ์ƒˆ concept์˜ ์ด๋ฏธ์ง€ ๋ช‡์žฅ์„ inverse function์„ ํ†ต๊ณผ์‹œ์ผœ ์ƒˆ concept์˜ word embedding์„ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค 3) ์ƒˆ concept์˜ textual ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  finetuningํ•œ๋‹ค.
  • architecture : CLIP
  • objective : ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ ํ†ต๊ณผํ•œ ์ž„๋ฒ ๋”ฉ๊ณผ A photo of a [new vocab]์˜ ์ž„๋ฒ ๋”ฉ์ด ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก, ์ƒˆ concept์˜ super-concept๊ณผ์˜ ์ž„๋ฒ ๋”ฉ์€ ๋ฉ€์–ด์ง€๋„๋ก ํ•™์Šต
  • baseline : Adapter, text-only CLIP, COLLIE
  • data : Youtube-VOS, DeepFashion2(both introduced in this paper)
  • result : SOTA
  • contribution : ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ ์ œ์•ˆ. ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ณ!
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ : CLIP ๋‹ค์‹œ ์ฝ์–ด์•ผ๋ ๋“ฏ? Deep Sets?

Details

new setup, personalized vision & language

image
  • pretrained model h(S, I)์— ์ƒˆ๋กœ์šด sentence S์™€ ์ด๋ฏธ์ง€ I๊ฐ€ ๋“ค์–ด๊ฐ.
  • ์ƒˆ๋กœ์šด concept์ธ C๊ฐ€ ๋“ค์–ด๊ฐ€์„œ V’ = V U C ๋กœ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธธ ์›ํ•จ
  • ํ•™์Šต ์‹œ์—๋Š” concept C์— ๋Œ€ํ•œ ๋ช‡๊ฐœ์˜ ์ด๋ฏธ์ง€์™€ ์ƒˆ๋กœ์šด ์ปจ์…‰์— ๋Œ€ํ•œ ์„ค๋ช… ํ…์ŠคํŠธ(e.g. “mug”, “short sleeve top”)๊ฐ€ ์ฃผ์–ด์ง

Adaptor vs new vocab ์ถ”๊ฐ€

image

์ƒˆ๋กœ์šด vocab์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉด ์ด์ „ class์— ๋Œ€ํ•œ encoder output์ด ๋ญ‰๊ฐœ์ง„๋‹ค. ์šฐ๋ฆฌ์˜ ํ…์ŠคํŠธ์ž„๋ฒ ๋”ฉ์ด ์ƒˆ๋กœ์šด ์ปจ์…‰์„ ํ’ˆ์„ ์ˆ˜ ์žˆ์„ ์ •๋„๋กœ ํฌ๋‹ค๋Š” ๊ฐ€์ •์œผ๋กœ ๋ชจ๋ธ์ด ์‹œ์ž‘

Architecture

image

DeepSets์ด๋ž€ ๋„คํŠธ์›Œํฌ๋กœ inverse mapping function ํ•™์Šต

Loss

image