image

paper , code

TL;DR

  • I read this because.. : ๋…ผ๋ฌธ๋ชจ์ž„์—์„œ ์–ธ๊ธ‰์ด ๋งŽ์ด ๋˜์–ด์„œ ์ฝ์Œ.. ๊ทธ๋Ÿฌ๋‚˜ ๋‚ด๊ฐ€ ์ด๊ฑธ ์ฝ์—ˆ์—ˆ๋„ค..
  • task : object detection -> phrase grounding ๋ฌธ์ œ๋กœ ์น˜ํ™˜ํ•ด์„œ ํ•™์Šต
  • problem : ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ์ •ํ•ด์ง„ ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด์—์„œ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— real world์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. CLIP์ด image-text pair๋กœ ์ด๋ฅผ ํ•ด๊ฒฐํ–ˆ์ง€๋งŒ, ์ด๊ฑด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ์˜ ์ด์•ผ๊ธฐ๊ณ  object detection ๋ ˆ๋ฒจ์˜ ํƒœ์Šคํฌ๋„ ๊ทธ๋ ‡๊ฒŒ ํ’€๊ณ  ์‹ถ๋‹ค!
  • idea : object detection ๋ฌธ์ œ๋ฅผ class ๋“ค์ด prompt ํ˜•์‹์œผ๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฏธ์ง€์—์„œ ๊ทธ prompt์˜ ๋‹จ์–ด๋“ค๊ณผ align์„ ์ž˜ํ•˜๋Š” phrase grounding ๋ฌธ์ œ๋กœ ๋ฐ”๊ฟ”๋ณด์ž.
  • architecture : 1) Visual Encoder(Swin) + DyHead 2) Pretrained BERT 3) 1๊ณผ 2๋ฅผ early fusion.
  • objective : cls loss(with alignment score!) + regressor loss
  • baseline : Faster RCNN, DyHead
  • data : COCO, LVIS, Flickr30K, Object365, GoldG, OpenImages, Visual Genome, ImageNetBoxes
  • evaluation : AP
  • result : 1) ํ•™์Šต ๋•Œ ์ฃผ์–ด์ง€์ง€ ์•Š์€ COCO, LVIS ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•˜์—ฌ supervised baseline ๋ณด๋‹ค ๋” ๋†’์€ ์„ฑ๋Šฅ 2) COCO์— ๋Œ€ํ•ด finetuneํ–ˆ์„ ๋•Œ SOTA ๋‹ฌ์„ฑ 3) 13๊ฐœ์˜ object detection ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ 1-shot GLIP์ด supervised Dynamic Head๋ณด๋‹ค ๋” ๋†’์€ ์„ฑ๋Šฅ.
  • contribution : CLIP in object detection
  • limitation / things I cannot understand :

Details

preliminaries

Data

  • COCO : 80๊ฐœ์˜ object categories, training 118K, valid 5K, test 41K
  • LVIS : long tail object detection. 1000๊ฐœ์˜ categories.
  • Flickr30K : ์ด๋ฏธ์ง€์™€ ์ด์— ๋Œ€ํ•œ 5 reference sentences. data for image captioning
  • Objects365 : 365 categories, 2 million images, 30 million bounding boxes
  • GoldG : 0.8M์˜ ๋ฐ์ดํ„ฐ๋กœ MDETR ๋…ผ๋ฌธ์—์„œ human annotation ์จ์„œ ๋งŒ๋“  grounding data
  • OpenImages : 15,851,536 boxes on 600 categories, 478,000 crowdsourced images with 6,000+ categories
  • Visual Genome : 108,077 Images, 5.4 M Region Descriptions, 2.3M Relationships
  • ImageNetBoxes : ?
  • architecture object detection์€ ๋‘๊ฐœ์˜ loss๋กœ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ, localization loss์™€ classification loss์˜ ํ•ฉ์ž„. ์ด ๋•Œ, localization์— ๋Œ€ํ•œ ๊ฑด ์ด ๋…ผ๋ฌธ์˜ ์˜์—ญ์ด ์•„๋‹˜. classification์— ๋Œ€ํ•œ ๋ฌธ์ œ๋งŒ tackleํ• ๊ฑฐ์ž„.

๋ณดํ†ต์˜ object detection ๋ฌธ์ œ์—์„œ classification loss๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋จ. image

์—ฌ๊ธฐ์„œ classification ๋Œ€์‹  Image Encoder ๋”ฐ๋กœ prompt๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” Language Encoder๋ฅผ ๋”ฐ๋กœ ๋‘” ๋’ค ์ด์˜ ๋‚ด์ ์ด alignment score๊ฐ€ ๋˜๊ฒŒํ•จ. ์ด๊ฒŒ classifier logit์„ ๋Œ€์ฒดํ•˜๊ฒŒ ๋จ. image

๊ทธ๋ฆฌ๊ณ  ๋˜‘๊ฐ™์ด loss์— ๋„ฃ์œผ๋ฉด ๋˜๋Š”๋ฐ ๊ทธ๋ƒฅ ํด๋ž˜์Šค๋ณด๋‹ค ์ฐจ์›์ด ์ถ”๊ฐ€๋  ๊ฒƒ์ž„.(multiple data, tokenization,[no_obj] token).

loss๋Š” binary sigmoid loss๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋จ.

image

detection ๋ชจ๋ธ๋กœ๋Š” FasterRCNN, DynamicHead(SOTA), image encoder๋Š” Swin-T, Swin-L๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ  textual encoder๋Š” BERT๋ฅผ ์‚ฌ์šฉํ–ˆ์Œ. image

deep fusion์€ ๋ณ„๊ฑด ์•„๋‹ˆ๊ณ  ๊ฐ์ž์˜ encoder์—์„œ ๋‚˜์˜จ๊ฑธ ํ•ฉ์น˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ(late-fusion์ด๋ผ๊ณ  ๋ถ€๋ฆ„.) ๋ ˆ์ด์–ด ์Œ“์•„๊ฐ€๋ฉด์„œ ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๊ฒ ๋‹ค๋Š” ์ทจ์ง€. ์ด๋•Œ BERT๋Š” ์ด๋ฏธ ์žˆ๋Š”๋ ˆ์ด์–ด ์œ„์— ์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๋ฅผ ์Œ“์•„์„œ ๊ทธ ์œ„์˜ layer๋“ค์˜ output์„ ๊ตํ™˜ํ•จ.

Result

image