image

paper

TL;DR

  • task : open vocab object detection
  • problem : ๊ธฐ์กด์˜ object detection ๋ชจ๋ธ๋“ค์€ closed set์œผ๋กœ ์˜ˆ์ธกํ•˜์—ฌ ํ™•์žฅ์„ฑ์ด ์–ด๋ ต๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ open vocab object detection ๋“ค์€ PRN์„ ๋จผ์ € ํ•˜๊ณ  class ์˜ˆ์ธก์„ ํ•ด์„œ ์ƒˆ๋กœ์šด class์— ๋Œ€ํ•œ bbox ์˜ˆ์ธก์ด ์–ด๋ ต๋‹ค.
  • idea : DETR์„ ์‚ฌ์šฉํ•˜์—ฌ end2end๋กœ object detection์„ ํ•ด๋ณด์ž! class๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋˜๊ฑธ CLIP์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณด๋‚ด์ž.
  • architecture : image์™€ text(=class)๋ฅผ CLIP์„ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ์„ ํ•œ ๋’ค์— object queries์™€ ํ•ฉํ•ด์ฃผ์–ด conditional query๋ฅผ ๋งŒ๋“ ๋‹ค. ํ•œ ์ด๋ฏธ์ง€์— ์—ฌ๋Ÿฌ object๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์œผ๋‹ˆ N๊ฐœ๋กœ ๋ณต์‚ฌํ•ด์ค€๋‹ค. ์ดํ›„ bipartite matching์€ [obj], [no obj]๊ฐ€ ์•„๋‹ˆ๋ผ input image์™€ conditional query๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ [matched], [not matched]๋กœ ํ•˜๊ฒŒ ๋œ๋‹ค.
  • objective : bce(match / not match) + bbox loss(gIoU, L1) + embedding reconstruction loss(L1)
  • baseline : OVR-CNN, ViLD
  • data : COCO, ELVIS
  • result : OV OD ๋ชจ๋ธ ๋Œ€๋น„ ๊ทธ๋ƒฅ AP, novel ํด๋ž˜์Šค์— ๋Œ€ํ•œ AP ๋‘˜๋‹ค SOTA
  • contribution : end2end open vocab object detection
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ : ๋ชจ๋“  base class / novel class์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ์„ ์ด๋ฏธ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ (๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” R๊ฐœ), ๊ทธ๊ฑฐ๋ž‘ ๋‹ค ๋งค์นญ์„ ํ•ด์„œ ์˜ˆ์ธก์„ ํ•˜๋Š”๊ฒŒ ๋งž๋‚˜? ํ—ท๊ฐˆ๋ฆผ. ๊ทธ๋Ÿผ ํ•™์Šตํ•  ๋•Œ๋Š” in batch negative ์ด๋Ÿฐ ์‹์œผ๋กœ ํ•˜๋ ค๋‚˜?

Details

image

image

image

image