image

paper

TL;DR

  • I read this because.. : CLIP pretrained ๋Šฅ๋ ฅ์„ ์žƒ์–ด๋ฒ„๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ ๋ณด์ˆ˜์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ method. LiT ๊ด€๋ จ ๋…ผ๋ฌธ ์ฐพ๋‹ค๊ฐ€ ์ฐพ์Œ
  • task : CLIP
  • problem : CLIP์—์„œ reference ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด finetuning์„ ํ•˜๋ฉด CLIP์—์„œ ์›๋ž˜ ํ•™์Šต๋œ general domain์— ๋Œ€ํ•œ ์ง€์‹์„ ์žƒ์–ด๋ฒ„๋ฆด ์ˆ˜๋„
  • idea : CLIP zero-shot ๋Šฅ๋ ฅ๊ณผ target domain์— finetuneํ•œ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ” ํ•˜์ž -> weight interpolate๋ฅผ ํ†ตํ•ด ์•™์ƒ๋ธ”ํ•˜์ž!
  • input/output : {image, text} -> score
  • architecture : CLIP, ViT, BASIC-L
  • objective : InfoNCE
  • baseline : zs-CLIP, finetuned CLIP.
  • data : WIT(clip), JFT-300M(vit) -> ImageNet, ImageNetV2, ImageNet-R, ImageNet sketch, ObjectNet, ImageNet-A
  • evaluation : ์›๋ž˜ ๋„๋ฉ”์ธ๊ณผ shift๋œ ๋„๋ฉ”์ธ์—์„œ์˜ ์ •ํ™•๋„.
  • result : ImageNet ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ domain shift๊ฐ€ ์žˆ๋Š” ์• ๋“คํ•œํ…Œ๋„ ์„ฑ๋Šฅ ๊ฐœ์„ 
  • contribution : ๊ฐ„๋‹จํ•œ ์•„์ด๋””์–ด + implement ํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉด์„œ๋„ ์„ฑ๋Šฅ์ด ์ข‹์Œ
  • etc. :

Details

image

param์˜ moving average๋ฅผ ์“ฐ๋Š”๊ฒŒ ์ผ์ข…์˜ ensemble ํšจ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค

domain shift data

image

Weight-space ensemble for finetuning

๋„ˆ๋ฌด ๊ฐ„๋‹จ..

  1. pretrianed CLIP์„ ๊ฐ€์ง€๊ณ  ์™€์„œ target domaind์— ๋Œ€ํ•ด์„œ ft. fully ft(end-to-end)ํ•  ์ˆ˜๋„ ์žˆ๊ณ  ๋งˆ์ง€๋ง‰ classifier๋งŒ ํ• ์ˆ˜๋„ ์žˆ๋‹ค(LC)
  2. mixing coefficient๋ฅผ ๋‘๊ณ  ๊ฐ element-wise๋กœ average๋ฅผ ๊ตฌํ•œ๋‹ค image

์—ฌ๊ธฐ์„œ alpha๋Š” greedyํ•˜๊ฒŒ ์ฐพ์•„์•ผ ํ•˜๋‚˜ 0.5๋กœ ์„ค์ •ํ–ˆ์„ ๋•Œ optimum์ด๋ž‘ ๊ฑฐ์˜ ๋น„์Šทํ•˜๊ฒŒ ๋‚˜์™”๋‹ค.

Result

image

์ฒซ๋ฒˆ์งธ ๊ทธ๋ฆผ : x์ถ•์€ ImageNet(reference distribution)์ด๊ณ  y์ถ•์€ distribution shift๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹๋“ค ๋ณด๋ผ์ƒ‰์ด zs clip ์„ฑ๋Šฅ์ด๊ณ  ํŒŒ๋ž€์ƒ‰์ด ๊ทธ๋ƒฅ ๊ทธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ์• ๋“ค. ์ฃผํ™ฉ์ƒ‰์ด ๊ทธ ๋ฐ์ดํ„ฐ๋กœ finetune ํ•œ ์• ๋“ค ๋‘๋ฒˆ์งธ ๊ทธ๋ฆผ : Wise-FT๋ฅผ ํ•˜๋ฉด reference ์ •ํ™•๋„ ๊ฐ์†Œ ์—†์ด distribution shift ์žˆ๋Š” ์• ๋“ค ์„ฑ๋Šฅ์„ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Œ

image

finetune ํ•œ๊ฒƒ๋“ค ๋ณด๋ฉด distribution shift ์žˆ๋Š”๊ฒƒ๋“ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง ์ œ์•ˆํ•œ WISE-FT ๋ณด๋ฉด reference domain์—์„œ๋„ ์„ฑ๋Šฅ์ด ft๋ณด๋‹ค ๋” ์ข‹์•„์ง€๊ณ  (86.2 -> 87.1) distribution shift๊ฐ€ ์žˆ๋Š” ์• ๋“ค๋„ ์ข‹์•„์ง

image

clip์ž์ฒด๊ฐ€ hparam์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋„ˆ๋ฌด ํ”๋“ค๋ฆฌ๋Š” ๊ฒฝํ–ฅ์„ฑ -> weight-space ensemble ํ•˜๋ฉด frontier!

image

๊ฐ๊ฐ์˜ ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด finetuning ํ•œ ๊ฒƒ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์Œ!

Analysis

image

zero-shot๊ณผ linear classifier๋Š” ๊ฒฝํ–ฅ์ด ๋‹ฌ๋ž๊ณ  linear-classifier ๋ผ๋ฆฌ๋Š” ๊ฒฝํ–ฅ์ด ๋น„์Šทํ–ˆ๋‹ค. -> ๋” ํฐ ์•™์ƒ๋ธ” ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ๋˜ ๋“ฏ ํ•˜๋‹ค

image

output์„ ensembleํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค weight ensembleํ•˜๋Š”๊ฒŒ ๋” ์„ฑ๋Šฅ๊ฐœ์„ ์ด ์ข‹์•˜๋‹ค!