image

paper

TL;DR

  • I read this because.. : #116 ์ฝ๊ณ  ๋‚˜์„œ ์ฝ๊ณ  ์‹ถ์–ด์ง. ์˜ˆ์ „์— ์„ฑํ˜„๋‹˜์ด ์†Œ๊ฐœํ•ด์ฃผ์…จ๋Š”๋ฐ ๋””ํ…Œ์ผ ๋ชจ๋ฆ„. GPT ์—ดํ’์œผ๋กœ ์š”์ฆ˜ ๋‹ค์‹œ ๋งŽ์ด ์–ธ๊ธ‰๋จ.
  • task : Vision Language Model in general use! VQA, object detction, VizWiz, HatefulMemes …
  • input : text with image/video
  • output : free form of text
  • problem : CLIP ๋ฅ˜๋Š” image-text pair์˜ score๋งŒ ๋‚˜์˜ค๋ฏ€๋กœ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ closed set์— ๋Œ€ํ•œ ํƒœ์Šคํฌ์—๋งŒ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. cpationing์ด๋‚˜ VQA ๊ฐ™์€ open-ended task๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋Š” generate language ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜๋‹ค.
  • idea : LM ๋ฐฉ์‹์œผ๋กœ! pretrained LLM ๊ฐ€์ ธ์˜ค๊ณ  visual token์„ cross-attention์œผ๋กœ ์ •๋ณด๋ฅผ ๋„ฃ์–ด์ฃผ์ž
  • architecture : LM์€ ์ผ๋‹จ chinchilller(70B). ์ด๋ฏธ์ง€ ์ธํ’‹์€ NFNet์— ๋„ฃ๊ณ  ๋งˆ์ง€๋ง‰ feature flattent ํ•œ ๋’ค Perceiver resampler๋กœ few latent vector๋ฅผ ๋ฝ‘์Œ. LM ์ค‘๊ฐ„์— cross attention(train from scratch)๋กœ visual ์ •๋ณด๋ฅผ ๋„ฃ์–ด์คŒ. ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ์œ„ํ•ด 0์œผ๋กœ ์ดˆ๊ธฐํ™” ๋˜๋Š” alpha๋กœ tanh gatingํ•จ.
  • objective : NLL loss given image. ๊ฐ ํ…์ŠคํŠผ ํ† ํฐ์€ ์ง์ „์˜ image๋งŒ ๋ณผ ์ˆ˜ ์žˆ์Œ. ๊ฐ ๋ฐ์ดํ„ฐ๋“ค์˜ weighted sum.
  • baseline : ๊ฐ ๋ฒค์น˜๋งˆํฌ์˜ few-shot / finetune ๋ชจ๋ธ
  • data : MultiModal MassiveWeb(M3W, 1.8B), ALIGN(312M), Video & Text pairs(VTP, 27M)(๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต์˜ ๋ชฉ์ ์œผ๋กœ annotate๋œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜๋„ ์“ฐ์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์— ์˜์˜!) -> 16๊ฐœ์˜ image/video and language ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ
  • evaluation : zero-shot / 32-shot์—์„œ ๋น„๊ต
  • result : ๋Œ€๋ถ€๋ถ„์˜ few-shot ๋ชจ๋ธ์— ๋Œ€ํ•ด flamingoํ•˜๋‚˜๋กœ ์ด๊ธฐ๊ณ . finetune ์„ฑ๋Šฅ๋„ ์ด๊ธด ๋ฒค์น˜๋งˆํฌ ๋‹ค์ˆ˜
  • contribution : ์•„๋งˆ ์ตœ์ดˆ์˜ token generation ๊ธฐ๋ฐ˜ vision & language ๋ชจ๋ธ?
  • limitation / things I cannot understand :

Details

  • ECCV workshop ๋•Œ Jean-Baptiste๊ฐ€ flamingo๋ฅผ ํ•˜๊ฒŒ ๋œ ์ด์œ  / ํ•˜๋ฉด์„œ ๋А๋‚€ ์ ์„ ์†Œ๊ฐœํ•ด์ค€ ์ ์ด ์žˆ์Œ image

introduction์— ์จ์žˆ๋Š”๊ฑฐ๋ž‘ ๋น„์Šทํ•œ ๋‚ด์šฉ. CLIP๋ฅ˜ ์—ฐ๊ตฌ๋ฅผ ํ–ˆ์—ˆ๋Š”๋ฐ ํ’€ ์ˆ˜ ์žˆ๋Š” task๊ฐ€ ํ•œ์ •์ ์ด์—ˆ๋‹ค. -> flamingo๋กœ ๋„˜์–ด๊ฐ ๊ฒฐ๊ตญ ์–ด๋–ค ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ๋ฅผ ํ’€ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ธ๊ฐ€? application์— ์ ํ•ฉํ•  ๊ฒƒ์ธ๊ฐ€?๋ฅผ ๋ฌธ์ œ์˜์‹์œผ๋กœ ์‚ผ์€ ๊ฒƒ ๊ฐ™๋‹น ๋ฌธ์ œ ์˜์‹์„ ์•„ํ‚คํ…์ณ๊ฐ€ ์•„๋‹ˆ๋ผ ํ’€ ์ˆ˜ ์žˆ๋Š” task ๋“ค๋กœ ์žก์€ ๋“ฏ~ ํ  ์ ์  ์•„ํ‚คํ…์ณ๊ฐ€ ์ค‘์š”ํ•œ๊ฒŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ/ํ•™์Šต/ํƒœ์Šคํฌ ๋“ฑ์ด ์ค‘์š”ํ•œ ๊ฒƒ ๊ฐ™๋„ค.. ๋‚˜๋Š” ์ด์ œ ๋ฌด์–ผ ์Œ“์•„์•ผ ํ•˜๋‚˜

Preliminaries

  • Normalizer Free ResNet https://arxiv.org/pdf/2102.06171.pdf ResNet์˜ batch norm์ด ๋ชจ๋ธ์ด bs์— ๋ฏผ๊ฐํ•ด์ง€๊ฑฐ๋‚˜, ํ•œ ๋ฐฐ์น˜ ๋‚ด ์ด๋ฏธ์ง€์˜ interaction์— ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์–ด์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ

  • Perceiver https://arxiv.org/pdf/2103.03206.pdf image

21๋…„๋„ deep mind์—์„œ image / video ๋“ฑ ๋‹ค์–‘ํ•œ modality๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ. ๋น„๋Œ€์นญ์ ์ธ attention ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ด์„œ a small set of latent units์œผ๋กœ ์ ์ฐจ CAํ•  ์ˆ˜ ์žˆ๋„๋ก(detr์ด๋ž‘ ๋น„์Šทํ•œ๋ฐ ๋””ํ…Œ์ผ์ด ์ข€ ๋‹ค๋ฅผ๋“ฏ) image classification / audio / point cloud ๋“ฑ์—์„œ comparable ์„ฑ๋Šฅ (c.f. Set Transformer๊ฐ€ most related work๋ผ๊ณ  ํ•˜๋ฉด์„œ ๊ณ„์† ์–ธ๊ธ‰)

  • Chinchiller 22๋…„ 3์›”์— ๋”ฅ๋งˆ์ธ๋“œ์—์„œ ๋‚˜์˜จ ๋ชจ๋ธ. https://arxiv.org/pdf/2203.15556.pdf ์ „์ž‘์ด Gopher์˜€๋Š”๋ฐ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋งŒ ์ปค์ง€๊ณ  ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ๊ทธ๋Œ€๋กœ ์จ์„œ ๋ชจ๋ธ์ด underfit ๋๋‹ค๊ณ  ํŒ๋‹จ. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. … ๋ฏธ์นœ ๋„˜๋“ค! model size๋ฅผ ๋‘๋ฐฐ ๋Š˜๋ฆฌ๋ฉด num of tokens๋„ ๋‘๋ฐฐ๋กœ ๋Š˜๋ ค์•ผ ํ•œ๋‹ค๋Š” ๋ฐœ๊ฒฌ Gopher(280B)๋ณด๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 4๋ฐฐ ์ž‘์ง€๋งŒ training data๋Š” 4๋ฐฐ ๋Š˜๋ ค์„œ Gopher์˜ ์„ฑ๋Šฅ์„ ์ด๊ธด ๋ชจ๋ธ
image image

ํ•™์Šต ์ค‘๊ฐ„์— ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์›€ -> ์™œ? https://arxiv.org/pdf/2112.11446.pdf 120์ชฝ์งœ๋ฆฌ ์ฝ์œผ๋ฉด ์•Œ ์ˆ˜ ์žˆ์„๋“ฏ..

Dataset

image
  • M3W 43M ์›นํŽ˜์ด์ง€์—์„œ HTML์„ ํ†ตํ•ด ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๋ฅผ ๋ฝ‘์Œ. DOM ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ƒ๋Œ€์ ์ธ ์œ„์น˜๋ฅผ ๋ฝ‘์Œ ํ…์ŠคํŠธ ๋‚ด์— token์„ ๋„ฃ์–ด์„œ ์ด๋ฏธ์ง€์˜ ์œ„์น˜๋ฅผ ๋„ฃ์—ˆ๊ณ  (end of chunk) ํ† ํฐ์„ ์ด๋ฏธ์ง€ ์ „ / ๋ฌธ์„œ ๋งˆ์ง€๋ง‰์— ๋„ฃ์—ˆ์Œ. ๊ฐ ๋ฌธ์„œ์— ๋Œ€ํ•ด์„œ subsequence L=256๊ฐœ(๋„ˆ๋ฌด ์ž‘์€๋ฐ? ๊ฐ ์ด๋ฏธ์ง€ ์•ž์—์„œ ๋งํ•˜๋Š”๊ฑฐ๊ฒ ์ง•?)์˜ ํ† ํฐ์„ ๋žœ๋ค์œผ๋กœ ๋ฝ‘์•˜๊ณ  ์ตœ๋Œ€ 5๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ์—ˆ์Œ

  • ALIGN web์— alt text(tag)๋ผ๋Š”๊ฒŒ ์žˆ๋Š”๋ฐ ๊ทธ๊ฑฐ ์‚ฌ์šฉํ•ด์„œ ๊ตฌ์ถ•ํ•œ ๋ฐ์ดํ„ฐ https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html image image

Architecture

image image image image image

Objective

image

๊ฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ gradient๋ฅผ accumulateํ•˜๋Š”๊ฒŒ ์ˆœ์ฐจ์ (round-robin)์œผ๋กœ ํ•˜๋Š”๊ฒƒ๋ณด๋‹ค ๋” ์ข‹์•˜์Œ ๊ทธ๋ฆฌ๊ณ  per-dataset weights์ธ $\lambda _m$์„ ํŠœ๋‹ํ•˜๋Š”๊ฒŒ ์„ฑ๋Šฅ์— ํฌ๋ฆฌํ‹ฐ์ปฌํ–ˆ๋‹ค๊ณ  ํ•˜๋„น

Results

image image image

Tanh gating image

etc.

c.f. x-attn์—์„œ x๊ฐ€ ๋ญ์ง€ ํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•˜๋‹ค ๋ฐœ๊ฒฌ ์ „์ฒด finetuning ์•ˆํ•˜๊ณ  CA์ชฝ๋งŒ ํ•ด๋„ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๋Š” ๋…ผ๋ฌธ. domain์€ MT Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation https://arxiv.org/pdf/2104.08771.pdf