image

paper

TL;DR

  • I read this because.. : aka cream. ๋™๋ฃŒ์˜ ๋…ผ๋ฌธ
  • task : DocVQA
  • problem : OCR์—†์ด VQA ํ•˜๋Š”๋ฐ๋Š” ์„ฑ๋Šฅ์˜ ์ œํ•œ์ด ์žˆ๊ณ , OCR์„ ์‚ฌ์šฉํ•ด์„œ input์œผ๋กœ ๋„ฃ์–ด์ฃผ๊ธฐ์—” ํ† ํฐ ์ˆ˜๋ฅผ ๋งŽ์ด ๋จน๋Š”๋‹ค
  • idea : OVD์™€ OCR์„ ์‚ฌ์šฉํ•˜๊ณ  auxiliary encoder๋กœ feature ๋ฝ‘์€ ๋’ค์— CA๋กœ ์ด๋ฅผ ํ™œ์šฉ
  • input/output : ์ด๋ฏธ์ง€, ocr ๊ฒฐ๊ณผ(box and text), ovd ๊ฒฐ๊ณผ(box and class text), ์งˆ๋ฌธ -> answer
  • architecture : Vision Encoder(CLIP ViT-L /LAION-2B), Auxiliary encoder(mBART), decoder(mBART, standalone ๋ชจ๋“œ), LLM(Vicuna).
  • objective : text read, masked text prediction, captioning, qa, qg / CL loss + LM loss -> qa / LM loss
  • baseline : ocr์˜ ๊ฒฐ๊ณผ๋ฅผ LLM์— ๋ฐ€์–ด๋„ฃ๋Š” ๊ฒƒ, BLIP, UDOP, Pix2Struct, MatCha, Donut, T5
  • data : (text read adn masked text prediction) IIT-CDIP, Webvicob, (captioning) CC3M, (QA + QG) WKVVQA, SquadVQA, TydiVQA(์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ)
  • evaluation : (ChartQA) Accuracy, ANLS, nED, BERTScore, PPL
  • result : ๋‹จ์ˆœ LLM์— ocr ๋„ฃ๋Š” ๊ฒƒ๋ณด๋‹จ ์›”๋“ฑํžˆ ์ข‹๊ณ  document ํŠนํ™” ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ InfoVQA ๋นผ๊ณ ๋Š” multi-task model ์ค‘์—์„œ๋Š” sota. ์„ฑ๋Šฅ์ƒ sota๋Š” UDOP.
  • contribution : document ๋„๋ฉ”์ธ์—์„œ ocr token์„ ์–ด๋–ป๊ฒŒ ์ž˜ ํ™œ์šฉํ• ์ง€ ๋ฐฉ์•ˆ ์ œ์•ˆ. ocr์ด ๋ถˆ์•ˆ์ •ํ•  ๋•Œ๋„ ์„ฑ๋Šฅ์ด ํ”๋“ค๋ฆฌ์ง€ ์•Š๊ฒŒ ํ•˜๋Š” CL ๋ฐฉ๋ฒ• ์ œ์•ˆ.
  • etc. : appendix๊ฐ€ ์ฐธ ์•Œ์ฐจ๋‹ค

Details

Architecture

image

์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋Š” BLIP-2๋ž‘ ๋น„์Šทํ•˜๋‹ค image

๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์— ์ถ”๊ฐ€์ ์œผ๋กœ vision encoder output๋ง๊ณ ๋„ auxiliary encoder๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ์ฐจ์ด์ ! vision encoder output๊ณผ aux encoder output์€ concatํ•ด์„œ cross-attention์œผ๋กœ decoder์— ๋“ค์–ด๊ฐ„๋‹ค

์ด๋ ‡๊ฒŒ CA๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ๋œ motivation์€ text-richํ•œ ์ด๋ฏธ์ง€๋Š” ocr ๊ฒฐ๊ณผ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ํ† ํฐ์ˆ˜๋ฅผ ๋„ˆ๋ฌด ๋งŽ์ด ๋จน๋Š”๋‹ค๋Š” ์ ! image

image image

๊ทธ๋ฆผ์ด ์ข€ ํ—ท๊ฐˆ๋ฆฌ๊ฒŒ(๋งˆ์น˜ crop๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ) ๊ทธ๋ ค์ ธ ์žˆ๋Š”๋ฐ contrastive์˜ ๋Œ€์ƒ์ด ๋˜๋Š” postivie pair๋Š” ์œ„์˜ ๊ทธ๋ฆผ์—์„œ ๋‚˜์˜จ aux output๊ณผ ์ด์— ํ•ด๋‹น(์ขŒํ‘œ๊ฐ€ ๊ฒน์น˜๋Š”)ํ•˜๋Š” patch์˜ output์„ contrastive ํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ์ด๊ฑธ ์™œ ํ–ˆ๋ƒ๊ณ  ์„ค๋ช…ํ•˜๋ƒ๋ฉด ocr output์ด noisyํ•˜๊ฑฐ๋‚˜ ๊ฒฐ๊ณผ๊ฐ€ ํ•œ์ •๋˜์–ด ์žˆ์„ ๋•Œ ์œ ๋ฆฌํ•˜๋‹ค๊ณ  ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋‹ค. image

image image

vision encoder์˜ ํŒจ์น˜๊ฐ€ ocr token encoder output์ด๋ž‘ ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก ํ•˜๋‹ˆ๊นŒ ocr ๊ฒฐ๊ณผ๊ฐ€ ์ข€ ๋ˆ„๋ฝ๋˜๋„ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ์„œ์ˆ ํ•˜๋Š”๋“ฏ? ๋ฐ˜๋ฉด์— OVD๋Š” Owl-ViT๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ (with coco 80 classes) DocVQA์—์„œ OVD๋ฅผ ์•ˆ์จ๋„ ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ์•ˆ๋–จ์–ด์ง„๋‹ค๊ณ  ๋งํ•œ๋‹ค(81.2 -> 80.9, A.2.) ์ด๊ฑด DocVQA์—ฌ์„œ ๊ทธ๋Ÿฐ๊ฑฐ ์•„๋‹๊นŒ ์‹ถ๊ธฐ๋„ ํ•˜๋‹ค

Dataset

image

Training

details

  • LM : CL = 1: 0.5
  • learnble queries ๊ฐœ์ˆ˜๋Š” 224
  • vision encoder์— ์ด๋ฏธ์ง€ ๋„ฃ์„ ๋•Œ pix2struct(https://github.com/long8v/PTIR/issues/140 )์˜ variable resolution image

Result

image image image

Arithmetic์ด ๊ฐœ์„ ๋จ

image

LLM์„ ๋ถ™์ด๋ฉด์„œ ์‚ฐ์ˆ ์„ ๋” ์ž˜ํ•˜์ง€๋งŒ ์ž˜๋ชป๋œ text๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•œ๋‹ค๊ณ  ํ•จ image