image

paper

TL;DR

  • I read this because.. : document ๋„๋ฉ”์ธ์—์„œ ViT variable resolution ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ์ข€ ๋‹ค๋ฅด๋‹ค๊ณ  ํ•ด์„œ ์ฝ์Œ
  • task : document understanding / UI / image captioning
  • problem : ํŒŒ์ดํ”„๋ผ์ธ ๋ง๊ณ  ์ด๋ฏธ์ง€ ์ธํ’‹ ๋ฐ›๊ณ  ๋ฐ”๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด๋ฏธ์ง€์˜ ๋น„์œจ์ด ๊ทน๋‹จ์ ์ธ ๊ฒฝ์šฐ๊ฐ€ ์ข€ ์žˆ๋‹ค. ๊ทธ๋ƒฅ ๋ฌธ์„œ ๋ง๊ณ ๋„ UI ์ด๋Ÿฐ ๊ฒƒ๋„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ๋‹ค.
  • idea : ์„ธ์ƒ์— ์›นํŽ˜์ด์ง€๋Š” ๋งŽ์œผ๋‹ˆ html์„ screenshot์œผ๋กœ render ํ•œ ๋’ค์— ์›๋ณธ html์„ generationํ•˜๊ฒŒ ํ•˜์ž!
  • input/output : ํ…์ŠคํŠธ๊ฐ€ ํฌํ•จ๋œ ์›น ์ด๋ฏธ์ง€ -> text
  • architecture : ViT + decoder (12 encoder w/ 768 hidden dim or 18 encoder, w/ 1536 hidden dim) -> Base(282M), Large(1.3B)
  • objective : contrastive loss(html recontstuction + masked token prediction)
  • baseline : Donut, UDOP, PaLI, VTP, DQAN, LATr, UIB, VUT
  • data : C4 corpus์—์„œ URL ๋‹ค์šด ๋ฐ›์•„์„œ 80M์˜ screen shot ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ฆ -> DocVQA, InfoGraphicVQA, UIChartQA, AI2D, OCR-VQA, RefExp(์ž์—ฐ์–ด๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ๋Š” ์›น ์‚ฌ์ดํŠธ ์ƒ ๋ถ€๋ถ„์„ ์ฐพ๋Š” ๋ฌธ์ œ), Widget Captioning(์•ฑ ์Šคํฌ๋ฆฐ ์ƒท์—์„œ ์„ ํƒ๋œ ๋ฒ„ํŠผ ๋“ฑ์ด ์–ด๋–ค ์—ญํ• ์„ ํ•˜๋Š”์ง€ captioning ํ•˜๋Š” ๊ฒƒ(e.g. find location),
  • evaluation : ANLS for DocVQA/InfoVQA, exact match for AI2D/RefExp/OCR-VQA, relaxed accuracy(RA) for Chart QA, CIDEr for generation task
  • result : image input๋งŒ ๋ฐ›๋Š” ์• ๋“ค ์ค‘์— sota. ๊ทธ ์™ธ์—๋Š” captioning์€ PALI์—๊ฒŒ DocVQA, InfoVQA๋Š” UDOPํ•œํ…Œ ๋ฐ€๋ฆผ. Donut์€ ๋‹ค ์ด๊น€.
  • contribution : ์—ฌ๋Ÿฌ๊ฐ€์ง€ task sota. ํŠนํžˆ UI ์ชฝ์„ ๊ฐ™์ด ํ‘ผ๊ฒŒ ์•„๋งˆ ์ฒ˜์Œ์ธ ๋“ฏํ•˜๋‹น.
  • etc. : ์ด๊ฑธ ์ด์ œ ์ฝ๋‹ค๋‹ˆ..

Details

variable-resolution

image

๋ณดํ†ต ViT๋Š” ์ •์‚ฌ๊ฐํ˜•์œผ๋กœ resize ํ•ด์„œ ํ•™์Šตํ•˜๋Š”๋ฐ ๊ทธ๋ ‡๊ฒŒ๋˜๋ฉด (1) ์ฐŒ๋ถ€๊ฐ€ ๋˜๊ณ  (2) ๋‚˜์ค‘์— high resolution์œผ๋กœ ๊ฐ”์„ ๋•Œ sequence length๊ฐ€ ๊ธธ์–ด์กŒ์„ ๋•Œ ์„ฑ๋Šฅ์ด ์ž˜ ์•ˆ๋‚˜์˜ด ์—ฌ๊ธฐ์„œ ์ œ์‹œํ•˜๋Š” ๋ฐฉ์‹์€ aspect ratio๋Š” ์œ ์ง€ํ•˜๋˜ sequence length๊ฐ€ maximum์œผ๋กœ ๊ฝ‰๊ฝ‰ ์ฑ„์›Œ์ง€๋„๋ก ์ด๋ฏธ์ง€๋ฅผ resize ํ•˜๋Š” ๊ฒƒ (patch size๊ฐ€ ๋ฐ”๋€Œ๋Š”๊ฑด ์•„๋‹˜)

Pretraining

C4์—์„œ url๋กœ html renderํ•ด์„œ ์‚ฌ์šฉ ์ด ๋•Œ (1) visible element ๋งŒ ์‚ฌ์šฉํ•˜๊ณ  (2) visible element๊ฐ€ ์—†๋Š”๋ฐ child๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ grandchild๋กœ child๋ฅผ ๋Œ€์ฒด. text + alt-text ์™€ filename ์ •๋„๋ฅผ ์‚ฌ์šฉ ์ด๋ฏธ์ง€์—์„œ ํŒŒ๋ž€์ƒ‰์œผ๋กœ ๋ฐ•์Šค ์นœ ๋ถ€๋ถ„์˜ html์„ recoverํ•˜๋ผ๊ณ  ์•Œ๋ ค์คŒ image

์ถ”๊ฐ€๋กœ ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ๋ฐ•์Šค ์น˜๊ณ  ๋งž์ถ”๊ฒŒ ํ•จ. ์ผ์ข…์˜ ์ด๋ฏธ์ง€์—์„œ์˜ masked language modeling. text์˜ 50% ์ •๋„.

Curriculum learning

์œ„์—๊ฑธ scratch๋กœ ํ•™์Šตํ•˜๊ธฐ์—” ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์„œ ์ผ๋‹จ ์ฝ๊ธฐ๋ถ€ํ„ฐ ์‹œํ‚ด. Book Corpus๋กœ ๋žœ๋ค ์ปฌ๋Ÿฌ ๋žœ๋ค ํฐํŠธ๋กœ ๋ Œ๋”๋ง ํ•œ ๋’ค 30K step ์ •๋„. (200K in donut)

Finetuning

GPT์—์„œ ๊ทธ๋ƒฅ Q๋ฅผ ๊ฐ™์ด ๋„ฃ๋“ฏ์ด ์—ฌ๊ธฐ๋„ ์ด๋ฏธ์ง€์— question ๋“ฑ์„ ๊ฐ™์ด renderedํ•ด์„œ ๋„ฃ์–ด์คŒ image

Training Details

  • 282M / 1.3B (Donut 143M)
  • 12 layers 768 hidden dim / 18 layers 1536 hidden dim
  • 128 image patches
  • 128 decoder sequence length
  • output์€ 128 characters๊ฐ€ ์•ˆ ๋„˜๋„๋ก.
  • batch size 2048 with 64 TPUs / batch size 1024 with 128 TPUs (196 with 64 A100s in donut)
  • BLEU๋กœ validation.

Result

image

PALIํ•œํ…Œ captioning ๋ฐ€๋ฆฌ๊ณ , text-richํ•œ DocVQA ๊ฐ™์€ ๊ฒฝ์šฐ์—๋„ OCR ๋“ฑ์„ ์“ฐ๋Š” UDOPํ•œํ…Œ ๋ฐ€๋ฆผ. ์•„๋ฌด๋ž˜๋„ ๋ฐ์ดํ„ฐ ์ž์ฒด๊ฐ€ caption ๋งŽ์ด ํ•ด์„œ ํ•™์Šตํ•œ ์• ๋“ค๋ณด๋‹จ ๋ฐ€๋ฆด ์ˆ˜ ๋ฐ–์—? ๊ทธ ์™ธ์—๋Š” Donut / GIT์„ ์ด๊ธฐ๊ณ  ํŠนํžˆ UI ์ชฝ์€ ์žฌํŒจํ•ด๋ฒ„๋ฆผ. ์—„์ฒญ๋‚œ sota.

Ablation

  • pretraining component image

Screenshot Parsing์ด ๊ฐ€์žฅ ๋งŽ์ด ๋–จ์–ด์กŒ๊ณ  warmup์ด๋ž‘ masking์€ ๋น„์Šทํ•œ ์ •๋„๋กœ ๋–จ์–ด์ง

  • variable-resolution image

padding์ด ์ƒ๋‹นํžˆ ์•ˆ์ข‹์€ ๋ชจ์Šต.. stretch๊ฐ€ ๋น„์œจ ์ฐŒ๋ถ€ํ•˜๋Š”๊ฑฐ!