image

paper , code , blog

TL;DR

  • I read this because.. : ๋†’์€ mathvista ์„ฑ๋Šฅ
  • task : MLLM
  • problem : multi-image, video ๊นŒ์ง€ ํ•œ๋ฒˆ์— ์ปค๋ฒ„ํ•˜๋Š” VLM
  • idea : anyres๋ฅผ domain ๋ณ„๋กœ ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅด๊ฒŒ ์„ค์ •. ๋ฐ์ดํ„ฐ ์ž˜ ๋ชจ์•„์„œ ํ•™์Šต!
  • input/output : {image or images or video, question} -> answer
  • architecture : SigLIP SO400M + 2 layer MLP + Qwen2 {0.5B, 7.6B, 72.7B}
  • objective : CE loss
  • baseline : QwenVL, Gemini-Pro, Claude 3.5 Sonnet, GPT4V, GPT4o, VILA, Cambrian, InternVL
  • data : stage 1.0(์—ฌ์ „ํžˆ LCS-553K), stage 1.5(3.5M llava recap, UReader, SynDog, chinese ShareGPT4V), stage 2.0(curated Single Image 3.2M and OneVision 1.6M)
  • evaluation : AI2D, ChartQA, DocVQA, InfoVQA, Mathverse, Mathvista, MMBench, MME, MMStar, MMMU, MMVet, SeedBench, ScienceQA, ImageDC, RealWorldQA, … Multi-image benchs(5), Video Benchs(9)
  • result : single image eval ๊ด€๋ จํ•ด์„œ ๋™์ผ ์Šค์ผ€์ผ์ธ Intern2-VL-8B์™€ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ ์œ ์˜๋ฏธํ•˜๊ฒŒ ๋†’์€๊ฑด MathVista ์ •๋„ ์ธ๋“ฏ? (63.2), multi-image, video bench์—์„œ ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ
  • contribution : ๋น ๋ฅด๊ฒŒ ๋น„๋””์˜ค ๋ฒค์น˜ ์ฐ์Œ.
  • etc. :

Details

  • thumbnail image

  • anyres ๋ณ€๊ฒฝ์  image

  • modality๋ณ„ anyres ์ ์šฉ๋ฐฉ์‹ image

image
  • stage 1: ์—ฌ์ „ํžˆ LCS

  • stage 1.5 image

  • stage 2 image

stage 1์—์„œ๋Š” anyres๋ฅผ ์ ์šฉ ์•ˆํ•จ ์ ์  sequence length๊ฐ€ ๊ธธ์–ด์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์Œ

Result

image image image