image

paper , code , blog

TL;DR

  • I read this because.. : High mathvista performance
  • task : MLLM
  • problem : VLM to cover multi-image, video all at once
  • IDEA: Setting up anyres a little differently for each domain. Collect data and learn!
  • input/output : {image or images or video, question} -> answer
  • architecture : SigLIP SO400M + 2 layer MLP + Qwen2 {0.5B, 7.6B, 72.7B}
  • objective : CE loss
  • baseline : QwenVL, Gemini-Pro, Claude 3.5 Sonnet, GPT4V, GPT4o, VILA, Cambrian, InternVL
  • data : stage 1.0 (still LCS-553K), stage 1.5 (3.5M llava recap, UReader, SynDog, chinese ShareGPT4V), stage 2.0 (curated Single Image 3.2M and OneVision 1.6M)
  • evaluation : AI2D, ChartQA, DocVQA, InfoVQA, Mathverse, Mathvista, MMBench, MME, MMStar, MMMU, MMVet, SeedBench, ScienceQA, ImageDC, RealWorldQA, … Multi-image benchs(5), Video Benchs(9)
  • result : It seems that MathVista is the only one that is significantly higher when compared to Intern2-VL-8B on the same scale for single image eval? (63.2), good performance in multi-image, video bench
  • contribution : Quickly take a video benchmark.
  • etc. :

Details

  • thumbnail image

  • anyres Changes image

  • How anyres is applied per modality image

image
  • stage 1: Still LCS

  • stage 1.5 image

  • stage 2 image

Do not apply anyres in stage 1 Insert data in an increasingly long sequence length direction

Result

image image image