TL;DR
- I read this because.. : High mathvista performance
- task : MLLM
- problem : VLM to cover multi-image, video all at once
- IDEA: Setting up anyres a little differently for each domain. Collect data and learn!
- input/output : {image or images or video, question} -> answer
- architecture : SigLIP SO400M + 2 layer MLP + Qwen2 {0.5B, 7.6B, 72.7B}
- objective : CE loss
- baseline : QwenVL, Gemini-Pro, Claude 3.5 Sonnet, GPT4V, GPT4o, VILA, Cambrian, InternVL
- data : stage 1.0 (still LCS-553K), stage 1.5 (3.5M llava recap, UReader, SynDog, chinese ShareGPT4V), stage 2.0 (curated Single Image 3.2M and OneVision 1.6M)
- evaluation : AI2D, ChartQA, DocVQA, InfoVQA, Mathverse, Mathvista, MMBench, MME, MMStar, MMMU, MMVet, SeedBench, ScienceQA, ImageDC, RealWorldQA, … Multi-image benchs(5), Video Benchs(9)
- result : It seems that MathVista is the only one that is significantly higher when compared to Intern2-VL-8B on the same scale for single image eval? (63.2), good performance in multi-image, video bench
- contribution : Quickly take a video benchmark.
- etc. :
Details
thumbnail
anyres Changes
How anyres is applied per modality
stage 1: Still LCS
stage 1.5
stage 2
Do not apply anyres in stage 1 Insert data in an increasingly long sequence length direction