[185] LLaVA-OneVision: Easy Visual Task Transfer

TL;DR

I read this because.. : High mathvista performance
task : MLLM
problem : VLM to cover multi-image, video all at once
IDEA: Setting up anyres a little differently for each domain. Collect data and learn!
input/output : {image or images or video, question} -> answer
architecture : SigLIP SO400M + 2 layer MLP + Qwen2 {0.5B, 7.6B, 72.7B}
objective : CE loss
baseline : QwenVL, Gemini-Pro, Claude 3.5 Sonnet, GPT4V, GPT4o, VILA, Cambrian, InternVL
data : stage 1.0 (still LCS-553K), stage 1.5 (3.5M llava recap, UReader, SynDog, chinese ShareGPT4V), stage 2.0 (curated Single Image 3.2M and OneVision 1.6M)
evaluation : AI2D, ChartQA, DocVQA, InfoVQA, Mathverse, Mathvista, MMBench, MME, MMStar, MMMU, MMVet, SeedBench, ScienceQA, ImageDC, RealWorldQA, … Multi-image benchs(5), Video Benchs(9)
result : It seems that MathVista is the only one that is significantly higher when compared to Intern2-VL-8B on the same scale for single image eval? (63.2), good performance in multi-image, video bench
contribution : Quickly take a video benchmark.
etc. :