Image

paper

TL;DR

  • I read this because.. : 읽은지는 좀 됐지만 video 쪽만 다시 읽음
  • task : MLLM –> GUI agent, grounding, video understanding
  • problem : 더 나은 open source model!
  • idea : more various data, efficient ViT, MRoPE
  • input/output : {image, video, question} -> answer
  • architecture : custom ViT (window attention, native resolution(2D-RoPE)) + Qwen2.5 LLM
  • objective : SFT -> DPO
  • baseline : priority models (Claude3.5, GPT4o), InternVL, Qwen2-VL
  • data : interleaved, image-, video QAs, long context videos, …
  • evaluation : {text, image, video} bench, GUI, grounding, OCR bench
  • result : almost all bench SOTA
  • contribution : efficient ViT variant with many domain abilities
  • etc. :

Details

architecture

Image Image
  • 효율적인 연산을 위해 4 layer만 full attention을 보고 windowed attention은 112 x 112(8 x 8 patches)
    • CLIP pretraining 부터 다시 함
  • 지금 보니까 모든 모델이 같은 ViT를 보는군
  • Image
  • Qwen2-VL에서 도입된 MRoPE를 그대로 쓰되 input frame에 dependant한게 아니라 absolute time에 dependant(실제 몇초 째의 Frame인지)를 넣도록 함

data

  • pretraining
  • Grounding Data with Absolute Position Coordinates
  • Document Omni-Parsing Data
  • Video Data
    • 학습 시 FPS를 dynamic하게 sampling하도록 함.
    • 30분을 넘어가는 비디오의 경우 multi-frame caption을 통해 비디오 캡션을 생성함
    • video grounding data의 경우 second-based, hour-minute-second-frame format 둘다 넣게 함
  • Agent data

training recipe

Image
  • Given that the vision encoder has relatively fewer parameters and that we introduced window attention to further reduce its computational demands, we focused on balancing the computational load of the LLM across different GPUs.

  • Specifically, we dynamically packed data samples based on their corresponding input sequence lengths to the LLM, ensuring consistent computational loads. In the first and second phases, data were uniformly packed to a sequence length of 8,192, while in the third phase, the sequence length was increased to 32,768 to accommodate the model’s enhanced capacity for handling longer sequences

  • rejection sampling for enhanced reasoning

Post-training

  • SFT / DPO

Performance

  • video bench

    • Image
    • Video-MME
      • https://video-mme.github.io/home_page.html
      • multiple video domain
      • 0 seconds ~ 60 minute (min 1017 seconds)
      • perception, recognition, OCR, temporal reasoning
      • 2700 QA Pairs.
      • The videos are sampled at 2 fps, and the upper limit is 480 frames. 이렇게 되어있는데 그럼 2 FPS로 뽑고 480 넘어가면 그 뒤는 안뽑는건가?
      • 아마 MCQA
    • Video-MMMU
      • https://videommmu.github.io/ 조금 더 교육/지식 도메인에 가까움
      • art, humanities, medicine, business, question, engineering
      • 거의다 educational video 쪽인듯. 아마 MCQA
    • MMMU
      • https://arxiv.org/abs/2501.12380
      • MMVU-Video와 비슷한듯
      • expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering
    • MVBench
      • https://arxiv.org/abs/2311.17005
      • Spatial Understanding, Temporal Understanding
      • video source: Perception, CLEVERER, MiT V1, STAR, Charades-STA – 별로 좋아보이진 않음
    • MMBench-Video
      • https://mmbench-video.github.io/
      • perception / reasoning
      • video duration: 0min~6min – 너무 짧아서 별로 좋아보이지 않음
      • video category는 여러가지 있는듯, people, news, science, advertisement, …
    • LongVideoBench
      • https://longvideobench.github.io/
      • The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames
        • 특정 프레임에 있는 인물을 refer하는 등 조금 특이한 task가 많음 (질문이 비교적 finegrained함)
      • num videos 3763, eval QAs 6678, avg duration 473 s – 긴 것들이 많아서 좋아보임
      • domain: life, movie, knowledge, news
    • LVBench
      • https://lvbench.github.io/
      • avg 4101초 / 1549 QAs
      • multi-domain (including sports domain)
      • temporal grounding, entity recognition, key information retrieval, reasoning, .. – 비교적 실용적이여 보임!
    • EgoSchema
    • Perception Test
    • MLVU
      • https://arxiv.org/pdf/2406.04264
      • multi-domain
      • close-ended task : NeedleQA, PlotQA, action order, action count — 그렇게 까지 실용적이어 보이진 않음
      • open-ended task: video summarization, sub-scene captioning, anomaly recognition (CCTV에서 싸우고 있음)
    • TempCompass
      • https://arxiv.org/abs/2403.00476
      • 실용적이지 않아보임. action, fine-grained action, attribute change, event order, direction
      • low-level vision에 가까워보임
    • Charades-STA
  • spatial

    • Image
  • OCR related

    • Image
  • pure image

    • Image
  • pure text

    • Image