Image

paper

TL;DR

  • I read this because.. : It’s been a while since I read it, but I just re-read the video part.
  • task : MLLM –> GUI agent, grounding, video understanding
  • problem : better open source model!
  • idea : more various data, efficient ViT, MRoPE
  • input/output : {image, video, question} -> answer
  • architecture : custom ViT (window attention, native resolution(2D-RoPE)) + Qwen2.5 LLM
  • objective : SFT -> DPO
  • baseline : priority models (Claude3.5, GPT4o), InternVL, Qwen2-VL
  • data : interleaved, image-, video QAs, long context videos, …
  • evaluation : {text, image, video} bench, GUI, grounding, OCR bench
  • result : almost all bench SOTA
  • contribution : efficient ViT variant with many domain abilities
  • etc. :

Details

architecture

Image Image
  • For efficient computation, only 4 layers see full attention and windowed attention is 112 x 112 (8 x 8 patches)
  • Start over from CLIP pretraining
  • Now that I look at it, all models see the same ViT.
  • Image
  • Keep the MRoPE introduced in Qwen2-VL, but make it dependent on absolute time (how many seconds into the frame) instead of depending on the input frame.

data

  • pretraining
  • Grounding Data with Absolute Position Coordinates
  • Document Omni-Parsing Data
  • Video Data
  • Enable dynamic sampling of FPS when training.
  • For videos longer than 30 minutes, use multi-frame caption to generate video captions
  • For video grounding data, include both second-based, hour-minute-second-frame format
  • Agent data

training recipe

Image
  • Given that the vision encoder has relatively fewer parameters and that we introduced window attention to further reduce its computational demands, we focused on balancing the computational load of the LLM across different GPUs.

  • Specifically, we dynamically packed data samples based on their corresponding input sequence lengths to the LLM, ensuring consistent computational loads. In the first and second phases, data were uniformly packed to a sequence length of 8,192, while in the third phase, the sequence length was increased to 32,768 to accommodate the model’s enhanced capacity for handling longer sequences

  • rejection sampling for enhanced reasoning

Post-training

  • SFT / DPO

Performance

  • video bench

  • The videos are sampled at 2 fps, and the upper limit is 480 frames. So it’s like this, but if I pull at 2 fps and go over 480, I don’t pull after that?

  • Probably MCQA

    • Video-MMMU
  • https://videommmu.github.io/ a bit more of an education/knowledge domain

    • art, humanities, medicine, business, question, engineering
  • Seems like it’s mostly educational video. Maybe MCQA

  • Similar to MMVU-Video

    • expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering
    • MVBench
  • video source: Perception, CLEVERER, MiT V1, STAR, Charades-STA – not looking so good

  • video duration: 0min to 6min – too short to look good

  • There are many video categories, such as people, news, science, advertisement, …

    • LongVideoBench
      • https://longvideobench.github.io/
      • The LongVideoBench highlights referred reasoning questions, which are dependent on long frame inputs and cannot be well-addressed by a single frame or a few sparse frames
  • Lots of unusual tasks, such as referring to people in specific frames (questions are relatively fine-grained)

  • num videos 3763, eval QAs 6678, avg duration 473 s – looks good with lots of long ones

  • avg 4101 sec / 1549 QAs

    • multi-domain (including sports domain)
  • temporal grounding, entity recognition, key information retrieval, reasoning, .. – seems relatively practical!

  • https://arxiv.org/pdf/2305.13786 / videos about stuck things / robotic data

  • close-ended task: NeedleQA, PlotQA, action order, action count - doesn’t seem that practical

  • open-ended task: video summarization, sub-scene captioning, anomaly recognition (fighting on CCTV)

  • Doesn’t seem practical. action, fine-grained action, attribute change, event order, direction

  • Looks more like low-level vision

  • As if the task is to find out if the timestamp was mentioned when the caption was present.

  • Is there such a thing as a temporal grounding task?

  • spatial

    • Image
  • OCR related

    • Image
  • pure image

    • Image
  • pure text

    • Image