Image

paper , code , dataset

TL;DR

  • I read this because.. : video language model. fully open-source model.
  • task : video language model
  • problem : closed model ๊ธฐ๋ฐ˜ synthetic model ๋ง๊ณ  fully open source๋กœ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค.
  • idea : ์—ฌ๋Ÿฌ open source model (๊ฑฐ์˜ meta ๋ชจ๋ธ)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ชจ๋ธ (molmo๋ž‘ ๋น„์Šทํ•œ motivation)
  • input/output : (video, image, (optional) mask) + question -> answer
  • architecture : VE {PE L/14, PE G/14} + LLM {Llama3.2 1B-3B, Llama3.1 8B}
  • objective : ce loss (alignment, mid-training, SFT)
  • baseline : GPT4o, Gemini 1.5 Pro, Gemini 2.0 Flash, Qwen2VL, InternVL2.5, Qwen 3.5VL, Llava-OV
  • data : pretrain 1M (from SA-1B + caption), mid-training 64.7M synthetic caption (LLaMa-3V-90B), SFT human-annotated 2.87M
  • evaluation : image bench, video bench
  • result : ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ
  • contribution : fully open source model. data๋„ ๊ณต๊ฐœ!
  • etc. :

Details

  • thumbnail
Image
  • overview Image

data

  • overall
    • Image
  • details
    • Image
    • all training data[^1]

synthetic data pipeline (66.1M)

  • image data engine
    • image -natural image, documents
    • give {caption, OCR, meta} - Lllama -> caption, QA
  • video data
    • https://www.scenedetect.com/ ์‚ฌ์šฉํ•˜์—ฌ 30์ดˆ์งœ๋ฆฌ ๋น„๋””์˜ค ํด๋ฆฝ ์ถ”์ถœ, {caption from Lllama-3V, video caption from initial PLM, video meta(action, time tags)} – Llama3 –> caption, QA
  • scaling law
    • Image
  • Limitation of synthetic data
    • Image
    • ์–ด๋ ค์šด ๋ฌธ์ œ์— ๋Œ€ํ•œ scaling law๋Š” ๋šœ๋ ทํ•˜์ง€ ์•Š์Œ -> human annotated ๊ฐ€ ํ•„์š”ํ•˜๊ฒ ๋‹ค.

human-annotated high quality data

  • PLM-FGQA
    • fine-grained human activity
    • Image
  • PLM-STC
  • spatial-teomporal
  • SAM2 ์‚ฌ์šฉํ•˜์—ฌ mask tublet์„ ๋งŒ๋“ค๊ณ  annotators ๋“ค์ด ํฅ๋ฏธ๋กญ๊ณ  ์›€์ง์ด๋Š” object๋ฅผ ์ฐพ์œผ๋ผ๊ณ  ํ•œ๋’ค์— ๋‹ค๋ฅธ ์–ด๋…ธํ…Œ์ดํ„ฐ๋“ค์—๊ฒŒ ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„ ์ƒ action ์ƒ์˜ ์›€์ง์ž„์— ๋Œ€ํ•ด ์ ์œผ๋ผ๊ณ  ํ•จ.
  • video-region caption (522.7K / train 476.2 / others PLM-VideoBench)
    • RCap (194.2K): Given the video region and timestamps, the model generates a caption;
    • RTLoc (194.2K): Given the video region and caption, the model localizes the action; and
    • RDCap (122.3K): Given the video region, the model generates dense, localized caption
  • Image
  • Fine-Grained Question Answering (FGQA) : fine-grained activity understanding (e.g., painting โ€œverticallyโ€ vs. โ€œhorizontallyโ€ in Fig. 6, first)
    • MBAcc
      • 4371 question
    • Smart Glasses Question Answering (SGQA) :
      • answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smartglasses device
      • LLM as a judge (Llama3.3 70B)
      • 665, human annotated
    • Video Region Captioning (RCap).
      • LLM as a judge (Llama3.3 70B)
      • 10,060 human annotated
    • Region Dense Video Captioning (RDCap).
      • model must generate a detailed description of all events involving a specific subject of interest (e.g., person, animal, or object)
      • must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible
      • 2620 samples
      • SODA score (Soda: Story oriented dense video captioning evaluation framework)

Results

benchmarks

  • PLM-VideoBench
    • Image
    • GPT-4o๊ฐ€
  • video bench
  • image bench Image
    • ๋‹ค๋ฅธ ์˜คํ”ˆ์†Œ์Šค๋Š” ๋น„์Šทํ•œ๋ฐ MMMU๊ฐ€ ๋งŽ์ด ์ฐจ์ด๋‚˜๋Š”๊ตฐ ใ…‹ใ…‹
    • RealWorldQA
  • Ablation studies
    • Image
  • Long video bench
    • Image

[^1] Image