TL;DR
- I read this because.. : video language model. fully open-source model.
- task : video language model
- problem : closed model ๊ธฐ๋ฐ synthetic model ๋ง๊ณ fully open source๋ก ๋ง๋ค๊ณ ์ถ๋ค.
- idea : ์ฌ๋ฌ open source model (๊ฑฐ์ meta ๋ชจ๋ธ)์ ๊ธฐ๋ฐ์ผ๋ก ํ ๋ชจ๋ธ (molmo๋ ๋น์ทํ motivation)
- input/output : (video, image, (optional) mask) + question -> answer
- architecture : VE {PE L/14, PE G/14} + LLM {Llama3.2 1B-3B, Llama3.1 8B}
- objective : ce loss (alignment, mid-training, SFT)
- baseline : GPT4o, Gemini 1.5 Pro, Gemini 2.0 Flash, Qwen2VL, InternVL2.5, Qwen 3.5VL, Llava-OV
- data : pretrain 1M (from SA-1B + caption), mid-training 64.7M synthetic caption (LLaMa-3V-90B), SFT human-annotated 2.87M
- evaluation : image bench, video bench
- result : ์ค์ํ ์ฑ๋ฅ
- contribution : fully open source model. data๋ ๊ณต๊ฐ!
- etc. :
Details
- thumbnail
- overview
data
- overall
- details
- all training data[^1]
synthetic data pipeline (66.1M)
- image data engine
- image -natural image, documents
- give {caption, OCR, meta} - Lllama -> caption, QA
- video data
- https://www.scenedetect.com/ ์ฌ์ฉํ์ฌ 30์ด์ง๋ฆฌ ๋น๋์ค ํด๋ฆฝ ์ถ์ถ, {caption from Lllama-3V, video caption from initial PLM, video meta(action, time tags)} – Llama3 –> caption, QA
- scaling law
- Limitation of synthetic data
- ์ด๋ ค์ด ๋ฌธ์ ์ ๋ํ scaling law๋ ๋๋ ทํ์ง ์์ -> human annotated ๊ฐ ํ์ํ๊ฒ ๋ค.
human-annotated high quality data
- PLM-FGQA
- fine-grained human activity
- PLM-STC
- spatial-teomporal
- SAM2 ์ฌ์ฉํ์ฌ mask tublet์ ๋ง๋ค๊ณ annotators ๋ค์ด ํฅ๋ฏธ๋กญ๊ณ ์์ง์ด๋ object๋ฅผ ์ฐพ์ผ๋ผ๊ณ ํ๋ค์ ๋ค๋ฅธ ์ด๋ ธํ ์ดํฐ๋ค์๊ฒ ๋น๋์ค์ ์๊ฐ ์ action ์์ ์์ง์์ ๋ํด ์ ์ผ๋ผ๊ณ ํจ.
- video-region caption (522.7K / train 476.2 / others PLM-VideoBench)
- RCap (194.2K): Given the video region and timestamps, the model generates a caption;
- RTLoc (194.2K): Given the video region and caption, the model localizes the action; and
- RDCap (122.3K): Given the video region, the model generates dense, localized caption
- Fine-Grained Question Answering (FGQA) : fine-grained activity understanding (e.g., painting โverticallyโ vs. โhorizontallyโ in Fig. 6, first)
- MBAcc
- 4371 question
- Smart Glasses Question Answering (SGQA) :
- answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smartglasses device
- LLM as a judge (Llama3.3 70B)
- 665, human annotated
- Video Region Captioning (RCap).
- LLM as a judge (Llama3.3 70B)
- 10,060 human annotated
- Region Dense Video Captioning (RDCap).
- model must generate a detailed description of all events involving a specific subject of interest (e.g., person, animal, or object)
- must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible
- 2620 samples
- SODA score (Soda: Story oriented dense video captioning evaluation framework)
- MBAcc
Results
benchmarks
- PLM-VideoBench
- GPT-4o๊ฐ
- video bench
- VideoMME ๊ฐ ์ฐจ์ด ๋ง์ด ๋๋๊ตฐ
- Charades-STA
- image bench
- ๋ค๋ฅธ ์คํ์์ค๋ ๋น์ทํ๋ฐ MMMU๊ฐ ๋ง์ด ์ฐจ์ด๋๋๊ตฐ ใ ใ
- RealWorldQA
- basic real-world spatial understanding capabilities of multimodal models
- consists of 765 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles
- https://huggingface.co/datasets/nirajandhakal/realworldqa
- Ablation studies
- Long video bench
[^1]