[217] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

paper , code , dataset

TL;DR

I read this because.. : video language model. fully open-source model.
task : video language model
problem : closed model 기반 synthetic model 말고 fully open source로 만들고 싶다.
idea : 여러 open source model (거의 meta 모델)을 기반으로 한 모델 (molmo랑 비슷한 motivation)
input/output : (video, image, (optional) mask) + question -> answer
architecture : VE {PE L/14, PE G/14} + LLM {Llama3.2 1B-3B, Llama3.1 8B}
objective : ce loss (alignment, mid-training, SFT)
baseline : GPT4o, Gemini 1.5 Pro, Gemini 2.0 Flash, Qwen2VL, InternVL2.5, Qwen 3.5VL, Llava-OV
data : pretrain 1M (from SA-1B + caption), mid-training 64.7M synthetic caption (LLaMa-3V-90B), SFT human-annotated 2.87M
evaluation : image bench, video bench
result : 준수한 성능
contribution : fully open source model. data도 공개!
etc. :

Details

thumbnail

overview

data

overall
details
- all training data[^1]

synthetic data pipeline (66.1M)

image data engine
- image -natural image, documents
- give {caption, OCR, meta} - Lllama -> caption, QA
video data
- https://www.scenedetect.com/ 사용하여 30초짜리 비디오 클립 추출, {caption from Lllama-3V, video caption from initial PLM, video meta(action, time tags)} – Llama3 –> caption, QA
scaling law
Limitation of synthetic data
- 어려운 문제에 대한 scaling law는 뚜렷하지 않음 -> human annotated 가 필요하겠다.

human-annotated high quality data

PLM-FGQA
- fine-grained human activity
PLM-STC
spatial-teomporal
SAM2 사용하여 mask tublet을 만들고 annotators 들이 흥미롭고 움직이는 object를 찾으라고 한뒤에 다른 어노테이터들에게 비디오의 시간 상 action 상의 움직임에 대해 적으라고 함.
video-region caption (522.7K / train 476.2 / others PLM-VideoBench)
- RCap (194.2K): Given the video region and timestamps, the model generates a caption;
- RTLoc (194.2K): Given the video region and caption, the model localizes the action; and
- RDCap (122.3K): Given the video region, the model generates dense, localized caption
Fine-Grained Question Answering (FGQA) : fine-grained activity understanding (e.g., painting “vertically” vs. “horizontally” in Fig. 6, first)
- MBAcc
  - 4371 question
- Smart Glasses Question Answering (SGQA) :
  - answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smartglasses device
  - LLM as a judge (Llama3.3 70B)
  - 665, human annotated
- Video Region Captioning (RCap).
  - LLM as a judge (Llama3.3 70B)
  - 10,060 human annotated
- Region Dense Video Captioning (RDCap).
  - model must generate a detailed description of all events involving a specific subject of interest (e.g., person, animal, or object)
  - must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible
  - 2620 samples
  - SODA score (Soda: Story oriented dense video captioning evaluation framework)

Results

benchmarks

PLM-VideoBench
- GPT-4o가
video bench
- VideoMME 가 차이 많이 나는군
- Charades-STA
  - Tall: Temporal activity localization via language query
image bench
- 다른 오픈소스는 비슷한데 MMMU가 많이 차이나는군 ㅋㅋ
- RealWorldQA
  - basic real-world spatial understanding capabilities of multimodal models
  - consists of 765 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles
  - https://huggingface.co/datasets/nirajandhakal/realworldqa
Ablation studies
Long video bench