TL;DR
- I read this because.. : 최신 LVLM. 높은 성능
- task : LVLM
- problem : efficient LVLM
- idea : [sos] token의 representation을 사용하여 중간 dimension을 높였다가 낮춤. DPO-like한 phantom optimization 제안
- input/output : image, question -> answer
- architecture : VE(Intern-ViT 300M), Projector MLP, LLM(Qwen2-0.5B, InternLM2-1.8B, Phi-3-mini-3.8B, InternLM2.5-7B)
- objective : SFT loss + SimPO
- baseline : closed and open LVLM models
- data : ShareGPT4o-Images(57K), ShareGPT4V(755K), ALLaVA-VFLAN/Text(548K), MiniGemini(DocVQA, ChartQA, DVQA, AI2D), Science and Mathematical Reasoning(SMR – Arxiv-QA, TextBookQA), GLLaVA, MathVision, MathInstruct , MathPlus
- evaluation : Science QA, AI2D, ChartQA, SEED, POPE, HallB, MME, MathVista, MMB, MM-Vet, LLaVA-w
- result : 비슷한 스케일의 모델 중에 좋은 성능
- contribution :
- etc. :
Details
proposed
Phantom Dimension
Phantom Optimization
SimplePO objective와 아예 같은듯?
{question, chosen, rejected} triplet은 GPT4o-mini로 생성 뒤 GPT4-o로 validate
e.g.
result
ChartQA,
data links
https://github.com/ByungKwanLee/Phantom/tree/master?tab=readme-ov-file#-download-training-datasets