image

paper , code , dataset

TL;DR

  • I read this because.. : Latest LVLM. High Performance
  • task : LVLM
  • problem : efficient LVLM
  • IDEA : Increase and decrease the intermediate dimension using the representation of the [sos] token. DPO-like phantom optimization suggestions
  • input/output : image, question -> answer
  • architecture : VE(Intern-ViT 300M), Projector MLP, LLM(Qwen2-0.5B, InternLM2-1.8B, Phi-3-mini-3.8B, InternLM2.5-7B)
  • objective : SFT loss + SimPO
  • baseline : closed and open LVLM models
  • data : ShareGPT4o-Images(57K), ShareGPT4V(755K), ALLaVA-VFLAN/Text(548K), MiniGemini(DocVQA, ChartQA, DVQA, AI2D), Science and Mathematical Reasoning(SMR – Arxiv-QA, TextBookQA), GLLaVA, MathVision, MathInstruct , MathPlus
  • evaluation : Science QA, AI2D, ChartQA, SEED, POPE, HallB, MME, MathVista, MMB, MM-Vet, LLaVA-w
  • result : Good performance among similarly scaled models
  • contribution :
  • etc. :

Details

proposed

  • Phantom Dimension image

  • Phantom Optimization image

Is this the same as a SimplePO objective?

image

Create the {question, chosen, rejected} triplet with GPT4o-mini and validate with GPT4-o e.g. image

image

result

image image

ChartQA,

https://github.com/ByungKwanLee/Phantom/tree/master?tab=readme-ov-file#-download-training-datasets