image

paper , code , dataset

TL;DR

  • I read this because.. : Recommended by
  • task : reasoning in LVLM
  • problem : I want LVLM to have longer reasoning like gpt-o1
  • Idea :** Put in data and learn. Break down the steps of the answer. Let’s do a beam search for each step of the answer
  • architecture : Llama 3.2V
  • objective : CE loss (SFT after future SFT)
  • baseline : Llama 3.2V
  • data : Llava-CoT-100k (proposed)
  • evaluation : mmstar, mmbench, mmvet, mathvista, ai2d,
  • result : Improved performance.
  • contribution : Data disclosure.

Details

  • thumbnail image

  • inference examples image

  • How to structure your answers image

GPT4o to generate and then mismatch the structure Filtering. Have GPT4o filter the things inside the <summary>, </summary> tags against the Gt answer to see if it’s a good answer. image

image
  • Generated image source image

https://github.com/long8v/PTIR/issues/203 overlap source with this image

  • Run a beam search for each structure image

I didn’t realize it was called “beam search”, but it looks like it uses an external verifier. Prompt used? Didn’t see which model you used image

  • Training hparam image

Result

image

I chose my own “Reasoning Benchmark”. direct training is the original vqa set further SFTed with the original vqa set. w/o structured tags is without tags like <summary>. mmstar, mmvet, and mathvista improved. ai2d performs better just learning the answer with direct

image

If you look at the details in mmstar, reasoning related details, math, science, etc. go up. perception does not go up, but it is insignificant.

  • stage level beam search image

How did BoN do it without mentioning RM learning? image

  • comparison with other models image