[188] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

TL;DR

I read this because.. : Recommended by
task : reasoning in LVLM
problem : I want LVLM to have longer reasoning like gpt-o1
Idea :** Put in data and learn. Break down the steps of the answer. Let’s do a beam search for each step of the answer
architecture : Llama 3.2V
objective : CE loss (SFT after future SFT)
baseline : Llama 3.2V
data : Llava-CoT-100k (proposed)
evaluation : mmstar, mmbench, mmvet, mathvista, ai2d,
result : Improved performance.
contribution : Data disclosure.

Details

thumbnail
inference examples
How to structure your answers

GPT4o to generate and then mismatch the structure Filtering. Have GPT4o filter the things inside the <summary>, </summary> tags against the Gt answer to see if it’s a good answer.

Generated image source

https://github.com/long8v/PTIR/issues/203 overlap source with this

Run a beam search for each structure

I didn’t realize it was called “beam search”, but it looks like it uses an external verifier. Prompt used? Didn’t see which model you used

Training hparam

Result

I chose my own “Reasoning Benchmark”. direct training is the original vqa set further SFTed with the original vqa set. w/o structured tags is without tags like <summary>. mmstar, mmvet, and mathvista improved. ai2d performs better just learning the answer with direct

If you look at the details in mmstar, reasoning related details, math, science, etc. go up. perception does not go up, but it is insignificant.

stage level beam search

How did BoN do it without mentioning RM learning?

comparison with other models

TL;DR#

Details#

Result#

TL;DR

Details

Result