TL;DR
- I read this because.. : Recommended by
- task : reasoning in LVLM
- problem : I want LVLM to have longer reasoning like gpt-o1
- Idea :** Put in data and learn. Break down the steps of the answer. Let’s do a beam search for each step of the answer
- architecture : Llama 3.2V
- objective : CE loss (SFT after future SFT)
- baseline : Llama 3.2V
- data : Llava-CoT-100k (proposed)
- evaluation : mmstar, mmbench, mmvet, mathvista, ai2d,
- result : Improved performance.
- contribution : Data disclosure.
Details
thumbnail
inference examples
How to structure your answers
GPT4o to generate and then mismatch the structure Filtering.
Have GPT4o filter the things inside the <summary>, </summary> tags against the Gt answer to see if it’s a good answer.
- Generated image source
https://github.com/long8v/PTIR/issues/203
overlap source with this
- Run a beam search for each structure
I didn’t realize it was called “beam search”, but it looks like it uses an external verifier.
Prompt used? Didn’t see which model you used
- Training hparam
Result
I chose my own “Reasoning Benchmark”.
direct training is the original vqa set further SFTed with the original vqa set. w/o structured tags is without tags like <summary>.
mmstar, mmvet, and mathvista improved. ai2d performs better just learning the answer with direct
If you look at the details in mmstar, reasoning related details, math, science, etc. go up. perception does not go up, but it is insignificant.
- stage level beam search
How did BoN do it without mentioning RM learning?
- comparison with other models