TL;DR
- I read this because.. : to improve mathvista
- task : LVLM
- Problem :** The existing math-related LVLM works, G-LLaVA and Math-LLaVA, have the drawbacks of limited geometric reasoning capabilities and limited CoT capabilities, respectively.
- idea: Create a dataset with different math disciplines + CoTs
- architecture : llava (clip-vit-large, DeepSeekMath-RL)
- objective : ce loss + ppo loss
- baseline : closed LLMs, LLMs, Math LLMs, Open-Source MLLMs(G-LLaVA-7B, Math-LLaVA-13B, LLaVa-1.5-7B, LLaVA-NeXT-34B)
- data : (align) LLaVA-Pretrain + geo170k-align (instruct) LLaVA-instruct (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k (PPO) MultiMath300K-val, GSM8K-train, Math-train, CMATH-train
- evaluation : Mathvista, Mathverse, GSM8K, MATH, CMATH, GaoKao
- Result : Highest mathvista, mathverse performance, and text math benchmarks of any open source model compared to other MLLMs.
- contribution : High performance for both dataset suggestions and text/vision
- etc. : The content may be obvious, but the analysis is interesting
Details
Thumbnail
proposed MultiMath-300K
- Create your own image license librarian (http://test.xuekubao.com/ )
- Not just QA, but also captioned
- Covers geomertry problem solving, automatic theorem proving, and mathematical word problems.
- It says English/Chinese, but it’s almost Chinese…?
- CoT Cover
Collection Method
- Round 1: Create step-by-step reasoning chains using GPT-4o. Use the original data as a hint
- round 2: Evaluate whether the reasoning chain generated using GPT4-o is well generated compared to the standard answer. if it is inconsistent, modify the reasoning step.
- Round 3: Use the GPT-4o answer and the standard answer, then use only the correct answer.
training
- (align) LLaVA-Pretrain + geo170k-align : 1 epoch
- (instruct) LLaVA-instruct: ViT also full tuning
- (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k
- (PPO) Create MultiMath300K-val, GSM8K-train, Math-train, and CMATH-train as sources.
Process-supervised RL
- Enable CoT reasoning to generate multiple reasoning steps
- Have GPT-4o evaluate the correctness, find the step where the error occurred, and generate the correct solution.
- This leads to prefer / disprefer set -> RM Learning PPO
- Learn PPO with reward scores for the reasoning steps generated by each actor model
Result
Highest performance of the open source model, though not as good as the closed model
text Performance
Other math open source specific models are worse than LLaVA-NeXT.
contribution of RL
The domains used during the PPO phase, cmath, gsm8k, and math, improved, while the unused ones did not. mathvista increased by 0.8 (align and sft increased by 1.3 and 1.6 respectively) and mathverse decreased by 0.2
LLM backbone
Significant performance gap vs. vicuna. MathVista 42.9 vs 50.0 ㄷㄷ This may be partly due to the fact that MultiMath is mostly in Chinese. Still, if you look at table 3, it doesn’t mean you didn’t learn.