image

paper , code

TL;DR

  • I read this because.. : to improve mathvista
  • task : LVLM
  • Problem :** The existing math-related LVLM works, G-LLaVA and Math-LLaVA, have the drawbacks of limited geometric reasoning capabilities and limited CoT capabilities, respectively.
  • idea: Create a dataset with different math disciplines + CoTs
  • architecture : llava (clip-vit-large, DeepSeekMath-RL)
  • objective : ce loss + ppo loss
  • baseline : closed LLMs, LLMs, Math LLMs, Open-Source MLLMs(G-LLaVA-7B, Math-LLaVA-13B, LLaVa-1.5-7B, LLaVA-NeXT-34B)
  • data : (align) LLaVA-Pretrain + geo170k-align (instruct) LLaVA-instruct (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k (PPO) MultiMath300K-val, GSM8K-train, Math-train, CMATH-train
  • evaluation : Mathvista, Mathverse, GSM8K, MATH, CMATH, GaoKao
  • Result : Highest mathvista, mathverse performance, and text math benchmarks of any open source model compared to other MLLMs.
  • contribution : High performance for both dataset suggestions and text/vision
  • etc. : The content may be obvious, but the analysis is interesting

Details

Thumbnail

image

proposed MultiMath-300K

image image

  • Create your own image license librarian (http://test.xuekubao.com/ )
  • Not just QA, but also captioned
  • Covers geomertry problem solving, automatic theorem proving, and mathematical word problems.
  • It says English/Chinese, but it’s almost Chinese…?
  • CoT Cover

image

Collection Method

image

  • Round 1: Create step-by-step reasoning chains using GPT-4o. Use the original data as a hint
  • round 2: Evaluate whether the reasoning chain generated using GPT4-o is well generated compared to the standard answer. if it is inconsistent, modify the reasoning step.
  • Round 3: Use the GPT-4o answer and the standard answer, then use only the correct answer.

training

  • (align) LLaVA-Pretrain + geo170k-align : 1 epoch
  • (instruct) LLaVA-instruct: ViT also full tuning
  • (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k
  • (PPO) Create MultiMath300K-val, GSM8K-train, Math-train, and CMATH-train as sources.

Process-supervised RL image

  • Enable CoT reasoning to generate multiple reasoning steps
  • Have GPT-4o evaluate the correctness, find the step where the error occurred, and generate the correct solution.
  • This leads to prefer / disprefer set -> RM Learning PPO
  • Learn PPO with reward scores for the reasoning steps generated by each actor model

Result

image

Highest performance of the open source model, though not as good as the closed model

text Performance

image

Other math open source specific models are worse than LLaVA-NeXT.

contribution of RL

image

The domains used during the PPO phase, cmath, gsm8k, and math, improved, while the unused ones did not. mathvista increased by 0.8 (align and sft increased by 1.3 and 1.6 respectively) and mathverse decreased by 0.2

LLM backbone

image

Significant performance gap vs. vicuna. MathVista 42.9 vs 50.0 ㄷㄷ This may be partly due to the fact that MultiMath is mostly in Chinese. Still, if you look at table 3, it doesn’t mean you didn’t learn.