[183] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

TL;DR

I read this because.. : mathvista 개선을 위해
task : LVLM
problem : 기존의 math 관련 LVLM work인 G-LLaVA, Math-LLaVA는 각각 geometric reasoning 능력에 제한, CoT 능력에 제한이라는 단점이 있다
idea : 다양한 수학 분야 + CoT를 추가한 데이터셋을 만들자
architecture : llava (clip-vit-large, DeepSeekMath-RL)
objective : ce loss + ppo loss
baseline : closed LLMs, LLMs, Math LLMs, Open-Source MLLMs(G-LLaVA-7B, Math-LLaVA-13B, LLaVa-1.5-7B, LLaVA-NeXT-34B)
data : (align) LLaVA-Pretrain + geo170k-align (instruct) LLaVA-instruct (math instruct) MultiMath300k-instuction, Geo170k-qa, MathV360k (PPO) MultiMath300K-val, GSM8K-train, Math-train, CMATH-train
evaluation : Mathvista, Mathverse, GSM8K, MATH, CMATH, GaoKao
result : open source model 중 가장 높은 mathvista, mathverse 성능, text math 벤치에서도 다른 MLLM과 비교해봤을 때 sota.
contribution : 데이터셋 제안 및 text/vision 둘다 높은 성능
etc. : 내용은 뻔할 수 있지만 분석이 많아서 재밌었다

직접 이미지 license 사서 제작(http://test.xuekubao.com/ )
QA 뿐 아니라 captioning 되어 있는 것도 있음
geomertry problem solving, automatic theorem proving, mathematical word problems 모두 커버
영어/중국어라고 하는데 거의 중국어 인듯..?
CoT 커버

round 1: GPT-4o를 사용하여 step-by-step reasoning chains를 생성. 원본 데이터를 힌트로 사용
round 2: GPT4-o를 사용하여 생성된 reasoning chain이 standard answer와 비교했을 때 잘 생성됐는지 평가. inconsistent하다면 reasoning step을 수정
round 3: GPT-4o 답변과 standard answer를 사용한 뒤 맞는 정답만 사용.

Process-supervised RL

closed model 보단 아니지만 open source model 중 가장 높은 성능

다른 math 오픈소스 특화 모델들이 LLaVA-NeXT보다 안좋음.

PPO 단계에서 쓰였던 도메인인 cmath, gsm8k, math 개선, 쓰이지 않은건 개선 안됨. mathvista의 경우 0.8 올랐고 (align, sft는 각각 1.3, 1.6 올림)mathverse의 경우 0.2 떨어짐

vicuna 대비 성능차이가 많이 남. MathVista 42.9 vs 50.0 ㄷㄷ MultiMath가 중국어가 대부분인 탓도 조금 있을 듯. 그래도 table 3보면 학습이 안된건 아님.