[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

TL;DR

I read this because.. : It’s also mentioned a lot. as if it’s one of the main ways to learn PRM.
task : math solving
problem : I want to learn Process Reward Model, but Human annotated is too expensive
idea: Use MCTS to get the value of a specific step and use that as the label for PRM – learn step-level PPOs
architecture : LLaMA2-7B/13B/70B, LLemma-7B/34B, Mistral-7B, Deepseek-67B
objective : (PRM) bce loss (RL) PPO loss
baseline : (train/infer) ORM, Self-consistency, Self-consistency + ORM (data) rule-based, BART NLI
data : 170K solution for GSM8K / 270K for MATH
evaluation : GSM8K, MATH accuracy
result : Good performance
contribution : Twitter says it’s the first PRM paper after OAI?

Details

thumbnail

PRM loss

automatic process annotation

Think of that value estimation as being done with MCTS! If you think about rolling out each step, the number of cases will be too large, so we optimized it with MCTS (https://gusals1620.tistory.com/3) .

In conclusion, I used HARD because I don’t need to find HPARAMs by model? (I guess I could do it with MSE, huh?)

parameter setting
generator and completer learned about metamath for 3 epochs each
Learn GSM8K and MATH training data to generate ORM / PRM training data -> generate 15 solutions per problem afterwards
completer is generated with decoded number N=8 using Llemma-7B (how is a completer different from a generator… is a generator the entity that creates the solution and a completer the entity that does the rollout? Can these two models be different?)
Use LLaMA-2 70B and Llemma-34B for verification
The policy model for PPO learning is based on Llama2-7B and Mistral-7B
I’m not sure why the models are so different.
result

Best verification methodology out of 256 samples.

Good performance compared to other learning methodologies (ORM + PPO / RFT)

There was an attempt to label the process as a BART NLI, and there is an ablation for this (https://arxiv.org/abs/2206.02336 )

Looking at (a)(b), math-shepherd performs better than verifier / ORM, performance also improves as model gets larger for both
(c) Compared to self-consistency, if the reward model is too smaller than the generator model, the performance gets worse as the solution per problem gets larger – the reward model should be as good as the generator
(d) Much better performance than (a) when the verifier is larger. Difference with SC becomes much larger

TL;DR#

Details#

TL;DR

Details