TL;DR
- I read this because.. : It’s also mentioned a lot. as if it’s one of the main ways to learn PRM.
- task : math solving
- problem : I want to learn Process Reward Model, but Human annotated is too expensive
- idea: Use MCTS to get the value of a specific step and use that as the label for PRM – learn step-level PPOs
- architecture : LLaMA2-7B/13B/70B, LLemma-7B/34B, Mistral-7B, Deepseek-67B
- objective : (PRM) bce loss (RL) PPO loss
- baseline : (train/infer) ORM, Self-consistency, Self-consistency + ORM (data) rule-based, BART NLI
- data : 170K solution for GSM8K / 270K for MATH
- evaluation : GSM8K, MATH accuracy
- result : Good performance
- contribution : Twitter says it’s the first PRM paper after OAI?
Details
- thumbnail
- PRM loss
- automatic process annotation
Think of that value estimation as being done with MCTS! If you think about rolling out each step, the number of cases will be too large, so we optimized it with MCTS (https://gusals1620.tistory.com/3) .
In conclusion, I used HARD because I don’t need to find HPARAMs by model? (I guess I could do it with MSE, huh?)
parameter setting
generator and completer learned about metamath for 3 epochs each
Learn GSM8K and MATH training data to generate ORM / PRM training data -> generate 15 solutions per problem afterwards
completer is generated with decoded number N=8 using Llemma-7B (how is a completer different from a generator… is a generator the entity that creates the solution and a completer the entity that does the rollout? Can these two models be different?)
Use LLaMA-2 70B and Llemma-34B for verification
The policy model for PPO learning is based on Llama2-7B and Mistral-7B
I’m not sure why the models are so different.
result
Best verification methodology out of 256 samples.
Good performance compared to other learning methodologies (ORM + PPO / RFT)
- There was an attempt to label the process as a BART NLI, and there is an ablation for this (https://arxiv.org/abs/2206.02336 )
- Looking at (a)(b), math-shepherd performs better than verifier / ORM, performance also improves as model gets larger for both
- (c) Compared to self-consistency, if the reward model is too smaller than the generator model, the performance gets worse as the solution per problem gets larger – the reward model should be as good as the generator
- (d) Much better performance than (a) when the verifier is larger. Difference with SC becomes much larger