[201] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

paper

TL;DR

I read this because.. : because it mentions
task : RL for reasoning
problem : The value model being used in the PPO to solve credit assignment does not seem to be learning well.
idea : Use MC to get value instead of value model
architecture : DeepSeekMath 7B, Rho Math 1.1B
objective : PPO loss – No value network learning.
baseline : PPO, DPO+, RestEM
data : GSM8K, MATH
evaluation : Pass@1 accuracy
result : 1) Better than PPO in performance 2) Takes longer time per step but converges faster because it makes inferences in the middle (wall-clock efficiency) 3) Lower KL-divergence for the same performance (KL divergence efficiency)
contribution : Similar to GRPO/RLOO, but borrowing from PPO’s value network to credit assignment.
etc. : Supps were well organized, so it was nice to see. iclr open review, reviewers dive in and get bored.

Details

intro

There are steps that are critically important and should be weighted more heavily, and there is a delay between action and reward, which is the most important problem in Rl, the credit assignment problem.
PPOs learn value networks, but there have been studies that show that they don’t learn well and act as a baseline for policy gradients, or that it’s better to replace them with average rewards.
How well is the value network learning?
I want to get an unbiased value estimate -> VinePPO

thumbnail

Accurate Credit Assignment with VinePPO

Keep the PPO’s term ¹, but measure only the value per step in MC. How often to step is determined by the hparm.

K is how many times to sample, but it works fine at 1 ².

Result

Step-by-step acc in VinePPO and PPO
acc by wallclock for VinePPOs and PPOs

vinePPO takes longer to iterate but converges faster

acc by KL divergence

Why low KL divergence is good (also from https://github.com/long8v/PTIR/issues/221 ) It is said that it is evaluated as good because it can be seen that the knowledge of the pretrained model is not lost and the performance is improved by utilizing the knowledge well (similar to adding a KL divergence term).

temperature tolerance

https://github.com/long8v/PTIR/issues/221 Here I increased the temperature from which the sampling trajectory is drawn to 1.2, which is usually <1 as a practice. PPOs also have lower performance on day 1. However, this is not the case with VinePPO.

acc in the value model

I did 256 MC to get GT value, and the horizontal axis is the predicted value, but PPO has many false positives and false negatives, and VPPO has high gt and corr.

We measured the export of incorrect trajectories and found that for PPO, the longer the reasoning, the higher the error, which is explained by the fact that the initial trajectory (far left) has less diversity along the training data and therefore the value may have been memorized.

Some details

End-of-Sequence (EOS) Trick: As detailed in Appendix A, rewards are only applied at the final token of a response, which corresponds to the EOS token when the response is complete. For responses that exceed the maximum length, we truncate the response to the maximum length and apply the reward to the last token of the truncated sequence

How did you do this? The generation is longer than the max length.

TL;DR#

Details#

intro#

thumbnail#

Accurate Credit Assignment with VinePPO#

Result#

Some details#

footnote#