TL;DR
- I read this because.. : I’m interested in dense RLHF.
- task : RLHF
- problem : sparse reward in RL is a problem
- idea : split reward into the attention map of the reward model, not just at the end
- input/output : Q -> A
- architecture : GPT-2 , openLLaMA
- objective : PPO objective // reward formula changes
- baseline : RLHF (as if it were PPO), evenly distributed by token length, ABC-D (utilizing attention map from actor model, not reward model)
- data : IMDb(GPT-2), RedPajama / Antrophic helpful + harmless preference data
- evaluation : average of reward reached by action model over time step -> MMLU I don’t evaluate this stuff.
- result : Theoretically has the same solution as RLHF. Converges faster and seems to reach a better local optima.
- contribution : RLHF’s dense rewards cheaply! Improved instability!
- etc. : Can I just evaluate the average of the rewards?!
Details
motivation
LLM’s sparse rewards are a problem
This is especially unreliable for longer sequences.
preliminary
proposed ABC
LLM problems can be viewed as a kind of sequential decision making. It can be expressed as a finite-state (because sentences always end…) MDP problem.
We can see that our goal is to find an action that maximizes the discounted reward below.
What we want to do is know the reward for the last token selection.
where $\alpha_i$ is the head average of the attention map of the last layer when the reward model predicts the reward at the last token. (vector indexed by the last token row)
If the reward at time step t is divided into an attention map and made into a vector, this is the ABC
- $R_\phi$ : reward model 이라고 하는뎅…. Maybe it’s a sparse reward with predicted reward in the last step?!
- $r_C$ : predicted reward of the last token
- In practice, we use $\beta$, 1 - $\beta$ interpolation of $R_\phi$ and $\alpha \times r_C$.
- Observed that as $\beta$ gets larger, performance gets better.
result
- ABC-D: write attention map as policy network
Limitation
- tokenize issues reward model tokenizer == action model tokenize, not currently possible
- over optimized RM It is also possible that the ABC method is more overfitting for RM, which I haven’t fully explored.
- only positive All rewards are positive. Let’s do negatives with something like DeepLIFT