[175] Dense Reward for Free in Reinforcement Learning from Human Feedback

paper , code

TL;DR

I read this because.. : I’m interested in dense RLHF.
task : RLHF
problem : sparse reward in RL is a problem
idea : split reward into the attention map of the reward model, not just at the end
input/output : Q -> A
architecture : GPT-2 , openLLaMA
objective : PPO objective // reward formula changes
baseline : RLHF (as if it were PPO), evenly distributed by token length, ABC-D (utilizing attention map from actor model, not reward model)
data : IMDb(GPT-2), RedPajama / Antrophic helpful + harmless preference data
evaluation : average of reward reached by action model over time step -> MMLU I don’t evaluate this stuff.
result : Theoretically has the same solution as RLHF. Converges faster and seems to reach a better local optima.
contribution : RLHF’s dense rewards cheaply! Improved instability!
etc. : Can I just evaluate the average of the rewards?!

Details

motivation

LLM’s sparse rewards are a problem

This is especially unreliable for longer sequences.

preliminary

proposed ABC

LLM problems can be viewed as a kind of sequential decision making. It can be expressed as a finite-state (because sentences always end…) MDP problem.

We can see that our goal is to find an action that maximizes the discounted reward below.

What we want to do is know the reward for the last token selection.

where $\alpha_i$ is the head average of the attention map of the last layer when the reward model predicts the reward at the last token. (vector indexed by the last token row)

If the reward at time step t is divided into an attention map and made into a vector, this is the ABC

$R_\phi$ : reward model 이라고 하는뎅…. Maybe it’s a sparse reward with predicted reward in the last step?!
$r_C$ : predicted reward of the last token

In practice, we use $\beta$, 1 - $\beta$ interpolation of $R_\phi$ and $\alpha \times r_C$.
Observed that as $\beta$ gets larger, performance gets better.

result

ABC-D: write attention map as policy network

Limitation

tokenize issues reward model tokenizer == action model tokenize, not currently possible
over optimized RM It is also possible that the ABC method is more overfitting for RM, which I haven’t fully explored.
only positive All rewards are positive. Let’s do negatives with something like DeepLIFT

TL;DR#

Details#

motivation#

preliminary#

proposed ABC#

result#

Limitation#