image

paper , code

TL;DR

  • I read this because.. : I’m interested in dense RLHF.
  • task : RLHF
  • problem : sparse reward in RL is a problem
  • idea : split reward into the attention map of the reward model, not just at the end
  • input/output : Q -> A
  • architecture : GPT-2 , openLLaMA
  • objective : PPO objective // reward formula changes
  • baseline : RLHF (as if it were PPO), evenly distributed by token length, ABC-D (utilizing attention map from actor model, not reward model)
  • data : IMDb(GPT-2), RedPajama / Antrophic helpful + harmless preference data
  • evaluation : average of reward reached by action model over time step -> MMLU I don’t evaluate this stuff.
  • result : Theoretically has the same solution as RLHF. Converges faster and seems to reach a better local optima.
  • contribution : RLHF’s dense rewards cheaply! Improved instability!
  • etc. : Can I just evaluate the average of the rewards?!

Details

motivation

image

LLM’s sparse rewards are a problem image

This is especially unreliable for longer sequences.

preliminary

image

proposed ABC

LLM problems can be viewed as a kind of sequential decision making. It can be expressed as a finite-state (because sentences always end…) MDP problem.

We can see that our goal is to find an action that maximizes the discounted reward below. image

What we want to do is know the reward for the last token selection. image

where $\alpha_i$ is the head average of the attention map of the last layer when the reward model predicts the reward at the last token. (vector indexed by the last token row) image

If the reward at time step t is divided into an attention map and made into a vector, this is the ABC image

  • $R_\phi$ : reward model 이라고 하는뎅…. Maybe it’s a sparse reward with predicted reward in the last step?!
  • $r_C$ : predicted reward of the last token
image
  • In practice, we use $\beta$, 1 - $\beta$ interpolation of $R_\phi$ and $\alpha \times r_C$.
  • Observed that as $\beta$ gets larger, performance gets better.

result

image
  • ABC-D: write attention map as policy network

Limitation

  • tokenize issues reward model tokenizer == action model tokenize, not currently possible
  • over optimized RM It is also possible that the ABC method is more overfitting for RM, which I haven’t fully explored.
  • only positive All rewards are positive. Let’s do negatives with something like DeepLIFT