image

paper , code

TL;DR

  • I read this because.. : o1 It was mentioned in the video that the
  • task : Improve reward model
  • PROBLEM : I can interpret scores in cases like llm-as-judge, can’t reward model do the same?
  • idea : Have RM create a critique and then put a reward head behind it to predict it.
  • input/output : {question, answer} -> {critique, reward score}
  • architecture : Llama-3-8B / 70B
  • objective : SFT loss + RM loss(Bradley-Terry Model)
  • baseline : classic RM model
  • data : UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment as oracle
  • evaluation : pairwise preference classification of Reward Bench, BoW win rate on ArenaHard
  • result : CLoud technique works in all categories. On policy is always better than off policy. I also tested self-consistency technique, but it is only good for reasoning.
  • contribution : Good because it makes rm interpretable? Not sure if it will be used much.
  • etc. :

Details

  • thumbnail
image

Simple. Tell it to create a critique, give it as a given, including the last critique, and attach a reward head to it. Learn SFT loss and RM loss at once to generate critiques. image

($\lambda$ is found as 5/4 in 8B and 3/4 in 70B)

image image
  • training overview image

Initially learning based on oracle ciritque. oracle uses UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment. (Oracle judgment creation prompt) image

This is followed by learning based on self-generated critique. Do you feel like you’ve only done this once, not N times?

Result

  • What’s the point of cloud techniques? image

All are found to be effective. I don’t know if it’s right to evaluate only RM.

  • on-policy vs off-policy How to continue using oracle critique image

On-policy is clearly more effective

  • Self-consistency effects Have them generate multiple reasoning (in this case, critique) and then average the scores behind them. image

Didn’t work except for reasoning. In addition, ArenaHard has no effect at all

image

Among the reasoning, it only worked if the reason step was 1 or 2 steps, otherwise none.