TL;DR
- I read this because.. : o1 It was mentioned in the video that the
- task : Improve reward model
- PROBLEM : I can interpret scores in cases like llm-as-judge, can’t reward model do the same?
- idea : Have RM create a critique and then put a reward head behind it to predict it.
- input/output : {question, answer} -> {critique, reward score}
- architecture : Llama-3-8B / 70B
- objective : SFT loss + RM loss(Bradley-Terry Model)
- baseline : classic RM model
- data : UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment as oracle
- evaluation : pairwise preference classification of Reward Bench, BoW win rate on ArenaHard
- result : CLoud technique works in all categories. On policy is always better than off policy. I also tested self-consistency technique, but it is only good for reasoning.
- contribution : Good because it makes rm interpretable? Not sure if it will be used much.
- etc. :
Details
- thumbnail
Simple. Tell it to create a critique, give it as a given, including the last critique, and attach a reward head to it.
Learn SFT loss and RM loss at once to generate critiques.
($\lambda$ is found as 5/4 in 8B and 3/4 in 70B)
- training overview
Initially learning based on oracle ciritque.
oracle uses UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment.
(Oracle judgment creation prompt)
This is followed by learning based on self-generated critique. Do you feel like you’ve only done this once, not N times?
Result
- What’s the point of cloud techniques?
All are found to be effective. I don’t know if it’s right to evaluate only RM.
- on-policy vs off-policy
How to continue using oracle critique
On-policy is clearly more effective
- Self-consistency effects
Have them generate multiple reasoning (in this case, critique) and then average the scores behind them.
Didn’t work except for reasoning. In addition, ArenaHard has no effect at all
Among the reasoning, it only worked if the reason step was 1 or 2 steps, otherwise none.