[191] Critique-out-Loud Reward Models

TL;DR

I read this because.. : o1 It was mentioned in the video that the
task : Improve reward model
PROBLEM : I can interpret scores in cases like llm-as-judge, can’t reward model do the same?
idea : Have RM create a critique and then put a reward head behind it to predict it.
input/output : {question, answer} -> {critique, reward score}
architecture : Llama-3-8B / 70B
objective : SFT loss + RM loss(Bradley-Terry Model)
baseline : classic RM model
data : UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment as oracle
evaluation : pairwise preference classification of Reward Bench, BoW win rate on ArenaHard
result : CLoud technique works in all categories. On policy is always better than off policy. I also tested self-consistency technique, but it is only good for reasoning.
contribution : Good because it makes rm interpretable? Not sure if it will be used much.
etc. :

Details

thumbnail

Simple. Tell it to create a critique, give it as a given, including the last critique, and attach a reward head to it. Learn SFT loss and RM loss at once to generate critiques.

($\lambda$ is found as 5/4 in 8B and 3/4 in 70B)

training overview

Initially learning based on oracle ciritque. oracle uses UltraLlama (proposed. UltraFeedback + UltraInteract subset as Prompt and Llama-3-8B-Instruct to generate response) + Llama-3.1.-405B-Instruct to generate critique and judgment. (Oracle judgment creation prompt)

This is followed by learning based on self-generated critique. Do you feel like you’ve only done this once, not N times?

Result

What’s the point of cloud techniques?

All are found to be effective. I don’t know if it’s right to evaluate only RM.

on-policy vs off-policy How to continue using oracle critique

On-policy is clearly more effective

Self-consistency effects Have them generate multiple reasoning (in this case, critique) and then average the scores behind them.

Didn’t work except for reasoning. In addition, ArenaHard has no effect at all

Among the reasoning, it only worked if the reason step was 1 or 2 steps, otherwise none.

TL;DR#

Details#

Result#

TL;DR

Details

Result