TL;DR
- I read this because.. : survey paper. recommended by
Details
Divided into search, learning, policy initialization, and reward.
Policy Initialization
o1 has human-like reasoning behaviors.
An alternative proposal Related research includes a study called divergent CoT (https://arxiv.org/abs/2407.03181 ) Inferring multiple COTs in a sequence (similar to journey learning in o1 journey or sequence search later)
Reward Design
Once you break it down into its largest pieces, you have ORM/PRM
ORMs are represented by #209 and PRMs are represented by ligthman et al. In PRM, process can be a token ( https://arxiv.org/abs/2404.12358 ) or a step.
reward from env
- from realistic env For code, the compiler or intepreter can use the
- from simulator environment Learning a verifier for a math problem and using it at test time is a kind of simulator. Learning the reward signal is good, but the current policy is different from the policy that learned RM, causing a distribution shift problem. To solve this, the world model (=learn until transition) is sometimes mentioned
- from ai judgment
reward from data
- learn from preference data DPO / PPO RM learning
- learn from expert data Inverse reinforcement learning is used a lot in RL, but hasn’t been properly introduced in LLM. Unlike preference data, it’s easy to obtain but a bit tricky to learn because it requires adversarial training
reward shaping
LLM’s reward comes in the last token and sprinkling it is called reward shaping.
There is some literature that uses q-values, but there is also literature that says this is not good because it is policy dependent.
Potential based reward shaping is how RL has traditionally shaped rewards (https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf
)
https://arxiv.org/pdf/2410.08146 has created a PRM with an ORM with a similar approach, and there’s research showing that DPOs already do potential-based reward shaping (https://arxiv.org/abs/2404.12358) .
**Conjecture about O1 I’m guessing you’re using multiple reward models - in math you’re using PRMs, right? Seems to have a very robust RM as FT is possible on a few-shot sample Looks like we’re generating reward while creating LLM rather than writing value head (Not sure about the reasoning behind the guesses…)
Search
BoN sampling How to sample multiple data points and use RM to pick the best one There are studies on speculatvie rejection, etc.
Beam Search Widely used. There are various follow-up studies such as search(TreeBoN) with token-level reward from DPO, search with value from value model, etc. This reward guided search is said to better align with downstream tasks
Monte Carlo Tree Search Advantageous in large search space. Algorithms that iterate through selection, expansion, evaluation, backprop to explore/exploit. There are studies that do this at the token-, step-, and solution-level. (I haven’t read them) Other search methods may include dfs, bfs, a*, tree-of-thought, best-first-search, etc.
Sequential Revision Sequentially refining past answers is called sequential revision. SELF-REFINE or snell et al. are representative studies. Whether it really works is debatable, with some arguing that it doesn’t have the ability to self-correct without external feedback (https://arxiv.org/abs/2310.01798 , ICLR), and others saying that it’s easier to discern than to generate (https://aligned.substack.com/p/ai-assisted-human-feedback) , so it could work. (like talking about zs without ft)
tree search + sequential revision https://arxiv.org/abs/2406.07394 This study is representative of the snell et al sequntial revision on n samples and combine BoN as verifier
**Conjecture about O1 In train-time search, we assume that the data was created with tree search or BoN. And for domains like code/math, I’m guessing you used external environment to create it. For teset-time search, sequential revision was used and tree search was not used because it was too overhead to use for Infer.
learning
**Conjecture about O1 Behavior cloning is more effective, so I would have done this first and then DPO or PPO. I did this iteratively and presumably improved performance (similar to llama3 #205)