TL;DR
- I read this because.. : ORM (Output Reward Model) is mentioned a lot. I don’t know if this is the exact paper you are referring to, but it is from the Omega PRM paper.
- task : LLM in math problem solving
- problem : LM has made a lot of progress, but it still can’t do multi-step mathematical reasoning.
- Idea:** Propose data. After finetuning, take 100 samples, label them and train a verifier. After that, make several inferences and select the one that scores high on the verifier as the final answer.
- architecture : GPT3 6B / 175B
- objective : Scalar head for CE loss / verifier (maybe bce loss?)
- baseline : finetuning
- data : GSM8K (proposed)
- evaluation : test solve ratio
- result : 175B finetuned over 6B
- contribution : gsm8k proposal / Multi-step math reasoning problem solved? / Predecessor of RFT…?
- etc. :
Details
Fast overfitting for 100 Guess. Only letting people see 2 epochs