image

paper

TL;DR

  • I read this because.. : ORM (Output Reward Model) is mentioned a lot. I don’t know if this is the exact paper you are referring to, but it is from the Omega PRM paper.
  • task : LLM in math problem solving
  • problem : LM has made a lot of progress, but it still can’t do multi-step mathematical reasoning.
  • Idea:** Propose data. After finetuning, take 100 samples, label them and train a verifier. After that, make several inferences and select the one that scores high on the verifier as the final answer.
  • architecture : GPT3 6B / 175B
  • objective : Scalar head for CE loss / verifier (maybe bce loss?)
  • baseline : finetuning
  • data : GSM8K (proposed)
  • evaluation : test solve ratio
  • result : 175B finetuned over 6B
  • contribution : gsm8k proposal / Multi-step math reasoning problem solved? / Predecessor of RFT…?
  • etc. :

Details

image image image

Fast overfitting for 100 Guess. Only letting people see 2 epochs image