TL;DR
- I read this because.. : q*’s star is this and so on and so forth.
- task : problem solving
- PROBLEM : Wouldn’t the model perform better if we learned the rationale?
- idea : Let the model generate a rationale, since heuristics can only go so far. If it can’t, hint at the correct answer.
- input/output : Q -> rationale - A
- architecture : GPT-J
- objective : CE loss
- baseline : direct answer tuned GPT-J, Few-shot GPT-J, Few-shot LaMDA 137B
- data : (source) GSM, CommonsenceQA, arithmetic problem
- evaluation : accuracy
- Result :** Accuracy improves faster. Solve problems you couldn’t solve (final accuracy increases).
- contribution : self-improvement? self-evolution? emphasize rationale?
- etc. :
Details
STaR
The details are 1) hinting only for questions that are not answered correctly 2) model finetuning is done in the base model, not iteratively. Is the rationale getting better and better as we go along? It seems like this is a little different from other models…
Claim that the process of filtering for incorrect rationales is similar to RL objectvie
Result
color is the number of digits problem
Ability to solve for digits you’ve never seen before