image

paper

TL;DR

  • I read this because.. : reasoning ์ชฝ survey. ์ถ”์ฒœ๋ฐ›์•„
  • task : test-time scaling ํ•ด๋ณด์ž
  • architecture : Llama 3.2 1B instruct / Llama 3.2 3B instruct / Llama 3.1 70B
  • baseline : (infer) zero-shot CoT (PRM) Math-Shepherd
  • data : (PRM) RLHFlow/Llama3.1-8B-PRM-Deepseek-Data
  • evaluation : Math-500 accuracy
  • result :
  • contribution : ๋ฐฉ๋ฒ•๋“ค ์ •๋ฆฌ ๋ฐ ํ…Œ์ŠคํŠธ

Details

Strategies for test-time compute scaling

  • self-refinement refine their own outputs or โ€œthoughtsโ€ by identifying and correcting errors in subsequent iterations built-in refinement ๋Šฅ๋ ฅ์ด ์žˆ์–ด์•ผ ํ•จ
  • Search Against a Verifier generation model๋“ค์ด multiple answer๋ฅผ ๋‚ด๊ณ  ๋ณ„๋„์˜ verifier๊ฐ€ ์žˆ์–ด์„œ multiple answer์ค‘ ์„ ํƒํ•˜๋Š” ํ˜•ํƒœ hard coding๋œ verifier์ผ ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์ผ๋‹จ learned verifier๋ฅผ ์ƒ์ •(https://github.com/long8v/PTIR/issues/209 ) verifier๊ฐ€ ์žˆ์œผ๋ฉด BoN์ด๋‚˜ tree search์—์„œ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค
image

๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•๋“ค

  • Best-of-N: Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.
  • Beam search: A systematic search method that explores the solution space, often combined with a process reward model (PRM) to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a sequence of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.
  • Diverse verifier tree search (DVTS): An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM. This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.

Experimental setup

image

Result

  • majority voting / Self-consistency image

  • Best-of-N vanilla Best-of-N์€ RM์ด ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ฅผ ๋งค๊ธด ์• ๋ฅผ ์ •๋‹ต์œผ๋กœ ๋ฝ‘๋Š”๊ฑฐ๊ณ  Weighted Best-of-N๋Š” RM์˜ score๋ฅผ ๊ฐ€์ค‘์œผ๋กœ ํ•ด์„œ ์ •๋‹ต์„ ๋ฝ‘์„ ๋•Œ ์„ ์ • ์ด๋•Œ rm score๋ฅผ ORM๋กœ ํ•˜๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ ๋น„๊ต๋ฅผ ์œ„ํ•ด PRM์œผ๋กœ ์‚ฌ์šฉ.

image

์ด๋•Œ PRM์€ step๋ณ„๋กœ ์ ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋–ค score๋ฅผ ์“ธ์ง€๋„ ๋ฌธ์ œ์ธ๋ฐ ์ด์— ๋Œ€ํ•ด์„œ๋Š” deep mind ๋…ผ๋ฌธ ์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Last๊ฐ€ ๊ฐ€์žฅ ์ข‹์•˜์Œ

image

๊ฐœ์ค‘์—๋Š” weighted BoN์ด ๊ฐ€์žฅ ์ข‹์•˜๋‹ค. ๋‹ค๋งŒ ์•„์ง 8B zs-cot๋ฅผ ๋ชป์ด๊ธด๋‹ค.

  • Beam search with process reward models
  1. beam ๊ฐœ์ˆ˜ N๊ฐœ๋ฅผ ๋งค ์Šคํ…๋ณ„ ์œ ์ง€ํ•˜๋ฉด์„œ beam search.
  2. ์ด๋•Œ \n ๋˜๋Š” \n\n ๋“ฑ์œผ๋กœ ์ •์˜ ๋œ stopping criterion์„ ๊ธฐ์ค€์œผ๋กœ step์„ ๋‚˜๋ˆ”.
  3. PRM์„ ํ†ตํ•ด ๊ฐ step์— ๋Œ€ํ•œ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ์„œ M๊ฐœ ์ค‘์— N๊ฐœ๋ฅผ ์„ ์ • ํ•จ
  4. 3์„ ๊ณ„์†ํ•จ
  5. eos๋‚˜ maximum search์— ๋„๋‹ฌํ•  ๋•Œ ๊นŒ์ง€ ์ง„ํ–‰ํ•จ

hparm์€ ์•„๋ž˜์™€ ๊ฐ™์Œ

  • N๏ปฟ beams in compute scalings of 4, 16, 64, 256
  • Fixed beam width m = 4
  • Sampling with temperature T=0.8๏ปฟ
  • Up to 40 iterations, i.e. a tree of maximum depth with 40 steps. image

์ด๋ ‡๊ฒŒ ํ•˜๋‹ˆ๊นŒ 8B๋ฅผ ์ด๊ฒผ๊ณ  ์ ˆ๋Œ€์ ์ธ MATH ์ ์ˆ˜ ์ž์ฒด๋„ ๊ดœ์ฐฎ์€ ํŽธ์ด๋ผ๊ณ  ํ•จ (CS PhD ์ ์ˆ˜ ํ‰๊ท ์ด 0.4๋ผ๊ณ  ํ•จ(์—„์ฒญ ์–ด๋ ต๋„ค..))

  • when beam search works well? image

์ „๋ฐ˜์ ์œผ๋กœ Majority voting ์€ ๊ณ„์‚ฐ ๋ณต์žก๋„ ๋Œ€๋น„ ๊ฐ€์žฅ ๊ตฌ๋ฆผ. ์–ด๋ ค์šธ ์ˆ˜๋ก beam search๊ฐ€ ๋” ์ž˜ํ•˜์ง€๋งŒ ๋‚œ์ด๋„ 1~2์— ๋Œ€ํ•ด์„œ๋Š” BoN์ด๋‚˜ ์‹ฌ์ง€์–ด majority voting ๋ณด๋‹ค ์•ˆ์ข‹์Œ

  • DVTS: boosting performance with diversity image

tree ๋ž‘ ๋ญ๊ฐ€ ๋‹ค๋ฅด๋ƒ๋ฉด M๊ฐœ๋ฅผ ํ™•์žฅํ•  ๋•Œ tree๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ‚ค์šฐ๋Š”๊ฒŒ ๋‹ค๋ฅธ๋“ฏ (e.g. “A world” vs “A happy” ์ด๋ ‡๊ฒŒ ๋‘๊ฐœ์˜ path๊ฐ€ ์‚ด์•„ ์žˆ์„ ๋•Œ N๊ฐœ๋ฅผ ๋ฝ‘์„ ๋•Œ A world๊ฐ€ ๋” ์œ ๋งํ•ด์„œ N๊ฐœ๊ฐ€ ๋ฝ‘ํž ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ทธ๋ ‡๊ฒŒ ๋ง๊ณ  ๋…๋ฆฝ์ ์œผ๋กœ N๊ฐœ๋ฅผ ๋‚˜๋ˆ ์„œ expandํ•˜๋Š” ๊ธฐ๋ฒ•์ธ๋“ฏ)

image

์ด๋ ‡๊ฒŒ ํ–ˆ์„ ๋•Œ answer๊ฐœ์ˆ˜๊ฐ€ ์ปค์งˆ ๋•Œ scaling์ด ๋” ์ž˜๋จ! image

๋‚œ์ด๋„๋ณ„๋กœ ๋ถ„๋ฆฌํ–ˆ์„ ๋•Œ DVTS๋Š” ๋‚œ์ด๋„๊ฐ€ ๋‚ฎ๊ณ  N์ด ํด ๋•Œ ๋” ์ž˜ํ•˜๊ณ  beam search๋Š” ๋‚œ์ด๋„์— ์ƒ๊ด€์—†์ด N์ด ์ž‘์„ ๋•Œ ๋” ์ž˜ํ–ˆ๋‹ค ์ถ”๊ฐ€๋กœ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋‚œ์ด๋„์—์„œ๋Š” beam search๊ฐ€ ์ „๋ฐ˜์ ์œผ๋กœ ๋” ์ž˜ํ•˜๋„น

  • compute optimal scaling image

given compute budget N์ผ ๋•Œ strategy๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์ข‹์„ํ…๋ฐ ํ•ด์„œ ๋‚˜์˜จ ๊ฐœ๋… ์ด๊ฑธ ๊ตฌํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋งˆ์ธ๋“œ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ทธ๋ƒฅ ๋‚œ์ด๋„๋Š” ์ฃผ์–ด์ง„๊ฑธ๋กœ ๋ณด๊ณ  ๊ฐ ๋‚œ์ด๋„์—์„œ ๊ฐ€์žฅ ์ž˜ํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์„ ๋ฝ‘์€ ๋’ค ์ด๊ฑธ๋กœ ์„ ํƒํ•˜๋Š” (??) ์ „๋žต์„ ์ทจํ–ˆ๋‹ค๊ณ  ํ•จ ์ด๋ ‡๊ฒŒ ํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ

image
  • Scaling up to larger models image

  • next steps

  1. The Power of Strong Verifiers: Strong verifiers play a critical role in enhancing performance. However, their current limitations are apparent, as highlighted in benchmarks like ProcessBench. Improving the robustness and generalization of verifiers will be crucial for advancing these methods.

  2. The Challenge of Self-Verification: The ultimate goalโ€”or “holy grail”โ€”is achieving self-verification, where models can validate their own outputs autonomously. This approach appears to be what models like o1 are doing, but remains difficult to implement in practice. Unlike standard supervised fine-tuning (SFT), self-verification demands more nuanced strategies. The recent DeepMind paper on self-verification and Score sheds light on this challenge and offers a pathway for future research.

  3. Integrating โ€œThoughtsโ€ into the Process: Incorporating explicit intermediate steps or โ€œthoughtsโ€ during generation could further enhance reasoning and decision-making. By integrating structured reasoning into the search process, we may unlock better performance on complex tasks.

  4. Search as a Data Generation Tool: This method can also serve as a powerful data generation process, creating high-quality training datasets. For example, fine-tuning models like Llama 1B on correct traces produced by search could yield significant gains. This on-policy approach resembles techniques like ReST or V-StaR but with the added benefits of search, offering a promising direction for iterative improvement.

  5. A Call for More PRMs: Open process reward models (PRMs) are relatively rare, limiting their broader application. Developing and sharing more PRMs for different domains is a critical area where the community can contribute significantly.

  6. Expanding Beyond Verifiable Domains: While current methods excel in domains like math and code, where solutions are inherently verifiable, extending these techniques to other areas remains a major challenge. How can we adapt these strategies for less structured or subjective tasks? This is a vital question for future exploration.