[192] Scaling Test-time Compute with Open Models (hf blog)

paper

TL;DR

I read this because.. : reasoning side survey. recommended by
task : let’s try test-time scaling
architecture : Llama 3.2 1B instruct / Llama 3.2 3B instruct / Llama 3.1 70B
baseline : (infer) zero-shot CoT (PRM) Math-Shepherd
data : (PRM) RLHFlow/Llama3.1-8B-PRM-Deepseek-Data
evaluation : Math-500 accuracy
result :
contribution : Organizing and testing methods

Details

Strategies for test-time compute scaling

self-refinement refine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations Must have built-in refinement capabilities
Search Against a Verifier Generation models with multiple answers and a separate verifier to choose between multiple answers It could be a hard-coded verifier, but we assume it’s a learned verifier (https://github.com/long8v/PTIR/issues/209 ) If you have a verifier, it can be used in a BoN or tree search

Possible methods

Best-of-N: Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.
Beam search: A systematic search method that explores the solution space, often combined with a process reward model (PRM) to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a sequence of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.
Diverse verifier tree search (DVTS): An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM. This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.

Experimental setup

Result

majority voting / Self-consistency
Best-of-N The vanilla Best-of-N is when the RM selects the highest scoring child as the correct answer, and the Weighted Best-of-N is when the RM’s score is weighted to select the correct answer. It’s common to use rm score as the ORM, but we’ll use PRM for comparison.

In this case, PRM gives a score per step, so the question is which score to use, and as in the deep mind paper , Last is the best.

The weighted BoN was the best of the bunch. However, it still doesn’t beat 8B zs-cot.

Beam search with process reward models

Search for beams, keeping the number of beams N at each step.
Divide the step based on the stopping criterion defined as \n or \n\n, etc.
Scoring each step through PRM to select N out of M
Continue with 3
Proceed until EOS or MAX SEARCH is reached

An hparm looks like this

N beams in compute scalings of 4, 16, 64, 256
Fixed beam width m = 4
Sampling with temperature T=0.8
Up to 40 iterations, i.e. a tree of maximum depth with 40 steps.

Says he beat 8B and that his absolute math score is okay (CS PhD average is 0.4 (that’s really hard…))

when beam search works well?

Overall, Majority voting is the worst for computational complexity. The harder the difficulty, the better beam search does, but for difficulties 1-2 it is worse than BoN or even majority voting

DVTS: boosting performance with diversity

What’s different from a tree is that when expanding M trees, it’s different to grow trees independently (e.g. “A world” vs “A happy”, when there are two paths alive, if N paths are drawn, A world is more promising, so N paths can be drawn, but it’s a technique to expand by dividing N paths independently).

This scales better as the number of answers grows!

When separated by difficulty, DVTS did better when difficulty was low and N was large, and beam search did better when N was small, regardless of difficulty. Additionally, beam search does better overall on the hardest difficulty level.

compute optimal scaling

The idea of being able to choose a strategy for a given compute budget N. Since it’s hard to get this, the deep mind researchers took the strategy of just looking at the difficulty level as a given, picking the best approach for each difficulty level, and then choosing that (??) When you do this, the Performance

Scaling up to larger models
next steps

The Power of Strong Verifiers: Strong verifiers play a critical role in enhancing performance. However, their current limitations are apparent, as highlighted in benchmarks like ProcessBench. Improving the robustness and generalization of verifiers will be crucial for advancing these methods.
The Challenge of Self-Verification: The ultimate goal—or “holy grail”—is achieving self-verification, where models can validate their own outputs autonomously. This approach appears to be what models like o1 are doing, but remains difficult to implement in practice. Unlike standard supervised fine-tuning (SFT), self-verification demands more nuanced strategies. The recent DeepMind paper on self-verification and Score sheds light on this challenge and offers a pathway for future research.
Integrating “Thoughts” into the Process: Incorporating explicit intermediate steps or “thoughts” during generation could further enhance reasoning and decision-making. By integrating structured reasoning into the search process, we may unlock better performance on complex tasks.
Search as a Data Generation Tool: This method can also serve as a powerful data generation process, creating high-quality training datasets. For example, fine-tuning models like Llama 1B on correct traces produced by search could yield significant gains. This on-policy approach resembles techniques like ReST or V-StaR but with the added benefits of search, offering a promising direction for iterative improvement.
A Call for More PRMs: Open process reward models (PRMs) are relatively rare, limiting their broader application. Developing and sharing more PRMs for different domains is a critical area where the community can contribute significantly.
Expanding Beyond Verifiable Domains: While current methods excel in domains like math and code, where solutions are inherently verifiable, extending these techniques to other areas remains a major challenge. How can we adapt these strategies for less structured or subjective tasks? This is a vital question for future exploration.

TL;DR#

Details#

Strategies for test-time compute scaling#

Experimental setup#

Result#

TL;DR

Details

Strategies for test-time compute scaling

Experimental setup

Result