Compute-Optimal Scaling
Compute-Optimal Scaling is the notion of selecting the optimal configuration (beam width, search budget, etc.) dynamically / for binned question.
Approaches to “Scaling Test-Time Compute”
Three primary approaches:
- best-of-n: roll out a bunch, reject
- Beam Search: check against intermediate
- lookahead search: MCTSish (do lookahead rollouts)
Key insight
- On easy qusetion, beam search shows over-optimization and best of n is good
- on medium/hard questions, beam search is better
Lookahead seems bad?
Method
Learning a Value Function
[Wang 2312.08935]
Sequential vs. Parallel Sampling in Scaling Test Time Compute
What’s the trade-off between sampling a bunch in parallel vs. thinking sequentially.
Key insight
- on easy questions, sequential is best
- on harder questions, there’s a good ratio of trade offs
Test Time vs. Pretraining compute
Key insight
- on easy questions, test time scaling is good
- on medium/hard questions, pretraining scaling is better
Questions
- why do you think lookahead search works worse/not better than others? is there some problem specific heuristics that trigger this?