Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning

Shurui Liu

From Training Scaling to Inference Scaling

The lecture's central theme is that AI progress has historically come from scaling data and training compute, but reasoning models add another scaling axis: test-time or inference compute.

Standard chat models often answer quickly. Reasoning models spend more computation per problem, using internal chains of thought, search, verification, or multi-sample aggregation. This mirrors earlier AI systems such as poker and Go, where planning/search at decision time produced large capability gains.

This changes the usual way of comparing systems. A single fixed benchmark score hides the inference budget used to obtain it. For reasoning systems, accuracy should be read together with dollars, tokens, wall-clock time, parallelism, and the number of attempts or search nodes.

Lessons from Poker and Go

Poker AI demonstrated the importance of planning under imperfect information. Techniques such as subgame solving allowed systems to reason more deeply in relevant parts of the game tree.

AlphaGo and AlphaGo Zero demonstrated that learned models plus search can outperform direct policy execution. In Go, test-time search substantially improved performance over a policy network alone.

The general lesson is Sutton's "bitter lesson": scalable methods that leverage computation, especially search and learning, tend to win in the long run.

The poker example is especially relevant because poker is imperfect-information and adversarial. A system cannot simply search a fully observed game tree from the current state. It must reason over hidden information, beliefs, and strategically important subgames. The Go example makes the complementary point: even with a strong learned policy, search at inference time can convert a good local move predictor into a much stronger decision maker.

Test-Time Scaling for LLMs

Several strategies scale inference compute:

  • Chain-of-thought prompting: induce the model to produce intermediate reasoning.
  • Best-of-$N$: sample many solutions and select one using a verifier or reward model.
  • Majority vote / consensus: sample many answers and take the most common.
  • Search over reasoning traces: explore and evaluate intermediate solution paths.
  • RL-trained reasoning policies: train models to produce longer or better reasoning traces.

Consensus can improve accuracy but may saturate because samples are correlated or because the correct answer is not the most common one. The lecture's Minerva example improved on MATH through consensus, but the slide also emphasizes that consensus flatlines before 100 samples. More structured reasoning and verification can continue scaling where naive majority vote stalls.

RL for Reasoning

The lecture mentions GRPO-style algorithms and systems such as DeepSeek-R1-Zero, where accuracy and response length increase together during RL. The important concept is that RL can shape not only final answers but the process distribution: how long the model thinks, what intermediate steps it takes, and how it searches.

For a reasoning problem $x$, the model samples a solution trace $y$. A reward may come from correctness, a verifier, or another evaluation signal:

$$ \max_ \theta \mathbb{E}_ {y\sim\pi_ \theta(\cdot\mid x)} [R(x,y)]. $$

The RL challenge is credit assignment across long generated traces. The benefit is that the model can learn strategies that are hard to specify as demonstrations.

A group-relative update, at a high level, samples multiple responses for the same question and compares them within that group. This gives a local baseline: a correct or high-scoring trace is reinforced relative to weaker traces for the same prompt. That framing is useful for reasoning tasks because prompts differ widely in difficulty, so raw rewards are not equally informative across prompts.

When rewards are verifiable, RL can reinforce reasoning traces without relying only on human preference labels. But this does not solve every problem. The reward may only score the final answer, while the useful behavior is distributed across many intermediate tokens. Longer traces can help exploration and self-correction, but they can also waste compute or rationalize a wrong answer. Algorithm design therefore needs both a reward signal and a compute-aware policy objective.

Serial and Parallel Test-Time Compute

Chain-of-thought reasoning is serial: later tokens depend on earlier thoughts. This can improve depth but adds latency.

Best-of-$N$, consensus, and multi-agent approaches are more parallel: many attempts can be run simultaneously. They reduce wall-clock latency for a given compute budget but may be less compute-efficient than a single well-directed search.

A useful evaluation should report performance as a function of both compute and time, not just one fixed benchmark score.

A practical distinction:

  • Serial compute buys depth: one trajectory can build on earlier intermediate results, but latency grows with the length of the reasoning process.
  • Parallel compute buys breadth: many attempts can be run at once, but repeated samples may be correlated and may duplicate the same mistake.
  • Search and verification try to combine both: branch over candidates, use a verifier or reward model to allocate more compute to promising paths, and stop when confidence is high enough.

The lecture uses recent reasoning systems as examples of increasing time scales: seconds, minutes, hours, and even days or weeks for external scaffolds. The lesson is not that longer is always better; it is that capability should be plotted against the full inference budget.

Safety and Preparedness Implications

The lecture argues that safety evaluations must account for test-time compute. A model that appears limited under a small inference budget may become much more capable with large budgets, long-running scaffolds, or many parallel attempts.

Implications:

  • Preparedness evaluations should project capabilities across inference budgets.
  • Long-horizon behavior is difficult to evaluate because the only definitive test may require actually running the system for a long time.
  • Inference capacity becomes strategically important, not just model weights.
  • Expensive capabilities today may become cheap later, so high-compute inference can preview future risks.

Takeaways

Reasoning-focused RL shifts attention from "what answer does the model produce immediately?" to "how does the model allocate computation to solve a problem?" Planning, search, verification, and RL-trained reasoning traces are increasingly central to frontier LLM performance. This connects modern LLM post-training back to classic RL themes: reward design, credit assignment, exploration, planning, and compute-aware evaluation.