Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

Shurui Liu

LLM Training Pipeline

The guest lecture frames modern language-model training as a sequence:

  1. Pretraining: next-token prediction on broad natural data.
  2. Mid-training: targeted domain mixtures or lower-volume curated data.
  3. Supervised fine-tuning / instruction tuning: curated prompt-response pairs that teach the model to follow instructions.
  4. Preference optimization / RLHF: optimize behavior toward implicit human preferences.

Pretraining learns broad statistical structure: syntax, facts, coreference, lexical semantics, sentiment, code, math patterns, and some world/agent modeling. But language modeling alone is not the same as being a helpful assistant.

The important distinction is objective mismatch. Next-token prediction teaches the model to continue text from the training distribution. Assistant behavior instead asks for instruction following, calibrated helpfulness, refusals when appropriate, style control, and robustness to ambiguous requests. Those properties can be latent in the pretrained model, but they are not directly selected for by the pretraining loss.

Instruction Tuning

Instruction tuning finetunes on many tasks represented as instruction-output pairs:

$$ \max_ \theta \sum_ {(x,y)\in\mathcal{D}_ {\text{SFT}}} \log \pi_ \theta(y\mid x). $$

It improves task following and generalization to unseen tasks, especially with scale in both model size and task diversity. Benchmarks such as MMLU and BIG-Bench test broad multitask capabilities.

A useful mental model is that instruction tuning turns broad pretrained competence into an interface. The model may already know facts, code patterns, or translation patterns, but supervised examples teach it which behavior should be produced when a user asks a task in natural language.

Limitations:

  • Demonstrations are expensive to collect.
  • Open-ended generation may have no single correct answer.
  • Token-level likelihood penalizes all deviations similarly, even though some mistakes are much worse than others.
  • Human-written demonstrations can be suboptimal.
  • The training objective is still not exactly "satisfy human preferences."

RLHF Objective

For prompt $x$ and sampled response $y\sim\pi_ \theta(\cdot\mid x)$, suppose we have reward $R(x,y)$. The ideal objective is:

$$ \max_ \theta \mathbb{E}_ {y\sim\pi_ \theta(\cdot\mid x)} [R(x,y)]. $$

REINFORCE gives:

$$ \nabla_ \theta \mathbb{E}_ {y\sim\pi_ \theta}[R(x,y)] = \mathbb{E}_ {y\sim\pi_ \theta} \left[ R(x,y)\nabla_ \theta\log\pi_ \theta(y\mid x) \right]. $$

In practice, rewards are learned from preferences and optimization includes a KL penalty to a reference model:

$$ \max_ \pi \mathbb{E}_ {x,y\sim\pi} \left[ r_ \phi(x,y) -\beta\log\frac{\pi(y\mid x)}{\pi_ {\text{ref}}(y\mid x)} \right]. $$

The KL term prevents the policy from moving too far from the supervised/reference model and exploiting reward-model artifacts.

In sequence modeling, the "action" is the whole sampled response or each token in the response. The reward is usually non-differentiable with respect to the tokens, so policy-gradient methods use the score-function identity: increase the log-probability of sampled outputs that receive high reward and decrease it for outputs that receive low reward. Real RLHF systems add variance reduction, a value function, batching, KL control, and many implementation details; the simple REINFORCE equation is only the conceptual core.

Preference Modeling

Given a preferred response $y_ w$ and dispreferred response $y_ l$ for prompt $x$, train a reward model using:

$$ \max_ \phi \log\sigma(r_ \phi(x,y_ w)-r_ \phi(x,y_ l)). $$

This is the same pairwise preference model introduced in the reward-learning lecture, specialized to text responses.

Pairwise comparisons are used because absolute human scores are noisy and poorly calibrated across raters, prompts, and days. It is often easier to answer "which response is better?" than "what scalar reward should this response receive?" The tradeoff is that the learned reward is only identified up to transformations that preserve preference order, and it may fail badly on model outputs that differ from the comparison data.

Direct Preference Optimization

DPO simplifies RLHF by deriving a supervised preference loss that directly updates the policy without separately training a reward model and running PPO.

The key theoretical starting point is the KL-regularized reward maximization objective:

$$ \max_ \pi \mathbb{E}_ {y\sim\pi(\cdot\mid x)}[r(x,y)] -\beta D_ {\mathrm{KL}}(\pi(\cdot\mid x)|\pi_ {\text{ref}}(\cdot\mid x)). $$

The optimal policy satisfies:

$$ r(x,y)= \beta\log\frac{\pi^*(y\mid x)}{\pi_ {\text{ref}}(y\mid x)} +\beta\log Z(x), $$

where $Z(x)$ is a normalizer that cancels in reward differences. Substituting this reward parameterization into the preference likelihood yields the DPO loss:

$$ \mathcal{L}_ {\text{DPO}}(\theta) = - \mathbb{E}_ {(x,y_ w,y_ l)} \left[ \log\sigma\left( \beta \left[ \log\frac{\pi_ \theta(y_ w\mid x)}{\pi_ {\text{ref}}(y_ w\mid x)} - \log\frac{\pi_ \theta(y_ l\mid x)}{\pi_ {\text{ref}}(y_ l\mid x)} \right] \right) \right]. $$

The update increases the policy/reference log-ratio for preferred responses and decreases it for dispreferred responses. This is more precise than saying it always increases or decreases absolute likelihood, because the reference model also appears in the comparison.

RLHF vs. DPO

RLHF:

  • Flexible and powerful.
  • Separates reward modeling from policy optimization.
  • Often uses PPO and online sampling.
  • More complex and sensitive to reward hacking and KL tuning.

DPO:

  • Simpler supervised-style optimization.
  • Uses preference pairs directly.
  • Avoids explicit reward-model training for the policy update.
  • Strongly depends on the quality and coverage of preference data and the choice of reference model.
  • Does not leverage online data in the same way as an RLHF loop that repeatedly samples the current policy, labels fresh failures, and retrains.

Both are ways to align a pretrained model with preferences rather than only imitate demonstrations.

Frontier Post-Training Concerns

The guest lecture emphasizes that learned rewards can be unreliable and model behavior is hard to control. Preference optimization changes style and capability, but it can also amplify evaluator blind spots. As models become stronger, post-training becomes less about a single algorithm and more about the whole pipeline: data construction, reward/preference quality, online sampling, safety constraints, evaluation, and monitoring.

Important failure modes:

  • Reward overoptimization: the policy can improve the learned reward while true human quality stops improving or worsens.
  • Reward hacking: the policy finds outputs that exploit annotation or reward-model artifacts.
  • Style overfitting: RLHF/DPO can make outputs more verbose, deferential, or list-like without necessarily making them more correct.
  • Sycophancy and control failures: a model can learn to agree with the user, over-apologize, over-refuse, or optimize politeness at the expense of truthfulness.
  • Objective balancing: helpfulness, harmlessness, honesty, abstention, brevity, and domain-specific preferences are not a single clean scalar objective.

One response is to use verifiable rewards when possible. Math, code, some science tasks, and other problems with checkable solutions can provide sharper reward signals than general human preference labels. This is part of why reasoning models fit naturally into the post-training story. The limitation is that many assistant behaviors remain non-verifiable: tone, judgment, open-ended writing quality, and safety in underspecified situations still require preference modeling or other forms of supervision.

Another response is AI feedback, including constitutional-style pipelines. A model can critique a harmful or low-quality answer under a written principle, revise the answer, and provide preference or reward signals for training. This can scale supervision and make the desired behavior more explicit, but it inherits the limits of the model providing the feedback and the principles used to guide it.