6. Reward Learning

Shurui Liu

Why Rewards Are Hard

In games, rewards may be built into the environment. In real-world domains such as robotics, dialogue, or driving, the true objective is often hard to specify. Proxy rewards can be exploited, and demonstrations may be expensive or impossible.

Reward learning asks how to infer a useful reward signal from easier supervision:

  • Goal-state examples.
  • Successful and unsuccessful outcomes.
  • Demonstrations.
  • Human ratings.
  • Pairwise preferences.
  • AI-generated feedback.

The key warning is that rewards are optimized, not just predicted. A proxy that looks accurate on a validation set can still fail when an RL policy searches for states or outputs that maximize it. This is distribution shift in reward space: the learner deliberately moves toward places where the learned reward may be least reliable.

Reward functions also have non-identifiability. Adding a constant to all rewards, scaling rewards, or adding certain potential-based shaping terms can leave preferences or optimal policies unchanged while changing learning dynamics. Practical RL therefore cares about both what reward represents and how its scale interacts with the optimizer, entropy bonuses, KL penalties, and value targets.

Goal Classifiers

A goal classifier learns whether a state belongs to a desired goal set $G$:

$$ c_ \theta(s)\approx \Pr(s\in G). $$

Procedure:

  1. Collect positive examples of successful states and negative examples of unsuccessful states.
  2. Train a binary classifier.
  3. Use the classifier output as a reward:

$$ r(s)=c_ \theta(s) \quad\text{or}\quad r(s)=\log c_ \theta(s). $$

If the classifier estimates $c_ \theta(s)=\Pr(y=1\mid s)$ from balanced positive and negative data, the logit

$$ \log\frac{c_ \theta(s)}{1-c_ \theta(s)} $$

can be interpreted as a density-ratio-style reward. The main intuition is still simple: states that look more like success examples receive higher reward.

The main failure mode is reward hacking: RL finds states that fool the classifier rather than truly satisfying the task. A mitigation is to update the classifier during RL, adding states visited by the policy as negatives. This creates an adversarial loop similar to GAN training:

  • Classifier distinguishes success states from policy-generated states.
  • Policy tries to reach states the classifier considers successful.

This can work well but may be unstable and requires regularization.

Two details from the slides are worth remembering:

  • The negative set should include states visited by the current policy, because those are exactly where the classifier will be optimized against.
  • Balanced positive/negative classifier batches matter; otherwise the classifier can collapse to a base-rate solution that is not useful as a reward.
  • Regularization matters because a high-capacity classifier can become overconfident on visual artifacts or spurious features, which creates exploitable reward spikes.

At convergence, the adversarial goal-classifier loop resembles a GAN: the classifier distinguishes successful states from policy-generated states, while the policy tries to generate states classified as successful. The analogy is useful, but the RL setting is harder because the generator is a sequential control policy, not a direct sampler over independent examples.

Preference-Based Reward Learning

Humans often find relative comparisons easier than absolute scores. Given two trajectories $\tau_ w$ and $\tau_ l$, where a human says $\tau_ w\succ\tau_ l$, define trajectory reward:

$$ r_ \theta(\tau)=\sum_ {(s,a)\in\tau}r_ \theta(s,a). $$

Use a Bradley-Terry preference model:

$$ \Pr(\tau_ a\succ\tau_ b) = \sigma(r_ \theta(\tau_ a)-r_ \theta(\tau_ b)), $$

where $\sigma$ is the logistic sigmoid.

The reward model is trained by maximizing:

$$ \max_ \theta \mathbb{E}_ {(\tau_ w,\tau_ l)} \left[ \log\sigma(r_ \theta(\tau_ w)-r_ \theta(\tau_ l)) \right]. $$

For $k$ ranked trajectories, all preferred/dispreferred pairs can contribute to the loss.

The Bradley-Terry form says preferences depend on reward differences, not absolute reward scale. Adding a constant to all trajectory rewards does not change pairwise probabilities. This is why preference reward models often need regularization and why downstream RL usually includes a KL penalty or another constraint.

A complete preference-learning loop can be online:

  1. Use the current policy to generate candidate trajectory snippets.
  2. Ask a human or evaluator to rank or compare them.
  3. Update the reward model with the Bradley-Terry loss.
  4. Update the policy to maximize the learned reward.
  5. Repeat with new policy-generated data.

Online preference collection focuses labels on decisions the current policy is actually making, but it is expensive and can introduce feedback loops if the evaluator is inconsistent or the reward model overfits early preferences.

RLHF for Language Models

In LLM post-training, a prompt $x$ and response $y$ form the object being scored. A reward model estimates:

$$ r_ \theta(x,y). $$

The high-level RLHF pipeline:

  1. Pretrain a language model on next-token prediction.
  2. Supervised fine-tune on high-quality prompt-response examples.
  3. Sample multiple responses for prompts.
  4. Collect human preferences over responses.
  5. Train a reward model from preferences.
  6. Fine-tune the language model to maximize learned reward, often with PPO and a KL penalty to stay close to the supervised model.

A typical RL objective is:

$$ \max_ \pi \mathbb{E}_ {y\sim\pi(\cdot\mid x)} [r_ \theta(x,y)] -\beta D_ {\mathrm{KL}}(\pi(\cdot\mid x)|\pi_ {\text{ref}}(\cdot\mid x)). $$

The KL term matters because learned rewards are imperfect. Without a constraint, the policy may exploit reward-model weaknesses.

The LLM setting is an MDP in a degenerate but useful sense: the state is the prompt plus generated prefix, the action is the next token, and the episode ends when the response is complete. Often the reward is only given at the end, which makes credit assignment across tokens difficult.

The KL penalty can also be read as a per-response reward shaping term:

$$ r_ {\mathrm{KL}}(x,y)= r_ \theta(x,y)- \beta\log\frac{\pi(y\mid x)}{\pi_ {\mathrm{ref}}(y\mid x)}. $$

This penalizes responses that become much more likely under the optimized policy than under the reference model. It is not only a safety device; it also stabilizes optimization when the reward model is imperfect.

RLAIF

Reinforcement learning from AI feedback replaces or supplements human preference labels with model-generated judgments. The slide's key intuition is that critique can be easier than generation: a model may be better at choosing the less harmful or more helpful response than writing the best response from scratch.

The same preference-learning machinery applies, but label quality depends on the evaluator model and its constitution, rubric, or prompting.

RLAIF is not a different RL algorithm. It changes the source of preference labels. The same concerns remain: evaluator bias, reward-model overoptimization, and the need to keep the optimized policy close enough to a trusted reference.

Takeaways

Rewards should not be taken for granted. Learned rewards make task specification more practical, but because RL optimizes hard against the reward, any learned reward can be exploited. Good systems combine reward learning with regularization, online data collection, preference auditing, and conservative policy updates.