0. RL Basics

9 minShurui Liu

What Deep Reinforcement Learning Studies

Deep reinforcement learning studies sequential decision-making problems in which an agent repeatedly observes information, acts, receives consequences, and then acts again. The "deep" part means that the learned objects, such as policies, value functions, dynamics models, or reward models, are usually represented with neural networks.

This differs from standard supervised learning in three important ways:

The data is not independent and identically distributed. An agent's action affects future states and therefore affects what data it sees next.
Feedback is indirect. The learner is often not told the correct action, only whether outcomes were good or bad.
The objective is long-term performance, not just immediate prediction accuracy.

Deep RL appears in robotics, control, games, autonomous driving, recommendation, web agents, and language-model post-training. The same mathematical abstractions cover low-level motor commands and high-level decisions such as text generation or tool use.

MDPs, POMDPs, and Trajectories

The standard fully observed formalism is a Markov decision process (MDP):

$$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r, \rho_ 0, \gamma, H). $$

Here $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\rho_ 0(s_ 1)$ is the initial-state distribution, $p(s_ {t+1}\mid s_ t,a_ t)$ is the transition dynamics, $r(s_ t,a_ t)$ is the reward, $\gamma \in [0,1]$ is the discount factor, and $H$ is the horizon.

The Markov property says that the next state depends on the current state and action, not on the full past:

$$ p(s_ {t+1}\mid s_ 1,a_ 1,\ldots,s_ t,a_ t)=p(s_ {t+1}\mid s_ t,a_ t). $$

In a partially observed MDP (POMDP), the agent receives observations $o_ t$ rather than the full state $s_ t$. In that case, a memoryless policy $\pi(a_ t\mid o_ t)$ may be insufficient because the future observation distribution can depend on the whole history. A practical fix is to provide history or memory:

$$ \pi_ \theta(a_ t \mid o_ {t-m:t}) \quad \text{or} \quad \pi_ \theta(a_ t \mid h_ t), $$

where $h_ t$ is a recurrent or transformer state.

A trajectory is a sequence of interaction. For a length-$T$ trajectory with $T$ actions, it is often convenient to include the final next state:

$$ \tau = (s_ 1,a_ 1,s_ 2,a_ 2,\ldots,s_ T,a_ T,s_ {T+1}). $$

Under policy $\pi_ \theta$, its probability is

$$ p_ \theta(\tau)=\rho_ 0(s_ 1)\prod_ {t=1}^{T}\pi_ \theta(a_ t\mid s_ t)p(s_ {t+1}\mid s_ t,a_ t). $$

For POMDPs, replace $s_ t$ in the policy input with observations or histories.

Discounting, Horizons, and Terminal States

There are two common ways to make cumulative reward mathematically well-defined:

Finite horizon: the episode ends after $H$ steps, so the return is a finite sum.
Infinite horizon with discounting: rewards are weighted by $\gamma^t$, and $\gamma<1$ keeps the total bounded when rewards are bounded.

For a trajectory starting at time $t$, the discounted return is

$$ G_ t=\sum_ {k=0}^{H-t}\gamma^k r(s_ {t+k},a_ {t+k}). $$

Terminal states have value zero because no future rewards are collected after termination:

$$ V^\pi(s_ {\mathrm{terminal}})=0. $$

Discounting has two interpretations. Mathematically, it makes infinite sums converge. Conceptually, it makes earlier rewards more important and creates an effective horizon of roughly $1/(1-\gamma)$. For example, $\gamma=0.99$ means rewards about 100 steps away still matter substantially; $\gamma=0.9$ makes the agent much more myopic.

Policies, Values, and Q-Functions

A policy is the agent's behavior:

$$ \pi_ \theta(a\mid s) = \Pr(a_ t=a\mid s_ t=s). $$

Policies may be deterministic, stochastic, discrete, continuous, or structured generative models. Stochasticity is useful for exploration and for modeling demonstrations where multiple actions can be reasonable.

The reinforcement-learning objective is expected cumulative reward:

$$ J(\theta)=\mathbb{E}_ {\tau\sim p_ \theta(\tau)} \left[\sum_ {t=1}^T \gamma^{t-1} r(s_ t,a_ t)\right]. $$

The value function of a policy is expected future reward from a state:

$$ V^\pi(s)=\mathbb{E}_ \pi\left[\sum_ {t=0}^{\infty}\gamma^t r(s_ t,a_ t)\mid s_ 0=s\right]. $$

The Q-function is expected future reward after taking a specific first action:

$$ Q^\pi(s,a)=\mathbb{E}_ \pi\left[\sum_ {t=0}^{\infty}\gamma^t r(s_ t,a_ t)\mid s_ 0=s,a_ 0=a\right]. $$

The advantage function measures how much better action $a$ is than the policy's average action at state $s$:

$$ A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s). $$

The Bellman equations connect these quantities recursively:

$$ V^\pi(s)=\mathbb{E}_ {a\sim\pi(\cdot\mid s),,s'\sim p(\cdot\mid s,a)} \left[r(s,a)+\gamma V^\pi(s')\right], $$

$$ Q^\pi(s,a)=r(s,a)+\gamma \mathbb{E}_ {s'\sim p(\cdot\mid s,a),,a'\sim\pi(\cdot\mid s')} \left[Q^\pi(s',a')\right]. $$

These equations are the backbone of actor-critic, Q-learning, offline RL, and planning methods.

The optimal value functions replace "follow $\pi$" with "act optimally":

$$ V^\ast(s)=\max_ a Q^\ast(s,a), $$

$$ Q^\ast(s,a)=r(s,a)+\gamma \mathbb{E}_ {s'\sim p(\cdot\mid s,a)} \left[\max_ {a'}Q^\ast(s',a')\right]. $$

If $Q^\ast$ is known, the optimal deterministic policy is greedy:

$$ \pi^\ast(s)\in\arg\max_ a Q^\ast(s,a). $$

Note: $Q^\pi$ evaluates a fixed policy, while $Q^\ast$ already assumes optimal behavior after the first action.

State-Visitation Distributions

When we write expectations over states in RL, the state distribution is usually induced by the policy. A useful discounted state distribution is

$$ d^\pi(s)=(1-\gamma)\sum_ {t=0}^{\infty}\gamma^t \Pr(s_ t=s\mid \pi). $$

Policy-gradient and actor-critic updates are naturally expectations over $d^\pi(s)\pi(a\mid s)$. This is why on-policy algorithms need fresh data from the current policy: changing the policy changes not only the action probabilities but also the states that will be visited later.

The normalized factor $(1-\gamma)$ is optional. Some derivations use the unnormalized occupancy

$$ \rho^\pi(s)=\sum_ {t=0}^{\infty}\gamma^t\Pr(s_ t=s\mid\pi), $$

which differs from $d^\pi$ by a constant. The constant does not change the direction of a policy gradient, but it does change exact formula scaling. When comparing equations across papers, first check whether their occupancy measure is normalized.

The occupancy measure is the formal reason that "off-policy" is not a magic word. If data were collected from $\pi_ \beta$, expectations are over $d^{\pi_ \beta}(s)\pi_ \beta(a\mid s)$. Learning about a new policy $\pi$ requires either correction terms, conservative assumptions, or enough coverage that the relevant state-action pairs for $\pi$ are present in the data.

Bellman Operators and Dynamic Programming

In a tabular MDP with known transition probabilities, policy evaluation can be written as repeatedly applying the Bellman operator for a fixed policy:

$$ (\mathcal{B}^\pi V)(s)= \mathbb{E}_ {a\sim\pi(\cdot\mid s),s'\sim p(\cdot\mid s,a)} \left[r(s,a)+\gamma V(s')\right]. $$

The optimality operator replaces the expectation over policy actions with maximization:

$$ (\mathcal{B}^\ast V)(s)= \max_ a \mathbb{E}_ {s'\sim p(\cdot\mid s,a)} \left[r(s,a)+\gamma V(s')\right]. $$

For $\gamma<1$, these operators are contractions in the sup norm:

$$ |\mathcal{B}V-\mathcal{B}U|_ \infty \le \gamma|V-U|_ \infty. $$

This contraction gives the clean tabular story:

Policy evaluation converges to $V^\pi$ by repeated Bellman backups.
Policy iteration alternates policy evaluation with greedy improvement.
Value iteration repeatedly applies $\mathcal{B}^\ast$ and then extracts a greedy policy.

Deep RL keeps the Bellman idea but loses several assumptions. Neural approximators do not store an independent value for each state-action pair, sampled minibatches replace exact expectations, and bootstrapped targets depend on the current learned function. Most stability tricks in later lectures are ways to recover some of the tabular fixed-point behavior without having the exact tabular setting.

Monte Carlo, TD, and Bootstrapping

There are two basic ways to estimate values from experience:

Monte Carlo: wait for a rollout and regress to realized return.
Temporal-difference learning: regress to one reward plus a bootstrapped value estimate.

For a value function, these targets are

$$ y_ t^{\mathrm{MC}}=\sum_ {k=0}^{T-t}\gamma^k r_ {t+k}, $$

and

$$ y_ t^{\mathrm{TD}}=r_ t+\gamma V(s_ {t+1}). $$

Monte Carlo targets are unbiased samples of the return under the rollout policy but can have high variance. TD targets are lower variance and can be used before an episode ends, but they are biased when the current value estimate is wrong. Actor-critic, Q-learning, offline RL, and model-based RL all reuse this same bias-variance tradeoff.

Algorithm Families

The course organizes deep RL methods by how they use data and what they learn:

Imitation learning: learn from expert demonstrations, usually without rewards.
Policy gradients: directly differentiate the expected-reward objective.
Actor-critic: learn both a policy and a value estimate to reduce policy-gradient variance.
Value-based methods: learn the value of optimal actions and choose greedily.
Offline RL: learn from a fixed dataset without collecting new online data.
Model-based RL: learn or use a dynamics model for synthetic data or planning.
Multi-task and goal-conditioned RL: share models and data across tasks or goals.
Reward learning: infer rewards from goals, demonstrations, or preferences.

Different algorithms trade off stability, data efficiency, ease of reward specification, action-space type, and how dangerous or expensive online exploration is.

Policy Evaluation and Policy Improvement

Many algorithms are variants of generalized policy iteration:

Policy evaluation: estimate how good the current policy is, using $V^\pi$ or $Q^\pi$.
Policy improvement: choose actions or update a policy to make high-value actions more likely.

Exact dynamic programming can do this if the transition model and reward are known. Model-free RL replaces exact expectations with sampled transitions. Actor-critic estimates $V^\pi$ or $Q^\pi$ and improves a parameterized policy. Q-learning tries to skip explicit policy evaluation and directly approximate $Q^\ast$.

Remarks

$r(s,a)$ may mean the deterministic reward or the expected reward $\mathbb{E}[R\mid s,a]$.

A POMDP can be converted into a belief-state MDP in principle, but practical deep RL usually feeds histories, recurrent hidden states, or transformer contexts rather than exact Bayesian beliefs.

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel