\documentclass{article}

\usepackage{fancyhdr} % Required for custom headers
\usepackage{lastpage} % Required to determine the last page for the footer
\usepackage{extramarks} % Required for headers and footers
\usepackage[usenames,dvipsnames]{color} % Required for custom colors
\usepackage{graphicx} % Required to insert images
\usepackage{listings} % Required for insertion of code
\usepackage{courier} % Required for the courier font
\usepackage{enumerate}
\usepackage{enumitem}
\usepackage{booktabs}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{caption}
\usepackage{subcaption}
\captionsetup[table]{skip=4pt}
\usepackage{framed}
\usepackage{bm}
\usepackage[most]{tcolorbox}


\usepackage{xcolor}
\graphicspath{{img/}} % set of paths to search for images
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}

\newenvironment{myitemize}
{ \begin{itemize}
		\setlength{\itemsep}{0pt}
		\setlength{\parskip}{0pt}
		\setlength{\parsep}{0pt}     }
	{ \end{itemize}                  } 

\usepackage{biblatex} % bibliography
\addbibresource{papers.bib}

\usepackage{tikz}
\usetikzlibrary{positioning,patterns,fit}

\newcommand{\ifans}[1]{\ifanswers \color{red} \textbf{Solution: } #1 \color{black}}

 \newcommand\solution[1]{\paragraph{\textcolor{purple}{Solution}}{\color{purple} #1}}
 \newcommand\solutionInline[1]{\textcolor{purple}{#1}}

%\newcommand\solution[1]{}
%\newcommand\solutionInline[1]{}

\newcommand{\E}{\mathbb{E}}
\newcommand{\given}{\,|\,}

\newcommand{\safereward}{r_{\text{safe}}}
\newcommand{\lowreward}{\underline{r}_{\text{risk}}}
\newcommand{\highreward}{\overline{r}_{\text{risk}}}
\newcommand{\consreward}{r_{\text{cons}}}

% Margins
\topmargin=-0.45in
\evensidemargin=0in
\oddsidemargin=0in
\textwidth=6.5in
\textheight=9.0in
\headsep=0.25in

\linespread{1.1} % Line spacing

% Set up the header and footer
\pagestyle{fancy}
\rhead{\hmwkAuthorName} % Top left header
\lhead{\hmwkClass: \hmwkTitle} % Top center head
\lfoot{\lastxmark} % Bottom left footer
\cfoot{} % Bottom center footer
\rfoot{Page\ \thepage\ of\ \protect\pageref{LastPage}} % Bottom right footer
\renewcommand\headrulewidth{0.4pt} % Size of the header rule
\renewcommand\footrulewidth{0.4pt} % Size of the footer rule

\setlength\parindent{0pt} % Removes all indentation from paragraphs

%----------------------------------------------------------------------------------------
%	CODE INCLUSION CONFIGURATION
%----------------------------------------------------------------------------------------

\definecolor{MyDarkGreen}{rgb}{0.0,0.4,0.0} % This is the color used for comments
\lstloadlanguages{Python}
\lstset{language=Python,
        frame=single, % Single frame around code
        basicstyle=\footnotesize\ttfamily, % Use small true type font
        keywordstyle=[1]\color{Blue}\bf,
        keywordstyle=[2]\color{Purple},
        keywordstyle=[3]\color{Blue}\underbar, % Custom functions underlined and blue
        identifierstyle=, % Nothing special about identifiers
        commentstyle=\usefont{T1}{pcr}{m}{sl}\color{MyDarkGreen}\small, % Comments small dark green courier font
        stringstyle=\color{Purple}, % Strings are purple
        showstringspaces=false, % Don't put marks in string spaces
        tabsize=5, % 5 spaces per tab
        morekeywords={rand},
        morekeywords=[2]{on, off, interp},
        morekeywords=[3]{test},
        morecomment=[l][\color{Blue}]{...}, % Line continuation (...) like blue comment
        numbers=left, % Line numbers on left
        firstnumber=1, % Line numbers start with line 1
        numberstyle=\tiny\color{Blue}, % Line numbers are blue and small
        stepnumber=5 % Line numbers go in steps of 5
}

\newcommand{\perlscript}[2]{
\begin{itemize}
\item[]\lstinputlisting[caption=#2,label=#1]{#1.pl}
\end{itemize}
}

%----------------------------------------------------------------------------------------
%	NAME AND CLASS SECTION
%----------------------------------------------------------------------------------------

\newcommand{\hmwkTitle}{Assignment \#3} % Assignment title
\newcommand{\hmwkClass}{CS\ 234} % Course/class
\newcommand{\hmwkAuthorName}{} % Your name

%----------------------------------------------------------------------------------------
%	TITLE PAGE
%----------------------------------------------------------------------------------------

\title{
\vspace{-1in}
\textmd{\textbf{\hmwkClass:\ \hmwkTitle}}}
\author{}
\date{} % Insert date here if you want it to appear below your name

\begin{document}

\maketitle
\vspace{-.5in}
\begin{framed}
{\bf Due date: February 26, 2021 at 6:00 PM (18:00) PST}
\\[1em]
These questions require thought, but do not require long answers. Please be as concise as possible.
\\[1em]
We encourage students to discuss in groups for assignments. \textbf{However, each student must finish the
problem set and programming assignment individually, and must turn in her/his assignment.} We ask
that you abide by the university Honor Code and that of the Computer Science department, and make
sure that all of your submitted work is done by yourself. If you have discussed the problems with others,
please include a statement saying who you discussed problems with. Failure to follow these instructions
will be reported to the Office of Community Standards. We reserve the right to run a fraud-detection software on your code.
\\[1em]
Please review any additional instructions posted on the assignment page at
http://web.stanford.edu/class/cs234/assignments.html. When you are ready to submit, please
follow the instructions on the course website.
\\[1em]
\end{framed}

\section{Policy Gradient Methods (50 pts coding + 15 pts writeup)}
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the policy gradient algorithm is setup in \texttt{main.py}, and everything that you need to implement is in the files \texttt{network\_utils.py}, \texttt{policy.py}, \texttt{policy\_gradient.py} and \texttt{baseline\_network.py}. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here.
\subsection{REINFORCE}
Recall the policy gradient theorem,
\[ \nabla_\theta J(\theta) = \mathbb E_{\pi_\theta} \left[ \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta} (s,a) \right] \]
REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns $G_t$ as unbiased estimates of $Q^{\pi_\theta}(s,a)$. 
The REINFORCE estimator can be expressed as the gradient of the following objective function:
\[ J(\theta) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} \log(\pi_\theta(a^i_t|s^i_t)) G^i_t \]
where $D$ is the set of all trajectories collected by policy $\pi_\theta$, and $\tau^i =(s^i_0, a^i_0, r^i_0, s^i_1, \dots, s^i_{T_i}, a^i_{T_i}, r^i_{T_i})$ is trajectory $i$.

\subsection{Baseline}
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) $G_t$ can have high variance. To reduce variance, we subtract a baseline $b_{\phi}(s)$ from the estimated returns when computing the policy gradient. A good baseline is the state value function, $V^{\pi_\theta}(s)$, which requires a training update to $\phi$ to minimize the following mean-squared error loss:
\[ L_{\text{MSE}}(\phi) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} (b_{\phi}(s^i_t) - G^i_t)^2\]
\subsection{Advantage Normalization}

After subtracting the baseline, we get the following new objective function:

\[ J(\theta) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} \log(\pi_\theta(a^i_t|s^i_t)) \hat{A}^i_t \]

where

\[\hat{A}^i_t = G^i_t - b_{\phi}(s^i_t)\]

A second variance reduction technique is to normalize the computed advantages, $\hat{A}^i_t$, so that they have mean $0$ and standard deviation $1$. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of $1/\sigma$, where $\sigma$ is the standard deviation of the empirical advantages.

\subsection{Coding Questions (50 pts)}
The functions that you need to implement in \texttt{network\_utils.py}, \texttt{policy.py}, \texttt{policy\_gradient.py}, and \texttt{baseline\_network.py} are enumerated here. Detailed instructions for each function can be found in the comments in each of these files.

Note: The "batch size" for all the arguments is $\sum T_i$ since we already flattened out all the episode observations, actions, and rewards for you.

In \texttt{network\_utils.py},
\begin{itemize}
\item \texttt{build\_mlp}
\end{itemize}

In \texttt{policy.py},
\begin{itemize}
\item \texttt{BasePolicy.act}
\item \texttt{CategoricalPolicy.action\_distribution}
\item \texttt{GaussianPolicy.\_\_init\_\_}
\item \texttt{GaussianPolicy.std}
\item \texttt{GaussianPolicy.action\_distribution}
\end{itemize}

In \texttt{policy\_gradient.py},
\begin{itemize}
\item \texttt{PolicyGradient.init\_policy}
\item \texttt{PolicyGradient.get\_returns}
\item \texttt{PolicyGradient.normalize\_advantage}
\item \texttt{PolicyGradient.update\_policy}
\end{itemize}

In \texttt{baseline\_network.py},
\begin{itemize}
\item \texttt{BaselineNetwork.\_\_init\_\_}
\item \texttt{BaselineNetwork.forward}
\item \texttt{BaselineNetwork.calculate\_advantage}
\item \texttt{BaselineNetwork.update\_baseline}
\end{itemize}

\subsection{Testing}
We have provided some basic tests to sanity check your implementation. \textbf{Please note that the tests are not comprehensive, and passing them does not guarantee a correct implementation}. Use the following command to run the tests:
\begin{tcolorbox}
\begin{verbatim}
python run_basic_tests.py
\end{verbatim}
\end{tcolorbox}
You can also add additional tests of your own design in \texttt{tests/test\_basic.py}.

\subsection{Writeup Questions (15 pts)}
\begin{enumerate}
\item[(a) (3 pts)] To compute the REINFORCE estimator, you will need to calculate the values $\{G_t\}_{t=1}^{T}$ (we drop the trajectory index $i$ for simplicity), where
$$
G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}
$$
Naively, computing all these values takes $O(T^2)$ time. Describe how to compute them in $O(T)$ time.

\solution{
	Your solution here
}

\item[(b) (12 pts)] The general form for running your policy gradient implementation is as follows:
\begin{tcolorbox}
\begin{verbatim}
python main.py --env-name ENV --seed SEED --no-baseline
\end{verbatim}
\end{tcolorbox}
if not using a baseline, or
\begin{tcolorbox}
\begin{verbatim}
python main.py --env-name ENV --seed SEED --baseline
\end{verbatim}
\end{tcolorbox}
if using a baseline. Here \texttt{ENV} should be \texttt{cartpole}, \texttt{pendulum}, or \texttt{cheetah}, and \texttt{SEED} should be a positive integer.

For each of the 3 environments, choose 3 random seeds and run the algorithm both without baseline and with baseline. Then plot the results using
\begin{tcolorbox}
\begin{verbatim}
python plot.py --env-name ENV --seeds SEEDS
\end{verbatim}
\end{tcolorbox}
where \texttt{SEEDS} should be a comma-separated list of seeds which you want to plot (e.g. \texttt{--seeds 1,2,3}).
\textbf{Please include the plots (one for each environment) in your writeup, and comment on whether or not you observe improved performance when using a baseline.}

We have the following expectations about performance to receive full credit:
\begin{itemize}
    \item cartpole: Should reach the max reward of 200 (although it may not stay there)
    \item pendulum: Should reach the max reward of 1000 (although it may not stay there)
    \item cheetah: Should reach at least 200 (Could be as large as 950)
\end{itemize}
\end{enumerate}

\solution{
	Your solution here
}

\newpage
\section{Reducing Variance in Policy Gradient Methods (35 pts)}

In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically reduce variance in policy gradient methods, potentially at the cost of increased bias. 

Let us consider an infinite horizon MDP $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{T}, \gamma \rangle$. Let us define 

\begin{equation} 
A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t)
\end{equation}

An approximation to the policy gradient is defined as

\begin{equation}
    g = \mathbb{E}_{\substack{s_{0:\infty}  \\ a_{0:\infty}}}[\sum_{t=0}^{\infty} A^{\pi}(s_t, a_t) \nabla_{\theta} \text{ log } \pi_{\theta}(a_t, s_t)]
\end{equation}

where the colon notation $a:b$ represents the range $[a, a+1, a+2, ... b]$ inclusive of both ends. 

\begin{enumerate}

\item[(a) (5 pts)] Let us define the partial sum $R_t = \sum_{i=0}^t r_i$. Show that it is not necessarily true that $\text{Var}(R_{t+1}) \geq \text{Var}(R_{t})$. [Hint: Construct a counterexample MDP where this statement does not hold.]

\solution{
	Your solution here
}

\item[(b) (10 pts)] Prove that $\text{Var}(R_{t+1}) \geq \text{Var}(R_{t})$ is true if we assume that $r_{t+1}$ is, on average, correlated with the previous rewards, i.e. $\frac{1}{t+1} \sum_{i=0}^t \text{Cov}(r_i, r_{t+1}) > 0$. 

\solution{
	Your solution here
}

\item[(c) (5 pts)] In practice, we do not have access to the true function $A^{\pi}(s_t, a_t)$, so we would like to obtain an estimate instead. We will consider the general form of an estimator $\hat{A}_t(s_{0:\infty}, a_{0:\infty})$ that can be a function of the entire trajectory. 

Let $\hat{A}_t(s_{0:\infty}, a_{0:\infty}) = \hat{Q}_t(s_{t:\infty}, a_{t:\infty}) - b_t(s_{0:t}, a_{0:t-1})$, where for all $s_t, a_t$, we have that $\hat{Q}_t$ is an unbiased estimator of the true $Q^{\pi}$. Namely, we have that $\mathbb{E}_{\substack{s_{t+1:\infty}  \\ a_{t+1:\infty}}}[\hat{Q}_t(s_{t:\infty}, a_{t:\infty})] = Q^{\pi}(s_t, a_t)$. Note that $b_t$ is an arbitrary function of the actions and states sampled before $a_t$. Prove that by using this estimate of $\hat{A}_t$, we obtain an unbiased estimate of the policy gradient $g$. In other words, prove that $\mathbb{E}_{\substack{s_{0:\infty}  \\ a_{0:\infty}}}[\sum_{t=0}^{\infty} \hat{A}_t(s_{0:\infty}, a_{0:\infty}) \nabla_{\theta} \text{ log } \pi_{\theta}(a_t, s_t)] = g$. 

\solution{
	Your solution here
}


\item [(d) (5 pts)] We will now look at a few different variants of $\hat{A}_t$. Recall the TD error $\delta_t^{\hat{V}}(s_t, a_t) = r_t + \gamma \hat{V}(s_{t+1}) - \hat{V}(s_t)$. If $\hat{V} = V^{\pi}$, prove that $\delta_t^{\hat{V}}$ is an unbiased estimate of $A^{\pi}$. 

\solution{
	Your solution here
}


\item [(e) (5 pts)] Let us define $\hat{A}_t^{(k)} = \sum_{i=0}^{k-1} \gamma^i \delta^{\hat{V}}_{t + i}$. Show that $\hat{A}_t^{(k)} =  - \hat{V}(s_t) + \gamma^k \hat{V}(s_{t+k}) + \sum_{i=0}^{k-1} \gamma^i r_{t+i}$. In general, how does bias and variance change as $k$ increases? (a few sentences of justification would suffice, no formal proof is necessary)

\solution{
	Your solution here
}




\item [(f) (5 pts)] Show that $\hat{A}_t^{(\infty)} = \sum_{i=0}^{\infty} \gamma^i r_{t+i} - \hat{V}(s_t)$. 

\solution{
	Your solution here
}




\end{enumerate}

\printbibliography

\end{document}
