\documentclass{article}

\usepackage{fancyhdr} % Required for custom headers
\usepackage{lastpage} % Required to determine the last page for the footer
\usepackage{extramarks} % Required for headers and footers
\usepackage[usenames,dvipsnames]{color} % Required for custom colors
\usepackage{graphicx} % Required to insert images
\usepackage{listings} % Required for insertion of code
\usepackage{courier} % Required for the courier font
\usepackage{enumerate}
\usepackage{enumitem}
\usepackage{booktabs}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{caption}
\usepackage{subcaption}
\captionsetup[table]{skip=4pt}
\usepackage{framed}
\usepackage{bm}
\usepackage[most]{tcolorbox}
\usepackage{physics}
\usepackage{xcolor}
\graphicspath{{img/}} % set of paths to search for images
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{hyperref}
\hypersetup{
     colorlinks = true,
     linkcolor = blue,
     anchorcolor = blue,
     citecolor = red,
     filecolor = blue,
     urlcolor = blue
     }

\newenvironment{myitemize}
{ \begin{itemize}
		\setlength{\itemsep}{0pt}
		\setlength{\parskip}{0pt}
		\setlength{\parsep}{0pt}     }
	{ \end{itemize}                  } 

\usepackage{biblatex} % bibliography
\addbibresource{papers.bib}

\usepackage{tikz}
\usetikzlibrary{positioning,patterns,fit}

\newcommand{\ifans}[1]{\ifanswers \color{red} \textbf{Solution: } #1 \color{black}}

 \newcommand\solution[1]{\paragraph{\textcolor{purple}{Solution}}{\color{purple} #1}}
 \newcommand\solutionInline[1]{\textcolor{purple}{#1}}

%\newcommand\solution[1]{}
%\newcommand\solutionInline[1]{}

\newcommand{\E}{\mathbb{E}}
\newcommand{\given}{\,|\,}

\newcommand{\safereward}{r_{\text{safe}}}
\newcommand{\lowreward}{\underline{r}_{\text{risk}}}
\newcommand{\highreward}{\overline{r}_{\text{risk}}}
\newcommand{\consreward}{r_{\text{cons}}}

% Margins
\topmargin=-0.45in
\evensidemargin=0in
\oddsidemargin=0in
\textwidth=6.5in
\textheight=9.0in
\headsep=0.25in

\linespread{1.1} % Line spacing

% Set up the header and footer
\pagestyle{fancy}
\rhead{\hmwkAuthorName} % Top left header
\lhead{\hmwkClass: \hmwkTitle} % Top center head
\lfoot{\lastxmark} % Bottom left footer
\cfoot{} % Bottom center footer
\rfoot{Page\ \thepage\ of\ \protect\pageref{LastPage}} % Bottom right footer
\renewcommand\headrulewidth{0.4pt} % Size of the header rule
\renewcommand\footrulewidth{0.4pt} % Size of the footer rule

\setlength\parindent{0pt} % Removes all indentation from paragraphs

%----------------------------------------------------------------------------------------
%	CODE INCLUSION CONFIGURATION
%----------------------------------------------------------------------------------------

\definecolor{MyDarkGreen}{rgb}{0.0,0.4,0.0} % This is the color used for comments
\lstloadlanguages{Python}
\lstset{language=Python,
        frame=single, % Single frame around code
        basicstyle=\footnotesize\ttfamily, % Use small true type font
        keywordstyle=[1]\color{Blue}\bf,
        keywordstyle=[2]\color{Purple},
        keywordstyle=[3]\color{Blue}\underbar, % Custom functions underlined and blue
        identifierstyle=, % Nothing special about identifiers
        commentstyle=\usefont{T1}{pcr}{m}{sl}\color{MyDarkGreen}\small, % Comments small dark green courier font
        stringstyle=\color{Purple}, % Strings are purple
        showstringspaces=false, % Don't put marks in string spaces
        tabsize=5, % 5 spaces per tab
        morekeywords={rand},
        morekeywords=[2]{on, off, interp},
        morekeywords=[3]{test},
        morecomment=[l][\color{Blue}]{...}, % Line continuation (...) like blue comment
        numbers=left, % Line numbers on left
        firstnumber=1, % Line numbers start with line 1
        numberstyle=\tiny\color{Blue}, % Line numbers are blue and small
        stepnumber=5 % Line numbers go in steps of 5
}

\newcommand{\perlscript}[2]{
\begin{itemize}
\item[]\lstinputlisting[caption=#2,label=#1]{#1.pl}
\end{itemize}
}

%----------------------------------------------------------------------------------------
%	NAME AND CLASS SECTION
%----------------------------------------------------------------------------------------

\newcommand{\hmwkTitle}{Assignment \#3} % Assignment title
\newcommand{\hmwkClass}{CS\ 234} % Course/class
\newcommand{\hmwkAuthorName}{} % Your name

%----------------------------------------------------------------------------------------
%	TITLE PAGE
%----------------------------------------------------------------------------------------

\title{
\vspace{-1in}
\textmd{\textbf{\hmwkClass:\ \hmwkTitle}}}
\author{}
\date{} % Insert date here if you want it to appear below your name

\begin{document}

\maketitle
\vspace{-.5in}
\begin{framed}
{\bf Due date: February 16, 2022 at 6:00 PM (18:00) PST}
\\[1em]
These questions require thought, but do not require long answers. Please be as concise as possible.
\\[1em]
We encourage students to discuss in groups for assignments. \textbf{However, each student must finish the
problem set and programming assignment individually, and must turn in her/his assignment.} We ask
that you abide by the university Honor Code and that of the Computer Science department, and make
sure that all of your submitted work is done by yourself. If you have discussed the problems with others,
please include a statement saying who you discussed problems with. Failure to follow these instructions
will be reported to the Office of Community Standards. We reserve the right to run a fraud-detection software on your code.
\\[1em]
Please review any additional instructions posted on the assignment page at
http://web.stanford.edu/class/cs234/assignments.html. When you are ready to submit, please
follow the instructions on the course website.
\\[1em]
\end{framed}

\section{Policy Gradient Methods (50 pts coding + 15 pts writeup)}
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the policy gradient algorithm is setup in \texttt{main.py}, and everything that you need to implement is in the files \texttt{network\_utils.py}, \texttt{policy.py}, \texttt{policy\_gradient.py} and \texttt{baseline\_network.py}. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here.
\subsection{REINFORCE}
Recall the policy gradient theorem,
\[ \nabla_\theta J(\theta) = \mathbb E_{\pi_\theta} \left[ \nabla_\theta \log\pi_\theta(a|s) Q^{\pi_\theta} (s,a) \right] \]
REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns $G_t$ as unbiased estimates of $Q^{\pi_\theta}(s,a)$. 
The REINFORCE estimator can be expressed as the gradient of the following objective function:
\[ J(\theta) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} \log(\pi_\theta(a^i_t|s^i_t)) G^i_t \]
where $D$ is the set of all trajectories collected by policy $\pi_\theta$, and $\tau^i =(s^i_0, a^i_0, r^i_0, s^i_1, \dots, s^i_{T_i}, a^i_{T_i}, r^i_{T_i})$ is trajectory $i$.

\subsection{Baseline}
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) $G_t$ can have high variance. To reduce variance, we subtract a baseline $b_{\phi}(s)$ from the estimated returns when computing the policy gradient. A good baseline is the state value function, $V^{\pi_\theta}(s)$, which requires a training update to $\phi$ to minimize the following mean-squared error loss:
\[ L_{\text{MSE}}(\phi) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} (b_{\phi}(s^i_t) - G^i_t)^2\]
\subsection{Advantage Normalization}

After subtracting the baseline, we get the following new objective function:

\[ J(\theta) = \frac{1}{\sum T_i} \sum_{i=1}^{|D|} \sum_{t=1}^{T_i} \log(\pi_\theta(a^i_t|s^i_t)) \hat{A}^i_t \]

where

\[\hat{A}^i_t = G^i_t - b_{\phi}(s^i_t)\]

A second variance reduction technique is to normalize the computed advantages, $\hat{A}^i_t$, so that they have mean $0$ and standard deviation $1$. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of $1/\sigma$, where $\sigma$ is the standard deviation of the empirical advantages.

\subsection{Coding Questions (50 pts)}
The functions that you need to implement in \texttt{network\_utils.py}, \texttt{policy.py}, \texttt{policy\_gradient.py}, and \texttt{baseline\_network.py} are enumerated here. Detailed instructions for each function can be found in the comments in each of these files.

Note: The "batch size" for all the arguments is $\sum T_i$ since we already flattened out all the episode observations, actions, and rewards for you.

In \texttt{network\_utils.py},
\begin{itemize}
\item \texttt{build\_mlp}
\end{itemize}

In \texttt{policy.py},
\begin{itemize}
\item \texttt{BasePolicy.act}
\item \texttt{CategoricalPolicy.action\_distribution}
\item \texttt{GaussianPolicy.\_\_init\_\_}
\item \texttt{GaussianPolicy.std}
\item \texttt{GaussianPolicy.action\_distribution}
\end{itemize}

In \texttt{policy\_gradient.py},
\begin{itemize}
\item \texttt{PolicyGradient.init\_policy}
\item \texttt{PolicyGradient.get\_returns}
\item \texttt{PolicyGradient.normalize\_advantage}
\item \texttt{PolicyGradient.update\_policy}
\end{itemize}

In \texttt{baseline\_network.py},
\begin{itemize}
\item \texttt{BaselineNetwork.\_\_init\_\_}
\item \texttt{BaselineNetwork.forward}
\item \texttt{BaselineNetwork.calculate\_advantage}
\item \texttt{BaselineNetwork.update\_baseline}
\end{itemize}

\subsection{Testing}
We have provided some basic tests to sanity check your implementation. \textbf{Please note that the tests are not comprehensive, and passing them does not guarantee a correct implementation}. Use the following command to run the tests:
\begin{tcolorbox}
\begin{verbatim}
python run_basic_tests.py
\end{verbatim}
\end{tcolorbox}
You can also add additional tests of your own design in \texttt{tests/test\_basic.py}.

\subsection{Writeup Questions (15 pts)}
\begin{enumerate}
\item[(a) (3 pts)] To compute the REINFORCE estimator, you will need to calculate the values $\{G_t\}_{t=1}^{T}$ (we drop the trajectory index $i$ for simplicity), where
$$
G_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}
$$
Naively, computing all these values takes $O(T^2)$ time. Describe how to compute them in $O(T)$ time.

\item[(b) (12 pts)] The general form for running your policy gradient implementation is as follows:
\begin{tcolorbox}
\begin{verbatim}
python main.py --env-name ENV --seed SEED --no-baseline
\end{verbatim}
\end{tcolorbox}
if not using a baseline, or
\begin{tcolorbox}
\begin{verbatim}
python main.py --env-name ENV --seed SEED --baseline
\end{verbatim}
\end{tcolorbox}
if using a baseline. Here \texttt{ENV} should be \texttt{cartpole}, \texttt{pendulum}, or \texttt{cheetah}, and \texttt{SEED} should be a positive integer.

For each of the 3 environments, choose 3 random seeds and run the algorithm both without baseline and with baseline. Then plot the results using
\begin{tcolorbox}
\begin{verbatim}
python plot.py --env-name ENV --seeds SEEDS
\end{verbatim}
\end{tcolorbox}
where \texttt{SEEDS} should be a comma-separated list of seeds which you want to plot (e.g. \texttt{--seeds 1,2,3}).
\textbf{Please include the plots (one for each environment) in your writeup, and comment on whether or not you observe improved performance when using a baseline.}

We have the following expectations about performance to receive full credit:
\begin{itemize}
    \item cartpole: Should reach the max reward of 200 (although it may not stay there)
    \item pendulum: Should reach the max reward of 1000 (although it may not stay there)
    \item cheetah: Should reach at least 200 (Could be as large as 950)
\end{itemize}
\end{enumerate}

\newpage
\section{Natural Gradient Methods (20 pts)}

Consider a finite MDP given by $(S, s_0, A, R, P)$ where $S$ is a set of states, $s_0$ is the start state, $A$ is a set of actions, $R$ is a reward function $R : S \times A \rightarrow [0, R_{max}]$, and P is the transition model. A policy for this MDP can be characterized as $\pi(a; s,\theta)$ denoting the probability of taking an action $a$ in state $s$ parameterized by $\theta \in \mathbb{R}^{|\theta|}$. The average reward $\eta(\theta)$\footnote{We are abusing notation by writing $\eta(\theta)$ instead of $\eta(\pi_\theta)$, since it is really a function on $\{ \pi(a; s, \theta): \theta \in \mathbb{R}^{|\theta|}\}$, but we can ignore this.} of this policy is defined in the equation below, where $d^\pi(s)$ is the stationary probability of the policy $\pi(a;s,\theta)$ to be in state $s$:

\begin{equation}
\eta(\theta) = \sum_{s, a} d^\pi(s)\pi(a;s,\theta)R(s,a)
\end{equation}

To find a policy parameterized by $\theta$ that maximizes the average reward, we can perform gradient ascent on $\eta(\theta)$ wrt. $\theta$. Let's define the gradient of the average reward with respect to $\theta$ as:

\begin{equation}
\nabla \eta(\theta) = \sum_{s, a}d^\pi(s)\nabla\pi(a;s,\theta)Q^\pi(s, a)
\end{equation}

And the gradient ascent step to maximize $\eta(\theta)$ wrt. $\theta$ will look like:

\begin{equation}
\theta' \gets \theta + \nabla\eta(\theta)
\end{equation}

By definition, $\nabla\eta(\theta)$ is the steepest ascent in the parameter space $\mathbb{R}^{|\theta|}$ measured by Euclidean metric. However, in this case, $\theta$ parameterizes the distribution $\pi(a;s,\theta)$; therefore, it might be useful to instead ascent in the steepest direction in $\mathbb{R}^{|\theta|}$ measured by divergence between different $\pi(a; s, \theta)$ distributions (e.g. KL-Divergence). We introduce such a method called Natural Gradient which redefines the gradient ascent step as:

\begin{equation}
\theta' \gets \theta + \mathbf{F}^{-1}(\theta)\nabla\eta(\theta)
\end{equation}

where $\mathbf{F}(\theta)$ is the Fisher Information Matrix for $\pi(a;s,\theta)$:

\begin{equation}
\mathbf{F}(\theta) = \mathbb{E}_{d^\pi(s)} \left[ \mathbb{E}_{\pi(a;s,\theta)} \left[ \nabla \log \pi(a;s,\theta) \nabla \log \pi(a;s,\theta)^\top \right]\right]
\end{equation}

\begin{enumerate}

\item[(a) (10 pts)] 
In this problem, we will use linear function approximation to approximate $Q^\pi(s, a)$, and show that for a specific linear approximation $\hat{Q}^\pi(s, a; \vb{w})$ and particular objective function, the weights $\vb{w^*}$ that minimize the objective are precisely the natural gradient of the average reward: $\mathbf{F}^{-1}(\theta)\nabla \eta(\theta)$. 

We define our feature vector $\vb{x}$ and linear approximation $\hat{Q}$ as follows:
\[\vb{x}^\pi(s, a) = \nabla \log \pi(a; s, \theta), \qquad \hat{Q}^\pi(s, a; \vb{w}) = \vb{w}^\top \vb{x}^\pi(s, a)\]
where the $i$-th element $\vb{x}^\pi(s,a)_i = [\nabla \log \pi(a; s, \theta)]_i = \dfrac{\partial \log \pi(a; s, \theta)}{\partial \theta_i}$. Our objective function is
\[J^\pi(\vb{w}) = \sum_{s,a} d^\pi(s )\pi(a;s,\theta)(\hat{Q}^\pi(s, a; \vb{w}) - Q^\pi(s, a))^2\]

Let $\vb{w}^*$ minimize this objective $J^\pi(\vb{w})$. Prove that
\[\vb{w}^* = \mathbf{F}^{-1}(\theta)\nabla\eta(\theta).\]


\item[(b) (10 pts)]
We have now computed the weights $\vb{w}^*$ that minimize our error objective $J^\pi(\vb{w})$. We then make our gradient ascent update to our parameter $\theta$:
\[\theta' \gets \theta + \alpha\mathbf{F}^{-1}(\theta)\nabla\eta(\theta)\]
where $\alpha$ is the step size. We will show that updating $\theta$ only changes the policy $\pi(a; s, \theta)$ locally, that is, by a factor that depends on our linear approximation $\hat{Q}^\pi(s, a; \vb{w^*})$. Prove
\[\pi(a; s, \theta') = \pi(a; s, \theta)(1 + \alpha\hat{Q}^\pi(s, a; \vb{w^*})) + O(\alpha^2).\]
Hint: Use a Taylor series expansion of $\pi(s, a; \theta')$ centered at $\pi(s, a; \theta)$.

\end{enumerate}
Natural gradient methods like the ones you have explored in this problem are closely related to modern RL algorithms like Trust Region Policy Optimization (TRPO). TRPO updates policies to take the largest possible step while ensuring the KL-Divergence of the policy update isn't too large. At a high level, this prevents the policy from straying outside of the area where our function approximation is accurate. In part (b), you showed a related result for our natural gradient method: the policy change differs by a factor that depends on our linear approximation.

\newpage
\section{Ethical concerns with Policy Gradients (5 pts)}
In this assignment, we focus on policy gradients, an extremely popular and useful model-free technique for RL. However, policy gradients collect data from the environment with a potentially suboptimal policy during the learning process. While this is acceptable in simulators like Mujoco or Atari, such exploration in real world settings such as healthcare and education presents challenges. 

Consider a case study of a Stanford CS course considering introducing a RL-based chat bot for office hours.  For each assignment, some students will be given 100\% human CA office hours; others 100\% chatbot; others a mix of both. The reward signal is the student grades on each assignment. Since the AI chatbot will learn through experience, at any given point in the quarter, the help given by the chatbot might be better or worse than the help given by a randomly selected human CA.  

If each time students are randomly assigned to each condition, some students will be assigned more chatbot hours and others fewer. In addition, some students will be assigned more chatbot hours at the beginning of the term (when the chatbot has had fewer interactions and may have lower effectiveness)  and fewer at the end, and vice versa.  All students will be graded according to the same standards, regardless of which type of help they have received.  

Researchers who experiment on human subjects are morally responsible for ensuring their well being and protecting them from being harmed by the study. A foundational document in research ethics, the \href{https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html#xethical}{Belmont Report}, identifies three core principles of responsible research:

\begin{enumerate}
    \item \textbf{Respect for persons:} individuals are capable of making choices about their own lives on the basis of their personal goals. Research participants should be informed about the study they are considering undergoing, asked for their consent, and not coerced into giving it. Individuals who are less capable of giving informed consent, such as young children, should be protected in other ways.
    \item \textbf{Beneficence:} the principle of beneficence describes an obligation to ensure the well-being of subjects. It has been summarized as “do not harm” or “maximize possible benefits and minimize possible harms.”
    \item \textbf{Justice:} the principle of justice requires treating all people equally and distributing benefits and harms to them equitably. 
\end{enumerate}

\begin{enumerate}
    \item[(a)] \textbf{[4 pts]} In 4-6 sentences, describe \textbf{two} experimental  design or research choices that researchers planning the above experiment ought to make in order to respect these principles. Justify the importance of these choices using one of the three ethical principles above and indicating which principle you have chosen. For example, “Researchers ought to ensure that students advised by the chatbot are able to revise their assignments after submission with the benefit of human advice if needed. If they did not take this precaution, the principle of justice would be violated because the risk of harm from poor advice from the AI chatbot would be distributed unevenly.” 
\end{enumerate}

At universities, research experiments that involve human subjects are subject by federal law to Institutional Review Board (IRB) approval.  The purpose of IRB is to protect human subjects of research: to “assure, both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of humans participating as subjects in the research” (\href{https://www.fda.gov/about-fda/center-drug-evaluation-and-research-cder/institutional-review-boards-irbs-and-protection-human-subjects-clinical-trials}{reference}). The IRB process was established in response to abuses of human subjects in the name of medical research performed during WWII (\href{https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/read-the-belmont-report/index.html#xethical}{reference}). The IRB is primarily intended to address the responsibilities of the researcher towards the subjects. Familiarize yourself with Stanford’s IRB Research Compliance process at \href{https://researchcompliance.stanford.edu/panels/hs/forms/for-researchers#need}{this link}.

\begin{enumerate}
    \item[(b)] \textbf{[1 pt]} If you were conducting the above experiment, what process would you need to follow at Stanford (who would you email/ where would you upload a research protocol) to get clearance? 
\end{enumerate}

\end{document}
