Welcome to CME 241: Reinforcement Learning for Stochastic Control Problems in Finance
• Ashwin's Office Hours: Fri 24pm (or by appointment) in ICME M09 (Ashwin's office room)
• Course Assistant (CA) Jeffrey Gu's Office Hours: Mon 23pm in Huang Engg. Basement
Grade will be based on
 25% MidTerm Exam (on Theory, Modeling and Algorithms)
 40% Final Exam (on Theory, Modeling and Algorithms)
 35% Assignments: Programming, Technical Writing and Theory ProblemSolving (to be done throughout the course)
Learning Material will be a combination of
LecturebyLecture (tentative) schedule with corresponding lecture slides, reading/videos, and assignments
Date 
Lecture Slides 
Reading/Videos 
Suggested Assignments 
January 9 
Course Overview 

 Install/Setup on your laptop with LaTeX, Python 3 (and optionally Jupyter notebook)
 Create a git repo for this course where you can upload and organize all the code and technical writing you will do as part of assignments and selflearning
 Let the Course Assistant (CA) know of your git repo, so we can periodically review your assignments and other selflearning work

January 11 
Markov Processes (MP) and Markov Reward Processes (MRP) 

 Write out the MP/MRP definitions and MRP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)
 Think about the data structures/class design (in Python 3) to represent MP/MRP and implement them with clear type declarations
 Remember your data structure/code design must resemble the Mathematical/notational formalism as much as possible
 Specifically the data structure/code design of MRP should be incremental (and not independent) to that of MP
 Separately implement the r(s,s') and the R(s) = \sum_{s'} p(s,s') * r(s,s') definitions of MRP
 Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)
 Write code to generate the stationary distribution for an MP

January 16 
Markov Decision Processes (MDP), Value Function, and Bellman Equations


 Write the Bellman equation for MRP Value Function and code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)
 Write out the MDP definition, Policy definition and MDP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)
 Think about the data structure/class design (in Python 3) to represent MDP, Policy, Value Function, and implement them with clear type definitions
 The data struucture/code design of MDP should be incremental (and not independent) to that of MRP
 Separately implement the r(s,s',a) and R(s,a) = \sum_{s'} p(s,s',a) * r(s,s',a) definitions of MDP
 Write code to convert/cast the r(s,s',a) definition of MDP to the R(s,a) definition of MDP (put some thought into code design here)
 Write code to create a MRP given a MDP and a Policy
 Write out all 8 MDP Bellman Equations and also the transformation from Optimal ActionValue function to Optimal Policy (in LaTeX)

January 18 
Dynamic Programming Algorithms


 Write code for Policy Evaluation (tabular) algorithm
 Write code for Policy Iteration (tabular) algorithm
 Write code for Value Iteration (tabular) algorithm
 Those familiar with function approximation (deep networks, or simply linear in featues) can try writing code for the above algorithms with function approximation (a.k.a. Approximate DP)

January 23 
Understanding RiskAversion through Utility Theory(as a prereq to Application Problem 1)


 Work out (in LaTeX) the equations for Absolute/Relative Risk Premia for CARA/CRRA respectively
 Write the solutions to Portfolio Applications covered in class with precise notation (in LaTeX)

January 25 
Application Problem 1  Optimal Asset Allocation/Consumption (Merton's 1969 Portfolio Problem)


 Model Merton's Portfolio problem as an MDP (write the model in LaTeX)
 Implement this MDP model in code
 Try recovering the closedform solution with a DP algorithm that you implemented previously
 Model a realworld Portfolio Allocation+Consumption problem as an MDP (including realworld frictions and constraints)
 Exam Practice Problem: Optimal Asset Allocation in Discrete Time (Solution)

January 30 
Application Problem 2  Optimal Exercise of American Options


 Implement BlackScholes formulas for European Call/Put Pricing
 Implement standard binary tree/gridbased numerical algorithm for American Option Pricing and ensure it validates against BlackScholes formula for Europeans
 Implement LongstaffSchwartz Algorithm and ensure it validates against binary tree/gridbased solution for pathindependent options
 Explore/Discuss an Approximate Dynamic Programming solution as an alternative to LongstaffSchwartz Algorithm

February 1 
Application Problem 3  Optimal Trade Order Execution


 Work out (in LaTeX) the solution to the Linear Impact model we covered in class
 Model a realworld Optimal Trade Order Execution problem as an MDP (with complete order book included in the State)

February 6 
Modelfree (RL) Prediction With Monte Carlo and Temporal Difference


 Write code for the interface for tabular RL algorithms. The core of this interface should be a mapping from a (state, action) pair
to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the statetransition probability model
or the reward model.
 Implement any tabular MonteCarlo algorithm for Value Function prediction
 Implement tabular 1step TD algorithm for Value Function prediction
 Test the above implementation of MonteCarlo and TD Value Function prediction algorithms versus DP Policy Evaluation algorithm on an example MDP
 Prove that fixed learning rate (step size alpha) for MC is equivalent to an exponentially decaying average of episode returns

February 8 
Midterm Exam (Solutions) 
February 13 
Modelfree (RL) Prediction with Eligibility Traces (TD(Lambda))

 Optional: David Silver's corresponding video (youtube) on Modelfree Prediction
 nStep TD section of SuttonBarto textbook (pages 141145)
 Optional: TD(Lambda) and Eligibility Tracesbased Prediction is covered on pages 287297, but this treatment is for the more general case of function approximation of Value Function (we've only covered tabular RL algorithms so far).

 Implement ForwardView TD(Lambda) algorithm for Value Function Prediction
 Implement Backward View TD(Lambda), i.e., Eligibility Traces algorithm for Value Function Prediction
 Implement these algorithms as offline or online algorithms (offline means updates happen only after a full simulation trace, online means updates happen at every time step)
 Test these algorithms on some example MDPs, compare them versus DP Policy Evaluation, and plot their accuracy as a function of Lambda
 Prove that Offline ForwardView TD(Lambda) and Offline Backward View TD(Lambda) are equivalent. We covered the proof of Lambda = 1 in class. Do the proof for arbitrary Lambda (similar telescoping argument as done in class) for the case where a state appears only once in an episode.

February 15 
Modelfree Control (RL for Optimal Value Function/Policy)

 Optional: David Silver's corresponding video (youtube) on Modelfree Control
 MC and TDbased Control sections of SuttonBarto textbook (pages 96111, 129134, 146149)
 Optional: SARSA(Lambda) is covered on pages 303307, but this treatment is for the more general case of function approximation of Value Function (we've only covered tabular RL algorithms so far)

 Prove the EpsilonGreedy Policy Improvement Theorem (we sketched the proof in Class)
 Provide (with clear mathematical notation) the defintion of GLIE (Greedy in the Limit with Infinite Exploration)
 Implement the tabular SARSA and tabular SARSA(Lambda) algorithms
 Implement the tabular QLearning algorithm
 Test the above algorithms on some example MDPs by using DP Policy Iteration/Value Iteration solutions as a benchmark

February 20 and 22 
RL with Function Approximation (including Deep RL and Batch Methods)


 Write code for the interface for RL algorithms with value function approximation. The core of this interface should be a function from a (state, action) pair
to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the statetransition probability model
or the reward model.
 Implement any MonteCarlo Prediction algorithm with Value Function approximation
 Implement 1step TD Prediction algorithm with Value Function approximation
 Implement EligibilityTracesbased TD(lambda) Prediction algorithm with Value Function approximation
 Implement SARSA and SARSA(Lambda) with Value Function approximation
 Implement QLearning with Value Function approximation
 Optional: Implement LSTD and LSPI
 Test the above algorithms versus DP Policy Evaluation and DP Policy Iteration/Value Iteration algorithm on an example MDP
 Project Suggestion: Customize the LSPI algorithm for American Option Pricing (see this paper)

February 27 and March 1 
Value Function Geometry and Gradient TD

 This week, aim to catch up on all the nonoptional reading from previous weeks
 Optional: Those of you who have made more progress on reading can aim to do some of the Optional reading from previous weeks (eg: DQN paper and LSPI paper)

 This week, aim to catch up on the coding assignments from previous weeks. Aim to do the basic algorithms
 Those of you who have made more progress on coding can try implementing the more advanced algorithms in my suggested assignment work from previous weeks (eg: LSTD/LSPI)
 Write with proper notation, the derivations to solutions of Linear Systems for Bellman Errorminimization and Projected Bellman Errorminimization (lecture slides have the derivation, but aim to do it yourself)
 Optional: Write with proper notation, the derivation of Gradient TD (the last two slides of the lecture have the derivation, but aim to do it yourself)

March 6 
Policy Gradient Algorithms


 Write Proof (with precise notation) of the Policy Gradient Theorem
 Derive the score function for softmax policy (for finite set of actions)
 Derive the score function for gaussian policy (for continuous actions)
 Write code for the REINFORCE Algoithm (MonteCarlo Policy Gradient Algorithm, i.e., no Critic)
 Optional: Write code for ActorCritic Policy Gradient Algorithms (with TD(0) and with EligiblityTracesbased TD(Lambda))
 Write Proof (with proper notation) of the Compatible Function Approximation Theorem
 Optional: Write code for Natural Policy Gradient Algorithm based on Compatible Linear Function Appeoximation for Critic

March 8 
Integrating Learning and Planning


 Aim to catch up on the coding assignment of trying to solve the finance problem of your choice with an RL Algorithm

March 13 
Exploration versus Exploitation


 Aim to catch up on the coding assignment of trying to solve the finance problem of your choice with an RL Algorithm

March 15 
Case Study on Planning, Control and Pricing in RealWorld Retail Industry

 Review all of the course material to prepare for the Final Exam

 Upload all of your assignment work on the github account you had created at the start of the course
 Inform the Course Assistant that you have uploaded all of your work
 Ensure that all of your assignment work can be accessed by the Course Assistant at github.com (make sure you have pushed and not just commited)

March 22 (3:30pm6:30pm in Bldg 200 (Lane History Corner), Room 203
) 
Final Exam (Solutions) 
Purpose and Grading of Assignments
 Assignments are not to be treated as "tests/exams" with a right/wrong answer
 Rather, they should be treated as part of your learning experience
 You will TRULY understand ideas/models/algorithms only when you WRITE down the Mathematics and the Code precisely
 In other words, simply reading the Mathematics or the Code gives you a false sense of understanding things
 Take the initiative to make up your own assignments, especially on topics you feel you don't quite understand
 Individual assignments won't get a grade and there are no due dates for the assignments
 Rather, the entire body of assignments work throughout the course will be graded (upload regularly on your course git repo)
 It will be graded less on correctness and completeness, and more on:
 Coding and Technical Writing style that is clear and modular
 Demonstration of curiosity and commitment to learning through the overall body of assignments work
 Extent of engagement in asking questions and seeking feedback for improvements