Welcome to CME 241: Reinforcement Learning for Stochastic Control Problems in Finance

Instructor: Ashwin Rao

• Winter 2019 Classes: Wed & Fri 4:30-5:50pm. Bldg 200 (Lane History Corner), Room 203

• Ashwin's Office Hours: Fri 2-4pm (or by appointment) in ICME M09 (Ashwin's office room)

• Course Assistant (CA) Jeffrey Gu's Office Hours: Mon 2-3pm in Huang Engg. Basement

Grade will be based on

Learning Material will be a combination of

Lecture-by-Lecture (tentative) schedule with corresponding lecture slides, reading/videos, and assignments

Date Lecture Slides Reading/Videos Suggested Assignments
January 9 Course Overview
  • Install/Setup on your laptop with LaTeX, Python 3 (and optionally Jupyter notebook)
  • Create a git repo for this course where you can upload and organize all the code and technical writing you will do as part of assignments and self-learning
  • Let the Course Assistant (CA) know of your git repo, so we can periodically review your assignments and other self-learning work
January 11 Markov Processes (MP) and Markov Reward Processes (MRP)
  • Write out the MP/MRP definitions and MRP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)
  • Think about the data structures/class design (in Python 3) to represent MP/MRP and implement them with clear type declarations
  • Remember your data structure/code design must resemble the Mathematical/notational formalism as much as possible
  • Specifically the data structure/code design of MRP should be incremental (and not independent) to that of MP
  • Separately implement the r(s,s') and the R(s) = \sum_{s'} p(s,s') * r(s,s') definitions of MRP
  • Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)
  • Write code to generate the stationary distribution for an MP
January 16 Markov Decision Processes (MDP), Value Function, and Bellman Equations
  • Write the Bellman equation for MRP Value Function and code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)
  • Write out the MDP definition, Policy definition and MDP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)
  • Think about the data structure/class design (in Python 3) to represent MDP, Policy, Value Function, and implement them with clear type definitions
  • The data struucture/code design of MDP should be incremental (and not independent) to that of MRP
  • Separately implement the r(s,s',a) and R(s,a) = \sum_{s'} p(s,s',a) * r(s,s',a) definitions of MDP
  • Write code to convert/cast the r(s,s',a) definition of MDP to the R(s,a) definition of MDP (put some thought into code design here)
  • Write code to create a MRP given a MDP and a Policy
  • Write out all 8 MDP Bellman Equations and also the transformation from Optimal Action-Value function to Optimal Policy (in LaTeX)
January 18 Dynamic Programming Algorithms
  • Write code for Policy Evaluation (tabular) algorithm
  • Write code for Policy Iteration (tabular) algorithm
  • Write code for Value Iteration (tabular) algorithm
  • Those familiar with function approximation (deep networks, or simply linear in featues) can try writing code for the above algorithms with function approximation (a.k.a. Approximate DP)
January 23 Understanding Risk-Aversion through Utility Theory(as a pre-req to Application Problem 1)
  • Work out (in LaTeX) the equations for Absolute/Relative Risk Premia for CARA/CRRA respectively
  • Write the solutions to Portfolio Applications covered in class with precise notation (in LaTeX)
January 25 Application Problem 1 - Optimal Asset Allocation/Consumption (Merton's 1969 Portfolio Problem)
  • Model Merton's Portfolio problem as an MDP (write the model in LaTeX)
  • Implement this MDP model in code
  • Try recovering the closed-form solution with a DP algorithm that you implemented previously
  • Model a real-world Portfolio Allocation+Consumption problem as an MDP (including real-world frictions and constraints)
  • Exam Practice Problem: Optimal Asset Allocation in Discrete Time (Solution)
January 30 Application Problem 2 - Optimal Exercise of American Options
  • Implement Black-Scholes formulas for European Call/Put Pricing
  • Implement standard binary tree/grid-based numerical algorithm for American Option Pricing and ensure it validates against Black-Scholes formula for Europeans
  • Implement Longstaff-Schwartz Algorithm and ensure it validates against binary tree/grid-based solution for path-independent options
  • Explore/Discuss an Approximate Dynamic Programming solution as an alternative to Longstaff-Schwartz Algorithm
February 1 Application Problem 3 - Optimal Trade Order Execution
  • Work out (in LaTeX) the solution to the Linear Impact model we covered in class
  • Model a real-world Optimal Trade Order Execution problem as an MDP (with complete order book included in the State)
February 6 Model-free (RL) Prediction With Monte Carlo and Temporal Difference
  • Write code for the interface for tabular RL algorithms. The core of this interface should be a mapping from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.
  • Implement any tabular Monte-Carlo algorithm for Value Function prediction
  • Implement tabular 1-step TD algorithm for Value Function prediction
  • Test the above implementation of Monte-Carlo and TD Value Function prediction algorithms versus DP Policy Evaluation algorithm on an example MDP
  • Prove that fixed learning rate (step size alpha) for MC is equivalent to an exponentially decaying average of episode returns
February 8 Midterm Exam (Solutions)
February 13 Model-free (RL) Prediction with Eligibility Traces (TD(Lambda))
  • Optional: David Silver's corresponding video (youtube) on Model-free Prediction
  • n-Step TD section of Sutton-Barto textbook (pages 141-145)
  • Optional: TD(Lambda) and Eligibility Traces-based Prediction is covered on pages 287-297, but this treatment is for the more general case of function approximation of Value Function (we've only covered tabular RL algorithms so far).
  • Implement Forward-View TD(Lambda) algorithm for Value Function Prediction
  • Implement Backward View TD(Lambda), i.e., Eligibility Traces algorithm for Value Function Prediction
  • Implement these algorithms as offline or online algorithms (offline means updates happen only after a full simulation trace, online means updates happen at every time step)
  • Test these algorithms on some example MDPs, compare them versus DP Policy Evaluation, and plot their accuracy as a function of Lambda
  • Prove that Offline Forward-View TD(Lambda) and Offline Backward View TD(Lambda) are equivalent. We covered the proof of Lambda = 1 in class. Do the proof for arbitrary Lambda (similar telescoping argument as done in class) for the case where a state appears only once in an episode.
February 15 Model-free Control (RL for Optimal Value Function/Policy)
  • Optional: David Silver's corresponding video (youtube) on Model-free Control
  • MC and TD-based Control sections of Sutton-Barto textbook (pages 96-111, 129-134, 146-149)
  • Optional: SARSA(Lambda) is covered on pages 303-307, but this treatment is for the more general case of function approximation of Value Function (we've only covered tabular RL algorithms so far)
  • Prove the Epsilon-Greedy Policy Improvement Theorem (we sketched the proof in Class)
  • Provide (with clear mathematical notation) the defintion of GLIE (Greedy in the Limit with Infinite Exploration)
  • Implement the tabular SARSA and tabular SARSA(Lambda) algorithms
  • Implement the tabular Q-Learning algorithm
  • Test the above algorithms on some example MDPs by using DP Policy Iteration/Value Iteration solutions as a benchmark
February 20 and 22 RL with Function Approximation (including Deep RL and Batch Methods)
  • Write code for the interface for RL algorithms with value function approximation. The core of this interface should be a function from a (state, action) pair to a sampling of the (next state, reward) pair. It is important that this interface doesn't present the state-transition probability model or the reward model.
  • Implement any Monte-Carlo Prediction algorithm with Value Function approximation
  • Implement 1-step TD Prediction algorithm with Value Function approximation
  • Implement Eligibility-Traces-based TD(lambda) Prediction algorithm with Value Function approximation
  • Implement SARSA and SARSA(Lambda) with Value Function approximation
  • Implement Q-Learning with Value Function approximation
  • Optional: Implement LSTD and LSPI
  • Test the above algorithms versus DP Policy Evaluation and DP Policy Iteration/Value Iteration algorithm on an example MDP
  • Project Suggestion: Customize the LSPI algorithm for American Option Pricing (see this paper)
February 27 and March 1 Value Function Geometry and Gradient TD
  • This week, aim to catch up on all the non-optional reading from previous weeks
  • Optional: Those of you who have made more progress on reading can aim to do some of the Optional reading from previous weeks (eg: DQN paper and LSPI paper)
  • This week, aim to catch up on the coding assignments from previous weeks. Aim to do the basic algorithms
  • Those of you who have made more progress on coding can try implementing the more advanced algorithms in my suggested assignment work from previous weeks (eg: LSTD/LSPI)
  • Write with proper notation, the derivations to solutions of Linear Systems for Bellman Error-minimization and Projected Bellman Error-minimization (lecture slides have the derivation, but aim to do it yourself)
  • Optional: Write with proper notation, the derivation of Gradient TD (the last two slides of the lecture have the derivation, but aim to do it yourself)
March 6 Policy Gradient Algorithms
  • Write Proof (with precise notation) of the Policy Gradient Theorem
  • Derive the score function for softmax policy (for finite set of actions)
  • Derive the score function for gaussian policy (for continuous actions)
  • Write code for the REINFORCE Algoithm (Monte-Carlo Policy Gradient Algorithm, i.e., no Critic)
  • Optional: Write code for Actor-Critic Policy Gradient Algorithms (with TD(0) and with Eligiblity-Traces-based TD(Lambda))
  • Write Proof (with proper notation) of the Compatible Function Approximation Theorem
  • Optional: Write code for Natural Policy Gradient Algorithm based on Compatible Linear Function Appeoximation for Critic
March 8 Integrating Learning and Planning
  • Aim to catch up on the coding assignment of trying to solve the finance problem of your choice with an RL Algorithm
March 13 Exploration versus Exploitation
  • Aim to catch up on the coding assignment of trying to solve the finance problem of your choice with an RL Algorithm
March 15 Case Study on Planning, Control and Pricing in Real-World Retail Industry
  • Review all of the course material to prepare for the Final Exam
  • Upload all of your assignment work on the github account you had created at the start of the course
  • Inform the Course Assistant that you have uploaded all of your work
  • Ensure that all of your assignment work can be accessed by the Course Assistant at github.com (make sure you have pushed and not just commited)
March 22 (3:30pm-6:30pm in Bldg 200 (Lane History Corner), Room 203 ) Final Exam (Solutions)

Purpose and Grading of Assignments