Project 2

Due date: November 6th at 11:59pm

This project is a competition to find the best policy for three different Markov decision processes given sampled transitions, each consisting of a state ( s ), action ( a ), reward ( r ), and next state ( sp ). The best policy is the one that maximizes the total expected reward.

Three csv-formatted datasets are provided.

small.csv 10 x 10 grid world (100 states) with 4 actions. Use sub2ind((10, 10), x, y) to find the integer state from the x and y coordinates. Actions are 1:left, 2:right, 3:up, 4:down. The discount factor is 0.95.
medium.csv MountainCarContinuous-v0 environment from Open AI Gym with altered parameters. State measurements are given by integers with 500 possible position values and 100 possible velocity values (50,000 possible state measurements). 1+pos+500*vel gives the integer corresponding to a state with position pos and velocity vel. There are 7 actions that represent different amounts of acceleration. This problem is undiscounted, but ends when the goal (the flag) is reached. Note that, since the discrete state measurements are calculated after the simulation, the data in medium.csv does not quite satisfy the Markov property.
large.csv or large.csv.tar.xz It’s a secret! MDP with 10101010 states and 125 actions, with a discount factor of 0.95.

Each line in the CSV file represents a transition from state (s) to (sp) (along with the associated reward (r)), when taking an action (a).
You might not have data for every state. Neither are all states equally important. Use your best judgement in handling this.

Rules

You can use any programming language but you cannot use any package directly related to reinforcement learning. You can use general optimization packages and libraries for things like function approximation and ODE solving so long as you discuss what you use in your writeup and make it clear how it is used in your code.
You can use different approaches and algorithms for the different problems. It will probably help to look at the data carefully and use the structure of the problem to your advantage.
Discussion is encouraged but you must write your own code. Otherwise, it violates the honor code.
You may look at the code for the MountainCarContinuous-v0 environment, but note that we do not use exactly the same parameters.

Submission

Login to http://canvas.stanford.edu
Use sidebar to go to Assignments. Click on Assignment 2.
Use the button at the bottom which says Load Project 2 - Reinforcement Learning to go to Vocareum.
You should find the three CSV files: small.csv, medium.csv and large.csv in your workspace.
Either use Vocareum’s editor to work on the project. Or upload your code along with the following files to your workspace:
- writeup.pdf, small.policy, medium.policy and large.policy.
- The writeup.pdf should briefly describe your strategy. This should not be more than 1 or 2 pages with description of your algorithm and their performance characteristics. Remember to include running/training time for each policy.
Click on the Submit button
It will complain if any files are missing and tell you your score for each .policy file. It will also submit your .policy files for evaluation. It takes a while to evaluate the policies so unless some urgent issue with the leaderboard, wait at least 20 minutes for the leaderboard to update.

FAQ

What’s the output format for the .policy file?
- Your program should output a file with the same name as the input filename, but with a .policy extension, e.g., “small.policy”.
  Each output file should contain an action for every possible state in the problem. The ith row contains the action to be taken from the ith state. small.policy should contain 100 lines, medium.policy should contain 50000 lines, and large.policy should contain 10101010 lines. If using Linux/MAC, you can check the number of lines using wc -l <file> e.g. wc -l small.policy.
What does the score represent?
- For each problem, the score is the expected value for a state drawn from initial distribution D, that is $$\underset{s \sim D}{E}[U(s)]$$. For all problems, we evaluate this this expectation using simulations. If the scores are too close in the end we’ll increase the number of Monte Carlo simulations. The scores that you see on the leaderboard are after subtracting the score from a random policy.
What’s the reward?
- All we can say is that the reward is a function of both the states and actions.
What packages can we use?
- For the computational part sparse matrix libraries (builtin in numpy and Julia) could be useful. Feel free to use any DeepLearning and Machine Learning package as well (as long as you are not using any RL implementations therein.) But a lot of effort would go into building a model for the data. So data munging libraries like DataFrames.jl, Pandas for python, graphing libraries like StatPlots.jl might be useful.
What about the terminal states?
- There are no terminal states for small.csv and large.csv. For the medium.csv, the simulation ends when the new position is beyond some position (where the flag is). You may want to think about which integer state measurements correspond to this case.