The search algorithms explored in the previous assignment work great when you know exactly the results of your actions. Unfortunately, the real world is not so predictable. One of the key aspects of an effective AI is the ability to reason in the face of uncertainty.
Markov decision processes (MDPs) can be used to formalize uncertain situations. In this homework, you will implement algorithms to find the optimal policy in these situations. You will then formalize a modified version of Blackjack as an MDP, and apply your algorithm to find the optimal policy.
In this problem, you will perform the value iteration updates manually on a very basic game just to solidify your intuitions about solving MDPs. The set of possible states in this game is {-2, -1, 0, 1, 2}. You start at state 0, and if you reach either -2 or 2, the game ends. At each state, you can take one of two actions: {-1, +1}.
If you're in state $s$ and choose -1:
Let's implement value iteration to compute the optimal policy on an arbitrary MDP. Later, we'll create the specific MDP for Blackjack.
Let $V_1$ be the optimal value function for the
original MDP, and $V_2$ the optimal value function for the modified MDP.
Is it always the case that $V_1(s_\text{start})\geq
V_2(s_\text{start})$? If so,
prove it on the written portion and put return None
for each of the code blocks.
Otherwise, construct a counterexample by filling out CounterexampleMDP
in submission.py
.
Let us define a new MDP with states $\text{States}' = \text{States} \cup \{ o \}$, where $o$ is a new state. Let's use the same actions ($\text{Actions}'(s) = \text{Actions}(s)$), but we need to keep the discount $\gamma' = 1$. Your job is to define new transition probabilities $T'(s, a, s')$ and rewards $\text{Reward}'(s, a, s')$ in terms of the old MDP such that the optimal values $V_\text{opt}(s)$ for all $s \in \text{States}$ are equal under the original MDP and the new MDP.
Hint: If you're not sure how to approach this problem, go back to the notes from the first MDP lecture and read closely the slides on convergence, toward the end of the deck.
Now that we have gotten a bit of practice with general-purpose MDP algorithms, let's use them to play (a modified version of) Blackjack. For this problem, you will be creating an MDP to describe states, actions, and rewards in this game.
For our version of Blackjack, the deck can contain an arbitrary collection of cards with different face values. At the start of the game, the deck contains the same number of each card of each face value; we call this number the 'multiplicity'. For example, a standard deck of 52 cards would have face values $[1, 2, \ldots, 13]$ and multiplicity 4. You could also have a deck with face values $[1,5,20]$; if we used multiplicity 10 in this case, there would be 30 cards in total (10 each of 1s, 5s, and 20s). The deck is shuffled, meaning that each permutation of the cards is equally likely.
The game occurs in a sequence of rounds.
Each round, the player either
(i) takes the next card from the top of the deck (costing nothing),
(ii) peeks at the top card
(costing peekCost
, in which case the next round, that card will be drawn),
or (iii) quits the game.
(Note: it is not possible to peek twice in a row; if the player peeks twice in a row, then
succAndProbReward()
should return []
.)
The game continues until one of the following conditions becomes true:
In this problem, your state $s$ will be represented as a 3-element tuple:
(totalCardValueInHand, nextCardIndexIfPeeked, deckCardCounts)
As an example, assume the deck has card values $[1, 2, 3]$ with multiplicity 1,
and the threshold is 4.
Initially, the player has no cards, so her total is 0;
this corresponds to state (0, None, (1, 1, 1))
.
At this point, she can take, peek, or quit.
She will receive a reward of 0 for reaching any of these states. (Remember, even though she now has a card in her hand for which she may receive a reward at the end of the game, the reward is not actually granted until the game ends.)(1, None, (0, 1, 1))
(2, None, (1, 0, 1))
(3, None, (1, 1, 0))
She will receive (immediate) reward(0, 0, (1, 1, 1))
(0, 1, (1, 1, 1))
(0, 2, (1, 1, 1))
-peekCost
for reaching any of these states.
Things to remember about the states after a peek action:
(0, 0, (1, 1, 1))
, taking a card will lead to the state (1, None, (0, 1, 1))
deterministically.len(deckCardCounts)-1
, inclusive.
(0, None, None)
.
(Remember that setting the deck to None
signifies the end of the game.)
(3, None, (1, 1, 0))
, and the threshold remains 4.
(3, None, None)
.(3 + 1, None, (0, 1, 0))
or (3 + 2, None, None)
.
Note that in the second successor state, the deck is set to None
to signify the game ended with a bust.
You should also set the deck to None
if the deck runs out of cards.succAndProbReward()
function of class BlackjackMDP
.
peekingMDP()
to return an instance of
BlackjackMDP
where the optimal action is to peek in at least
10% of states. Hint: Before randomly
assinging values, think of the case when you really want to peek instead
of blindly taking a card.
So far, we've seen how MDP algorithms can take an MDP which describes the full dynamics of the game and return an optimal policy. But suppose you go into a casino, and no one tells you the rewards nor the transitions. We will see how reinforcement learning can allow you to play the game and learn its rules & strategy at the same time!
QLearningAlgorithm
,
which is an instance of an RLAlgorithm
. As discussed in class,
reinforcement learning algorithms are capable of executing a policy while
simultaneously improving that policy. Look in simulate()
, in
util.py
to see how the RLAlgorithm
will be used. In
short, your QLearningAlgorithm
will be run in a simulation of the MDP, and will
alternately be asked for an action to perform in a given state (QLearningAlgorithm.getAction
), and then be
informed of the result of that action (QLearningAlgorithm.incorporateFeedback
),
so that it may learn better actions to perform in the future.
We are using Q-learning with function approximation,
which means $\hat Q_\text{opt}(s, a) = \mathbb w \cdot \phi(s, a)$,
where in code, $\mathbb w$ is self.weights
, $\phi$ is the featureExtractor
function,
and $\hat Q_\text{opt}$ is self.getQ
.
We have implemented QLearningAlgorithm.getAction
as a simple $\epsilon$-greedy policy.
Your job is to implement QLearningAlgorithm.incorporateFeedback()
,
which should take an $(s, a, r, s')$ tuple and update self.weights
according to the standard Q-learning update.
simulate
using your Q-learning code and the identityFeatureExtractor()
on the MDP smallMDP
(defined for you in submission.py
), with 30000 trials and default explorationProb
.
Next, use value iteration to find out the optimal policy for smallMDP
.
How does the Q-learning policy compare with a policy learned by value iteration
(i.e., for how many states do they produce a different action)?
(Don't forget to set the explorationProb
of your Q-learning algorithm to 0 after learning the policy.)
largeMDP
, again with 30,000 trials. How does the policy
learned in this case compare to the policy learned by value iteration? What went wrong?
Note: We have provided the helper function run4bHelper
in grader.py
to help you run simulate_QL_over_MDP
with appropriate arguments. The implementation of
simulate_QL_over_MDP
and the use of run4bHelper
are totally optional,
but it will probably be useful to you as you work to answer this question.
blackjackFeatureExtractor
as described in the code comments.
Using this feature extractor, you should be able to get pretty close to the
optimum on the largeMDP
.
To explore this scenario, let's take a brief look at how a policy learned using value iteration responds to a change in the rules of the MDP.
originalMDP
(defined for you in submission.py
) to compute an
optimal policy for that MDP.
newThresholdMDP
(also defined for you in
submission.py
) by calling simulate
with an instance of
FixedRLAlgorithm
that has been instantiated using the policy you computed
with value iteration. What is the expected reward from this simulation?
Hint: read the documentation (comments) for the
simulate
function in util.py, and look specifically at the format of the
function's return value.
originalMDP
(30,000 trials).
Then, using the learned parameters, run Q-learning again on
newThresholdMDP
(again, 30000 trials). What is your
expected reward under the new Q-learning policy? Provide some explanation for how
the rewards compare with when FixedRLAlgorithm
is used. Why they are
different?
Note: As in 4(b), we have provided the helper function run4dHelper
in
grader.py
to help you run compare_changed_MDP
with appropriate arguments.
The implementation of compare_changed_MDP
and the use of run4dHelper
are
totally optional, but it will probably be useful to you as you work to answer this question.