TuTh 2:05-3:15 pm, Terman M33.

Instructor: Ashish Goel. Office hours: Mondays 4-5, Terman 311.

Guests/auditors, please subscribe to msande325-win0809-guests by going to mailman.stanford.edu .

- Course description and requirements

- Lecture
notes (always under
construction; please let me know if/when you find bugs; Stanford
distribution only for now; last modified Mar 5th, 2009). Also see
the lower
bound supplement.

- Reading list
- Project suggestions

- Homeworks: (1) Given 1/22/09. (2) Given 2/10/09.
- Related class at University of Pennsylvania, taught by Sudipto Guha. We will co-ordinate lecture notes. Also check out the reading list assembled by Kamesh Munagala.

Course Description

From the bulletin: Markov decision processes; optimization with sparse priors; multi-armed bandit problems and the Gittins' index; regret bounds for multi-armed bandit problems; stochastic knapsack and the adaptivity gap; budgeted learning; adversarial queueing theory; stochastic scheduling and routing; stochastic inventory problems; multi-stage and multi-objective stochastic optimization. Prerequisites: MS&E 221 or equivalent; and MS&E 212 or CS 261 or equivalent.

Informally: The first part of the class will focus on topics that can loosely be called "Algorithmic decision theory": Algorithms for designing provably (near)-optimal and computationally efficient strategies for decision making under uncertainty, using available and acquired data, as well as probabilistic models thereon. The second part will focus on stochastic versions of combinatorial optimization problems.

Course requirements: There will be four homework assignments, and an optional project. Please sign up for a project only if you believe you will be able to devote substantial amount of attention. Also, every student may be expected to scribe one lecture each.

Reading list (will be updated continually; papers marked as [C] will be covered in class; papers marked as [R] are for your reference; this list is not exhaustive)

[C] A short proof of the Gittins Index theorem. J. Tsitsiklis. Notice that this proof assumes that the time for which an arm gets played is a continuous distribution, whereas we assumed discrete in the proof in class.

[C] A spreadsheet illustrating a computation of the Gittins' index for Beta priors.

[R] Four proofs of Gittins' multiarmed bandit theorem. E. Frostig and G. Weiss. Specially, see the proof which involves the fair and prevailing prices.

[R] Elicitation of a Beta prior for Bayesian inference in clinical trials. Y. Wu, W. Shih, and D. Moore. This is just to support my contention that the multi-armed bandit problems are important not just for new Internet related applications but also traditional applications in health-care etc. It would be enough to read the abstract.

[R] Asymptotically efficient adaptive allocation rules. T. Lai and H. Robbins. This is the original "regret paper", and is a classic. Read the treatment of the lower bound from this paper, but for upper bounds, the proofs have been simplified and generalized (sometimes with slightly weaker constants, but that is not a major concern in this course).

[R] Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. R. Agarwal. This contains a simplified proof of the main result of Lai and Robbins. It would be most instructive to read sections 1 and 2 of this paper, and skip the rest, since (modulo constants), these sections are subsumed by the next paper.

[C] Finite-time Analysis of the Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, and P. Fischer. Specifically, read the analysis of UCB1. Then contemplate the following question: why is the performance of this algorithm apparently worse if the arms have very similar means? Can you derive a bound that does not depend on the difference in the means?

[C] The Nonstochastic Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. Read the proofs for the very first algorithm (section 3) followed by a quick look at section 4. Then ask the following question: where did all the probabilistic analysis disappear in this proof? Also read the lower bound in the appendix.

[C] Efficient algorithms for online decision problems. A. Kalai and S. Vempala. A nice exposition of the full information model. Read the first three sections. Note that this algorithm can be used to solve a large variety of linear optimization problems sequentially, without any assumptions on the linear utilities in each time period. Also note that the bounds that they obtain in the simplest multi-armed bandit setting are stronger than in the partial information model, most notably, the sqrt(N) changes to a log(N) -- verify this for yourself by finding the optimum epsilon for the simplest algorithm in this paper.

[R] Online convex programming and generalized infinitesimal gradient ascent. M. Zinkevich. Read the algorithm. Then before reading the proof, consider the following question: how would you use this technique in a black-box fashion to solve the linear generalization of the bandits problem, as descried by Kalai and Vempala in section 1.1? Before you assume that this is trivial, note that Zinkevich requires a convex decision space, whereas Kalai and Vempala assume that the decision space is arbitrary, possibly just a set of points.

[R] Logarithmic Regret Algorithms for Online Convex Optimization. E. Hazan, A. Agarwal, and S. Kale. Read the first two sections and make sure you understand the connection to Zinkevich's results and those of Kalai and Vempala.

[R] Universal portfolios. T. Cover. Another classic. Read the problems in this paper and think about how you would solve them using more recent techniques. Then go back and read the rest of Hazan et al (don't worry too much about the proofs since the techniques used are not similar to those used in the rest of this class).

[R] The value of knowing the demand curve. R. Kleinberg and T. Leighton. An interesting application of the ideas in the UCB algorithm of Auer et al, as well as new insights.

[R] Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. V. Dani and T. Hayes. Read the discussion on adaptivity.

[C] Allocating bandwidth for bursty connections. J. Kleinberg, Y. Rabani, E. Tardos. Read the sections on load balancing. Observe how the lower bound and the upper bound seem almost contradictory -- the lower bound says that "no definition of effective bandwidth works" while the upper bound says that "the standard definition of effective bandwidth works"! The trick, of course is in dealing with exceptional jobs separately and doing binary search ot find the right cut-off. An imporvement in the constant factor would be very interesting, specially if the improvement is large.

[R] Stochastic Load Balancing and Related Problems. A. Goel and P. Indyk. Read the sections on Poisson and Exponential jobs. Then ask the following questions: (a) How does the Poisson case for load balancing relate to the majorization based proof we saw in class? And most importantly, can the kinds of techniques we used for exponential jobs be used for Bernoulli jobs? For general jobs? A PTAS for the Bernoulli or general problem would be spectacular.

[C] Simultaneous optimization via approximate majorization for concave profits or convex costs. A. Goel and A. Meyerson. Read sections 2, 3.4, and section 4 (skip the proofs of section 4).

[C] Approximating the stochastic knapsack: the benefit of adaptivity. B. Dean, M. Goemans and J. Vondrak.

[C] Approximation algorithms for budgeted learning problems S. Guha and K. Munagala. Notice how the proof uses techniques very similar to those used by Dean, Goemans, and Vondrak.

[C] Multi-armed Bandits with Metric Switching Costs S. Guha and K. Munagala. Notice the nice balancing trick in the LP.

[C] The Ratio Index for Budgeted Learning, with Applications. A. Goel, S. Khanna, B. Null. Read the definition of the ratio index, the statement of the result, and then just focus on sections 3 and 4.

(Many) more to come shortly.

Project suggestions.

My preference would be if you were to do this in teams of two. Remember, the project is optional but if you choose to do one, I would expect you to devote substantial effort. I will add more project suggestions if needed. Also, please feel free to come up with a formulation of your own.

- Fast computation of the Gittins index for finite state spaces. The best known appears to be O(n^3) where n is the size of the state space of the Markov chain of the arm. This can be achieved by analyzing pivoting rules for simplex, or more simply by just implementing Tsitsiklis's recursive rule as in section 3 of this paper. Can you improve this running time? Contrary to what I may have said or thought or appeared to say or think in class, I don't think any longer that this is easy.
- The above will not apply to Beta priors. Assume that you are allowed an error of epsilon in computing the Gittins index of the state (alpha, beta). What is the fastest algorithm you can devise?
- For the truly adventurous: can you devise a fully strongly polynomial time algorithm for solving the discounted infinite horizon Markov Decision Process? The "most strongly polynomial" known algorithm is strongly polynomial in everything except the discount factor.
- Can you find the best possible approximation for the budgeted learning problem? Check out this LP based and this index based approximation algorithm.
- Advertising auctions with priors
- Strategic fraud in multi-armed bandit problems: Suppose each arm corresponds to a strategic agent. For example, each arm may be an advertiser who wants good placement. Assume this strategic agent can inflate the reward probability initially. What kind of a game does this impose? This will involve first doing a survey of known results in this direction. Also, suppose the agent could return with a new pseudonym if its prior falls very low. What kind of a social cost does this impose?
- Data-oriented: Can you show, under some reasonable prior, that
the NetFlix data set does not have
sufficient information to obtain a better than x% prediction? This will
not help you win the NetFlix challenge, since this is in the wrong
direction, but it would be interesting nevertheless.

- Product pricing: Consider the scenario where you can offer a good at one
of K price points. Users arrive sequentially. The paper on the
value of knowing the demand curve studies this problem with a view
towards minimizing regret. Can you design optimum (or near-optimum)
strategies in a model where you are given a prior on the user valuations?
What priors make sense? This will also involve doing a survey of existing
work.

- Correlated bandits: What if different arms are correlated? We have some problem formulations that I would be happy to discuss individually, or you can find your own problem in this space.