Nima Hamidi

On Worst-case Regret of Linear Thompson Sampling

Thu, 11 Jun 2020 00:00:00 +0000

In this paper, we consider the worst-case regret of Linear Thompson Sampling (LinTS) for the linear bandit problem. Russo and Van Roy (2014) show that the Bayesian regret of LinTS is bounded above by $\widetilde{\mathcal{O}}(d\sqrt{T})$ where $T$ is the time horizon and $d$ is the number of parameters. While this bound matches the minimax lower-bounds for this problem up to logarithmic factors, the existence of a similar worst-case regret bound is still unknown. The only known worst-case regret bound for LinTS, due to Agrawal and Goyal (2013b); Abeille et al. (2017), is $\widetilde{\mathcal{O}}(d\sqrt{dT})$ which requires the posterior variance to be inflated by a factor of $\widetilde{\mathcal{O}}(\sqrt{d})$. While this bound is far from the minimax optimal rate by a factor of $\sqrt{d}$, in this paper we show that it is the best possible one can get, settling an open problem stated in Russo et al. (2018). Specifically, we construct examples to show that, without the inflation, LinTS can incur linear regret up to time $\exp(\Omega(d))$. We then demonstrate that, under mild conditions, a slightly modified version of LinTS requires only an $\widetilde{\mathcal{O}}(1)$ inflation where the constant depends on the diversity of the optimal arm.

The Randomized Elliptical Potential Lemma with an Application to Linear Thompson Sampling

Fri, 01 Jan 2021 00:00:00 +0000

In this note, we introduce a randomized version of the well-known elliptical potential lemma that is widely used in the analysis of algorithms in sequential learning and decision-making problems such as stochastic linear bandits. Our randomized elliptical potential lemma relaxes the Gaussian assumption on the observation noise and on the prior distribution of the problem parameters. We then use this generalization to prove an improved Bayesian regret bound for Thompson sampling for the linear stochastic bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.

The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Mon, 24 Feb 2020 00:00:00 +0000

We study the structure of regret-minimizing policies in the many-armed Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$. We first show that subsampling is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB only on that subset, is rate-optimal.

Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms. In particular, in numerical experiments SS-UCB performs worse than a simple greedy algorithm (and its subsampled version) that pulls the current empirical best arm at every time period. We show that these insights hold even in a contextual setting, using real-world data. These empirical results suggest a novel form of free exploration in the many-armed regime that benefits greedy algorithms. We theoretically study this new source of free exploration and find that it is deeply connected to the distribution of a certain tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration as discussed in the recent literature on contextual bandits, where free exploration arises due to variation in contexts. We use this insight to prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when $k > \sqrt{T}$, and achieves sublinear regret with more general distributions. This is a case where theoretical rate optimality does not tell the whole story: when complemented by the empirical observations of our paper, the power of greedy algorithms becomes quite evident. Taken together, from a practical standpoint, our results suggest that in applications it may be preferable to use a variant of the greedy algorithm in the many-armed regime.

A General Framework to Analyze Stochastic Linear Bandit

Wed, 12 Feb 2020 00:00:00 +0000

In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions in $\mathbb{R}^d$, observes their noisy linear reward, and aims to maximize her cumulative expected reward over a horizon of length $T$. We first introduce a general family of algorithms for the problem and prove that they achieve the best-known performance (aka, are rate optimal). Our second contribution is to show that several well-known algorithms for the problem such as optimism in the face of uncertainty linear bandit (OFUL), Thompson sampling (TS), and OLS Bandit (a variant $\epsilon$-greedy) are special cases of our family of algorithms. Therefore, we obtain a unified proof of rate optimality for all of these algorithms, for both Bayesian and frequentist settings. Our new unified technique also yields a number of new results such as obtaining poly-logarithmic (in $T$) regret bounds for OFUL and TS, under a generalized gap assumption and a margin condition as in Goldenshluger and Zeevi (2013). A key component of our analysis technique is the introduction of a new notion of uncertainty complexity that directly captures the complexity of uncertainty in the action sets that we show is connected to regret analysis of any policy.

Our third and most important contribution, from both theoretical and practical points of view, is the introduction of a new rate-optimal algorithm called Sieved-Greedy (SG) by combining insights from uncertainty complexity and a new (and general) notion of optimism in expectation. Specifically, SG works by filtering out the actions with relatively low uncertainty and then chooses one among the remaining actions greedily. Our empirical simulations show that SG significantly outperforms existing benchmarks by combining the best attributes of both greedy and OFUL algorithms.

Personalizing Many Decisions with High-Dimensional Covariates

Fri, 01 Nov 2019 00:00:00 +0000

We consider the $k$-armed stochastic contextual bandit problem with $d$ dimensional features, when both $k$ and $d$ can be large. To the best of our knowledge, all existing algorithm for this problem have a regret bound that scale as polynomials of degree at least two in $k$ and $d$. The main contribution of this paper is to introduce and theoretically analyze a new algorithm (REAL-bandit) with a regret that scales by $r^2(k+d)$ when $r$ is rank of the $k\times d$ matrix of unknown parameters. REAL-bandit relies on ideas from low-rank matrix estimation literature and a new row-enhancement subroutine that yields sharper bounds for estimating each row of the parameter matrix that may be of independent interest.

On Low-rank Trace Regression Under General Sampling Distribution

Thu, 29 Aug 2019 00:00:00 +0000

In this paper, we study the trace regression when a matrix of parameters $\mathbf{B}^\star$ is estimated via convex relaxation of rank-penalized regression or non-convex optimization. It is known that these estimators satisfy near optimal error bounds under assumptions on rank, coherence, or spikiness of $\mathbf{B}^\star$. We first introduce a general notion of spikiness for $\mathbf{B}^\star$ and prove non-asymptotic error bounds for the estimation error. Our approach relies on a generic recipe to prove restricted strong convexity for the sampling operator of the trace regression. Second, we prove similar error bounds when the regularization parameter is chosen via cross-validation. This result is significant in that existing theory on cross-validated estimators (Satyen Kale and Vassilvitskii (2011); Kumar et al. (2013); Abou-Moustafa and Szepesvari (2017)) do not apply to our setting since our estimators are not known to satisfy their required notion of stability. Third, we study applications of our general results to four sub-problems of (1) matrix completion, (2) multi-task learning, (3) compressed sensing with Gaussian ensembles, and (4) compressed sensing with factored measurements. For (1), (3), and (4) we recover matching error bounds as those found in the literature, and for (2) we obtain (to the best of our knowledge) the first such error bound.