Khashayar Khosravi

For getting the most up-to-date list of my publications, please see my Google Scholar page.

Working Papers

  1. Optimal and Greedy Algorithms for Multi-Armed Bandits with Many Arms (Joint work with M. Bayati, N. Hamidi, and R. Johari)

    We characterize Bayesian regret in a stochastic multi-armed bandit problem with a large but finite number of arms. In particular, we assume the number of arms \(k\) is \(T^{\alpha}\), where \(T\) is the time-horizon and \(\alpha\) is in \((0,1)\). We consider a Bayesian setting where the reward distribution of each arm is drawn independently from a common prior, and provide a complete analysis of expected regret with respect to this prior. Our results exhibit a sharp distinction around \(\alpha = 1/2\). When \(\alpha < 1/2\), the fundamental lower bound on regret is \(\Omega(k)\); and it is achieved by a standard UCB algorithm. When \(\alpha > 1/2\), the fundamental lower bound on regret is \(\Omega(\sqrt{T})\), and it is achieved by an algorithm that first subsamples \(\sqrt{T}\) arms uniformly at random, then runs UCB on just this subset. Interestingly, we also find that a sufficiently large number of arms allows the decision-maker to benefit from “free” exploration if she simply uses a greedy algorithm. In particular, this greedy algorithm exhibits a regret of \(\tilde{O}(\max(k,T/\sqrt{k}))\), which translates to a sublinear (though not optimal) regret in the time horizon. We show empirically that this is because the greedy algorithm rapidly disposes of underperforming arms, a beneficial trait in the many-armed regime. Technically, our analysis of the greedy algorithm involves a novel application of the Lundberg inequality, an upper bound for the ruin probability of a random walk.

  2. Non-Parametric Inference Adaptive to Intrinsic Dimension (Joint work with G. Lewis and V. Syrgkanis)

    We consider non-parametric estimation and inference of conditional moment models in high dimensions. We show that even when the dimension \(D\) of the conditioning variable is larger than the sample size \(n\), estimation and inference is feasible as long as the distribution of the conditioning variable has small intrinsic dimension \(d\), as measured by the doubling dimension. Our estimation is based on a sub-sampled ensemble of the \(k\)-nearest neighbors \(Z\)-estimator. We show that if the intrinsic dimension of the co-variate distribution is equal to \(d\), then the finite sample estimation error of our estimator is of order \(n^{-1/(d+2)}\) and our estimate is \(n^{1/(d+2)}\)-asymptotically normal, irrespective of \(D\). We discuss extensions and applications to heterogeneous treatment effect estimation.

  3. Matrix Completion Methods for Causal Panel Data Models, Github Repository (Joint work with S. Athey, M. Bayati, N. Doudchenko, and G. Imbens)

    A central tool in causal inference is treatment effect estimation from observational data. This analysis is challenging due to potential confounders. We consider the panel data models and motivated by the recent advances in matrix-completion literature, we introduce and study a class of estimators that minimize the distance between the estimated matrix and the original matrix, while favoring less complex models. We prove consistency of this estimator by generalizing the existing results in the matrix-completion literature to allow the patterns of missing data to include a dependency structure. We also present new insights concerning the connections between the interactive fixed effects models and the literatures on program evaluation under unconfoundedness as well as on synthetic control methods.

Published Papers

  1. Mostly Exploration-Free Algorithms for Contextual Bandits (Joint work with H. Bastani and M. Bayati)
    Management Science (forthcoming).

    The contextual bandits literature usually focuses on algorithms that trade-off between exploration and exploitation. Exploration-free greedy algorithms are very desirable in the settings where the exploration is costly or unethical (e.g., clinical trials), but they might be sub-optimal in general. We prove that the greedy algorithms are rate-optimal for the two-armed contextual bandit problem, if there is sufficient randomness in the covariates. Furthermore, even absent this condition, we prove that greedy algorithms are optimal with some positive probability. Motivated by this result, we introduce and study the Greedy-First algorithm, which uses observed covariates and outcomes to determine whether to follow greedy algorithm or not. We prove that Greedy-First is rate-optimal without any additional assumptions on the context distribution or the number of arms.

  2. Multiclass classification, information, divergence and surrogate risk (Joint work with J.C. Duchi and F. Ruan)
    Annals of Statistics 46(6B):3246-3275, 2018.

    In multiclass classification, the decision maker aims to minimize the misclassification rate. This translates to the empirical zero-one loss minimization problem which is in fact NP-hard. Therefore, instead of minimizing the zero-one loss, researchers usually focus on minimizing a surrogate loss and study when the resulting estimator is consistent. In many applications, making decisions based on raw data is undesirable as the data might be very high-dimensional and carry useless information. In this setting, instead of only learning an estimator (or discriminant function) the decision-maker also wishes to learn a quantizer. We characterize the equivalence between loss functions, meaning that optimizing either of two losses yields the same optimal discriminant and quantizer, complementing and generalizing the previous result by Nguyen et. al. in the binary setting. We also provide a unifying view of statistical information measures, multi-way Bayesian hypothesis testing, loss functions for multi-class classification problems, and multi-distribution \(f-\)divergences.

  3. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling (Joint work with H. Inan and R. Socher)
    ICLR 2017.

    Recurrent neural networks have been very successful at predicting sequences of words in tasks such as language modeling. However, all such models are based on the conventional classification framework, where the model is trained against one-hot targets, and each word is represented both as an input and as an output in isolation. This causes inefficiencies in learning both in terms of utilizing all of the information and in terms of the number of parameters needed to train. We introduce a novel theoretical framework that facilitates better learning in language modeling, and show that our framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables. Our framework leads to state of the art performance on the Penn Treebank with a variety of network models.

Other Publications

  1. Finding the Winner of a Hypothesis Test via Sample Allocation Optimization (Joint work with K. Modarresi)
    Procedia Computer Science, Elsevier, Vol. 108, 2017.

  2. Steganographic Schemes with Multiple \(q-\)ary Changes per Block of Pixels (Joint work with I. Gholampour)
    Elsevier Journal of Signal Processing, Vol. 108, March 2015.

  3. Interpolation of Steganographic Schemes (Joint work with with I. Gholampour)
    Elsevier Journal of Signal Processing, Vol. 98, May 2014.


  1. Learning, Decision-Making, and Inference with Limited Experimentation, Doctoral Dissertation
    Stanford University, September 2019.