#### In the Pipeline

**Debiasing Linear Prediction. (arxiv)***Nilesh Tripuraneni and Lester Mackey.*

View details »Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point

**x**_{★}be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how debiasing techniques can be used transductively to combat regularization bias. We first lower bound the**x**_{★}prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. Then, building on techniques from semi-parametric inference, we provide non-asymptotic upper bounds on the**x**_{★}prediction error of two transductive, debiased prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements test-point-tailored debiasing can provide in settings with distribution shift.**A Kernel Stein Test for Comparing Latent Variable Models. (arxiv)***Heishiro Kanagawa, Wittawat Jitkrittum, Lester Mackey, Kenji Fukumizu, and Arthur Gretton.*

View details »We propose a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a well-controlled type-I error. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015), which cannot exploit the latent structure.

**Teacher-Student Compression with Generative Adversarial Networks. (arxiv, poster)***Ruishan Liu, Nicolo Fusi, and Lester Mackey.*

View details »More accurate machine learning models often demand more computation and memory at test time, making them difficult to deploy on CPU- or memory-constrained devices. Model compression (also known as distillation) alleviates this burden by training a less expensive student model to mimic the expensive teacher model while maintaining most of the original accuracy. However, when fresh data is unavailable for the compression task, the teacher's training data is typically reused, leading to suboptimal compression. In this work, we propose to augment the compression dataset with synthetic data from a generative adversarial network (GAN) designed to approximate the training data distribution. Our GAN-assisted model compression (GAN-MC) significantly improves student accuracy for expensive models such as large random forests and deep neural networks on both tabular and image datasets. Building on these results, we propose a comprehensive metric---the Compression Score---to evaluate the quality of synthetic datasets based on their induced model compression performance. The Compression Score captures both data diversity and discriminability, and we illustrate its benefits over the popular Inception Score in the context of image classification.

**DeepMiner: Discovering Interpretable Representations for Mammogram Classification and Explanation. (arxiv, code)***Jimmy Wu, Bolei Zhou, Diondra Peck, Scott Hsieh, Vandana Dialani, Lester Mackey, and Genevieve Patterson.*

View details »We propose DeepMiner, a framework to discover interpretable representations in deep neural networks and to build explanations for medical predictions. By probing convolutional neural networks (CNNs) trained to classify cancer in mammograms, we show that many individual units in the final convolutional layer of a CNN respond strongly to diseased tissue concepts specified by the BI-RADS lexicon. After expert annotation of the interpretable units, our proposed method is able to generate explanations for CNN mammogram classification that are correlated with ground truth radiology reports on the DDSM dataset. We show that DeepMiner not only enables better understanding of the nuances of CNN classification decisions, but also possibly discovers new visual knowledge relevant to medical diagnosis.

#### Publications

**Minimum Stein Discrepancy Estimators. (arxiv)***Alessandro Barp, Francois-Xavier Briol, Andrew B. Duncan, Mark Girolami, and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). Forthcoming.

View details »When maximum likelihood estimation is infeasible, one often turns to score matching, contrastive divergence, or minimum probability flow learning to obtain tractable parameter estimates. We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. We establish the consistency, asymptotic normality, and robustness of DKSD and DSM estimators, derive stochastic Riemannian gradient descent algorithms for their efficient optimization, and demonstrate their advantages over score matching in models with non-smooth densities or heavy tailed distributions.

**Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond. (arxiv)***Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu.*

Advances in Neural Information Processing Systems (NeurIPS). Forthcoming.

View details »Sampling with Markov chain Monte Carlo methods typically amounts to discretizing some continuous-time dynamics with numerical integration. In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth Ito diffusions exhibiting fast Wasserstein-2 contraction, based on local deviation properties of the integration scheme. In particular, we study a sampling algorithm constructed by discretizing the overdamped Langevin diffusion with the method of stochastic Runge-Kutta. For strongly convex potentials that are smooth up to a certain order, its iterates converge to the target distribution in 2-Wasserstein distance in Õ(dε^{-2/3}) iterations. This improves upon the best-known rate for strongly log-concave sampling based on the overdamped Langevin equation using only the gradient oracle without adjustment. In addition, we extend our analysis of stochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly non-convex potentials and show they achieve better rates compared to the Euler-Maruyama scheme in terms of the dependence on tolerance ε. Numerical studies show that these algorithms lead to better stability and lower asymptotic errors.

**Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions. (arxiv)***Ashia Wilson, Lester Mackey, and Andre Wibisono.*

Advances in Neural Information Processing Systems (NeurIPS). Forthcoming.

View details »We present a family of algorithms, called descent algorithms, for optimizing convex and non-convex functions. We also introduce a new first-order algorithm, called rescaled gradient descent (RGD), and show that RGD achieves a faster convergence rate than gradient descent provided the function is strongly smooth -- a natural generalization of the standard smoothness assumption on the objective function. When the objective function is convex, we present two novel frameworks for "accelerating" descent methods, one in the style of Nesterov and the other in the style of Monteiro and Svaiter, using a single Lyapunov. Rescaled gradient descent can be accelerated under the same strong smoothness assumption using both frameworks. We provide several examples of strongly smooth loss functions in machine learning and numerical experiments that verify our theoretical findings. We also present several extensions of our novel Lyapunov framework, including deriving optimal universal tensor methods and extending our framework to the coordinate setting.

**Improving Subseasonal Forecasting in the Western U.S. with Machine Learning. (arxiv, our SubseasonalRodeo dataset, slides, poster)***Jessica Hwang, Paulo Orenstein, Judah Cohen, Karl Pfeiffer, and Lester Mackey.*

International Conference on Knowledge Discovery and Data Mining (KDD). Aug. 2019.

View details »Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance. Here we present and evaluate our machine learning approach to the Rodeo and release our SubseasonalRodeo dataset, collected to train and evaluate our forecasting system.

Our system is an ensemble of two nonlinear regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2), and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon. Moreover, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40-50% for temperature and 129-169% for precipitation. We hope that both our dataset and our methods will help to advance the state of the art in subseasonal forecasting.

**Stein Point Markov Chain Monte Carlo. (arxiv, code)***Wilson Ye Chen, Alessandro Barp, Francois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, and Chris. J. Oates.*

International Conference on Machine Learning (ICML). June 2019.

View details »An important task in machine learning and statistics is the approximation of a probability measure by an empirical measure supported on a discrete point set. Stein Points are a class of algorithms for this task, which proceed by sequentially minimising a Stein discrepancy between the empirical measure and the target and, hence, require the solution of a non-convex optimisation problem to obtain each new point. This paper removes the need to solve this optimisation problem by, instead, selecting each new point based on a Markov chain sample path. This significantly reduces the computational cost of Stein Points and leads to a suite of algorithms that are straightforward to implement. The new algorithms are illustrated on a set of challenging Bayesian inference problems, and rigorous theoretical guarantees of consistency are established.

**Measuring Sample Quality with Diffusions. (arxiv, code)***Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, and Lester Mackey.*

Annals of Applied Probability. Forthcoming.

View details »Stein's method for measuring convergence to a continuous target distribution relies on an operator characterizing the target and Stein factor bounds on the solutions of an associated differential equation. While such operators and bounds are readily available for a diversity of univariate targets, few multivariate targets have been analyzed. We introduce a new class of characterizing operators based on Ito diffusions and develop explicit multivariate Stein factor bounds for any target with a fast-coupling Ito diffusion. As example applications, we develop computable and convergence-determining diffusion Stein discrepancies for log-concave, heavy-tailed, and multimodal targets and use these quality measures to select the hyperparameters of biased Markov chain Monte Carlo (MCMC) samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate MCMC. Our results establish a near-linear relationship between diffusion Stein discrepancies and Wasserstein distances, improving upon past work even for strongly log-concave targets. The exposed relationship between Stein factors and Markov process coupling may be of independent interest.

**Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach. (article)***Robert Kueffner, Neta Zach, Maya Bronfeld, Raquel Norel, Nazem Atassi, Venkat Balagurusamy, Barbara DiCamillo, Adriano Chio, Merit Cudkowicz, Donna Dillenberger, Javier Garcia-Garcia, Orla Hardiman, Bruce Hoff, Joshua Knight, Melanie L. Leitner, Guang Li, Lara Mangravite, Thea Norman, Liuxia Wang, The ALS Stratification Consortium (including Lester Mackey), Jinfeng Xiao, Wen-Chieh Fang, Jian Peng, Chen Yang, Huan-Jui Chang, and Gustavo Stolovitzky.*

Scientific Reports. January 2019.

View details »Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease where substantial heterogeneity in clinical presentation urgently requires a better stratification of patients for the development of drug trials and clinical care. In this study we explored stratification through a crowdsourcing approach, the DREAM Prize4Life ALS Stratification Challenge. Using data from >10,000 patients from ALS clinical trials and 1479 patients from community-based patient registers, more than 30 teams developed new approaches for machine learning and clustering, outperforming the best current predictions of disease outcome. We propose a new method to integrate and analyze patient clusters across methods, showing a clear pattern of consistent and clinically relevant sub-groups of patients that also enabled the reliable classification of new patients. Our analyses reveal novel insights in ALS and describe for the first time the potential of a crowdsourcing to uncover hidden patient sub-populations, and to accelerate disease understanding and therapeutic development.

**A Multifactorial Model of T Cell Expansion and Durable Clinical Benefit in Response to a PD-L1 Inhibitor. (article, code, The ASCO Post)***Mark DM Leiserson, Vasilis Syrgkanis, Amy Gilson, Miroslav Dudik, Sharon Gillett, Jennifer Chayes, Christian Borgs, Dean F Bajorin, Jonathan Rosenberg, Samuel Funt, Alexandra Snyder, and Lester Mackey.*

PLOS One. December 2018.

View details »Checkpoint inhibitor immunotherapies have had major success in treating patients with late-stage cancers, yet the minority of patients benefit. Mutation load and PD-L1 staining are leading biomarkers associated with response, but each is an imperfect predictor. A key challenge to predicting response is modeling the interaction between the tumor and immune system. We begin to address this challenge with a multifactorial model for response to anti-PD-L1 therapy. We train a model to predict immune response in patients after treatment based on 36 clinical, tumor, and circulating features collected prior to treatment. We analyze data from 21 bladder cancer patients using the elastic net high-dimensional regression procedure and, as training set error is a biased and overly optimistic measure of prediction error, we use leave-one-out cross-validation to obtain unbiased estimates of accuracy on held-out patients. In held-out patients, the model explains 79% of the variance in T cell clonal expansion. This predicted immune response is multifactorial, as the variance explained is at most 23% if clinical, tumor, or circulating features are excluded. Moreover, if patients are triaged according to predicted expansion, only 38% of non-durable clinical benefit (DCB) patients need be treated to ensure that 100% of DCB patients are treated. In contrast, using mutation load or PD-L1 staining alone, one must treat at least 77% of non-DCB patients to ensure that all DCB patients receive treatment. Thus, integrative models of immune response may improve our ability to anticipate clinical benefit of immunotherapy.

**S2S reboot: An argument for greater inclusion of machine learning in subseasonal to seasonal forecasts. (article)***Judah Cohen, Dim Coumou, Jessica Hwang, Lester Mackey, Paulo Orenstein, Sonja Totz, and Eli Tziperman.*

WIREs Climate Change. December 2018.

View details »The discipline of seasonal climate prediction began as an exercise in simple statistical techniques. However, today the large government forecast centers almost exclusively rely on complex fully coupled dynamical forecast systems for their subseasonal to seasonal (S2S) predictions while statistical techniques are mostly neglected and those techniques still in use have not been updated in decades. In this Opinion Article, we argue that new statistical techniques mostly developed outside the field of climate science, collectively referred to as machine learning, can be adopted by climate forecasters to increase the accuracy of S2S predictions. We present an example of where unsupervised learning demonstrates higher accuracy in a seasonal prediction than the state-of-the-art dynamical systems. We also summarize some relevant machine learning methods that are most applicable to climate prediction. Finally, we show by comparing real-time dynamical model forecasts with observations from winter 2017/2018 that dynamical model forecasts are almost entirely insensitive to polar vortex (PV) variability and the impact on sensible weather. Instead, statistical forecasts more accurately predicted the resultant sensible weather from a mid-winter PV disruption than the dynamical forecasts. The important implication from the poor dynamical forecasts is that if Arctic change influences mid-latitude weather through PV variability, then the ability of dynamical models to demonstrate the existence of such a pathway is compromised. We conclude by suggesting that S2S prediction will be most beneficial to the public by incorporating mixed or a hybrid of dynamical forecasts and updated statistical techniques such as machine learning.

**Global Non-convex Optimization with Discretized Diffusions. (arxiv, poster)***Murat A. Erdogdu, Lester Mackey, and Ohad Shamir.*

Advances in Neural Information Processing Systems (NeurIPS). December 2018.

View details »An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.

**Random Feature Stein Discrepancies. (arxiv, code, poster)***Jonathan H. Huggins and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2018.

View details »Computable Stein discrepancies have been deployed for a variety of applications, including sampler selection in posterior inference, approximate Bayesian inference, and goodness-of-fit testing. Existing convergence-determining Stein discrepancies admit strong theoretical guarantees but suffer from a computational cost that grows quadratically in the sample size. While linear-time Stein discrepancies have been proposed for goodness-of-fit testing, they exhibit avoidable degradations in testing power---even when power is explicitly optimized. To address these shortcomings, we introduce feature Stein discrepancies (ΦSDs), a new family of quality measures that can be cheaply approximated using importance sampling. We show how to construct ΦSDs that provably determine the convergence of a sample to its target and develop high-accuracy approximations---random ΦSDs (RΦSDs)---which are computable in near-linear time. In our experiments with sampler selection for approximate posterior inference and goodness-of-fit testing, RΦSDs typically perform as well or better than quadratic-time KSDs while being orders of magnitude faster to compute.

**Stein Points. (arxiv, code, bib)***Wilson Ye Chen, Lester Mackey, Jackson Gorham, Francois-Xavier Briol, and Chris J. Oates.*

International Conference on Machine Learning (ICML). July 2018.

View details »An important task in computational statistics and machine learning is to approximate a posterior distribution p(x) with an empirical measure supported on a set of representative points {x_i}_{i=1}^n. This paper focuses on methods where the selection of points is essentially deterministic, with an emphasis on achieving accurate approximation when n is small. To this end, we present `Stein Points'. The idea is to exploit either a greedy or a conditional gradient method to iteratively minimise a kernel Stein discrepancy between the empirical measure and p(x). Our empirical results demonstrate that Stein Points enable accurate approximation of the posterior at modest computational cost. In addition, theoretical results are provided to establish convergence of the method.

**Orthogonal Machine Learning: Power and Limitations. (arxiv, slides, code, bib)***Lester Mackey, Vasilis Syrgkanis, and Ilias Zadik.*

International Conference on Machine Learning (ICML). July 2018.

View details »Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a k-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein's lemma and may be of independent interest. We conclude by demonstrating the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

**Accurate Inference for Adaptive Linear Models. (arxiv, code, bib)***Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy.*

International Conference on Machine Learning (ICML). July 2018.

View details »Estimators computed from adaptively collected data do not behave like their non-adaptive brethren. Rather, the sequential dependence of the collection policy can lead to severe distributional biases that persist even in the infinite data limit. We develop a general method -- W-decorrelation -- for transforming the bias of adaptive linear regression estimators into variance. The method uses only coarse-grained information about the data collection policy and does not need access to propensity scores or exact knowledge of the policy. We bound the finite-sample bias and variance of W-decorrelation and develop asymptotically correct confidence intervals based on a novel martingale central limit theorem. We then demonstrate the empirical benefits of the generic W-decorrelation procedure in two different adaptive data settings: the multi-armed bandit and the autoregressive time series settings.

**Expert identification of visual primitives used by CNNs during mammogram classification. (arxiv, code, poster, bib)***Jimmy Wu, Diondra Peck, Scott Hsieh, Vandana Dialani, Constance D. Lehman, Bolei Zhou, Vasilis Syrgkanis, Lester Mackey, and Genevieve Patterson.*

SPIE Medical Imaging. February 2018.

View details »This work interprets the internal representations of deep neural networks trained for classifying diseased tissue in 2D mammograms. We propose an expert-in-the-loop interpretation method to label the behavior of internal units in convolutional neural networks (CNNs). Expert radiologists identify that the visual patterns detected by the units are correlated with meaningful medical phenomena such as mass tissue and calcificated vessels. We demonstrate that several trained CNN models are able to produce explanatory descriptions to support the final classification decisions. We view this as an important first step toward interpreting the internal representations of medical classification CNNs and explaining their predictions.

**Empirical Bayesian Analysis of Simultaneous Changepoints in Multiple Data Sequences. (arxiv, code, bib)***Zhou Fan and Lester Mackey.*

Annals of Applied Statistics. December 2017.

View details »Copy number variations in cancer cells and volatility fluctuations in stock prices are commonly manifested as changepoints occurring at the same positions across related data sequences. We introduce a Bayesian modeling framework, BASIC, that employs a changepoint prior to capture the co-occurrence tendency in data of this type. We design efficient algorithms to sample from and maximize over the BASIC changepoint posterior and develop a Monte Carlo expectation-maximization procedure to select prior hyperparameters in an empirical Bayes fashion. We use the resulting BASIC framework to analyze DNA copy number variations in the NCI-60 cancer cell lines and to identify important events that affected the price volatility of S&P 500 stocks from 2000 to 2009.

**Measuring Sample Quality with Kernels. (arxiv, slides, code, bib)***Jackson Gorham and Lester Mackey.*

International Conference on Machine Learning (ICML). August 2017.

View details »Approximate Markov chain Monte Carlo (MCMC) offers the promise of more rapid sampling at the cost of more biased inference. Since standard MCMC diagnostics fail to detect these biases, researchers have developed computable Stein discrepancy measures that provably determine the convergence of a sample to its target distribution. This approach was recently combined with the theory of reproducing kernels to define a closed-form kernel Stein discrepancy (KSD) computable by summing kernel evaluations across pairs of sample points. We develop a theory of weak convergence for KSDs based on Stein's method, demonstrate that commonly used KSDs fail to detect non-convergence even for Gaussian targets, and show that kernels with slowly decaying tails provably determine convergence for a large class of target distributions. The resulting convergence-determining KSDs are suitable for comparing biased, exact, and deterministic sample sequences and simpler to compute and parallelize than alternative Stein discrepancies. We use our tools to compare biased samplers, select sampler hyperparameters, and improve upon existing KSD approaches to one-sample hypothesis testing and sample quality improvement.

**Improving Gibbs Sampler Scan Quality with DoGS. (arxiv, bib)***Ioannis Mitliagkas and Lester Mackey.*

International Conference on Machine Learning (ICML). August 2017.

View details »The pairwise influence matrix of Dobrushin has long been used as an analytical tool to bound the rate of convergence of Gibbs sampling. In this work, we use Dobrushin influence as the basis of a practical tool to certify and efficiently improve the quality of a discrete Gibbs sampler. Our Dobrushin-optimized Gibbs samplers (DoGS) offer customized variable selection orders for a given sampling budget and variable subset of interest, explicit bounds on total variation distance to stationarity, and certifiable improvements over the standard systematic and uniform random scan Gibbs samplers. In our experiments with joint image segmentation and object recognition, Markov chain Monte Carlo maximum likelihood estimation, and Ising model inference, DoGS consistently deliver higher-quality inferences with significantly smaller sampling budgets than standard Gibbs samplers.

**Predicting Patient "Cost Blooms" in Denmark: A Longitudinal Population-based Study. (pdf, bib)***Suzanne Tamang, Arnold Milstein, Henrik Toft Sorensen, Lars Pedersen, Lester Mackey, Jean-Raymond Betterton, Lucas Janson, and Nigam Shah.*

BMJ Open. January 2017.

View details »**Objectives:**To compare the ability of standard vs. enhanced models to predict future high-cost patients, especially those who move from a lower to the upper decile of per capita healthcare expenditures within one year - i.e., "cost bloomers."**Design:**We developed alternative models to predict being in the upper decile of healthcare expenditures in Year 2 of a sample, based on data from Year 1. Our six alternative models ranged from a standard cost-prediction model with four variables (i.e., traditional model features), to our largest enhanced model with 1,053 nontraditional model features. To quantify any increases in predictive power that enhanced models achieved over standard tools, we compared the prospective predictive performance of each model.**Participants and setting:**We used the population of Western Denmark between 2004 and 2011 (2,146,801 individuals) to predict future high-cost patients and examine characteristics of high-cost cohorts. Using the most recent two-year period (2010-11) for model evaluation, our whole-population model used a cohort of 1,557,950 individuals with a full year of active residency Year 1 (2010). Our cost-bloom model excluded the 155,795 individuals who were already high cost at the population level in Year 1, resulting in 1,402,155 individuals for prediction of cost bloomers in Year 2 (2011).**Primary outcome measures:**Using unseen data from a future year, we evaluated each model's prospective predictive performance by calculating the ratio of predicted high-cost patient expenditures to the actual high-cost patient expenditures in Year 2 - i.e., cost capture.**Results:**Our best enhanced model achieved a 21 percent and 30 percent improvement in cost capture over a standard diagnosis-based model for predicting population-level high-cost patients and cost bloomers, respectively.**Conclusions:**In combination with modern statistical learning methods for analyzing large datasets, models enhanced with a large and diverse set of features led to better performance�especially for predicting future cost bloomers.**Predicting inpatient clinical order patterns with probabilistic topic models vs. conventional order sets. (pdf, bib)***Jonathan H. Chen, Mary K. Goldstein, Steven M. Asch, Lester Mackey, and Russ B. Altman.*

Journal of the American Medical Informatics Association. September 2016.

View details »**Objective**Build probabilistic topic model representations of hospital admissions processes and compare the ability of such models to predict clinical order patterns as compared to preconstructed order sets.**Materials and Methods**The authors evaluated the first 24 hours of structured electronic health record data for > 10 K inpatients. Drawing an analogy between structured items (e.g., clinical orders) to words in a text document, the authors performed latent Dirichlet allocation probabilistic topic modeling. These topic models use initial clinical information to predict clinical orders for a separate validation set of > 4 K patients. The authors evaluated these topic model-based predictions vs existing human-authored order sets by area under the receiver operating characteristic curve, precision, and recall for subsequent clinical orders.**Results**Existing order sets predict clinical orders used within 24 hours with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% by using probabilistic topic models to summarize clinical data into up to 32 topics. Many of these latent topics yield natural clinical interpretations (e.g., "critical care," "pneumonia," "neurologic evaluation").**Discussion**Existing order sets tend to provide nonspecific, process-oriented aid, with usability limitations impairing more precise, patient-focused support. Algorithmic summarization has the potential to breach this usability barrier by automatically inferring patient context, but with potential tradeoffs in interpretability.**Conclusion**Probabilistic topic modeling provides an automated approach to detect thematic trends in patient care and generate decision support content. A potential use case finds related clinical orders for decision support.**Efron-Stein Inequalities for Random Matrices. (pdf, bib)***Daniel Paulin, Lester Mackey, and Joel A. Tropp.*

Annals of Probability. September 2016.

View details »This paper establishes new concentration inequalities for random matrices constructed from independent random variables. These results are analogous with the generalized Efron-Stein inequalities developed by Boucheron et al. The proofs rely on the method of exchangeable pairs.

**Multivariate Stein Factors for a Class of Strongly Log-concave Distributions. (arxiv, bib)***Lester Mackey and Jackson Gorham.*

Electronic Communications in Probability. September 2016.

View details »We establish uniform bounds on the low-order derivatives of Stein equation solutions for a broad class of multivariate, strongly log-concave target distributions. These "Stein factor" bounds deliver control over Wasserstein and related smooth function distances and are well-suited to analyzing the computable Stein discrepancy measures of Gorham and Mackey. Our arguments of proof are probabilistic and feature the synchronous coupling of multiple overdamped Langevin diffusions.

**Jet-Images -- Deep Learning Edition. (pdf, code, bib)***Luke de Oliveira, Michael Kagan, Lester Mackey, Benjamin Nachman, and Ariel Schwartzman.*

Journal of High Energy Physics. July 2016.

View details »Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. This interplay between physically-motivated feature driven tools and supervised learning algorithms is general and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.

**Fuzzy Jets. (pdf, code, bib)***Lester Mackey, Benjamin Nachman, Ariel Schwartzman, and Conrad Stansbury.*

Journal of High Energy Physics. June 2016.

View details »Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets. To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets, are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variables. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.

**Measuring Sample Quality with Stein's Method. (arxiv, poster, code, bib)***Jackson Gorham and Lester Mackey.*

Advances in Neural Information Processing Systems (NIPS). December 2015.

View details »To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce a new computable quality measure based on Stein's method that quantifies the maximum discrepancy between sample and target expectations over a large class of test functions. We use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, convergence rate assessment, and quantifying bias-variance tradeoffs in posterior inference.

**Weighted Classification Cascades for Optimizing Discovery Significance in the HiggsML Challenge. (pdf, bib)***Lester Mackey, Jordan Bryan, and Man Yue Mo.*

Proceedings of the NIPS Workshop on High Energy Physics, Machine Learning, and the HiggsML Data Challenge. August 2015.

View details »We introduce a minorization-maximization approach to optimizing common measures of discovery significance in high energy physics. The approach alternates between solving a weighted binary classification problem and updating class weights in a simple, closed-form manner. Moreover, an argument based on convex duality shows that an improvement in weighted classification error on any round yields a commensurate improvement in discovery significance. We complement our derivation with experimental results from the 2014 Higgs boson machine learning challenge.

**Distributed Matrix Completion and Robust Factorization. (pdf, website, code, bib)***Lester Mackey, Ameet Talwalkar, and Michael I. Jordan.*

Journal of Machine Learning Research. April 2015.**Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. (pdf, website, bib)***Robert Kuffner, Neta Zach, Raquel Nore, Johann Hawe, David Schoenfeld, Liuxia Wang, Guang Li, Lilly Fang, Lester Mackey, Orla Hardiman, Merit Cudkowicz, Alexander Sherman, Gokhan Ertaylan, Moritz Grosse-Wentrup, Torsten Hothorn, Jules van Ligtenberg, Jakob H. Macke, Timm Meyer, Bernhard Scholkopf, Linh Tran, Rubio Vaughan, Gustavo Stolovitzky, and Melanie L. Leitner.*

Nature Biotechnology. November 2014.- Editor's choice for Science Translational Medicine, November 2014.
**Combinatorial Clustering and the Beta Negative Binomial Process. (pdf, code, bib)***Tamara Broderick, Lester Mackey, John Paisley, and Michael I. Jordan.*

IEEE Transactions on Pattern Analysis and Machine Intelligence. April 2014.**Matrix Concentration Inequalities via the Method of Exchangeable Pairs. (pdf, bib, Joel Tropp's talk)***Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp.*

Annals of Probability. March 2014.**Corrupted Sensing: Novel Guarantees for Separating Structured Signals. (pdf, bib)***Rina Foygel and Lester Mackey.*

IEEE Transactions on Information Theory. February 2014.**Distributed Low-rank Subspace Segmentation. (pdf, code, bib)***Ameet Talwalkar, Lester Mackey, Yadong Mu, Shih-Fu Chang, and Michael I. Jordan.*

IEEE International Conference on Computer Vision (ICCV). December 2013.**The Asymptotics of Ranking Algorithms. (pdf, bib)***John C. Duchi, Lester Mackey, and Michael I. Jordan.*

Annals of Statistics. November 2013.**Joint Link Prediction and Attribute Inference using a Social-Attribute Network. (pdf, website, bib)***Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Eui Chul Richard Shin, Emil Stefanov, Elaine (Runting) Shi, and Dawn Song*.

ACM Transactions on Intelligent Systems and Technology. March 2013.**Divide-and-Conquer Matrix Factorization. (pdf, website, code, bib)***Lester Mackey, Ameet Talwalkar, and Michael I. Jordan.*

Advances in Neural Information Processing Systems (NIPS). December 2011.**Visually Relating Gene Expression and in vivo DNA Binding Data. (pdf, bib)***Min-Yu Huang, Lester Mackey, Soile Keranen, Gunther Weber, Michael Jordan, David Knowles, Mark Biggin, and Bernd Hamann*.

IEEE International Conference on Bioinformatics and Biomedicine (BIBM). November 2011.**Mixed Membership Matrix Factorization. (pdf, supp info, slides, code, bib)***Lester Mackey, David Weiss, and Michael I. Jordan*.

International Conference on Machine Learning (ICML). June 2010.

Handbook of Mixed Membership Models and Their Applications. November 2014.**On the Consistency of Ranking Algorithms. (pdf, slides, bib)***John Duchi, Lester Mackey, and Michael I. Jordan*.

International Conference on Machine Learning (ICML). June 2010.- Winner of the ICML 2010 Best Student Paper Award.

**Deflation Methods for Sparse PCA. (pdf, poster, code, bib)***Lester Mackey*.

Advances in Neural Information Processing Systems (NIPS). December 2008.**Fault-tolerant Typed Assembly Language. (pdf, bib)***Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker*.

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). June 2007.- Joint winner of the PLDI 2007 Best Paper Award.
- Selected for SIGPLAN CACM Research Highlights, September 2008.
**Static Typing for a Faulty Lambda Calculus. (pdf, bib)***David Walker, Lester Mackey, Jay Ligatti, George Reis, and David August*.

ACM SIGPLAN International Conference on Functional Programming (ICFP). September 2006.**Participatory Design with Proxies: Developing a Desktop-PDA System to Support People with Aphasia. (pdf, bib)***Jordan Boyd-Graber, Sonya Nikolova, Karyn Moffatt, Kenrick Kin, Joshua Lee, Lester Mackey, Marilyn Tremaine, and Maria Klawe*.

SIGCHI Conference on Human Factors in Computing Systems (CHI). April 2006.

#### Other Work

**Deriving Matrix Concentration Inequalities from Kernel Couplings. (arxiv)***Daniel Paulin, Lester Mackey, and Joel A. Tropp. May 2013***Feature-Weighted Linear Stacking. (arxiv, Joe Sill's talk)***Joint work with Joe Sill, Gabor Takacs, and David Lin. November 2009.***Anomaly Detection for Asynchronous and Incomplete Data. (pdf)***Joint work with John Duchi and Fabian Wauthier.*

Advanced Topics in Computer Systems (UC Berkeley CS 262A, E. Brewer). December 2008.**Scalable Dyadic Kernel Machines. (pdf)**

Advanced Topics in Learning and Decision Making (UC Berkeley CS 281B, P. Bartlett). May 2008.**Latent Dirichlet Markov Random Fields for Semi-supervised Image Segmentation and Object Recognition. (pdf)**

Statistical Learning Theory (UC Berkeley CS 281A, M. Jordan) and Computer Vision (UC Berkeley CS 280, J. Malik). December 2007.

#### Invited Talks

**Improving Subseasonal Forecasting in the Western U.S. with Machine Learning. (slides)**- Statistics & Data Science Conference (SDSCon), MIT, Apr. 2019.
- Computer Science Colloquium, Cornell University, Nov. 2018.

**Probabilistic Inference and Learning with Stein's Method. (slides)**- Data, Learning, and Inference Workshop (DALI), George, South Africa, Jan. 2019.
- Charles Stein Memorial Session, Joint Statistical Meetings, Vancouver, Canada, July 2018.

**Orthogonal Machine Learning: Power and Limitations. (slides)**- Robust and High-Dimensional Statistics Workshop, Simons Institute for the Theory of Computing, Oct. 2018.

**Measuring Sample Quality with Stein's Method. (slides)**- Gatsby Unit Seminar, University College London, Oct. 2016.
- Seminar, University of Liege, Sep. 2016.
- Quetelet Seminar, Ghent University, Sep. 2016.
- International Conference on Monte Carlo and Quasi-Monte Carlo Methods (MCQMC), Stanford, CA, Aug. 2016.
- Statistics Seminar, Columbia University, Feb. 2016.
- Quasi-Monte Carlo Invited Session, IMS-ISBA Joint Meeting (MCMSki V), Jan. 2016.
- Wharton Statistics Seminar, University of Pennsylvania, Dec. 2015.
- Neyman Seminar, UC Berkeley, Sep. 2015.
- IMS-Microsoft Research Workshop: Foundations of Data Science, Cambridge, MA, June 2015.
- Stochastics and Statistics Seminar, MIT, May 2015.
- Statistics Seminar, Stanford University, May 2015.

**Measuring Sample Quality with Kernels. (slides)**- Bayes, Machine Learning, and Deep Learning Invited Session, International Society for Bayesian Analysis (ISBA) World Meeting, June 2018.
- Harvard / MIT Econometrics Workshop, MIT, Mar. 2018.
- SAMSI Workshop on Trends and Advances in Monte Carlo Sampling Algorithms, Duke University, Dec. 2017.
- SAMSI Workshop on Quasi-Monte Carlo and High-Dimensional Sampling Methods, Duke University, Aug. 2017.
- Borchard Colloquium on Concentration Inequalities, High Dimensional Statistics, and Stein's Method, Missilac, France, July 2017.
- New England Machine Learning Day, Cambridge, MA, May 2017.
- Machine Learning Seminar, MIT, Mar. 2017.

**Statistics for Social Good**- AI Now Symposium on the Social and Economic Impact of Artificial Intelligence Technologies, MIT, July 2017.
- Data Science @ Stanford Seminar, Stanford, June 2016.

**Matrix Completion and Matrix Concentration. (slides)**- IDSS Special Seminar, MIT, Feb. 2016.
- Statistics Seminar, Harvard University, Nov. 2014.
- Blackwell-Tapia Conference, Los Angeles, CA, Nov. 2014.
- Information Systems Laboratory Colloquium, Stanford University, April 2013.
- Statistics Seminar, Yale University, April 2013.
- Statistics Seminar, Columbia University, April 2013.
- Computer Science Seminar, University of Southern California, May 2012.
- Statistics Seminar, Stanford University, Jan. 2012.

**Divide-and-Conquer Matrix Factorization. (slides)**- CS Department Colloquium, Princeton University, Dec. 2015.
- Workshop on Big Data: Theoretical and Practical Challenges, Paris, France, May 2013.
- Kaggle, San Francisco, CA, Feb. 2013.
- Statistical Science Seminar Series, Duke University, Jan. 2012.
- CMS Seminar, Caltech, Jan. 2012.
- San Francisco Bay Area Machine Learning Meetup, San Francisco, CA, Nov. 2011.

**Predicting ALS Disease Progression with Bayesian Additive Regression Trees. (slides)**- Big Data in Biomedicine Conference, Stanford University, May 2015.
- Guest Lecture, Stats 202, Stanford University, Nov. 2013.
- Statistics Seminar, Stanford University, April 2013.
- RECOMB Conference on Regulatory and Systems Genomics, San Francisco, CA, Nov. 2012.

**Weighted Classification Cascades for Optimizing Discovery Significance. (slides)**- NIPS Workshop on High-energy particle physics, machine learning, and the HiggsML data challenge (HEPML), December 2014.

**Ranking, Aggregation, and You. (slides)**- Statistics Seminar, University of Chicago, Oct. 2014
- Yale MacMillan-CSAP Workshop on Quantitative Research Methods, Yale University, Sep. 2014.
- Wharton Statistics Seminar, University of Pennsylvania, Sep. 2014.
- Statistics Seminar, Carnegie Mellon University, Sep. 2014.
- Western Section Meeting, American Mathematical Society, Nov. 2013.
- Statistics Seminar, Stanford University, Sep. 2013.
- Stanford Statistics/Machine Learning Reading Group, Stanford University, Nov. 2012.

**Dividing, Conquering, and Mixing Matrix Factorizations. (slides)**

Technicolor, Palo Alto, CA, June 2013.**Stein's Method for Matrix Concentration. (slides)**- Institut National de Recherche en Informatique et en Automatique (INRIA), Dec. 2012.
- Berkeley Probability Seminar, University of California, Berkeley, May 2012.

**Build a Better Netflix, Win a Million Dollars?**

SPARC Camp, Aug. 2014. (slides)

USA Science and Engineering Festival, Washington, DC, Apr. 2012. (slides)**The Story of the Netflix Prize: An Ensembler's Tale. (slides, video)**

National Academies' Seminar, Washington, DC, Nov. 2011.**Mixed Membership Matrix Factorization. (slides)**

Joint Statistical Meetings, Miami Beach, FL, July 2011.**False Event Identification and Beyond: A Machine Learning Approach.***Presented with Ariel Kleiner.*

Comprehensive Test Ban Treaty Organization Technical Meeting on Data Mining, Vienna, Austria, Nov. 2009.**The Dinosaur Planet Approach to the Netflix Prize.**- LIDS Seminar Series, MIT, Nov. 2008, presented with David Weiss.
- Guest Lecture, Stat 157, U.C. Berkeley, Sept. 2008.
- Process Driven Trading Group, Morgan Stanley, April 2008, presented with David Lin and David Weiss.