300 Pasteur Drive, Palo Alto, CA 94304
Research Overview
Welcome! I am a postdoctoral scientist working in functional genomics and machine learning, in the lab of Anshul Kundaje at Stanford University. My genomics research has focused on computationally exploring epigenetic regulation of transcription at the singlecell level, with applications in basic cancer and aging research. My machine learning and statistics research includes methods for semisupervised and representation learning and decision theory, as well as work on more foundational aspects of statistics and probability of stochastic processes.
I completed my PhD at UC San Diego in machine learning, where I was advised by Yoav Freund. During my PhD, I developed semisupervised algorithms to combine ensembles of predictors, and also worked on stochastic processes. I spent summers interning with the Machine Learning group at Microsoft Research NYC, the video content analysis group at Google Research Mountain View, and the Interaction and Intent group at Microsoft Research Silicon Valley.
Research Manuscripts
Click on each paper title for a very unofficial onesentence summary.Preprints

Pvalue peeking and estimating extrema. [arXiv]
In old and new statistical hypothesis tests, the reported pvalues can be optimally adapted to any sampling strategy, eliminating the need to prespecify sample sizes. 
Sharp finitesample concentration of independent variables. [arXiv]
Empirical distributions from i.i.d. data are concentrated, with a very simple informationtheoretic proof. 
Linking generative adversarial learning and binary classification. [arXiv]
Generative adversarial learning of a distribution, using a classifier learned by risk minimization, is always equivalent to fdivergence minimization.  Akshay Balsubramani, Yoav Freund. Preprint.

Learning to abstain from binary prediction. [arXiv]
The problem of binary classification with an abstaining predictor centers around the tradeoff between abstaining and making a prediction error. We characterize this tradeoff optimally well, both theoretically and empirically with efficient algorithms that use labeled and unlabeled data. 
PACBayes iterated logarithm bounds for martingale mixtures. [arXiv]
Any mixture of stochastic processes with high probability stays within an optimally characterized range of its conditional mean, at all times along its sample path, and with respect to all "posterior" mixing distributions. 
Sharp finitetime iteratedlogarithm martingale concentration. [arXiv]
Any stochastic process with high probability stays within a narrow, optimally characterized range of its conditional mean, at all times along its sample path.
Papers

A genomewide atlas of coessential modules assigns function to uncharacterized genes. [bioRxiv]
By measuring essentiality of genes in a broad spectrum of cancer cell lines using CRISPR, we can infer known and unknown functional relationships between genes.
Nature Genetics, 2021. 
WILDS: A benchmark of inthewild distribution shifts. [arXiv]
"In the wild" shifts between training and test distributions are commonplace and benchmarked here, with significant effects on performance of predictive models.
ML Retrospectives Workshop, NeurIPS, 2020. 
Learning transport cost from subset correspondence. [arXiv]
Information about partial correspondences between analogous datasets can be used to learn custom metrics for use by optimal transport methods.
International Conference on Learning Representations (ICLR), 2020 (conference track). 
An adaptive nearest neighbor rule for classification. [arXiv] [code] [demo] [spotlight]
Nearestneighbor classifiers can be modified to give robust, provable, practically checkable confidence sets by choosing the neighborhood size according to local label noise.
Neural Information Processing Systems (NeurIPS), 2019. 
Semantically decomposing the latent spaces of generative adversarial networks. [arXiv] [code] [demo]
When learning a latent space for generating data, any given axis of variation in the data can be disentangled from the rest in the latent space, using an efficient modelagnostic pairwise training strategy.
International Conference on Learning Representations (ICLR), 2018 (conference track). 
The ENCODEDREAM Challenge to predict genomewide binding of regulatory proteins to DNA. [pdf]
An open challenge to design a genomewide predictor of transcription factor binding.
Machine Learning Challenges as a Research Tool, NIPS, 2017. 
Optimal binary autoencoding with pairwise correlations. [arXiv] [code] [discussion]
Efficient and practical biconvex learning of binary autoencoders is strongly optimal, using pairwise correlations between encoding and decoding layers.
International Conference on Learning Representations (ICLR), 2017 (conference track). 
Sequential nonparametric testing with the law of the iterated logarithm. [arXiv]
When performing nonparametric testing of the difference in mean between two distributions (and many other problems besides), we devise rigorous sequential tests that use as few samples as possible, adapting to the unknown mean difference.
Conference on Uncertainty in Artificial Intelligence (UAI), 2016. 
Optimal binary classifier aggregation for general losses. [arXiv] [spotlight]
The minimax optimal way to combine a set of binary classifiers of varying competences with unlabeled data is an artificial neuron, with a sigmoidshaped transfer function that only depends on the evaluation loss function.
Neural Information Processing Systems (NIPS), 2016. Short version in Workshop on Learning Faster from Easy Data, NIPS, 2015. 
Instancedependent regret bounds for dueling bandits. [paper]
Online learning from limited (bandit) pairwise feedback between actions is easy when a few actions are better than the rest and the matrix of pairwise preferences is wellconditioned.
Conference on Learning Theory (COLT), 2016. 
Scalable semisupervised aggregation of classifiers. [arXiv]
There is an efficient way to use unlabeled data to combine the trees of a random forest, which often performs better than random forests for binary classification.
Neural Information Processing Systems (NIPS), 2015. 
Optimally combining classifiers using unlabeled data. [arXiv]
The minimax optimal way to combine a set of binary classifiers of known competences with unlabeled data resembles a weighted majority vote, and is efficiently learnable.
Conference on Learning Theory (COLT), 2015. 
The fast convergence of incremental PCA. [arXiv]
Natural algorithms for incremental lineartime and space principal component analysis (PCA) converge quickly to the optimum, despite the problem's nonconvexity.
Neural Information Processing Systems (NIPS), 2013.
Workshop Only

Crossspecies transcription factor binding prediction via domainadaptive neural networks.
Machine Learning in Computational Biology, 2019. 
An empirical comparison of sparse vs. embedding techniques on manyclass text classification.
Rare features can be usefully predictive in (text) classification problems with many classes and features.
Workshop on Extreme Classification, NIPS, 2013.
Theses
* Indicates equal authorship.Other writing
I maintain a blog where I post researchrelated content that hasn't made it into papers (yet).
Biography
Before the PhD, I was an Associate at Strand Life Sciences, where I did statistical genomics, developing tools for genomics researchers. Previously, I received a B.S. (High Honors) in Electrical Engineering and Computer Science at UC Berkeley. On the way to that degree, I minored in (quantum) physics at Berkeley as well. Before that, I lived in various parts of India, the US, and Singapore.
Miscellaneous
Some suggestions on research which I believe in.
I used to play the violin (and occasionally still do); before college, I got a distinction in it (unfortunately recordings are lost!). I also played the Carnatic classical style, which is less polyphonic but melodically richer than the Western European classical tradition.
I have always enjoyed traveling and do so whenever the opportunity arises. I like running, occasionally structured. In my free time, I sometimes write on history and philosophy tidbits I find interesting.
This site is (still and perennially) under construction.