Sublinear-Sample Property Estimation
Given samples from a distribution over an alphabet of size $n$, what properties of the distribution can be estimated given $o(n)$ samples? This question is inspired by the early work of Good and Turing on frequency estimation, and that of Fisher on analyzing the number of new species one can expect to discover. I will discuss some recent developments in this area (joint work with Paul Valiant), starting with establishing that a broad class of properties, which includes entropy, support size, and various distance metrics between two distributions, can be estimated using $O(n/log n)$ samples, and that, up to constant factors, this is optimal. This improves upon a long line of work, from both the statistics and computer science communities, in which the best estimators for any of these properties required a linear number of samples. I will also describe some exciting open questions in these directions.
Greg joined Stanford's CS Department in Fall, 2013, after receiving his PhD from Berkeley, and spending a year at Microsoft Research.