In Spring 2021, I took Yanjun Han’s excellent course “Information-theoretic Lower Bounds in Data Science (EE 378C)”. For the entirety of the course, we looked at methods of proving statistical lower bounds, but being a pragmatist I was always looking for ways that this could inform how we construct statistical estimators. “Dualizing Le Cam’s method” (Polyanski and Wu, 2021) promised exactly this, so I chose it for my reading project.
I’ll try to illustrate the main takeaway from this paper without the hefty proofs that come with it.
$\newcommand\E{\mathbb E}\newcommand\what{\widehat} \newcommand{\Ber}{\mathrm{Ber}}$
We have a statistical model decided by its pararmeters $\theta\in\Theta$, and it produces observations $X\in \mathcal X$ (i.e. our model is a stochastic kernel $\Theta \to \mathcal X$). We want to estimate a function of the parameters $T(\theta)$, so we come up with an estimator $\widehat T(X)$.
The so-called minimax risk is the smallest possible risk incurred on an adversary controlling the parameters. Formally, the minimax risk (under squared loss) is $$R^* := \inf_{\what T}\sup_{\theta\in \Theta} \E[(\what T - T(\theta))^2].$$
If we would like to achieve the lowest minimax risk, what should we use as the estimator $\what T$?
The first result concerns us having i.i.d. observations of some distribution. To keep things simple, we will always try to reference the follow toy problem.
Toy problem. Given $n$ i.i.d. observations from a $\Ber(p)$ (with $p$ unknown), establish the minimax risk of estimating $p$.
There are two pre-conditions to check:
With these fulfilled, the first result tells us the minimax rate up to optimal factors. Define $\chi^2$-modulus of continuity $\delta_{\chi^2}(t)$ as the largest number so that we can say that a $\chi^2$-divergence of $t^2$ between distributions $P_1,P_2$ correspond to at most $\delta_{\chi^2}(t)$ difference in $t$.
Ok, so we compute this for Bernoulli distributions.
$$\chi^2(\Ber(p)|| \Ber(q)) = \frac{(p-q)^2}{q(1-q)}$$
Thus, if $\chi^2 \le t^2$, $p-q \le t/2$, so $\delta_{\chi^2}(t) \le t/2$ and it’s not hard to check that equality is achieved at $q=1/2, p = (1-t)/2$.
It turns out that the optimal minimax rate is precisely $\delta_{\chi^2}(n^{-1/2})$, so for the toy problem it’s $\Theta(n^{-1/2})$.
What was the estimator we needed to use? It turns out to be some kind of average $$\what T = \frac 1 n \sum_{i=1}^n g(X_i)$$ and if you trace through the proof, we want to pick $g$ where for pairs of distribution $\pi,\pi'$ with $\chi^2$-divergence of at most $1/n$, we attain the supremum (in the limit) of the variational representation of $\chi^2$:
$$\chi^2(\pi|| \pi') = \sup_g \frac{(\E_{\pi}[g] - \E_{\pi'}[g])^2}{\mathrm{Var}_{\pi}[g]}$$
but basically $g(x) = x$ works, since this is invariant under affine transforms and any function on $\{0,1\}$ can be transformed into the identity.
The beautiful thing is that this rate comes with a matching lower bound automatically!
There was something very strange and subtle happening with our choice of $g$. We want to distinguish all pairs of distributions $\pi,\pi'$ at most $1/n$ apart (in $\chi^2$-divergence) - we were lucky that our choice of $g$ worked for any pair of distributions, but in general what do we do?
The crux of the proof is as follows: we have upper bounded the average risk with some functional $F$ (of the distribution $\pi$ and the estimator $g$), so we would like to upper-bound $$\inf_g \sup_\pi F(g, \pi).$$ (Sanity check: the choice of $g$ is clear - we pick the $g$ which minimizes $\sup_\pi F(g,\pi)$.)
We would like to swap the inf and the sup, since it’s much easier to minimize over $g$ than $\pi$. However, minimax theory tells us that we can only do this if $F$ is convex in $g$ and concave in $\pi$ - and in our case $F$ is convex in both arguments (specifically, $F$ is some kind of bias-variance decomposition).
Aside: I never managed to remember which argument has to be convex/concave - here’s the trick. We can swap the arguments only if the optimal pair $(g,\pi)$ is a saddle point, so $g$ sits at the bottom of a convex cross-section while $\pi$ sits at the top of a concave cross-section.
Also, the content of this theorem is quite nontrivial. If you knew this for bi-affine (i.e. affine in either argument) functionals $F$ for instance, then you would know that a Nash Equilibrium exists for zero-sum games by setting $F, -F$ as the respective utility functions…
The ingenious solution given by the author is as follows: the bias term in $F$ (which we call $B(\pi, g)$) is affine in $g$ and convex in $\pi$. However, there is an expanded space where we find an upper bound $B^+(\pi^+, g)$ which is affine in $\pi^+$ and $$B(\pi, g) \le B^+(\pi^+ ,g)$$
so if in $F$ we replace $B$ with $B^+$ (which we call $F^+$), we get something that is concave in $\pi^+$ and convex in $g$, so we can properly swap the inf and the sup.
To summarize:
$$\inf_g \sup_\pi F(\pi ,g) \le \inf_g \sup_{\pi^+} F^+(\pi^+, g) = \sup_{\pi^+} \inf_g F^+(\pi^+, g).$$
So what should we take for $g$? This means that it suffices to take $g$ for the “worst” $\pi^+$, and automatically this $g$ will work for any $\pi^+$.
Let me elaborate on the affine relaxation. If $a(x)$ is an affine function, then $|a(x)|$ is convex. But we can also write it as $$|a(x)| = \max_{\xi\in [0,2]} a(x) - \xi a(x) \le \max_{\xi\in [0,2], x'} a(x) - \xi a(x')$$ and it turns out this is affine in $(x, \xi x', \xi)$.
(Exercise: if $a(x)$ is affine, so is $y\cdot a(x/y)$ over the joint space $X\times Y$. Why?)