Dualizing Le Cam's Method

In Spring 2021, I took Yanjun Han’s excellent course “Information-theoretic Lower Bounds in Data Science (EE 378C)”. For the entirety of the course, we looked at methods of proving statistical lower bounds, but being a pragmatist I was always looking for ways that this could inform how we construct statistical estimators. “Dualizing Le Cam’s method” (Polyanski and Wu, 2021) promised exactly this, so I chose it for my reading project.

I’ll try to illustrate the main takeaway from this paper without the hefty proofs that come with it.

$\newcommand\E{\mathbb E}\newcommand\what{\widehat} \newcommand{\Ber}{\mathrm{Ber}}$

Statistical Decision Theory

We have a statistical model decided by its pararmeters $\theta\in\Theta$, and it produces observations $X\in \mathcal X$ (i.e. our model is a stochastic kernel $\Theta \to \mathcal X$). We want to estimate a function of the parameters $T(\theta)$, so we come up with an estimator $\widehat T(X)$.

The so-called minimax risk is the smallest possible risk incurred on an adversary controlling the parameters. Formally, the minimax risk (under squared loss) is $$R^* := \inf_{\what T}\sup_{\theta\in \Theta} \E[(\what T - T(\theta))^2].$$

If we would like to achieve the lowest minimax risk, what should we use as the estimator $\what T$?

The i.i.d. case

The first result concerns us having i.i.d. observations of some distribution. To keep things simple, we will always try to reference the follow toy problem.

Toy problem. Given $n$ i.i.d. observations from a $\Ber(p)$ (with $p$ unknown), establish the minimax risk of estimating $p$.

There are two pre-conditions to check:

Our space of distributions is convex (i.e. we can take averages and they still make sense). For our toy problem, the average of $\Ber(p)$ and $\Ber(q)$ as distributions is $\Ber(\tfrac{p+q}{2})$.
Our function $T$ is affine across distributions (i.e. convex combinations commute with $T$). This is fine, since the parameter to estimate for the average of $\Ber(p)$ and $\Ber(q)$ is $\tfrac{p+q}{2}$.

With these fulfilled, the first result tells us the minimax rate up to optimal factors. Define $\chi^2$-modulus of continuity $\delta_{\chi^2}(t)$ as the largest number so that we can say that a $\chi^2$-divergence of $t^2$ between distributions $P_1,P_2$ correspond to at most $\delta_{\chi^2}(t)$ difference in $t$.

Ok, so we compute this for Bernoulli distributions.

$$\chi^2(\Ber(p)|| \Ber(q)) = \frac{(p-q)^2}{q(1-q)}$$

Thus, if $\chi^2 \le t^2$, $p-q \le t/2$, so $\delta_{\chi^2}(t) \le t/2$ and it’s not hard to check that equality is achieved at $q=1/2, p = (1-t)/2$.

It turns out that the optimal minimax rate is precisely $\delta_{\chi^2}(n^{-1/2})$, so for the toy problem it’s $\Theta(n^{-1/2})$.

What was the estimator we needed to use? It turns out to be some kind of average $$\what T = \frac 1 n \sum_{i=1}^n g(X_i)$$ and if you trace through the proof, we want to pick $g$ where for pairs of distribution $\pi,\pi'$ with $\chi^2$-divergence of at most $1/n$, we attain the supremum (in the limit) of the variational representation of $\chi^2$:

$$\chi^2(\pi|| \pi') = \sup_g \frac{(\E_{\pi}[g] - \E_{\pi'}[g])^2}{\mathrm{Var}_{\pi}[g]}$$

but basically $g(x) = x$ works, since this is invariant under affine transforms and any function on $\{0,1\}$ can be transformed into the identity.

The beautiful thing is that this rate comes with a matching lower bound automatically!

Minimax theorems

There was something very strange and subtle happening with our choice of $g$. We want to distinguish all pairs of distributions $\pi,\pi'$ at most $1/n$ apart (in $\chi^2$-divergence) - we were lucky that our choice of $g$ worked for any pair of distributions, but in general what do we do?

The crux of the proof is as follows: we have upper bounded the average risk with some functional $F$ (of the distribution $\pi$ and the estimator $g$), so we would like to upper-bound $$\inf_g \sup_\pi F(g, \pi).$$ (Sanity check: the choice of $g$ is clear - we pick the $g$ which minimizes $\sup_\pi F(g,\pi)$.)

We would like to swap the inf and the sup, since it’s much easier to minimize over $g$ than $\pi$. However, minimax theory tells us that we can only do this if $F$ is convex in $g$ and concave in $\pi$ - and in our case $F$ is convex in both arguments (specifically, $F$ is some kind of bias-variance decomposition).

Aside: I never managed to remember which argument has to be convex/concave - here’s the trick. We can swap the arguments only if the optimal pair $(g,\pi)$ is a saddle point, so $g$ sits at the bottom of a convex cross-section while $\pi$ sits at the top of a concave cross-section.

Also, the content of this theorem is quite nontrivial. If you knew this for bi-affine (i.e. affine in either argument) functionals $F$ for instance, then you would know that a Nash Equilibrium exists for zero-sum games by setting $F, -F$ as the respective utility functions…

The ingenious solution given by the author is as follows: the bias term in $F$ (which we call $B(\pi, g)$) is affine in $g$ and convex in $\pi$. However, there is an expanded space where we find an upper bound $B^+(\pi^+, g)$ which is affine in $\pi^+$ and $$B(\pi, g) \le B^+(\pi^+ ,g)$$

so if in $F$ we replace $B$ with $B^+$ (which we call $F^+$), we get something that is concave in $\pi^+$ and convex in $g$, so we can properly swap the inf and the sup.

To summarize:

$$\inf_g \sup_\pi F(\pi ,g) \le \inf_g \sup_{\pi^+} F^+(\pi^+, g) = \sup_{\pi^+} \inf_g F^+(\pi^+, g).$$

So what should we take for $g$? This means that it suffices to take $g$ for the “worst” $\pi^+$, and automatically this $g$ will work for any $\pi^+$.

Let me elaborate on the affine relaxation. If $a(x)$ is an affine function, then $|a(x)|$ is convex. But we can also write it as $$|a(x)| = \max_{\xi\in [0,2]} a(x) - \xi a(x) \le \max_{\xi\in [0,2], x'} a(x) - \xi a(x')$$ and it turns out this is affine in $(x, \xi x', \xi)$.

(Exercise: if $a(x)$ is affine, so is $y\cdot a(x/y)$ over the joint space $X\times Y$. Why?)

David Kewei Lin

Welcome to my personal site!

Dualizing Le Cam's Method

Statistical Decision Theory

The i.i.d. case

Minimax theorems