David Kewei Lin

Welcome to my personal site!

Why does the variance shrink when you truncate a normal distribution?

$\providecommand{\E}{\mathbb E} \providecommand{\Var}{\mathrm{Var}}\providecommand{\I}{\mathbb I} \providecommand{\Cov}{\mathrm{Cov}}$ I encountered this fact while solving a problem and I was a little shocked that (1) I had never seen this fact before and (2) I had trouble proving it:

Let $Z \sim \mathcal N(0,1)$. Then $\mathrm{Var}(Z | Z < c) \le \mathrm{Var}(Z)$.

(If you think this fact is obvious, you might want to check if your intuition suggests that this is true for all distributions, which it is not: if you have a support of three values $a,b,b+\epsilon$ with $p(a)=p(b) = \delta$ small, then realize that conditioning on being less than $b+\epsilon$ actually increases the variance.)

The solution is not too hard if you have the patience to integrate (and ChatGPT/Claude will happily do it) - but I do not have such patience. Being somewhat unsatisfied, I googled around to find a better explanation, and it turns out that this fact generalizes to any $Z$ with a log-concave distribution.

Background: unnormalized distributions

I used to teach Stanford’s core intro to probability class (for the CS department). One of the deeper ideas I tried to convey in the class was to ignore the normalizing constants for distributions. Like, the formual for the normal distribution $$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right)$$ seems scary, but is maybe half as scary if you write it as $$p(x) \propto \exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right).$$

And there’s a perfectly good reason for the $\exp$ as well (and in general, for lots of distributions to look like $p(x) \propto e^{f(x)}$). One looks to Bayesian updating for an example: suppose you had some (prior) distribution over $x$, and you observed $y$, and that you knew $p(y|x)$. How would you update your prior (i.e. $p(x|y)$)? The answer is that $$p(x|y) \propto p(x,y) = p(x) \cdot p(y|x)$$ or in words, we multiply the probability update onto our existing prior. (When $x$ and $y$ are boolean, the slick way to do this is to consider odds, then the odds multiply - if a rare event happens with 1:100 odds and you observe some 10:1 evidence for it, then the event now happes with 1:10 odds.)

Some very basic stat-mech

Ok, so if that was convincing to you we can start to consider some probability distribution $p(x) \propto e^{f(x)}$. We can do all the hocus-pocus we want, but to get back a proper probability distribution we have to figure out the normalizing constant: $$Z = \sum_x e^{f(x)}$$ (I’m using $\sum$ because I come from the CS/discrete math tradition, but if you come from physics you might prefer $\int dx$.)

A thing physicists like to do is add a linear term $hx$ to the exponent. So really, we have a so-called partition function (of $h$): $$Z(h) = \sum_x e^{f(x) + hx}.$$

This is useful for a bunch of reasons, but a really cool trick is that we can recover some moments of $X$ just from $Z$:

$$ \begin{align*} (\log Z)' |_{h=0} &= \E[X] \\ (\log Z)''|_{h=0} &= \Var[X] \end{align*} $$ and so on.

So okay, let’s try to see if this gets us anywhere. Let $X$ be a standard normal and $X^* = X | X < C$. The respective distributions are $$ \begin{align*} p_X(x) &\propto \exp\left(-\frac{x^2}{2}\right) \\ p_{X^*}(x) &\propto \exp\left(-\frac{x^2}{2}\right) \I_{x < C} \end{align*} $$

Therefore, $$ \begin{align*} \Var(X^*) - \Var(X) &= (\log Z^* - \log Z)''|_{h=0} \\ &= \left(\log \frac{\int \exp(-\tfrac {x^2} 2 + hx) \I_{x < C} dx}{\int \exp(-\tfrac {x^2} 2 + hx) dx}\right)''\bigg|_{h=0}. \end{align*} $$ Now, a probability proportional to $\exp(-x^2/2 + hx)$ is a normal distribution, albeit now with mean $h$. To get rid of it, we can switch out $x$ for $x+h$, so really we have $$ \left(\log \frac{\int \exp(-\tfrac {x^2} 2 + hx) \I_{x < C} dx}{\int \exp(-\tfrac {x^2} 2 + hx) dx}\right)''\bigg|_{h=0} = (\log \Phi(c-h))''|_{h=0} $$ So the question becomes simply: is $\Phi$ log-concave? The answer is yes, and it’s not too hard to check it by hand, but because we’re this lazy I will instead try to appeal to some well-known facts about log-concavity.

Log-concavity

A function $f: \mathbb R^n \to \mathbb R_{\ge 0}$ is log-concave if … drumroll please … $\log f$ is concave. (I wish names in math were always this straightforward!) Similarly, a random variable $X$ is log-concave if its pdf is log-concave.

As an example, $p_{X^*}$ above is log-concave because:

How does this help us? Well, it turns out that integrals of log-concave functions are still log-concave. This is a special case of a more general fact:

Any marginal of a log-concave distribution is log-concave. (That is: if we integrate the log-concave function along a dimension, the result is still log-concave.)

Applying this to the special distribution $e^{-z^2/2} \I_{x>z}$ and integrating out $z$ gives us that $\Phi$ is log-concave.

If you’re interested in the proof, I wrote it up here. The crux is the following fact:

(Prekopa-Lindler) Suppose $f,g,h$ are non-negative functions on $\mathbb R^n$, and $\lambda\in (0,1)$. Then, if $h(\lambda x + (1-\lambda) y) \ge f(x)^{\lambda} g(y)^{1-\lambda}$ for all $x,y \in \mathbb R^n$, then $$\int h(x) dx \ge \left(\int f(x) dx\right)^\lambda \left(\int g(x) dx\right)^{1-\lambda}.$$

Here are a smattering of other facts related to log-concavity:

(Borell) If a measure is a density and satisfies the Brunn-Minkowski inequality (i.e. $$\mu(\lambda A + (1-\lambda) B) \ge \mu(A)^\lambda \mu(B)^{1-\lambda}$$ for compact sets $A,B$ and $\lambda\in (0,1)$), then it is log-concave (and vice versa).

(Brascamp-Lieb; generalized Poincare inequality) For a smooth $f$, $\Var(g(X)) \le \E[\nabla g(X)^\top (\nabla^2 f(X))^{-1} \nabla g(X)]$ where $f$ is the log-likelihood.

Hoeffding’s inequality

Here’s another spot where thinking about the log-partition function helps a lot. One way to bound the sum of arbitrary (but bounded) random variables is Hoeffding’s inequality:

(Hoeffding) Let $X_1,\ldots,X_n$ be iid random variables with $X_i \in [a,b]$. Then, for all $\epsilon > 0$, their sum $S_n$ satisfies $$P\left[\left|S_n - \E\left[S_n\right]\right| \ge \epsilon\right] \le 2 \exp\left(-\frac{2\epsilon^2}{n(b-a)^2}\right).$$

The proof is pretty standard: use Markov’s inequality to move to the exponential generating function, then we just have to bound a single term: $$ \E[e^{s(X-\E X)}] \le \exp\left(\frac{s^2(b-a)^2}{8}\right). $$

I remember Jensen showed this to me and we both found it surprisingly hard. But here’s the solution: after shifting $X$ to have mean 0, we just need to prove that $$Z(s) := \log \E[e^{sX}] \le \frac{s^2(b-a)^2}{8}$$

(Note that this variable $s$ is not like $h$ from before! If you were a physicist you’d call this $\beta$.) Thus, $Z(0) = 0$, $Z'(0) = \E[X] = 0$, and $$Z(s) = \int_0^s Z'(t) dt = \int_0^s \int_0^t Z''(u) du dt \le \frac{s^2}{2} \cdot \frac{(a-b)^2}{4}$$ because $Z''(u)$ can be interpreted as the variance of a distribution supported on $[a,b]$, whereas the maximal possible variance is $\frac{(a-b)^2}{4}$ (by a smoothing argument).

A challenge

What does a multivariate version of the first problem look like? When I fed the initial problem to ChatGPT, it hallucinated (?) this:

Let $X$ be a $\mathbb R^n$-valued log-concave r.v. and $K$ be a convex set. Show that $\Cov(X|X\in K) \preceq \Cov(X)$.

If you like a challenge, try to prove this! (Be warned - I’m not sure if this is the correct statement.)

Conclusion

Log-concavity is certainly a concept I want to understand better, and it’s great that I’ve managed to “encounter it in the wild” for once. Hope you enjoyed this post!