Why does the variance shrink when you truncate a normal distribution?

$\providecommand{\E}{\mathbb E} \providecommand{\Var}{\mathrm{Var}}\providecommand{\I}{\mathbb I} \providecommand{\Cov}{\mathrm{Cov}}$ I encountered this fact while solving a problem and I was a little shocked that (1) I had never seen this fact before and (2) I had trouble proving it:

Let $Z \sim \mathcal N(0,1)$. Then $\mathrm{Var}(Z | Z < c) \le \mathrm{Var}(Z)$.

(If you think this fact is obvious, you might want to check if your intuition suggests that this is true for all distributions, which it is not: if you have a support of three values $a,b,b+\epsilon$ with $p(a)=p(b) = \delta$ small, then realize that conditioning on being less than $b+\epsilon$ actually increases the variance.)

The solution is not too hard if you have the patience to integrate (and ChatGPT/Claude will happily do it) - but I do not have such patience. Being somewhat unsatisfied, I googled around to find a better explanation, and it turns out that this fact generalizes to any $Z$ with a log-concave distribution.

Background: unnormalized distributions

I used to teach Stanford’s core intro to probability class (for the CS department). One of the deeper ideas I tried to convey in the class was to ignore the normalizing constants for distributions. Like, the formual for the normal distribution $$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right)$$ seems scary, but is maybe half as scary if you write it as $$p(x) \propto \exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right).$$

And there’s a perfectly good reason for the $\exp$ as well (and in general, for lots of distributions to look like $p(x) \propto e^{f(x)}$). One looks to Bayesian updating for an example: suppose you had some (prior) distribution over $x$, and you observed $y$, and that you knew $p(y|x)$. How would you update your prior (i.e. $p(x|y)$)? The answer is that $$p(x|y) \propto p(x,y) = p(x) \cdot p(y|x)$$ or in words, we multiply the probability update onto our existing prior. (When $x$ and $y$ are boolean, the slick way to do this is to consider odds, then the odds multiply - if a rare event happens with 1:100 odds and you observe some 10:1 evidence for it, then the event now happes with 1:10 odds.)

Some very basic stat-mech

Ok, so if that was convincing to you we can start to consider some probability distribution $p(x) \propto e^{f(x)}$. We can do all the hocus-pocus we want, but to get back a proper probability distribution we have to figure out the normalizing constant: $$Z = \sum_x e^{f(x)}$$ (I’m using $\sum$ because I come from the CS/discrete math tradition, but if you come from physics you might prefer $\int dx$.)

A thing physicists like to do is add a linear term $hx$ to the exponent. So really, we have a so-called partition function (of $h$): $$Z(h) = \sum_x e^{f(x) + hx}.$$

This is useful for a bunch of reasons, but a really cool trick is that we can recover some moments of $X$ just from $Z$:

$$ \begin{align*} (\log Z)' |_{h=0} &= \E[X] \\ (\log Z)''|_{h=0} &= \Var[X] \end{align*} $$ and so on.

So okay, let’s try to see if this gets us anywhere. Let $X$ be a standard normal and $X^* = X | X < C$. The respective distributions are $$ \begin{align*} p_X(x) &\propto \exp\left(-\frac{x^2}{2}\right) \\ p_{X^*}(x) &\propto \exp\left(-\frac{x^2}{2}\right) \I_{x < C} \end{align*} $$

Therefore, $$ \begin{align*} \Var(X^*) - \Var(X) &= (\log Z^* - \log Z)''|_{h=0} \\ &= \left(\log \frac{\int \exp(-\tfrac {x^2} 2 + hx) \I_{x < C} dx}{\int \exp(-\tfrac {x^2} 2 + hx) dx}\right)''\bigg|_{h=0}. \end{align*} $$ Now, a probability proportional to $\exp(-x^2/2 + hx)$ is a normal distribution, albeit now with mean $h$. To get rid of it, we can switch out $x$ for $x+h$, so really we have $$ \left(\log \frac{\int \exp(-\tfrac {x^2} 2 + hx) \I_{x < C} dx}{\int \exp(-\tfrac {x^2} 2 + hx) dx}\right)''\bigg|_{h=0} = (\log \Phi(c-h))''|_{h=0} $$ So the question becomes simply: is $\Phi$ log-concave? The answer is yes, and it’s not too hard to check it by hand, but because we’re this lazy I will instead try to appeal to some well-known facts about log-concavity.

Log-concavity

A function $f: \mathbb R^n \to \mathbb R_{\ge 0}$ is log-concave if … drumroll please … $\log f$ is concave. (I wish names in math were always this straightforward!) Similarly, a random variable $X$ is log-concave if its pdf is log-concave.

As an example, $p_{X^*}$ above is log-concave because:

$e^{-x^2/2}$ is log-concave
$\I_{x < C}$ is log-concave (and in general, $\mathbb I_K$ is log-concave iff $K$ is convex)
their product is log-concave (because we are adding their logs together)

How does this help us? Well, it turns out that integrals of log-concave functions are still log-concave. This is a special case of a more general fact:

Any marginal of a log-concave distribution is log-concave. (That is: if we integrate the log-concave function along a dimension, the result is still log-concave.)

Applying this to the special distribution $e^{-z^2/2} \I_{x>z}$ and integrating out $z$ gives us that $\Phi$ is log-concave.

If you’re interested in the proof, I wrote it up here. The crux is the following fact:

(Prekopa-Lindler) Suppose $f,g,h$ are non-negative functions on $\mathbb R^n$, and $\lambda\in (0,1)$. Then, if $h(\lambda x + (1-\lambda) y) \ge f(x)^{\lambda} g(y)^{1-\lambda}$ for all $x,y \in \mathbb R^n$, then $$\int h(x) dx \ge \left(\int f(x) dx\right)^\lambda \left(\int g(x) dx\right)^{1-\lambda}.$$

Here are a smattering of other facts related to log-concavity:

(Borell) If a measure is a density and satisfies the Brunn-Minkowski inequality (i.e. $$\mu(\lambda A + (1-\lambda) B) \ge \mu(A)^\lambda \mu(B)^{1-\lambda}$$ for compact sets $A,B$ and $\lambda\in (0,1)$), then it is log-concave (and vice versa).

(Brascamp-Lieb; generalized Poincare inequality) For a smooth $f$, $\Var(g(X)) \le \E[\nabla g(X)^\top (\nabla^2 f(X))^{-1} \nabla g(X)]$ where $f$ is the log-likelihood.

Hoeffding’s inequality

Here’s another spot where thinking about the log-partition function helps a lot. One way to bound the sum of arbitrary (but bounded) random variables is Hoeffding’s inequality:

(Hoeffding) Let $X_1,\ldots,X_n$ be iid random variables with $X_i \in [a,b]$. Then, for all $\epsilon > 0$, their sum $S_n$ satisfies $$P\left[\left|S_n - \E\left[S_n\right]\right| \ge \epsilon\right] \le 2 \exp\left(-\frac{2\epsilon^2}{n(b-a)^2}\right).$$

The proof is pretty standard: use Markov’s inequality to move to the exponential generating function, then we just have to bound a single term: $$ \E[e^{s(X-\E X)}] \le \exp\left(\frac{s^2(b-a)^2}{8}\right). $$

I remember Jensen showed this to me and we both found it surprisingly hard. But here’s the solution: after shifting $X$ to have mean 0, we just need to prove that $$Z(s) := \log \E[e^{sX}] \le \frac{s^2(b-a)^2}{8}$$

(Note that this variable $s$ is not like $h$ from before! If you were a physicist you’d call this $\beta$.) Thus, $Z(0) = 0$, $Z'(0) = \E[X] = 0$, and $$Z(s) = \int_0^s Z'(t) dt = \int_0^s \int_0^t Z''(u) du dt \le \frac{s^2}{2} \cdot \frac{(a-b)^2}{4}$$ because $Z''(u)$ can be interpreted as the variance of a distribution supported on $[a,b]$, whereas the maximal possible variance is $\frac{(a-b)^2}{4}$ (by a smoothing argument).

A challenge

What does a multivariate version of the first problem look like? When I fed the initial problem to ChatGPT, it hallucinated (?) this:

Let $X$ be a $\mathbb R^n$-valued log-concave r.v. and $K$ be a convex set. Show that $\Cov(X|X\in K) \preceq \Cov(X)$.

If you like a challenge, try to prove this! (Be warned - I’m not sure if this is the correct statement.)

Conclusion

Log-concavity is certainly a concept I want to understand better, and it’s great that I’ve managed to “encounter it in the wild” for once. Hope you enjoyed this post!

David Kewei Lin

Welcome to my personal site!