David Kewei Lin

Welcome to my personal site!

Sampling from a Gibbs' distribution

$\newcommand\E{\mathbb E}$ $\newcommand \cN {\mathcal N}$ $\newcommand \one{\mathbf 1}$ $\newcommand \Var {\mathrm{Var}}$ $\renewcommand \Im {\mathrm{Im}}$ $\renewcommand \Re {\mathrm{Re}}$ $\newcommand \tr {\mathrm {tr}}$

Here’s a folklore fact from statistical physics: given a random variable $X$, sampling from $\propto p(X)e^{\beta X}$ is exactly the same as:

(1) sampling a large number of independent $X$’s

(2) conditioning the sum to be $\sum X_i = N(u + o(1))$, where $$ u = \frac{\E Xe^{\beta X}}{\E e^{\beta X}}$$ (i.e. conditioning the mean to match that of the target distribution)

(3) picking one of the samples at random.

In this post, we’ll try to state it rigorously and prove it. Pingback to Why Boltzmann? - we’re going far beyond the reasons given in that post.

Warm-up: why does CLT work?

Ignoring rigorous considerations, the answer tends to involve some kind of generating function. You can think of it as a rack for the various moments of a random variable $X$:

$$\E[e^{tX}] = 1 + t\E[X] + \frac{t^2}{2} \E[X^2] + ...$$

(One could also consider more “directly” $\E[t^X]$ but there won’t be too much of a difference, since it’s just a variable change anyway.)

Concretely, let’s suppose that $X$ was uniformly $\pm 1$, so the generating function is $\frac{e^t + e^{-t}}{2}$ (also known as $\cosh t$). What is the distribution of $$S_n = \frac{1}{\sqrt n}\left(X_1 + X_2 + \cdots + X_n\right)?$$

Well, let’s compute the generating function: $$ \begin{align*} \E[e^{tS_n}] &= \E[e^{tX_1/\sqrt{n}}]^n \\ &= \left(1 + \frac{t^2}{2n} + \frac{t^4}{4!\cdot n^2} + \cdots \right)^n\\ &\to \exp(t^2/2) \qquad\text{as $n\to \infty$} \end{align*} $$

which is the generating function of a standard Gaussian $\cN(0,1)$. The same argument extends to almost everything else (e.g. bounded random variables) - technically if your generating function doesn’t converge then you’ll have problems, but let’s not care about that for now.

(I think the actual way to do this uses the characteristic function $\E[e^{itx}]$, which is sort of the complex-rotated version. I’m not sure what exactly this gains - maybe better convergence properties.)

Background: the thermodynamical story

Thermodynamical systems consist of a vast number of individual particles that are both indistinguishable and immeasurable, but instead we are able to measure the evolution of so-called “macro” quantities like temperature and pressure.

For a very simple model, we can imagine that we have $N$ particles in various states $X_1,...,X_n$ (which are random, over the set of states). Each state $X_i$ has some energy level $E(X_i)$. However, if we have a closed system, the total energy must be conserved, so there is some invariant quantity like $$U = E(X_1) + ... + E(X_N)$$ which is roughly constant. The question now is: can I figure out the distribution of energy among the particles?

The answer turns out to be yes - it is the probability distribution over states that (1) gives the correct mean energy and (2) has the maximal entropy. Specifically, we’re looking for $p_x$ such that $U/N = \sum_x p_x E(x)$ that maximizes $\sum_x p_x \log (1/p_x)$.

A straightforward Lagrange Multipliers computation tells us that $p_x \propto e^{-\beta E(x)}$, and it’s whichever $\beta$ gives the correct mean energy. We sometimes call $\beta$ the inverse temperature. This answer is also called a Gibbs’ distribution.

This interpretation of things is not new; we see hints of it from other areas of probability:


Yet, on the concrete level, there is one unsatisfying thing here, which is about why we can assume the principle of maximum entropy here. Thinking back to the CLT - is there also a reason why the resulting distribution has maximum entropy? The normal distribution is the distribution with the maximal entropy given constant variance (i.e. setting $E(x) = x^2$), but I don’t see how it comes up at all.

There is some sense that in the limit, the result “forgets” all the moments above the variance, which, if we squint hard enough, seems to be happening in the generating function computation. Maybe we can do the same…


On another level, the thermodynamics interpretation is incredibly attractive to me because the ability to interpret “continuous objects” as a limit of a sequence of discrete ones lends us the possibility of doing combinatorial arguments. (What’s possible in the world of Poisson and Exponential distributions warrants a whole article by itself.)

The rigorous setup

Actually, physicists already managed to do some rigorous computation of this, but it involves discrete energy levels and assuming that $U$ is kept exactly constant.

This is in chapter 4 of Patria and Beale (exact reference in the future).

How they did it was: they discretized $E(X)$ (so it only had integer values), and then considered the distribution of a single state conditioned on the probability, i.e. $$p_x = P(X_1 = x | E(X_1) + E(X_2) + ... + E(X_n) = U).$$ where each $X_i$ are independent, uniform samples of a state.

Of course, if we just care about the distribution of the energy levels, we would just write the state as the energy level: $$p_x = P(X_1 | X_1 + ... + X_n = U) = \frac{\E[X_1 \one_{X_1 + ... + X_n = U}]}{\E[\one_{X_1 + ... + X_n = U}]}$$ and here each $X_i$ are independent and identically distributed. In this case, we can know the distribution of $X_1$ also from the g.f.:

$$ \E[e^{tX_1} | X_1 + ... + X_n = U] = \frac{\E[e^{tX_1} \one_{X_1 + ... + X_n = U}]}{\E[\one_{X_1 + ... + X_n = U}]}.$$

This is workable, but (1) kinda gross, and (2) doesn’t really work for real energy levels. What we are really saying here is that this should work for any thin band around $U$. So instead of the indicator, we consider a sharply decaying kernel $\exp(-\frac{\lambda^2} 2(X_1 + X_2 + ... + X_N - U)^2)$ with the intent of taking $\lambda \to \infty$.

So here’s attempt 1: write $u = N/U$, then the following is close to what we want:

$$\frac{\E[\exp(tX_1 - \frac{\lambda^2} 2(X_1 + X_2 + ... + X_N - Nu)^2)]}{\E[\exp( - \frac{\lambda^2} 2(X_1 + X_2 + ... + X_N - Nu)^2)]}.$$

Let’s invoke a little secret sauce that allows us to deal with integrals over exp-squared.Introduce an auxillary random variable $\gamma \sim \cN(0,1)$, then we know that

$$\E_\gamma[e^{it\gamma}] = e^{-t^2/2}$$

so we can write:

$$ \begin{align*} \E_X[\exp( - \lambda(X_1 + X_2 + ... + X_N - Nu)^2)] &= \E_{X,\gamma}[\exp(i\lambda \gamma(X_1 + X_2 + ... + X_N - Nu))] \\ &= \E_\gamma \left( \E_{X_1}[\exp(i\lambda \gamma (X_1 - u))]^N\right) \end{align*} $$

We have an integral over a fixed space with something to the power of $N$ as $N\to\infty$. The leading order term can be estimated using Laplace approximations.

Detour: Laplace approximations and saddle point methods

The rough idea

The general idea of a Laplace approximation is that if I have an integral $$\int f(x) g(x)^N dx$$ then as $N\to\infty$, this integral grows like $C g(x_0)^N$ if there is a unique maximizer $x_0$ of $|g(x)|$. In fact, one can also figure out the exact form of $C$ by assuming that expanding $\log g(x)$ up to the quadratic term: if

$$\log g(x) \approx \log g(x_0) + \frac{g'(x_0)}{g(x_0)} \cdot(x-x_0) + \frac{ g(x_0) g''(x_0) - g'(x_0)^2}{g(x_0)^2} \cdot \frac{(x-x_0)^2}{2}$$

then we pretend that we only care about a neighborhood around the maximizer $x_0$:

$$\int f(x) g(x)^N dx \approx f(x_0) g(x_0)^N\int \exp(A(x-x_0) + B\cdot \frac{(x-x_0)^2}{2})^Ndx = f(x_0) g(x_0)^N C_g.$$

The constant $C_g$ can be computed but I won’t really bother here, because in most cases we will end up taking a ratio of integrals:

$$\frac{\int f(x) g(x)^N dx}{\int g(x)^N dx} \to f(x_0)$$


Let’s do a concrete case of this. If $x_0=0$ and $g$ was a moment generating function $g(t) = \E[e^{tX}]$, then this is even nicer: $$\log \E[e^{tX}] \approx \E[X]\cdot t + \Var(X) \frac{t^2}{2}.$$

Wait, did you say minimizer earlier on? But this thing is convex!

Ok, let’s revisit what we’re looking at before in the previous section. Turns out what we really need to look are $g$’s which are characteristic functions. We get

$$\log \E[e^{itX}] \approx i\E[X] \cdot t - \Var(X) \cdot \frac{t^2}{2}.$$

The main problem with this is now that we have an imaginary part to $\log g$, so we have to think about how this affects the original Laplace approximation. The answer is that this is really bad: the imaginary part gives us “oscillations” that cancel out and make the contribution to the integral much smaller than it actually is. For instance; $$\int_{-\infty}^{\infty} e^{iat - bt^2/2} dt = \frac{1}{\sqrt{2\pi b}} e^{-a^2 / 2b}$$ which is $\exp(-a^2/2b)$ less than what it would be without the imaginary part.

The solution? Well, it involves a handful of complex analysis. So let me just recap the two things you need here.

An ounce of complex analysis

There’s just a small handful of concepts relevant here:

(1) The complex analogue for differentiability is holomorphicity. You don’t have to worry too much about this - most things you see will be holomorphic (e.g. $e^z$).

(2) If I have a holomorphic function $f$, then integrating $f$ along two different paths give the same answer: $$\int_{\Gamma} f(z) dz = \int_{\Gamma'} f(z) dzf$$ (related: integrals along loops equal 0.)

(3) What even is a path integral? We can convert them into regular integrals by looking for some function that parametrizes the path $\gamma: [0,1] \to \Gamma$, then $$\int_\Gamma f(z) dz = \int_0^1 f(\gamma(t)) \gamma'(t) dt.$$ The nice thing about this is that this doesn’t depend on which parametrization you take, as long as you trace the same path.

(4) If I write a holomorphic function as $f(u+iv)$, then the Cauchy-Riemann equations imply that $$\frac{\partial^2 f}{\partial u^2} + \frac{\partial^2 f}{\partial v^2} = 0$$ This is not strictly speaking necessary, but it will give some nice interpretation for what we’re about to see.

So the relevance here is that integral we’re considering can be thought of as a path integral from $-i\infty$ to $+i\infty$:

$$\E_\gamma \left( \E_{X_1}[\exp(i\lambda \gamma (X_1 - u))]^N\right) = -i \int_{-i\infty \to i\infty} \E_{X_1}[\exp(\lambda z(X_1 - u))]^N \cdot \frac{e^{z^2/2} dz}{\sqrt{2\pi}}$$

The key here is that the action is all happening when $\Im(z)$ is small relative to $\Re(z)$ due to the $e^{z^2/2}$ term. So really, using (2) we can do the following “free of charge”:

$$\int_{-it \to it} (...) = \underbrace{\int_{-it \to \beta-it}(...)}_{\text{goes to 0 as }t\to\infty} + \int_{\beta-it \to \beta+it}(...) + \underbrace{\int_{\beta+it \to it}(...)}_{\text{goes to 0 as }t\to\infty} $$

Hence the moral of the story is that we can really sneak in an extra $\beta$ free of charge:

$$\E_\gamma \left( \E_{X_1}[\exp(i\lambda \gamma (X_1 - u))]^N\right) = \E_\gamma \left( \E_{X_1}[\exp(\lambda (\beta+i\gamma) (X_1 - u))]^N\right).$$

That’s all the complex analysis we needed.

Saddle points

Back to where we were stuck earlier. So we assumed that the maxima happened at $x_0=0$, and we had $$\log \E[e^{itX}] \approx i\E[X] \cdot t - \Var(X) \cdot \frac{t^2}{2}$$ and we were concerned that $\E[X]$ could be non-zero. What we learnt earlier is that we could sneak an extra $\beta$ into the numerator for free. This means that we can “tilt” the distribution $X$: if $p(X_\beta) \propto p(X) e^{\beta X}$, then really $$\E_\gamma \left( \E_{X_1}[\exp(\lambda (\beta+i\gamma) X_1]^N\right) = \frac{\E_\gamma \left( \E_{X_{1,\beta}}[\exp(i\lambda \gamma) X_{1,\beta}]^N\right)}{\left(\text{normalizing constant}\right)}$$

So what we can do is to we tilt $X$ until $\E[X]$ is zero, so we do not have an imaginary term and $g(x_0)$ remains the largest contributor. In particular, we see $\beta$ such that $\E[Xe^{\beta X}] = 0$. This particular $\beta$ is also given by $$\beta = \argmin_\beta \E[e^{\beta X}].$$

Once we have this, we can proceed with the Laplace approximation as usual.

This is sometimes called the saddle point method, because at the minimizer $z=\beta$, applying fact (4) means that we must have $\log E[e^{z X}]$ convex in the real direction and concave in the imaginary direction, so the minimizer along the real axis becomes a maximizer among the imaginary line from $\beta-i\infty$ to $\beta +i\infty$.

So the upshot is some statement like

$$ \frac{\int f(i\gamma) \E[e^{i\gamma X}]^N dt}{\int\E[e^{i\gamma X}]^N} dt \longrightarrow f(\beta) $$ where $\beta$ minimizes $\E[e^{\beta X}]$. In the original problem, the $f$ we care about is exactly $$f(i\gamma) = \E[e^{(t+i\gamma)X}] / \E[e^{i\gamma X}]$$ so the answer is exactly $$f(\beta) = \E[e^{(t+\beta)X}] / \E[e^{\beta X}].$$ This is exactly the generating function for the Gibbs distribution. Therefore, we may conclude that

A Gibbs’ distribution $\propto e^{\beta f(X)}$ naturally arises as the distribution of $N$ independent copies subject to the average value $\frac 1 N \sum f(X_i)$ being roughly some fixed value.

Follow-up questions

  1. One odd observation is that this seems to hold for any $\lambda$ at all, so really we could have slapped on any Gaussian kernel we wanted centered at $u$ and still obtained the same result.

  2. I think there’s more to be investigated between how complex analysis is related to Laplace’s approximation, especially when we consider a path going through multiple critical points.

  3. The million-dollar question is whether this interpretation gives any new insight into the mean-field approximation bounds. I’m not yet sure, but a good starting point would be to relook at Augeri (2021) and to see if things now have a nicer interpretation. I’m under the impression that quantities on the Gibbs distribution should be easier to bound but I’m not entirely sure.

  4. Maybe this can be further coupled to some kind of Gaussian variable thing? Stochastic localization?

  5. One could think about the actual realization of the energy in the quadratic case, where $f(\sigma) = \frac{\beta}{2} \sigma^\top J\sigma$. If we have $N$ copies $\sigma^{(1)}, ..., \sigma^{(N)}$, then actually we can write $$\frac 1 N \sum_{i} f(\sigma^{(i)}) = \frac \beta 2 \sum_{i, j} J_{ij}\left(\frac{1}{N}\sum_k \sigma_i^{(k)} \sigma_j^{(k)}\right)$$ If we write $\Sigma_{ij} = \frac{1}{\sqrt{N}}\sum_k \sigma_i^{(k)} \sigma_j^{(k)}$, then it turns out that in the “limit” we have that $\Sigma_{i,j}\sim \mathcal N(0,1)$ independently where $i\lt j$. This can be done by interpreting the whole setup as a generating function in $J_{ij}$ (where we zero out all entries with $j\ge i$), then $$\E_{\Sigma} \exp(\tr(J\Sigma)) = \left(\E_\sigma \left(1+ \frac{1}{\sqrt{N}} \sigma^\top J \sigma + \frac{1}{2N}(\sigma^\top J \sigma)^2 + ... \right)\right)^N = \exp(\sum_{i\lt j} J_{ij}^2/2)$$

  6. Showerthought but: let’s say U,V are two independent normals. Then $U+iV$ is rotationally symmetric about 0. Hence $\E[(U+iV)^4]=0$, thus $2\E[U^4] - 6\E[U^2]^2 = 0$, so $\E[U^4] = 3$.