Sales Pitch for Statistical Physics

Over the summer, I gave weekly talks as part of a reading group. The first half of the talks were centered on information complexity, while the second half was about random kSAT threshold and methods from statistical physics. Here’s an attempt to transcibe (and revise) the “sales pitch” I gave for statistical physics.

(Credits. The first example is largely also a sales pitch for large deviations theory, which I took from Chatterjee’s introduction to this subject.)

$\newcommand\E{\mathbb E}$ $\newcommand\I{\mathbb I}$ $\newcommand\eps{\epsilon}$

An introductory problem

Here’s a problem:

Flip $N$ coins. What is the probability that at least $2/3$ of them are tails?

The exact probability here can be computed directly in terms of binomial coefficients, but let’s try to get a more qualitative understanding (especially about how it changes in $N$).

Rewrite the problem slightly:

Let $X_i$ take the values $\pm 1$ uniformly and independently. Establish the asymptotics of $$P(X_1+X_2+...+X_N \ge N/3)$$ in $N$.

Here’s an (incorrect) approach:

We use the central limit theorem, thus $X = X_1+X_2+...+X_N$ can be approximated by a normal random variable with mean $0$ and variance $N$.
So if $Z$ is a standard normal r.v., we must have $$P(X \ge N / 3) \approx P(Z \ge \sqrt{N}/3) \overset{\cdot}{=} e^{-N/18}$$ where in the last equality we suppress polynomial factors.

Ok, so $P(X\ge N/3)$ is exponentially decreasing in $N$. That much is correct, but it turns out that this argument gives the wrong base constant $e^{-1/18}$. Where exactly we went wrong is that CLT only states that $$\frac{X_1+...+X_N}{\sqrt N} \overset d \to Z$$ so the convergence is guaranteed for any $P(X\ge c\sqrt{N})$ with $c$ constant, but not a “moving target” value like $c=\sqrt N$.

So the puzzle here is to figure out:

For which $\alpha$ do we have $$P(X\ge N/3) \overset \cdot = \alpha^N ?$$

Here comes Gibbs'

Let’s look at the hypercube $\{\pm 1\}^N$. We want to understand the size of the subset $$S= \{(x_1,x_2,...,x_N)\in \{\pm 1\}^N: x_1+x_2+...+x_N \ge N/3 \}.$$

We need a bunch of inputs from physics:

First, we need to express what we want as a sum $$Z = \sum e^{\beta f(x)}$$ We can get close enough by assigning items in the subset to be $e^\beta$ and 1 otherwise, then taking $\beta \to 0$. So, each “state” gets a weight $\exp (\beta \mathbb I_S)$. If we can estimate the sum $Z := \sum \exp (\beta \mathbb I_S)$ satisfactorily, then we have our answer.
Gibbs’ inequality tells us that for any distribution $\nu$ over the hypercube, $$\log Z \ge \sup_{\nu} \left\{ \beta \cdot \E_\nu \mathbb I_S + H(\nu ) \right\}.$$ This inequality is completely mathematical; one application of Jensen’s inequality does it for us.
Finally, the mean-field assumption in physics tells us that for “nice” sets $S$, we can assume that the best product distribution $\nu$ will attain equality up to an $o(N)$-additive error. (For our purposes, this is good enough for establishing the base $\alpha$.)

So now assume that $\nu$ is a product distribution (i.e. it is made of independent $X_i$’s). We try to maximize the LHS (for $\beta$ large enough!) by understanding various parts of it:

First notice that $H(\nu)$ is at most $n \log 2$ and at least $0$. So whatever fluctuation the “entropy” term has will have less priority than the “energy” term.
What exactly is $\E_\nu \I_S$? Well, it’s the probability of $x$ drawn from $\nu$ landing in our set $S$. But this only depends on the sum, and really by the strong law of large numbers we should have $$X_1+X_2+ ... + X_N \approx \E[X] = \E[X_1] + \E[X_2] + ... + \E[X_N]$$ If we think of $\E[X] / N$ as being fixed, then the probability is $0$ if this “mean” is less than $1/3$ and $1$ if it’s more than $1/3$ (and could go either way if it’s exactly 1/3). Since $\beta$ is huge, we really need it to be at least $1/3$.
For a fixed “mean” $\E[X]/N$, the way we maximize entropy is to equally distribute the randomness across each dimension, i.e. $$\E[X_1] = \E[X_2] = ... = \E[X_N] = m.$$
Also, $H(\nu)$ is decreasing in $m$. So to maximize our quantity, we want $m$ to be as small as possible.
This gives $m=1/3$, which uniquely determines $\nu$.

At this point, we can plug this distribution back and obtain $$\beta + \log |S| \approx \log Z \approx \beta + N h_2(1/3)$$ where $h_2(p)$ is the binary entropy function defined by $$h_2(p) = -p\log p -(1-p) \log (1-p).$$ (This also happens to be the entropy of a two-valued r.v. where the split in probability mass is $p$ and $1-p$.)

In any case, the answer is $|S| = (2^{h_2(1/3)})^N = (3/2^{2/3})^N$, so $\alpha = 3/2^{5/3}$. Did you guess it right?

That’s where my talk ended.

Gibbs’ inequality with combinatorial interpretations.

Gibbs’ inequality lives in the “log-space”, and we can get a combinatorial interpretation by exponentiating it. But first, we have to take a detour.

Asymptotic Equipartition Theory

Let’s say $(X_1,X_2,...,X_n)$ was drawn from $n$ independent copies of $\nu$. The typical likelihood is $2^{-nH(\nu)}$, due to the law of large numbers, so $$\frac{1}{n} \log p(X_1,X_2,...,X_n) \overset p \to -H(\nu).$$ Concretely, this means that for any $\eps > 0$, then for sufficiently large $n$ we must have $$P\left(p(X_1,X_2,...,X_n) = 2^{n(H(\nu) \pm \eps)}\right)\ge 1-\eps$$

This has a few consequences:

Since we know (roughly) the probability of each typical sequence, we can figure out how many there are! In particular, $$(1-\eps) 2^{n(H(\nu) - \eps)} \le (\# \text{ typical seqs}) \le 2^{n(H(\nu) + \eps)}.$$ (If you know about the conection between entropy and minimum expected-length encodings, this establishes it immediately.)
The smallest set that contains 99% (or 1%) of all probability mass has around $2^{nH(\nu)}$ elements (up to $2^{o(n)}$ factors), just by looking at how such a set must intersect the set of typical sequences.
We can also consider sets that are typical under some other measure (e.g. there’s some $f$ where we would like $\frac 1 n\sum f(X_i) \approx \E f(X)$). Then, these sets must have a huge overlap with the set of typical sequences. This is what we will use next to show Gibbs’ inequality.

Gibbs’ inequality as a consequence of AEP

We consider $n$ large enough such that the following is true: $$ \begin{aligned} P\left(p(X_1,X_2,...,X_n) = 2^{n(H(\nu) \pm \eps)}\right) &\ge 1-\eps,\\ P\left(\beta(f(X_1)+...+f(X_n))) = n(\beta \E f(X) \pm \eps)\right) &\ge 1-\eps. \end{aligned}$$

Their intersection has probability mass at least $1-2\eps$, so the number of such sequences is at least $(1-2\eps)2^{n(H(\nu)-\eps)}$. On the other hand, $Z^n$ is the sum of $e^{\beta(f(x_1) + ... + f(x_n))}$ over all length $n$ sequences. Thus, their contribution to $Z$ is at least $$ \begin{aligned} Z^n &\ge (1-2\eps) 2^{n(H(\nu) - \eps)} \cdot 2^{n(\beta \E_\nu f(X) -\eps )}\\ &\ge (1-2\eps) 2^{n(\beta \E_\nu f + H(\nu) - 2\eps).} \end{aligned} $$ Thus taking logs and dividing by $n$, we get $$\log Z \ge \beta \E_{\nu} f + H(\nu) - 2\eps$$ then we simply take $\eps \to 0$.

Footnote

The actual way to prove this is to use large deviations theory. There’s a whole textbook about this, so I will refrain from jamming it in the footnote of this blog post. Until next time!

David Kewei Lin

Welcome to my personal site!