Singular learning theory is a topic that I’ve always wanted to learn, since it sits at the intersection of learning theory and algebraic geometry, two fields which I’ve spent some time learning already. The core interface it has with algebraic geometry is through this one asymptotic integral, so I thought I would spend some time trying to understand it.
I was reading these slides from Shaowei Lin and it starts with the following claim:
(Singular asymptotics, AGV) Writing $\omega^\kappa = \omega_1^{\kappa_1} \cdots \omega_k^{\kappa_k}$, then as $N\to\infty$ the approximation $$ \int_{\R^d_{\ge 0}} e^{-N\omega^k }\omega^\tau d\omega \approx CN^{-\lambda} (\log N)^{\theta - 1} $$ is correct up to leading order, where $\lambda = \min_i \frac{\tau_i + 1}{\kappa_i}$ and $\theta$ is the multiplicity of the minima (i.e. the number of indices $i$ satisfying the minimal value).
Somehow, the integral knows to pick out the index $i$ for which this is minimal, and folds in the multiplicity in log form.
Why is this equation interesting? To me, this seems like it is the backbone of the main equation in singular learning. I’ll state what it is in a later post, since there is a lengthier context we need.
I want to try to attempt some simple cases for this integral, and see if we can recover the proof or our intuition for the proof.
The one variable case should be easy enough. We want $$ \int_0^\infty e^{-Nx^\kappa} x^\tau dx $$ We start with a simple substitution $u=x^\kappa$, so $x=u^{1/\kappa}$ and $dx = \frac{1}{\kappa} u^{1/\kappa - 1} du$. Then we have $$ \begin{aligned} \int_0^\infty e^{-Nx^\kappa} x^\tau dx &= \int_0^\infty e^{-Nu} u^{\tau/\kappa} \frac{1}{\kappa} u^{1/\kappa - 1} du \\ &= \frac{1}{\kappa} \int_0^\infty e^{-Nu} u^{\frac{\tau+1}{\kappa} - 1} du \\ &= \frac{1}{\kappa} \int_0^\infty e^{-Nu} u^{\frac{\tau+1}{\kappa} - 1} du \end{aligned} $$ Aha, so the quantity $\frac{\tau+1}{\kappa}$ has made an appearance. Let’s call it $\lambda$. Then, let’s substitute $u\mapsto Nu$: $$ \begin{aligned} &= \frac{1}{\kappa}\int_0^\infty \frac{1}{N} e^{-u} \left(\frac{u}{N}\right)^{\lambda-1} du \\ &= \frac{1}{\kappa N^\lambda} \int_0^\infty e^{-u} u^{\lambda-1} du \\ &= \frac{\Gamma(\lambda)}{\kappa} \cdot N^{-\lambda} \end{aligned} $$ Awesome, so this is exactly what we want!
Let’s try to do the integral
$$\int_{0}^\infty \int_{0}^\infty e^{-Nxy} dx\ dy.$$
I’ll be honest, initially I thought - couldn’t we just factor the expression across the variables? But obviously this is wrong, since we have a product of the variables in the exponent.
I started by reading Section 3 of Shaowei’s exercises here. There’s a nice idea to consider the so-called “state density function” $$ v(t) = \frac d {dt} \int_{0 < xy < t} dx\ dy. $$ (If we were integrating over a non-constant, we would integrate over that function here as well.) Then, our original integral becomes $$ \int_{0}^\infty \int_{0}^\infty e^{-Nxy} dx\ dy = \int_{0}^\infty e^{-Nt} v(t) dt. $$
This is really just a generalization of what we were doing earlier in the single variable case.
So let’s find $v(t)$. What is $\int_{0 < xy < t} dx\ dy$? We can write this as $$ \frac d {dt}\int_{0 < xy < t} dx\ dy = \frac d {dt} \int_0^\infty \frac{t}{y} dy = \int_0^\infty \frac{1}{y} dy $$ …which doesn’t converge! Whaaaaaaat?
It turns out the original claim handwaved over some convergence issues. If we go digging around in Shaowei’s thesis, we see that (e.g. in Theorem 3.16) the conditions require that we integrate over a compact domain (or equivalently, that the integrand is supported on a compact domain).
So the theorem should read instead:
(More precise version) For any compact $\Omega\subset \R^d_{\ge 0}$, $$ \int_{\Omega} e^{-N\omega^k }\omega^\tau d\omega \approx C N^{-\lambda} (\log N)^{\theta - 1}. $$
For our purposes, we can think of a sequence $\Omega_i \uparrow \R^d_{\ge 0}$, and try to evaluate that.
This is kinda of bad news for us, because it complicates the computation a fair bit. Let’s pick $\Omega = [0,M]^2$ for now, and try to evaluate that “by hand”: $$ \begin{aligned} \int_{0}^M \int_{0}^M e^{-Nxy} dx\ dy &= \int_0^M \frac{1}{Ny} (1-e^{-NMy}) dy \\ \end{aligned} $$ This integral, unfortunately, has no closed form. So we do the following estimation that allows us to be at most a constant off:
Picking the threshold to be at $1/(MN)$ allows us to be at most a constant off, so $$ \begin{aligned} &\approx \int_0^{1/(MN)} M dy + \int_{1/(MN)}^M \frac{1}{Ny} dy \\ &= \frac 1 N + \frac{1}{N} \left(\log M - \log \frac{1}{MN}\right) \\ &= \frac 1 N \left(\log N + 2\log M + 1\right) \end{aligned} $$ and here we can declare victory (even though technically we haven’t established the convergence of $C$, just that it has to be bounded).
I can actually throw this into ChatGPT o1 and it came up with the following: first, define $E_1$ by $$ \int_0^a \frac{1-e^{-t}}{t} = \gamma + \log a + E_1(a) $$ where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant. It is known that $E_1(a)$ has the asymptotic expansion $$ E_1(a) = \frac{e^{-a}}{a} \left(1 - \frac{1}{a} + \cdots\right) $$ which converges really fast to 0 as $a\to \infty$. The overall integral evaluates to $$ \int_0^M \int_0^M e^{-Nxy} dx\ dy = \frac{1}{N} \left(\log N + 2\log M +\gamma + E_1(NM^2)\right) $$ so we were not too far off!
The surprising thing is that this suggests that this estimate is really good - past the first two terms, the error is exponentially small in $NM^2$!
It’s still very counterintuitive to me that we have to fix the compact domain before taking $N\to\infty$. But let’s try to figure out if we can sniff out what’s going on here.
Let’s think of the weight distribution - this should be pretty “spiky”, concentrating near the origin and the axes. To see this precisely, we chop up $0$ to $M$ into $0$ to $M'$ and $M'$ to $M$. Then, because $$ \int_0^{M'} \int_0^M e^{-Nxy} dx\ dy = \frac{1}{N} \left(\log N + \log M' + \log M +\gamma + E_1(NM'M')\right) $$
Assuming that the $E_1$ term is small, we have very clean expressions for what the weight in each quadrant is
[picture]
So let’s think of two possibilities:
(1) If we fix $M'=1$, then we see that the leading term is concentrated in the box $[0,1]\times [0,1]$. The terms in $M$ come from the “legs” $[0,1]\times [1, M]$, up to terms that are subexponential in $N$.
(2) Let’s push this further: pick $M'$ such that all the $E_1$ terms are $O(1)$. This means that by ignoring the $E_1$ term (and $\gamma$), we can be at most $O(1/N)$ off. The correct threshold for this is $M'\approx 1/\sqrt{N}$. So, we see that:
So all of the leading terms greater than $O(1/N)$ comes from the legs!
Combining both (or just simply setting $M=1$), we conclude that the bulk must sit in “legs” of the form $[N^{-1/2}, 1]\times [0,N^{-1/2}]$.
It’s perhaps not crazy to extrapolate to the following explanation:
We could stare at this some more, but we could also move on to the next simplest example.
Let’s try to do the integral $$ \int_0^{M'} \int_0^M x e^{-Nxy} dx\ dy. $$
(Yes, we’re going immediately for the version with different limits, because we know that’s what we want for the quadrant analysis.)
Shockingly, o1 tells me that this has a closed form:
$$ \int_{0}^{M'}\!\int_{0}^{M} x\,e^{-Nxy}\ dx\ dy = \frac{M}{N} - \frac{1}{N^2M'} + \frac{e^{-NM'M}}{N^2M'} $$
Woah, this looks kind of different. In particular, this tells us that the weight is fully concentrated along the $x$-axis, and we don’t particularly care what happens along the $y$-axis. In fact, picking $M'=1/MN$, we see that most of the weight sits in $[0, M] \times [0, 1/MN]$, and is more or less evenly distributed along $x$.
This is perhaps not surprising if we interpret this integral as a weighted version of the previous integral - what happened was that for $[0, N^{-1/2}] \times [0, 1]$, the extra $x$ factor completely nukes the value of the integral (including the main term earlier on), so we’re left with just the $O(1/N)$ bits.
At this point, I wasn’t getting more out of this, so I decided to revisit the 1-variable case.
There are actually a few key pieces that we can pick up from the 1-variable case.
By substitution, the only integral we ever need to consider is $$\int_\Omega e^{-N\omega} \omega^{\lambda-1} d\omega.$$
(If you’re ever so concerned, you’ve lost a factor of $1/\prod \kappa_i$ here.)
The next thing we can zoom in on is how we can deal with the compactness of the domain. One question we’ve had all this while is whether (or why) the size of the domain matters (for instance, having the dependency on $M$ when we pick the domain to be $[0,M]^d$). It shouldn’t, if the main contributing region gets arbitrarily near to the origin as $N\to\infty$, but it could if the “legs” become the main contributing term. We’ll keep this question around as a litmus test for our computations.
Well, we recall that the second thing we did was to do a second substitution removing $N$ from the exponent. In the 1-variable case, this means that
$$\int_0^1 e^{-Nx} x^{\lambda-1} dx = N^{-\lambda} \int_0^N e^{-x} x^{\lambda-1} dx$$
and here is the crux: the resulting integral converges to $\Gamma(\lambda)$ in $N$, and furthermore it does not matter whether we picked $[0,1]$ or $[0,M]$ as the domain. With the substitution $x\mapsto Nx$ we have made our integral “scale invariant”.
Now, let me try to attempt the same thing but for two variables, and we will see if we can squeeze out the $\log N$ term. We start with $$\int_{[0,1]^2} e^{-Nxy} (xy)^{\lambda-1} dx\ dy = N^{-1}\int_{[0,\sqrt{N}]^2} e^{-xy} (xy)^{\lambda-1} dx\ dy$$
where we evenly distribute the scaling. Now, here we revisit the “state density function” idea: we want $$ \int_0^t v(t) dt = \mathrm{Vol}(\{xy\le t: x,y\in [0,\sqrt{n}]\}) $$
which works out to $v(t) = \log N - \log t$ (and one can sanity check that at $t=n$ we have $v(n) = 0$). Thus, $$ \int_{[0,\sqrt{N}]^2} e^{-xy} (xy)^{\lambda-1} dx\ dy = \int_0^{N} e^{-t} t^{\lambda-1} v(t) dt. $$
However, most of the weight of the integral happens on in the region $t\in [0,\lambda]$, and in this region $v(t)$ is roughly $\log N$. So for the purposes of asymptotics, this ends up being $\Gamma(\lambda)\log N$ to leading order.
This can be done in general: if you have $k$ variables, then $v(t)$ has the nice form $$v(t) = \frac{(\log N - \log t)^{k-1}}{(k-1)!}$$ so we recover exactly the conclusion that we want.
Finally, the last question we need to answer is why we only care about the indices with the minimal $\lambda$. I think the answer to this will be apparent when we apply the rescaling trick to the variables themselves.
In order to see this, let’s try the following integral
$$\int_{[0,1]^3} e^{-Nxyz} z\, dx\, dy\, dz.$$
The theory tells us that this integral should be “similar” to the two-variable integral $\int_{[0,1]^2} e^{-Nxy}\, dx\, dy$, so let’s do our best:
$$ \begin{aligned} \int_{[0,1]^3} e^{-Nxyz} z\, dx\, dy\, dz &= N^{-1}\int_{[0,\sqrt{N}]^2 \times [0,1]} e^{-xyz} z\, dx\, dy\, dz \\ &= N^{-1} \int_{[0,1]} z \left(\int_{[0,\sqrt{N}]^2} e^{-xyz} dx\, dy\right) dz \\ &= N^{-1} \int_{[0,1]} \left(\int_{[0,\sqrt{Nz}]^2} e^{-xy} dx\, dy\right) dz \\ &\approx N^{-1}\int_{[0,1]} \log N dz = N^{-1} \log N \end{aligned} $$
So the key trick here was to lump $z$ in together with $N$. Here, we also see that if we changed the limits for $z$ to $[0, M_z]$, we would certainly see that term come in as an additional factor, but not so much for $x$ or $y$.
But what would stop us from doing this in a different order? For instance, why not scale only $z$?
$$ \begin{aligned} \int_{[0,1]^3} e^{-Nxyz} z\, dx\, dy\, dz &= N^{-1}\int_{[0,1]^2\times [0, N]} \int_0^1 e^{-xyz} z\, dx\, dy\, dz \\ &= N^{-1}\int_{[0,1]^2} \frac{1}{xy}\int_0^{Nxy} e^{-z} z\, dx\, dy\, dz \end{aligned} $$
Aha, so this is where problem strikes. The integral $\int_{[0,1]^2} \frac{1}{xy} dx\, dy$ diverges, so we cannot justify taking the limit $N\to\infty$ on the inner term.
To be extremely clear, let’s switch out $z$ for $z^{\lambda-1}$. In our hypothesis, this trick must work when (and only when) $z$ has the smallest $\lambda$ (i.e. $\lambda <1$).
$$ \begin{aligned} \int_{[0,1]^3} e^{-Nxyz} z^{\lambda-1}\, dx\, dy\, dz &= N^{-\lambda} \int_{[0,1]^2\times [0, N]} e^{-xyz} z^{\lambda-1}\, dx\, dy\, dz \\ &= N^{-1}\int_{[0,1]^2} (xy)^{-\lambda}\int_0^{Nxy} e^{-z} z^{\lambda-1}\, dx\, dy\, dz \end{aligned} $$
Because $\int_{[0,1]^2} (xy)^{-\lambda} dx\,dy = \frac{1}{(1-\lambda)^2}$, this gives us precisely what we want!
These pieces together a complete argument for the main integral in full generality, but I’ll leave that for you to work it out yourself if you’re interested.
We also pass the litmus test: for indices with non-minimal $\lambda$, we expect the boundary $M_i$ to show up in the final answer as $M_i^{\lambda_i+1}$.
I think by bumping around as we did, we uncovered a lot more than if we had just read the proof. The original proof in AGV is hard to understand, but Section 3 of Shaowei’s exercises here (as well as Chapter 4 of Watanbe’s grey book) give a good overview.
The key here is the Mellin transform, which turns $f(z): (0,\infty)\to \mathbb C$ into
$$F(z) = \int_0^\infty f(t) t^z dz.$$
Under regularity conditions, this transform can be inverted:
$$f(t) = \frac{1}{2\pi i} \int_{\sigma - i\infty}^{\sigma + i\infty} F(z) t^{-z-1} dz$$ (where we need absolute convergence on some band around $\mathrm{Re} (z) = \sigma$).
The implication for us is that actually, $v(t)$ can be constructed from a “zeta function”. In the $\lambda = 1$ case (where we had an explicit formula for $v(t)$), the zeta function is $$\zeta(z) = \int_{[0,1]^k} \omega^z d\omega =\int_0^1 v(t) t^z dt$$ which is the Mellin transform of $v(t)$. But because we can explicitly evaluate $$\zeta(z) = \frac{1}{(1+z)^k}$$ the inverse Mellin transform can be read off to be $v(t) = \frac{(-\log t)^{k-1}}{(k-1)!}$ (which matches what we had before), and then the original integral is straightforward.
This can be done much more generally. Let’s say we want to evaluate the leading term of $$Z(N) = \int_\Omega e^{-N|f(\omega)|} g(\omega) d\omega.$$
Then, we can define the state density in general to be $$v(t) = \frac d {dt} \int_{0 < |f(\omega)| < t} g(\omega) d\omega$$ and then the zeta function is $$\zeta(z) = \int_\Omega g(\omega) |f(\omega)|^z d\omega = \int_0^\infty v(t) t^z dt.$$ Typically, $\zeta(z)$ is some meromorphic function, and we have the following handy result:
(Example 4.7 from the Grey book) Inverting $F(z) = \frac{\theta^z}{(z+\lambda)^k}$ gives us $$f(t) = \frac{\theta^{-\lambda}}{(k-1)!} t^{\lambda-1} (\log \theta - \log t)^{k-1}_{+}$$
so by taking the Laurent series and inverting, we can get $v(t)$ and directly evaluate $$Z(N) = \int_\Omega e^{-N|f(\omega)|} g(\omega) d\omega = \int_0^\infty v(t) e^{-Nt} dt.$$
Squinting a bit, what we have done is exactly a special case of this general argument.
I think this integral was surprisingly tricky. Initially, I was kind of skeptical of the zeta function approach, but we ended up roughly inventing it ourselves anyway. It certainly gives us more context to move forward to the main claim in SLT.
One interaction which I’m curious about is how this relates to statistical physics, where the partition function takes into account the randomness of the exponent. In that world, all minima are assumed to be regular, but there are possibly exponentially many of them in a low-temperature regime.