X-posted from the SIMO X-Men blog.
This is the last installment to the series sparked off by the SMO Open Q4 problem, which led us to introduce some basic learning theory.
If you’re deciding whether to read this, I’ve approximately split up this post into two parts:
- the first part is about dealing with a certain flavor of inequalities, where you are averaging a maximum over all $2^n$ combinations of $\pm$ signs.
- the second part is some conventional ML theory content tying generalization error to the Rademacher complexity of loss functions
I would suggest reading just the first part unless you are really dying to see where Rademacher complexity comes from.
First post in the cubic curves series. The goal is to understand (the two versions of) the group law.
Next post: Cubics II: Constant cross-ratios.
X-posted from the SIMO X-Men blog.
In this blogpost, we’ll go off on a tangent and explore what it means to learn from data. In the process, we will get slightly closer (but not quite there) to the context where Rademacher complexity emerges.
This is the second blogpost in the sequence, continuing off BTS I.
$$\newcommand\lrangle[1]{\left\langle #1 \right\rangle}$$
$$\newcommand \E {\mathbb E}$$
$$\newcommand \cH {\mathcal H}$$
$$\newcommand \Var {\mathrm{Var}}$$
$$\newcommand \MSE {\mathrm{MSE}}$$
X-posted from the SIMO X-Men blog.
Recently, a problem of mine came out as Q4 on SMO Open 2024. I’ll go over how I proposed the problem and the theory behind it, which involves some machine learning ideas (!).
This will take a small sequence of posts:
-
Part 1: a complete backstory for how this problem came about
-
Part 1.5: trying to understand Rademacher complexity through examples
-
Part 2: an introduction to learning theory
$\newcommand\E{\mathbb E}$
$\newcommand \cN {\mathcal N}$
$\newcommand \one{\mathbf 1}$
$\newcommand \Var {\mathrm{Var}}$
$\renewcommand \Im {\mathrm{Im}}$
$\renewcommand \Re {\mathrm{Re}}$
$\newcommand \tr {\mathrm {tr}}$
Here’s a folklore fact from statistical physics: given a random variable $X$, sampling from $\propto p(X)e^{\beta X}$ is exactly the same as:
(1) sampling a large number of independent $X$’s
(2) conditioning the sum to be $\sum X_i = N(u + o(1))$, where
$$ u = \frac{\E Xe^{\beta X}}{\E e^{\beta X}}$$
(i.e. conditioning the mean to match that of the target distribution)
(3) picking one of the samples at random.
In this post, we’ll try to state it rigorously and prove it.