Mathematics as Search Problems
Talk by Henry is availalbe at Youtube.
Supervised Learning
A typical machine learning problem setup:
- Input: $\mathcal{X}$
- Output: $\mathcal{Y}$
- distribution $p$ on $\mathcal{X}\times\mathcal{Y}$
- data $(x_ i, y_ i)\sim p$ i.i.d
- Goal: want to find $h:\mathcal{X}\rightarrow \mathcal{Y}$, such that $(x_ i, y_ i)\sim p$, then $h(x_ i)\approx y_ i$.
We have a loss function to metrize "approximation": $$\ell: \mathcal{Y}\times\mathcal{Y}\rightarrow \mathbb{R}_ {\geq 0},$$ with $\ell(y,y) = 0$ for any $y$.
Then supervised learning is $\textrm{min}_ {h}\mathbb{E}_ {(x,y)\sim p}\ell(h(x),y)$, where $y$ is the label, and $\hat{y} = h(x)$ is the prediction.
More precisely, we have a searching space $\mathcal{F}$. Given $\mathcal{F}\subseteq \mathrm{Fun}(\mathcal{X},\mathcal{Y}),$ we want to find $\textrm{min}_ {h\in\mathcal{F}}\mathbb{E}_ {(x,y)\sim p}\ell(h(x),y)$.
If we sampled $(x_ i, y_ i)$ independly $1\leq i \leq n$, and if $\mathcal{F}:=\{h_ \theta: \theta\in \Theta\}$ is a family of functions parametrized by $\theta\in \Theta$, then our loss is estimated as $$\hat{L}_ {n}(\theta) := \frac{1}{n} \sum_ {i=1}^n\ell(h_ \theta(x_ i), y_ i).$$
Error source:
- Function class expression ability: $|\textrm{min} \mathbb{E}_ {(x,y)\sim p}\ell(h(x), y) - \textrm{min}_ {\theta\in \Theta}| L(\theta)$.
- Uniform convergence.
- Optimization error.
Let $\hat{\theta}$ be some $\theta$ we found by algorithm and $\theta^\ast$ is the optimal of $\textrm{min}_ {\theta\in \Theta}| L(\theta)$. Then $$L(\hat{\theta}) - L(\theta^\ast) = (L(\hat{\theta}) - \hat{L}_ n(\hat{\theta})) + (\hat{L}(\hat{\theta}) - \hat{L}(\theta^\ast)) + (\hat{L}_ n(\theta^\ast)-L(\theta^\ast))$$ decomposes the overall error into the three sources above.
Language Models
Theorem 1 (Universal Approximation Theorem).
Definition 2.
We say $\mathcal{F}\subseteq L^\infty_ {\textrm{loc}}(\mathbb{R}^n)$ is dense in $C(\mathbb{R}^n)$, if for any continuous function $g\in C(\mathbb{R}^n)$, for any $K\subseteq \mathbb{R}^n$ compact, there exists $f_ i\in \mathcal{F}$, with $$\textrm{lim}_ {j\rightarrow \infty} ||g-f_j||_ {L^\infty(K)} = 0.$$
Given an activation function $\sigma\in L_ {\textrm{loc}}^\infty(\mathbb{R})$ in $M$, where $M\subseteq L^\infty(\mathbb{R})$ consists of function $f$ such that the closure of the set $\{x\in\mathbb{R}: f\textrm{ discontinuous at }x\}$ has Lebesgue measure 0. Then $\Sigma_ n:= \textrm{span}(\sigma(\langle w,x\rangle)+\theta: w\in\mathbb{R}^n,\theta\in\mathbb{R})$ is dence in $C(\mathbb{R}^n)$ if and only if $\sigma\notin \mathbb{R}[x]$ almost everywhere.
This theoretically explains why one can even expect neural network could possibly work.
Definition 3.
Vocabulary is a set $S$ of "tokens".
Definition 4.
A language model predicts a distribution over tokens, given a sequence of tokens of some length $0\leq \ell\leq\ell_ {\textrm{max}}$. Formally, this is a function $$\mathcal{L}: \bigcup_ {j=0}^{\ell_ {\textrm{max}}} S^j \rightarrow \mathcal{P}(S).$$
The loss function $\ell(p_ 1, p_ 2) := H(p_ 1, p_ 2)=\mathbb{E}_ {p_ 1}(\log p_ 2) = H(p_ 1)+ D_ {\mathrm{KL}}(p_ 1||p_ 2).$
Math as a search problem
Why is this related to math?
A over-simplified workflow: Exploration $\rightarrow$ Formalization $\rightarrow$ Optimization $\rightarrow$ Distillation.
"Theorem" by Tao: a paper is submittalbe if it is a local max of results/effort.
We currenttly focus on exploration part, which includes finding valuable conjectures and informal reasoning to prove it.
Lesson (Sutton 2019) on search and learning.
Demo: OpenEvolve
Demo Google Colab is shared by Henry (It is still under review by Henry and me at this moment). If this is your first time using Colab and do not know what to do, see Shurui's quick guide.