From the theory of ODEs, we know that an effective way to keep track of a seemingly untameable system is through a potential function (a.k.a. a Lyapunov function, or more simply a monovariant). We’ll see two examples from learning theory that uses a potential function inspired by statistical physics.

Thanks to Caleb Koch for teaching this material during Li-Yang’s summer seminar (which in turn derived it from Understanding Machine Learning by Shalev-Shwartz and Ben-David).

$\newcommand{\bv}{\mathbf v}$ $\newcommand{\bw}{\mathbf w}$ $\newcommand{\lrangle}[1]{\left\langle #1 \right\rangle}$ $\newcommand\eps{\varepsilon}$ $\newcommand\E{\mathbb E}$ $\newcommand\I{\mathbb I}$

Online prediction with expert advice (Hedge)

Let’s say you would like to predict something everyday (e.g. chance of rain tomorrow, stock prices), but this prediction problem is hard. Luckily, you have $d$ experts offering advice, and every day you have to decide whose advice you want to follow.

Because we’re pessimists, we think about the cost ($\in [0,1]$) of following each expert’s advice on day $t$ - 0 represents a perfect prediction, while 1 represents the worst. So you could think about our problem as minimizing $$ \sum_{t=1}^T v_{c_t,t}$$ where $v_{i,t}$ is the cost of following the advice of expert $i$ on day $t$, and $c_t$ is the choice you make on day $t$.

It turns out that the extent that we can minimize this at all is sort of nuanced, and for that we look at certain special scenarios.

What the learning algorithm cannot do

We’re going to motivate the definition of our eventual goal by looking at a few scenarios where we throw our hands up and concede that we simply can’t do any better.

First, it is entirely possible that the expert predictions are completely doo-doo: if $v_{i,t}=1$ for all $(i,t)$, then we incur cost $T$ no matter what. So there is no absolute guarantee posible - instead, we have to

(1) measure the incurred cost relative to some “ideal” scenario

This difference in cost is sometimes called the regret, because this would be how much better you could have done in hindsight (in some sense).

So maybe we’d like to control the quantity $$\sum_{t=1}^T v_{c_t, t} - \min_{(o_t)} \sum_{t=1}^T v_{o_t, t}$$ where $(o_t)$ is the sequence of “oracle actions” - the best possible sequence of actions? But even this is too much to hope for: we could think of a coin-flip guessing game where there is a devil controlling the outcome of the coin flip. If you had two experts, one who always guesses heads and one who always guesses tails, then the oracle cost is 0 but your incurred cost against an adversarial coin-flip is once again $T$. The way around this is that we

(2) allow our choice $c_t$ to be randomized, and evaluate our loss on the average-case

In game-theoretic terms, this would be allowing for “mixed strategies”, instead of just having a “pure strategy”. This partially circumvents the unfairness of the adversary - but it remains the case that we’re still down $1/2$ per round compared to the oracle. Furthermore, as the number of experts increases, the adversary could assign 0 cost to the expert we gave the least probabilistic weight to (which is the oracle answer), and 1 cost to rest, leaving us down $1-1/d$ per round.

At this point maybe we realize that the oracle is too powerful, and so we concede doing that well and say instead

(1’) measure against the best possible fixed strategy (i.e. not adaptive)

That is, we would like $$\E_{c_t}\sum_{t=1}^T v_{c_t, t} - \E_o\min_o \sum_{t=1}^T v_{o,t}$$ to be small. Now, this turns out to be possible!

How well will we not be able to do?

To think about how good of a bound we can get for this, we can think about a completely randomized scenario where $v_{i,t}$ are independently 0 or 1 with equal probability.

Regardless of what we do, we end up with the sum of $T$ independent coinflips, so the answer is $T/2 + O(\sqrt{T})$. However, the oracle answer is the minimum of $d$ such sums, so approximating each sum as an independent normal we get that the answer is $T/2 - O(\sqrt{T\log d})$. (If you have never tried this, a hint is to think about the CDF.)

So a reasonable guess here is that we cannot do any better than $O(\sqrt{T\log d})$. Nicely enough,it turns out that the algorithm we will see gets exactly this much.

The algorithm

Here’s the intuition: the expert with the lowest cumulative sum so far is probably the best one. However, we saw that this can be adversarially counter-played, so we’d like a “smoothed” out version of this strategy.

So the slightly better thing is: at time $t$, we form distribution $\bw_t$ such that $$\bw_t(i) \propto \exp\left(-\eta \cdot \sum_{t'\lt t} v_{i, t'}\right)$$ where $\eta \gt 0$ can be thought of as the inverse temperature or learning rate. The higher $\eta$ is, the more we skew our distribution according to the cumulative history.

Now for the analysis. We would like to bound $\sum_t \lrangle{\bw_t, \bv_t}$ where $\bv_t = (v_{i,t})_i$. The key idea is that this sum is bounded by the increments in free energy $\log Z_t$, where $Z_t$ is the normalization constant when forming the probability distribution $\bw_t$: $$\log Z_t = \log \sum_{i=1}^d \exp \left(-\eta \cdot \sum_{t'=0}^{t-1} v_{i, t'}\right).$$

The key here is that because of how we set up the distributions, the increment in free energy can be expressed as an expectation:

$$\frac{Z_{t+1}} {Z_t} = \E_{i\sim \bw_t} \exp(-\eta v_{i,t})$$

In particular, using $e^{-a} \le 1 - a + \frac {a^2} 2$ for $a\in [0,1]$, $$\frac{Z_{t+1}} {Z_t} = \E_{i\sim \bw_t} \exp(-\eta v_{i,t}) \le \E_{i\sim \bw_t} \left[1 - \eta v_{i,t} + \frac{\eta^2 v_{i,t}^2} 2 \right] $$ where we must assume that $\eta \le 1$. You should already get a sense that this ends up giving us an upper bound on $\eta \lrangle{\bw_t, \bv_t}$. Taking logs and using $\log(1-t) \le t$,

$$ \log Z_{t+1} - \log {Z_t} \le \log\left(1-\E_{\bw_t} \left[ \eta v_{i,t} - \frac{\eta^2}{2} v_{i,t}^2 \right] \right) \le -\eta \lrangle{\bw_t,\bv_t} + \frac{\eta^2}{2}$$

Hence, $$\log Z_T \le \log Z_0 -\eta\sum_{t\lt T} \lrangle{\bw_t, \bv_t} + \frac{\eta^2 T}{2}$$ but on the other hand, $$\log Z_T \ge -\eta \min_{i} \sum_{t\lt T} v_{i,t}$$ so putting everything together we get $$\sum_{t\lt T} \lrangle{\bw_t, \bv_t} - \min_i \sum_{t\lt T} v_{i,t} \le \frac{\log d}{\eta} + \frac{\eta T}{2}.$$ Hence, if $T \gt \sqrt{2 \log d}$, we’d pick $\eta = (2 \log d / T)^{1/2}$ to get that the difference is at most $\sqrt{2T\log d}$.

Commentary

Connections to FTRL/Mirror descent

Actually, the choice of $\bw_t$ here is precisely

$$\bw_{t+1} = \arg\min_{\bw} \left\{\lrangle{\bw, v_t} + \frac{1}{\eta} \mathrm{KL}(\bw || \bw_t)\right\}.$$

This is a particular case of mirror descent - I won’t try to describe what it is, but I’ll just say it is a framework to unify this update rule with gradient descent: $x_{t+1} = x_t - \eta \nabla f(x_t)$ can also be written as $$x_{t+1} = \arg\min_x \left\{\lrangle{x, \nabla f(x_t)} + \frac{1}{2\eta} \|x-x_t\|^2\right\}$$ which is now eerily similar.

Connections to replicator dynamics

Imagine we have some evolutionary scenario where we have $d$ species, and their fitness evolves continuously according to $v_i(t)$. The proportion of species $i$ at time $t$ is $w_i(t)$. Then,

the change has to be relative to the current population size (i.e. $\dot w_i$ is a multiple of $w_i$)
the total population is conserved (i.e. $\sum_i w_i = 1$)

This leads to the natural rule $$\dot w_i = -w_i \cdot (v_i - c)$$ (where we can solve $c=\langle \mathbf v, \mathbf w \rangle$ if we want). Playing a bit fast and loose, we can subtract two such equations to get $$\dot w_i / w_i - \dot w_j / w_j = - (v_i - v_j)$$ so we have the usual ODE $w_i(t)/w_j(t) = \exp(-\int_0^t (v_i - v_j)(s) ds) \cdot (w_i(0) / w_j(0))$, i.e. $w_i(t) \propto \exp(-\int_0^t v_i(s) ds)$.

Caveats

In a real-world application scenario, you may want to think about these:

maybe $T$ is not known beforehand, which makes it hard to determine $\eta$. One possible solution is to allow for an adaptive learning rate $\eta_t$ and to just “assume” the time as $T=t$. This implies that $\eta_t \sim ct^{1/2}$ is the correct scaling law for large $t$. (We see this scaling law for adaptive optimizers such as AdaGrad too!)
maybe $v$ is not bounded. This is in principle disastrous, because you could have a single bad $t$ ruin your whole attempt, but I bet there are looser criteria out there that will make a similar argument work.
maybe the regime of costs changes over time, and you would like to adapt to the most recent changes. One way to phrase this is: how many timestamps would you need to keep the cost under $\epsilon T$? The answer is $T_0 = \epsilon^{-2}(2\log d)$. So you could imagine dividing up all of history into blocks of $T_0$ and you would have at most $\epsilon$-fraction of regret even for an optimal learner on every block of length $T_0$!

Adaboost

Here is a second algorithm that uses the exact same idea.

Suppose we have data from an input space $\mathcal X$, and each datapoint belongs to one of two classes labelled $+1$ or $-1$. A classifier would simply be a function $\mathcal X \to \{\pm 1\}$, and its error under a distribution $D$ (over $\mathcal X \times \{ \pm 1\}$) is just how often the classifier is wrong: $$\text{(error of $h$)} = \E_{(x,y) \sim D} \mathbb I_{h(x) \neq y}.$$

Let’s say we have a learning algorithm, but for whatever reason the learners are “weak” (e.g. decision stumps - $x\mapsto 2\mathbb I[x \gt t] - 1$). It turns out that there is a way to combine the results of multiple weak learners into a strong learner!

The algorithm

The idea here is that when we iteratively train the weaker learners, we want to focus on the examples that are hardest to classify. To facilitate this, we require that the weak learner is able to learn over any distribution $D$ over examples, and produces a classifier $h$ with error $\le 1/2 -\gamma$ (i.e. some amount better than random). Then, we can do the following:

start from $D^{(0)} = (1/m, ..., 1/m)$
for $t=1, ..., T$:
- run the weak learner on $D^{(t)}$ to get $h_t$
- compute the error $\eps_t$
- compute a weight $w_t = f(\eps_t)$
- update the distribution using $$D^{(t+1)}(i) \propto D^{(t)}(i) \cdot \exp(-w_t y_i h_t(x_i))$$

By simply taking a linear combination of the weak learners, the classifier $\mathrm{sign}(\sum w_t h_t(x))$ has exponentially small error in the number of rounds $T$!

The analysis

Define $f_t = \sum_{s \le t} w_sh_s$. Then, we can upper bound the indicator $$\mathbb I[\mathrm{sign} f(x) \neq y] \le e^{- y f(x)}$$

By averaging, we get that $$\text{(error of } f_t) \le \E_{x,y}[\exp(-y f_t(x))] = Z_t.$$

Again, we look at the decrement $Z_{t+1}/Z_t$. It can be interpreted as an average:

$$\frac{Z_{t+1}} {Z_t} = \E_{x,y\sim D^{(t-1)}}[\exp(-w_t y h_t(x))]$$

and notice that $y_i h_t(x_i)$ is literally the sign of whether $h_t$ got example $i$ correct. So we can split this into $$ \E_{x_i, y_i \sim D^{(t)}} [\exp(-w_t y_i h_t(x_i))] = \eps_t \cdot e^{-w_t} + (1-\eps_t) \cdot e^{w_t}. $$ Maximizing over $w_t$, we get that the maximal value is $f(w_t) = 2\sqrt{\eps_t(1-\eps_t)}$. But also we have an assumption that $\eps_t \le 1/2-\gamma$, so the maximal value is $\le \sqrt{1-4\gamma^2} \le \exp(-2\gamma^2)$. Thus,

$$(\text{error of } f_t) \le Z_t \le Z_0 \exp(-2\gamma^2 t) = \exp(-2\gamma^2 t)$$

David Kewei Lin

Welcome to my personal site!

Online Learning and Statistical Physics

Online prediction with expert advice (Hedge)

What the learning algorithm cannot do

How well will we not be able to do?

The algorithm

Commentary

Connections to FTRL/Mirror descent

Connections to replicator dynamics

Caveats

Adaboost

The algorithm

The analysis