For a long time now I’ve been searching for some intuitive reasons on why we should expect minimax duality to hold. One approach to this seems to come from online learning algorithms.
This section follows these notes.
Let $K$ be a convex set and let $f:K \times K \to \R$ be a convex-concave function. We want to show that
$$\min_x \max_y f(x,y) = \max_y \min_x f(x,y)$$
One can interpret this as a game between two players with inputs $x$ and $y$, where the first player has payout $-f$ while the second has payout $f$. Both player’s payout functions are concave for a fixed response.
The approach is as follows: we will repeatedly play this game, and suppose the moves are $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ...$ and so on. Suppose we are able to “learn” the optimal move - which means that eventually, the gap between the cumulative payout and the best payout under a fixed move (the “regret”) is sublinear in the number of rounds $T$: $$\sum_{t\le T} f(x^{(t)}, y^{(t)}) - \min_x \sum_{t\le T}f(x,y^{(t)}) = o(T)$$ By concavity, averaging over the rounds for $y$ can only increase this amount. $$\min_x \sum_{t\le T} f(x, y^{(t)}) \le T \min_x f(x, \bar y) \le T \max_y \min_x f(x,y)$$ where $\bar y := \frac 1 T \sum_{t\le T} y^{(t)}$. Doing the analogous thing on the $x$ side gives $$\max_y \min_x f(x, y) + o(1)\ge \frac{1}{T} \sum_{t\le T} f(x^{(t)}, y^{(t)}) \ge \min_x \max_y f(x, y) + o(1)$$ then taking $T \to \infty$ we conclude that $\max_y \min_x f \ge \min_x \max_y f$, which is the “hard” direction, so combining this with the “easy” direction we are done.
Remarks. Summarized, we have the following chain of equalities:
$$\max_y \min_x f = \lim_{T\to\infty} \min_x f(x, \bar y^{(T)}) = \lim_{T\to\infty} \min_x \frac{1}{T} \sum_{t\le T}f(x, \bar y^{(t)}) = \lim_{T \to \infty}\frac{1}{T} \sum_{t\le T}f(x^{(t)}, y^{(t)}) = \min_x \max_y f $$
which I think should suggest that $\bar x^{(T)}, \bar y^{(T)}$ eventually also “finds” the minimax values as $t\to \infty$.
From this Ch3 in this textbook.
Turns out that this so-called “online convex optimization” is possible if $K$ is bounded and $f$ is Lipschitz. In particular, the algorithm used is simply “online gradient descent”:
$$x^{(t+1)} = \Pi_K(x^{(t)} - \eta_t \nabla_t)$$ where:
This in fact achieves regret on the order of $\sqrt T$, that is, $$\sum_{t\le T} f(x^{(t)}, y^{(t)}) - \min_x \sum_{t\le T}f(x,y^{(t)}) = O(\sqrt T).$$
Here’s the rough idea: denote the minimizer as $$x^* = \argmin_x \sum_{t\le T} f(x, y^{(t)}).$$ Since $f$ is convex in $x$, we have $$f(x^{(t)}, y^{(t)}) -f(x^*, y^{(t)}) \le \nabla_t^\top (x^{(t)} - x^*)$$ so we can bound the difference at each time step $t$ term by term. Ignoring $\Pi_K$, we can use the fact that $$-2 \eta_t \nabla_t^\top (x_t - x^*) = \|x_{t+1} - x^*\|^2 - \|x_t-x^*\|^2 - \eta_t^2 \|\nabla_t\|^2$$ So, putting everything together, $$ \begin{align*} 2\left(\sum_{t\le T} f(x^{(t)},y^{(t)}) - f(x^*, y^{(t)})\right) &\le \sum_{t\le T} \left(\frac{ \|x_{t} - x^*\|^2 - \|x_{t+1}-x^*\|^2}{\eta_t} + \eta_t \|\nabla_t\|^2 \right)\\ &= \sum_{t=1}^T \|x_t - x^*\|^2 \left(\frac 1 {\eta_t} - \frac 1{\eta_{t-1}}\right) + \sum_{t\le 1} \eta_t \|\nabla_t \|^2 \end{align*} $$ At this stage, we should assume that the step sizes decrease over time, so the term in parantheses is positive. Then, in the “best” case where $\eta_1=...=\eta_T$, the sum is at most $D^2/\eta_T + G^2 T\eta_T$ so we see that $\eta_T \approx (D/G)\sqrt{T}$ is roughly the correct form up to constants. Plugging $\eta_t = G/(D\sqrt T)$ gives us a bound of $\frac{3}{2} GD\sqrt{T} = o(T)$.
Remark. Lipschitzness is actually really strong, since we need a uniform control over all gradients: $$G \ge \max_{x,y} \|\nabla_x f(x, y)\|$$ and perhaps in general we might even expect the minimax duality to hold even for for $K = \R^n$.
Sanity check. Where did we use the fact that $x^*$ was the minimizer? It wasn’t really used (in the sense that even for other values of $x^*$, the upper bound on the regret still holds), but for non-minimizers the regret can possibly be negative.