We put together the puzzle pieces from the various definitions of “conjugates” and “duality” (convex, Lagrange and minimax) to get a better picture of what duality is. In particular, we will look at
$\newcommand{\lrangle}[1]{\left\langle #1 \right\rangle}$ $\newcommand{\lrbrace}[1]{\left\{ #1 \right\}}$ $\newcommand{\E}{\mathbb E}$
For any function $f:\R^n \to \R$, the convex conjugate is $$f^*(u) := \sup_{x}\lrbrace{\lrangle{x,u} - f(x)}.$$
The definition is inspired from the convex differentiable case, where we have $$f(y) \ge f(x) + \lrangle{y-x, \nabla f(x)}$$ because $f$ is lower bounded by the affine support at $x$.
Rearranged: $$\lrangle{x, \nabla f(x)} - f(x) \ge \lrangle{y, \nabla f(x)} - f(y)$$ i.e. when $u=\nabla f(x)$, the maximizer is $x$. Hence, one really thinks of $u$ as living in the “dual space” (i.e. “gradient” space).
We have $f=f^{**}$ if $f$ is convex and lower-semicontinuous (i.e. lim inf is at least the limiting value).
Usually, we have $f^{**}\le f$. Written in full, $$ \begin{aligned} f^{**}(x) &= \sup_{u} \lrbrace{\lrangle{x,u} - f^*(u)}\\ &= \sup_{u} \inf_{x'} \lrbrace{\lrangle{x-x', u} + f(x')}\\ &\le \inf_{x'} \sup_{u} \lrbrace{\lrangle{x-x', u} + f(x')} = f(x) \end{aligned} $$ so convex duality occurs when equality holds.
Consider the following optimization problem $$ \begin{aligned} \text{minimize} &\qquad f(x)\\ \text{subject to} & \qquad h(x) \preceq 0 \end{aligned} $$ To solve this, one constructs the Lagrangian $$L(x,\lambda) = f(x) + \lrangle{\lambda, h(x)}$$ and often times it is easy to solve the unconstrained optimization over $x$ for variable $\lambda$. So we define the dual function $$g(\lambda) := \inf_x L(x,\lambda)$$ and note that it is always a lower bound for the original optimization problem when $\lambda \succeq 0$. So the “dual” problem is $$ \begin{aligned} \text{maximize} &\qquad g(\lambda)\\ \text{subject to} & \qquad \lambda \succeq 0 \end{aligned} $$
When this dual problem gives the same answer as the original problem, we say that strong duality holds.
$$ \begin{aligned} \min_{x: h(x) \preceq 0}f(x) &= \min_x \max_\lambda L(x,\lambda) \\ &\ge \max_x \min_\lambda L(x,\lambda) = \max_{\lambda \succeq 0} g(\lambda) \end{aligned} $$
A word on equality conditions: the most famous one is Slater’s constraint qualification - as long as we have even a single $x$ where $h(x) \prec 0$ (strictly), and $f,h$ convex, then we have strong duality.
We see a pattern here. One asks what we would need for $$\min_x \max_y f(x,y) = \max_y \min_x f(x,y)$$ Von Neumann’s minimax theorem establishes this for affine $f$ - the most famous case of this is when $x$ and $y$ are distributions: $$\min_{\mu} \max_\nu \mathbb E_{X\sim \mu, Y\sim \nu} [u(X,Y)] =\max_\nu \min_{\mu} \mathbb E_{X\sim \mu, Y\sim \nu} [u(X,Y)]$$ One might interpret $u$ as the payoff in a two-player game zero-sum game (where player 1 minimizes and player 2 maximizes), and $\mu,\nu$ as the (mixed) strategy of the players. The inner operator represents the responding player choosing their optimal move, so $(\mu, \nu)$ must be a Nash equilibrium, since neither player can deviate to advantage.
The general theorem actually holds for $f$ which is convex in $x$ and convex in $y$. We talk about why later via the connection to saddle points.
In fact, an even stronger generalization due to Sion suggests that this property is almost purely “topological” - convexity can be weakened to quasi-convexity etc.
Here’s a trick I learnt from Borwein-Lewis (which probably originated from Rockefellar). One considers $$ \begin{aligned} h(u) &:= \min_x \max_y \lrbrace{f(x,y)+\lrangle{u, y}} \end{aligned} $$ It is not immediately obvious, but $h$ is convex (assuming $f$ is convex-concave), since $\max_y f(x,y) + \lrangle{u,y}$ is jointly convex in $(x,u)$, and partial minimization is still convex.
Now, $$ \begin{aligned} h^{**}(u) &= \max_{u'} \min_{x'} \min_x \max_y \lrbrace{f(x', y) + \lrangle{x-x', u'} + \lrangle{u,y}}\\ &= \max_{u'} \min_{x} \min_{x'} \max_y \lrbrace{f(x', y) + \lrangle{-x', u'} + \lrangle{u,y}} + \lrangle{x,u'}\\ &= \min_{x'} \max_y \lrbrace{f(x', y) + \lrangle{u,y}} \end{aligned} $$ so minimax duality holds iff $h(0) = h^{**}(0)$, so all we need is lower-semicontinuity at $0$ of $h$.
Another reason to expect this to be true is as follows: we can prove the standard minimax inequality by noting that $$ \begin{aligned} \max_y f(x,y) &\ge f(x, y)\\ &\ge \min_x f(x, y) \end{aligned} $$ then varying $x, y$ gets us the answer. However, if we imagine (in the 2-D case) two “connected” sets $$ \begin{aligned} M_y &:= \{(x,y) \mid y = \arg\max_{y} f(x,y)\}\\ m_x &:= \{(x,y) \mid y = \arg\min_{x} f(x,y)\}\\ \end{aligned} $$ so if there is an intersection $(x^*, y^*) \in m_x\cap M_y$, we can run the reverse chain of inequalities $$ \begin{aligned} f(x\in m_x, y) &\ge f(x\in m_x, y^*) \\ &\ge f(x^*, y^*)\\ & \ge f(x^*, y \in M_y) \\ &\ge f(x, y\in M_y) \end{aligned} $$
Thus, it all boils down to showing the existence of a saddle point. This is somewhat of a conundrum: in the 2-D case (with single dimensional $x$ and $y$) the argument seems almost obvious, since in a square a path running left-to-right must necessarily intersect another path running down-to-up (with some asssumptions), but this is way less clear in higher dimensions.
In the theory of statistical minimax lower bounds, we would like to lower-bound the minimax risk across a class of estimators: $$\inf_{\widehat T}\sup_{\theta\in \Theta} \E[(\widehat T - T(\theta))^2] \ge \text{(lower bound)}$$ If we knew that certain lower bounds were “tight”, then we have a matching upper bound. To turn this into an estimator, it’s common to do some relaxations to massage it into convex-concave form, and so we have something like $$ \begin{aligned} \inf_{\widehat T}\sup_{\theta\in \Theta} \E[(\widehat T - T(\theta))^2] &\le \inf_{\widehat T}\sup_{\theta\in \Theta} f(\widehat T, \theta)\\ &= \sup_{\theta} \inf_{\widehat T} f(\widehat T, \theta) \end{aligned} $$ and this is usually fairly easy to compute, since this is the “retroactively best possible loss for any estimator” on a slightly modified loss $f$.
Now, a practictioner will ask: “okay, so which $\widehat T$ does this actually suggest we use?” The answer is: any $\widehat T$ at the saddle point! It seems logically fallacious we do the following:
but that’s exactly what we are allowed to do by minimax theory.
There’s another example of this in the hardness vs randomness paradigm, but idk if i will ever type it up.