David Kewei Lin

Welcome to my personal site!

Towards understanding duality

We put together the puzzle pieces from the various definitions of “conjugates” and “duality” (convex, Lagrange and minimax) to get a better picture of what duality is. In particular, we will look at

$\newcommand{\lrangle}[1]{\left\langle #1 \right\rangle}$ $\newcommand{\lrbrace}[1]{\left\{ #1 \right\}}$ $\newcommand{\E}{\mathbb E}$

Various flavors of duality

Convex conjugates

For any function $f:\R^n \to \R$, the convex conjugate is $$f^*(u) := \sup_{x}\lrbrace{\lrangle{x,u} - f(x)}.$$

The definition is inspired from the convex differentiable case, where we have $$f(y) \ge f(x) + \lrangle{y-x, \nabla f(x)}$$ because $f$ is lower bounded by the affine support at $x$.

Rearranged: $$\lrangle{x, \nabla f(x)} - f(x) \ge \lrangle{y, \nabla f(x)} - f(y)$$ i.e. when $u=\nabla f(x)$, the maximizer is $x$. Hence, one really thinks of $u$ as living in the “dual space” (i.e. “gradient” space).

Convex Duality

We have $f=f^{**}$ if $f$ is convex and lower-semicontinuous (i.e. lim inf is at least the limiting value).

Usually, we have $f^{**}\le f$. Written in full, $$ \begin{aligned} f^{**}(x) &= \sup_{u} \lrbrace{\lrangle{x,u} - f^*(u)}\\ &= \sup_{u} \inf_{x'} \lrbrace{\lrangle{x-x', u} + f(x')}\\ &\le \inf_{x'} \sup_{u} \lrbrace{\lrangle{x-x', u} + f(x')} = f(x) \end{aligned} $$ so convex duality occurs when equality holds.

Lagrangians.

Consider the following optimization problem $$ \begin{aligned} \text{minimize} &\qquad f(x)\\ \text{subject to} & \qquad h(x) \preceq 0 \end{aligned} $$ To solve this, one constructs the Lagrangian $$L(x,\lambda) = f(x) + \lrangle{\lambda, h(x)}$$ and often times it is easy to solve the unconstrained optimization over $x$ for variable $\lambda$. So we define the dual function $$g(\lambda) := \inf_x L(x,\lambda)$$ and note that it is always a lower bound for the original optimization problem when $\lambda \succeq 0$. So the “dual” problem is $$ \begin{aligned} \text{maximize} &\qquad g(\lambda)\\ \text{subject to} & \qquad \lambda \succeq 0 \end{aligned} $$

When this dual problem gives the same answer as the original problem, we say that strong duality holds.

$$ \begin{aligned} \min_{x: h(x) \preceq 0}f(x) &= \min_x \max_\lambda L(x,\lambda) \\ &\ge \max_x \min_\lambda L(x,\lambda) = \max_{\lambda \succeq 0} g(\lambda) \end{aligned} $$

A word on equality conditions: the most famous one is Slater’s constraint qualification - as long as we have even a single $x$ where $h(x) \prec 0$ (strictly), and $f,h$ convex, then we have strong duality.

Generic Minimax Duality

We see a pattern here. One asks what we would need for $$\min_x \max_y f(x,y) = \max_y \min_x f(x,y)$$ Von Neumann’s minimax theorem establishes this for affine $f$ - the most famous case of this is when $x$ and $y$ are distributions: $$\min_{\mu} \max_\nu \mathbb E_{X\sim \mu, Y\sim \nu} [u(X,Y)] =\max_\nu \min_{\mu} \mathbb E_{X\sim \mu, Y\sim \nu} [u(X,Y)]$$ One might interpret $u$ as the payoff in a two-player game zero-sum game (where player 1 minimizes and player 2 maximizes), and $\mu,\nu$ as the (mixed) strategy of the players. The inner operator represents the responding player choosing their optimal move, so $(\mu, \nu)$ must be a Nash equilibrium, since neither player can deviate to advantage.

The general theorem actually holds for $f$ which is convex in $x$ and convex in $y$. We talk about why later via the connection to saddle points.

In fact, an even stronger generalization due to Sion suggests that this property is almost purely “topological” - convexity can be weakened to quasi-convexity etc.

Reduction to convex duality.

Here’s a trick I learnt from Borwein-Lewis (which probably originated from Rockefellar). One considers $$ \begin{aligned} h(u) &:= \min_x \max_y \lrbrace{f(x,y)+\lrangle{u, y}} \end{aligned} $$ It is not immediately obvious, but $h$ is convex (assuming $f$ is convex-concave), since $\max_y f(x,y) + \lrangle{u,y}$ is jointly convex in $(x,u)$, and partial minimization is still convex.

Now, $$ \begin{aligned} h^{**}(u) &= \max_{u'} \min_{x'} \min_x \max_y \lrbrace{f(x', y) + \lrangle{x-x', u'} + \lrangle{u,y}}\\ &= \max_{u'} \min_{x} \min_{x'} \max_y \lrbrace{f(x', y) + \lrangle{-x', u'} + \lrangle{u,y}} + \lrangle{x,u'}\\ &= \min_{x'} \max_y \lrbrace{f(x', y) + \lrangle{u,y}} \end{aligned} $$ so minimax duality holds iff $h(0) = h^{**}(0)$, so all we need is lower-semicontinuity at $0$ of $h$.

Saddle points.

Another reason to expect this to be true is as follows: we can prove the standard minimax inequality by noting that $$ \begin{aligned} \max_y f(x,y) &\ge f(x, y)\\ &\ge \min_x f(x, y) \end{aligned} $$ then varying $x, y$ gets us the answer. However, if we imagine (in the 2-D case) two “connected” sets $$ \begin{aligned} M_y &:= \{(x,y) \mid y = \arg\max_{y} f(x,y)\}\\ m_x &:= \{(x,y) \mid y = \arg\min_{x} f(x,y)\}\\ \end{aligned} $$ so if there is an intersection $(x^*, y^*) \in m_x\cap M_y$, we can run the reverse chain of inequalities $$ \begin{aligned} f(x\in m_x, y) &\ge f(x\in m_x, y^*) \\ &\ge f(x^*, y^*)\\ & \ge f(x^*, y \in M_y) \\ &\ge f(x, y\in M_y) \end{aligned} $$

Thus, it all boils down to showing the existence of a saddle point. This is somewhat of a conundrum: in the 2-D case (with single dimensional $x$ and $y$) the argument seems almost obvious, since in a square a path running left-to-right must necessarily intersect another path running down-to-up (with some asssumptions), but this is way less clear in higher dimensions.

Application in statistics

In the theory of statistical minimax lower bounds, we would like to lower-bound the minimax risk across a class of estimators: $$\inf_{\widehat T}\sup_{\theta\in \Theta} \E[(\widehat T - T(\theta))^2] \ge \text{(lower bound)}$$ If we knew that certain lower bounds were “tight”, then we have a matching upper bound. To turn this into an estimator, it’s common to do some relaxations to massage it into convex-concave form, and so we have something like $$ \begin{aligned} \inf_{\widehat T}\sup_{\theta\in \Theta} \E[(\widehat T - T(\theta))^2] &\le \inf_{\widehat T}\sup_{\theta\in \Theta} f(\widehat T, \theta)\\ &= \sup_{\theta} \inf_{\widehat T} f(\widehat T, \theta) \end{aligned} $$ and this is usually fairly easy to compute, since this is the “retroactively best possible loss for any estimator” on a slightly modified loss $f$.

Now, a practictioner will ask: “okay, so which $\widehat T$ does this actually suggest we use?” The answer is: any $\widehat T$ at the saddle point! It seems logically fallacious we do the following:

but that’s exactly what we are allowed to do by minimax theory.

Application in computer science

There’s another example of this in the hardness vs randomness paradigm, but idk if i will ever type it up.

Directions