Online Learning to Precondition I. Space dilation methods for linear systems
This post begins a series on what I call Online Learning to Precondition (OL2P). For a mathematical optimization problem, the core idea is to
improve the optimization landscape by learning from the behavior of the algorithm itself.
This principle is implicit in several existing optimization algorithms and can lead to powerful results. However, a systematic study of this idea seems to be lacking. In this series, I will explain the principle and, more importantly, give concrete instantiations, drawn both from the literature and from my own work. As the first example, I present a “new” algorithm for solving linear systems. The method has superlinear convergence and resembles a quasi-Newton method, but it is derived entirely from a learning principle.
Updates.
- 03/17/2026. Fix typos.
Introduction and background
Consider the convex quadratic minimization problem \[ \min_{x \in \mathbb{R}^n} \quad f(x) := \tfrac{1}{2}\langle x, Ax \rangle - \langle b, x \rangle, \] where \(A \in \mathbb{S}_{++}^n\) is positive definite and contains eigenvalues \(\lambda_1 \geq \lambda_2, \ldots, \lambda_n\). Its optimality condition \(0 = \nabla f(x) = Ax - b\) corresponds to solving a linear system. For simplicity, I assume that \(f(x^\star)\) is known. It can be achieved, for example, with a least-squares formulation \(f(x) = \tfrac{1}{2}\|Ax - b\|^2\).
Optimization with steepest descent
A standard algorithm for solving this problem is steepest descent \[ \begin{aligned} \alpha_k &= \arg\min_\alpha f(x^k - \alpha \nabla f(x^k)) = \tfrac{\|\nabla f(x^k)\|^2}{\langle \nabla f(x^k),\, A\nabla f(x^k)\rangle} \\ x^{k+1} &= x^k - \alpha_k \nabla f(x^k), \end{aligned} \] where \(\alpha_k\) minimizes \(f\) over the line \(\{x : x = x^k + \alpha \nabla f(x^k),\, \alpha \in \mathbb{R}\}\). From now on, let’s focus on a single step (from \(x\) to \(x^+\)) and drop the iteration index. Since \(f\) is a quadratic, we can explicitly write down the contraction ratio, denoted by \(r\), as \[ r := \tfrac{f(x^+) - f(x^\star)}{f(x) - f(x^\star)} = 1 - \tfrac{\|\nabla f(x)\|^4}{\|\nabla f(x)\|_A^2\, \|\nabla f(x)\|^2_{A^{-1}}} = 1 - \tfrac{1}{\left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle \left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A^{-1}\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle} \] which, by Kantorovich’s inequality \(\tfrac{\|a\|^2}{\|a\|_A \|a\|_{A^{-1}}} \geq \tfrac{4\lambda_1\lambda_n}{(\lambda_1+\lambda_n)^2}\), gives \(r \leq 1 - \tfrac{4\lambda_1\lambda_n}{(\lambda_1+\lambda_n)^2} = \left(\tfrac{\lambda_1 - \lambda_n}{\lambda_1 + \lambda_n}\right)^2\). When \(\lambda_1 \gg \lambda_n\), the problem is called ill-conditioned, as it is possible that \(r \to 1\).
More specifically, this issue of bad contraction ratio can happen if at least one of \(\left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle\) or \(\left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A^{-1}\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle\) is large. Denote \(g := \tfrac{\nabla f(x)}{\|\nabla f(x)\|}\). Then
The contraction ratio \(r\) is bad if at least one of \(\langle g, Ag \rangle\) or \(\langle g, A^{-1}g \rangle\) is large.
A standard technique for dealing with ill-conditioned problems is preconditioning.
Preconditioning
There are many ways to write preconditioning. I’ll use a very much simplified version. Let \(P\) be a positive definite matrix. Then solving the original problem is the same as solving \[ \min_x \quad \tfrac{1}{2}\langle x, PAPx \rangle - \langle Pb, x \rangle \] up to a change of variable \(x \leftarrow Px\). This new function has Hessian \(PAP\), and the choice of \(P\) should ideally make \(PAP\) look like identity, say, \(\operatorname{dist}(PAP, I) \to 0\).
The question is how to choose \(P\). Let’s start by choosing a distance function to measure the distance between a positive definite matrix \(A\) and identity \(I\). There are many matrix functions available for measuring distance (e.g., Frobenius norm \(\|P - Q\|_F\)). Here I use the \(\log\det\) divergence (although strictly speaking it is a divergence rather than a distance): \[ V_d(P, Q) = \langle P, Q^{-1} \rangle - \log\det PQ^{-1} - n, \] which is the Bregman divergence \(V_d(P, Q) = d(P) - d(Q) - \langle \nabla d(Q), P - Q\rangle\) induced by \(d(X) = -\log\det X\). There are many technical details regarding \(\log\det\) divergence, but it will just be used as a distance measure here: we want to find \(P\) such that \[ \operatorname{dist}(PAP, I) = V_d(PAP, I) = \operatorname{tr}(PAP) - \log\det PAP - n \] is close to \(0\). Let’s also assume \(P\) has some parametrized form \(P = I + \sigma vv^\top\) for some unit vector \(\|v\| = 1\). This form of \(P\) is known as a space-dilation operator in the optimization literature, since when it operates on a vector, it simply stretches (squeezes) the space in direction \(v\): \[ Pz = (I + \sigma vv^\top)z = z + \sigma \langle v, z \rangle v. \] Given \(P = I + \sigma vv^\top\), we can write \(A^+(v,\sigma) := PAP = (I + \sigma vv^\top)A(I + \sigma vv^\top)\). Using the \(\log\det\) divergence, the way \(V_d(PAP, I)\) changes w.r.t. \(V_d(A, I)\) can be explicitly computed: \[ \begin{aligned} \operatorname{tr}(PAP) &= \operatorname{tr}((I + \sigma vv^\top)(A + \sigma Avv^\top)) \\ &= \operatorname{tr}(A + \sigma Avv^\top + \sigma vv^\top A + \sigma^2 vv^\top Avv^\top) \\ &= \operatorname{tr}(A) + 2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle \\ \det PAP &= \det(I + \sigma vv^\top)A(I + \sigma vv^\top) = (1+\sigma)^2 \det A. \end{aligned} \] Hence we can explicitly write \[ \begin{aligned} V_d(A^+(v,\sigma), I) &= V_d(PAP, I) \\ &= \operatorname{tr}(A) + 2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle - \log[(1+\sigma)^2 \det A] - n \\ &= \operatorname{tr}(A) - \log\det A - n + 2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle - 2\log(1+\sigma) \\ &= V_d(A, I) + \underbrace{2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle - 2\log(1+\sigma)}_{\gamma(\sigma)}, \end{aligned} \] where \(\gamma(\sigma)\) is a scalar function in \(\sigma\) given \(v\). Since our goal is to reduce \(V_d(PAP, I)\), we should minimize \(\gamma(\sigma)\), whose optimal value is \(\sigma^\star = \sqrt{\tfrac{1}{\langle v, Av\rangle}} - 1\) and \[ V_d(A^+(v, \sigma^\star), I) = V_d(A, I) + \underbrace{\log(\langle v, Av\rangle) - \langle v, Av\rangle + 1}_{\eta(\langle v, Av\rangle)}, \] where \(\eta(x) = \log x - x + 1 \leq 0\) is again a scalar function. This function is always non-positive, with maximizer attained at \(x = 1\). Hence if we can find \(v\) such that \(\langle v, Av\rangle\) is bounded away from \(1\) (e.g. \(|\langle v, Av\rangle - 1| \geq c_1 > 0\)), then we can ensure that \[ V_d(A^+(v, \sigma^\star), I) \leq V_d(A, I) - c_2 \] for some other constant \(c_2 > 0\). In other words, direction \(v\) helps to reduce the distance. Since \(\|v\| = 1\), having \(|\langle v, Av\rangle - 1| > c_1\) requires \(v\) to have either large or small Rayleigh quotient. The only question is where we can find such \(v\)? Let’s rephrase the problem.
The distance can improve if we find \(v\) such that \(\langle v, Av\rangle\) is bounded away from \(1\).
Learning to precondition
Let’s put together the observations from the previous sections. After one steepest descent step, we have \[ r = 1 - \tfrac{1}{\left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle \left\langle \tfrac{\nabla f(x)}{\|\nabla f(x)\|}, A^{-1}\tfrac{\nabla f(x)}{\|\nabla f(x)\|} \right\rangle} = 1 - \tfrac{1}{\langle g, Ag\rangle \langle g, A^{-1}g\rangle}. \]
Case 1. If \(r\) is not too large (say \(r \leq 0.99\)), then \(\langle g, Ag\rangle\) and \(\langle g, A^{-1}g\rangle\) are both small.
Case 2. If \(r \to 1\) (say \(r > 0.99\)), then \(\tfrac{1}{{\color{red}\langle g, Ag\rangle}\langle g, A^{-1}g\rangle} \to 0\) and at least one of them is large.
But the distance can improve if we find \(v\) such that \(\langle v, Av\rangle\) is bounded away from \(1\).
Clearly, there is no need to worry when the contraction is already good. The interesting case is Case 2 \(r \to 1\), where we know at least one of \(\langle {\color{red}g}, A{\color{red}g}\rangle\) and \(\langle g, A^{-1}g\rangle = \langle {\color{blue}(A^{-1}g)}, A{\color{blue}(A^{-1}g)}\rangle\) is large.
If \(\langle g, Ag\rangle \gg 1\), then we can let \(v \leftarrow {\color{red}g}\) and ensure \(V_d(A^+(v, \sigma^\star), I) \leq V_d(A, I) - c_2\).
If \(\langle g, A^{-1}g\rangle \gg 1\), then we can let \(v \leftarrow {\color{blue}A^{-1}g}\) and again ensure \(V_d(A^+(v, \sigma^\star), I) \leq V_d(A, I) - c_2\).
Therefore, every time Case 2 happens, we always ensure that \(V_d(A^+(v, \sigma^\star), I) \leq V_d(A, I) - c_2\). Suppose we replace \(A \leftarrow A^+(v, \sigma^\star)\), \(b \leftarrow Pb\) whenever Case 2 happens and scale \(x\) by \(x \leftarrow P^{-1}x\), then the function value remains unchanged: \[ \tfrac{1}{2}\langle (P^{-1}x), PAP(P^{-1}x)\rangle - \langle Pb, P^{-1}x\rangle = \tfrac{1}{2}\langle x, Ax\rangle - \langle b, x\rangle. \] Finally, the nonnegativity of \(V_d\) implies Case 2 can happen at most \(\left\lceil\tfrac{V_d(A,I)}{c_2}\right\rceil\) times. The final complexity of steepest descent becomes \[ \mathcal{O}\!\left(V_d(A, I) + \log\!\left(\tfrac{1}{\varepsilon}\right)\right). \]
\begin{algorithm}
\caption{Space dilation for solving linear systems}
\begin{algorithmic}
\Require $A, b, x$ and $f(x) = \tfrac{1}{2}\langle x, Ax\rangle - \langle b, x\rangle$
\For{$k = 1, \ldots, K$}
\State \textbf{compute} $\alpha = \tfrac{\|\nabla f(x)\|^2}{\langle \nabla f(x), A\nabla f(x)\rangle}$, $r = \tfrac{f(x - \alpha\nabla f(x)) - f(x^\star)}{f(x) - f(x^\star)}$, $g = \tfrac{\nabla f(x)}{\|\nabla f(x)\|}$
\If{$r > 0.99$}
\If{$\langle g, Ag\rangle$ is large}
\State $v = g$
\Else \Comment{$\langle g, A^{-1}g\rangle$ must be large}
\State $v = A^{-1}g$
\EndIf
\State \textbf{compute} $\sigma^\star = \sqrt{\tfrac{1}{\langle v, Av\rangle}} - 1$
\State $P = I + \sigma^\star vv^\top$
\State $A \leftarrow PAP$, $b \leftarrow Pb$, $x \leftarrow P^{-1}x$
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
Theorem 1 Algorithm 1 finds an \(\varepsilon\)-optimal solution in \(\mathcal{O}\!\left(V_d(A, I) + \log\!\left(\tfrac{1}{\varepsilon}\right)\right)\) iterations.
The algorithm above is clearly not implementable due to the need for computing \(A^{-1}g\), but it is sufficient as a prototypical method for communicating intuitions. In the next section, I’ll remove the need for evaluating \(A^{-1}g\) by using symmetrized \(\log\det\) divergence.
Theorem 2 A variant of Algorithm 1 (in the next section) finds an \(\varepsilon\)-optimal solution in \(\mathcal{O}\!\left(V_d(A, I) + V_d(I, A) + \log\!\left(\tfrac{1}{\varepsilon}\right)\right)\) iterations. Moreover, the algorithm has superlinear convergence \[ \tfrac{f(x^{K+1}) - f(x^\star)}{f(x^1) - f(x^\star)} \leq \left(\tfrac{C}{K}\right)^K \] for some \(C > 0\).
As a final remark, the idea mentioned in this post was pioneered by (Monteiro et al. 2004) and (Dunagan and Harvey 2007). Both papers’ authors notice that when an algorithm performs poorly (gives bad contraction), it simultaneously provides information that helps improve problem conditioning (reduce distance). The approach in this post is inspired by (Dunagan and Harvey 2007), and our \(\log\det\)-based analysis partially resolves some shortcomings mentioned at the end of (Dunagan and Harvey 2007).
Space dilation methods for linear systems
Algorithm 1 in the previous section is not implementable due to the need of computing \(A^{-1}g\). This section fixes this issue. When \(A^{-1}g\) is not available, we need to deal with the case \(\langle g, A^{-1}g\rangle \gg 1\) and \(\langle g, Ag\rangle \to 1\).
The key observation is simple:
when \(\langle g, Ag\rangle \gg 1\), the vector \(g\) can help improve conditioning of \(A\),
which, symmetrically, should give
when \(\langle g, A^{-1}g\rangle \gg 1\), the vector \(g\) can help improve conditioning of \(A^{-1}\).
So let’s define \[ \operatorname{dist}(PAP, I) = \underbrace{V_d(PAP, I)}_{PAP \text{ to } I} + \underbrace{V_d((PAP)^{-1}, I)}_{(PAP)^{-1} \text{ to } I}, \] which, by the property of \(\log\det\) divergence \(V_d(P, Q) = V_d(Q^{-1}, P^{-1})\), gives \[ \operatorname{dist}(PAP, I) = V_d(PAP, I) + V_d(I, PAP). \] It is exactly the symmetrized version of the divergence. Now we can repeat the previous derivation with \(P = I + \sigma vv^\top\) and \(P^{-1} = I - \tfrac{\sigma}{1+\sigma}vv^\top\): \[ \begin{aligned} V_d(PAP, I) &= V_d(A, I) + 2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle - 2\log(1+\sigma) \\ V_d(I, PAP) &= V_d(I, A) - \tfrac{2\sigma}{1+\sigma}\langle v, A^{-1}v\rangle + \tfrac{\sigma^2}{(1+\sigma)^2}\langle v, A^{-1}v\rangle + 2\log(1+\sigma), \end{aligned} \] and we have \[ \begin{aligned} &V_d(PAP, I) + V_d(I, PAP) \\ \leq\;& V_d(A, I) + V_d(I, A) + 2\sigma\langle v, Av\rangle + \sigma^2\langle v, Av\rangle {\color{red}- 2\log(1+\sigma)} \\ &\quad - \tfrac{2\sigma}{1+\sigma}\langle v, A^{-1}v\rangle + \tfrac{\sigma^2}{(1+\sigma)^2}\langle v, A^{-1}v\rangle {\color{red}+ 2\log(1+\sigma)} \\ =\;& V_d(A, I) + V_d(I, A) + \underbrace{(\sigma^2 + 2\sigma)\langle v, Av\rangle - \tfrac{\sigma^2 + 2\sigma}{(1+\sigma)^2}\langle v, A^{-1}v\rangle}_{\gamma(\sigma)}. \end{aligned} \] Minimizing over \(\sigma\), we have \(\sigma^\star = -1 + \left(\tfrac{\langle v, A^{-1}v\rangle}{\langle v, Av\rangle}\right)^{1/4}\) and that \[ V_d(A^+(v, \sigma^\star), I) + V_d(I, A^+(v, \sigma^\star)) \leq V_d(A, I) + V_d(I, A) - \left(\sqrt{\langle v, A^{-1}v\rangle} - \sqrt{\langle v, Av\rangle}\right)^2. \] The RHS is now the Hellinger distance between two scalars \(\langle v, A^{-1}v\rangle\) and \(\langle v, Av\rangle\). To ensure that \[ \left|\sqrt{\langle v, A^{-1}v\rangle} - \sqrt{\langle v, Av\rangle}\right| \geq c_1, \] we need \(\max\{\langle v, A^{-1}v\rangle, \langle v, Av\rangle\} - \min\{\langle v, A^{-1}v\rangle, \langle v, Av\rangle\} \geq c_1\), i.e., \(v\) achieves sufficiently different Rayleigh quotients on \(A\) and \(A^{-1}\). Can we find such \(v\) when \(r = 1 - \tfrac{1}{\langle g, Ag\rangle \langle g, A^{-1}g\rangle} \to 1\)? Let’s do a concrete case analysis.
Case 1. If \(r \leq \tfrac{15}{16}\), then we are fine.
Case 2. If \(r > \tfrac{15}{16}\), then \(\tfrac{1}{\langle g, A^{-1}g\rangle \langle g, Ag\rangle} \leq \tfrac{1}{16}\) and \(\max\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \geq 4\).
Case 2.1. If \(\min\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \leq 2\). Then \(\left|\sqrt{\langle g, A^{-1}g\rangle} - \sqrt{\langle g, Ag\rangle}\right| \geq 2\). We can just take \(v = g\).
Case 2.2. If \(\min\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \geq 2\). This case is slightly more subtle. It is possible both \(\langle g, Ag\rangle\) and \(\langle g, A^{-1}g\rangle\) are equally large. Interestingly, we can show that running a power iteration on top of \(g\) would generate a new direction that meets our need: in particular, let \(z = \tfrac{Ag}{\|Ag\|}\). We have, by spectral Cauchy-Schwarz, that \[ \begin{aligned} \langle z, Az\rangle = \tfrac{\langle g, A^3g\rangle}{\|Ag\|^2} &\geq \tfrac{\|Ag\|^2\langle g, Ag\rangle}{\|Ag\|^2} = \langle g, Ag\rangle {\color{red}\geq 2} \\ \langle z, A^{-1}z\rangle = \tfrac{\langle g, Ag\rangle}{\|Ag\|^2} &\leq \tfrac{\langle g, Ag\rangle}{\langle g, Ag\rangle^2} \leq \tfrac{1}{2} \end{aligned} \] and taking \(v = z\) gives \(\left|\sqrt{\langle z, A^{-1}z\rangle} - \sqrt{\langle z, Az\rangle}\right| \geq \sqrt{2} - \tfrac{\sqrt{2}}{2} \geq \tfrac{1}{4}\).
Now we have Algorithm 2.
\begin{algorithm}
\caption{Another space dilation algorithm}
\begin{algorithmic}
\Require $A, b, x$ and $f(x) = \tfrac{1}{2}\langle x, Ax\rangle - \langle b, x\rangle$
\For{$k = 1, \ldots, K$}
\State \textbf{compute} $\alpha = \tfrac{\|\nabla f(x)\|^2}{\langle \nabla f(x), A\nabla f(x)\rangle}$, $r = \tfrac{f(x - \alpha\nabla f(x)) - f(x^\star)}{f(x) - f(x^\star)}$, $g = \tfrac{\nabla f(x)}{\|\nabla f(x)\|}$
\If{$r > \tfrac{15}{16}$}
\If{Case 2.1 happens}
\State $v = g$
\Else
\State $v = \tfrac{Ag}{\|Ag\|}$
\EndIf
\State \textbf{compute} $\sigma^\star = \left(\tfrac{\langle v, A^{-1}v\rangle}{\langle v, Av\rangle}\right)^{1/4} - 1$
\State $P = I + \sigma^\star vv^\top$
\State $A \leftarrow PAP$, $b \leftarrow Pb$, $x \leftarrow P^{-1}x$
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
As a last minor point, Algorithm 2 requires computing \(\langle v, A^{-1}v\rangle\) in \(\sigma^\star\). When \(v = \tfrac{Ag}{\|Ag\|}\), this is not a problem, since \(\left\langle \tfrac{Ag}{\|Ag\|}, A^{-1}\tfrac{Ag}{\|Ag\|}\right\rangle = \tfrac{\langle g, Ag\rangle}{\|Ag\|^2}\). The only issue happens when \(v = g\), where \[ \tfrac{1}{2}\langle g, A^{-1}g\rangle = \tfrac{1}{2}\tfrac{\langle \nabla f(x), A^{-1}\nabla f(x)\rangle}{\|\nabla f(x)\|^2} = \tfrac{f(x) - f(x^\star)}{\|\nabla f(x)\|^2} \] holds for quadratic functions. Although no \(A^{-1}g\) is required, it does depend on the optimality gap.
Potential reduction
To prove the promised Theorem 2, I’ll introduce a tool for analyzing algorithms with this “learning to optimize” intuition. At each step of the algorithm, we
- either make progress (contraction \(r \to 0\)), or
- improve the landscape (\(V_d\) shrinks).
Thus, even when the contraction is poor, each step still decreases a suitable potential.
For the specific algorithm we consider, let’s define the potential function \[ \varphi(A, x) := \underbrace{\log(f(x) - f(x^\star))}_{\text{Optimization}} + \underbrace{[V_d(A, I) + V_d(I, A)]}_{\text{Learning}}. \]
The algorithm is slightly different, with a different threshold for choosing between two cases.
\begin{algorithm}
\caption{Space dilation with superlinear convergence rate}
\begin{algorithmic}
\Require $A, b, x$ and $f(x) = \tfrac{1}{2}\langle x, Ax\rangle - \langle b, x\rangle$
\For{$k = 1, \ldots, K$}
\State \textbf{compute} $\alpha = \tfrac{\|\nabla f(x)\|^2}{\langle \nabla f(x), A\nabla f(x)\rangle}$, $r = \tfrac{f(x - \alpha\nabla f(x)) - f(x^\star)}{f(x) - f(x^\star)}$, $g = \tfrac{\nabla f(x)}{\|\nabla f(x)\|}$
\If{$\min\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \leq \sqrt{\tfrac{2-r}{2(1-r)}}$}
\State $v = g$
\Else
\State $v = \tfrac{Ag}{\|Ag\|}$
\EndIf
\State \textbf{compute} $\sigma^\star = \left(\tfrac{\langle v, A^{-1}v\rangle}{\langle v, Av\rangle}\right)^{1/4} - 1$
\State $P = I + \sigma^\star vv^\top$
\State $A \leftarrow PAP$, $b \leftarrow Pb$, $x \leftarrow P^{-1}x$
\EndFor
\end{algorithmic}
\end{algorithm}
Algorithm 3 satisfies \(\varphi(A^+, x^+) \leq \varphi(A, x) - \tfrac{1}{2}\log\tfrac{\mathrm{e}}{2}\). Moreover, superlinear convergence holds: \[ \tfrac{f(x^{K+1}) - f(x^\star)}{f(x^1) - f(x^\star)} \leq \left(\sqrt{\tfrac{4[V_d(A,I) + V_d(I,A)]}{K}}\right)^K. \]
Proof. Since \(r = 1 - \tfrac{1}{\langle g, Ag\rangle\langle g, A^{-1}g\rangle}\), we have \(\langle g, Ag\rangle\langle g, A^{-1}g\rangle = \tfrac{1}{1-r}\) and \(\max\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \geq \tfrac{1}{\sqrt{1-r}}\).
Case 1. If \(\min\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \leq \alpha := \sqrt{\dfrac{2-r}{2(1-r)}}\), \(v = g\) gives \(\max\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \geq \max\!\left\{\tfrac{1}{1-r}\alpha^{-1}, \tfrac{1}{\sqrt{1-r}}\right\}\), \[ \begin{aligned} \left(\sqrt{\langle v, A^{-1}v\rangle} - \sqrt{\langle v, Av\rangle}\right)^2 &\geq \max\!\left\{\max\!\left\{\tfrac{1}{1-r}\alpha^{-1}, \tfrac{1}{\sqrt{1-r}}\right\} - \alpha,\, 0\right\}^2 \\ &= \max\!\left\{\max\!\left\{\sqrt{\tfrac{2}{(2-r)(1-r)}}, \tfrac{1}{\sqrt{1-r}}\right\} - \sqrt{\tfrac{2-r}{2(1-r)}},\, 0\right\}^2 \\ &= \max\!\left\{\sqrt{\tfrac{2}{2-r}}\tfrac{1}{\sqrt{1-r}} - \sqrt{\tfrac{2-r}{2}}\tfrac{1}{\sqrt{1-r}},\, 0\right\}^2 \\ &= \tfrac{r^2}{2(r-2)(r-1)}. \end{aligned} \]
Case 2. If \(\min\{\langle g, Ag\rangle, \langle g, A^{-1}g\rangle\} \geq \alpha\), \(v = \tfrac{Ag}{\|Ag\|}\) gives \(\langle v, Av\rangle \geq \alpha\), \(\langle v, A^{-1}v\rangle \leq \tfrac{1}{\alpha}\) and \[ \begin{aligned} \left(\sqrt{\langle v, A^{-1}v\rangle} - \sqrt{\langle v, Av\rangle}\right)^2 &\geq \max\!\left\{\alpha - \tfrac{1}{\alpha},\, 0\right\}^2 \\ &= \max\!\left\{\sqrt{\tfrac{2-r}{2-2r}} - \sqrt{\tfrac{2-2r}{2-r}},\, 0\right\}^2 = \tfrac{r^2}{2(r-2)(r-1)}. \end{aligned} \]
Hence \(\left(\sqrt{\langle v, A^{-1}v\rangle} - \sqrt{\langle v, Av\rangle}\right)^2 \geq \tfrac{r^2}{2(2-r)(1-r)} \geq \tfrac{r^2}{4}\) in both cases. In view of the potential function, we have \[ \begin{aligned} \log(f(x^+) - f(x^\star)) &= \log(f(x) - f(x^\star)) + \log r \\ V_d(PAP, I) + V_d(I, PAP) &\leq V_d(A, I) + V_d(I, A) - \tfrac{r^2}{4}. \end{aligned} \] Hence, \[ \begin{aligned} \varphi(A^+, x^+) &\leq \varphi(A, x) + \log r - \tfrac{r^2}{4} \\ &\leq \varphi(A, x) + \max_{r \in [0,1]} \log r - \tfrac{r^2}{4} \\ &= \varphi(A, x) - \tfrac{1}{2}\log\tfrac{\mathrm{e}}{2}. \end{aligned} \]
Finally, to show superlinear convergence, we denote \(r_k = \tfrac{f(x^{k+1}) - f(x^\star)}{f(x^k) - f(x^\star)}\), and the relation \[ V_d(PAP, I) + V_d(I, PAP) \leq V_d(A, I) + V_d(I, A) - \tfrac{r^2}{4} \] implies \(\tfrac{1}{K}\sum_{k=1}^K r_k^2 \leq 4[V_d(A, I) + V_d(I, A)]\), and with the quadratic mean inequality, we have \[ \tfrac{1}{K}\sum_{k=1}^K r_k \leq \sqrt{\tfrac{1}{K}\sum_{k=1}^K r_k^2} \leq \sqrt{\tfrac{4[V_d(A,I) + V_d(I,A)]}{K}}, \] which, by AM-GM inequality, implies \[ \tfrac{f(x^{K+1}) - f(x^\star)}{f(x^1) - f(x^\star)} \leq \left(\sqrt{\tfrac{4[V_d(A,I) + V_d(I,A)]}{K}}\right)^K \] and this completes the proof.