Adagrad and projections onto ellipsoids

((Caveat! I am not sure the manipulations done in this post are correct, but the gist is certainly there.))

One of my favorite optimization techniques is Adagrad, a first-order technique that approximates the Hessian by using all the gradients up to that point. It calls for updates of the form:
\[
x_{t+1} = \Pi_{\mathcal{X}}^{G_t^{1/2}} (x_t – \eta G_{t}^{-1/2} g_t),
\]
or more practically for high-dimensional problems,
\[
x_{t+1} = \Pi_{\mathcal{X}}^{\text{diag}(G_t)^{1/2}} (x_t – \eta \text{diag}(G_{t})^{-1/2} g_t).
\]
Here, \(g_t\) denotes the gradient at step \(t\), \(\eta\) is a fixed stepsize, \(\mathcal{X}\) is the constraint set for your optimization problem,
\[
G_t = \sum_{i=1}^t g_t g_t^T,
\]
and \(\Pi_{\mathcal{X}}^A\) is the projection onto the constraint set with respect to a distance function defined by a positive-definite matrix \(A\):
\[
\Pi_{\mathcal{X}}^A (y) := \text{argmin}_x \langle x – y, A (x-y) \rangle = \|A^{1/2} (x-y)\|_2^2.
\]
By convention \(\Pi_{\mathcal{X}} = \Pi_{\mathcal{X}}^I\) denotes the Euclidean projection operator.

The neat thing about Adagrad is that it chooses different effective step sizes for each variable, rather than using one fixed step size for all of them. I’ve found that Adagrad outperforms sgd in my non-constrained optimization problems.

Now I need to use adagrad on a ball constrained problem: \(\mathcal{X} = \{ x : \|x\|_2 \leq B \}\). Computing the associated projection operator \(\Pi_{\mathcal{X}}^A)\) was a nice enough exercise in basic convex optimization that it seems worth recording here. Actually, given the ubiquity of this constraint set, I’m surprised the authors didn’t seem to provide this as an example in their paper (I may have missed it, since I only read the absolute minimum to get the gist of the idea and then skimmed the rest).

Adagrad’s connection to Euclidean projection onto ellipsoids

Our first step is to observe that
\[
\min_{\|x\|_2 \leq B} \|A^{1/2} (x- y)\|_2^2 = \min_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2
\]
since \(A^{1/2}\) is invertible, and that the unique minimizer of the first program is a linear transformation of the unique minimizer of the second. Concretely,
\begin{align*}
\Pi_{\|x\|_2 \leq B}^{A}(y) & = A^{-1/2} \text{argmin}_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2 \\
& = A^{-1/2} \Pi_{\|A^{-1/2} x\|_2 \leq B} (A^{1/2} y).
\end{align*}

The Lagrangian dual to Euclidean projection onto an ellipsoid

Thus, it suffices to determine the Euclidean projection operator onto an ellipsoid, \(\Pi_{\|Q x\|_2 \leq B}(z)\), where \(Q\) is positive-semidefinite. To do so, let’s write out the optimization problem that defines this projection operator:
\[
\min_{x^T Q^T Q x \leq B^2} (x – z)^T(x-z).
\]
The associated Lagrangian is:
\begin{align*}
L(x,\lambda) &= (x-z)^T(x-z) – \lambda ( B^2 – x^TQ^TQx) \\
& = x^T( I + \lambda Q^TQ) x – 2z^T x + z^Tz – \lambda B^2.
\end{align*}
Recall that \(L(x,\lambda)\) is a lower bound on the objective whenever \(x\) is feasible and \(\lambda \geq 0.\) To find a uniform lower bound, we minimize \(L\) with respect to \(x\) by setting \(\nabla_x L = 0\). This occurs when
\[
x = (I + \lambda Q^TQ)^{-1} z,
\]
and gives the uniform lower bound
\[
g(\lambda) = \min_x L(x, \lambda) = z^T[ I – (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2.
\]
The optimization problem we’re looking at satisfies strong duality, so we know that maximizing \(g\) over the set of nonnegative \(\lambda\) gives the same optimal value of the original problem, and corresponding to the optimal \(\lambda\) there is an optimal \(x\). We now find the optimal \(\lambda\) and show how to recover the optimal \(x\).

From the dual optimal point to the primal optimal point

First write the SVD of \(Q^TQ\) as \(Q^TQ = U \Sigma U^T.\) It follows that
\[
I – (I + \lambda Q^TQ)^{-1} = \left[ \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} \right]_{ii},
\]
where \(\sigma_i = \Sigma_{ii}.\) The optimal \(\lambda\) then is the nonnegative \(\lambda\) which maximizes
\begin{align*}
g(\lambda) & = z^T[ I – (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2 \\
& = \sum\nolimits_i (U^Tz)_i^2 \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} – \lambda B^2.
\end{align*}

Observe that \(g(\lambda)\) is concave on \(\mathbb{R}^+,\) so its maximum occurs at \(0\) or the point where \(g^\prime(\lambda) = 0\). Thus, if the equation
\[
\sum\nolimits_i (U^Tz)_i^2 \frac{\sigma_i}{(1 + \lambda \sigma_i)^2} = \|Q(I + \lambda Q^TQ)^{-1} z\|_2^2 = B^2
\]
has a nonnegative solution, that solution is the optimal \(\lambda\). If not, then the optimal \(\lambda\) is \(0.\)

Given this optimal \(\lambda,\) the corresponding optimal \(x\) is given by
\[
x = \Pi_{\|Qx\|_2 \leq B}(z) = (I + \lambda Q^TQ)^{-1} z.
\]

From the Euclidean projection on an ellipsoid back to the Adagrad projector

Putting all the pieces together, we get the following expression for the projection operator needed in Adagrad:
\begin{align*}
\Pi_{\|x\|_2 \leq B}^A(y) & = A^{-1/2} \Pi_{\|A^{-1/2}x\|_2 \leq B}(A^{1/2}y) \\
& = A^{-1/2} ( I + \lambda A^{-1})^{-1} A^{1/2} y \\
& = (I + \lambda A^{-1})^{-1}y = A(A + \lambda I)^{-1}y,
\end{align*}
where \(\lambda\) is either the nonnegative solution to the nonlinear equation
\begin{align*}
\sum\nolimits_i (\Sigma^{1/2} V^T y)_{n-i+1}^2 \frac{\sigma_i(A^{-1})}{(1 + \lambda \sigma_i(A^{-1}))^2} & \\
& = \|(I + \lambda A^{-1})^{-1} y \|_2^2 \\
& = \|A(A + \lambda I)^{-1} y\|_2^2 = B^2,
\end{align*}
where \(A = U \Sigma U^T,\) or if such a solution does not exist, \(0.\)