Adagrad and projections onto ellipsoids

((Caveat! I am not sure the manipulations done in this post are correct, but the gist is certainly there.))

One of my favorite optimization techniques is Adagrad, a first-order technique that approximates the Hessian by using all the gradients up to that point. It calls for updates of the form:
\[
x_{t+1} = \Pi_{\mathcal{X}}^{G_t^{1/2}} (x_t - \eta G_{t}^{-1/2} g_t),
\]
or more practically for high-dimensional problems,
\[
x_{t+1} = \Pi_{\mathcal{X}}^{\text{diag}(G_t)^{1/2}} (x_t - \eta \text{diag}(G_{t})^{-1/2} g_t).
\]
Here, \(g_t\) denotes the gradient at step \(t\), \(\eta\) is a fixed stepsize, \(\mathcal{X}\) is the constraint set for your optimization problem,
\[
G_t = \sum_{i=1}^t g_t g_t^T,
\]
and \(\Pi_{\mathcal{X}}^A\) is the projection onto the constraint set with respect to a distance function defined by a positive-definite matrix \(A\):
\[
\Pi_{\mathcal{X}}^A (y) := \text{argmin}_x \langle x - y, A (x-y) \rangle = \|A^{1/2} (x-y)\|_2^2.
\]
By convention \(\Pi_{\mathcal{X}} = \Pi_{\mathcal{X}}^I\) denotes the Euclidean projection operator.

The neat thing about Adagrad is that it chooses different effective step sizes for each variable, rather than using one fixed step size for all of them. I’ve found that Adagrad outperforms sgd in my non-constrained optimization problems.

Now I need to use adagrad on a ball constrained problem: \(\mathcal{X} = \{ x : \|x\|_2 \leq B \}\). Computing the associated projection operator \(\Pi_{\mathcal{X}}^A)\) was a nice enough exercise in basic convex optimization that it seems worth recording here. Actually, given the ubiquity of this constraint set, I’m surprised the authors didn’t seem to provide this as an example in their paper (I may have missed it, since I only read the absolute minimum to get the gist of the idea and then skimmed the rest).

Adagrad’s connection to Euclidean projection onto ellipsoids

Our first step is to observe that
\[
\min_{\|x\|_2 \leq B} \|A^{1/2} (x- y)\|_2^2 = \min_{\|A^{-1/2} x\| \leq B} \|x - A^{1/2}y\|_2^2
\]
since \(A^{1/2}\) is invertible, and that the unique minimizer of the first program is a linear transformation of the unique minimizer of the second. Concretely,
\begin{align*}
\Pi_{\|x\|_2 \leq B}^{A}(y) & = A^{-1/2} \text{argmin}_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2 \\
& = A^{-1/2} \Pi_{\|A^{-1/2} x\|_2 \leq B} (A^{1/2} y).
\end{align*}

The Lagrangian dual to Euclidean projection onto an ellipsoid

Thus, it suffices to determine the Euclidean projection operator onto an ellipsoid, \(\Pi_{\|Q x\|_2 \leq B}(z)\), where \(Q\) is positive-semidefinite. To do so, let’s write out the optimization problem that defines this projection operator:
\[
\min_{x^T Q^T Q x \leq B^2} (x - z)^T(x-z).
\]
The associated Lagrangian is:
\begin{align*}
L(x,\lambda) &= (x-z)^T(x-z) – \lambda ( B^2 – x^TQ^TQx) \\
& = x^T( I + \lambda Q^TQ) x – 2z^T x + z^Tz – \lambda B^2.
\end{align*}
Recall that \(L(x,\lambda)\) is a lower bound on the objective whenever \(x\) is feasible and \(\lambda \geq 0.\) To find a uniform lower bound, we minimize \(L\) with respect to \(x\) by setting \(\nabla_x L = 0\). This occurs when
\[
x = (I + \lambda Q^TQ)^{-1} z,
\]
and gives the uniform lower bound
\[
g(\lambda) = \min_x L(x, \lambda) = z^T[ I - (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2.
\]
The optimization problem we’re looking at satisfies strong duality, so we know that maximizing \(g\) over the set of nonnegative \(\lambda\) gives the same optimal value of the original problem, and corresponding to the optimal \(\lambda\) there is an optimal \(x\). We now find the optimal \(\lambda\) and show how to recover the optimal \(x\).

From the dual optimal point to the primal optimal point

First write the SVD of \(Q^TQ\) as \(Q^TQ = U \Sigma U^T.\) It follows that
\[
I – (I + \lambda Q^TQ)^{-1} = \left[ \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} \right]_{ii},
\]
where \(\sigma_i = \Sigma_{ii}.\) The optimal \(\lambda\) then is the nonnegative \(\lambda\) which maximizes
\begin{align*}
g(\lambda) & = z^T[ I - (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2 \\
& = \sum\nolimits_i (U^Tz)_i^2 \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} – \lambda B^2.
\end{align*}

Observe that \(g(\lambda)\) is concave on \(\mathbb{R}^+,\) so its maximum occurs at \(0\) or the point where \(g^\prime(\lambda) = 0\). Thus, if the equation
\[
\sum\nolimits_i (U^Tz)_i^2 \frac{\sigma_i}{(1 + \lambda \sigma_i)^2} = \|Q(I + \lambda Q^TQ)^{-1} z\|_2^2 = B^2
\]
has a nonnegative solution, that solution is the optimal \(\lambda\). If not, then the optimal \(\lambda\) is \(0.\)

Given this optimal \(\lambda,\) the corresponding optimal \(x\) is given by
\[
x = \Pi_{\|Qx\|_2 \leq B}(z) = (I + \lambda Q^TQ)^{-1} z.
\]

From the Euclidean projection on an ellipsoid back to the Adagrad projector

Putting all the pieces together, we get the following expression for the projection operator needed in Adagrad:
\begin{align*}
\Pi_{\|x\|_2 \leq B}^A(y) & = A^{-1/2} \Pi_{\|A^{-1/2}x\|_2 \leq B}(A^{1/2}y) \\
& = A^{-1/2} ( I + \lambda A^{-1})^{-1} A^{1/2} y \\
& = (I + \lambda A^{-1})^{-1}y = A(A + \lambda I)^{-1}y,
\end{align*}
where \(\lambda\) is either the nonnegative solution to the nonlinear equation
\begin{align*}
\sum\nolimits_i (\Sigma^{1/2} V^T y)_{n-i+1}^2 \frac{\sigma_i(A^{-1})}{(1 + \lambda \sigma_i(A^{-1}))^2} & \\
& = \|(I + \lambda A^{-1})^{-1} y \|_2^2 \\
& = \|A(A + \lambda I)^{-1} y\|_2^2 = B^2,
\end{align*}
where \(A = U \Sigma U^T,\) or if such a solution does not exist, \(0.\)

  • Stephen

    This comes up a lot, so it’s good to cover it. I have a similar derivation in section 3.7.3 of my thesis (http://thesis.library.caltech.edu/6492/6/nesta.pdf ). I’ve seen two other (and earlier) mentions of it as well:

    J. Dahl, C. Hansen, S. H. Jensen, and T. L. Jensen,
    Algorithms and software for total variation image reconstruction via first-order methods, Numer. Algorithms, 53, (2010), no. 1

    P. Weiss, L. Blanc-Feraud, and G. Aubert,
    Efficient schemes for total variation minimization under constraints in image processing, SIAM J. Scientific Computing,31 (2009), 2047–2080.

    Adagrad is interesting — but I would think only the diagonal version is useful, since if you are not diagonal, then why not use BFGS or L-BFGS? There was some work in the late 80s and 90s on quasi-Newton methods that are only diagonal (but not constant diagonal), but the results did not seem that impressive, so most people use a constant diagonal quasi-Newton (which is aka Barzilai-Borwein).

    For finding lambda via solving the equation, of course you can do Newton (this is what we did in NESTA), but it looks like it’s a secular equation and maybe you can do even better using a third-order method (see http://en.wikipedia.org/wiki/Simple_rational_approximation ). I haven’t gone through the details yet.

    • swiftset

      I also doubt anyone does use the full-matrix form of Adagrad, since it requires a full matrix inversion. The diagonal form of Adagrad is popular in machine learning, because it is almost as simple as sgd, doesn’t require line searches (yuck!), and updates different parameters at a natural rate corresponding to the training data’s distribution rather than one fixed rate.