Mirror descent is, in a precise sense, a second order algorithm

For one of our projects at eBay, I’ve been attempting to do a Poisson MLE fit on a large enough dataset that Fisher scoring is not feasible. The problem is that the data also has such large variance in the scales of the observation that stochastic gradient descent does not work, period — because of the exponentiation involved, you need to take a very tiny step size to avoid overflow errors, but this step size is shared by all the parameters, so you can’t make progress in this way.

An alternative is adagrad, which maintains separate stepsizes for each parameter, but that seems to run into the same divergence issue, albeit much slower — slow enough that it’s unclear to me whether the fit is actually diverging, or if it ‘just’ needs to run a couple hundred of iterations before it converges. So for the past week I’ve been massaging the initial conditions and amount of information I hard-bake into the parametrization of the problem to see if I can get Adagrad to work reasonably. Still no luck.

I just came across Raskutti’s and Mukherjee’s paper “The information geometry of mirror descent“, which seems relevant to my situation and is a nice (albeit, in need of proof-reading) read. The main result of the paper is that the mirror descent algorithm associated with the Bregman divergence of a function \(G\) is equivalent to natural gradient descent in the dual manifold with metric tensor defined by Hessian of the convex conjugate of \(G.\) This sounds wonderful, because the connection between exponential families and Bregman divergences suggests that one can then perform a first-order optimization in a certain dual manifold, and reap all the benefits of having done Fisher scoring, a second-order algorithm, in parameter space. I have to reread the paper carefully to get a handle on the precise manipulations required, but this may be a nice alternative to Adagrad for my problem.

I wonder: is there a similarly geometric interpretation of what composite mirror descent does?

Update: A more readable recent paper, “Stochastic Discriminative EM” from UAI 2014, does a better job of explaining the interpretation of the dual manifold and has a very similar algorithm.