Wilkinson on a priori error analysis

I’ve been reading a lot of NLA lately (e.g., a recent paper on communication-avoiding RRQR), and necessarily brushing up on some details I paid scant attention to in my NLA courses, like the details of the different types of pivoting. Which led me to this quote by a famous numerical analyst:

There is still a tendency to attach too much importance to the precise error bounds obtained by an a priori error analysis. In my opinion, the bound itself is the least important part of it. The main object of such an analysis is to expose the potential instabilities, if any, of an algorithm so that hopefully from the insight thus obtained one might be led to improved algorithms. Usually the bound itself is weaker than it might have been because of the necessity of restricting the mass of detail to a reasonable level and because of the limitations imposed by expressing the errors in terms of matrix norms. A priori bounds are not, in general, quantities that should be used in practice. Practical error bounds should usually be determined by some form of a posteriori error analysis, since this takes full advantage of the statistical distribution of rounding errors and of any special features, such as sparseness, in the matrix.

Can I get an amen? This could be the epigraph of the career I’m building. I strive for a priori analyses— whether they are of algorithms or physical systems—, because in the best cases, they enhance our understanding of the factors relevant to our problems. I seek them out in others’ work and try to provide them in my own because I’m deeply skeptical of purely empirical results: without sufficient theory, how do you know you’re not just avoiding inputs that would expose some failing in your idea? This is why I’m an applied mathematician.

Nystrom vs Random Feature Maps

I haven’t seen a truly convincing study comparing Nystrom approximations to Random Feature Map approximations. On the one hand, a NIPS 2012 paper compared the two and argued that because the bases Nystrom approximations use are adaptive to the problem, whereas those used by RFMs are not, Nystrom approximations are more efficient.

This is an indisputable point, but the experiments done in the paper are not convincing: they used the same number of samples in Nystrom approximations as random features in RFMS. Instead, the fair comparison is to allot both methods the same number of FLOPs; since Nystrom methods involve an additional pseudoinversion of a (huge, for a large number of samples) matrix, one can potentially use more random features than sample points for the same number of FLOPs. Also, as always, it is important to choose an appropriate kernel — this paper only considered RBF kernels.

On the other hand, recently IBM researchers have used large-scale RFM approaches to get state-of-the-art performance on vision and speech tasks. Their results use the simplest of RFM approaches: linear regression on top of a very large number (~400K) random fourier features. The key to their success is a well-engineered ADMM approach to parallelizing the solution of the system. It’s not clear to me that a similar approach couldn’t be used to scale up a Nystrom-based solution and obtain similar results. Also, I’ve not seen anyone implement Wainwright et al.’s divide and conquer approach to kernel regression; theoretically, this could also be used to distribute the cost of a truly large-scale Nystrom implementation.

Personally, I’m of the opinion that a well-engineered Nystrom solution (using uniform sampling, even) should always outperform a well-engineered RFM solution. But, I’m interested in seeing this convincingly demonstrated.

My podcast masterlist

Here’s an early Christmas gift to you: a list of podcasts I enjoy! For listening while you’re doing all your holiday season travelling.

APM: Marketplace
KCRW’s Left, Right, and Center
Newshour
BBC World Update: Daily Commute
Common Sense with Dan Carlin
PRI’s The World: Latest Edition
On the Media
The Young Turks Video Podcast
Citizen Radio
Best of the Left Podcast
The David Pakman Show
TWIB! Prime (This Week in Blackness)
MSNBC Rachel Maddow (video)
NPR: Intelligence Squared U.S. Debates Podcast

The Read
The Complete Guide to Everything
Throwing Shade
My Brother, My Brother, and Me

Cognitive Dissonance
Dogma Debate with David Smalley
No Religion Required Podcast
Thank God I’m Atheist Podcast
The Heathen Half Hour
The Herd Mentality
The Imaginary Friends Show
The Scathing Atheist
The Thinking Atheist Podcast

Mirror descent is, in a precise sense, a second order algorithm

For one of our projects at eBay, I’ve been attempting to do a Poisson MLE fit on a large enough dataset that Fisher scoring is not feasible. The problem is that the data also has such large variance in the scales of the observation that stochastic gradient descent does not work, period — because of the exponentiation involved, you need to take a very tiny step size to avoid overflow errors, but this step size is shared by all the parameters, so you can’t make progress in this way.

An alternative is adagrad, which maintains separate stepsizes for each parameter, but that seems to run into the same divergence issue, albeit much slower — slow enough that it’s unclear to me whether the fit is actually diverging, or if it ‘just’ needs to run a couple hundred of iterations before it converges. So for the past week I’ve been massaging the initial conditions and amount of information I hard-bake into the parametrization of the problem to see if I can get Adagrad to work reasonably. Still no luck.

I just came across Raskutti’s and Mukherjee’s paper “The information geometry of mirror descent“, which seems relevant to my situation and is a nice (albeit, in need of proof-reading) read. The main result of the paper is that the mirror descent algorithm associated with the Bregman divergence of a function \(G\) is equivalent to natural gradient descent in the dual manifold with metric tensor defined by Hessian of the convex conjugate of \(G.\) This sounds wonderful, because the connection between exponential families and Bregman divergences suggests that one can then perform a first-order optimization in a certain dual manifold, and reap all the benefits of having done Fisher scoring, a second-order algorithm, in parameter space. I have to reread the paper carefully to get a handle on the precise manipulations required, but this may be a nice alternative to Adagrad for my problem.

I wonder: is there a similarly geometric interpretation of what composite mirror descent does?

Update: A more readable recent paper, “Stochastic Discriminative EM” from UAI 2014, does a better job of explaining the interpretation of the dual manifold and has a very similar algorithm.

Algebra: it matters

I’m looking at two different models for learning polynomial functions, and trying to determine if they are equivalent. After a couple days of thinking, I’ve reduced the question to the following:

Can every symmetric polynomial of degree \(r\) in \(d\) variables that has no constant term be written as a sum of the \(r\)-th powers of linear polynomials in \(d\) degrees and a homogeneous polynomial of degree \(r\) each of whose monomials involves at most \(d-1\) variables?

Julia, once more

Julia + PyCall + CCall + Gadfly or PyPlot (+ Julia Studio ?) looks delicious. The only feature that absolutely needs to be added is shared memory parallelism
(why wasn’t this an initial core feature of the language?), but I’m extremely excited by the current awesomeness of the Julia ecosystem.

I recommend you get into it now, if you’re a scientific computation person.

Update: Julia has experimental support for shared-memory arrays on Unix, which is really all that I need at this point. Great!

a bit on word embeddings

Lately I’ve been working almost exclusively on continuous word representations, with the goal of finding vectorial representations of words which expose semantic and/or syntactic relationships between words.

As is typical for any interesting machine learning problem, there are a glut of clever models based on various assumptions (sparsity, hierarchical sparsity, low-rankedness, etc.) that yield respectable embeddings. Arguably, however, the most well known of these representations are the word2vec models due to Mikolov et al., which are part of a larger class of neural network-based methods for learning word representations. This is largely due to the facts that the authors released an efficient implementation of their algorithm, and demonstrated that the embeddings they calculate exhibit nice algebraic properties (e.g., ‘king’ – ‘man’ + ‘woman’ is approximately ‘queen’).

I’m partial to the word2vec (skip-gram) model precisely because it is based on a simple and sensible (although contested) linguistic axiom known as the distribution hypothesis— that words that appear in contexts with similar words tend to have similar meanings— and in its purest form has a simple probabilistic interpretation that makes its application amenable to nontrivial theoretical investigation (for instance, it’s now clear to me from whence the mysterious algebraic properties of the fitted word embeddings come). This model involves parameterizing the probability distribution of the words in the neighborhood of each word in the model. Semantically similar words end up having similar vectorial embeddings, precisely because the distribution of words in their neighborhoods are similar.

Unfortunately, the pure skip-gram model is not easily fit, because the calculation of the partition function involves a sum over all the words in the vocabulary, which can easily be on the order of millions. The authors of the word2vec paper mitigate this situation by introducing two approximations to this exact model: one is a variational approximation that finds the best approximation with a tree-structured class of probability distributions that are more computationally tractable, and one is an approximation inspired by the principle of noise contrastive estimation— that a good model only has to be able to distinguish between true observations and observations drawn from a sufficiently plausible noise distribution. The latter model is very similar in spirit to the model of Minh et al.

I’m curious as to what extent these approximate models lose semantic information that would be captured in the pure model, if it could be fit. It is entirely possible that the ‘approximations’ are themselves more useful models than the intractable pure model. One of the projects that I’m working on with my intern (Jiyan Yang) is attempting to experimentally quantify this, using a combination of smaller datasets and various methods of formulating the optimization problem (e.g., stochastic methods and score matching-type ideas) to make it feasible to fit the exact model.

We have more ambition than to understand the word2vec model. It seems that a large class of word embeddings can be considered to be attempting to fit probability distributions over the neighbors of a word, just as word2vec does, so I’m trying to figure out a universal framework for probabilistic word embeddings. It turns out that this has strong connections to several other applications in machine learning, so the understanding garnered here can be used in, e.g., collaborative filtering.

Installing Hadoop on Ubuntu (works for Ubuntu 12.04 and Hadoop 2.4.1)

I’m trying to use LDA on a large amount of data. A quick recap:

  1. Tried vowpal wabbit … it’s fast, I’ll give it that, but it’s also useless: the output is dubious (what I think are the topics look like they haven’t changed very much from the prior) *and* I have no idea how it maps onto topics and documents (the documentation is AWFUL, and the dimensions of the output files are WONKY).
  2. Tried two implementations of SCVB0, a stochastic collapsed variational bayes LDA algorithm: one doesn’t work at all (as in, it stalls on any amount of data — so crap), and the other works fine and fast on medium amounts of data but apparently requires more than 5 days on my size dataset (I quit after 5 days; it’s worth noting that this code was removed from github literally the next day after I downloaded it — so dubious).
  3. For whatever reason, Mallet can’t even import (using import-file) a relatively small amount of data: the import process takes forever and then quits with an error — so crap.
  4. Now I’m trying to use Mr.LDA, a map-reduce implementation. Given my experience so far, I am not holding out much hope, but when one must… The issue with Mr.LDA is that when I’m importing my data on the eBay Hadoop cluster, the mappers run out of heap memory during the tokenization process.

After spending of most of today trying to figure out how to increase the heap memory, I decided to just install Hadoop locally, where I can control the heap size, and tokenize my data, then transfer the tokenized data to eBay’s HDFS and run the actual LDA job on the cluster. Ugh.

First I have to get Hadoop running locally. I followed the DigitalOcean tutorial, which leaves out crucial details about permissions. To address this oversight is the entire point of this post (that and to exorcise my crappy-implementations-of-LDA-induced demons).

I installed everything to run from my user account, as opposed to root. Specifically, I untarred the hadoop directory using my user account, then moved it to /etc with sudo, so the entire hadoop directory structure is under my username and login group. Then, after creating the hadoop_store directory as root, I chowned that entire subtree over to my user account with ‘chown -R uname: /usr/local/hadoop_store’. Then I followed the directions as provided and got a working hadoop system.

Sharing numpy arrays between processes using multiprocessing and ctypes

Because of its global interpreter lock, Python doesn’t support multithreading. To me, this is a ridiculous limitation that should be gotten rid of post-haste: a programming language is not modern unless it support multithreading.

Python supports multiprocessing, but the straightforward manner of using multiprocessing requires you to pass data between processes using pickling/unpickling rather than sharing memory. Needless to say, this slows down execution when large amounts of data need to be shared by processes.

In my case, I’ve been using multiprocessing to speed up the training of a recursive autoencoder: instead of training one auto-encoder, I train an ensemble of around 30 simultaneously, and periodically average their parameters together and use these as the initialization points for further training of all the models. Pickling the parameters here doesn’t seem to be that costly, because the optimization itself takes so much time. However, it’s nice to note that there’s a relatively simple workaround that allows you to avoid the pickling/unpickling process: create the shared array as a ctypes object, and pass it into the initializer of each process, making it a global variable in each thread. Here’s some example code:

from multiprocessing import Pool, sharedctypes
import numpy as np
import warnings

def populate(args):
    """ Populates the portion of the shared array arr indicated by the offset and blocksize options """
    offset, blocksize = args
    with warnings.catch_warnings():
        warnings.simplefilter('ignore', RuntimeWarning)
        v = np.ctypeslib.as_array(arr)

    for idx in range(offset, blocksize+offset):
        v[idx, :] = offset

    return offset

def _init(arr_to_populate):
    """ Each pool process calls this initializer. Load the array to be populated into that process's global namespace """
    global arr
    arr = arr_to_populate

size = 2000
blocksize = 100
tmp = np.ctypeslib.as_ctypes(np.zeros((size, 10)))
shared_arr = sharedctypes.Array(tmp._type_, tmp, lock=False) 

pool = Pool(processes=4, initializer=_init, initargs=(shared_arr, ))
result = pool.map(populate, zip(range(0, size, blocksize), [blocksize]*(size/blocksize)))

with warnings.catch_warnings():
    warnings.simplefilter('ignore', RuntimeWarning)
    final_arr = np.ctypeslib.as_array(shared_arr)

print final_arr

The warnings filter is there because a PEP compliance bug gives rise to some superfluous runtime warnings.

Eigenvector two-condition number for a product of PSD matrices

I’m pushing to submit a preprint on the Nystrom method that has been knocking around for the longest time.

I find myself running into problems centering around expressions of the type \(B^{-1}A\), where \(A, B\) are SPSD matrices satisfying \(B \preceq A\). This expression will be familiar to numerical linear algebraists: there \(B\) would be a preconditioner for a linear system \(A x = b,\) and the relevant quantity of interest is the spectral radius of \(B^{-1} A\).

It’s not hard to show that the spectral radius of this product is at most 1… but instead, I’m interested in the norm of this product. Because the spectral radius of the product is at most 1, we can use the bound
\[
\|B^{-1} A\|_2 \leq \kappa(U_{B^{-1}A})
\]
where \(\kappa(U_{B^{-1}A})\) is the two-condition number of the matrix of eigenvectors of \(B^{-1}A \).

In the applications I’m interested in, some rough numerical experiments show that this bound is good enough for my purposes (the two terms are of the same order). Assuming this is the case, how can I get a good theoretical bound on this condition number?