Nystrom vs Random Feature Maps

I haven’t seen a truly convincing study comparing Nystrom approximations to Random Feature Map approximations. On the one hand, a NIPS 2012 paper compared the two and argued that because the bases Nystrom approximations use are adaptive to the problem, whereas those used by RFMs are not, Nystrom approximations are more efficient.

This is an indisputable point, but the experiments done in the paper are not convincing: they used the same number of samples in Nystrom approximations as random features in RFMS. Instead, the fair comparison is to allot both methods the same number of FLOPs; since Nystrom methods involve an additional pseudoinversion of a (huge, for a large number of samples) matrix, one can potentially use more random features than sample points for the same number of FLOPs. Also, as always, it is important to choose an appropriate kernel — this paper only considered RBF kernels.

On the other hand, recently IBM researchers have used large-scale RFM approaches to get state-of-the-art performance on vision and speech tasks. Their results use the simplest of RFM approaches: linear regression on top of a very large number (~400K) random fourier features. The key to their success is a well-engineered ADMM approach to parallelizing the solution of the system. It’s not clear to me that a similar approach couldn’t be used to scale up a Nystrom-based solution and obtain similar results. Also, I’ve not seen anyone implement Wainwright et al.’s divide and conquer approach to kernel regression; theoretically, this could also be used to distribute the cost of a truly large-scale Nystrom implementation.

Personally, I’m of the opinion that a well-engineered Nystrom solution (using uniform sampling, even) should always outperform a well-engineered RFM solution. But, I’m interested in seeing this convincingly demonstrated.