I’ve been away from this blog for years now(!); this is a post that’s been sitting in the pipeline. More to come.
In their 2007 NIPS paper, “Random Features for Large-Scale Kernel Machines“, Rahimi and Recht propose using random feature maps to facilitate dealing with nonlinear kernel methods over large training sets. The problem is that if \(n\), the number of training examples, is large, then the optimization problems involving dealing with an \(n \times n\) kernel matrix. Also, the resulting classifier (or regression equation, or what have you) requires the computation of \(k(x, x_i)\) for all the training points \(x_i\).
One way to mitigate these concerns is to use sampling methods like the Nystrom extension. The Nystrom extension and its ilk work by sampling then interpolating to approximate the kernel. Rahimi and Recht’s idea is to use the fact that PSD kernels can be written in terms of feature maps, \(k(x,y) = \langle \phi(x), \phi(y) \rangle\). This implies that one could approximate the kernel \(k(x,y) \approx z(x)^T z(y)\) if you chose \(z(\cdot)\) appropriately. Note that for general kernels, the feature map \(\phi\) may be infinite-dimensional, so if \(z\) could be chosen to be a finite dimensional mapping, we’d arguably have a simplication of the kernel representation. More practically interesting, although we’d still be dealing with an \(n \times n\) large kernel matrix, we’d be using a linear kernel, for which much more efficient optimization methods exist. Furthermore, applying the trained classifier (or regression equation, or what have you) would require only the computation of \(w^T x\), where \(w\) is a vector in the span of \(z(x_i)\). Thus, finding an appropriate low-dimensional approximate feature map \(z\) would increase the efficiency with which kernel methods are both trained and applied.
Rahimi and Recht’s key idea was to take \(z(x)\) to be a sum of unbiased estimators of \(\phi(x)\). Then concentration of measure ensures that uniformly over a compact space \(\Omega\), we have \(x,y\in\Omega\) implies
\[
k(x,y) = \langle \phi(x), \phi(y) \rangle \approx z(x)^T z(y).
\]
In particular, they proposed sampling from the Fourier transform of shift invariant kernels to form such estimators. In more detail, if
\[
k(x,y) = k(x-y) = \int_\mathrm{R} \exp(i \omega^T (x-y)) \rho(\omega) d\omega,
\]
then one can approximate \(k\) with a finite sum:
\[
k(x,y) \approx \sum_j \exp(i \omega_j^T x) \exp(i \omega_j^T x)
\]
when \(\omega_j\) are sampled according to \(\rho.\)
This is a neat idea, but one is left with the question: now we know how to approximate the nonlinear kernel matrix with a linear kernel matrix, but what can we say about how this approximation affects our downstream application? Recht and Rahimi’s second paper, “Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning“, answers this question for Lipschitz multiplicative loss functions of the form
\[
\ell(c, c^\prime) = \ell(cc^\prime), \quad\text{with}\quad \|\ell(x) – \ell(y)\| \leq L \|x – y\|
\]
and hypothesis space of the form
\[
\left\{ f(x) = \int_{\Omega} \alpha(\omega) \phi(x; \omega) d\omega \,\big|\, |\alpha(\omega)| \leq Cp(\omega) \right\},
\]
where the atoms \(\phi(x; \omega)\) are uniformly bounded w.r.t both arguments, and \(p\) is a distribution over \(\Omega.\)
The idea is that since the contribution of each atom is bounded (by \(C \sup_{x,\omega} |\phi(x; \omega)|\)) the approximation error in replacing the function in the true hypothesis space that minimizes the loss by the function in the finite-dimensional approximation space
\[
\left\{ f(x) = \sum_{k=1}^K \alpha_k \phi(x; \omega_k) \,\big|\, |\alpha_k| \leq \frac{C}{K} \right\}
\]
that minimizes the loss is small with high probability with respect to the \(\omega_k\), which are selected i.i.d according to \(p\). Concisely, the Galerkin method works when the basis elements are selected randomly from an appropriate distribution.