Teaching kernel learning

Here’s a neat approach to teaching kernel learning (for empirical risk minimization), following section 2.2.6 of the book “First-order and Stochastic Optimization Methods for Machine Learning” by Guanghui Lan. Given the terse nature of this book, I did not expect to find such a straightforward pedagogical approach that I’ve not seen elsewhere. Let’s begin by assuming you understand the optimality conditions for unconstrained convex optimization.

Consider the convex empirical risk minimization problem

\[ \omega^\star = \text{argmin}_{\omega \in \mathbb{R}^d} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T x_i, y_i) + \lambda \|\omega\|_2^2.\]

Fermat’s condition implies that the optimal model weights are in the span of the training data points, \(\omega^\star = \sum_{i=1}^n \alpha_i x_i \), so the problem can be recast as

\[ \alpha^\star = \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( \sum_{j=1}^n \alpha_j \langle x_i, x_j \rangle, y_i) + \lambda \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle x_i, x_j \rangle. \]

Define the linear kernel matrix \(K = [ \langle x_i, x_j \rangle ]_{i,j=1,\ldots,n}\), and this becomes

\[ \alpha^\star = \text{argmin}_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

which is again a convex optimization problem. Once \(\alpha^\star\) is known, the model can be used to predict the value of the target given a new input, also in terms of the kernel function \( \kappa(x_1, x_2) = \langle x_1, x_2 \rangle \) :

\[ (\omega^\star)^T x = \sum_{i=1}^n \alpha_i \langle x_i, x \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

This reformulation of the ERM is not particularly useful unless \(d > n\), as the original ERM involved a \(d\)-dimensional optimization problem while the latter involves a \(n\)-dimensional optimization problem and involves forming and working with an \(n \times n\) kernel matrix. However, if \(d\) is larger than \(n\), the kernel trick that led to the second ERM is useful, as the latter ERM may be more efficient to work with.

Now, consider that we want to learn using linear combinations of some fixed functions of the raw features \(x\), e.g. we want to learn using the features \(x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots\). One can imagine that using such features could be of benefit over simply using the raw features. Let \(\phi(x)\) denote the feature map that maps the raw features from \(d\) dimensions into \(D\) new features, then the ERM of interest is

\[ \min_{\omega \in \mathbb{R}^D} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T \phi(x_i), y_i) + \lambda \|\omega\|_2^2,\]

and the same reasoning as before tells us that \(\omega^\star = \sum_{i=1}^n \alpha_i \phi(x_i)\). Defining the nonlinear kernel function \( \kappa(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle \) and the corresponding nonlinear kernel matrix \(K = [ \kappa(x_1, x_2) ]_{i,j=1,\ldots,n}\) corresponding to the feature map \(\phi\), we see that we can fit this model by solving

\[ \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

the exact same convex problem as before, and also predict using

\[ (\omega^\star)^T \phi(x) = \sum_{i=1}^n \alpha_i \langle \phi(x_i), \phi(x) \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

Note that \(\phi\) could even be infinite-dimensional, and this approach still works: the ERM stated in terms of the feature map is not solvable on a computer, but the latter is, as it involves working with only an \(n\)-dimensional optimization problem. In practice, even if \(\phi\) is finite-dimensional, \(D\) is taken to be sufficiently high that the kernel version of the ERM is more efficiently solvable than the feature map version of the ERM. Further, \(\phi\) is chosen so that the kernel function is efficiently evaluable.

The canonical example of the feature map used in kernel learning is the Gaussian/radial basis feature map, an infinite dimensional feature map that consists of all the monomials in the raw features, each downweighed by an appropriate exponential weight (it’s tedious to write out, you can look it up). The nice thing is that this choice of feature map leads to the RBF kernel function that we all know and love

\[ \kappa(x_1, x_2) = \exp\left( \frac{\|x_1 – x_2|_2^2}{2\sigma^2} \right), \]

where \(\sigma^2\) is a hyperparameter.