Here’s a neat approach to teaching kernel learning (for empirical risk minimization), following section 2.2.6 of the book “First-order and Stochastic Optimization Methods for Machine Learning” by Guanghui Lan. Given the terse nature of this book, I did not expect to find such a straightforward pedagogical approach that I’ve not seen elsewhere. Let’s begin by assuming you understand the optimality conditions for unconstrained convex optimization.
Consider the convex empirical risk minimization problem
ω⋆=argminω∈Rd1nn∑i=1ℓ(ωTxi,yi)+λ‖ω‖22.
Fermat’s condition implies that the optimal model weights are in the span of the training data points, ω⋆=∑ni=1αixi, so the problem can be recast as
α⋆=minα∈Rn1nn∑i=1ℓ(n∑j=1αj⟨xi,xj⟩,yi)+λn∑i=1n∑j=1αiαj⟨xi,xj⟩.
Define the linear kernel matrix K=[⟨xi,xj⟩]i,j=1,…,n, and this becomes
α⋆=argminα∈Rn1nn∑i=1ℓ((Kα)i,yi)+λαTKα,
which is again a convex optimization problem. Once α⋆ is known, the model can be used to predict the value of the target given a new input, also in terms of the kernel function κ(x1,x2)=⟨x1,x2⟩ :
(ω⋆)Tx=n∑i=1αi⟨xi,x⟩=n∑i=1αiκ(xi,x).
This reformulation of the ERM is not particularly useful unless d>n, as the original ERM involved a d-dimensional optimization problem while the latter involves a n-dimensional optimization problem and involves forming and working with an n×n kernel matrix. However, if d is larger than n, the kernel trick that led to the second ERM is useful, as the latter ERM may be more efficient to work with.
Now, consider that we want to learn using linear combinations of some fixed functions of the raw features x, e.g. we want to learn using the features x1,x2,x21,x22,x1x2,…. One can imagine that using such features could be of benefit over simply using the raw features. Let ϕ(x) denote the feature map that maps the raw features from d dimensions into D new features, then the ERM of interest is
minω∈RD1nn∑i=1ℓ(ωTϕ(xi),yi)+λ‖ω‖22,
and the same reasoning as before tells us that ω⋆=∑ni=1αiϕ(xi). Defining the nonlinear kernel function κ(x1,x2)=⟨ϕ(x1),ϕ(x2)⟩ and the corresponding nonlinear kernel matrix K=[κ(x1,x2)]i,j=1,…,n corresponding to the feature map ϕ, we see that we can fit this model by solving
minα∈Rn1nn∑i=1ℓ((Kα)i,yi)+λαTKα,
the exact same convex problem as before, and also predict using
(ω⋆)Tϕ(x)=n∑i=1αi⟨ϕ(xi),ϕ(x)⟩=n∑i=1αiκ(xi,x).
Note that ϕ could even be infinite-dimensional, and this approach still works: the ERM stated in terms of the feature map is not solvable on a computer, but the latter is, as it involves working with only an n-dimensional optimization problem. In practice, even if ϕ is finite-dimensional, D is taken to be sufficiently high that the kernel version of the ERM is more efficiently solvable than the feature map version of the ERM. Further, ϕ is chosen so that the kernel function is efficiently evaluable.
The canonical example of the feature map used in kernel learning is the Gaussian/radial basis feature map, an infinite dimensional feature map that consists of all the monomials in the raw features, each downweighed by an appropriate exponential weight (it’s tedious to write out, you can look it up). The nice thing is that this choice of feature map leads to the RBF kernel function that we all know and love
κ(x1,x2)=exp(‖x1–x2|222σ2),
where σ2 is a hyperparameter.