Teaching kernel learning

Here’s a neat approach to teaching kernel learning (for empirical risk minimization), following section 2.2.6 of the book “First-order and Stochastic Optimization Methods for Machine Learning” by Guanghui Lan. Given the terse nature of this book, I did not expect to find such a straightforward pedagogical approach that I’ve not seen elsewhere. Let’s begin by assuming you understand the optimality conditions for unconstrained convex optimization.

Consider the convex empirical risk minimization problem

\[ \omega^\star = \text{argmin}_{\omega \in \mathbb{R}^d} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T x_i, y_i) + \lambda \|\omega\|_2^2.\]

Fermat’s condition implies that the optimal model weights are in the span of the training data points, \(\omega^\star = \sum_{i=1}^n \alpha_i x_i \), so the problem can be recast as

\[ \alpha^\star = \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( \sum_{j=1}^n \alpha_j \langle x_i, x_j \rangle, y_i) + \lambda \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle x_i, x_j \rangle. \]

Define the linear kernel matrix \(K = [ \langle x_i, x_j \rangle ]_{i,j=1,\ldots,n}\), and this becomes

\[ \alpha^\star = \text{argmin}_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

which is again a convex optimization problem. Once \(\alpha^\star\) is known, the model can be used to predict the value of the target given a new input, also in terms of the kernel function \( \kappa(x_1, x_2) = \langle x_1, x_2 \rangle \) :

\[ (\omega^\star)^T x = \sum_{i=1}^n \alpha_i \langle x_i, x \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

This reformulation of the ERM is not particularly useful unless \(d > n\), as the original ERM involved a \(d\)-dimensional optimization problem while the latter involves a \(n\)-dimensional optimization problem and involves forming and working with an \(n \times n\) kernel matrix. However, if \(d\) is larger than \(n\), the kernel trick that led to the second ERM is useful, as the latter ERM may be more efficient to work with.

Now, consider that we want to learn using linear combinations of some fixed functions of the raw features \(x\), e.g. we want to learn using the features \(x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots\). One can imagine that using such features could be of benefit over simply using the raw features. Let \(\phi(x)\) denote the feature map that maps the raw features from \(d\) dimensions into \(D\) new features, then the ERM of interest is

\[ \min_{\omega \in \mathbb{R}^D} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T \phi(x_i), y_i) + \lambda \|\omega\|_2^2,\]

and the same reasoning as before tells us that \(\omega^\star = \sum_{i=1}^n \alpha_i \phi(x_i)\). Defining the nonlinear kernel function \( \kappa(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle \) and the corresponding nonlinear kernel matrix \(K = [ \kappa(x_1, x_2) ]_{i,j=1,\ldots,n}\) corresponding to the feature map \(\phi\), we see that we can fit this model by solving

\[ \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

the exact same convex problem as before, and also predict using

\[ (\omega^\star)^T \phi(x) = \sum_{i=1}^n \alpha_i \langle \phi(x_i), \phi(x) \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

Note that \(\phi\) could even be infinite-dimensional, and this approach still works: the ERM stated in terms of the feature map is not solvable on a computer, but the latter is, as it involves working with only an \(n\)-dimensional optimization problem. In practice, even if \(\phi\) is finite-dimensional, \(D\) is taken to be sufficiently high that the kernel version of the ERM is more efficiently solvable than the feature map version of the ERM. Further, \(\phi\) is chosen so that the kernel function is efficiently evaluable.

The canonical example of the feature map used in kernel learning is the Gaussian/radial basis feature map, an infinite dimensional feature map that consists of all the monomials in the raw features, each downweighed by an appropriate exponential weight (it’s tedious to write out, you can look it up). The nice thing is that this choice of feature map leads to the RBF kernel function that we all know and love

\[ \kappa(x_1, x_2) = \exp\left( \frac{\|x_1 – x_2|_2^2}{2\sigma^2} \right), \]

where \(\sigma^2\) is a hyperparameter.

Truncated SVD … how?

This question has been bothering me off and on for several months now: how *exactly* is the truncated SVD computed, and what is the cost? I’ve gathered that most methods are based on Krylov subspaces, and that implicitly restarted Arnoldi processes seem to be the most popular proposal … in what miniscule amount of literature I’ve been able to find on the subject (like only two or three papers, tops, that actually focus on computing truncated SVDs, as opposed to the multitudes that propose using them). However, I’ve no idea if that’s the case, or just the impression I have.

But mostly, I want to know the compute time of the most efficient truncated SVD algorithm. Why? A naive estimate of getting the rank-\(k\) truncated SVD of an \(m\times n\) matrix using Krylov subspace techniques is \(\mathrm{O}(mnk)\), but in practice the compute time actually depends on properties of the matrix. On the other hand, basically all randomized low-rank techniques take around that many operations but without depending on the properties of the matrix. Soooo, it’d be nice to know when randomized techniques are competitive. Since there’s so much literature on them, I assume they are competitive (and the Halko–Martinsson–Tropp survey paper has some plots that show speedups).

QOTD: total unimodularity

I started reading Matousek’s discrete geometry book. Specifically, chapter 12 on applications of high-dimensional polytopes. Mostly because he apparently draws a connection between graphs and the Brunn-Minkowski theory, and I’m interested to see exactly what that connection is and if it has any interesting implications.

I just finished the proof of the weak perfect graph conjecture (it’s been a while since I read a nontrivial pure math proof, so I haven’t actually *digested* it entirely). Now I’m on the exercises for that portion. So, here’s an interesting question involving the concept of total unimodularity, which is apparently one of those foundational concepts in an area called polyhedral combinatorics.

A matrix \(\mat{A}\) is called totally unimodular if every square submatrix of \(\mat{A}\) has determinant 0 or \(\pm 1.\) The question is to show that every nonsingular totally unimodular \(n \times n\) matrix maps the lattice \(\mathbb{Z}^n\) bijectively onto itself.

Update (solution)
It’s easy peasy. Clearly each entry of \(\mat{A}\) is in \(\{-1,0,1\}\), so \(\mat{A}\) maps \(\mathbb{Z}^n\) into itself. The fact that the mapping is bijective follows easily from Cramer’s rule, the fact that \(|\mathrm{det}(\mat{A})| = 1,\) and the fact that all the minors of \(\mat{A}\) are integral.

Are there perturbations that preserve incoherence and give nicely conditioned “submatrices”?

Let \(\mat{P}_{\mathcal{U}}\) denote the projection unto a \(k\)-dimensional subspace of \(\C^{n}.\) We say \(\mathcal{U}\) is \(\mu\)-coherent if \((\mat{P}_{\mathcal{U}})_{ii} \leq \mu \frac{k}{n} \) for all \(i = 1, \ldots, n.\)

Let \(\mat{A} \in \C^{n \times n} \) be a SPSD matrix whose top \(k\)-dimensional invariant subspace is \(\mu\)-coherent. Given \(\delta > 0\) and \(\mat{S} \in \C^{n \times \ell}\), where \(k < \ell \ll n,\)  is there a matrix \(\mat{E}_{\delta,\mat{S}} \in \C^{n \times n} \) such that

  1. \(\hat{\mat{A}} := \mat{A} + \mat{E}_{\delta,\mat{S}}\) is SPSD,
  2. The top \(k\)-dimensional invariant subspace of \(\hat{\mat{A}}\) is also \(\mu\)-coherent,
  3. \(\|\hat{\mat{A}} – \mat{A}\|_2 < \delta \), and
  4. \(\mat{S}^t \hat{\mat{A}} \mat{S} \succeq \delta \cdot \mat{I}\) ?

It’d be fine if instead of (2) holding, the coherence of the top \(k\)-dimensional invariant subspace of \(\hat{\mat{A}}\) is slightly larger than that of \(\mat{A}\).

Sampling models and vector sums

What happens when you sample columns from a matrix whose columns all have roughly the same norm? It’s clear that you should have some kind of concentration: say if you sample \(\ell\) of \(n\) columns, then the Frobenius norm of the resulting matrix, appropriately scaled, should be a very good approximation of the Frobenius norm of the original matrix. This is easy to show (well, it’s easy to establish a rather weak probability tail bound on the norm) using a vector Bernstein inequality, where the `vectors’ are simply the original matrix with all but one column set to zero. The key step is to change the sampling model from one where you randomly select a subset of cardinality \(\ell\) to a simpler one based on i.i.d. Bernoulli r.v.s where you randomly select a subset with expected cardinality \(\ell\). The change is straightforward because in this case, the support of the vectors don’t overlap, so adding more vectors montonically increases the Frobenius norm.

Harder question: what happens when you sample columns from \(\mat{X}\) and rows from \(\mat{Y}\), once again with appropriate rescaling, to form \(\hat{\mat{X}}\) and \(\hat{\mat{Y}}\)? In particular, what can you say about \(\|\hat{\mat{X}}\hat{\mat{Y}} – \mat{X}\mat{Y}\|_F\)? Again, assuming the Frobenius norms of \(\mat{X}\) and \(\mat{Y}\) are evenly distributed across their columns and rows, respectively, this difference should be small. It’s easy to get a bound on the expectation, but what happens in probability? Here you could again use a vector Bernstein inequality to get a reasonable result (I think), with vectors \(\text{vec}(\mat{X}_i \mat{Y}^i)\), but now the support of the vectors overlap, so it’s not clear if the change of sampling models can be done.

So, the question is: let \(\mathbb{P}_{U(\ell)}\) denote the probability under the model of randomly selecting subsets of cardinality exactly \(\ell\) and \(\mathbb{P}_{B(\ell)}\) denote the probability under the model of randomly selecting subsets of cardinality on average \(\ell\). Is it the case that, for vectors in some vector space,
\[
\mathbb{P}_{U(\ell)}\left( \left\|\sum_{i \in I} v_i \right\| \geq t \right) \leq 2 \mathbb{P}_{B(\ell)}\left( \left\|\sum_{i \in I} v_i \right\| \geq t \right)?
\]
The constant may vary, or there may even be a small additive term on the rhs, and I really only need an answer in the case that the norm is the standard Euclidean norm on \(\R^n\).

Verify these matrix properties (easy and fun)

I’m looking at products like \(\mat{G}^t \mat{A} \mat{G}\) where the columns of \(\mat{G}\) are (nonisotropic) normal vectors. Specifically, I’d like to know the distribution of the eigen/singular-values of this product. Surprisingly, I was unable to find any results in the literature on this, so I started reading Gupta and Nagar to see if I could work it out using Jacobian magic.

In the preliminary material, they list some basic matrix algebra facts. Your mission, should you choose to accept it, is to prove the following:

  • If \(\mat{A} > \mat{0},\) \(\mat{B} > \mat{0},\) and \(\mat{A} – \mat{B} > \mat{0},\) then \(\mat{B}^{-1} – \mat{A}^{-1} > \mat{0}\)
  • If \(\mat{A}>\mat{0}\) and \(\mat{B} > \mat{0},\) then \(\det(\mat{A}+\mat{B}) > \det(\mat{A}) + \det(\mat{B})\)

The second’s easy, but for the first I found it necessary to use the fact that every positive matrix has a unique positive square root.

Some more, this time on Kronecker products:

  • If \(\mat{A}\) has eigenvalues \(\{\alpha_i\}\) and \(\mat{B}\) has eigenvalues \(\{\beta_j\},\) then \(\mat{A} \otimes \mat{B}\) has eigenvalues \(\{\alpha_i \beta_j\}.\)
  • If \(\mat{A}\) is \(m \times m\) and \(\mat{B}\) is \(n \times n\), then \(\det(\mat{A} \otimes \mat{B}) = \det(\mat{A})^n \det(\mat{B})^m\)

It’s useful here to note that
\[
(\mat{A} \otimes \mat{B}) (\mat{C} \otimes \mat{D}) = \mat{AC} \otimes \mat{BD},
\]
which has some obvious consequences, like that the Kronecker product of two orthogonal matrices is orthogonal. To be clear, the first Kronecker question can be addressed without any tedious calculations.