April 2014 – ChapterZero

I’m pushing to submit a preprint on the Nystrom method that has been knocking around for the longest time.

I find myself running into problems centering around expressions of the type \(B^{-1}A\), where \(A, B\) are SPSD matrices satisfying \(B \preceq A\). This expression will be familiar to numerical linear algebraists: there \(B\) would be a preconditioner for a linear system \(A x = b,\) and the relevant quantity of interest is the spectral radius of \(B^{-1} A\).

It’s not hard to show that the spectral radius of this product is at most 1… but instead, I’m interested in the norm of this product. Because the spectral radius of the product is at most 1, we can use the bound
\[
\|B^{-1} A\|_2 \leq \kappa(U_{B^{-1}A})
\]
where \(\kappa(U_{B^{-1}A})\) is the two-condition number of the matrix of eigenvectors of \(B^{-1}A \).

In the applications I’m interested in, some rough numerical experiments show that this bound is good enough for my purposes (the two terms are of the same order). Assuming this is the case, how can I get a good theoretical bound on this condition number?

I am not completely satisfied with the expositions of CCA that I’ve come across, so I decided to write one that reflects my own intuition.

CCA is useful in the case where you observe two random variables that are both noisy linear functions of some underlying latent random variable, and you want to use this fact to help you guess the latent variable. Formally, assume
\[
x = T_X z + n_x, \quad \text{ and } \quad y = T_Y z + n_y,
\]
where, without loss of generality, we assume that the entries of \(z\) are uncorrelated and unit variance. Here \(T_X\) and \(T_Y\) are matrices whose image spaces may be of different dimension, and \(n_x\) and \(n_y\) are white noise. For convenience, we assume that \(z\) is mean-zero, so that \(x\) and \(y\) are also.

In order to recover \(z\) from \(x\) and \(y\), we phrase the following question: find two transformations \(\Phi_X\) and \(\Phi_Y\) so that the entries of \(\Phi_X x\) are uncorrelated and unit variance, as are those of \(\Phi_Y y\), and the correlation of \(\Phi_X x\) with \(\Phi_Y y\) is maximized.

We could use PCA to whiten \(x\) and \(y\) individually to get two different noisy estimates of \(z\), but this would ignore the fact that knowing both \(x\) and \(y\) gives us more information on \(z\). Indeed, the requirement that the correlation of \(\Phi_X x\) and \(\Phi_Y y\) be maximized should tend to choose \(\Phi_X\) and \(\Phi_Y\) which remove the directions containing the noise \(n_x\) and \(n_y\).

CCA can then be formulated as the following:
\[
\max_{\Phi_X, \Phi_Y} \mathbb{E} \langle \Phi_X x, \Phi_Y y \rangle \text{ subject to }
\mathbb{E} \Phi_X x (\Phi_X x)^T = \mathbb{E} \Phi_Y y (\Phi_Y y)^T = I,
\]
or equivalently,
\[
\max_{\Phi_X, \Phi_Y} \text{Trace}(\Phi_Y C_{y x} \Phi_X^T) \text{ subject to } \Phi_X C_{xx} \Phi_X^T = \Phi_Y C_{yy} \Phi_Y^T = I,
\]
where \(C_{xx}, C_{yy}\) are the covariance matrices of \(x\) and \(y\) and \(C_{yx}\) is the cross-covariance matrix of \(y\) and \(x\) given by \(C_{yx} = \mathbb{E} (yx^T).\)

Since \(C_{xx}\) and \(C_{yy}\) are full-rank, by taking \(\Phi_X^\prime = \Phi_X C_{xx}^{1/2}\) and \(\Phi_Y^\prime = \Phi_Y C_{yy}^{1/2}, \)this program can be transformed to
\[
\max_{\Phi_X^\prime, \Phi_Y^\prime} \text{Trace}(\Phi_Y^\prime C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2} \Phi_X^{\prime T}) \text{ subject to } \Phi_x^\prime \Phi_x^{\prime T} = \Phi_Y^\prime \Phi_Y^{\prime T} = I.
\]

From the standard characterization of the SVD, it is clear that the solution \(\Phi_X^\prime\) and \(\Phi_Y^\prime\) are given by the top right and left singular vectors of the matrix \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2}.\) Equivalently, they are the top eigenvectors of the two matrices \(C_{xx}^{-1/2} C_{xy} C_{yy}^{-1} C_{yx} C_{xx}^{-1/2}\) and \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1} C_{xy} C_{yy}^{-1}.\) The eigenvectors of these two unwieldy matrices are also the top eigenvectors of the generalized eigenvalue problems given by
\begin{align*}
C_{xy} C_{yy}^{-1} C_{yx} \phi_i^x & = \lambda_i C_{xx} \phi_i^x \\
C_{yx} C_{xx}^{-1} C_{xy} \phi_i^y & = \lambda_i C_{yy} \phi_i^y.
\end{align*}
Here the \(\lambda_i\) are the squared canonical correlations, and the \(\phi^x_i, \phi^y_i\) are the canonical correlation basis vectors.

None of the non-variational characterizations given so far seem terribly efficiently, since they all involve two matrix inversion. It turns out that CCA can be done by using a QR decomposition instead. See Björck and Golub, 1973 for the details of the algorithm, and the connection between CCA and the principle angles between subspaces.

Month: April 2014

Eigenvector two-condition number for a product of PSD matrices

Canonical Correlation Analysis (CCA)