Canonical Correlation Analysis (CCA)

I am not completely satisfied with the expositions of CCA that I’ve come across, so I decided to write one that reflects my own intuition.

CCA is useful in the case where you observe two random variables that are both noisy linear functions of some underlying latent random variable, and you want to use this fact to help you guess the latent variable. Formally, assume
\[
x = T_X z + n_x, \quad \text{ and } \quad y = T_Y z + n_y,
\]
where, without loss of generality, we assume that the entries of \(z\) are uncorrelated and unit variance. Here \(T_X\) and \(T_Y\) are matrices whose image spaces may be of different dimension, and \(n_x\) and \(n_y\) are white noise. For convenience, we assume that \(z\) is mean-zero, so that \(x\) and \(y\) are also.

In order to recover \(z\) from \(x\) and \(y\), we phrase the following question: find two transformations \(\Phi_X\) and \(\Phi_Y\) so that the entries of \(\Phi_X x\) are uncorrelated and unit variance, as are those of \(\Phi_Y y\), and the correlation of \(\Phi_X x\) with \(\Phi_Y y\) is maximized.

We could use PCA to whiten \(x\) and \(y\) individually to get two different noisy estimates of \(z\), but this would ignore the fact that knowing both \(x\) and \(y\) gives us more information on \(z\). Indeed, the requirement that the correlation of \(\Phi_X x\) and \(\Phi_Y y\) be maximized should tend to choose \(\Phi_X\) and \(\Phi_Y\) which remove the directions containing the noise \(n_x\) and \(n_y\).

CCA can then be formulated as the following:
\[
\max_{\Phi_X, \Phi_Y} \mathbb{E} \langle \Phi_X x, \Phi_Y y \rangle \text{ subject to }
\mathbb{E} \Phi_X x (\Phi_X x)^T = \mathbb{E} \Phi_Y y (\Phi_Y y)^T = I,
\]
or equivalently,
\[
\max_{\Phi_X, \Phi_Y} \text{Trace}(\Phi_Y C_{y x} \Phi_X^T) \text{ subject to } \Phi_X C_{xx} \Phi_X^T = \Phi_Y C_{yy} \Phi_Y^T = I,
\]
where \(C_{xx}, C_{yy}\) are the covariance matrices of \(x\) and \(y\) and \(C_{yx}\) is the cross-covariance matrix of \(y\) and \(x\) given by \(C_{yx} = \mathbb{E} (yx^T).\)

Since \(C_{xx}\) and \(C_{yy}\) are full-rank, by taking \(\Phi_X^\prime = \Phi_X C_{xx}^{1/2}\) and \(\Phi_Y^\prime = \Phi_Y C_{yy}^{1/2}, \)this program can be transformed to
\[
\max_{\Phi_X^\prime, \Phi_Y^\prime} \text{Trace}(\Phi_Y^\prime C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2} \Phi_X^{\prime T}) \text{ subject to } \Phi_x^\prime \Phi_x^{\prime T} = \Phi_Y^\prime \Phi_Y^{\prime T} = I.
\]

From the standard characterization of the SVD, it is clear that the solution \(\Phi_X^\prime\) and \(\Phi_Y^\prime\) are given by the top right and left singular vectors of the matrix \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2}.\) Equivalently, they are the top eigenvectors of the two matrices \(C_{xx}^{-1/2} C_{xy} C_{yy}^{-1} C_{yx} C_{xx}^{-1/2}\) and \(C_{yy}^{-1/2} C_{yx} C_{xx}^{-1} C_{xy} C_{yy}^{-1}.\) The eigenvectors of these two unwieldy matrices are also the top eigenvectors of the generalized eigenvalue problems given by
\begin{align*}
C_{xy} C_{yy}^{-1} C_{yx} \phi_i^x & = \lambda_i C_{xx} \phi_i^x \\
C_{yx} C_{xx}^{-1} C_{xy} \phi_i^y & = \lambda_i C_{yy} \phi_i^y.
\end{align*}
Here the \(\lambda_i\) are the squared canonical correlations, and the \(\phi^x_i, \phi^y_i\) are the canonical correlation basis vectors.

None of the non-variational characterizations given so far seem terribly efficiently, since they all involve two matrix inversion. It turns out that CCA can be done by using a QR decomposition instead. See Björck and Golub, 1973 for the details of the algorithm, and the connection between CCA and the principle angles between subspaces.