when do \(A\) and \(AA^T\) have the same spectrum?

There are matrices for which the spectrum of \(\mat{A}\) and \(\mat{A}\mat{A}^T\) are identical. As an example, take
\mat{A} = \begin{pmatrix}
\frac{1}{2} & 0 & \frac{1}{2} & 0 \\
0 & \frac{1}{2} & 0 & \frac{1}{2} \\
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0
Freaky, right? I wouldn’t have believed it. I know one class of matrices for which this is true, but it’s a bit funky: if \(\mat{C}\) and \(\mat{B}\) are rectangular matrices related in a nice way (essentially, \(\mat{B}\) is \(\mat{C}\) plus a perturbation that is orthogonal to the range space of \(\mat{C}\)), then \(\mat{A} = \mat{C} \mat{B}^\dagger\) has this property. Here \(\dagger\) denotes pseudoinversion. The proof of this special case is complicated, and not particularly intuitive. So I wonder if there’s a simpler proof for the general case.

Summer of languages

This summer’s the summer of languages for me. I’m learning R piecemeal, because I’m working on a data analytics project that requires a lot of statistical analysis: learning a new language is a bother, but the unprecedented amount of statistical packages available for R justifies the investment. I also decided to dive into Julia, despite its youth, mostly because of this:

We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

(Did we mention it should be as fast as C?)

Well, that and the fact that Matlab isn’t readily available here at IBM. I could use my Caltech license, but the writing’s on the wall. I might as well train myself in a good alternative for when I no longer have that option. Octave isn’t an option I’m happy with: it looks kludgy, Matlab code doesn’t *actually* run without modification, and the plotting options are embarassingly sad (Gnuplot in 2012? come on).

Third week of the IBM internship

I’m working on an internal project that motivated me to look into survival analysis. Cool stuff, that. Essentially, you have a bunch of data about the lifetimes of the some objects and potentially related covariates, and from that data you’d like to estimate the functional relationship between the lifetimes and the covariates, in the form of the survival function \(P(T > x) = f(x, X_1, \ldots, X_n).\) It’s not clear that this is the best lens to approach the particular problem I’m looking at (churn)— for instance, you could think of this in machine learning or approximation theory terms rather than statistical terms—, but it’s certainly a very sensible approach.

I still have access to the online Springer books through Caltech, so I started to read Dynamic regression methods for survival analysis, which is more than a bit ambitious given my lack of statistical training. The first chapter’s quite inspiring though: I now want to learn about compensators and martingale central limit theorems. Apparently one can view a lot of statistical estimators as compensators (or something close), and the difference between the estimators and the actual quantities as a martingale (or, more generally, a zero-mean stochastic process). Then the asymptotic error of the estimator can be understood through the central limit theorem (\emph{or}, in the more general case, through empirical process theory— which for various reasons, including the connection to sums of independent r.v.s in Banach spaces, I’m also very interested in learning).

Of course, my first question is: is there a nonasymptotic version of the martingale central limit theorem? And my second question is: do these concepts make sense in the matrix case (whatever that is), and what’s known there? Is this something worth working on? Unfortunately, I know nothing about statistics … I think maybe there’s a relation to the self-normalizing processes that Joel mentioned I should look into. So there’s one more thing to add to my list of too-little-time investigations.

Links to interesting literature:

Truncated SVD … how?

This question has been bothering me off and on for several months now: how *exactly* is the truncated SVD computed, and what is the cost? I’ve gathered that most methods are based on Krylov subspaces, and that implicitly restarted Arnoldi processes seem to be the most popular proposal … in what miniscule amount of literature I’ve been able to find on the subject (like only two or three papers, tops, that actually focus on computing truncated SVDs, as opposed to the multitudes that propose using them). However, I’ve no idea if that’s the case, or just the impression I have.

But mostly, I want to know the compute time of the most efficient truncated SVD algorithm. Why? A naive estimate of getting the rank-\(k\) truncated SVD of an \(m\times n\) matrix using Krylov subspace techniques is \(\mathrm{O}(mnk)\), but in practice the compute time actually depends on properties of the matrix. On the other hand, basically all randomized low-rank techniques take around that many operations but without depending on the properties of the matrix. Soooo, it’d be nice to know when randomized techniques are competitive. Since there’s so much literature on them, I assume they are competitive (and the Halko–Martinsson–Tropp survey paper has some plots that show speedups).