Monument

Jeff Salyards is doing an Ask Me Anything (AMA) on Reddit tonight at 5 PCT. I started reading his debut novel, Scourge of the Betrayer, a couple of weeks ago, but couldn’t get into it for one reason or another. Maybe I was just coming off of an urban fantasy binge, and the relatively heavier setting of SOTB was just too much for me to get into at the time (the couple dozens of pages I read evoked Glenn Cook’s Dark Company novels, especially since the main character is a chronicler). Or maybe I was put off by the heavy-handed way in which a SECRET MISSION with EPIC IMPLICATIONS was being hinted at in the little I read (I mean, almost everyone reading this genre knows what to expect, ok?). Either way, I plan on finishing it eventually, if only to see why the Jeff chose a flail as the weapon of import. But I do expect that it gets better, eventually.

I mention Jeff’s AMA mostly because it got me thinking about his book, which in turn got me thinking about the Dark Company novels (which I love love love), which in turn got me thinking about Ian Graham’s excellent debut novel, Monument. I swear this book flew under the radar, and Ian Graham seems not to have written anything since. But when I come across a fantasy novel revolving around an anti-hero, at some point or another I end up comparing it with Monument. So far none of the anti-heroes I’ve come across have been as despicable and yet somehow comfortably human as Ballas. I’m not even going to try to sell the book to you: just go and read it.

A list of \(\|\cdot\|_{\infty \rightarrow 1}^*\) problems

Recall our old friend the \(\|\cdot\|_{\infty \rightarrow 1}^*\) norm (from the first version of this website, which I still haven’t gotten around to merging into the current iteration). Given a matrix \(\mat{A},\) it is a measure of the energy in the smallest expansion of \(\mat{A}\) in terms of rank one sign matrices:
\[
\|\mat{A}\|_{\infty \rightarrow 1}^* = \inf\left\{\sum_i d_i \,:\, \mat{A} = \sum_i d_i \vec{u}_i \vec{v}_i, \right\}
\]
where \(\{\vec{u}_i, \vec{v}_i\}\) is a collection of sign vectors. Note that the number of sign vectors necessary to achieve this minimum is not known a priori.

This is an interesting norm both intrinsically (as a potential source of factor models in statistics and machine learning) and because of its relations to other norms. See these slides for details.

An estimate of \(\|\cdot\|_{\infty \rightarrow 1}^*\) to within a multiplicative factor of about \(1.783\) can be obtained using a semidefinite program, but I suspect it’s NP hard to compute \(\|\cdot\|_{\infty \rightarrow 1}^*\) exactly (because \(\|\cdot\|_{\infty \rightarrow 1}^*\) is the trace dual of the \(\|\cdot\|_{\infty \rightarrow 1}\) norm, which is known to be NP-hard to compute — if the latter is also APX-hard, which I’m not sure of either way, then that means you definitely couldn’t compute \(\|\cdot\|_{\infty \rightarrow 1}^*\) efficiently in polynomial time). However, more interesting is the question of approximately obtaining the factorization corresponding to \(\|\cdot\|_{\infty \rightarrow 1}^*\): there doesn’t seem to be a good way to obtain such an approximate factorization from the SDP that gives the approximation of \(\|\cdot\|_{\infty \rightarrow 1}^*\), but maybe there’s a clever way to do so that I haven’t seen.

Other interesting questions abound: can we characterize \(\|\cdot\|_{\infty \rightarrow 1}^*\) or the number of rank 1 matrices needed to obtain the corresponding decomposition from a randomly chosen matrix (say i.i.d. Gaussian entries)? What if we just want probability bounds on the number of rank 1 matrices in the decomposition of a randomly chosen matrix?

I think the latter question is probably the easiest to answer of all posed here (except maybe coming up with an SDP-based approximate factorization).

Drawing to a close at IBM

Well damn. It’s been a long while. Time for another post about my not having posted in a while.

I’m wrapping up my internship here at IBM, preparing the exit talk on my research. It’s entitled “New results on the accuracy of randomized spectral methods.” I’ll put up the talk after we’ve wrapped up the paper and submitted it to ArXiv, in a month or so, but here’s a synopsis. Motivated by the fact that a lot of applications of randomized projection-based low-rank approximations use them to approximate subspaces rather than matrices (the canonical example being spectral clustering), we investigated the accuracy of the subspaces so obtained. It turns out that you have a pretty nice dependence on the number of steps of the power method you take, and the oversampling … the error bounds are almost usable. That’s a coup for machine learning applications! However, I think a sharper analysis (ours is as simple as possible) might be able to make the bounds tight enough for practical use. Along the way, we revisited the analysis of the classical subspace iteration method (unintentionally), and found a geometric interpretation of that “random space interaction with singular spaces” term that shows up in the analysis of random projection-based low-rank approximation error bounds.

I think the results are nice enough for a conference paper, but I’m (as usual) a little leery of claiming a strong original contribution, since a lot of what we did turns out to be only a few steps removed from others’ work. But them’s the breaks when you work in numerical linear algebra!

Problem: another inequality

Let \(\mat{F},\mat{G}\) be positive definite matrices (do they have to be definite?) and \(0 \leq p \leq 2.\) Show that
\[
\tr\left(\mat{F}^{p/2} – \mat{G}^{p/2}\right) \leq \frac{p}{2} \tr\left((\mat{F} – \mat{G})\mat{G}^{p/2 – 1}\right).
\]

I’m working on it. See if you can get a proof before I do đŸ™‚

when do \(A\) and \(AA^T\) have the same spectrum?

There are matrices for which the spectrum of \(\mat{A}\) and \(\mat{A}\mat{A}^T\) are identical. As an example, take
\[
\mat{A} = \begin{pmatrix}
\frac{1}{2} & 0 & \frac{1}{2} & 0 \\
0 & \frac{1}{2} & 0 & \frac{1}{2} \\
0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0
\end{pmatrix}.
\]
Freaky, right? I wouldn’t have believed it. I know one class of matrices for which this is true, but it’s a bit funky: if \(\mat{C}\) and \(\mat{B}\) are rectangular matrices related in a nice way (essentially, \(\mat{B}\) is \(\mat{C}\) plus a perturbation that is orthogonal to the range space of \(\mat{C}\)), then \(\mat{A} = \mat{C} \mat{B}^\dagger\) has this property. Here \(\dagger\) denotes pseudoinversion. The proof of this special case is complicated, and not particularly intuitive. So I wonder if there’s a simpler proof for the general case.

Summer of languages

This summer’s the summer of languages for me. I’m learning R piecemeal, because I’m working on a data analytics project that requires a lot of statistical analysis: learning a new language is a bother, but the unprecedented amount of statistical packages available for R justifies the investment. I also decided to dive into Julia, despite its youth, mostly because of this:

We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

(Did we mention it should be as fast as C?)

Well, that and the fact that Matlab isn’t readily available here at IBM. I could use my Caltech license, but the writing’s on the wall. I might as well train myself in a good alternative for when I no longer have that option. Octave isn’t an option I’m happy with: it looks kludgy, Matlab code doesn’t *actually* run without modification, and the plotting options are embarassingly sad (Gnuplot in 2012? come on).

Third week of the IBM internship

I’m working on an internal project that motivated me to look into survival analysis. Cool stuff, that. Essentially, you have a bunch of data about the lifetimes of the some objects and potentially related covariates, and from that data you’d like to estimate the functional relationship between the lifetimes and the covariates, in the form of the survival function \(P(T > x) = f(x, X_1, \ldots, X_n).\) It’s not clear that this is the best lens to approach the particular problem I’m looking at (churn)— for instance, you could think of this in machine learning or approximation theory terms rather than statistical terms—, but it’s certainly a very sensible approach.

I still have access to the online Springer books through Caltech, so I started to read Dynamic regression methods for survival analysis, which is more than a bit ambitious given my lack of statistical training. The first chapter’s quite inspiring though: I now want to learn about compensators and martingale central limit theorems. Apparently one can view a lot of statistical estimators as compensators (or something close), and the difference between the estimators and the actual quantities as a martingale (or, more generally, a zero-mean stochastic process). Then the asymptotic error of the estimator can be understood through the central limit theorem (\emph{or}, in the more general case, through empirical process theory— which for various reasons, including the connection to sums of independent r.v.s in Banach spaces, I’m also very interested in learning).

Of course, my first question is: is there a nonasymptotic version of the martingale central limit theorem? And my second question is: do these concepts make sense in the matrix case (whatever that is), and what’s known there? Is this something worth working on? Unfortunately, I know nothing about statistics … I think maybe there’s a relation to the self-normalizing processes that Joel mentioned I should look into. So there’s one more thing to add to my list of too-little-time investigations.

Links to interesting literature:

Truncated SVD … how?

This question has been bothering me off and on for several months now: how *exactly* is the truncated SVD computed, and what is the cost? I’ve gathered that most methods are based on Krylov subspaces, and that implicitly restarted Arnoldi processes seem to be the most popular proposal … in what miniscule amount of literature I’ve been able to find on the subject (like only two or three papers, tops, that actually focus on computing truncated SVDs, as opposed to the multitudes that propose using them). However, I’ve no idea if that’s the case, or just the impression I have.

But mostly, I want to know the compute time of the most efficient truncated SVD algorithm. Why? A naive estimate of getting the rank-\(k\) truncated SVD of an \(m\times n\) matrix using Krylov subspace techniques is \(\mathrm{O}(mnk)\), but in practice the compute time actually depends on properties of the matrix. On the other hand, basically all randomized low-rank techniques take around that many operations but without depending on the properties of the matrix. Soooo, it’d be nice to know when randomized techniques are competitive. Since there’s so much literature on them, I assume they are competitive (and the Halko–Martinsson–Tropp survey paper has some plots that show speedups).

QOTD: total unimodularity

I started reading Matousek’s discrete geometry book. Specifically, chapter 12 on applications of high-dimensional polytopes. Mostly because he apparently draws a connection between graphs and the Brunn-Minkowski theory, and I’m interested to see exactly what that connection is and if it has any interesting implications.

I just finished the proof of the weak perfect graph conjecture (it’s been a while since I read a nontrivial pure math proof, so I haven’t actually *digested* it entirely). Now I’m on the exercises for that portion. So, here’s an interesting question involving the concept of total unimodularity, which is apparently one of those foundational concepts in an area called polyhedral combinatorics.

A matrix \(\mat{A}\) is called totally unimodular if every square submatrix of \(\mat{A}\) has determinant 0 or \(\pm 1.\) The question is to show that every nonsingular totally unimodular \(n \times n\) matrix maps the lattice \(\mathbb{Z}^n\) bijectively onto itself.

Update (solution)
It’s easy peasy. Clearly each entry of \(\mat{A}\) is in \(\{-1,0,1\}\), so \(\mat{A}\) maps \(\mathbb{Z}^n\) into itself. The fact that the mapping is bijective follows easily from Cramer’s rule, the fact that \(|\mathrm{det}(\mat{A})| = 1,\) and the fact that all the minors of \(\mat{A}\) are integral.