swiftset

Truncated SVD … how?

This question has been bothering me off and on for several months now: how *exactly* is the truncated SVD computed, and what is the cost? I’ve gathered that most methods are based on Krylov subspaces, and that implicitly restarted Arnoldi processes seem to be the most popular proposal … in what miniscule amount of literature I’ve been able to find on the subject (like only two or three papers, tops, that actually focus on computing truncated SVDs, as opposed to the multitudes that propose using them). However, I’ve no idea if that’s the case, or just the impression I have.

But mostly, I want to know the compute time of the most efficient truncated SVD algorithm. Why? A naive estimate of getting the rank-\(k\) truncated SVD of an \(m\times n\) matrix using Krylov subspace techniques is \(\mathrm{O}(mnk)\), but in practice the compute time actually depends on properties of the matrix. On the other hand, basically all randomized low-rank techniques take around that many operations but without depending on the properties of the matrix. Soooo, it’d be nice to know when randomized techniques are competitive. Since there’s so much literature on them, I assume they are competitive (and the Halko–Martinsson–Tropp survey paper has some plots that show speedups).

QOTD: total unimodularity

I started reading Matousek’s discrete geometry book. Specifically, chapter 12 on applications of high-dimensional polytopes. Mostly because he apparently draws a connection between graphs and the Brunn-Minkowski theory, and I’m interested to see exactly what that connection is and if it has any interesting implications.

I just finished the proof of the weak perfect graph conjecture (it’s been a while since I read a nontrivial pure math proof, so I haven’t actually *digested* it entirely). Now I’m on the exercises for that portion. So, here’s an interesting question involving the concept of total unimodularity, which is apparently one of those foundational concepts in an area called polyhedral combinatorics.

A matrix \(\mat{A}\) is called totally unimodular if every square submatrix of \(\mat{A}\) has determinant 0 or \(\pm 1.\) The question is to show that every nonsingular totally unimodular \(n \times n\) matrix maps the lattice \(\mathbb{Z}^n\) bijectively onto itself.

Update (solution)
It’s easy peasy. Clearly each entry of \(\mat{A}\) is in \(\{-1,0,1\}\), so \(\mat{A}\) maps \(\mathbb{Z}^n\) into itself. The fact that the mapping is bijective follows easily from Cramer’s rule, the fact that \(|\mathrm{det}(\mat{A})| = 1,\) and the fact that all the minors of \(\mat{A}\) are integral.

Are there perturbations that preserve incoherence and give nicely conditioned “submatrices”?

Let \(\mat{P}_{\mathcal{U}}\) denote the projection unto a \(k\)-dimensional subspace of \(\C^{n}.\) We say \(\mathcal{U}\) is \(\mu\)-coherent if \((\mat{P}_{\mathcal{U}})_{ii} \leq \mu \frac{k}{n} \) for all \(i = 1, \ldots, n.\)

Let \(\mat{A} \in \C^{n \times n} \) be a SPSD matrix whose top \(k\)-dimensional invariant subspace is \(\mu\)-coherent. Given \(\delta > 0\) and \(\mat{S} \in \C^{n \times \ell}\), where \(k < \ell \ll n,\) is there a matrix \(\mat{E}_{\delta,\mat{S}} \in \C^{n \times n} \) such that

\(\hat{\mat{A}} := \mat{A} + \mat{E}_{\delta,\mat{S}}\) is SPSD,
The top \(k\)-dimensional invariant subspace of \(\hat{\mat{A}}\) is also \(\mu\)-coherent,
\(\|\hat{\mat{A}} – \mat{A}\|_2 < \delta \), and
\(\mat{S}^t \hat{\mat{A}} \mat{S} \succeq \delta \cdot \mat{I}\) ?

It’d be fine if instead of (2) holding, the coherence of the top \(k\)-dimensional invariant subspace of \(\hat{\mat{A}}\) is slightly larger than that of \(\mat{A}\).

I discover vegetables taste good in Dijon sauce then proceed to spread the gospel

I had a dozen eggs in the fridge that I need to use up by Wednesday; now I’m down to four. I made Jaffa cakes last week (not bad, but not memorable, and definitely not worth the effort: my main complaint is that the cake portion is bland and has a too-tough texture) and I left some chocolate blueberry cookie dough sitting in the fridge overnight which I’ll bake up tonight. That accounted for four of the eggs. This post is about the two meals that account for another four, and maybe the remaining four also 🙂

Steak, eggs, and vegetables in a Dijon sauce — Steak, eggs, and vegetables in sauce

I’m in love with this dish because it’s super simple to prepare, filling, and not too unhealthy: steak, eggs, and steamed veggies in a Dijon sauce. Surprisingly, considering how much I love meat, the star of the show here is the veggies in sauce: steamed baby carrots and French green beans (haricot verts) with an emulsion of Dijon mustard, red wine vinegar, EVOO, and some salt. Basically, it’s a quick and dirty substitute for Sauce Dijon.

If I’d known sauce + steamed vegetables was such a winning combination (don’t ask me why it took so late in life for me to get a clue) I would be a much healthier person 🙂 My mission for next week is to learn how to make Hollandaise sauce and its derivatives, and find other vegetables to smother in sauce. No doubt Alton Brown has something interesting to say on the matter.

Puzzled by the definition of stationarity

I’m reading Santosh Vempala’s survey “Geometric Random Walks: A Survey,” and already I’m puzzled at one of the very first definitions he gives.

Define a Markov chain as a state space sigma algebra pair \((K, \mathcal{A})\) with transition probability measures given by \(P_u\) for each \(u \in K.\)

A distribution \(Q\) on \((K, \mathcal{A})\) is called stationary if one step from it gives the same distribution, i.e., for any \(A \in \mathcal{A},\)
\[
\int_A P_u(A) \, dQ(u) = Q(A).
\]

This definition makes sense in words, but mathematically it doesn’t seem sound: unless \(P_u(A) = 1\) a.e. (with respect to \(u \in A\)), this equality can’t hold. I must be missing something…

Update:
The definition should involve integration over the whole space:

A distribution \(Q\) on \((K, \mathcal{A})\) is called stationary if one step from it gives the same distribution, i.e., for any \(A \in \mathcal{A},\)
\[
\int_K P_u(A) \, dQ(u) = Q(A).
\]

That this is so can be seen from the discrete case. Just goes to show you that you have to be careful even when reading peer-reviewed articles.

My Android ebook reading workflow (optimized for frequent turnover of ebooks)

Synopsis: Aldiko’s not appropriate if you plan on constantly rotating the collection of ebooks on your mobile device. Instead, a combination of Calibre, Moon+Reader, and DropBox seems to do the trick. Consider using Moon+Reader as your default reader unless you need Aldiko’s capabilities for dealing with large ebook collections, as Moon+Reader has a more polished reading experience than Aldiko.

A major contributor to my decision to get an Android tablet was my desire to be able to read ebooks comfortably. Namely, with more portability and a form factor closer to the usual book experience. (the other major contributor was my desire to be able to read research papers comfortably, but that’s an issue for another discussion)

Until recently I was using Aldiko as my primary ebook reader, because it seems to be most popular reader, going both from the Android Market impression and Google searches for “Android ebook reader” and similar terms. Aldiko’s great if you have a large, static, well-tagged collection of EPUBs. The interface makes it a breeze to find books based on tags or authors, and keeps track of your recent reads, but it has some really annoying drawbacks. First, it is awful at properly removing books: it leaves behind the tag information for files that have been deleted. Even reinstalling Aldiko doesn’t seem to fix this. On a related note, oftentimes Aldiko has multiple index entries for one book. This may have been due to me having two formats of the same file, but one would expect that any ereader should recognize this and adjust accordingly.

These are somewhat major usability issues, but the main problem I have with Aldiko is that it’s just not aimed at a user like me, who constantly switches out ebooks. Aldiko is not designed for rapid turnover of books. It’s aimed more at people who store all their ebooks on the device, add books occasionally, and don’t delete any books. That’s a dealbreaker for me, since I have lots of ebooks, don’t want to waste space on my tablet storing files I won’t look at for several years, and want to regularly switch out books I’ve finished reading for ones I plan on reading soon.

Tonight I’ve been experimenting with a new setup of Calibre, Moon+Reader, and DropBox that seems to provide a setup that suits my style more. Moon+Reader has a book database, just as Aldiko does, but it’s also more amenable to reading EPUBs on a file-by-file basis (I forgot to mention that Aldiko can’t just load an EPUB off the disk: you have to import it into Aldiko’s special purpose database first). I’m quite impressed with the reading experience in Moon+Reader: it doesn’t have the collection organizational tools of Aldiko, but on a per-book basis, the navigation capabilities and formatting options as well as the overall feel of the application is more polished.

My setup is simple: in Calibre, I use the “connect to folder” option to load my DropBox folder as a device, then sync the books I’m currently reading to the device. On my tablet, I then open the files from DropBox using Moon+Reader. As long as I don’t reboot the tablet (or force kill DropBox), I’m able to interact with the file as though it were stored permanently on the tablet (e.g. Moon+Reader will store your location and return to it every time you visit the book). If you want, you can also favorite the ebook files in DropBox so that a local permanent copy of them is mirrored to your tablet, and then import them (from your DropBox scratch folder) into Moon+Reader. This method has the advantage that you don’t lose any metadata when you reboot the tablet, and if you delete the files from DropBox, they’re deleted locally and automagically from Moon+Reader. The only disadvantage to this technique is that the DropBox application doesn’t let you favorite folders for now, so if you have multiple ebooks in a folder, you have to favorite them all manually. This isn’t a big deal.

Symmetrized products of positive matrices are not positive

.. duh. This is in relation to the symmetrized matrix AM-GM inequality conjectured by Recht and Re. The most obvious/direct way of maybe attempting to prove this inequality is to show the semidefinite inequality
\[
\left(\frac{1}{n}\sum\nolimits_{i=1}^n\mat{A}_i\right)^n \succeq \frac{1}{n!} \sum_\sigma \mat{A}_{\sigma(1)}\mat{A}_{\sigma(2)}\times\cdots\times \mat{A}_{\sigma(n)}.
\]

Unfortunately, the right hand side isn’t even a positive matrix in general. The proof is pretty ridiculously simple, but I console myself with the fact that it wasn’t immediately obvious to some folks I’ve discussed this problem with that the symmetrized sum on the rhs could be nonpositive in general.

Consider the case of just two summands. Then \(\vec{x}^T (\mat{B}\mat{A} + \mat{A}\mat{B}) \vec{x} = 2\vec{x}^t \mat{B}\mat{A} \vec{x},\) so \(\mat{A}\mat{B} + \mat{B}\mat{A}\) is positive only if \(\mat{A}\mat{B}\) is positive. This in turn implies \(\mat{A} \mat{B} = \mat{B}\mat{A},\) so at least in this case, the matrices must commute for the symmetrized sum to be positive.

QOTD (from an old IMO): show that none of these sums are integers

Here’s an old IMO question (or maybe it’s off one of the short or long lists) that I haven’t made any progress on:

Show that for any natural numbers \(a\) and \(n\) the sum
\[
1 + \frac{1}{1+a} + \frac{1}{1 + 2a} + \cdots + \frac{1}{1+n a}
\]
is not an integer.

An obvious and cool corollary is that the harmonic numbers are never integers.

Reading List for Feb and March 2012

My top picks, with links if you want to tag along:

Ledoux and Talagrand (available from Talagrand’s webpage). Specifically, the chapters on Rademacher and Gaussian sums, symmetrization and contraction properties, and tail bounds for sums of random vectors. Also the bits on scalar martingale arguments.
Sourav Chatterjee’s thesis and its matrix counterpart. He came up with a beautiful approach for deriving concentration inequalities from exchangeable pairs which generalizes nicely to the matrix case.
Plan and Vershynin’s works on one-bit compressed sensing, tessellation for dimension reduction, and sparse logistic regression. I’ve said it before and I’ll say it again: I’m not interested in compressed sensing per se. But the concentration and geometric techniques they use are both beautiful and useful, and I’m interesting in learning them.
Robust computation of linear models, or How to find a needle in a haystack. Robust PCA is more interesting than compressed sensing, but again I’m most interested in the tools here.
Candes et al.’s paper on subspace clustering. Interesting problem and interesting geometric tools.

Norms of sums of random vectors sampled without replacement

Let \(\mathbb{P}_{WOR(\ell)}\) denote a probability under the model where a set \(I\) of \(\ell\) indices is sampled without replacement from \(\{1,2,\ldots,n\}\), and \(\mathbb{P}_{WR(\ell)}\) denote a probability under the model where the \(\ell\) indices are chosen with replacement. Then is it the case that, for any collection \(\vec{v}_i\) of \(n\) vectors in a vector space,
\[
\mathbb{P}_{WOR(\ell)}\left(\left\| \sum\nolimits_{i \in I} \vec{v}_i \right\| \geq t \right) \leq \mathbb{P}_{WR(\ell)}\left(\left\| \sum\nolimits_{i \in I} \vec{v}_i \right\| \geq t \right)
\]
up to constants?
In general, the \(WOR(\ell)\) model is harder to work with, because the terms in the resulting sum are dependent, whereas the terms in the \(WR(\ell)\) sum are independent. So if this inequality held, it would make life easier.

I don’t know if this inequality is true in general, but there are classical exponential tail bounds for the rhs, and I believe that those bounds hold up to constants for the lhs. Here’s an outline of the argument. The key fact is that
\[
\E_{WOR(\ell)}f\left(\sum\nolimits_{i \in I} \vec{v}_i\right) \leq \E_{WR(\ell)}f\left(\sum\nolimits_{i \in I} \vec{v}_i\right)
\]
for any convex function \(f.\) This was originally realized by Hoeffding, but Gross provides a more readable account. Take \(f(x) = \exp(\|x\|)\) to see that the moment generating function under the \(WOR(\ell)\) model is dominated by the moment generating function under the \(WR(\ell)\) model.

Next step. We know that the deviation of the norm of the sum from its expectation under the \(WR(\ell)\) model satisfies a Bennett inequality (see Ledoux and Talagrand, chapter 6):
\[
\mathbb{P}_{WR(\ell)}\left( \left|\left\|\sum\nolimits_{i \in I} \vec{v}_i\right\| – \E_{WR(\ell)} \left\|\sum\nolimits_{i \in I} \vec{v}_i \right\| \right| \geq t \right) \leq \exp\left[\frac{t}{2a} – \left(\frac{t}{2a} + \frac{b^2}{4a^2} \right) \log\left(1 + \frac{2 a t}{b^2} \right) \right].
\]
Here \(a\) is a bound on the norms of the individual summands and \(b^2 = \ell/n \cdot \sum_{i=1}^n \|\vec{v}_i\|^2.\)
From the tail bound/moment/mgf connection, we know that from this tail bound we can recover a bound on the mgf of this deviation under the \(WR(\ell)\) model that allows recovery of the Bennett inequality from the usual Chernoff arguments (up to changes in the constants).

(Alternatively, I realized latterly, you can just look for a proof this or a similar inequality which proceeds by bounding the mgf, and use that bound directly rather than reverse engineering to get a similar but looser bound. The issue here is that you have to find a proof of this inequality with uses the mgf. I was not able to do this, but I didn’t try too hard, and there are similar results in chapter 1 of Ledoux and Talagrand which are good enough for my purposes)

Now we do some algebra to bound an mgf involving the norm of the sum formed under the \(WOR(\ell)\) model:
\[
\E_{WOR(\ell)} \exp\left(\lambda \left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\right) \leq \E_{WR(\ell)} \exp\left(\lambda \left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\right)
\]
from our first observation, and the right hand side is smaller than
\[
\E_{WR(\ell)} \exp\left(\lambda \bigg|\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\| – \E_{WR(\ell)}\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\bigg| \right) \exp\left(\lambda \E_{WR(\ell)}\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\right),
\]
so rearranging gives
\[
\E_{WOR(\ell)} \exp\left(\lambda \left[\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\| – \E_{WR(\ell)}\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\right] \right) \leq \E_{WR(\ell)} \exp\left(\lambda \bigg|\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\| – \E_{WR(\ell)}\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\|\bigg| \right).
\]

Recall that we have a bound on the rhs from the Bennett inequality. Thus we have a bound on the mgf of the deviation of the norm of the sum under the \(WOR(\ell)\) model above the expectation of the norm of the sum under the \(WR(\ell)\) model, and this bound is sufficiently strong to generate a Bennett inequality. Thus, caveat! up to changes in the constants (since I have not done the calculations), we have that
\[
\mathbb{P}_{WOR(\ell)}\left(\left\| \sum\nolimits_{i \in I} \vec{v}_i \right\| \geq t + \E_{WR(\ell)}\left\|\sum\nolimits_{i \in I} \vec{v}_i \right\| \right) \leq \exp\left[\frac{t}{2a} – \left(\frac{t}{2a} + \frac{b^2}{4a^2} \right) \log\left(1 + \frac{2 a t}{b^2} \right) \right].
\]