# A useful trick for computing gradients w.r.t. matrix arguments, with some examples

I’ve spent hours this week and last week computing, recomputing, and checking expressions for matrix gradients of functions. It turns out that except in the simplest of cases, the most painfree method for finding such gradients is to use the Frechet derivative (this is one of the few concrete benefits I derived from the differential geometry course I took back in grad school).

Remember that the Frechet derivative of a function $$f : X \rightarrow \mathbb{R}$$ at a point $$x$$ is defined as the unique linear operator $$d$$ that is tangent to $$f$$ at $$x$$, i.e. that satisfies
$f(x+h) = f(x) + d(h) + o(\|h\|).$
This definition of differentiability makes sense whenever $$X$$ is a normed linear space. If $$f$$ has a gradient, then the Frechet derivative exists and the gradient satisfies the relation $$d(h) = \langle \nabla f(x), h \rangle.$$

### Simple application

As an example application, lets compute the gradient of the function
$f(X) = \langle A, XX^T \rangle := \mathrm{trace}(A^T XX^T) = \sum_{ij} A_{ij} (XX^T)_{ij}$
over the linear space of $$m$$ by $$n$$ real-valued matrices equipped with the Frobenius norm. First we can expand out $$f(X+H)$$ as
$f(X + H) = \langle A, (X+H)(X+H)^T \rangle = \langle A, XX^T + XH^T + HX^T + HH^T \rangle$
Now we observe that the terms which involve more than one power of $$H$$ are $$O(\|H\|^2) = o(\|H\|)$$ as $$H \rightarrow 0$$, so
$f(X + H) = f(X) + \langle A, XH^T + HX^T \rangle + o(\|H\|).$
It follows that
$d(H) = \langle A, XH^T + HX^T \rangle = \mathrm{trace}(A^TXH^T) + \mathrm{trace}(A^THX^T),$
which is clearly a linear function of $$H$$ as desired. To write this in a way that exposes the gradient, we use the
cyclicity properties of the trace, and exploit its invariance under transposes to see that
\begin{align}
d(H) & = \mathrm{trace}(HX^TA) + \mathrm{trace}(X^TA^T H) \\
& = \mathrm{trace}(X^TAH) + \mathrm{trace}(X^TA^T H) \\
& = \langle AX, H \rangle + \langle A^TX, H \rangle \\
& = \langle (A + A^T)X, H \rangle.
\end{align}
The gradient of $$f$$ at $$X$$ is evidently $$(A + A^T)X$$.

### More complicated application

If you have the patience to work through a lot of algebra, you could probably calculate the above gradient component by component using the standard rules of differential calculus, then back out the simple matrix expression $$(A + A^T)X$$. But what if we partitioned $$X$$ into $$X = [\begin{matrix}X_1^T & X_2^T \end{matrix}]^T$$ and desired the derivative of
$f(X_1, X_2) = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 \end{matrix} \right]^T\right)$
with respect to $$X_2$$? Then the bookkeeping necessary becomes even more tedious if you want to compute component by component derivatives (I imagine, not having attempted it). On the other hand, the Frechet derivative route is not significantly more complicated.

Some basic manipulations allow us to claim
\begin{align}
f(X_1, X_2 + H) & = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 + H \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 + H \end{matrix} \right]^T\right) \\
& = f(X_1, X_2) + \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\
H X_2^T & H X_2^T + X_2 H^T + H H^T \end{matrix} \right]\right)
\end{align}
Once again we drop the $$o(\|H\|)$$ terms to see that
$d(H) = \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\ H X_2^T & H X_2^T + X_2 H^T \end{matrix} \right]\right).$
To find a simple expression for the gradient, we partition $$A$$ (conformally with our partitioning of $$X$$ into $$X_1$$ and $$X_2$$) as
$A = \left[\begin{matrix} A_1 & A_2 \\ A_3 & A_4 \end{matrix} \right].$
Given this partitioning,
\begin{align}
d(H) & = \mathrm{trace}\left(\left[\begin{matrix}
A_2 H X_1^T & \\
& A_3 X_1 H^T + A_4 H X_2^T + A_4 X_2 H^T
\end{matrix}\right] \right) \\
& = \langle A_2^TX_1, H \rangle + \langle A_3X_1, H \rangle + \langle A_4^T X_2, H \rangle + \langle A_4X_2, H \rangle \\
& = \langle (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2, H \rangle.
\end{align}
The first equality comes from noting that the trace of a block matrix is simply the trace of its diagonal parts, and the second comes from manipulating the traces using their cyclicity and invariance to transposes.

Thus $$\nabla_{X_2} f(X_1, X_2) = (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2.$$

### A masterclass application

Maybe you didn’t find the last example convincing. Here’s a function I needed to compute the matrix gradient for— a task which I defy you to accomplish using standard calculus operations—:
$f(V) = \langle 1^T K, \log(1^T \mathrm{e}^{VV^T}) \rangle = \log(1^T \mathrm{e}^{VV^T})K1.$
Here, $$K$$ is an $$n \times n$$ matrix (nonsymmetric in general), $$V$$ is an $$n \times d$$ matrix, and $$1$$ is a column vector of ones of length $$n$$. The exponential $$\mathrm{e}^{VV^T}$$ is computed entrywise, as is the $$\log$$.

To motivate why you might want to take the gradient of this function, consider the situation that $$K_{ij}$$ measures how similar items $$i$$ and $$j$$ are in a nonsymmetric manner, and the rows of $$V$$ are coordinates for representations of the items in Euclidean space. Then $$(1^T K)_j$$ measures how similar item $$j$$ is to all the items, and
$(1^T \mathrm{e}^{VV^T})_j = \sum_{\ell=1}^n \mathrm{e}^{v_\ell^T v_j}$
is a measure of how similar the embedding $$v_j$$ is to the embeddings of all the items. Thus, if we constrain all the embeddings to have norm 1, maximizing $$f(V)$$ with respect to $$V$$ ensures that the embeddings capture the item similarities in some sense. (Why do you care about this particular sense? That’s another story altogether.)

Ignoring the constraints (you could use a projected gradient method for the optimization problem), we’re now interested in finding the gradient of $$f$$. In the following, I use the notation $$A \odot B$$ to indicate the pointwise product of two matrices.
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T \mathrm{e}^{(V+H)(V+H)^T} \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot \mathrm{e}^{VH^T} \odot \mathrm{e}^{HV^T} \odot \mathrm{e}^{HH^T} ]) \rangle
\end{align}
One can use the series expansion of the exponential to see that
\begin{align}
\mathrm{e}^{VH^T} & = 11^T + VH^T + o(\|H\|), \\
\mathrm{e}^{HV^T} & = 11^T + HV^T + o(\|H\|), \text{ and}\\
\mathrm{e}^{HH^T} & = 11^T + o(\|H\|).
\end{align}
It follows that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot (11^T + VH^T + o(\|H\|)) \\
\odot (11^T + HV^T + o(\|H\|)) \odot (11^T + o(\|H\|)) ]) \rangle.
\end{multline}
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot(11^T + VH^T + HV^T + o(\|H\|) )]) \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} + e^{VV^T} \odot (VH^T + HV^T) + o(\|H\|) )]) \rangle
\end{align}
Now recall the linear approximation of $$\log$$:
$\log(x) = \log(x_0) + \frac{1}{x_0} (x-x_0) + o(|x- x_0|^2).$
Apply this approximation pointwise to conclude that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T \mathrm{e}^{VV^T}) + \\
\{1^T \mathrm{e}^{VV^T}\}^{-1}\odot (1^T [\mathrm{e}^{VV^T} \odot (VH^T + HV^T) + o(\|H\|)]) \rangle,
\end{multline}
where $$\{x\}^{-1}$$ denotes the pointwise inverse of a vector.
Take $$D$$ to be the diagonal matrix with diagonal entries given by $$1^T \mathrm{e}^{VV^T}$$. We have shown that
$f(V + H) = f(V) + \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle + o(\|H\|),$
so
\begin{align}
d(H) & = \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle \\
& = \langle D^{-1}K^T 11^T, \mathrm{e}^{VV^T} \odot (VH^T + HV^T) \rangle \\
& = \langle \mathrm{e}^{VV^T} \odot D^{-1}K^T 11^T, (VH^T + HV^T) \rangle.
\end{align}
The second inequality follows from the standard properties of inner products and the third from the observation that
$\langle A, B\odot C \rangle = \sum_{ij} A_{ij}*B_{ij}*C_{ij} = \langle B \odot A, C \rangle.$
Finally, manipulations in the vein of the two preceding examples allow us to claim that
$\nabla_V f(V) = [\mathrm{e}^{VV^T} \odot (11^T K D^{-1} + D^{-1} K^T 11^T)] V.$

# A workaround for installing SQBLib

I spent about an hour today getting Carlos Becker’s SQBLib package for gradient boosted tree regression in matlab working. The issue is that it depends on liblbfgs, and following the instruction Carlos gave for compiling liblbfgs resulted in errors when MATLAB tried to link liblbfgs into his code.

Specifically, I got a mysterious error like
/usr/bin/ld: /home/agittens/Downloads/liblbfgs-1.10/lib/lbfgs.o: relocation R_X86_64_32S against .text' can not be used when making a shared object; recompile with -fPIC.

See the original install instructions. Here’s the fix that worked for me on a 64-bit Linux box:

1. Run ./configure --disable-shared --enable-static --enable-sse2 from the liblbfgs root directory
2. Run make --dry-run to find the gcc command that generates liblbfgs.o and modify it to add the -fPIC compiler option. In my case, that resulted in the line
gcc -DHAVE_CONFIG_H -I.. -I. -I../include -msse2 -DUSE_SSE -O3 -ffast-math -fPIC -Wall -c lbfgs.c -o lbfgs.o'
3. Move into the lib/ directory and execute this commnad
4. Edit the GLN... case of the switch statement in the compileMex.m file to replace the current
%s/lib/.libs/lbfgs.o portion of the mex call with %s/lib/lbfgs.o
5. Run compileMex(<liblbfgs root directory>)`

# Quick note on the Chen, Chi, Goldsmith covariance sketching paper

NB: I will update this post as I read the paper, in case it turns out that the first issue I raised is not legitimately a concern.

Covariance estimation (and the natural extension, precision estimation) has always been an interesting topic for me because it (can) represent a concise, concrete, and very broadly applicable instance of applied nonasymptotic random matrix theory. Likewise, I’m also quite interested in matrix sketching algorithms. Thus I was very excited to see the latest preprint by Chen, Chi, and Goldsmith on arxiv which presents a convex optimization algorithm for recovering covariance matrices from randomized sketches obtained in a streaming manner. Their convex optimization problem is essentially the PhaseLift formulation used for recovering phase from magnitude, but their proof shows that it works for covariance matrix recovery.

I have only just started reading this paper, and I’m still excited, but I have two concerns already: first, it is not clear to me that the algorithm they propose is actually the algorithm they provide a guarantee for! At the least, the text must be corrected to make it clear that this is indeed the case. To be more precise, their algorithm is to compute a few random sketching vectors ahead of time, then as the rows of the matrix come in, compute the magnitude of the projection of each row onto *one* randomly chosen sketching vector. The measurement model described mathematically seems to compute the magnitude of the projection of each row onto *each* of the sketching vectors. Big difference there.

Second, their algorithm provides a Frobenius norm guarantee, which is par for the course, but they make claims about things like subspace detection, for which afaik, Frobenius norm guarantees are too weak to really be of interest. But here, this may be a matter of preference, and practitioners probably don’t care about sharp guarantees as long as it works in practice and has at least a semblance of theoretical support.

# Sampling uniformly from the set of partitions into a fixed number of nonempty sets

It’s easy to sample uniformly from the set of partitions of a set: you pick a number of bins using an appropriate exponential distribution, then randomly i.i.d. toss each element of the set into one of those bins. At the end of this procedure, the nonempty bins will constitute your uniformly sampled random partition. [literature ref: "Generation of a random partition of a finite set by an urn model" by Stam, 1983; pretty pictures ref: see this blog post].

Is there a similarly efficient and simple algorithm for sampling uniformly from the set of partitions into a fixed number of nonempty sets? The only algorithm I’m aware of takes advantage of the fact that the number of such partitions is given by $$\left\{n \atop p \right\}$$, a Stirling number of the second kind, if the set has $$n$$ elements and we desire $$p$$ nonempty subsets in the partition. In particular, we have the identity
$\left\{ {n \atop p} \right\} = \left\{ {n-1 \atop p-1} \right\} + p \left\{ {n-1 \atop p} \right\}$
that comes from observing that the only two ways to construct a partition of $$n$$ elements into $$p$$ nonempty sets are to: either partition the first $$n-1$$ elements into $$p-1$$ nonempty sets and take the remaining singleton as our final set in the partition, or we partition the first $$n-1$$ elements into $$p$$ nonempty sets and place the remaining element into any of these $$p$$ sets.

This observation leads to a straightforward recursive sampling procedure: with probability $$\left\{ {n-1 \atop p-1} \right\}/\left\{ {n \atop p} \right\}$$, use the first procedure with a randomly sampled partition of the first $$n-1$$ elements into $$p-1$$ nonempty sets, otherwise use the second procedure with a randomly sampled partition of the first $$n-1$$ elements into $$p$$ nonempty sets.

Unfortunately, this is not an efficient procedure for several reasons. In particular, it requires computing Stirling numbers, and taking their ratio. When $$n,k$$ are large, this will require both computational time and, more of a practical impediment, arbitrary-precision arithmetic. A straightforward implementation also relies on recursion, which is infeasible when $$n$$ is large. Clearly one can implement this algorithm without using recursion; one can also use an asymptotic expansion of the Stirling numbers to approximate the ratio $$\left\{ {n-1 \atop p-1} \right\}/\left\{ {n \atop p} \right\}$$ when $$n,k$$ are large … but this comes at the cost of some unspecified inaccuracy and just doesn’t feel right.

So the question remains: is there an efficient and simple way to sample uniformly from the set of partitions into a fixed number of nonempty sets?

# I miss Mathematica

Why? Because Mathics is not up to helping me determine if indeed
$f(\{A_1, \ldots, A_p\}) = \frac{(n-p)!^2}{n! p!} \left(\frac{1}{p} \right)^{n-p} \frac{|A_1|^2 \cdots |A_p|^2}{|A_1|!\cdots |A_p|!}$
is a pmf over the set of partitions of the set $$\{1, \ldots, n\}$$ into $$p \leq n$$ nonempty sets. In particular, the sets $$A_1, \ldots, A_p$$ in the formula above satisfy $$|A_1| + \ldots +|A_p| = n$$ and $$|A_i| \geq 1$$ for all $$i.$$

I feel like I could empirically test this easily in Mathematica, but OMG trying to do it in Matlab is a real pain, so I gave up. Combinatorics or set manipulation in Matlab in general is an exercise in trying to make a smoothie with a grater: you can do it, eventually, but it’s going to take forever and make a mess.

# The convergence rate of the OLS estimator for linear regression

A lot of machine learning and modern statistics concerns the rate at which certain statistical estimators converge to the quantities they are estimating. For instance, in classification, you use a collection of observations of covariates $$\mathbf{x}_i \in \mathbb{R}^p$$ and classifications $$y_i \in \{\pm 1\}$$ to fit an estimated classification function $$h : \mathbb{R}^p \rightarrow \{\pm 1\},$$ and the relevant question is how close is this estimated function to the actual classification function as the number of observations increase?

In this post, I’ll consider the consistency rate of the simple ordinary least-squares estimator for linear regression coefficients. The estimators of more practical interest in a modern setting (high-dimensional with fewer observations than variables) are more complicated, but OLS estimators are a natural starting point, and comparatively much easier to work with and understand.

To establish the setting, assume that our observations take the form $$y_i = \mathbf{x}_i^T \mathbf{\beta} + \varepsilon_i,$$ where $$\mathbf{x}_i$$ are our observations, and $$\varepsilon_i$$ are i.i.d. $$\mathcal{N}(0,1/n)$$ noise that are independent of the covariates. We can write this as the matrix–vector equation
$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon}.$
The rows of the $$n \times p$$ matrix $$\mathbf{X}$$ constitute the observed covariates.

Given data $$\mathbf{X}$$ and $$\mathbf{y}$$ that we expect conform to this model, the OLS estimator for the regression coefficients is $$\hat{\mathbf{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y},$$ if we assume that $$\mathbf{X}$$ is full-rank. We are interested in how fast $$\hat{\mathbf{\beta}}$$ converges to $$\mathbf{\beta}$$ as the number of observations $$p$$ is increased. It seems most natural to consider the $$\ell_2$$ error since we are dealing with least-squares estimation.

If $$\mathbf{X}$$ is full-rank with $$p$$-largest singular value bounded away from $$0$$, then $$\mathbf{X}^\dagger \mathbf{X} = \mathbf{I}$$ and we can estimate
$\|\hat{\mathbf{\beta}} - \mathbf{\beta} \|_2 = \|(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{\varepsilon}\|_2 \leq \|\mathbf{X}^\dagger\|_2 \|\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{\varepsilon}\|_2.$
Since $$\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$$ is a projection onto a $$p$$ dimensional-subspace of $$\mathbb{R}^n,$$ we see that in this case
$\|\hat{\mathbf{\beta}} - \mathbf{\beta} \|_2 \lesssim \sigma_p(\mathbf{X})^{-1} \sqrt{p/n}.$

It’s natural to ask whether this bound is optimal. To see that it is, consider the case of an orthogonal design, i.e. $$\mathbf{X}^T \mathbf{X} = \mathbf{I}_p.$$ Then
$\|\hat{\mathbf{\beta}} - \mathbf{\beta} \|_2 = \|\mathbf{X}^T \mathbf{\varepsilon}\|_2.$
Since $$\mathbf{X}^T \mathbf{\varepsilon}$$ is a $$\mathbf{N}(0, \mathbf{I}_p/n)$$ r.v., we see that the right-hand side quantity concentrates around $$\sqrt{p/n}$$.

# We play the long game here

I’ve been off Being Human (the original UK version) for a while now. I stopped watching after the first episode of season 3, in part because the American version came out and I got into that before it started circling the drain, and in part because Mitchell gets on my damned last nerve.

Now I’m back to watching it after I discovered all the original cast members leave by the end of the show. I’m so looking forward to seeing emo-goth pretty boy Mitchell get dusted— or maybe they’ll dial up his already epic levels of self-pity, narcicissm, and angst to the point that he combusts— that this was just what I needed to get back in the game.

Too bad that means Lenora Crichlow and Russell Tovey will be leaving eventually too. Their acting is the strongest reason I watch this show: when they cry, I cry. When Lenora gives her monologues, I cry a little. When Russell opens up about his feelings, I ball. It’s pathetic, I know, but the point is that they’re very good at getting across the underlying message of the show: that even monsters can be human, because humanity is about our relationships.

Anyhooo, now I’m back to watching Robson Green play a homeless(?) drifter werewolf dad forced into underground supernatural cage fights.

# Roll On

I was reading, of all things, a romance novel when I came across this famous excerpt from Childe Harold’s Pilgrimage:

Roll on, thou deep and dark blue Ocean – roll!
Ten thousand fleets sweep over thee in vain;
Man marks the earth with ruin – his control
Stops with the shore; – upon the watery plain
The wrecks are all thy deed, nor doth remain
A shadow of man’s ravage, save his own,
When for a moment, like a drop of rain,
He sinks into thy depths with bubbling groan,
Without a grave, unknelled, uncoffined, and unknown.

There are few poets that affect me like Lord Byron does… or, more likely, they are lots of poets whose work would affect me similarly if I knew of them.

# Monument

Jeff Salyards is doing an Ask Me Anything (AMA) on Reddit tonight at 5 PCT. I started reading his debut novel, Scourge of the Betrayer, a couple of weeks ago, but couldn’t get into it for one reason or another. Maybe I was just coming off of an urban fantasy binge, and the relatively heavier setting of SOTB was just too much for me to get into at the time (the couple dozens of pages I read evoked Glenn Cook’s Dark Company novels, especially since the main character is a chronicler). Or maybe I was put off by the heavy-handed way in which a SECRET MISSION with EPIC IMPLICATIONS was being hinted at in the little I read (I mean, almost everyone reading this genre knows what to expect, ok?). Either way, I plan on finishing it eventually, if only to see why the Jeff chose a flail as the weapon of import. But I do expect that it gets better, eventually.

I mention Jeff’s AMA mostly because it got me thinking about his book, which in turn got me thinking about the Dark Company novels (which I love love love), which in turn got me thinking about Ian Graham’s excellent debut novel, Monument. I swear this book flew under the radar, and Ian Graham seems not to have written anything since. But when I come across a fantasy novel revolving around an anti-hero, at some point or another I end up comparing it with Monument. So far none of the anti-heroes I’ve come across have been as despicable and yet somehow comfortably human as Ballas. I’m not even going to try to sell the book to you: just go and read it.

# A list of $$\|\cdot\|_{\infty \rightarrow 1}^*$$ problems

Recall our old friend the $$\|\cdot\|_{\infty \rightarrow 1}^*$$ norm (from the first version of this website, which I still haven’t gotten around to merging into the current iteration). Given a matrix $$\mat{A},$$ it is a measure of the energy in the smallest expansion of $$\mat{A}$$ in terms of rank one sign matrices:
$\|\mat{A}\|_{\infty \rightarrow 1}^* = \inf\left\{\sum_i d_i \,:\, \mat{A} = \sum_i d_i \vec{u}_i \vec{v}_i, \right\}$
where $$\{\vec{u}_i, \vec{v}_i\}$$ is a collection of sign vectors. Note that the number of sign vectors necessary to achieve this minimum is not known a priori.

This is an interesting norm both intrinsically (as a potential source of factor models in statistics and machine learning) and because of its relations to other norms. See these slides for details.

An estimate of $$\|\cdot\|_{\infty \rightarrow 1}^*$$ to within a multiplicative factor of about $$1.783$$ can be obtained using a semidefinite program, but I suspect it’s NP hard to compute $$\|\cdot\|_{\infty \rightarrow 1}^*$$ exactly (because $$\|\cdot\|_{\infty \rightarrow 1}^*$$ is the trace dual of the $$\|\cdot\|_{\infty \rightarrow 1}$$ norm, which is known to be NP-hard to compute — if the latter is also APX-hard, which I’m not sure of either way, then that means you definitely couldn’t compute $$\|\cdot\|_{\infty \rightarrow 1}^*$$ efficiently in polynomial time). However, more interesting is the question of approximately obtaining the factorization corresponding to $$\|\cdot\|_{\infty \rightarrow 1}^*$$: there doesn’t seem to be a good way to obtain such an approximate factorization from the SDP that gives the approximation of $$\|\cdot\|_{\infty \rightarrow 1}^*$$, but maybe there’s a clever way to do so that I haven’t seen.

Other interesting questions abound: can we characterize $$\|\cdot\|_{\infty \rightarrow 1}^*$$ or the number of rank 1 matrices needed to obtain the corresponding decomposition from a randomly chosen matrix (say i.i.d. Gaussian entries)? What if we just want probability bounds on the number of rank 1 matrices in the decomposition of a randomly chosen matrix?

I think the latter question is probably the easiest to answer of all posed here (except maybe coming up with an SDP-based approximate factorization).