# Julia, once more

Julia + PyCall + CCall + Gadfly or PyPlot (+ Julia Studio ?) looks delicious. The only feature that absolutely needs to be added is shared memory parallelism
(why wasn’t this an initial core feature of the language?), but I’m extremely excited by the current awesomeness of the Julia ecosystem.

I recommend you get into it now, if you’re a scientific computation person.

Update: Julia has experimental support for shared-memory arrays on Unix, which is really all that I need at this point. Great!

# a bit on word embeddings

Lately I’ve been working almost exclusively on continuous word representations, with the goal of finding vectorial representations of words which expose semantic and/or syntactic relationships between words.

As is typical for any interesting machine learning problem, there are a glut of clever models based on various assumptions (sparsity, hierarchical sparsity, low-rankedness, etc.) that yield respectable embeddings. Arguably, however, the most well known of these representations are the word2vec models due to Mikolov et al., which are part of a larger class of neural network-based methods for learning word representations. This is largely due to the facts that the authors released an efficient implementation of their algorithm, and demonstrated that the embeddings they calculate exhibit nice algebraic properties (e.g., ‘king’ – ‘man’ + ‘woman’ is approximately ‘queen’).

I’m partial to the word2vec (skip-gram) model precisely because it is based on a simple and sensible (although contested) linguistic axiom known as the distribution hypothesis— that words that appear in contexts with similar words tend to have similar meanings— and in its purest form has a simple probabilistic interpretation that makes its application amenable to nontrivial theoretical investigation (for instance, it’s now clear to me from whence the mysterious algebraic properties of the fitted word embeddings come). This model involves parameterizing the probability distribution of the words in the neighborhood of each word in the model. Semantically similar words end up having similar vectorial embeddings, precisely because the distribution of words in their neighborhoods are similar.

Unfortunately, the pure skip-gram model is not easily fit, because the calculation of the partition function involves a sum over all the words in the vocabulary, which can easily be on the order of millions. The authors of the word2vec paper mitigate this situation by introducing two approximations to this exact model: one is a variational approximation that finds the best approximation with a tree-structured class of probability distributions that are more computationally tractable, and one is an approximation inspired by the principle of noise contrastive estimation— that a good model only has to be able to distinguish between true observations and observations drawn from a sufficiently plausible noise distribution. The latter model is very similar in spirit to the model of Minh et al.

I’m curious as to what extent these approximate models lose semantic information that would be captured in the pure model, if it could be fit. It is entirely possible that the ‘approximations’ are themselves more useful models than the intractable pure model. One of the projects that I’m working on with my intern (Jiyan Yang) is attempting to experimentally quantify this, using a combination of smaller datasets and various methods of formulating the optimization problem (e.g., stochastic methods and score matching-type ideas) to make it feasible to fit the exact model.

We have more ambition than to understand the word2vec model. It seems that a large class of word embeddings can be considered to be attempting to fit probability distributions over the neighbors of a word, just as word2vec does, so I’m trying to figure out a universal framework for probabilistic word embeddings. It turns out that this has strong connections to several other applications in machine learning, so the understanding garnered here can be used in, e.g., collaborative filtering.

# Installing Hadoop on Ubuntu (works for Ubuntu 12.04 and Hadoop 2.4.1)

I’m trying to use LDA on a large amount of data. A quick recap:

1. Tried vowpal wabbit … it’s fast, I’ll give it that, but it’s also useless: the output is dubious (what I think are the topics look like they haven’t changed very much from the prior) *and* I have no idea how it maps onto topics and documents (the documentation is AWFUL, and the dimensions of the output files are WONKY).
2. Tried two implementations of SCVB0, a stochastic collapsed variational bayes LDA algorithm: one doesn’t work at all (as in, it stalls on any amount of data — so crap), and the other works fine and fast on medium amounts of data but apparently requires more than 5 days on my size dataset (I quit after 5 days; it’s worth noting that this code was removed from github literally the next day after I downloaded it — so dubious).
3. For whatever reason, Mallet can’t even import (using import-file) a relatively small amount of data: the import process takes forever and then quits with an error — so crap.
4. Now I’m trying to use Mr.LDA, a map-reduce implementation. Given my experience so far, I am not holding out much hope, but when one must… The issue with Mr.LDA is that when I’m importing my data on the eBay Hadoop cluster, the mappers run out of heap memory during the tokenization process.

After spending of most of today trying to figure out how to increase the heap memory, I decided to just install Hadoop locally, where I can control the heap size, and tokenize my data, then transfer the tokenized data to eBay’s HDFS and run the actual LDA job on the cluster. Ugh.

First I have to get Hadoop running locally. I followed the DigitalOcean tutorial, which leaves out crucial details about permissions. To address this oversight is the entire point of this post (that and to exorcise my crappy-implementations-of-LDA-induced demons).

I installed everything to run from my user account, as opposed to root. Specifically, I untarred the hadoop directory using my user account, then moved it to /etc with sudo, so the entire hadoop directory structure is under my username and login group. Then, after creating the hadoop_store directory as root, I chowned that entire subtree over to my user account with ‘chown -R uname: /usr/local/hadoop_store’. Then I followed the directions as provided and got a working hadoop system.

# Sharing numpy arrays between processes using multiprocessing and ctypes

Because of its global interpreter lock, Python doesn’t support multithreading. To me, this is a ridiculous limitation that should be gotten rid of post-haste: a programming language is not modern unless it support multithreading.

Python supports multiprocessing, but the straightforward manner of using multiprocessing requires you to pass data between processes using pickling/unpickling rather than sharing memory. Needless to say, this slows down execution when large amounts of data need to be shared by processes.

In my case, I’ve been using multiprocessing to speed up the training of a recursive autoencoder: instead of training one auto-encoder, I train an ensemble of around 30 simultaneously, and periodically average their parameters together and use these as the initialization points for further training of all the models. Pickling the parameters here doesn’t seem to be that costly, because the optimization itself takes so much time. However, it’s nice to note that there’s a relatively simple workaround that allows you to avoid the pickling/unpickling process: create the shared array as a ctypes object, and pass it into the initializer of each process, making it a global variable in each thread. Here’s some example code:

from multiprocessing import Pool, sharedctypes
import numpy as np
import warnings

def populate(args):
""" Populates the portion of the shared array arr indicated by the offset and blocksize options """
offset, blocksize = args
with warnings.catch_warnings():
warnings.simplefilter('ignore', RuntimeWarning)
v = np.ctypeslib.as_array(arr)

for idx in range(offset, blocksize+offset):
v[idx, :] = offset

return offset

def _init(arr_to_populate):
""" Each pool process calls this initializer. Load the array to be populated into that process's global namespace """
global arr
arr = arr_to_populate

size = 2000
blocksize = 100
tmp = np.ctypeslib.as_ctypes(np.zeros((size, 10)))
shared_arr = sharedctypes.Array(tmp._type_, tmp, lock=False)

pool = Pool(processes=4, initializer=_init, initargs=(shared_arr, ))
result = pool.map(populate, zip(range(0, size, blocksize), [blocksize]*(size/blocksize)))

with warnings.catch_warnings():
warnings.simplefilter('ignore', RuntimeWarning)
final_arr = np.ctypeslib.as_array(shared_arr)

print final_arr


The warnings filter is there because a PEP compliance bug gives rise to some superfluous runtime warnings.

# Eigenvector two-condition number for a product of PSD matrices

I’m pushing to submit a preprint on the Nystrom method that has been knocking around for the longest time.

I find myself running into problems centering around expressions of the type $$B^{-1}A$$, where $$A, B$$ are SPSD matrices satisfying $$B \preceq A$$. This expression will be familiar to numerical linear algebraists: there $$B$$ would be a preconditioner for a linear system $$A x = b,$$ and the relevant quantity of interest is the spectral radius of $$B^{-1} A$$.

It’s not hard to show that the spectral radius of this product is at most 1… but instead, I’m interested in the norm of this product. Because the spectral radius of the product is at most 1, we can use the bound
$\|B^{-1} A\|_2 \leq \kappa(U_{B^{-1}A})$
where $$\kappa(U_{B^{-1}A})$$ is the two-condition number of the matrix of eigenvectors of $$B^{-1}A$$.

In the applications I’m interested in, some rough numerical experiments show that this bound is good enough for my purposes (the two terms are of the same order). Assuming this is the case, how can I get a good theoretical bound on this condition number?

# Canonical Correlation Analysis (CCA)

I am not completely satisfied with the expositions of CCA that I’ve come across, so I decided to write one that reflects my own intuition.

CCA is useful in the case where you observe two random variables that are both noisy linear functions of some underlying latent random variable, and you want to use this fact to help you guess the latent variable. Formally, assume
$x = T_X z + n_x, \quad \text{ and } \quad y = T_Y z + n_y,$
where, without loss of generality, we assume that the entries of $$z$$ are uncorrelated and unit variance. Here $$T_X$$ and $$T_Y$$ are matrices whose image spaces may be of different dimension, and $$n_x$$ and $$n_y$$ are white noise. For convenience, we assume that $$z$$ is mean-zero, so that $$x$$ and $$y$$ are also.

In order to recover $$z$$ from $$x$$ and $$y$$, we phrase the following question: find two transformations $$\Phi_X$$ and $$\Phi_Y$$ so that the entries of $$\Phi_X x$$ are uncorrelated and unit variance, as are those of $$\Phi_Y y$$, and the correlation of $$\Phi_X x$$ with $$\Phi_Y y$$ is maximized.

We could use PCA to whiten $$x$$ and $$y$$ individually to get two different noisy estimates of $$z$$, but this would ignore the fact that knowing both $$x$$ and $$y$$ gives us more information on $$z$$. Indeed, the requirement that the correlation of $$\Phi_X x$$ and $$\Phi_Y y$$ be maximized should tend to choose $$\Phi_X$$ and $$\Phi_Y$$ which remove the directions containing the noise $$n_x$$ and $$n_y$$.

CCA can then be formulated as the following:
$\max_{\Phi_X, \Phi_Y} \mathbb{E} \langle \Phi_X x, \Phi_Y y \rangle \text{ subject to } \mathbb{E} \Phi_X x (\Phi_X x)^T = \mathbb{E} \Phi_Y y (\Phi_Y y)^T = I,$
or equivalently,
$\max_{\Phi_X, \Phi_Y} \text{Trace}(\Phi_Y C_{y x} \Phi_X^T) \text{ subject to } \Phi_X C_{xx} \Phi_X^T = \Phi_Y C_{yy} \Phi_Y^T = I,$
where $$C_{xx}, C_{yy}$$ are the covariance matrices of $$x$$ and $$y$$ and $$C_{yx}$$ is the cross-covariance matrix of $$y$$ and $$x$$ given by $$C_{yx} = \mathbb{E} (yx^T).$$

Since $$C_{xx}$$ and $$C_{yy}$$ are full-rank, by taking $$\Phi_X^\prime = \Phi_X C_{xx}^{1/2}$$ and $$\Phi_Y^\prime = \Phi_Y C_{yy}^{1/2},$$this program can be transformed to
$\max_{\Phi_X^\prime, \Phi_Y^\prime} \text{Trace}(\Phi_Y^\prime C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2} \Phi_X^{\prime T}) \text{ subject to } \Phi_x^\prime \Phi_x^{\prime T} = \Phi_Y^\prime \Phi_Y^{\prime T} = I.$

From the standard characterization of the SVD, it is clear that the solution $$\Phi_X^\prime$$ and $$\Phi_Y^\prime$$ are given by the top right and left singular vectors of the matrix $$C_{yy}^{-1/2} C_{yx} C_{xx}^{-1/2}.$$ Equivalently, they are the top eigenvectors of the two matrices $$C_{xx}^{-1/2} C_{xy} C_{yy}^{-1} C_{yx} C_{xx}^{-1/2}$$ and $$C_{yy}^{-1/2} C_{yx} C_{xx}^{-1} C_{xy} C_{yy}^{-1}.$$ The eigenvectors of these two unwieldy matrices are also the top eigenvectors of the generalized eigenvalue problems given by
\begin{align*}
C_{xy} C_{yy}^{-1} C_{yx} \phi_i^x & = \lambda_i C_{xx} \phi_i^x \\
C_{yx} C_{xx}^{-1} C_{xy} \phi_i^y & = \lambda_i C_{yy} \phi_i^y.
\end{align*}
Here the $$\lambda_i$$ are the squared canonical correlations, and the $$\phi^x_i, \phi^y_i$$ are the canonical correlation basis vectors.

None of the non-variational characterizations given so far seem terribly efficiently, since they all involve two matrix inversion. It turns out that CCA can be done by using a QR decomposition instead. See Björck and Golub, 1973 for the details of the algorithm, and the connection between CCA and the principle angles between subspaces.

# Decision time: MacPorts vs Homebrew vs Fink

My work macbook pro recently crapped out on me during an update of the OS (apparently something has a tendency to go wrong with the video card or its driver or something similar during this particular update for this particular model … sigh) so I’ve had the joy of reinstalling my personal ecosystem of software again.

One of the crucial decisions for me is whether to use MacPorts, HomeBrew, or Fink to allow me to manage the installation of non-trivial Unix packages. This post is really just to remind myself that at this go-round, I chose MacPorts because of these posts. Essentially, Fink is older and MacPorts is its successor, so therefore better in my simplistic mind, and apparently it’s a lot easier to setup a SciPy install with MacPorts.

((Caveat! I am not sure the manipulations done in this post are correct, but the gist is certainly there.))

One of my favorite optimization techniques is Adagrad, a first-order technique that approximates the Hessian by using all the gradients up to that point. It calls for updates of the form:
$x_{t+1} = \Pi_{\mathcal{X}}^{G_t^{1/2}} (x_t - \eta G_{t}^{-1/2} g_t),$
or more practically for high-dimensional problems,
$x_{t+1} = \Pi_{\mathcal{X}}^{\text{diag}(G_t)^{1/2}} (x_t - \eta \text{diag}(G_{t})^{-1/2} g_t).$
Here, $$g_t$$ denotes the gradient at step $$t$$, $$\eta$$ is a fixed stepsize, $$\mathcal{X}$$ is the constraint set for your optimization problem,
$G_t = \sum_{i=1}^t g_t g_t^T,$
and $$\Pi_{\mathcal{X}}^A$$ is the projection onto the constraint set with respect to a distance function defined by a positive-definite matrix $$A$$:
$\Pi_{\mathcal{X}}^A (y) := \text{argmin}_x \langle x - y, A (x-y) \rangle = \|A^{1/2} (x-y)\|_2^2.$
By convention $$\Pi_{\mathcal{X}} = \Pi_{\mathcal{X}}^I$$ denotes the Euclidean projection operator.

The neat thing about Adagrad is that it chooses different effective step sizes for each variable, rather than using one fixed step size for all of them. I’ve found that Adagrad outperforms sgd in my non-constrained optimization problems.

Now I need to use adagrad on a ball constrained problem: $$\mathcal{X} = \{ x : \|x\|_2 \leq B \}$$. Computing the associated projection operator $$\Pi_{\mathcal{X}}^A)$$ was a nice enough exercise in basic convex optimization that it seems worth recording here. Actually, given the ubiquity of this constraint set, I’m surprised the authors didn’t seem to provide this as an example in their paper (I may have missed it, since I only read the absolute minimum to get the gist of the idea and then skimmed the rest).

Our first step is to observe that
$\min_{\|x\|_2 \leq B} \|A^{1/2} (x- y)\|_2^2 = \min_{\|A^{-1/2} x\| \leq B} \|x - A^{1/2}y\|_2^2$
since $$A^{1/2}$$ is invertible, and that the unique minimizer of the first program is a linear transformation of the unique minimizer of the second. Concretely,
\begin{align*}
\Pi_{\|x\|_2 \leq B}^{A}(y) & = A^{-1/2} \text{argmin}_{\|A^{-1/2} x\| \leq B} \|x – A^{1/2}y\|_2^2 \\
& = A^{-1/2} \Pi_{\|A^{-1/2} x\|_2 \leq B} (A^{1/2} y).
\end{align*}

### The Lagrangian dual to Euclidean projection onto an ellipsoid

Thus, it suffices to determine the Euclidean projection operator onto an ellipsoid, $$\Pi_{\|Q x\|_2 \leq B}(z)$$, where $$Q$$ is positive-semidefinite. To do so, let’s write out the optimization problem that defines this projection operator:
$\min_{x^T Q^T Q x \leq B^2} (x - z)^T(x-z).$
The associated Lagrangian is:
\begin{align*}
L(x,\lambda) &= (x-z)^T(x-z) – \lambda ( B^2 – x^TQ^TQx) \\
& = x^T( I + \lambda Q^TQ) x – 2z^T x + z^Tz – \lambda B^2.
\end{align*}
Recall that $$L(x,\lambda)$$ is a lower bound on the objective whenever $$x$$ is feasible and $$\lambda \geq 0.$$ To find a uniform lower bound, we minimize $$L$$ with respect to $$x$$ by setting $$\nabla_x L = 0$$. This occurs when
$x = (I + \lambda Q^TQ)^{-1} z,$
and gives the uniform lower bound
$g(\lambda) = \min_x L(x, \lambda) = z^T[ I - (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2.$
The optimization problem we’re looking at satisfies strong duality, so we know that maximizing $$g$$ over the set of nonnegative $$\lambda$$ gives the same optimal value of the original problem, and corresponding to the optimal $$\lambda$$ there is an optimal $$x$$. We now find the optimal $$\lambda$$ and show how to recover the optimal $$x$$.

### From the dual optimal point to the primal optimal point

First write the SVD of $$Q^TQ$$ as $$Q^TQ = U \Sigma U^T.$$ It follows that
$I - (I + \lambda Q^TQ)^{-1} = \left[ \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} \right]_{ii},$
where $$\sigma_i = \Sigma_{ii}.$$ The optimal $$\lambda$$ then is the nonnegative $$\lambda$$ which maximizes
\begin{align*}
g(\lambda) & = z^T[ I - (I + \lambda Q^TQ)^{-1} ]z – \lambda B^2 \\
& = \sum\nolimits_i (U^Tz)_i^2 \frac{\lambda \sigma_{i}}{1 + \lambda \sigma_{i}} – \lambda B^2.
\end{align*}

Observe that $$g(\lambda)$$ is concave on $$\mathbb{R}^+,$$ so its maximum occurs at $$0$$ or the point where $$g^\prime(\lambda) = 0$$. Thus, if the equation
$\sum\nolimits_i (U^Tz)_i^2 \frac{\sigma_i}{(1 + \lambda \sigma_i)^2} = \|Q(I + \lambda Q^TQ)^{-1} z\|_2^2 = B^2$
has a nonnegative solution, that solution is the optimal $$\lambda$$. If not, then the optimal $$\lambda$$ is $$0.$$

Given this optimal $$\lambda,$$ the corresponding optimal $$x$$ is given by
$x = \Pi_{\|Qx\|_2 \leq B}(z) = (I + \lambda Q^TQ)^{-1} z.$

### From the Euclidean projection on an ellipsoid back to the Adagrad projector

Putting all the pieces together, we get the following expression for the projection operator needed in Adagrad:
\begin{align*}
\Pi_{\|x\|_2 \leq B}^A(y) & = A^{-1/2} \Pi_{\|A^{-1/2}x\|_2 \leq B}(A^{1/2}y) \\
& = A^{-1/2} ( I + \lambda A^{-1})^{-1} A^{1/2} y \\
& = (I + \lambda A^{-1})^{-1}y = A(A + \lambda I)^{-1}y,
\end{align*}
where $$\lambda$$ is either the nonnegative solution to the nonlinear equation
\begin{align*}
\sum\nolimits_i (\Sigma^{1/2} V^T y)_{n-i+1}^2 \frac{\sigma_i(A^{-1})}{(1 + \lambda \sigma_i(A^{-1}))^2} & \\
& = \|(I + \lambda A^{-1})^{-1} y \|_2^2 \\
& = \|A(A + \lambda I)^{-1} y\|_2^2 = B^2,
\end{align*}
where $$A = U \Sigma U^T,$$ or if such a solution does not exist, $$0.$$

# Back of the envelope calculations of how fast your computer can do linear algebra operations

Let’s talk about CPU speed, practically. By practically, I mean, how fast can your CPU do linear algebra operations. And by linear algebra operations, I mean matrix-matrix multiplies.

First, you need to calculate how many FLOPS your computer can do. The following formula comes in handy:
$\text{nFLOPS} = \text{cores} \cdot \frac{\text{clock cycles}}{\text{second}} \cdot \frac{\text{FLOPS}}{\text{cycle}}.$
You probably already know the number of cores in your computer, and the number of clock cyles. The interesting thing here is the number of FLOPS per cycle: this depends on the architecture of your CPU and what exactly you take to be the size of a float.

It’s standard to take a float to consist of 32 bits, so the number of FLOPS per cycle depends on how many multiples of 32 bits can fit into your registers. SSE capable CPUs have 128 bit registers, so can do 4 FLOPS per cycle (this is the most common set of CPUs). AVX capable CPUs have 256 bit registers, so can do 8 FLOPS per cycle (e.g. the latest Macbook Pros are AVX capable).

Putting these bits together, I get that my workstation, which has 2 hexa-core SSE-capable CPUS each running at 2 GHz achieves
$\text{nFLOPS} = (2*6) * (2*10^9)*4 = 96 \text{GFLOPS}.$

The cost of a matrix-matrix multiply of two $$n$$-by-$$n$$ matrices is essentially $$n^3$$ floating point operations. Thus it should take this workstation about $$\frac{n^3}{96} * 10^{-9}$$ seconds to do this multiply.

E.g., in my case, the predicted time of squaring two $$16\text{K} \times 16\text{K}$$ matrices is about 42.6 seconds. A quick Matlab check shows it does take about 43 seconds.

# A useful trick for computing gradients w.r.t. matrix arguments, with some examples

I’ve spent hours this week and last week computing, recomputing, and checking expressions for matrix gradients of functions. It turns out that except in the simplest of cases, the most painfree method for finding such gradients is to use the Frechet derivative (this is one of the few concrete benefits I derived from the differential geometry course I took back in grad school).

Remember that the Frechet derivative of a function $$f : X \rightarrow \mathbb{R}$$ at a point $$x$$ is defined as the unique linear operator $$d$$ that is tangent to $$f$$ at $$x$$, i.e. that satisfies
$f(x+h) = f(x) + d(h) + o(\|h\|).$
This definition of differentiability makes sense whenever $$X$$ is a normed linear space. If $$f$$ has a gradient, then the Frechet derivative exists and the gradient satisfies the relation $$d(h) = \langle \nabla f(x), h \rangle.$$

### Simple application

As an example application, lets compute the gradient of the function
$f(X) = \langle A, XX^T \rangle := \mathrm{trace}(A^T XX^T) = \sum_{ij} A_{ij} (XX^T)_{ij}$
over the linear space of $$m$$ by $$n$$ real-valued matrices equipped with the Frobenius norm. First we can expand out $$f(X+H)$$ as
$f(X + H) = \langle A, (X+H)(X+H)^T \rangle = \langle A, XX^T + XH^T + HX^T + HH^T \rangle$
Now we observe that the terms which involve more than one power of $$H$$ are $$O(\|H\|^2) = o(\|H\|)$$ as $$H \rightarrow 0$$, so
$f(X + H) = f(X) + \langle A, XH^T + HX^T \rangle + o(\|H\|).$
It follows that
$d(H) = \langle A, XH^T + HX^T \rangle = \mathrm{trace}(A^TXH^T) + \mathrm{trace}(A^THX^T),$
which is clearly a linear function of $$H$$ as desired. To write this in a way that exposes the gradient, we use the
cyclicity properties of the trace, and exploit its invariance under transposes to see that
\begin{align}
d(H) & = \mathrm{trace}(HX^TA) + \mathrm{trace}(X^TA^T H) \\
& = \mathrm{trace}(X^TAH) + \mathrm{trace}(X^TA^T H) \\
& = \langle AX, H \rangle + \langle A^TX, H \rangle \\
& = \langle (A + A^T)X, H \rangle.
\end{align}
The gradient of $$f$$ at $$X$$ is evidently $$(A + A^T)X$$.

### More complicated application

If you have the patience to work through a lot of algebra, you could probably calculate the above gradient component by component using the standard rules of differential calculus, then back out the simple matrix expression $$(A + A^T)X$$. But what if we partitioned $$X$$ into $$X = [\begin{matrix}X_1^T & X_2^T \end{matrix}]^T$$ and desired the derivative of
$f(X_1, X_2) = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 \end{matrix} \right]^T\right)$
with respect to $$X_2$$? Then the bookkeeping necessary becomes even more tedious if you want to compute component by component derivatives (I imagine, not having attempted it). On the other hand, the Frechet derivative route is not significantly more complicated.

Some basic manipulations allow us to claim
\begin{align}
f(X_1, X_2 + H) & = \mathrm{trace}\left(A \left[\begin{matrix} X_1 \\ X_2 + H \end{matrix}\right] \left[\begin{matrix}X_1 \\ X_2 + H \end{matrix} \right]^T\right) \\
& = f(X_1, X_2) + \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\
H X_2^T & H X_2^T + X_2 H^T + H H^T \end{matrix} \right]\right)
\end{align}
Once again we drop the $$o(\|H\|)$$ terms to see that
$d(H) = \mathrm{trace}\left(A \left[\begin{matrix} 0 & X_1 H^T \\ H X_2^T & H X_2^T + X_2 H^T \end{matrix} \right]\right).$
To find a simple expression for the gradient, we partition $$A$$ (conformally with our partitioning of $$X$$ into $$X_1$$ and $$X_2$$) as
$A = \left[\begin{matrix} A_1 & A_2 \\ A_3 & A_4 \end{matrix} \right].$
Given this partitioning,
\begin{align}
d(H) & = \mathrm{trace}\left(\left[\begin{matrix}
A_2 H X_1^T & \\
& A_3 X_1 H^T + A_4 H X_2^T + A_4 X_2 H^T
\end{matrix}\right] \right) \\
& = \langle A_2^TX_1, H \rangle + \langle A_3X_1, H \rangle + \langle A_4^T X_2, H \rangle + \langle A_4X_2, H \rangle \\
& = \langle (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2, H \rangle.
\end{align}
The first equality comes from noting that the trace of a block matrix is simply the trace of its diagonal parts, and the second comes from manipulating the traces using their cyclicity and invariance to transposes.

Thus $$\nabla_{X_2} f(X_1, X_2) = (A_2^T + A_3)X_1 + (A_4^T + A_4)X_2.$$

### A masterclass application

Maybe you didn’t find the last example convincing. Here’s a function I needed to compute the matrix gradient for— a task which I defy you to accomplish using standard calculus operations—:
$f(V) = \langle 1^T K^T, \log(1^T \mathrm{e}^{VV^T}) \rangle = \log(1^T \mathrm{e}^{VV^T})K1.$
Here, $$K$$ is an $$n \times n$$ matrix (nonsymmetric in general), $$V$$ is an $$n \times d$$ matrix, and $$1$$ is a column vector of ones of length $$n$$. The exponential $$\mathrm{e}^{VV^T}$$ is computed entrywise, as is the $$\log$$.

To motivate why you might want to take the gradient of this function, consider the situation that $$K_{ij}$$ measures how similar items $$i$$ and $$j$$ are in a nonsymmetric manner, and the rows of $$V$$ are coordinates for representations of the items in Euclidean space. Then $$(1^T K)_j$$ measures how similar item $$j$$ is to all the items, and
$(1^T \mathrm{e}^{VV^T})_j = \sum_{\ell=1}^n \mathrm{e}^{v_\ell^T v_j}$
is a measure of how similar the embedding $$v_j$$ is to the embeddings of all the items. Thus, if we constrain all the embeddings to have norm 1, maximizing $$f(V)$$ with respect to $$V$$ ensures that the embeddings capture the item similarities in some sense. (Why do you care about this particular sense? That’s another story altogether.)

Ignoring the constraints (you could use a projected gradient method for the optimization problem), we’re now interested in finding the gradient of $$f$$. In the following, I use the notation $$A \odot B$$ to indicate the pointwise product of two matrices.
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T \mathrm{e}^{(V+H)(V+H)^T} \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot \mathrm{e}^{VH^T} \odot \mathrm{e}^{HV^T} \odot \mathrm{e}^{HH^T} ]) \rangle
\end{align}
One can use the series expansion of the exponential to see that
\begin{align}
\mathrm{e}^{VH^T} & = 11^T + VH^T + o(\|H\|), \\
\mathrm{e}^{HV^T} & = 11^T + HV^T + o(\|H\|), \text{ and}\\
\mathrm{e}^{HH^T} & = 11^T + o(\|H\|).
\end{align}
It follows that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot (11^T + VH^T + o(\|H\|)) \\
\odot (11^T + HV^T + o(\|H\|)) \odot (11^T + o(\|H\|)) ]) \rangle.
\end{multline}
\begin{align}
f(V + H) & = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} \odot(11^T + VH^T + HV^T + o(\|H\|) )]) \rangle \\
& = \langle 1^T K, \log(1^T [\mathrm{e}^{VV^T} + e^{VV^T} \odot (VH^T + HV^T) + o(\|H\|) )]) \rangle
\end{align}
Now recall the linear approximation of $$\log$$:
$\log(x) = \log(x_0) + \frac{1}{x_0} (x-x_0) + o(|x- x_0|^2).$
Apply this approximation pointwise to conclude that
\begin{multline}
f(V + H) = \langle 1^T K, \log(1^T \mathrm{e}^{VV^T}) + \\
\{1^T \mathrm{e}^{VV^T}\}^{-1}\odot (1^T [\mathrm{e}^{VV^T} \odot (VH^T + HV^T) + o(\|H\|)]) \rangle,
\end{multline}
where $$\{x\}^{-1}$$ denotes the pointwise inverse of a vector.
Take $$D$$ to be the diagonal matrix with diagonal entries given by $$1^T \mathrm{e}^{VV^T}$$. We have shown that
$f(V + H) = f(V) + \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle + o(\|H\|),$
so
\begin{align}
d(H) & = \langle K^T1, D^{-1} [\mathrm{e}^{VV^T} \odot (VH^T + HV^T)]1 \rangle \\
& = \langle D^{-1}K^T 11^T, \mathrm{e}^{VV^T} \odot (VH^T + HV^T) \rangle \\
& = \langle \mathrm{e}^{VV^T} \odot D^{-1}K^T 11^T, (VH^T + HV^T) \rangle.
\end{align}
The second inequality follows from the standard properties of inner products and the third from the observation that
$\langle A, B\odot C \rangle = \sum_{ij} A_{ij}*B_{ij}*C_{ij} = \langle B \odot A, C \rangle.$
Finally, manipulations in the vein of the two preceding examples allow us to claim that
$\nabla_V f(V) = [\mathrm{e}^{VV^T} \odot (11^T K D^{-1} + D^{-1} K^T 11^T)] V.$

As a caveat, note that if instead $$f(V) = \log(1^T \mathrm{e}^{VV^T} ) K^T 1$$, then one should substitute $$K$$ for $$K^T$$ in the last expression.