Teaching kernel learning

Here’s a neat approach to teaching kernel learning (for empirical risk minimization), following section 2.2.6 of the book “First-order and Stochastic Optimization Methods for Machine Learning” by Guanghui Lan. Given the terse nature of this book, I did not expect to find such a straightforward pedagogical approach that I’ve not seen elsewhere. Let’s begin by assuming you understand the optimality conditions for unconstrained convex optimization.

Consider the convex empirical risk minimization problem

\[ \omega^\star = \text{argmin}_{\omega \in \mathbb{R}^d} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T x_i, y_i) + \lambda \|\omega\|_2^2.\]

Fermat’s condition implies that the optimal model weights are in the span of the training data points, \(\omega^\star = \sum_{i=1}^n \alpha_i x_i \), so the problem can be recast as

\[ \alpha^\star = \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( \sum_{j=1}^n \alpha_j \langle x_i, x_j \rangle, y_i) + \lambda \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle x_i, x_j \rangle. \]

Define the linear kernel matrix \(K = [ \langle x_i, x_j \rangle ]_{i,j=1,\ldots,n}\), and this becomes

\[ \alpha^\star = \text{argmin}_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

which is again a convex optimization problem. Once \(\alpha^\star\) is known, the model can be used to predict the value of the target given a new input, also in terms of the kernel function \( \kappa(x_1, x_2) = \langle x_1, x_2 \rangle \) :

\[ (\omega^\star)^T x = \sum_{i=1}^n \alpha_i \langle x_i, x \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

This reformulation of the ERM is not particularly useful unless \(d > n\), as the original ERM involved a \(d\)-dimensional optimization problem while the latter involves a \(n\)-dimensional optimization problem and involves forming and working with an \(n \times n\) kernel matrix. However, if \(d\) is larger than \(n\), the kernel trick that led to the second ERM is useful, as the latter ERM may be more efficient to work with.

Now, consider that we want to learn using linear combinations of some fixed functions of the raw features \(x\), e.g. we want to learn using the features \(x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots\). One can imagine that using such features could be of benefit over simply using the raw features. Let \(\phi(x)\) denote the feature map that maps the raw features from \(d\) dimensions into \(D\) new features, then the ERM of interest is

\[ \min_{\omega \in \mathbb{R}^D} \frac{1}{n} \sum_{i=1}^n \ell(\omega^T \phi(x_i), y_i) + \lambda \|\omega\|_2^2,\]

and the same reasoning as before tells us that \(\omega^\star = \sum_{i=1}^n \alpha_i \phi(x_i)\). Defining the nonlinear kernel function \( \kappa(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle \) and the corresponding nonlinear kernel matrix \(K = [ \kappa(x_1, x_2) ]_{i,j=1,\ldots,n}\) corresponding to the feature map \(\phi\), we see that we can fit this model by solving

\[ \min_{\alpha \in \mathbb{R}^n} \frac{1}{n} \sum_{i=1}^n \ell( (K \alpha)_i, y_i) + \lambda \alpha^T K \alpha, \]

the exact same convex problem as before, and also predict using

\[ (\omega^\star)^T \phi(x) = \sum_{i=1}^n \alpha_i \langle \phi(x_i), \phi(x) \rangle = \sum_{i=1}^n \alpha_i \kappa(x_i, x). \]

Note that \(\phi\) could even be infinite-dimensional, and this approach still works: the ERM stated in terms of the feature map is not solvable on a computer, but the latter is, as it involves working with only an \(n\)-dimensional optimization problem. In practice, even if \(\phi\) is finite-dimensional, \(D\) is taken to be sufficiently high that the kernel version of the ERM is more efficiently solvable than the feature map version of the ERM. Further, \(\phi\) is chosen so that the kernel function is efficiently evaluable.

The canonical example of the feature map used in kernel learning is the Gaussian/radial basis feature map, an infinite dimensional feature map that consists of all the monomials in the raw features, each downweighed by an appropriate exponential weight (it’s tedious to write out, you can look it up). The nice thing is that this choice of feature map leads to the RBF kernel function that we all know and love

\[ \kappa(x_1, x_2) = \exp\left( \frac{\|x_1 – x_2|_2^2}{2\sigma^2} \right), \]

where \(\sigma^2\) is a hyperparameter.

Every deep learning paper is wrong. Don’t at me. — seen in the bowels of Twitter

Bruh! I’ve been away for five years. Only posting now because I’m productively procrastinating (yes that is a lifehack link, don’t judge me) working on a grant whose writing is going to be like pulling nails.

This blog was supposed to serve as a journal (as much of a journal as something public facing can be, when my identity is no secret), and it’s been a while since I put pen to paper. So what’s there to update on? I’m an assistant professor now, of Computer Science, in a wonderfully homey department in the northeast. Hence the grant writing. I’m still working on the same type of problems– machine learning and what I would call information-theoretic applied mathematics.

My personal life is more full-throated now. Maybe more to come on that in the future… I’m still parsing out exactly how to use social media in an age when your students and colleagues can look you up. That’s part of why I’m not on any social media, other than Youtube, where I fastidiously maintain a bevy of troll-adjacent accounts. I can only imagine how hard dating is when there’s the possibility you can run into students, postdocs, etc. on dating sites. There’s something about the neutered, purely intellectual nature of academia (which I love, don’t get it wrong) that makes it gauche to be caught trying to start a romantic relationship … maybe that’s just me. In fact, thinking of some other profs I know, that is entirely just me.

So about that quote: I am working on deep learning because that’s almost required nowadays in machine learning. But I am constitutionally opposed to much of the baselessness of it all. It’s a great opportunity of constructing theoretical foundations, and I appreciate that a lot of good work has been done in that direction. But fundamentally it seems like the advances come from clever engineering of inductive biases, and the theory is forever playing catch-up. And when you read many of these papers closely, the proofs are problematic, incorrect, or implicitly make assumptions that trivialize the results. I can’t give statistics on that, just I’ve seen it enough recently to be bothered by it. What I’m saying is, I want a paradigm shift. Look at this recent paper on foundational models (a term referring to large models trained in an unsupervised manner, that can then be fine-tuned for other tasks). It doesn’t leave a good taste in my mouth.

I have more to say, but I’m already late for a meeting. Welcome back to me, and see you later.

How to make the IEEEtran bibliography font smaller

It’s ridiculously over spaced by default, and in addition to correcting this, you may want to reduce the font size to get that extra half page you desperately need to eke out to meet the page limit.

To accomplish these two goals, put the following in your preamble:

\renewcommand{\IEEEbibitemsep}{0pt plus 2pt}

I know just slightly more than exactly enough to get myself in trouble, so caveat utilitor.

Morals, smorals

I just spent several disappointing minutes on Amazon checking out two books by Pinker and Harris recommended on one of the atheist podcasts I follow, both with theses relating to the development of morals. Unsurpisingly, I found a scathing review of Harris’s book— full disclosure: I *really* *really* don’t like the guy, ever since I bought his book “The End of Faith” on recommendation from another atheist podcast, and was disgusted by some comments he made to the effect of how it’s ok to preemptively kill Muslims —, but I was surprised to also find several criticisms of Pinker’s book that make the case that he engages in a horrendous amount of cherry-picking. So, where can I find a scientifically rigorous look at morals that delves into philosophical implications of an atheistic worldview? I don’t know…

But the point of this post was less to kvetch and more to point readers to the full text of Crane Brinton’s “History of Western Morals“, one of my favorite books by one of my favorite authors. I wish I’d discovered this sooner.

Spark for Linear Algebra: don’t

Here are some pretty pictures of empirical orthogonal functions (principal components in climatology jargon) calculated for the 3D field of ocean temperatures using data collected over 32 years, generated by computing a rank 20 PCA of a 2 Tb matrix.

surface temperature portion of first EOF
surface temperature portion of first EOF

surface temperature portion of second eof
surface temperature portion of second eof

surface temperature portion of third eof
surface temperature portion of third eof

mean sea surface temperature
mean sea surface temperature

first two years of first temporal eof
first two years of first temporal eof

second temporal eof
first two years of second temporal eof

first two years of third temporal eof
first two years of third temporal eof

I’ve been doing a lot of programming recently, implementing truncated SVDs in Spark on large datasets (hence the pictures above). I’m new to this game, but I suspect that Spark should only be used for linear algebra if you really need linear algebra as part of your larger Spark workflow: in my experience, you have to fiddle with too many esoteric knobs (e.g., several different memory settings, several network timeout settings, using the correct type of serialization at the correct location within your code) that are dependent on the size and nature of your data to make Spark run without strange errors, much less run efficiently. Then there’s the annoying JVM array size restriction: arrays are indexed by 4-bit integers, so can have at most 2^31-1 (about 2.147 billion) entries in an array. Thus a matrix with 6 million rows can only have about 360 columns, and the largest square array you can form is about 46K squared. Another way to look at this is that an array of Java floats can hold at most about 8.5 Gb. Compare this with C/C++ where your array size is essentially limited only by the amount of memory on your machine.

If *all* you’re doing is linear algebra, I recommend using a purpose-built tool, like Skylark (which we’re working on making callable from Spark). Possibly the only serious advantage Spark has over MPI-based solutions is that it can theoretically work with an unlimited amount of data, since it doesn’t need to store it all in RAM. You just have to be very careful how you process it, due to the aforementioned JVM array size restrictions.

Installing Spark with Hadoop 2 using spark-ec2

YARN does not seem to be configured correctly when you use the spark-ec2 script to install a Spark cluster on EC2. Here’s my short workaround for getting YARN to work (with a simple python script at the bottom):

  1. launch a cluster with e.g. spark-ec2 -k <keyname> -i <keyfile> -s --instance-type=<type> --placementgroup=<placementgroupname> --hadoop-major-version=2 --copy-aws-credentials launch <clustername>
    This automatically copies your AWS access keys into the core-site.xml configuration file for Hadoop so you can pull data from S3 into hdfs. Unfortunately, Hadoop is configured to use Yarn, but the Yarn installation is broken. The next couple steps will fix this (they are outlines, run the python script at the bottom after exporting your AWS keys in step 3 to implement them).
  2. ssh into the cluster master; you can use spark-ec2 get-master <clustername> to get the public dns for the master
  3. export AWS_ACESS_KEY_ID=<key> and export AWS_SECRET_ACCESS_KEY=<key>
  4. shut down yarn, the (ephemeral) hdfs, tachyon and spark
  5. change the mapred-site.xml and yarn-site.xml configuration files of (ephemeral) hdfs to correctly configure YARN
  6. open the ports 8025,8030,8040 (and maybe 8033, 9000) of the master group to the slave group
  7. copy the (ephemeral) hdfs configuration files to all the slave machines
  8. start up (ephemeral) hdfs, yarn, tachyon, then spark in this order

You should now be able to pull data from s3 (using s3n:// urls) to hdfs, use hadoop, run spark jobs, etc.

Run this python script on the master to implement steps 4–8. You may need to open some ports manually in the master security group (8033, 9000) etc … check the yarn log files under /mnt/ephemeral-hdfs on the master and a slave if you have issues.

My thoughts on the confederate flag

The Confederate flag is literally a symbol of treason. Not only that, but the most treasonous act in the entire history of the United States.

Anyone who buys into the ‘States Rights’ explanation for the Secession either doesn’t understand what rights we’re talking about, or doesn’t care: the right to slavery. Elide motivations all you want, the southern states withdrew because with the entry of the western states to the union and the North blocking the expansion of slavery into the west, they felt their way of life and economic foundations being threatened. ‘States Rights’ is a euphemism for the right to slavery. If you disagree: off the top of your head list forme two other ‘states rights’ that motivated the secession.

Put the issue of motivation aside. What good came of the Confederacy? At the end of the day, their treasonous secession led directly to the death of hundreds of thousands, devastated the south, and increased the national debt from $60 million before the war to $2.7 billion afterwards! Why defend a symbol of such a wretched and shameful attempt to tear apart the United States? As you say, the only good I can think of coming from the Civil War is the Emancipation Proclamation. But somehow I doubt anyone’s flying the flag to represent that.

It goes without saying that individuals can do whatever they want on their own property— just as it goes without saying that I reserve the right to think you’re grossly insensitive at best and racist at worst and most likely—, but the Confederate flag should not be flown on government property.

A quick thought on Supernatural and some other tv shows

Just finished season 9 of Supernatural. You’ve got to give that show credit for being one of the few that *demands* a deus ex machina ending. Anything less, after all this fighting over who’s going to take God’s place and Castiel’s moments of mysterious grace, would be a let down. I can’t wait to see what season 10 has to offer.

While we’re on the topic of shows that appropriate Christian mythology for their own ends, I want to say that Messengers is crap. We’re supposed to believe that God turned some humans into angels and sent the Devil to earth to tempt some other humans into becoming the Four Horsemen of the Apocalypse, all because he’s testing humanity? What a convoluted and cruel way for an omnipotent being to decide what he wants to do next. It seems ridiculous: objectively, no more ridiculous than a strictly biblical eschatology, but the latter has the weight of time and belief to lend it a patina of respectability. The plot of Messengers is both silly and not original … a combination it’s hard to get excited over. The actual execution is also pretty bland.

Continuing in the theme, I’m looking forward to the Lucifer show. D. B. Woodside’s character — the angel trying to pressure Lucifer back into hell — reminds me of the angel from Constantine, but hopefully that’ll be the only point of similarity between the two shows. The concept of Lucifer turning to crime fighting is almost too much for me to wrap my head around (I can only imagine the reaction of various Christian action organizations), but I have a history of enjoying shows that revolve around immortals assisting cops. I hope he ends up having more powers than just the ability to convince people to tell him their innermost desires.

Marco Polo: I approve, so far

I finally got around to watching Marco Polo. This is perhaps surprising news, as anyone who knows me could guess this would be right up my alley: almost superhuman martial arts (at least in the promo material), an eastern setting, and a clash of nations …

I’m on episode three, and I’m suprised to say that I’m not at all disappointed! I’m enjoying seeing the tensions building within his empire as Kublai Khan tries to preserve the Mongolian spirit of his empire while incorporating the disparate cultures and religions of his client states. In particular, Khan wants to incorporate the learning and wisdom of the Chinese, but his warlords are uneasy with the changes they see in his court and they think he is moving away from the Mongol ways. Jingim, Khan’s half-Chinese son and heir, is particularly affected by this cultural conflict: the warlords of the Empire see him as weak and not Mongol enough to be the next Khan, and he himself blames his father for not raising him to be more Chinese.

I haven’t yet decided how I feel about the portrayal of the women in this series. On the one hand, it’s completely believable that in the societies depicted at that time women had no path to influence other than through the bedroom. On the other, the sexual intrigue seems like an excuse to show a lot of writhing naked women.

Probably the aspect of the show that excites me the most is the casting. There’s only ONE white person on the cast, that being of course Marco Polo. He is an outsider in a vast world that does not involve white people at all… and he has no real power to influence events. He is, literally, an observer. It’s refreshing to see a big budget American production tackle a story from another culture and make authentic casting decisions. I really hope Marco Polo doesn’t end up being another Great White Savior.