Here are some pretty pictures of empirical orthogonal functions (principal components in climatology jargon) calculated for the 3D field of ocean temperatures using data collected over 32 years, generated by computing a rank 20 PCA of a 2 Tb matrix.
surface temperature portion of first EOF
surface temperature portion of second eof
surface temperature portion of third eof
mean sea surface temperature
first two years of first temporal eof
first two years of second temporal eof
first two years of third temporal eof
I’ve been doing a lot of programming recently, implementing truncated SVDs in Spark on large datasets (hence the pictures above). I’m new to this game, but I suspect that Spark should only be used for linear algebra if you really need linear algebra as part of your larger Spark workflow: in my experience, you have to fiddle with too many esoteric knobs (e.g., several different memory settings, several network timeout settings, using the correct type of serialization at the correct location within your code) that are dependent on the size and nature of your data to make Spark run without strange errors, much less run efficiently. Then there’s the annoying JVM array size restriction: arrays are indexed by 4-bit integers, so can have at most 2^31-1 (about 2.147 billion) entries in an array. Thus a matrix with 6 million rows can only have about 360 columns, and the largest square array you can form is about 46K squared. Another way to look at this is that an array of Java floats can hold at most about 8.5 Gb. Compare this with C/C++ where your array size is essentially limited only by the amount of memory on your machine.
If *all* you’re doing is linear algebra, I recommend using a purpose-built tool, like Skylark (which we’re working on making callable from Spark). Possibly the only serious advantage Spark has over MPI-based solutions is that it can theoretically work with an unlimited amount of data, since it doesn’t need to store it all in RAM. You just have to be very careful how you process it, due to the aforementioned JVM array size restrictions.
YARN does not seem to be configured correctly when you use the spark-ec2 script to install a Spark cluster on EC2. Here’s my short workaround for getting YARN to work (with a simple python script at the bottom):
launch a cluster with e.g.
spark-ec2 -k <keyname> -i <keyfile> -s --instance-type=<type> --placementgroup=<placementgroupname> --hadoop-major-version=2 --copy-aws-credentials launch <clustername>
This automatically copies your AWS access keys into the
core-site.xml configuration file for Hadoop so you can pull data from S3 into hdfs. Unfortunately, Hadoop is configured to use Yarn, but the Yarn installation is broken. The next couple steps will fix this (they are outlines, run the python script at the bottom after exporting your AWS keys in step 3 to implement them).
- ssh into the cluster master; you can use
spark-ec2 get-master <clustername> to get the public dns for the master
export AWS_ACESS_KEY_ID=<key> and
- shut down yarn, the (ephemeral) hdfs, tachyon and spark
- change the
yarn-site.xml configuration files of (ephemeral) hdfs to correctly configure YARN
- open the ports 8025,8030,8040 (and maybe 8033, 9000) of the master group to the slave group
- copy the (ephemeral) hdfs configuration files to all the slave machines
- start up (ephemeral) hdfs, yarn, tachyon, then spark in this order
You should now be able to pull data from s3 (using s3n:// urls) to hdfs, use hadoop, run spark jobs, etc.
Run this python script on the master to implement steps 4–8. You may need to open some ports manually in the master security group (8033, 9000) etc … check the yarn log files under
/mnt/ephemeral-hdfs on the master and a slave if you have issues.
The Confederate flag is literally a symbol of treason. Not only that, but the most treasonous act in the entire history of the United States.
Anyone who buys into the ‘States Rights’ explanation for the Secession either doesn’t understand what rights we’re talking about, or doesn’t care: the right to slavery. Elide motivations all you want, the southern states withdrew because with the entry of the western states to the union and the North blocking the expansion of slavery into the west, they felt their way of life and economic foundations being threatened. ‘States Rights’ is a euphemism for the right to slavery. If you disagree: off the top of your head list forme two other ‘states rights’ that motivated the secession.
Put the issue of motivation aside. What good came of the Confederacy? At the end of the day, their treasonous secession led directly to the death of hundreds of thousands, devastated the south, and increased the national debt from $60 million before the war to $2.7 billion afterwards! Why defend a symbol of such a wretched and shameful attempt to tear apart the United States? As you say, the only good I can think of coming from the Civil War is the Emancipation Proclamation. But somehow I doubt anyone’s flying the flag to represent that.
It goes without saying that individuals can do whatever they want on their own property— just as it goes without saying that I reserve the right to think you’re grossly insensitive at best and racist at worst and most likely—, but the Confederate flag should not be flown on government property.
Just finished season 9 of Supernatural. You’ve got to give that show credit for being one of the few that *demands* a deus ex machina ending. Anything less, after all this fighting over who’s going to take God’s place and Castiel’s moments of mysterious grace, would be a let down. I can’t wait to see what season 10 has to offer.
While we’re on the topic of shows that appropriate Christian mythology for their own ends, I want to say that Messengers is crap. We’re supposed to believe that God turned some humans into angels and sent the Devil to earth to tempt some other humans into becoming the Four Horsemen of the Apocalypse, all because he’s testing humanity? What a convoluted and cruel way for an omnipotent being to decide what he wants to do next. It seems ridiculous: objectively, no more ridiculous than a strictly biblical eschatology, but the latter has the weight of time and belief to lend it a patina of respectability. The plot of Messengers is both silly and not original … a combination it’s hard to get excited over. The actual execution is also pretty bland.
Continuing in the theme, I’m looking forward to the Lucifer show. D. B. Woodside’s character — the angel trying to pressure Lucifer back into hell — reminds me of the angel from Constantine, but hopefully that’ll be the only point of similarity between the two shows. The concept of Lucifer turning to crime fighting is almost too much for me to wrap my head around (I can only imagine the reaction of various Christian action organizations), but I have a history of enjoying shows that revolve around immortals assisting cops. I hope he ends up having more powers than just the ability to convince people to tell him their innermost desires.
I finally got around to watching Marco Polo. This is perhaps surprising news, as anyone who knows me could guess this would be right up my alley: almost superhuman martial arts (at least in the promo material), an eastern setting, and a clash of nations …
I’m on episode three, and I’m suprised to say that I’m not at all disappointed! I’m enjoying seeing the tensions building within his empire as Kublai Khan tries to preserve the Mongolian spirit of his empire while incorporating the disparate cultures and religions of his client states. In particular, Khan wants to incorporate the learning and wisdom of the Chinese, but his warlords are uneasy with the changes they see in his court and they think he is moving away from the Mongol ways. Jingim, Khan’s half-Chinese son and heir, is particularly affected by this cultural conflict: the warlords of the Empire see him as weak and not Mongol enough to be the next Khan, and he himself blames his father for not raising him to be more Chinese.
I haven’t yet decided how I feel about the portrayal of the women in this series. On the one hand, it’s completely believable that in the societies depicted at that time women had no path to influence other than through the bedroom. On the other, the sexual intrigue seems like an excuse to show a lot of writhing naked women.
Probably the aspect of the show that excites me the most is the casting. There’s only ONE white person on the cast, that being of course Marco Polo. He is an outsider in a vast world that does not involve white people at all… and he has no real power to influence events. He is, literally, an observer. It’s refreshing to see a big budget American production tackle a story from another culture and make authentic casting decisions. I really hope Marco Polo doesn’t end up being another Great White Savior.
I’ve been reading a lot of NLA lately (e.g., a recent paper on communication-avoiding RRQR), and necessarily brushing up on some details I paid scant attention to in my NLA courses, like the details of the different types of pivoting. Which led me to this quote by a famous numerical analyst:
There is still a tendency to attach too much importance to the precise error bounds obtained by an a priori error analysis. In my opinion, the bound itself is the least important part of it. The main object of such an analysis is to expose the potential instabilities, if any, of an algorithm so that hopefully from the insight thus obtained one might be led to improved algorithms. Usually the bound itself is weaker than it might have been because of the necessity of restricting the mass of detail to a reasonable level and because of the limitations imposed by expressing the errors in terms of matrix norms. A priori bounds are not, in general, quantities that should be used in practice. Practical error bounds should usually be determined by some form of a posteriori error analysis, since this takes full advantage of the statistical distribution of rounding errors and of any special features, such as sparseness, in the matrix.
Can I get an amen? This could be the epigraph of the career I’m building. I strive for a priori analyses— whether they are of algorithms or physical systems—, because in the best cases, they enhance our understanding of the factors relevant to our problems. I seek them out in others’ work and try to provide them in my own because I’m deeply skeptical of purely empirical results: without sufficient theory, how do you know you’re not just avoiding inputs that would expose some failing in your idea? This is why I’m an applied mathematician.
I haven’t seen a truly convincing study comparing Nystrom approximations to Random Feature Map approximations. On the one hand, a NIPS 2012 paper compared the two and argued that because the bases Nystrom approximations use are adaptive to the problem, whereas those used by RFMs are not, Nystrom approximations are more efficient.
This is an indisputable point, but the experiments done in the paper are not convincing: they used the same number of samples in Nystrom approximations as random features in RFMS. Instead, the fair comparison is to allot both methods the same number of FLOPs; since Nystrom methods involve an additional pseudoinversion of a (huge, for a large number of samples) matrix, one can potentially use more random features than sample points for the same number of FLOPs. Also, as always, it is important to choose an appropriate kernel — this paper only considered RBF kernels.
On the other hand, recently IBM researchers have used large-scale RFM approaches to get state-of-the-art performance on vision and speech tasks. Their results use the simplest of RFM approaches: linear regression on top of a very large number (~400K) random fourier features. The key to their success is a well-engineered ADMM approach to parallelizing the solution of the system. It’s not clear to me that a similar approach couldn’t be used to scale up a Nystrom-based solution and obtain similar results. Also, I’ve not seen anyone implement Wainwright et al.’s divide and conquer approach to kernel regression; theoretically, this could also be used to distribute the cost of a truly large-scale Nystrom implementation.
Personally, I’m of the opinion that a well-engineered Nystrom solution (using uniform sampling, even) should always outperform a well-engineered RFM solution. But, I’m interested in seeing this convincingly demonstrated.
Here’s an early Christmas gift to you: a list of podcasts I enjoy! For listening while you’re doing all your holiday season travelling.
KCRW’s Left, Right, and Center
BBC World Update: Daily Commute
Common Sense with Dan Carlin
PRI’s The World: Latest Edition
On the Media
The Young Turks Video Podcast
Best of the Left Podcast
The David Pakman Show
TWIB! Prime (This Week in Blackness)
MSNBC Rachel Maddow (video)
NPR: Intelligence Squared U.S. Debates Podcast
The Complete Guide to Everything
My Brother, My Brother, and Me
Dogma Debate with David Smalley
No Religion Required Podcast
Thank God I’m Atheist Podcast
The Heathen Half Hour
The Herd Mentality
The Imaginary Friends Show
The Scathing Atheist
The Thinking Atheist Podcast
For one of our projects at eBay, I’ve been attempting to do a Poisson MLE fit on a large enough dataset that Fisher scoring is not feasible. The problem is that the data also has such large variance in the scales of the observation that stochastic gradient descent does not work, period — because of the exponentiation involved, you need to take a very tiny step size to avoid overflow errors, but this step size is shared by all the parameters, so you can’t make progress in this way.
An alternative is adagrad, which maintains separate stepsizes for each parameter, but that seems to run into the same divergence issue, albeit much slower — slow enough that it’s unclear to me whether the fit is actually diverging, or if it ‘just’ needs to run a couple hundred of iterations before it converges. So for the past week I’ve been massaging the initial conditions and amount of information I hard-bake into the parametrization of the problem to see if I can get Adagrad to work reasonably. Still no luck.
I just came across Raskutti’s and Mukherjee’s paper “The information geometry of mirror descent“, which seems relevant to my situation and is a nice (albeit, in need of proof-reading) read. The main result of the paper is that the mirror descent algorithm associated with the Bregman divergence of a function \(G\) is equivalent to natural gradient descent in the dual manifold with metric tensor defined by Hessian of the convex conjugate of \(G.\) This sounds wonderful, because the connection between exponential families and Bregman divergences suggests that one can then perform a first-order optimization in a certain dual manifold, and reap all the benefits of having done Fisher scoring, a second-order algorithm, in parameter space. I have to reread the paper carefully to get a handle on the precise manipulations required, but this may be a nice alternative to Adagrad for my problem.
I wonder: is there a similarly geometric interpretation of what composite mirror descent does?
Update: A more readable recent paper, “Stochastic Discriminative EM” from UAI 2014, does a better job of explaining the interpretation of the dual manifold and has a very similar algorithm.
I’m looking at two different models for learning polynomial functions, and trying to determine if they are equivalent. After a couple days of thinking, I’ve reduced the question to the following:
Can every symmetric polynomial of degree \(r\) in \(d\) variables that has no constant term be written as a sum of the \(r\)-th powers of linear polynomials in \(d\) degrees and a homogeneous polynomial of degree \(r\) each of whose monomials involves at most \(d-1\) variables?