Installing Spark with Hadoop 2 using spark-ec2

YARN does not seem to be configured correctly when you use the spark-ec2 script to install a Spark cluster on EC2. Here’s my short workaround for getting YARN to work (with a simple python script at the bottom):

  1. launch a cluster with e.g. spark-ec2 -k <keyname> -i <keyfile> -s --instance-type=<type> --placementgroup=<placementgroupname> --hadoop-major-version=2 --copy-aws-credentials launch <clustername>
    This automatically copies your AWS access keys into the core-site.xml configuration file for Hadoop so you can pull data from S3 into hdfs. Unfortunately, Hadoop is configured to use Yarn, but the Yarn installation is broken. The next couple steps will fix this (they are outlines, run the python script at the bottom after exporting your AWS keys in step 3 to implement them).
  2. ssh into the cluster master; you can use spark-ec2 get-master <clustername> to get the public dns for the master
  3. export AWS_ACESS_KEY_ID=<key> and export AWS_SECRET_ACCESS_KEY=<key>
  4. shut down yarn, the (ephemeral) hdfs, tachyon and spark
  5. change the mapred-site.xml and yarn-site.xml configuration files of (ephemeral) hdfs to correctly configure YARN
  6. open the ports 8025,8030,8040 (and maybe 8033, 9000) of the master group to the slave group
  7. copy the (ephemeral) hdfs configuration files to all the slave machines
  8. start up (ephemeral) hdfs, yarn, tachyon, then spark in this order

You should now be able to pull data from s3 (using s3n:// urls) to hdfs, use hadoop, run spark jobs, etc.

Run this python script on the master to implement steps 4–8. You may need to open some ports manually in the master security group (8033, 9000) etc … check the yarn log files under /mnt/ephemeral-hdfs on the master and a slave if you have issues.