YARN does not seem to be configured correctly when you use the spark-ec2 script to install a Spark cluster on EC2. Here’s my short workaround for getting YARN to work (with a simple python script at the bottom):
-
launch a cluster with e.g.
spark-ec2 -k <keyname> -i <keyfile> -s --instance-type=<type> --placementgroup=<placementgroupname> --hadoop-major-version=2 --copy-aws-credentials launch <clustername>
This automatically copies your AWS access keys into thecore-site.xml
configuration file for Hadoop so you can pull data from S3 into hdfs. Unfortunately, Hadoop is configured to use Yarn, but the Yarn installation is broken. The next couple steps will fix this (they are outlines, run the python script at the bottom after exporting your AWS keys in step 3 to implement them). - ssh into the cluster master; you can use
spark-ec2 get-master <clustername>
to get the public dns for the master export AWS_ACESS_KEY_ID=<key>
andexport AWS_SECRET_ACCESS_KEY=<key>
- shut down yarn, the (ephemeral) hdfs, tachyon and spark
- change the
mapred-site.xml
andyarn-site.xml
configuration files of (ephemeral) hdfs to correctly configure YARN - open the ports 8025,8030,8040 (and maybe 8033, 9000) of the master group to the slave group
- copy the (ephemeral) hdfs configuration files to all the slave machines
- start up (ephemeral) hdfs, yarn, tachyon, then spark in this order
You should now be able to pull data from s3 (using s3n:// urls) to hdfs, use hadoop, run spark jobs, etc.
Run this python script on the master to implement steps 4–8. You may need to open some ports manually in the master security group (8033, 9000) etc … check the yarn log files under /mnt/ephemeral-hdfs
on the master and a slave if you have issues.