Setup a Spark 2.0 Cluster + R in AWS

Background

I have been compiling step-by-step documentation, using help guides, blog posts and insights from previous exercises. Now that Spark 2.0 is out, I figured it was a good oportunity to update my 1.6 documentation and to make it available to others.

The plan is to leverage a feature in AWS that allows you to replicate an existing server setup.

AWS/EC2 Setup

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Step 1- Amazon Machine Instance: Ubuntu
  • Step 2 – Instance Type: m4.large
  • Step 3 – No changes
  • Step 4 – Storage: 30 Size GiB
  • Step 5 – No changes
  • Step 6 – Security Group Name: sparkClick Add Rule, select Type: All Trafic | Source: My IP

    Click Add Rule, select Type: All Traffic | Source: Custom and type the the first two numbers in the IP address followed by ‘.0.0/16’ (So,if your Shiny server’s internal IP address is 172.31.2.200 then you’d enter 172.31.0.0/16 This gives every server in your VPC access to to each other)

  • After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:Create a new pair

    Key pair name: spark

    Click Download Key Pair

    Save the file

    Click Launch Instances

  • Go to the Instances section in the EC2 Dashboard section and click on the new instance
  • Copy to a text editor in your laptop the Public DNS address, from here on we’ll refer to it as MY_PUBLIC_DNS
  • Copy to a text editor in your laptop the Private IPs address, from here on we’ll refer to it as MY_PRIVATE_IP

Key and Connection setup from laptop

Next you need to setup add a password to the certificate on the “Key Pair” downloaded from AWS.

For ease, we’ll use spark as the Key Passphrase

AWS provides the step-by-step instructions for this part:

Master server configuration

Terminal session

  • Start a new terminal session using the instructions in the Key and Connection setup from laptop section.
  • The initial password you need to enter when prompted for the “Passphrase for key”imported-openssh-key“:” is : spark

Install Java

Tip The terminal commands inside the boxes can be copied and pasted into your terminal session. In putty you can use right-click as the “paste” command.

The latest version of Java needs to be installed

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Install Scala

Spark is based on Scala, so we need to install it early in the process

sudo wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz
cd /usr/local/src/
sudo mkdir scala
cd /home/ubuntu/
sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala

We’ll also need to tell Ubuntu where we installed Scala. To do this we update a setup file called ‘bashrc’

For editing, we’ll use an application called ‘vi’. This application is a very paired down file editor.

  • Open the bashrc file
    vi .bashrc
  • Use the arrow keys to go to the bottom of the file
  • Press the {Insert} key
  • Type:
    export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
    export PATH=$SCALA_HOME/bin:$PATH
  • Press {Esc}
  • Type :wq and Enter (to save and close)
  • Ask Ubuntu to read the new bashrc file
    . .bashrc
  • Verify that the the new Scala version is recognized. The command below should return something like Scala code runner version 2.11.8 …
    scala -version

Install Spark

First, we will install git

sudo apt-get install git

Now, we’ll start by updating our Ubuntu server and then download the latest Spark installation files

sudo apt-get upgrade 
sudo apt-get update
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz
tar -xvzf spark-2.0.0.tgz
cd spark-2.0.0

We have to build the Spark in the server. This was a syntax change going from Spark 1.6 to 2.0. Spark recommends using something called ‘Maven’ to build Spark, but I’ve been more successful using ‘SBT’

sudo build/sbt package

Side comment – In my original notes I have a note that says ‘Go watch paint dry’, since it took the VM in my laptop 78 minutes to complete. In AWS using the M large instance type, it should take around 4 minutes.

Install R

We need to update the package list so that Ubuntu installs the latest R version. Here is the reference: https://cran.r-project.org/bin/linux/ubuntu/

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core

Install SparkR

This allows you to do two things. First, the source for the ‘SparkR’ library in R and second, a running SparkR installation that needs to run in all of the workers so that you SparkR commands work.

cd /home/ubuntu/spark-2.0.0
sudo R/install-dev.sh  
sudo update-java-alternatives -s java-8-oracle
sudo R CMD javareconf

Pre-Install R pacakges

This was a big revelation for me, when you run functions using the spark.lapply command, any packages needed for the function to work need to be installed in each worker. The Spark documentation shows the lm function in ‘spark.lapply’, which is installed with R, so it will fail if you try to use ML models like Random Forest.

sudo su - -c "R -e \"install.packages('ggplot2', repos='http://cran.us.r-project.org')\""
sudo su - -c "R -e \"install.packages('caret', repos='http://cran.us.r-project.org')\""
sudo su - -c "R -e \"install.packages('randomForest', repos='http://cran.us.r-project.org')\""
sudo su - -c "R -e \"install.packages('e1071', repos='http://cran.us.r-project.org')\""

Start the Master

  • We’ll start the master service and close our terminal session
    sudo spark-2.0.0/sbin/start-master.sh
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080
  • Note the the Spark Master url, it should be something like: spark://ip-[MY_PRIVATE_IP but with dashes]:7077

Create the AWS Image

Now that we have completed the necessary setup, we will take a snapshot of the current state of our Master server. We will use this image to easily deploy the worker servers.

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Right-click on the instance that for the Master
  • Select Image and then Create Image
  • Image Name: spark
  • Click Create Image

Install RStudio

The steps below install the current version. To find updated instructions go to: https://www.rstudio.com/products/rstudio/download-server/ and select Debian/Ubuntu

  • Start a new terminal session session
  • Download and install RStudio Server
    wget https://download2.rstudio.org/rstudio-server-0.99.903-amd64.deb
    sudo gdebi rstudio-server-0.99.903-amd64.deb
    sudo adduser rstudio
  • Start the Master server
    sudo sbin/start-master.sh

Connect RStudio to Spark via SparkR

Make sure to visit: https://spark.apache.org/docs/latest/sparkr.html. There was a syntax change between Spark 1.6 and 2.0. The ‘sparkRSQL.init’ command has been deprecated, so we’ll use ‘sparkR.session’

  • Navigate to the RStudio, http://MY_PUBLIC_DNS:8787/
  • Log on as ‘rstudio’
  • Confirm that the latest version of R is loaded (As of today is 3.3.1)
  • In the console or a new R Script run the following commands
    library(SparkR, lib.loc = "/home/ubuntu/spark-2.0.0/R/lib")
    sparkR.session(master = "spark://[MY_PRIVATE_IP]:7077", sparkHome = "/home/ubuntu/spark-2.0.0", enableHiveSupport=FALSE)

If successful, the console should return something like: Java ref type org.apache.spark.sql.SparkSession id 1. You may also receive some Warnings in red, as long as these are warning and not failures we should be good to continue.

Spark Workers

Launch the workers

The AMI we created earlier will be used to deploy the workers.

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on AMIs
  • Right-click on the spark AMI
  • Select Launch
  • Step 2 – Instance Type: m4.large [You can select a different size server, a smaller server will run slower but may be more cost effective]
  • Step 3 – Number of instances: 3 [A different number of instances can be selected]
  • Step 4 – Storage: 20 Size GiB
  • Step 5 – Name: worker
  • Step 6 – Select an existing group | Name: spark
  • Click Launch
  • After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:Choose an existing key pair

    Key pair name: spark

  • Launch Instance

Starting and connecting the workers

This part is a little repetitive. You will need to follow these steps for each of the workers that were deployed. Additionally, if you were to stop the Instances in AWS, you will need to follow these steps again:

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Select a worker and note the Public DNS
  • Start a new terminal session that connects to that worker
  • Start the slave service and close the terminal session
    sudo spark-2.0.0/sbin/start-slave.sh spark://[MY_PRIVATE_IP]:7077
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080, the new node(s) should be listed

Testing SparkR

I highly recommend that you have two web browser sessions, one for the Spark UI (port 8080) and one for RStudio (8787). This will show how the applications interact with each other as each SparkR command is executed.

  • Navigate to http://MY_PUBLIC_DNS:8787
  • Log on as ‘rstudio’
  • In the console or a new R Script run the following commands
    library(SparkR, lib.loc = "/home/ubuntu/spark-2.0.0/R/lib")
    sparkR.session(master = "spark://[MY_PRIVATE_IP]:7077", sparkHome = "/home/ubuntu/spark-2.0.0", enableHiveSupport=FALSE)
  • Load the ’mtcars’dataset to Spark
    ds <- createDataFrame(mtcars)
  • Run a ‘glm’ model using the Spark cluster
    s_glm <- SparkR::glm(mpg~disp, ds, family="gaussian")
    print(summary(s_glm))
  • Close the Spark connection
    sparkR.session.stop()