sparklyr + Spark 2.0 + AWS

Background

The new ‘sparklyr’ package provides a ‘dplyr’ backend to interacting with Spark. It also opens the ML Spark library, which in effect expands the functionality previously available to R via SparkR.

I wanted to see what it would take to get a functional Spark 2.0 cluster to interact with this package.

AWS/EC2 Setup

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Step 1- Amazon Machine Instance: Ubuntu
  • Step 2 – Instance Type: m4.large
  • Step 3 – No changes
  • Step 4 – Storage: 20 Size GiB
  • Step 5 – No changes
  • Step 6 – Security Group Name: sparkClick Add Rule, select Type: All Trafic | Source: My IP

    Click Add Rule, select Type: All Traffic | Source: Custom and type the the first two numbers in the IP address followed by ‘.0.0/16’ (So,if your Shiny server’s internal IP address is 172.31.2.200 then you’d enter 172.31.0.0/16 This gives every server in your VPC access to to each other)

  • After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:Create a new pair

    Key pair name: spark

    Click Download Key Pair

    Save the file

    Click Launch Instances

  • Go to the Instances section in the EC2 Dashboard section and click on the new instance
  • Copy to a text editor in your laptop the Public DNS address, from here on we’ll refer to it as MY_PUBLIC_DNS
  • Copy to a text editor in your laptop the Private IPs address, from here on we’ll refer to it as MY_PRIVATE_IP

Key and Connection setup from laptop

Next you need to setup add a password to the certificate on the “Key Pair” downloaded from AWS.

For ease, we’ll use spark as the Key Passphrase

AWS provides the step-by-step instructions for this part:

Master server configuration

Terminal session

  • Start a new terminal session using the instructions in the Key and Connection setup from laptop section.
  • The initial password you need to enter when prompted for the “Passphrase for key”imported-openssh-key“:” is : spark

Install Java

Tip The terminal commands inside the boxes can be copied and pasted into your terminal session. In putty you can use right-click as the “paste” command.

The latest version of Java needs to be installed

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Install Scala

Spark is based on Scala, so we need to install it early in the process

sudo wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz
cd /usr/local/src/
sudo mkdir scala
cd /home/ubuntu/
sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala

We’ll also need to tell Ubuntu where we installed Scala. To do this we update a setup file called ‘bashrc’

For editing, we’ll use an application called ‘vi’. This application is a very paired down file editor.

  • Open the bashrc file
    vi .bashrc
  • Use the arrow keys to go to the bottom of the file
  • Press the {Insert} key
  • Type:
    export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
    export PATH=$SCALA_HOME/bin:$PATH
  • Press {Esc}
  • Type :wq and Enter (to save and close)
  • Ask Ubuntu to read the new bashrc file
    . .bashrc
  • Verify that the the new Scala version is recognized. The command below should return something like Scala code runner version 2.11.8 …
    scala -version

Install Spark

First, we will install git

sudo apt-get install git

Now, we’ll start by updating our Ubuntu server and then download the latest Spark installation files

sudo apt-get upgrade 
sudo apt-get update
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz
tar -xvzf spark-2.0.0.tgz
cd spark-2.0.0

Building Spark to support ‘sparklyr’

We have to build the Spark in the server. This was a syntax change going from Spark 1.6 to 2.0. Spark recommends using something called ‘Maven’ to build Spark, but I’ve been more successful using ‘SBT’

In order to recognize ‘hive’ support, we’ll add the two options at the end, this is different from the original SparkR installation.

sudo build/sbt package -Phive -Phive-thriftserver

Install R

We need to update the package list so that Ubuntu installs the latest R version. Here is the reference: https://cran.r-project.org/bin/linux/ubuntu/

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core

Start the Master

  • We’ll start the master service and close our terminal session
    sudo spark-2.0.0/sbin/start-master.sh
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080
  • Note the the Spark Master url, it should be something like: spark://ip-[MY_PRIVATE_IP but with dashes]:7077

Create the AWS Image

Now that we have completed the necessary setup, we will take a snapshot of the current state of our Master server. We will use this image to easily deploy the worker servers.

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Right-click on the instance that for the Master
  • Select Image and then Create Image
  • Image Name: spark
  • Click Create Image

Install RStudio

The steps below install the current version. To find updated instructions go to: https://www.rstudio.com/products/rstudio/download-server/ and select Debian/Ubuntu

  • Start a new terminal session session
  • Download and install RStudio Server
    wget https://download2.rstudio.org/rstudio-server-0.99.903-amd64.deb
    sudo gdebi rstudio-server-0.99.903-amd64.deb
    sudo adduser rstudio
  • Install the pre-requesites to get ‘devtools’ to work. This step won’t be needed when ‘spraklyr’ goes CRAN
    sudo apt-get -y install libcurl4-gnutls-dev
    sudo apt-get -y install libssl-dev
  • Start the Master server
    sudo spark-2.0.0/sbin/start-master.sh

Spark Workers

Launch the workers

The AMI we created earlier will be used to deploy the workers.

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on AMIs
  • Right-click on the spark AMI
  • Select Launch
  • Step 2 – Instance Type: m4.large [You can select a different size server, a smaller server will run slower but may be more cost effective]
  • Step 3 – Number of instances: 3 [A different number of instances can be selected]
  • Step 4 – Storage: 20 Size GiB
  • Step 5 – Name: worker
  • Step 6 – Select an existing group | Name: spark
  • Click Launch
  • After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:Choose an existing key pair

    Key pair name: spark

  • Launch Instance

Starting and connecting the workers

This part is a little repetitive. You will need to follow these steps for each of the workers that were deployed. Additionally, if you were to stop the Instances in AWS, you will need to follow these steps again:

  • Go to the AWS console: https://console.aws.amazon.com
  • Select EC2
  • Click on Instances
  • Select a worker and note the Public DNS
  • Start a new terminal session that connects to that worker
  • Start the slave service and close the terminal session Important:Use dots not dashes
    sudo spark-2.0.0/sbin/start-slave.sh spark://[MY_PRIVATE_IP]:7077
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080, the new node(s) should be listed

Connect RStudio to Spark via ‘sparklyr’

  • Install the package using ‘devtools’
    install.packages("devtools")
    devtools::install_github("rstudio/sparklyr")
  • Load the libraries
    library(sparklyr)
    library(dplyr)
  • Connect to the ‘Spark’ cluster. Different from the SparkR connection, the Spark version needs to be defined. Important: Use the dashes not dots.
    my_context <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP with Dashes]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")
  • Run a test script that compares the minimum syntax to go from ‘dyplr’ to ‘sparklyr’, in other words from running the code in memory locally to running it in Spark
ds <- mtcars %>%
  mutate(over20mpg = ifelse(mpg>20, "Yes","No")) %>%
  select(-mpg) %>%
  filter(qsec>15)

spark_ds <- copy_to(my_context, mtcars,  overwrite = TRUE) %>%
  mutate(over20mpg = ifelse(mpg>20, "Yes","No")) %>%
  select(-mpg) %>%
  filter(qsec>15)

print(head(ds));print(head(spark_ds))

spark_disconnect(my_context)

This script loads the data and runs the model in one pipeline code segment

my_context <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP with Dashes]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")
spark_rf <- copy_to(my_context, mtcars,  overwrite = TRUE) %>%
  mutate(over20mpg = ifelse(mpg>20, "Yes","No")) %>%
  select(-mpg) %>%
  filter(qsec>15) %>% 
  ml_random_forest(over20mpg~.)
print(summary(spark_rf))
spark_disconnect(my_context)