.Spark-Bench - Benchmarking Apache Spark v1.0

Overview

These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.

Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.

It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.

The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:

traditional benchmarking of algorithm implementations
stress-testing clusters
simulating multiple notebook users on one cluster
comparing multiple versions of Spark on multiple clusters

Highlights

Data Generation.
A data generator automatically generates input data sets with various sizes. Spark-Bench has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc.
Workloads
The atomic unit of organization in Spark-Bench is the workload. Workloads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk.. Spark-Bench provides diverse and representative workloads ( extensible to new workloads )
- Machine learning: Logistic regression, support vector machine, matrix factorization
- Graph processing: pagerank, svdplusplus, triangle count
- Streaming: twitter, pageview
- SQL query applications: hive,RDDRelation
Configurations
Spark-Bench allows you to launch multiple spark-submit commands by creating and launching multiple spark-submit scripts. This flexibility allows to:
- Comparing benchmark times of the same workloads with different Spark settings
- Simulating multiple batch applications hitting the same cluster at once.
- Comparing benchmark times against two different Spark clusters!
Metrics:
- supported: job execution time, input data size, data process rate
- under development: shuffle data, RDD size, resource consumption, integration with monitoring tool

Workload characterization and study of parameter impacts

Diverse and representative date sets: Wikipedia, Google web graph, Amazon movie review
Charactering workloads in terms of resource consumption, data access patterns and time information, job execution time, shuffle data
Studying the impact of Spark configuration parameters

Pre-Requisities

OpenJDK8 installed
```
$ java -version
```
Docker installed
```
$ docker version
```
Apache Hadoop and Apache Spark should be installed from Apache Bigtop packages.
- Follow the instructions here to install the Bigtop Hadoop and Spark components

BigTop Setup

Follow the instructions in here

Create docker Containers

1. Create a cluster of Bigtop docker containers

$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 3

2. Login into each container.

$ docker container exec -it <container_name> bash

3. Verify Hadoop is installed in containers.

$ hadoop

Configure Docker Containers

1. Follow these steps inside each container.

Create hadoop user
We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop user group:
```
$ sudo adduser hduser -G hadoop
```
give a password for hduser
```
$ sudo passwd hduser
```
Add hduser to sudoers list:
On Debian:
```
$ sudo adduser hduser sudo
```
On CentOS:
```
$ sudo usermod -G wheel hduser
```
Switch to hduser
```
$ su - hduser
```
Generate ssh key for hduser
```
$ ssh-keygen -t rsa -P ""
```
Press <enter> to leave to default file name.
enable ssh access to local machine
```
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod 600 $HOME/.ssh/authorized_keys
$ chmod 700 $HOME/.ssh
```
Login as hduser
```
$ su - hduser
```

Set Environment variables

Make sure the environment variables are set for Hadoop. Add them to your bash profile.

$ vi ~/.bashrc

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export YARN_HOME=/usr/lib/hadoop-yarn
export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/
export HADOOP_USER_NAME=hdfs
export CLASSPATH=$CLASSPATH:.
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.7.2.jar:$HADOOP_HOME/client/hadoop-hdfs-2.7.2.jar:$HADOOP_HOME/hadoop-auth-2.7.2.jar:/usr/lib/hadoop-mapreduce/*:/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH
export SPARK_HOME=/usr/lib/spark
export PATH=$HADOOP_HOME\bin:$PATH
export SPARK_DIST_CLASSPATH=$HADOOP_HOME\bin\hadoop:$CLASSPATH:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-mapreduce/*:.
export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/:. 
export SPARK_MASTER_HOST=local[*]

$ source ~/.bashrc

Hosts file

Make sure hosts file is setup correctly. The hosts file should like below:

$ sudo vi /etc/hosts

172.17.0.4 spark-master <containerID.apache.bigtop.org> <containerID>
172.17.0.3 spark-slave01 <containerID.apache.bigtop.org> <containerID>
172.17.0.2 spark-slave02 <containerID.apache.bigtop.org> <containerID>
127.0.0.1 localhost localhost.domain

::1 localhost

Spark Configurations

Make sure Spark is configured properly
- /usr/lib/spark/conf/spark-env.sh
  This file should have STANDALONE_SPARK_MASTER_HOST pointing to spark master ip address
  Also SPARK_MASTER_IP should be set
- cp /usr/lib/spark/conf/slaves.template /usr/lib/spark/conf/slaves. Add the slaves ip address to the file instead of 'localhost'
- Make sure spark-defaults.conf has the master set correctly
Verify Spark
- ping to make sure you can ping all nodes
```
$ ping spark-master
$ ping spark-slave01
$ ping spark-slave02
```
- Make sure below command shows spark-master ip with port as 'Established' and other spark ports as listening
```
$ netstat -n -a
```

Stop and start Spark

$ /usr/lib/spark/sbin/stop-all.sh


$ /usr/lib/spark/sbin/start-all.sh

2. Install Spark-Bench

Install dependencies

Login to each container and install gnupg package

$ apt-get install gnupg

7. Install sbt package inside all containers.

$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt-get install sbt

8. Set SBT_OPTS. Change your SBT heap space. Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. Add the below to your bash profile

$ export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M"

Grab the .tgz source code from here, inside each container

$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgz

Unpack the tar file and cd into the newly created folder, inside each container

$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
$ cd spark-bench_2.3.0_0.4.0-RELEASE_99/

Run sbt compile

$ sbt compile

3. Run Spark-Bench

$ ./bin/spark-bench.sh examples/minimal-example.conf

4. Verification

The output should be like below:

One run of SparkPi and that's it!                                               
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|   pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|       spark.app.id|         description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328|   1032871662|3.141851141851142|     |                        |    10|  0|     10.200.22.54|            61657|                 :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...|           driver|                 client|    local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+

You could also verify by opening up another terminal instance, while the job is running with the below command

$ jps -lm

The output should be like:

11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}}
12045 sun.tools.jps.Jps -lm
11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf

5. SQL benchmark

Run the gen-data.sh script

$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh

Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.

Run the run.sh script
For SQL applications, by default it runs the RDDRelation workload.

$ <SPARK_BENCH_HOME>/SQL/bin/run.sh

6. Hive Workload

To run Hive workload, execute:

$ hive;

7. Streaming Applications

For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute:

$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal


$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminal

In order run a particular streaming app (default: PageViewStream):
     You need to pass a subApp parameter to the gen_data.sh or run.sh like this:

$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags

     *Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.

8. Other Workloads

https://hub.docker.com/r/alvarobrandon/spark-bench/

Reference

https://codait.github.io/spark-bench/compilation/

https://github.com/codait/spark-bench/tree/legacy

Big Data Team Space