Spark-Bench - Benchmarking Apache Spark
Overview
These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.
Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.
It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.
The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:
- traditional benchmarking of algorithm implementations
- stress-testing clusters
- simulating multiple notebook users on one cluster
- comparing multiple versions of Spark on multiple clusters
Highlights
- Data Generation.
A data generator automatically generates input data sets with various sizes. Spark-Bench has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc. - Workloads
The atomic unit of organization in Spark-Bench is the workload. Workloads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk.. Spark-Bench provides diverse and representative workloads ( extensible to new workloads )- Machine learning: Logistic regression, support vector machine, matrix factorization
- Graph processing: pagerank, svdplusplus, triangle count
- Streaming: twitter, pageview
- SQL query applications: hive,RDDRelation
- Configurations
Spark-Bench allows you to launch multiple spark-submit commands by creating and launching multiple spark-submit scripts. This flexibility allows to:- Comparing benchmark times of the same workloads with different Spark settings
- Simulating multiple batch applications hitting the same cluster at once.
- Comparing benchmark times against two different Spark clusters!
- Metrics:
- supported: job execution time, input data size, data process rate
- under development: shuffle data, RDD size, resource consumption, integration with monitoring tool
Workload characterization and study of parameter impacts
- Diverse and representative date sets: Wikipedia, Google web graph, Amazon movie review
- Charactering workloads in terms of resource consumption, data access patterns and time information, job execution time, shuffle data
- Studying the impact of Spark configuration parameters
Pre-Requisities
OpenJDK8 installed
$ java -version
Docker installed
$ docker version
- Apache Hadoop and Apache Spark should be installed from Apache Bigtop packages.
- Follow the instructions here to install the Bigtop Hadoop and Spark components
BigTop Setup
Follow the instructions in here
Create docker Containers
1. Create a cluster of Bigtop docker containers
$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 3
2. Login into each container.
$ docker container exec -it <container_name> bash
3. Verify Hadoop is installed in containers.
$ hadoop
Configure Docker Containers
1. Follow these steps inside each container.
Create hadoop user
We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop user group:
$ sudo adduser hduser -G hadoop
give a password for hduser
$ sudo passwd hduser
Add hduser to sudoers list:
On Debian:
$ sudo adduser hduser sudo
On CentOS:
$ sudo usermod -G wheel hduser
Switch to hduser
$ su - hduser
Generate ssh key for hduser
$ ssh-keygen -t rsa -P ""
Press <enter> to leave to default file name.
enable ssh access to local machine
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ chmod 600 $HOME/.ssh/authorized_keys $ chmod 700 $HOME/.ssh
Login as hduser
$ su - hduser
Set Environment variables
Make sure the environment variables are set for Hadoop. Add them to your bash profile.$ vi ~/.bashrc
export HADOOP_HOME=/usr/lib/hadoop export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native" export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs export YARN_HOME=/usr/lib/hadoop-yarn export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/ export HADOOP_USER_NAME=hdfs export CLASSPATH=$CLASSPATH:. export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.7.2.jar:$HADOOP_HOME/client/hadoop-hdfs-2.7.2.jar:$HADOOP_HOME/hadoop-auth-2.7.2.jar:/usr/lib/hadoop-mapreduce/*:/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*: export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH export SPARK_HOME=/usr/lib/spark export PATH=$HADOOP_HOME\bin:$PATH export SPARK_DIST_CLASSPATH=$HADOOP_HOME\bin\hadoop:$CLASSPATH:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-mapreduce/*:. export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/:. export SPARK_MASTER_HOST=local[*]
$ source ~/.bashrc
Hosts file
Make sure hosts file is setup correctly. The hosts file should like below:
$ sudo vi /etc/hosts
172.17.0.4 spark-master <containerID.apache.bigtop.org> <containerID> 172.17.0.3 spark-slave01 <containerID.apache.bigtop.org> <containerID> 172.17.0.2 spark-slave02 <containerID.apache.bigtop.org> <containerID> 127.0.0.1 localhost localhost.domain ::1 localhost
Spark Configurations
Make sure Spark is configured properly- /usr/lib/spark/conf/spark-env.sh
This file should have STANDALONE_SPARK_MASTER_HOST pointing to spark master ip address
Also SPARK_MASTER_IP should be set - cp /usr/lib/spark/conf/slaves.template /usr/lib/spark/conf/slaves. Add the slaves ip address to the file instead of 'localhost'
Make sure spark-defaults.conf has the master set correctly
Verify Spark
ping to make sure you can ping all nodes
$ ping spark-master $ ping spark-slave01 $ ping spark-slave02
Make sure below command shows spark-master ip with port as 'Established' and other spark ports as listening
$ netstat -n -a
Stop and start Spark
$ /usr/lib/spark/sbin/stop-all.sh $ /usr/lib/spark/sbin/start-all.sh
2. Install Spark-Bench
- Install dependencies
Login to each container and install gnupg package
$ apt-get install gnupg
7. Install sbt package inside all containers.
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 $ sudo apt-get update $ sudo apt-get install sbt
8. Set SBT_OPTS. Change your SBT heap space. Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. Add the below to your bash profile
$ export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M"
- Grab the .tgz source code from here, inside each container
$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
- Unpack the tar file and cd into the newly created folder, inside each container
$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz $ cd spark-bench_2.3.0_0.4.0-RELEASE_99/
- Run sbt compile
$ sbt compile
3. Run Spark-Bench
$ ./bin/spark-bench.sh examples/minimal-example.conf
4. Verification
- The output should be like below:
One run of SparkPi and that's it!
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
| name| timestamp|total_runtime| pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir| spark.jars| spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master| spark.app.id| description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328| 1032871662|3.141851141851142| | | 10| 0| 10.200.22.54| 61657| :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...| driver| client| local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
- You could also verify by opening up another terminal instance, while the job is running with the below command
$ jps -lm
- The output should be like:
11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}} 12045 sun.tools.jps.Jps -lm 11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf
5. SQL benchmark
- Run the gen-data.sh script
$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh
Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.
- Run the run.sh script
For SQL applications, by default it runs the RDDRelation workload.
$ <SPARK_BENCH_HOME>/SQL/bin/run.sh
6. Hive Workload
To run Hive workload, execute:
$ hive;
7. Streaming Applications
For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute:
$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal $ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminal
In order run a particular streaming app (default: PageViewStream):
You need to pass a subApp parameter to the gen_data.sh or run.sh like this:
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags
*Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.
8. Other Workloads
https://hub.docker.com/r/alvarobrandon/spark-bench/
Reference
https://codait.github.io/spark-bench/compilation/
https://github.com/codait/spark-bench/tree/legacy