.Spark-Bench - Benchmarking Apache Spark v1.0
Overview
These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.
Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.
It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.
The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:
traditional benchmarking of algorithm implementations
stress-testing clusters
simulating multiple notebook users on one cluster
comparing multiple versions of Spark on multiple clusters
Highlights
Data Generation.
A data generator automatically generates input data sets with various sizes. Spark-Bench has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc.Workloads
The atomic unit of organization in Spark-Bench is the workload. Workloads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk.. Spark-Bench provides diverse and representative workloads ( extensible to new workloads )Machine learning: Logistic regression, support vector machine, matrix factorization
Graph processing: pagerank, svdplusplus, triangle count
Streaming: twitter, pageview
SQL query applications: hive,RDDRelation
Configurations
Spark-Bench allows you to launch multiple spark-submit commands by creating and launching multiple spark-submit scripts. This flexibility allows to:Comparing benchmark times of the same workloads with different Spark settings
Simulating multiple batch applications hitting the same cluster at once.
Comparing benchmark times against two different Spark clusters!
Metrics:
supported: job execution time, input data size, data process rate
under development: shuffle data, RDD size, resource consumption, integration with monitoring tool
Workload characterization and study of parameter impacts
Diverse and representative date sets: Wikipedia, Google web graph, Amazon movie review
Charactering workloads in terms of resource consumption, data access patterns and time information, job execution time, shuffle data
Studying the impact of Spark configuration parameters
Pre-Requisities
OpenJDK8 installed
$ java -versionDocker installed
$ docker versionApache Hadoop and Apache Spark should be installed from Apache Bigtop packages.
Follow the instructions here to install the Bigtop Hadoop and Spark components
BigTop Setup
Follow the instructions in here
Create docker Containers
1. Create a cluster of Bigtop docker containers
$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 32. Login into each container.
$ docker container exec -it <container_name> bash3. Verify Hadoop is installed in containers.
$ hadoopConfigure Docker Containers
1. Follow these steps inside each container.
Create hadoop user
We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop user group:
$ sudo adduser hduser -G hadoopgive a password for hduser
$ sudo passwd hduserAdd hduser to sudoers list:
On Debian:
$ sudo adduser hduser sudoOn CentOS:
$ sudo usermod -G wheel hduserSwitch to hduser
$ su - hduserGenerate ssh key for hduser
$ ssh-keygen -t rsa -P ""Press <enter> to leave to default file name.
enable ssh access to local machine
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ chmod 600 $HOME/.ssh/authorized_keys $ chmod 700 $HOME/.sshLogin as hduser
$ su - hduserSet Environment variables
Make sure the environment variables are set for Hadoop. Add them to your bash profile.
$ vi ~/.bashrcexport HADOOP_HOME=/usr/lib/hadoop export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native" export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs export YARN_HOME=/usr/lib/hadoop-yarn export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/ export HADOOP_USER_NAME=hdfs export CLASSPATH=$CLASSPATH:. export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.7.2.jar:$HADOOP_HOME/client/hadoop-hdfs-2.7.2.jar:$HADOOP_HOME/hadoop-auth-2.7.2.jar:/usr/lib/hadoop-mapreduce/*:/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*: export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH export SPARK_HOME=/usr/lib/spark export PATH=$HADOOP_HOME\bin:$PATH export SPARK_DIST_CLASSPATH=$HADOOP_HOME\bin\hadoop:$CLASSPATH:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-mapreduce/*:. export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/:. export SPARK_MASTER_HOST=local[*]$ source ~/.bashrcHosts file
Make sure hosts file is setup correctly. The hosts file should like below:
$ sudo vi /etc/hosts172.17.0.4 spark-master <containerID.apache.bigtop.org> <containerID> 172.17.0.3 spark-slave01 <containerID.apache.bigtop.org> <containerID> 172.17.0.2 spark-slave02 <containerID.apache.bigtop.org> <containerID> 127.0.0.1 localhost localhost.domain ::1 localhostSpark Configurations
Make sure Spark is configured properly
/usr/lib/spark/conf/spark-env.sh
This file should have STANDALONE_SPARK_MASTER_HOST pointing to spark master ip address
Also SPARK_MASTER_IP should be setcp /usr/lib/spark/conf/slaves.template /usr/lib/spark/conf/slaves. Add the slaves ip address to the file instead of 'localhost'
Make sure spark-defaults.conf has the master set correctly
Verify Spark
ping to make sure you can ping all nodes
$ ping spark-master $ ping spark-slave01 $ ping spark-slave02Make sure below command shows spark-master ip with port as 'Established' and other spark ports as listening
$ netstat -n -a
Stop and start Spark
$ /usr/lib/spark/sbin/stop-all.sh $ /usr/lib/spark/sbin/start-all.sh
2. Install Spark-Bench
Install dependencies
Login to each container and install gnupg package
$ apt-get install gnupg7. Install sbt package inside all containers.
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 $ sudo apt-get update $ sudo apt-get install sbt8. Set SBT_OPTS. Change your SBT heap space. Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. Add the below to your bash profile
$ export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M"Grab the .tgz source code from here, inside each container
$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgzUnpack the tar file and cd into the newly created folder, inside each container
$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
$ cd spark-bench_2.3.0_0.4.0-RELEASE_99/Run sbt compile
$ sbt compile3. Run Spark-Bench
$ ./bin/spark-bench.sh examples/minimal-example.conf4. Verification
The output should be like below:
One run of SparkPi and that's it!
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
| name| timestamp|total_runtime| pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir| spark.jars| spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master| spark.app.id| description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328| 1032871662|3.141851141851142| | | 10| 0| 10.200.22.54| 61657| :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...| driver| client| local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
You could also verify by opening up another terminal instance, while the job is running with the below command
$ jps -lmThe output should be like:
11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}}
12045 sun.tools.jps.Jps -lm
11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf5. SQL benchmark
Run the gen-data.sh script
$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh
Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.
Run the run.sh script
For SQL applications, by default it runs the RDDRelation workload.
$ <SPARK_BENCH_HOME>/SQL/bin/run.sh6. Hive Workload
To run Hive workload, execute:
$ hive;7. Streaming Applications
For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute:
$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminalIn order run a particular streaming app (default: PageViewStream):
You need to pass a subApp parameter to the gen_data.sh or run.sh like this:
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags *Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.
8. Other Workloads
https://hub.docker.com/r/alvarobrandon/spark-bench/
Reference
https://codait.github.io/spark-bench/compilation/
https://github.com/codait/spark-bench/tree/legacy