These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.
Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.
It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.
The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:
Workload characterization and study of parameter impacts
OpenJDK8 installed
$ java -version |
Docker installed
$ docker version |
Follow the instructions in here
1. Create a cluster of Bigtop docker containers
$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 3 |
2. Login into each container.
$ docker container exec -it <container_name> bash |
3. Verify Hadoop is installed in containers.
$ hadoop |
1. Follow these steps inside each container.
We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop user group:
$ sudo adduser hduser -G hadoop |
give a password for hduser
$ sudo passwd hduser |
Add hduser to sudoers list:
On Debian:
$ sudo adduser hduser sudo |
On CentOS:
$ sudo usermod -G wheel hduser |
Switch to hduser
$ su - hduser |
Generate ssh key for hduser
$ ssh-keygen -t rsa -P "" |
Press <enter> to leave to default file name.
enable ssh access to local machine
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ chmod 600 $HOME/.ssh/authorized_keys $ chmod 700 $HOME/.ssh |
Login as hduser
$ su - hduser |
$ vi ~/.bashrc |
export HADOOP_HOME=/usr/lib/hadoop export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native" export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec export HADOOP_CONF_DIR=/etc/hadoop/conf export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs export YARN_HOME=/usr/lib/hadoop-yarn export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/ export HADOOP_USER_NAME=hdfs export CLASSPATH=$CLASSPATH:. export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.7.2.jar:$HADOOP_HOME/client/hadoop-hdfs-2.7.2.jar:$HADOOP_HOME/hadoop-auth-2.7.2.jar:/usr/lib/hadoop-mapreduce/*:/usr/lib/hive/lib/*:/usr/lib/hadoop/lib/*: export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH export SPARK_HOME=/usr/lib/spark export PATH=$HADOOP_HOME\bin:$PATH export SPARK_DIST_CLASSPATH=$HADOOP_HOME\bin\hadoop:$CLASSPATH:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-mapreduce/*:. export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/:. export SPARK_MASTER_HOST=local[*] |
$ source ~/.bashrc |
Make sure hosts file is setup correctly. The hosts file should like below:
$ sudo vi /etc/hosts |
172.17.0.4 spark-master <containerID.apache.bigtop.org> <containerID> 172.17.0.3 spark-slave01 <containerID.apache.bigtop.org> <containerID> 172.17.0.2 spark-slave02 <containerID.apache.bigtop.org> <containerID> 127.0.0.1 localhost localhost.domain ::1 localhost |
Make sure spark-defaults.conf has the master set correctly
ping to make sure you can ping all nodes
$ ping spark-master $ ping spark-slave01 $ ping spark-slave02 |
Make sure below command shows spark-master ip with port as 'Established' and other spark ports as listening
$ netstat -n -a |
$ /usr/lib/spark/sbin/stop-all.sh $ /usr/lib/spark/sbin/start-all.sh |
Login to each container and install gnupg package
$ apt-get install gnupg |
7. Install sbt package inside all containers.
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 $ sudo apt-get update $ sudo apt-get install sbt |
8. Set SBT_OPTS. Change your SBT heap space. Building spark-bench takes more heap space than the default provided by SBT. There are several ways to set these options for SBT, this is just one. Add the below to your bash profile
$ export SBT_OPTS="-Xmx1536M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M" |
$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgz |
$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz $ cd spark-bench_2.3.0_0.4.0-RELEASE_99/ |
$ sbt compile |
$ ./bin/spark-bench.sh examples/minimal-example.conf |
One run of SparkPi and that's it!
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
| name| timestamp|total_runtime| pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir| spark.jars| spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master| spark.app.id| description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328| 1032871662|3.141851141851142| | | 10| 0| 10.200.22.54| 61657| :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...| driver| client| local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
$ jps -lm |
11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}} 12045 sun.tools.jps.Jps -lm 11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf |
$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh |
Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.
$ <SPARK_BENCH_HOME>/SQL/bin/run.sh |
To run Hive workload, execute:
$ hive; |
For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute:
$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal $ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminal |
In order run a particular streaming app (default: PageViewStream):
You need to pass a subApp parameter to the gen_data.sh or run.sh like this:
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags |
*Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.
https://hub.docker.com/r/alvarobrandon/spark-bench/
https://codait.github.io/spark-bench/compilation/
https://github.com/codait/spark-bench/tree/legacy