Overview

These instructions are how to do Benchmarking of Spark on Apache Spark, installed using Apache Bigtop packaging tool.


Spark-Bench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications.

It provides a number of built-in workloads and data generators while also providing users the capability of plugging in their own workloads.

The framework provides three independent levels of parallelism that allow users to accurately simulate a variety of use cases. Some examples of potential uses for Spark-Bench include, but are not limited to:


Highlights

Workload characterization and study of parameter impacts

Pre-Requisities

BigTop Setup

Follow the instructions in here

Create docker Containers


1. Create a cluster of Bigtop docker containers

$ ./docker-hadoop.sh -C erp-18.06_debian-9.yaml -c 3

2. Login into each container.

$ docker container exec -it <container_name> bash

3. Verify Hadoop is installed in containers.

$ hadoop

Configure Docker Containers

1. Follow these steps inside each container.

2. Install Spark-Bench

$ wget https://github.com/CODAIT/spark-bench/releases/download/v99/spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
$ tar -xvzf spark-bench_2.3.0_0.4.0-RELEASE_99.tgz
$ cd spark-bench_2.3.0_0.4.0-RELEASE_99/
$ sbt compile


3. Run Spark-Bench

$ ./bin/spark-bench.sh examples/minimal-example.conf

4. Verification

One run of SparkPi and that's it!                                               
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|   pi_approximate|input|workloadResultsOutputDir|slices|run|spark.driver.host|spark.driver.port|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|       spark.app.id|         description|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+
|sparkpi|1498683099328|   1032871662|3.141851141851142|     |                        |    10|  0|     10.200.22.54|            61657|                 :/Users/...|file:/Users/ecurt...|com.ibm.sparktc.s...|           driver|                 client|    local[2]|local-1498683099078|One run of SparkP...|
+-------+-------------+-------------+-----------------+-----+------------------------+------+---+-----------------+-----------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+-------------------+--------------------+


$ jps -lm
11699 org.apache.spark.deploy.SparkSubmit --master local[*] --class com.ibm.sparktc.sparkbench.cli.CLIKickoff /home/hduser/spark-bench_2.3.0_0.4.0-RELEASE/lib/spark-bench-2.3.0_0.4.0-RELEASE.jar {"spark-bench":{"spark-submit-config":[{"workload-suites":[{"benchmark-output":"console","descr":"One run of SparkPi and that's it!","workloads":[{"name":"sparkpi","slices":10}]}]}]}}
12045 sun.tools.jps.Jps -lm
11630 com.ibm.sparktc.sparkbench.sparklaunch.SparkLaunch examples/minimal-example.conf

5. SQL benchmark

$ <SPARK_BENCH_HOME>/SQL/bin/gen_data.sh


Check if sample data sets are created in /SparkBench/sql/Input in HDFS.
If not, then there is a bug in spark bench scripts that needs to be fixed using the following steps: <<BR>>
- Open <SPARK_BENCH_HOME>/bin/funcs.sh and search for function 'CPFROM' <<BR>>
- In the last else block, replace the two occurences of ${src} variable with this: ${src:8}
- This problem was spotted by a colleague at AMD and he has submitted the patch here: https://github.com/SparkTC/spark-bench/pull/34 <<BR>>
- After making these changes, try running gen_data.sh script again and check if input data is created in HDFS this time. Then proceed to the next step.

$ <SPARK_BENCH_HOME>/SQL/bin/run.sh


6. Hive Workload

To run Hive workload, execute:


$ hive;

7. Streaming Applications

For Streaming applications such as TwitterTag,StreamingLogisticRegression First, execute: 


$ <SPARK_BENCH_HOME>/Streaming/bin/gen_data.sh # Run this in one terminal


$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh # Run this in another terminal
In order run a particular streaming app (default: PageViewStream):
     You need to pass a subApp parameter to the gen_data.sh or run.sh like this:
$ <SPARK_BENCH_HOME>/Streaming/bin/run.sh TwitterPopularTags
     *Note: some subApps do not need the data_gen step. In those you will get a "no need" string in the output.

8. Other Workloads

https://hub.docker.com/r/alvarobrandon/spark-bench/


Reference

https://codait.github.io/spark-bench/compilation/

https://github.com/codait/spark-bench/tree/legacy


Errors and Resolutions