NOTE: Following steps are tested on Ubuntu-16.04.

Prerequisites

Code Block

language	bash

apt install -y maven

...

Code Block

language	bash

# get source of latest release
git clone https://github.com/intel-hadoop/HiBench -b HiBench-7.0
cd HiBench
# build all modules in HiBench
mvn -Dspark=2.2 -Dscala=2.11 clean package
# if you just want to build for hadoop and spark
mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package

Run Benchmark

...

Prerequisites

Code Block

language	bash

apt install -y bc python2.7 python-setuptools openssh-server
service start ssh

...

A detail description for fields in hadoop.conf are listed as following:

Property	Meaning
hibench.hadoop.home	The Hadoop installation location
hibench.hadoop.executable	The path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dir	Hadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.master	The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.release	Hadoop release provider. Supported value: apache, cdh5, hdp

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

...

change input data size:
- set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
change parallelism
- Change the below properties in conf/hibench.conf to control the parallelism.
  Property Meaning
  hibench.default.map.parallelism Mapper number in hadoop
  hibench.default.shuffle.parallelism Reducer number in hadoop

Spark

Setup

A working HDFS service
A working YARN service, if Spark is tested in YARN mode
Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
- Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
- YARN mode:

...

Code Block

language	bash

cp conf/hadoop.conf.template conf/hadoop.conf

Property	Meaning
hibench.hadoop.home	The Hadoop installation location
hibench.hadoop.executable	The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dir	Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.master	The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.release	Hadoop release provider. Supported value: apache, cdh5, hdp

Configure Spark

Create and edit conf/spark.conf：

...

change input data size:
- set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
change parallelism
Property Meaning
hibench.default.map.parallelism Partition number in Spark
hibench.default.shuffle.parallelism Shuffle partition number in Spark

change Spark job properties

Property	Meaning
hibench.yarn.executor.num	Spark executor number in Yarn mode
hibench.yarn.executor.cores	Spark executor cores in Yarn mode
spark.executor.memory	Spark executor memory
spark.driver.memory	Spark driver memory

References

Sample log: Hadoop terasort

...

Version	Old Version 7	New Version 8
Changes made by	Jun He (Deactivated)	Jun He (Deactivated)
Saved on	Dec 21, 2017	Dec 21, 2017

Versions Compared

Key

Prerequisites

Run Benchmark

Prerequisites

Run Workload

Spark

Setup

Configure Spark

References

Property	Meaning
hibench.default.map.parallelism	Mapper number in hadoop
hibench.default.shuffle.parallelism	Reducer number in hadoop

Property	Meaning
hibench.default.map.parallelism	Partition number in Spark
hibench.default.shuffle.parallelism	Shuffle partition number in Spark

Content Comparison

Versions Compared

Key

Prerequisites

Run Benchmark

Prerequisites

Run Workload

Spark

Setup

Configure Spark

References