Table of Contents

References

https://github.com/intel-hadoop/HiBench/wiki/Getting-Started

...

Introduction

...

HiBench is an open sourced and Apache licensed big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.

It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, PageRank, Bayes, Kmeans, enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Storm and Samza.

Build

...

NOTE: Following steps are tested on Ubuntu-16.04.

Prerequisites

Code Block

language	bash

apt install -y maven

Build

Code Block

language	bash

$ git clone https://github.com/intel-hadoop/HiBench.git
$        # get source
cd src
$ mvnHiBench
mvn -Dspark=2.2 -Dscala=2.11 clean package -D spark1.6.1 -D MR2 # Changed the spark version from default
$ cd conf
$ cp 99-user_defined_properties.conf.template 99-user_defined_properties.conf

...

               # build all modules in HiBench
# if you just want to build for hadoop and spark
mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package

Run Benchmark

...

Prerequisites

Code Block

language	bash

apt install -y bc python2.7 python-setuptools openssh-server
service start ssh

Hadoop

Setup

A working hadoop cluster with HDFS, and YARN
- To setup pseudo-distributed cluster, pls refer this link (Hadoop 2.x) or this link (Hadoop 3.0).
- To setup multi-node cluster, please refer this link (Hadoop 2.x) or this link (Hadoop 3.0)
Start up SSH service

You may encounter two problems:

'''Solution:''' Started spark (start-all and history server)

...

Passphraseless ssh
Hadoop requires a certain account to login to nodes without passphrase. This account should be setup in each node. To setup this account, please refer following cmds.
Code Block
language bash

...

hibench.hadoop.home <Hadoop installation location>
hibench.spark.home <Spark installation location>
hibench.hdfs.master hdfs://<host>:8020
hibench.spark.master spark://<host>:7077
hibench.hadoop.version hadoop2 - # Change this in addition to the above configuration as hibench was not able to detect the hadoop version

Errors and Workarounds

Running WordCount Workload

Code Block

language	bash

$ workloads/wordcount/prepare/prepare.sh

_{'''Error 1:''' certain environment variables not found}
_{workloads/wordcount/spark/scala/bin/run.sh}

Traceback (most recent call last):
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 556, in <module>
load_config(conf_root, workload_root, workload_folder, patching_config)
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 161, in load_config
generate_optional_value()
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 374, in generate_optional_value
HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar")
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 114, in OneAndOnlyOneFile
raise Exception("Need to match one and only one file!")
Exception: Need to match one and only one file!
/home/nbhoyar/HiBench/bin/functions/workload-functions.sh: line 39: .: filename argument required
.: usage: . filename [arguments]

'''Solution:''' Modified line 358 and 374 in bin/functions/load-config.py to reflect the correct path of ODPi Hadoop's example jar file

...

mkdir -p ~/.ssh
rm -f ~/.ssh/id_rsa*
# scan and save target fingerprints
ssh-keyscan -t ecdsa-sha2-nistp256 -H ${HOSTNAME} > ~/.ssh/known_hosts
ssh-keyscan -t ecdsa-sha2-nistp256 -H localhost >> ~/.ssh/known_hosts
ssh-keyscan -t ecdsa-sha2-nistp256 -H 0.0.0.0 >> ~/.ssh/known_hosts
# generate key
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Hadoop user privilege
It is recommended to run hadoop services as a non-root user. Usually a user, hdfs, is created to run HDFS and YARN services. If "root" is a must option, following cmds are required

Code Block

language	bash
theme	Confluence

USER=$(whoami)
export HDFS_NAMENODE_USER=${USER}
export HDFS_DATANODE_USER=${USER}
export HDFS_SECONDARYNAMENODE_USER=${USER}
export YARN_RESOURCEMANAGER_USER=${USER}
export YARN_NODEMANAGER_USER=${USER}

Configure HiBench

HiBench requires Hadoop configuration info to setup and run test workloads. The default configuration is <HIBENCH_ROOT_DIR>/conf/hadoop.conf. A template configuration file can be used as start point.

Code Block

language	bash

cp conf/hadoop.conf.template conf/hadoop.conf

Usually these two fields should be modified to match Hadoop settings:

hibench.hadoop.home: point to hadoop root directory

hibench.hdfs.master: point to hdfs service uri. This uri can be found at <HADOOP_ROOT_DIR>/etc/hadoop/core-site.xml:fs.defaultFS.

A detail description for fields in hadoop.conf are listed as following:

Property	Meaning
hibench.hadoop.home	The Hadoop installation location
hibench.hadoop.executable	The path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dir	Hadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.master	The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.release	Hadoop release provider. Supported value: apache, cdh5, hdp

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

Code Block

language	bash

bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/hadoop/run.sh

The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Hadoop job to the cluster.

View Report

When benchmark is done, the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report. It is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.

The report directory also includes further information for debugging and tuning.

<workload>/hadoop/bench.log: Raw logs on client side.
<workload>/hadoop/monitor.html: System utilization monitor results.
<workload>/hadoop/conf/<workload>.conf: Generated environment variable configurations for this workload.

Tuning Benchmark

change input data size:
- set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
change parallelism
- Change the below properties in conf/hibench.conf to control the parallelism.
  Property Meaning
  hibench.default.map.parallelism Mapper number in hadoop
  hibench.default.shuffle.parallelism Reducer number in hadoop

Spark

Setup

A working HDFS service
A working YARN service, if Spark is tested in YARN mode
Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
- Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
- YARN mode:

Start SSH service

Configure HiBench

Configure Hadoop

Hadoop is used to generate the input data of the workloads. Create and edit conf/hadoop.conf：

Code Block

language	bash

cp conf/hadoop.conf.template conf/hadoop.conf

Property	Meaning
hibench.hadoop.home	The Hadoop installation location
hibench.hadoop.executable	The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dir	Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.master	The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.release	Hadoop release provider. Supported value: apache, cdh5, hdp

Configure Spark

Create and edit conf/spark.conf：

Code Block

language	bash

cp conf/spark.conf.template conf/spark.conf

Set the below properties properly:

hibench.spark.home            The Spark installation location
hibench.spark.master          The Spark master, i.e. `spark://xxx:7077`, `yarn-client`

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

Code Block

language	bash

bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/spark/run.sh

The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Spark job to the cluster.

View Report

Same as "Hadoop benchmark", the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report.

Meanwhile, detail information is generated for debugging and tuning.

<workload>/spark/bench.log: Raw logs on client side.
<workload>/spark/monitor.html: System utilization monitor results.
<workload>/spark/conf/<workload>.conf: Generated environment variable configurations for this workload.
<workload>/spark/conf/sparkbench/<workload>/sparkbench.conf: Generated configuration for this workloads, which is used for mapping to environment variable.
<workload>/spark/conf/sparkbench/<workload>/spark.conf: Generated configuration for spark.

Tuning Benchmark

change input data size:
- set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
change parallelism
Property Meaning
hibench.default.map.parallelism Partition number in Spark
hibench.default.shuffle.parallelism Shuffle partition number in Spark

change Spark job properties

Property	Meaning
hibench.yarn.executor.num	Spark executor number in Yarn mode
hibench.yarn.executor.cores	Spark executor cores in Yarn mode
spark.executor.memory	Spark executor memory
spark.driver.memory	Spark driver memory

Versions Compared

Old Version 4

New Version 5

Key

References

Introduction

Build

Build

Run Benchmark

Prerequisites

Hadoop

Setup

Errors and Workarounds

Running WordCount Workload

Configure HiBench

Run Workload

View Report

Tuning Benchmark

Spark

Setup

Configure HiBench

Configure Hadoop

Configure Spark

Run Workload

View Report

Tuning Benchmark

Property	Meaning
hibench.default.map.parallelism	Mapper number in hadoop
hibench.default.shuffle.parallelism	Reducer number in hadoop

Property	Meaning
hibench.default.map.parallelism	Partition number in Spark
hibench.default.shuffle.parallelism	Shuffle partition number in Spark

Page Comparison

Versions Compared

Old Version 4

New Version 5

Key

References

Introduction

Build

Build

Run Benchmark

Prerequisites

Hadoop

Setup

Errors and Workarounds

Running WordCount Workload

Configure HiBench

Run Workload

View Report

Tuning Benchmark

Spark

Setup

Configure HiBench

Configure Hadoop

Configure Spark

Run Workload

View Report

Tuning Benchmark