Table of Contents |
---|
References
https://github.com/intel-hadoop/HiBench/wiki/Getting-Started
...
Introduction
...
HiBench is an open sourced and Apache licensed big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.
It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, PageRank, Bayes, Kmeans, enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Storm and Samza.
Build
...
NOTE: Following steps are tested on Ubuntu-16.04.
Prerequisites
Code Block | ||
---|---|---|
| ||
apt install -y maven |
Build
Code Block | ||
---|---|---|
| ||
$# get source of latest release git clone https://github.com/intel-hadoop/HiBench -b HiBench-7.git0 cd HiBench $# cdbuild srcall modules in $HiBench mvn -Dspark=2.2 -Dscala=2.11 clean package -D spark1.6.1 -D MR2 # Changed the spark version from default $ cd conf $ cp 99-user_defined_properties.conf.template 99-user_defined_properties.conf |
...
# if you just want to build for hadoop and spark
mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package |
Run Benchmark
...
Prerequisites
Code Block | ||
---|---|---|
| ||
apt install -y bc python2.7 python-setuptools openssh-server
service start ssh |
Hadoop
Setup
- A working hadoop cluster with HDFS, and YARN
- Start up SSH service
You may encounter two problems:
...
Passphraseless ssh
Hadoop requires a certain account to login to nodes without passphrase. This account should be setup in each node. To setup this account, please refer following cmds.Code Block language bash
...
hibench.hadoop.home <Hadoop installation location>
hibench.spark.home <Spark installation location>
hibench.hdfs.master hdfs://<host>:8020
hibench.spark.master spark://<host>:7077
hibench.hadoop.version hadoop2 - # Change this in addition to the above configuration as hibench was not able to detect the hadoop version
Errors and Workarounds
mkdir -p ~/.ssh rm -f ~/.ssh/id_rsa* # scan and save target fingerprints ssh-keyscan -t ecdsa-sha2-nistp256 -H ${HOSTNAME} > ~/.ssh/known_hosts ssh-keyscan -t ecdsa-sha2-nistp256 -H localhost >> ~/.ssh/known_hosts ssh-keyscan -t ecdsa-sha2-nistp256 -H 0.0.0.0 >> ~/.ssh/known_hosts # generate key ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
Hadoop user privilege
It is recommended to run hadoop services as a non-root user. Usually a user, hdfs, is created to run HDFS and YARN services. If "root" is a must option, following cmds are requiredCode Block language bash theme Confluence USER=$(whoami) export HDFS_NAMENODE_USER=${USER} export HDFS_DATANODE_USER=${USER} export HDFS_SECONDARYNAMENODE_USER=${USER} export YARN_RESOURCEMANAGER_USER=${USER} export YARN_NODEMANAGER_USER=${USER}
Configure HiBench
HiBench requires Hadoop configuration info to setup and run test workloads
...
Traceback (most recent call last):
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 556, in <module>
load_config(conf_root, workload_root, workload_folder, patching_config)
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 161, in load_config
generate_optional_value()
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 374, in generate_optional_value
HibenchConf["hibench.hadoop.examples.test.jar"] = OneAndOnlyOneFile(HibenchConf['hibench.hadoop.mapreduce.home'] + "/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient*-tests.jar")
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 114, in OneAndOnlyOneFile
raise Exception("Need to match one and only one file!")
Exception: Need to match one and only one file!
/home/nbhoyar/HiBench/bin/functions/workload-functions.sh: line 39: .: filename argument required
.: usage: . filename [arguments]
'''Solution:''' Modified line 358 and 374 in bin/functions/load-config.py to reflect the correct path of ODPi Hadoop's example jar file
'''Error 2:''' Traceback (most recent call last):
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 556, in <module>
load_config(conf_root, workload_root, workload_folder, patching_config)
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 161, in load_config
generate_optional_value()
File "/home/nbhoyar/HiBench/bin/functions/load-config.py", line 434, in generate_optional_value
assert 0, "Get workers from spark master's web UI page failed, reason:%s\nPlease check your configurations, network settings, proxy settings, or set `hibench.masters.hostnames` and `hibench.slaves.hostnames` manually to bypass auto-probe" % e
AssertionError: Get workers from spark master's web UI page failed, reason:[Errno socket error] [Errno 111] Connection refused
Please check your configurations, network settings, proxy settings, or set `hibench.masters.hostnames` and `hibench.slaves.hostnames` manually to bypass auto-probe
/home/nbhoyar/HiBench/bin/functions/workload-functions.sh: line 39: .: filename argument required'''Solution:''' Started spark (start-all and history server)
'''Error 3:''' /home/nbhoyar/HiBench/bin/functions/workload-functions.sh: line 113: $1: unbound variable'''Solution:''' Working on it. Posted it on: https://github.com/intel-hadoop/HiBench/issues/279. The default configuration is <HIBENCH_ROOT_DIR>/conf/hadoop.conf
. A template configuration file can be used as start point.
Code Block | ||
---|---|---|
| ||
cp conf/hadoop.conf.template conf/hadoop.conf |
Usually these two fields should be modified to match Hadoop settings:
hibench.hadoop.home: point to hadoop root directory
hibench.hdfs.master: point to hdfs service uri. This uri can be found at <HADOOP_ROOT_DIR>/etc/hadoop/core-site.xml:fs.defaultFS
.
A detail description for fields in hadoop.conf are listed as following:
Property | Meaning |
---|---|
hibench.hadoop.home | The Hadoop installation location |
hibench.hadoop.executable | The path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop |
hibench.hadoop.configure.dir | Hadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop |
hibench.hdfs.master | The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username |
hibench.hadoop.release | Hadoop release provider. Supported value: apache, cdh5, hdp |
Run Workload
HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",
Code Block | ||
---|---|---|
| ||
bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/hadoop/run.sh |
The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Hadoop job to the cluster.
View Report
When benchmark is done, the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report
. It is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.
The report directory also includes further information for debugging and tuning.
<workload>/hadoop/bench.log
: Raw logs on client side.<workload>/hadoop/monitor.html
: System utilization monitor results.<workload>/hadoop/conf/<workload>.conf
: Generated environment variable configurations for this workload.
Tuning Benchmark
- change input data size:
- set hibench.scale.profile in
conf/hibench.conf
. Available values are tiny, small, large, huge, gigantic and bigdata.
- set hibench.scale.profile in
- change parallelism
Change the below properties in
conf/hibench.conf
to control the parallelism.Property Meaning hibench.default.map.parallelism Mapper number in hadoop hibench.default.shuffle.parallelism Reducer number in hadoop
Spark
Setup
- A working HDFS service
- A working YARN service, if Spark is tested in YARN mode
- Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
- Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
- YARN mode:
- Start SSH service
Configure HiBench
Configure Hadoop
Hadoop is used to generate the input data of the workloads. Create and edit conf/hadoop.conf
:
Code Block | ||
---|---|---|
| ||
cp conf/hadoop.conf.template conf/hadoop.conf |
Property | Meaning |
---|---|
hibench.hadoop.home | The Hadoop installation location |
hibench.hadoop.executable | The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop |
hibench.hadoop.configure.dir | Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop |
hibench.hdfs.master | The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username |
hibench.hadoop.release | Hadoop release provider. Supported value: apache, cdh5, hdp |
Configure Spark
Create and edit conf/spark.conf
:
Code Block | ||
---|---|---|
| ||
cp conf/spark.conf.template conf/spark.conf |
Set the below properties properly:
hibench.spark.home The Spark installation location
hibench.spark.master The Spark master, i.e. `spark://xxx:7077`, `yarn-client`
Run Workload
HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",
Code Block | ||
---|---|---|
| ||
bin/workloads/micro/wordcount/prepare/prepare.sh
bin/workloads/micro/wordcount/spark/run.sh |
The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Spark job to the cluster.
View Report
Same as "Hadoop benchmark", the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report
.
Meanwhile, detail information is generated for debugging and tuning.
<workload>/spark/bench.log
: Raw logs on client side.<workload>/spark/monitor.html
: System utilization monitor results.<workload>/spark/conf/<workload>.conf
: Generated environment variable configurations for this workload.<workload>/spark/conf/sparkbench/<workload>/sparkbench.conf
: Generated configuration for this workloads, which is used for mapping to environment variable.<workload>/spark/conf/sparkbench/<workload>/spark.conf
: Generated configuration for spark.
Tuning Benchmark
- change input data size:
- set hibench.scale.profile in
conf/hibench.conf
. Available values are tiny, small, large, huge, gigantic and bigdata.
- set hibench.scale.profile in
change parallelism
Property Meaning hibench.default.map.parallelism Partition number in Spark hibench.default.shuffle.parallelism Shuffle partition number in Spark
change Spark job properties
Property Meaning hibench.yarn.executor.num Spark executor number in Yarn mode hibench.yarn.executor.cores Spark executor cores in Yarn mode spark.executor.memory Spark executor memory spark.driver.memory Spark driver memory
References
Sample log: Hadoop terasort