Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

...

HiBench is an open sourced and Apache licensed big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.

It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, PageRank, Bayes, Kmeans, enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Storm and Samza. 

Build

...

NOTE: Following steps are tested on Ubuntu-16.04.

Prerequisites

Code Block
languagebash
apt install -y maven

Build

Code Block
languagebash
# get source of latest release
git clone https://github.com/intel-hadoop/HiBench -b HiBench-7.0
cd HiBench
# build all modules in HiBench
mvn -Dspark=2.2 -Dscala=2.11 clean package
# if you just want to build for hadoop and spark
mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package

Run Benchmark

...

Prerequisites

Code Block
languagebash
apt install -y bc python2.7 python-setuptools openssh-server
service start ssh

Hadoop

Setup

  • A working hadoop cluster with HDFS, and YARN
    • To setup pseudo-distributed cluster, pls refer this link (Hadoop 2.x) or this link (Hadoop 3.0).
    • To setup multi-node cluster, please refer this link (Hadoop 2.x) or this link (Hadoop 3.0)
  • Start up SSH service

...

  1. Passphraseless ssh
    Hadoop requires a certain account to login to nodes without passphrase. This account should be setup in each node. To setup this account, please refer following cmds.

    Code Block
    languagebash
    mkdir -p ~/.ssh
    rm -f ~/.ssh/id_rsa*
    # scan and save target fingerprints
    ssh-keyscan -t ecdsa-sha2-nistp256 -H ${HOSTNAME} > ~/.ssh/known_hosts
    ssh-keyscan -t ecdsa-sha2-nistp256 -H localhost >> ~/.ssh/known_hosts
    ssh-keyscan -t ecdsa-sha2-nistp256 -H 0.0.0.0 >> ~/.ssh/known_hosts
    # generate key
    ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    chmod 0600 ~/.ssh/authorized_keys


  2. Hadoop user privilege
    It is recommended to run hadoop services as a non-root user. Usually a user, hdfs, is created to run HDFS and YARN services. If "root" is a must option, following cmds are required

    Code Block
    languagebash
    themeConfluence
    USER=$(whoami)
    export HDFS_NAMENODE_USER=${USER}
    export HDFS_DATANODE_USER=${USER}
    export HDFS_SECONDARYNAMENODE_USER=${USER}
    export YARN_RESOURCEMANAGER_USER=${USER}
    export YARN_NODEMANAGER_USER=${USER}


Configure HiBench

HiBench requires Hadoop configuration info to setup and run test workloads. The default configuration is <HIBENCH_ROOT_DIR>/conf/hadoop.conf. A template configuration file can be used as start point.

...

A detail description for fields in hadoop.conf are listed as following:

PropertyMeaning
hibench.hadoop.homeThe Hadoop installation location
hibench.hadoop.executableThe path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dirHadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.masterThe root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.releaseHadoop release provider. Supported value: apache, cdh5, hdp

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

...

The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Hadoop job to the cluster.

View Report

When benchmark is done, the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report. It is a summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node.

...

  • <workload>/hadoop/bench.log: Raw logs on client side.
  • <workload>/hadoop/monitor.html: System utilization monitor results.
  • <workload>/hadoop/conf/<workload>.conf: Generated environment variable configurations for this workload.

Tuning Benchmark

  • change input data size:
    • set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
  • change parallelism
    • Change the below properties in conf/hibench.conf to control the parallelism.

      PropertyMeaning
      hibench.default.map.parallelismMapper number in hadoop
      hibench.default.shuffle.parallelismReducer number in hadoop


Spark

Setup

  • A working HDFS service
  • A working YARN service, if Spark is tested in YARN mode
  • Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
    • Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
    • YARN mode:
  • Start SSH service

Configure HiBench

Configure Hadoop

Hadoop is used to generate the input data of the workloads. Create and edit conf/hadoop.conf

Code Block
languagebash
cp conf/hadoop.conf.template conf/hadoop.conf


PropertyMeaning
hibench.hadoop.homeThe Hadoop installation location
hibench.hadoop.executableThe path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dirHadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.masterThe root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.releaseHadoop release provider. Supported value: apache, cdh5, hdp

Configure Spark

Create and edit conf/spark.conf

...

hibench.spark.home            The Spark installation location
hibench.spark.master          The Spark master, i.e. `spark://xxx:7077`, `yarn-client`

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

...

The prepare.sh launches a Hadoop job to generate the input data on HDFS. The run.sh submits a Spark job to the cluster.

View Report

Same as "Hadoop benchmark", the report is outputed to <HIBENCH_ROOT_DIR>/report/hibench.report.

...

  • <workload>/spark/bench.log: Raw logs on client side.
  • <workload>/spark/monitor.html: System utilization monitor results.
  • <workload>/spark/conf/<workload>.conf: Generated environment variable configurations for this workload.
  • <workload>/spark/conf/sparkbench/<workload>/sparkbench.conf: Generated configuration for this workloads, which is used for mapping to environment variable.
  • <workload>/spark/conf/sparkbench/<workload>/spark.conf: Generated configuration for spark.

Tuning Benchmark

  • change input data size:
    • set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
  • change parallelism

    PropertyMeaning
    hibench.default.map.parallelismPartition number in Spark
    hibench.default.shuffle.parallelismShuffle partition number in Spark


  • change Spark job properties

    PropertyMeaning
    hibench.yarn.executor.numSpark executor number in Yarn mode
    hibench.yarn.executor.coresSpark executor cores in Yarn mode
    spark.executor.memorySpark executor memory
    spark.driver.memorySpark driver memory


References

Sample log: Hadoop terasort

...