Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE: Following steps are tested on Ubuntu-16.04.

Prerequisites

Code Block
languagebash
apt install -y maven

...

Code Block
languagebash
# get source of latest release
git clone https://github.com/intel-hadoop/HiBench -b HiBench-7.0
cd HiBench
# build all modules in HiBench
mvn -Dspark=2.2 -Dscala=2.11 clean package
# if you just want to build for hadoop and spark
mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package

Run Benchmark

...

Prerequisites

Code Block
languagebash
apt install -y bc python2.7 python-setuptools openssh-server
service start ssh

...

A detail description for fields in hadoop.conf are listed as following:

PropertyMeaning
hibench.hadoop.homeThe Hadoop installation location
hibench.hadoop.executableThe path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dirHadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.masterThe root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.releaseHadoop release provider. Supported value: apache, cdh5, hdp

Run Workload

HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",

...

  • change input data size:
    • set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
  • change parallelism
    • Change the below properties in conf/hibench.conf to control the parallelism.

      PropertyMeaning
      hibench.default.map.parallelismMapper number in hadoop
      hibench.default.shuffle.parallelismReducer number in hadoop


Spark

Setup

  • A working HDFS service
  • A working YARN service, if Spark is tested in YARN mode
  • Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
    • Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
    • YARN mode:

...

Code Block
languagebash
cp conf/hadoop.conf.template conf/hadoop.conf


PropertyMeaning
hibench.hadoop.homeThe Hadoop installation location
hibench.hadoop.executableThe path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop
hibench.hadoop.configure.dirHadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop
hibench.hdfs.masterThe root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username
hibench.hadoop.releaseHadoop release provider. Supported value: apache, cdh5, hdp

Configure Spark

Create and edit conf/spark.conf

...

  • change input data size:
    • set hibench.scale.profile in conf/hibench.conf. Available values are tiny, small, large, huge, gigantic and bigdata.
  • change parallelism

    PropertyMeaning
    hibench.default.map.parallelismPartition number in Spark
    hibench.default.shuffle.parallelismShuffle partition number in Spark


  • change Spark job properties

    PropertyMeaning
    hibench.yarn.executor.numSpark executor number in Yarn mode
    hibench.yarn.executor.coresSpark executor cores in Yarn mode
    spark.executor.memorySpark executor memory
    spark.driver.memorySpark driver memory


References

Sample log: Hadoop terasort

...