...
NOTE: Following steps are tested on Ubuntu-16.04.
Prerequisites
Code Block | ||
---|---|---|
| ||
apt install -y maven |
...
Code Block | ||
---|---|---|
| ||
# get source of latest release git clone https://github.com/intel-hadoop/HiBench -b HiBench-7.0 cd HiBench # build all modules in HiBench mvn -Dspark=2.2 -Dscala=2.11 clean package # if you just want to build for hadoop and spark mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package |
Run Benchmark
...
Prerequisites
Code Block | ||
---|---|---|
| ||
apt install -y bc python2.7 python-setuptools openssh-server service start ssh |
...
A detail description for fields in hadoop.conf are listed as following:
Property | Meaning |
---|---|
hibench.hadoop.home | The Hadoop installation location |
hibench.hadoop.executable | The path of hadoop executable. For Apache Hadoop, it is/YOUR/HADOOP/HOME/bin/hadoop |
hibench.hadoop.configure.dir | Hadoop configuration directory. For Apache Hadoop, it is/YOUR/HADOOP/HOME/etc/hadoop |
hibench.hdfs.master | The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username |
hibench.hadoop.release | Hadoop release provider. Supported value: apache, cdh5, hdp |
Run Workload
HiBench's workload usually have two parts: prepare and run. For example, to run "wordcount",
...
- change input data size:
- set hibench.scale.profile in
conf/hibench.conf
. Available values are tiny, small, large, huge, gigantic and bigdata.
- set hibench.scale.profile in
- change parallelism
Change the below properties in
conf/hibench.conf
to control the parallelism.Property Meaning hibench.default.map.parallelism Mapper number in hadoop hibench.default.shuffle.parallelism Reducer number in hadoop
Spark
Setup
- A working HDFS service
- A working YARN service, if Spark is tested in YARN mode
- Working Spark: Spark can be configured to work in either "standalone mode" or "YARN mode". ("Mesos mode" is not counted in as Mesos is not deployed when we run HiBench)
- Standalone mode: it is the easiest to set up and will provide almost all the same features as the "YARN mode" if only Spark is running.
- YARN mode:
...
Code Block | ||
---|---|---|
| ||
cp conf/hadoop.conf.template conf/hadoop.conf |
Property | Meaning |
---|---|
hibench.hadoop.home | The Hadoop installation location |
hibench.hadoop.executable | The path of hadoop executable. For Apache Hadoop, it is /YOUR/HADOOP/HOME/bin/hadoop |
hibench.hadoop.configure.dir | Hadoop configuration directory. For Apache Hadoop, it is /YOUR/HADOOP/HOME/etc/hadoop |
hibench.hdfs.master | The root HDFS path to store HiBench data, i.e. hdfs://localhost:8020/user/username |
hibench.hadoop.release | Hadoop release provider. Supported value: apache, cdh5, hdp |
Configure Spark
Create and edit conf/spark.conf
:
...
- change input data size:
- set hibench.scale.profile in
conf/hibench.conf
. Available values are tiny, small, large, huge, gigantic and bigdata.
- set hibench.scale.profile in
change parallelism
Property Meaning hibench.default.map.parallelism Partition number in Spark hibench.default.shuffle.parallelism Shuffle partition number in Spark
change Spark job properties
Property Meaning hibench.yarn.executor.num Spark executor number in Yarn mode hibench.yarn.executor.cores Spark executor cores in Yarn mode spark.executor.memory Spark executor memory spark.driver.memory Spark driver memory
References
Sample log: Hadoop terasort
...