...
Outline of the structure of the Ansible:
Quick note : To look at how our Jenkins executes the full job, see : https://github.com/Linaro/hpc_lab_setup/blob/tensorflowci/files/build_tensorflow.yml
install_python3.yml
This playbook is to be run first, and makes sure the environment contains all the "base" python (2 & 3) dependencies (i.e. python-devel, python-setuptools, python-pip). Python 3 is sourced from EPEL.
This playbook installs python3 requirements. It is kept separate at the moment since the second playbook should make use of the python3 interpreter (on the target/builder machine).
Sadly, the yum module only works with Ansible on python2, dnf as to be used on Ansible with python3. And CentOS7 does not what would be "python3(6)-dnf", and building dnf from source doesn't provide it either (and upgrading dnf has many dependencies).
But CentOS8 could make use of it, and apt based distros too.
...
This playbook, as the name entails, does the rest of the job of building TensorFlow. By default (see roles/tensorflow/defaults/main.yml), it builds :
- TensorFlow 1.15.0 Release
- NumPy 1.17.3 Release
- Bazel 0.24.1-dist Release
This playbook does also setup a "builder" user (see roles/tensorflow/defaults/main.yml for name) and adds it to the wheel group (that is also changed to do password-less sudo) and changes its bashrc so Lmod is systematically loaded, so it can be used to do the build.
To that effect, it fetches dependencies first (see roles/tensorflow/tasks/main.yml for the precise task order), including OpenBLAS, HDF5, FFTW, GCC 8.3.0 and LMod from OpenHPC (See the Further Enhancements section), as well as openjdk8 for Bazel.
On the topic of Lmod, we are working on trying to find a way to install it and get it to work in the same bash script (without having to restart the bash session). BASH_ENV might be the trick.
...
Once NumPy is built, we can get to the nitty-gritty : building TensorFlow itself. Here, the environment that is setup up to this point, by both the Ansible and the bash, can build 1.15.0 and 2.0.0 TensorFlow just fine (but tensorflow-benchmarks hasn't been made to work with TensorFlow 2... And the split up of tf.collab is non trivial...).
Configuration of the TensorFlow build has to be done with sourcing environment variables... Which is achieved through a script to maintain some sanity.
The TensorFlow does also require a patch, which is due to a missing requirement in Bazel's build configuration (i.e. WORKSPACE) : it is a quite well known issue.
Then the build itself is where we feed the compiler arguments (see /roles/tensorflow/defaults/main.yml : tensorflow.c_optimizations and tensorflow.c_optimizations)
Once it is built (and the pip package is built as well), the final trick is to install h5py (HDF5 python lib/API) via the command line (shell module in Ansible), with the variable HDF5_DIR pointing to the base of the HDF5 installation. (this might be a problem with LMOD, needs further investigation)
TensorFlow is built and ready to install !
After this and as a final step, the playbook will execute a "Hello World" script to make sure the TensorFlow installation is functional, this script is ripped out of the official documentation 101 : https://www.tensorflow.org/tutorials/quickstart/beginner
Further Notes
The HPC SIG's scripts to build Tensorflow support only CentOS 7 at the moment of writing this article.
Efforts will be made to ensure the compatibility with CentOS 8, and then we will look into Debian environments (probably after adding the OpenBLAS and FFTW build)
The scripts/ansible recipe does make use of LMod to keep track of the libraries, and OpenHPC's binaries for OpenBLAS, FFTW, GCC 8.3.0 and LMod.
Tensorflow and NumPy 1.17.3 do require Python 3, and the EOL of Python 2 is fast approaching, but sadly, the dnf/yum module in Ansible seems very dodgy at the moment, so we still need python2, at least for the ansible.
The Ansible makes use of venv (a.k.a virtualenv) to keep (at least) the python dependencies contained and easily identifiable.
GCC is used to build all necessary pieces (at the moment, 8.3.0 from OpenHPC)
Further Enhancements
Note: The TensorFlow Ansible and Jenkins Job is on a PR at the moment, to be merged with master hpc_lab_jenkins and put into production, please add any issue encountered or comments here : https://github.com/Linaro/hpc_lab_setup/pull/91
The first domain of focus for enhancement would be the NumPy build ; allowing optimizations, looking at fetching the UMFPACK/AMD libraries to hook it up to.
Also a round of clean up is due, to make sure that you can turn on and off certain components build, and fetch the components not built via pip/yum install pre-built packages (from other builds : especially numpy and bazel).
Adding benchmark runs/testsuite runs of the stack's components is also a required step.
Then we will focus on adding more parts of the stack to the build :
...