Recipes
Please note : These are a work in progress and subject to change, although those changes will not be introduced at the expense of functionality
The Linaro HPC SIG CI loop will make use of the Ansible (and the associated Jenkins Job), but on the other hand development efforts will make use of a bash script.
The Ansible role is part of the HPC lab's Jenkins infrastructure, thus can be found nestled in the hpc_lab_jenkins repository. It can be found exactly here : https://github.com/Linaro/hpc_lab_setup/tree/tensorflowci/files/ansible
Concerning the bash script, it can be found here (pending Linaro hosting).
In the following sections we will mostly focus on the Ansible as it is the production environment, but the bash should also be functional (but more ugly):
Outline of the structure of the Ansible:
install_python3.yml
This playbook is to be run first, and makes sure the environment contains all the "base" python (2 & 3) dependencies (i.e. python-devel, python-setuptools, python-pip)
This playbook installs python3 requirements. It is kept separate at the moment since the second playbook should make use of the python3 interpreter (on the target/builder machine).
Sadly, the yum module only works with Ansible on python2, dnf as to be used on Ansible with python3. And CentOS7 does not what would be "python3(6)-dnf", and building dnf from source doesn't provide it either (and upgrading dnf has many dependencies).
But CentOS8 could make use of it, and apt based distros too.
build_tensorflow.yml
This playbook, as the name entails, does the rest of the job of building TensorFlow.
To that effect, it fetches dependencies first (see roles/tensorflow/tasks/main.yml for the precise task order), including OpenBLAS, HDF5, FFTW, GCC 8.3.0 and LMod from OpenHPC (See the Further Enhancements section), as well as openjdk8 for Bazel.
On the topic of Lmod, we are working on trying to find a way to install it and get it to work in the same bash script (without having to restart the bash session). BASH_ENV might be the trick.
Then, it builds the v(irtual)env that will contain the pip dependencies, populates it with the first few that do not depend on numpy (i.e. pip, wheel, Cython, mock, future...) as well as keras (versions as instructed upstream)
We then proceed with the Bazel build, which is a straightforward affair.
Here we buld Bazel 0.24.1, the lower requirement to build TensorFlow. TensorFlow is quite picky about its Bazel... Thankfully, you could pretty much replace bazel_version (and bazel_url) with any available version (above 0.24.1) and it builds just fine.
After Bazel is built, we can go on with the NumPy build. The NumPy build does involve applying a workaround to address a GCC bug , pending OpenHPC picking up an up-to-date version of GCC 8.XX or fetching and using a GCC 9.X AArch64 build of the toolchain (see Further Enhancements).
Please do not that the aforementioned GCC bug breaks "pip install numpy>=1.15.3", and the workaround disables any optimizations on a function (reportedly breaks a testsuite test as well, see upstream issue)
NumPy also makes use of a certain mechanism to hook up to BLAS/LAPACK and FFTW (also UMFPACK and AMD (nothing to do with the Advanced Micro Devices) libraries, see Further Enhancements) : the .numpy-site.cfg file, to be found at /home/$USER/.numpy-site.cfg (ugly, yes...).
The "setup.py bdist_wheel" command can be a bit finicky, when in a venv, just run setup.py directly without first calling the interpreter (and make sure wheel is installed)
Once NumPy is built, we can get to the nitty-gritty : building TensorFlow itself. Here, the environment that is setup up to this point, by both the Ansible and the bash, can build 1.15.0 and 2.0.0 TensorFlow just fine (but tensorflow-benchmarks hasn't been made to work with TensorFlow 2... And the split up of tf.collab is non trivial...).
Configuration of the TensorFlow build has to be done with sourcing environment variables... Which is achieved through a script to maintain some sanity.
The TensorFlow does also require a patch, which is due to a missing requirement in Bazel's build configuration (i.e. WORKSPACE) : it is a quite well known issue.
Then the build itself is where we feed the compiler arguments (see /roles/tensorflow/defaults/main.yml : tensorflow.c_optimizations and tensorflow.c_optimizations)
Once it is built (and the pip package is built as well), the final trick is to install h5py (HDF5 python lib/API) via the command line (shell module in Ansible), with the variable HDF5_DIR pointing to the base of the HDF5 installation. (this might be a problem with LMOD, needs further investigation)
TensorFlow is built and ready to install !
Further Notes
The HPC SIG's scripts to build Tensorflow support only CentOS 7 at the moment of writing this article.
Efforts will be made to ensure the compatibility with CentOS 8, and then we will look into Debian environments (probably after adding the OpenBLAS and FFTW build)
The scripts/ansible recipe does make use of LMod to keep track of the libraries, and OpenHPC's binaries for OpenBLAS, FFTW, GCC 8.3.0 and LMod.
Tensorflow and NumPy 1.17.3 do require Python 3, and the EOL of Python 2 is fast approaching, but sadly, the dnf/yum module in Ansible seems very dodgy at the moment, so we still need python2, at least for the ansible.
The Ansible makes use of venv (a.k.a virtualenv) to keep (at least) the python dependencies contained and easily identifiable.
GCC is used to build all necessary pieces (at the moment, 8.3.0 from OpenHPC)
Further Enhancements
The first domain of focus for enhancement would be the NumPy build ; allowing optimizations, looking at fetching the UMFPACK/AMD libraries to hook it up to.
Also a round of clean up is due, to make sure that you can turn on and off certain components build, pip/yum install pre-built packages (from other builds : especially numpy and bazel).
Then we will focus on adding more parts of the stack to the build :
fig 1 - Stack and Tools Diagram
The above diagram (fig 1) establishes the stack and tools, as well as the ones we build at the moment with the recipes (orange/salmon), the ones that would be interesting to build (green), as well as the ones that are probably not (grey)
Typically, Keras is pure Python and should not impact performance as it is only a high level model modelling library that fits on top of TensorFlow. Linux is also outside of the scope of this endeavour.
HDF5 is the filesystem used by TensorFlow, since it is a filesystem, it lays outside of the core area of optimization and requires additional expertise to fine tune. Nonetheless, it is certainly something to be scoped (thus the greenish/greyish colour)
OpenBLAS is certainly the next thing to be added to the build. Following that, integrating the building FFTW should also be interesting.
Above that in the stack, SciPy and Python3 which are also interesting domains to investigate
Concerning GCC and LLVM, the yellow colour denotes that it might interesting to fetch those from outside of distributions/OpenHPC but not so much to build them. GCC especially since OpenHPC's 8.3.0 (as well as RedHat's 4.8.5) contains a bug that breaks the NumPy build.
LLVM is put there, but not used in the recipes, as it is the target of work by the HPC-SIG and seems to be able to build TensorFlow. It requires further investigation.
Both toolchains could be acquired through Linaro's Toolchain Group, as they do CI both. The place to do so remains to be investigated.