Investigate the Gluten with Velox backend for Spark SQL

 

Background

Apache Spark is a well-established and mature project that has been developed over many years. It is widely recognized as one of the top frameworks for efficiently processing large-scale datasets. However, the Spark community has faced performance challenges and has made various optimizations over time. One significant optimization introduced in Spark 2.0 is Whole Stage Code Generation, which replaces the Volcano Model and achieves a 2x speedup. Subsequently, most optimizations are focused on the query plan level, and the performance improvements of individual operators become less significant.

On the other hand, SQL engines have been a subject of research for many years. There are several products and libraries, such as Clickhouse, Arrow, and Velox, that can outperform Spark's JVM-based SQL engine. These libraries take advantage of features like native implementation, columnar data format, and vectorized data processing. However, it is important to note that these libraries operate on a single node.

Currently, Gluten+Velox backend is only tested on Ubuntu20.04/Ubuntu22.04/Centos8. Other kinds of OS support are still in progress. The long term goal is to support several common OS and conda env deployment.

Gluten builds with Spark3.2.x and Spark3.3.x now but only fully tested in CI with 3.2.2 and 3.3.1. We will add/update supported/tested versions according to the upstream changes.

Velox uses the script scripts/setup-xxx.sh to install all dependency libraries, but Arrow's dependency libraries are not installed. Velox also requires ninja for compilation. So we need to install all of them manually. Also, we need to set up the JAVA_HOME env. Currently, java 8 is required and the support for java 11/17 is not ready.

 

Gluten Architecture

Gluten is a New Middle Layer to Offload Spark SQL Queries to Native Engines.

Gluten was an Apache–Arrow–based, native SQL engine to offload Spark SQL. Gluten replaces that engine with multiple native engines, including the Meta-led Velox vectorized execution engine developed by Intel, and a Clickhouse execution engine developed by Kyligence.

With Gluten and the Velox, Apache Spark users can expect performance gains and higher resource utilization.

The architecture chart of Gluten in Spark

The integration of a vectorized SQL engine or shared library with Spark opens up numerous opportunities for optimizing like offloading functions and operators to a vectorized library, introducing just-in-time compilation engines, and enabling the use of hardware accelerators (e.g., GPU and FPGA). Project Gluten allows users to take advantage of these software and hardware innovations from within Spark.

 

The key components in Gluten

 

  • Query plan conversion which convert Spark's physical plan into substrait plan in each stage.

  • Unified memory management in Spark is used to control the native memory allocation as well

    • Gluten leverages Spark’s existing memory management system. It calls the Spark memory registration API for every native memory allocation/deallocation action.

    • Spark manages the memory for each task thread. If the thread needs more memory than is available, it can call the spill interface for operators that support this capability.

    • Spark’s memory management system protects against memory leaks and out-of-memory issues.

  • Columnar shuffle is used to shuffle columnar data directly. The shuffle service still reuses the one in Spark core. The exchange operator is reimplemented to support columnar data format

    • Gluten reuses its  Apache Arrow-based Columnar Shuffle Manager as the default shuffle manager.

    • A third-party library is responsible for handling the data transformation from native to Arrow. Alternatively, developers are free to implement their own shuffle manager.

  • Gluten fallback the operator to Vanilla Spark for unsupported operators or functions.

    • Gluten leverages the existing Spark JVM engine to check that an operator is supported by the native library.

    • If not, Gluten falls back to the existing Spark-JVM-based operator.

    • This fallback mechanism comes at the cost of columnar-to-row/row-to-columnar data conversion. 

  • Metrics are very important to get insight of Spark's execution, identify the issues or bottlenecks

    • Gluten collects the metrics from native library and shows in Spark UI.

    • Gluten support Spark’s Metrics functionality. The default Spark metrics are served for Java row-based data processing. In Project Gluten, we extend this with a column-based API and additional metrics to facilitate the use of Gluten and provide developers a means of debugging these native libraries.

  • Shim layer is used to support multiple releases of Spark.

    • To fully integrate with Spark, Gluten includes Shim Layer whose role is to support multiple versions of Spark.

    • Gluten only plans to support the latest 2-3 spark stable releases (Spark-3.2 & Spark-3.3), with no plans to support older spark releases.

Velox Architecture

Velox is an open source unified execution engine.

Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML. Velox could serve as a native backend for Gluten.

A data computation engine, is composed of a similar set of logical components: a language front end, an intermediate representation (IR), an optimizer, an execution runtime, and an execution engine.

Velox provides the building blocks required to implement execution engines, consisting of all data-intensive operations executed within a single host such as expression evaluation, aggregation, sorting, joining, and more — also commonly referred to as the data plane. Therefore, Velox expects an optimized plan as input and efficiently executes it using the resources available in the local host.

 

Diagram by Philip Bell from https://engineering.fb.com/2023/03/09/open-source/velox-open-source-execution-engine/

 

Velox main components

  • Type: a generic type system that allows developers to represent scalar, complex, and nested data types, including structs, maps, arrays, functions (lambdas), decimals, tensors, and more.Vector: an Apache Arrow–compatible columnar memory layout module supporting multiple encodings, such as flat, dictionary, constant, sequence/RLE, and frame of reference, in addition to a lazy materialization pattern and support for out-of-order result buffer population.

  • Expression Eval: a state-of-the-art vectorized expression evaluation engine built based on vector-encoded data, leveraging techniques such as common subexpression elimination, constant folding, efficient null propagation, encoding-aware evaluation, dictionary peeling, and memoization.

  • Functions: APIs that can be used by developers to build custom functions, providing a simple (row by row) and vectorized (batch by batch) interface for scalar functions and an API for aggregate functions.

    • A function package compatible with the popular PrestoSQL dialect is also provided as part of the library.

  • Operators: implementation of common SQL operators such as TableScan, Project, Filter, Aggregation, Exchange/Merge, OrderBy, TopN, HashJoin, MergeJoin, Unnest, and more.

  • I/O: a set of APIs that allows Velox to be integrated in the context of other engines and runtimes, such as:

    • Connectors: enables developers to specialize data sources and sinks for TableScan and TableWrite operators.

    • DWIO: an extensible interface providing support for encoding/decoding popular file formats such as Parquet, ORC, and DWRF.

    • Storage adapters: a byte-based extensible interface that allows Velox to connect to storage systems such as Tectonic, S3, HDFS, and more.

    • Serializers: a serialization interface targeting network communication where different wire protocols can be implemented, supporting PrestoPage and Spark’s UnsafeRow formats.

  • Resource management: a collection of primitives for handling computational resources, such as CPU and memory management, spilling, and memory and SSD caching.

Velox showcases the potential of combining current computation engines into a top-performing query execution component. It is currently being integrated with numerous data systems, which include analytical engines like Presto and Spark, as well as stream processing platforms, message buses, data warehouse ingestion infrastructure, and ML systems for data preprocessing and feature engineering.

Spark - Gluten - Velox

Velox is also being integrated into Spark as part of the Gluten project.

Gluten allows C++ execution engines (such as Velox) to be used within the Spark environment while executing Spark SQL queries.

Gluten decouples the Spark JVM and execution engine by creating a JNI API based on the Apache Arrow data format and Substrait query plans, thus allowing Velox to be used within Spark by simply integrating with Gluten’s JNI API.

 

From The Gluten Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era

 

Gluten and Velox on Arm64

Build Gluten with Velox backend:

Prerequisite

## install gcc and libraries to build arrow sudo apt-get update && apt-get install -y locales wget tar tzdata git ccache cmake ninja-build build-essential llvm-11-dev clang-11 libiberty-dev libdwarf-dev libre2-dev libz-dev libssl-dev libboost-all-dev libcurl4-openssl-dev openjdk-8-jdk maven

Build Parameters

 

Build Scripts

export CPU_TARGET="aarch64" cd /path/to/gluten ./dev/builddeps-veloxbe.sh

Build Issues on Arm64

Fbthrift building issue in Velox

home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/thrift/compiler/lib/cpp2/util.cc:697:15: warning: ‘int SHA256_Final(unsigned char*, SHA256_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Wdeprecated-declarations] 697 | SHA256_Final(mid, &hasher); | ~~~~~~~~~~~~^~~~~~~~~~~~~~ In file included from /home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/thrift/compiler/lib/cpp2/util.cc:25: /usr/include/openssl/sha.h:76:27: note: declared here 76 | OSSL_DEPRECATEDIN_3_0 int SHA256_Final(unsigned char *md, SHA256_CTX *c); | ^~~~~~~~~~~~ [55/330] Generating templates.cc FAILED: thrift/compiler/generate/templates.cc /home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/_build/thrift/compiler/generate/templates.cc cd /home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/_build/thrift/compiler/generate && /home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/_build/bin/compiler_generate_build_templates /home/yuqi/gluten/ep/build-velox/build/velox_ep/fbthrift/thrift/compiler/generate/templates > templates.cc Illegal instruction (core dumped) [56/330] Building CXX object thrift/lib/cpp/CMakeFiles/concurrency.dir/concurrency/ThreadManager.cpp.o ninja: build stopped: subcommand failed.

Issues Debugging

1. Debug in Illegal instruction (core dumped):

2. No symbol

No symbol table was found in GDB; We should rebuild Gluten-Velox with symbol table in Debug mod.

3. Debug in Core Dump

4. Analysis

All compliing flag for Arm64 in velox was set:

 

Arm Neoverse N1 baseline architecture is v8.2, but the N1 includes extensions from the v8.1, v8.3, v8.4, and v8.5 architectures.

 

The Arm Neoverse N1 baseline architecture is v8.2, but the N1 includes extensions from the v8.1, v8.3, v8.4, and v8.5 architectures. When we specify -march, we are confining the compiler to only the baseline architecture,

so the compiler is unable take advantage of any architecture extensions beyond the baseline.  In order to take advantage of all the features of a particular target, we should use the -mcpu flag to simultaneously specify the architecture with all its extensions, and the microarchitecture.

It is possible to use -march and list out every possible architecture extension, but this is cumbersome, non-portable, and reverse mapping multiple compiler flags to a single target is more trouble than it’s worth.  

Instead, just use -mcpu=target to tell the compiler exactly what you want, and the compiler will do the rest.

 C code invokes the atomic_add_fetch() intrinsic to atomically update an integer.  Arm v8.0 has no special support for atomics, so compiling this code for Arm v8.0 will generate multiple instructions to perform the atomic operation. Arm v8.1 defines the Large System Extension (LSE) instruction ldaddal, which can atomically update an integer in a single instruction. 

no LSE:

atomic_add_fetch ->

 

In Velox, all flags were set -mcpu and N1 includes the Large System Extension (LSE) instruction: ldaddal.

If we build velox on the platform which does not inlcude LSE extension. The Illegal instruction occurred in code generating in fbthrift building.

5. Solution

Read Arm64 MIDR_EL1 register to detect Arm64 cpu and set gcc flag correctly.