Optimizing Arrow ByteStreamSplitDecode/ByteStreamSplitEncode with Neon

Purpose

ByteStreamSplitDecode andByteStreamSplitEncode has SSE4 optimization, which may also get benefit from Neon.

Implementation code at

https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/byte_stream_split.h

Benchmark code at

https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding_benchmark.cc

SSE implementation

1. load 128bit Data

_mm_loadu_si128

2. Store 128bit Data

 _mm_storeu_si128

3. Shuffle low part of 128bit based on byte

_mm_unpacklo_epi8

4. Shuffle high part of 128bit based on byte

 _mm_unpackhi_epi8

5. Shuffle low part of 128bit based on init

 _mm_unpacklo_epi32

6. Shuffle high part of 128bit based on init

 _mm_unpackhi_epi32

7. Shuffle low part of 128bit based on long

 _mm_unpacklo_epi64

8. Shuffle high part of 128bit based on long

 _mm_unpackhi_epi64

Neon implementation

1. load 128bit Data

vld1q_u8

2. Store 128bit Data

vst1q_u8

3. Shuffle low part of 128bit based on byte

vzip1q_u8

4. Shuffle high part of 128bit based on byte

vzip2q_u8

5. Shuffle low part of 128bit based on init

vzip1q_u32

6. Shuffle high part of 128bit based on init

vzip2q_u32

7. Shuffle low part of 128bit based on long

vzip1q_u64

8. Shuffle high part of 128bit based on long

vzip2q_u64

The patches of Neon implementation:

Decode: https://github.com/apache/arrow/pull/9424

Benchmark Results

Add Neon benchmark:

diff --git a/cpp/src/parquet/encoding_benchmark.cc b/cpp/src/parquet/encoding_benchmark.cc
index 8e409c5e4..541b028b4 100644
--- a/cpp/src/parquet/encoding_benchmark.cc
+++ b/cpp/src/parquet/encoding_benchmark.cc
@@ -371,31 +371,31 @@ BENCHMARK(BM_ByteStreamSplitDecode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE);
 BENCHMARK(BM_ByteStreamSplitEncode_Float_Scalar)->Range(MIN_RANGE, MAX_RANGE);
 BENCHMARK(BM_ByteStreamSplitEncode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE);

-#if defined(ARROW_HAVE_SSE4_2)
-static void BM_ByteStreamSplitDecode_Float_Sse2(benchmark::State& state) {
+#if defined(ARROW_HAVE_NEON) || defined(ARROW_HAVE_SSE4_2)
+static void BM_ByteStreamSplitDecode_Float_128bit(benchmark::State& state) {
   BM_ByteStreamSplitDecode<float>(
-      state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<float>);
+      state, ::arrow::util::internal::ByteStreamSplitDecode128bit<float>);
 }

-static void BM_ByteStreamSplitDecode_Double_Sse2(benchmark::State& state) {
+static void BM_ByteStreamSplitDecode_Double_128bit(benchmark::State& state) {
   BM_ByteStreamSplitDecode<double>(
-      state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<double>);
+      state, ::arrow::util::internal::ByteStreamSplitDecode128bit<double>);
 }

-static void BM_ByteStreamSplitEncode_Float_Sse2(benchmark::State& state) {
+static void BM_ByteStreamSplitEncode_Float_128bit(benchmark::State& state) {
   BM_ByteStreamSplitEncode<float>(
-      state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<float>);
+      state, ::arrow::util::internal::ByteStreamSplitEncode128bit<float>);
 }

-static void BM_ByteStreamSplitEncode_Double_Sse2(benchmark::State& state) {
+static void BM_ByteStreamSplitEncode_Double_128bit(benchmark::State& state) {
   BM_ByteStreamSplitEncode<double>(
-      state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<double>);
+      state, ::arrow::util::internal::ByteStreamSplitEncode128bit<double>);
 }

-BENCHMARK(BM_ByteStreamSplitDecode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE);
-BENCHMARK(BM_ByteStreamSplitDecode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE);
-BENCHMARK(BM_ByteStreamSplitEncode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE);
-BENCHMARK(BM_ByteStreamSplitEncode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE);
+BENCHMARK(BM_ByteStreamSplitDecode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE);
+BENCHMARK(BM_ByteStreamSplitDecode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE);
+BENCHMARK(BM_ByteStreamSplitEncode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE);
+BENCHMARK(BM_ByteStreamSplitEncode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE);
 #endif

Run ByteStreamSplitDecode and ByteStreamSplitEncode on Ampere-Altra

160 cores, 500G memory
CPU Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

No Neon implementation:

Decode

BM_ByteStreamSplitDecode_Float_Scalar/1024                176 ns          176 ns      3974797 bytes_per_second=21.6626G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096                689 ns          688 ns      1016797 bytes_per_second=22.1632G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768              5569 ns         5568 ns       125595 bytes_per_second=21.9223G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536             11164 ns        11161 ns        62682 bytes_per_second=21.8744G/s

Encode

BM_ByteStreamSplitEncode_Double_Scalar/1024              5421 ns         5420 ns       129149 bytes_per_second=1.40764G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096             21690 ns        21687 ns        32287 bytes_per_second=1.40717G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768           135249 ns       135209 ns         5175 bytes_per_second=1.80565G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536           273756 ns       273651 ns         2559 bytes_per_second=1.78432G/s

Neon implementation:

Decode

BM_ByteStreamSplitDecode_Float_128bit/1024                138 ns          138 ns      5085948 bytes_per_second=27.7174G/s
BM_ByteStreamSplitDecode_Float_128bit/4096                541 ns          541 ns      1293778 bytes_per_second=28.1905G/s
BM_ByteStreamSplitDecode_Float_128bit/32768              4649 ns         4647 ns       150647 bytes_per_second=26.2698G/s
BM_ByteStreamSplitDecode_Float_128bit/65536             10079 ns        10076 ns        69483 bytes_per_second=24.2306G/s

Encode

BM_ByteStreamSplitEncode_Double_128bit/1024               387 ns          387 ns      1809221 bytes_per_second=19.71G/s
BM_ByteStreamSplitEncode_Double_128bit/4096              1549 ns         1548 ns       452319 bytes_per_second=19.7186G/s
BM_ByteStreamSplitEncode_Double_128bit/32768            17193 ns        17190 ns        40724 bytes_per_second=14.2024G/s
BM_ByteStreamSplitEncode_Double_128bit/65536            34556 ns        34531 ns        20615 bytes_per_second=14.1402G/s

Benchmark conclusion:

Improve ByteStreamSplitDecode performance for float_1k/4k/32k/64k:

1k: 21.6536G/s -> 27.7227G/s

4K: 22.1561G/s -> 28.2019G/s

32k: 21.9854G/s -> 25.8939G/s

64k: 22.0192G/s -> 26.1399G/s

No obvious performance difference for Double_1k/4k/32k/64k.

Improve ByteStreamSplitEncode performance for Double_1k/4k/32k/64k:

double_1k: 1.40764G/s -> 19.71G/s

double_4k: 1.40717G/s -> 19.7186G/s

double_32k: 1.80565G/s -> 14.2024G/s

double_64k: 1.78432G/s -> 14.1402G/s

And no obvious performance difference for 'Float'.

Big Data Team Space