Optimizing Arrow ByteStreamSplitDecode/ByteStreamSplitEncode with Neon

Optimizing Arrow ByteStreamSplitDecode/ByteStreamSplitEncode with Neon

Purpose

ByteStreamSplitDecode andByteStreamSplitEncode has SSE4 optimization, which may also get benefit from Neon. 

  • Implementation code at

    https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/byte_stream_split.h
  • Benchmark code at

    https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding_benchmark.cc

SSE implementation

1. load 128bit Data

_mm_loadu_si128

 

2.  Store 128bit Data

_mm_storeu_si128

 

3. Shuffle low part of 128bit based on byte

_mm_unpacklo_epi8

 

4. Shuffle high part of 128bit based on byte

_mm_unpackhi_epi8

 

5. Shuffle low part of 128bit based on init

_mm_unpacklo_epi32

 

6. Shuffle high part of 128bit based on init

_mm_unpackhi_epi32

 

7. Shuffle low part of 128bit based on long

_mm_unpacklo_epi64

 

8. Shuffle high part of 128bit based on long

_mm_unpackhi_epi64

 

Neon implementation

1. load 128bit Data

vld1q_u8

 

2.  Store 128bit Data

vst1q_u8

 

3. Shuffle low part of 128bit based on byte

vzip1q_u8

 

4. Shuffle high part of 128bit based on byte

vzip2q_u8

 

5. Shuffle low part of 128bit based on init

vzip1q_u32

 

6. Shuffle high part of 128bit based on init

vzip2q_u32

 

7. Shuffle low part of 128bit based on long

vzip1q_u64

 

8. Shuffle high part of 128bit based on long

vzip2q_u64

 

The patches of Neon implementation:

Decode: https://github.com/apache/arrow/pull/9424

 

Benchmark Results

Add Neon benchmark:

diff --git a/cpp/src/parquet/encoding_benchmark.cc b/cpp/src/parquet/encoding_benchmark.cc index 8e409c5e4..541b028b4 100644 --- a/cpp/src/parquet/encoding_benchmark.cc +++ b/cpp/src/parquet/encoding_benchmark.cc @@ -371,31 +371,31 @@ BENCHMARK(BM_ByteStreamSplitDecode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE); BENCHMARK(BM_ByteStreamSplitEncode_Float_Scalar)->Range(MIN_RANGE, MAX_RANGE); BENCHMARK(BM_ByteStreamSplitEncode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE); -#if defined(ARROW_HAVE_SSE4_2) -static void BM_ByteStreamSplitDecode_Float_Sse2(benchmark::State& state) { +#if defined(ARROW_HAVE_NEON) || defined(ARROW_HAVE_SSE4_2) +static void BM_ByteStreamSplitDecode_Float_128bit(benchmark::State& state) { BM_ByteStreamSplitDecode<float>( - state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<float>); + state, ::arrow::util::internal::ByteStreamSplitDecode128bit<float>); } -static void BM_ByteStreamSplitDecode_Double_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitDecode_Double_128bit(benchmark::State& state) { BM_ByteStreamSplitDecode<double>( - state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<double>); + state, ::arrow::util::internal::ByteStreamSplitDecode128bit<double>); } -static void BM_ByteStreamSplitEncode_Float_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitEncode_Float_128bit(benchmark::State& state) { BM_ByteStreamSplitEncode<float>( - state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<float>); + state, ::arrow::util::internal::ByteStreamSplitEncode128bit<float>); } -static void BM_ByteStreamSplitEncode_Double_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitEncode_Double_128bit(benchmark::State& state) { BM_ByteStreamSplitEncode<double>( - state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<double>); + state, ::arrow::util::internal::ByteStreamSplitEncode128bit<double>); } -BENCHMARK(BM_ByteStreamSplitDecode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitDecode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitEncode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitEncode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitDecode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitDecode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitEncode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitEncode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE); #endif

 

Run ByteStreamSplitDecode and ByteStreamSplitEncode on Ampere-Altra

  • 160 cores, 500G memory

  • CPU Features:  fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

No Neon implementation:

  • Decode

    BM_ByteStreamSplitDecode_Float_Scalar/1024 176 ns 176 ns 3974797 bytes_per_second=21.6626G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 689 ns 688 ns 1016797 bytes_per_second=22.1632G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 5569 ns 5568 ns 125595 bytes_per_second=21.9223G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 11164 ns 11161 ns 62682 bytes_per_second=21.8744G/s
  • Encode

    BM_ByteStreamSplitEncode_Double_Scalar/1024 5421 ns 5420 ns 129149 bytes_per_second=1.40764G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 21690 ns 21687 ns 32287 bytes_per_second=1.40717G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 135249 ns 135209 ns 5175 bytes_per_second=1.80565G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 273756 ns 273651 ns 2559 bytes_per_second=1.78432G/s

Neon implementation:

  • Decode

    BM_ByteStreamSplitDecode_Float_128bit/1024 138 ns 138 ns 5085948 bytes_per_second=27.7174G/s BM_ByteStreamSplitDecode_Float_128bit/4096 541 ns 541 ns 1293778 bytes_per_second=28.1905G/s BM_ByteStreamSplitDecode_Float_128bit/32768 4649 ns 4647 ns 150647 bytes_per_second=26.2698G/s BM_ByteStreamSplitDecode_Float_128bit/65536 10079 ns 10076 ns 69483 bytes_per_second=24.2306G/s
  • Encode

    BM_ByteStreamSplitEncode_Double_128bit/1024 387 ns 387 ns 1809221 bytes_per_second=19.71G/s BM_ByteStreamSplitEncode_Double_128bit/4096 1549 ns 1548 ns 452319 bytes_per_second=19.7186G/s BM_ByteStreamSplitEncode_Double_128bit/32768 17193 ns 17190 ns 40724 bytes_per_second=14.2024G/s BM_ByteStreamSplitEncode_Double_128bit/65536 34556 ns 34531 ns 20615 bytes_per_second=14.1402G/s

Benchmark conclusion:

  • Improve ByteStreamSplitDecode performance for float_1k/4k/32k/64k:

         1k: 21.6536G/s -> 27.7227G/s

         4K: 22.1561G/s -> 28.2019G/s

         32k: 21.9854G/s -> 25.8939G/s

         64k: 22.0192G/s -> 26.1399G/s

      No obvious performance difference for Double_1k/4k/32k/64k.

 

  • Improve ByteStreamSplitEncode performance for Double_1k/4k/32k/64k:

          double_1k: 1.40764G/s -> 19.71G/s

          double_4k: 1.40717G/s -> 19.7186G/s

          double_32k: 1.80565G/s -> 14.2024G/s

          double_64k: 1.78432G/s -> 14.1402G/s

        And no obvious performance difference for 'Float'.