Optimizing Arrow ByteStreamSplitDecode/ByteStreamSplitEncode with Neon
Purpose
ByteStreamSplitDecode andByteStreamSplitEncode has SSE4 optimization, which may also get benefit from Neon.
Implementation code at
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/byte_stream_split.h
Benchmark code at
https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding_benchmark.cc
SSE implementation
1. load 128bit Data
_mm_loadu_si128
2. Store 128bit Data
_mm_storeu_si128
3. Shuffle low part of 128bit based on byte
_mm_unpacklo_epi8
4. Shuffle high part of 128bit based on byte
_mm_unpackhi_epi8
5. Shuffle low part of 128bit based on init
_mm_unpacklo_epi32
6. Shuffle high part of 128bit based on init
_mm_unpackhi_epi32
7. Shuffle low part of 128bit based on long
_mm_unpacklo_epi64
8. Shuffle high part of 128bit based on long
_mm_unpackhi_epi64
Neon implementation
1. load 128bit Data
vld1q_u8
2. Store 128bit Data
vst1q_u8
3. Shuffle low part of 128bit based on byte
vzip1q_u8
4. Shuffle high part of 128bit based on byte
vzip2q_u8
5. Shuffle low part of 128bit based on init
vzip1q_u32
6. Shuffle high part of 128bit based on init
vzip2q_u32
7. Shuffle low part of 128bit based on long
vzip1q_u64
8. Shuffle high part of 128bit based on long
vzip2q_u64
The patches of Neon implementation:
Decode: https://github.com/apache/arrow/pull/9424
Benchmark Results
Add Neon benchmark:
diff --git a/cpp/src/parquet/encoding_benchmark.cc b/cpp/src/parquet/encoding_benchmark.cc index 8e409c5e4..541b028b4 100644 --- a/cpp/src/parquet/encoding_benchmark.cc +++ b/cpp/src/parquet/encoding_benchmark.cc @@ -371,31 +371,31 @@ BENCHMARK(BM_ByteStreamSplitDecode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE); BENCHMARK(BM_ByteStreamSplitEncode_Float_Scalar)->Range(MIN_RANGE, MAX_RANGE); BENCHMARK(BM_ByteStreamSplitEncode_Double_Scalar)->Range(MIN_RANGE, MAX_RANGE); -#if defined(ARROW_HAVE_SSE4_2) -static void BM_ByteStreamSplitDecode_Float_Sse2(benchmark::State& state) { +#if defined(ARROW_HAVE_NEON) || defined(ARROW_HAVE_SSE4_2) +static void BM_ByteStreamSplitDecode_Float_128bit(benchmark::State& state) { BM_ByteStreamSplitDecode<float>( - state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<float>); + state, ::arrow::util::internal::ByteStreamSplitDecode128bit<float>); } -static void BM_ByteStreamSplitDecode_Double_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitDecode_Double_128bit(benchmark::State& state) { BM_ByteStreamSplitDecode<double>( - state, ::arrow::util::internal::ByteStreamSplitDecodeSse2<double>); + state, ::arrow::util::internal::ByteStreamSplitDecode128bit<double>); } -static void BM_ByteStreamSplitEncode_Float_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitEncode_Float_128bit(benchmark::State& state) { BM_ByteStreamSplitEncode<float>( - state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<float>); + state, ::arrow::util::internal::ByteStreamSplitEncode128bit<float>); } -static void BM_ByteStreamSplitEncode_Double_Sse2(benchmark::State& state) { +static void BM_ByteStreamSplitEncode_Double_128bit(benchmark::State& state) { BM_ByteStreamSplitEncode<double>( - state, ::arrow::util::internal::ByteStreamSplitEncodeSse2<double>); + state, ::arrow::util::internal::ByteStreamSplitEncode128bit<double>); } -BENCHMARK(BM_ByteStreamSplitDecode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitDecode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitEncode_Float_Sse2)->Range(MIN_RANGE, MAX_RANGE); -BENCHMARK(BM_ByteStreamSplitEncode_Double_Sse2)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitDecode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitDecode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitEncode_Float_128bit)->Range(MIN_RANGE, MAX_RANGE); +BENCHMARK(BM_ByteStreamSplitEncode_Double_128bit)->Range(MIN_RANGE, MAX_RANGE); #endif
Run ByteStreamSplitDecode and ByteStreamSplitEncode on Ampere-Altra
- 160 cores, 500G memory
- CPU Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
No Neon implementation:
Decode
BM_ByteStreamSplitDecode_Float_Scalar/1024 176 ns 176 ns 3974797 bytes_per_second=21.6626G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 689 ns 688 ns 1016797 bytes_per_second=22.1632G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 5569 ns 5568 ns 125595 bytes_per_second=21.9223G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 11164 ns 11161 ns 62682 bytes_per_second=21.8744G/s
Encode
BM_ByteStreamSplitEncode_Double_Scalar/1024 5421 ns 5420 ns 129149 bytes_per_second=1.40764G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 21690 ns 21687 ns 32287 bytes_per_second=1.40717G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 135249 ns 135209 ns 5175 bytes_per_second=1.80565G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 273756 ns 273651 ns 2559 bytes_per_second=1.78432G/s
Neon implementation:
Decode
BM_ByteStreamSplitDecode_Float_128bit/1024 138 ns 138 ns 5085948 bytes_per_second=27.7174G/s BM_ByteStreamSplitDecode_Float_128bit/4096 541 ns 541 ns 1293778 bytes_per_second=28.1905G/s BM_ByteStreamSplitDecode_Float_128bit/32768 4649 ns 4647 ns 150647 bytes_per_second=26.2698G/s BM_ByteStreamSplitDecode_Float_128bit/65536 10079 ns 10076 ns 69483 bytes_per_second=24.2306G/s
Encode
BM_ByteStreamSplitEncode_Double_128bit/1024 387 ns 387 ns 1809221 bytes_per_second=19.71G/s BM_ByteStreamSplitEncode_Double_128bit/4096 1549 ns 1548 ns 452319 bytes_per_second=19.7186G/s BM_ByteStreamSplitEncode_Double_128bit/32768 17193 ns 17190 ns 40724 bytes_per_second=14.2024G/s BM_ByteStreamSplitEncode_Double_128bit/65536 34556 ns 34531 ns 20615 bytes_per_second=14.1402G/s
Benchmark conclusion:
- Improve ByteStreamSplitDecode performance for float_1k/4k/32k/64k:
1k: 21.6536G/s -> 27.7227G/s
4K: 22.1561G/s -> 28.2019G/s
32k: 21.9854G/s -> 25.8939G/s
64k: 22.0192G/s -> 26.1399G/s
No obvious performance difference for Double_1k/4k/32k/64k.
- Improve ByteStreamSplitEncode performance for Double_1k/4k/32k/64k:
double_1k: 1.40764G/s -> 19.71G/s
double_4k: 1.40717G/s -> 19.7186G/s
double_32k: 1.80565G/s -> 14.2024G/s
double_64k: 1.78432G/s -> 14.1402G/s
And no obvious performance difference for 'Float'.