Arrow Parquet-Column-io benchmark and profiling

Host: Ampere Altra

  • 2 numa nodes, 

  • 80 cores per node

  • 500G memory

Enable Parquet and Benchmarks

Also enable snappy, lz4 and zstd for Column io

cmake -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON ..

Run benchmarks

WriteInt64Column

BM_WriteInt64Column<Repetition::REQUIRED>/1048576 2639596 ns 2638016 ns 303 bytes_per_second=758.146M/s BM_WriteInt64Column<Repetition::OPTIONAL>/1048576 5345657 ns 5341984 ns 132 bytes_per_second=374.393M/s BM_WriteInt64Column<Repetition::REPEATED>/1048576 8851535 ns 8845487 ns 79 bytes_per_second=226.104M/s BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576 17509566 ns 17492054 ns 40 bytes_per_second=114.338M/s BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576 19999655 ns 19987350 ns 35 bytes_per_second=100.063M/s BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576 23882360 ns 23868971 ns 30 bytes_per_second=83.7908M/s BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576 18408449 ns 18392312 ns 38 bytes_per_second=108.741M/s BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576 20982352 ns 20969377 ns 32 bytes_per_second=95.3772M/s BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576 24883492 ns 24864342 ns 29 bytes_per_second=80.4365M/s BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576 43420594 ns 43394430 ns 16 bytes_per_second=46.0889M/s BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576 46023927 ns 45981533 ns 15 bytes_per_second=43.4957M/s BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576 49986686 ns 49950594 ns 14 bytes_per_second=40.0396M/s



ReadInt64Column

BM_ReadInt64Column<Repetition::REQUIRED>/1024/16 3805 ns 3803 ns 182461 bytes_per_second=513.556M/s BM_ReadInt64Column<Repetition::REQUIRED>/1024/1024 2843 ns 2843 ns 245949 bytes_per_second=687.047M/s BM_ReadInt64Column<Repetition::REQUIRED>/65536/1024 20008 ns 20001 ns 35386 bytes_per_second=6.10314G/s BM_ReadInt64Column<Repetition::OPTIONAL>/1024/16 5906 ns 5905 ns 118519 bytes_per_second=330.736M/s BM_ReadInt64Column<Repetition::OPTIONAL>/1024/1024 4434 ns 4433 ns 158402 bytes_per_second=440.6M/s BM_ReadInt64Column<Repetition::OPTIONAL>/65536/1024 113364 ns 113341 ns 6105 bytes_per_second=1102.86M/s BM_ReadInt64Column<Repetition::REPEATED>/1024/16 7859 ns 7857 ns 89492 bytes_per_second=248.599M/s BM_ReadInt64Column<Repetition::REPEATED>/1024/1024 5699 ns 5698 ns 123520 bytes_per_second=342.759M/s BM_ReadInt64Column<Repetition::REPEATED>/65536/1024 186313 ns 186263 ns 3761 bytes_per_second=671.094M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/16 6063 ns 6058 ns 114592 bytes_per_second=322.406M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/1024 5146 ns 5144 ns 136238 bytes_per_second=379.701M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/65536/1024 125527 ns 125473 ns 5602 bytes_per_second=996.231M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/16 8270 ns 8266 ns 84247 bytes_per_second=236.281M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/1024 6759 ns 6756 ns 103656 bytes_per_second=289.114M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/65536/1024 219198 ns 219107 ns 3177 bytes_per_second=570.499M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/16 10262 ns 10257 ns 68167 bytes_per_second=190.414M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/1024 7936 ns 7931 ns 87868 bytes_per_second=246.26M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/65536/1024 292810 ns 292707 ns 2386 bytes_per_second=427.048M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/16 5955 ns 5952 ns 116976 bytes_per_second=328.172M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/1024 5028 ns 5025 ns 139034 bytes_per_second=388.673M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/65536/1024 121755 ns 121714 ns 5742 bytes_per_second=1027M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/16 8137 ns 8130 ns 86195 bytes_per_second=240.233M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/1024 6624 ns 6622 ns 105572 bytes_per_second=294.96M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/65536/1024 219563 ns 219446 ns 3209 bytes_per_second=569.616M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/16 10164 ns 10159 ns 68763 bytes_per_second=192.264M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/1024 7868 ns 7865 ns 88969 bytes_per_second=248.34M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/65536/1024 290063 ns 289985 ns 2415 bytes_per_second=431.056M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/16 25455 ns 25447 ns 27518 bytes_per_second=76.7541M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/1024 24607 ns 24596 ns 28426 bytes_per_second=79.4086M/s BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/65536/1024 1213728 ns 1213228 ns 578 bytes_per_second=103.031M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/16 28433 ns 28420 ns 24625 bytes_per_second=68.7238M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/1024 26956 ns 26947 ns 26013 bytes_per_second=72.481M/s BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/65536/1024 1319779 ns 1318919 ns 528 bytes_per_second=94.7746M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/16 30563 ns 30550 ns 22888 bytes_per_second=63.9312M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/1024 28309 ns 28290 ns 24754 bytes_per_second=69.0395M/s BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/65536/1024 1391098 ns 1390659 ns 502 bytes_per_second=89.8855M/s



RleEncoding/RleDecoding

Profiling

Column-io-Arm64-3.svg

Hot methods 

ReadInt64Column: 

parquet::internal:FindMinMax → FindMinMaxImpl

                                                → LevelsToBitmap

 



The GCC and Clang would vectorize this automatically with SSE4/AVX2:



Optimization for Arm64:

Also enable GCC vectorization for Arm64 Simd like AVX:

Other hot methods are:

 

lz4_decompress, snappy::RawUncompress, ZSTD_decompress



RleDecoding:

parquet::LevelDecoder::Decode → FindMinMaxImpl

                                                   → unpack32

 

AVX bit unpack implemented in:

bpacking_avx2.cc,  bpacking_avx2_generated.h, bpacking_avx512.cc,  bpacking_avx512_generated.h.

Arm64:  Also add a Neon optimized version of bit-unpacking that leverages the generated code for 128-bit SIMD.



RleEncoding:

LevelEncoder::Encode → Flush() → arrow::util::RleEncoder::FlushLiteralRun:

 

FlushLiteralRun → PutValue → BitWriter::PutValue → ByteSwap (int64_t, int32_t, int16_t, int8_t):

It could be optimized by Arm64 Neon: vext/vextq: