Arrow Parquet-Column-io benchmark and profiling
Host: Ampere Altra
2 numa nodes,
80 cores per node
500G memory
Enable Parquet and Benchmarks
Also enable snappy, lz4 and zstd for Column io
cmake -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON .. |
Run benchmarks
WriteInt64Column
BM_WriteInt64Column<Repetition::REQUIRED>/1048576 2639596 ns 2638016 ns 303 bytes_per_second=758.146M/s
BM_WriteInt64Column<Repetition::OPTIONAL>/1048576 5345657 ns 5341984 ns 132 bytes_per_second=374.393M/s
BM_WriteInt64Column<Repetition::REPEATED>/1048576 8851535 ns 8845487 ns 79 bytes_per_second=226.104M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576 17509566 ns 17492054 ns 40 bytes_per_second=114.338M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576 19999655 ns 19987350 ns 35 bytes_per_second=100.063M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576 23882360 ns 23868971 ns 30 bytes_per_second=83.7908M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576 18408449 ns 18392312 ns 38 bytes_per_second=108.741M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576 20982352 ns 20969377 ns 32 bytes_per_second=95.3772M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576 24883492 ns 24864342 ns 29 bytes_per_second=80.4365M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576 43420594 ns 43394430 ns 16 bytes_per_second=46.0889M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576 46023927 ns 45981533 ns 15 bytes_per_second=43.4957M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576 49986686 ns 49950594 ns 14 bytes_per_second=40.0396M/s |
ReadInt64Column
BM_ReadInt64Column<Repetition::REQUIRED>/1024/16 3805 ns 3803 ns 182461 bytes_per_second=513.556M/s
BM_ReadInt64Column<Repetition::REQUIRED>/1024/1024 2843 ns 2843 ns 245949 bytes_per_second=687.047M/s
BM_ReadInt64Column<Repetition::REQUIRED>/65536/1024 20008 ns 20001 ns 35386 bytes_per_second=6.10314G/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/16 5906 ns 5905 ns 118519 bytes_per_second=330.736M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/1024 4434 ns 4433 ns 158402 bytes_per_second=440.6M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/65536/1024 113364 ns 113341 ns 6105 bytes_per_second=1102.86M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/16 7859 ns 7857 ns 89492 bytes_per_second=248.599M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/1024 5699 ns 5698 ns 123520 bytes_per_second=342.759M/s
BM_ReadInt64Column<Repetition::REPEATED>/65536/1024 186313 ns 186263 ns 3761 bytes_per_second=671.094M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/16 6063 ns 6058 ns 114592 bytes_per_second=322.406M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/1024 5146 ns 5144 ns 136238 bytes_per_second=379.701M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/65536/1024 125527 ns 125473 ns 5602 bytes_per_second=996.231M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/16 8270 ns 8266 ns 84247 bytes_per_second=236.281M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/1024 6759 ns 6756 ns 103656 bytes_per_second=289.114M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/65536/1024 219198 ns 219107 ns 3177 bytes_per_second=570.499M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/16 10262 ns 10257 ns 68167 bytes_per_second=190.414M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/1024 7936 ns 7931 ns 87868 bytes_per_second=246.26M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/65536/1024 292810 ns 292707 ns 2386 bytes_per_second=427.048M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/16 5955 ns 5952 ns 116976 bytes_per_second=328.172M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/1024 5028 ns 5025 ns 139034 bytes_per_second=388.673M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/65536/1024 121755 ns 121714 ns 5742 bytes_per_second=1027M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/16 8137 ns 8130 ns 86195 bytes_per_second=240.233M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/1024 6624 ns 6622 ns 105572 bytes_per_second=294.96M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/65536/1024 219563 ns 219446 ns 3209 bytes_per_second=569.616M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/16 10164 ns 10159 ns 68763 bytes_per_second=192.264M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/1024 7868 ns 7865 ns 88969 bytes_per_second=248.34M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/65536/1024 290063 ns 289985 ns 2415 bytes_per_second=431.056M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/16 25455 ns 25447 ns 27518 bytes_per_second=76.7541M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/1024 24607 ns 24596 ns 28426 bytes_per_second=79.4086M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/65536/1024 1213728 ns 1213228 ns 578 bytes_per_second=103.031M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/16 28433 ns 28420 ns 24625 bytes_per_second=68.7238M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/1024 26956 ns 26947 ns 26013 bytes_per_second=72.481M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/65536/1024 1319779 ns 1318919 ns 528 bytes_per_second=94.7746M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/16 30563 ns 30550 ns 22888 bytes_per_second=63.9312M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/1024 28309 ns 28290 ns 24754 bytes_per_second=69.0395M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/65536/1024 1391098 ns 1390659 ns 502 bytes_per_second=89.8855M/s |
RleEncoding/RleDecoding
Profiling
Hot methods
ReadInt64Column:
parquet::internal:FindMinMax → FindMinMaxImpl
→ LevelsToBitmap
The GCC and Clang would vectorize this automatically with SSE4/AVX2:
Optimization for Arm64:
Also enable GCC vectorization for Arm64 Simd like AVX:
Other hot methods are:
lz4_decompress, snappy::RawUncompress, ZSTD_decompress
RleDecoding:
parquet::LevelDecoder::Decode → FindMinMaxImpl
→ unpack32
AVX bit unpack implemented in:
bpacking_avx2.cc, bpacking_avx2_generated.h, bpacking_avx512.cc, bpacking_avx512_generated.h.
Arm64: Also add a Neon optimized version of bit-unpacking that leverages the generated code for 128-bit SIMD.
RleEncoding:
LevelEncoder::Encode → Flush() → arrow::util::RleEncoder::FlushLiteralRun:
FlushLiteralRun → PutValue → BitWriter::PutValue → ByteSwap (int64_t, int32_t, int16_t, int8_t):
It could be optimized by Arm64 Neon: vext/vextq: