Arrow Parquet-Column-io benchmark and profiling
Host: Ampere Altra
2 numa nodes,
80 cores per node
500G memory
Enable Parquet and Benchmarks
Also enable snappy, lz4 and zstd for Column io
cmake -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON .. |
Run benchmarks
WriteInt64Column
BM_WriteInt64Column<Repetition::REQUIRED>/1048576 2639596 ns 2638016 ns 303 bytes_per_second=758.146M/s
BM_WriteInt64Column<Repetition::OPTIONAL>/1048576 5345657 ns 5341984 ns 132 bytes_per_second=374.393M/s
BM_WriteInt64Column<Repetition::REPEATED>/1048576 8851535 ns 8845487 ns 79 bytes_per_second=226.104M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576 17509566 ns 17492054 ns 40 bytes_per_second=114.338M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576 19999655 ns 19987350 ns 35 bytes_per_second=100.063M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576 23882360 ns 23868971 ns 30 bytes_per_second=83.7908M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576 18408449 ns 18392312 ns 38 bytes_per_second=108.741M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576 20982352 ns 20969377 ns 32 bytes_per_second=95.3772M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576 24883492 ns 24864342 ns 29 bytes_per_second=80.4365M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576 43420594 ns 43394430 ns 16 bytes_per_second=46.0889M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576 46023927 ns 45981533 ns 15 bytes_per_second=43.4957M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576 49986686 ns 49950594 ns 14 bytes_per_second=40.0396M/s |
ReadInt64Column
BM_ReadInt64Column<Repetition::REQUIRED>/1024/16 3805 ns 3803 ns 182461 bytes_per_second=513.556M/s
BM_ReadInt64Column<Repetition::REQUIRED>/1024/1024 2843 ns 2843 ns 245949 bytes_per_second=687.047M/s
BM_ReadInt64Column<Repetition::REQUIRED>/65536/1024 20008 ns 20001 ns 35386 bytes_per_second=6.10314G/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/16 5906 ns 5905 ns 118519 bytes_per_second=330.736M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/1024 4434 ns 4433 ns 158402 bytes_per_second=440.6M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/65536/1024 113364 ns 113341 ns 6105 bytes_per_second=1102.86M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/16 7859 ns 7857 ns 89492 bytes_per_second=248.599M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/1024 5699 ns 5698 ns 123520 bytes_per_second=342.759M/s
BM_ReadInt64Column<Repetition::REPEATED>/65536/1024 186313 ns 186263 ns 3761 bytes_per_second=671.094M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/16 6063 ns 6058 ns 114592 bytes_per_second=322.406M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/1024 5146 ns 5144 ns 136238 bytes_per_second=379.701M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/65536/1024 125527 ns 125473 ns 5602 bytes_per_second=996.231M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/16 8270 ns 8266 ns 84247 bytes_per_second=236.281M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/1024 6759 ns 6756 ns 103656 bytes_per_second=289.114M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/65536/1024 219198 ns 219107 ns 3177 bytes_per_second=570.499M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/16 10262 ns 10257 ns 68167 bytes_per_second=190.414M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/1024 7936 ns 7931 ns 87868 bytes_per_second=246.26M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/65536/1024 292810 ns 292707 ns 2386 bytes_per_second=427.048M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/16 5955 ns 5952 ns 116976 bytes_per_second=328.172M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/1024 5028 ns 5025 ns 139034 bytes_per_second=388.673M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/65536/1024 121755 ns 121714 ns 5742 bytes_per_second=1027M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/16 8137 ns 8130 ns 86195 bytes_per_second=240.233M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/1024 6624 ns 6622 ns 105572 bytes_per_second=294.96M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/65536/1024 219563 ns 219446 ns 3209 bytes_per_second=569.616M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/16 10164 ns 10159 ns 68763 bytes_per_second=192.264M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/1024 7868 ns 7865 ns 88969 bytes_per_second=248.34M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/65536/1024 290063 ns 289985 ns 2415 bytes_per_second=431.056M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/16 25455 ns 25447 ns 27518 bytes_per_second=76.7541M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/1024 24607 ns 24596 ns 28426 bytes_per_second=79.4086M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/65536/1024 1213728 ns 1213228 ns 578 bytes_per_second=103.031M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/16 28433 ns 28420 ns 24625 bytes_per_second=68.7238M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/1024 26956 ns 26947 ns 26013 bytes_per_second=72.481M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/65536/1024 1319779 ns 1318919 ns 528 bytes_per_second=94.7746M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/16 30563 ns 30550 ns 22888 bytes_per_second=63.9312M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/1024 28309 ns 28290 ns 24754 bytes_per_second=69.0395M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/65536/1024 1391098 ns 1390659 ns 502 bytes_per_second=89.8855M/s |
RleEncoding/RleDecoding
BM_RleEncoding/1024/1 1778 ns 1777 ns 391801 bytes_per_second=1098.86M/s items_per_second=576.12M/s
BM_RleEncoding/4096/1 6921 ns 6920 ns 100943 bytes_per_second=1.10252G/s items_per_second=591.913M/s
BM_RleEncoding/32768/1 54279 ns 54272 ns 12892 bytes_per_second=1.12462G/s items_per_second=603.774M/s
BM_RleEncoding/65536/1 108481 ns 108444 ns 6455 bytes_per_second=1.12566G/s items_per_second=604.332M/s
BM_RleEncoding/1024/8 4684 ns 4683 ns 146161 bytes_per_second=417.038M/s items_per_second=218.648M/s
BM_RleEncoding/4096/8 18622 ns 18618 ns 37582 bytes_per_second=419.63M/s items_per_second=220.007M/s
BM_RleEncoding/32768/8 148805 ns 148786 ns 4696 bytes_per_second=420.067M/s items_per_second=220.236M/s
BM_RleEncoding/65536/8 297708 ns 297613 ns 2349 bytes_per_second=420.009M/s items_per_second=220.206M/s
BM_RleEncoding/1024/16 5034 ns 5033 ns 139270 bytes_per_second=388.037M/s items_per_second=203.443M/s
BM_RleEncoding/4096/16 19970 ns 19964 ns 35012 bytes_per_second=391.337M/s items_per_second=205.173M/s
BM_RleEncoding/32768/16 159882 ns 159859 ns 4377 bytes_per_second=390.968M/s items_per_second=204.98M/s
BM_RleEncoding/65536/16 319724 ns 319654 ns 2192 bytes_per_second=391.047M/s items_per_second=205.021M/s
BM_RleDecoding/1024/1 1200 ns 1200 ns 583177 bytes_per_second=1.58909G/s items_per_second=853.134M/s
BM_RleDecoding/4096/1 4541 ns 4540 ns 154092 bytes_per_second=1.68058G/s items_per_second=902.253M/s
BM_RleDecoding/32768/1 35654 ns 35647 ns 19626 bytes_per_second=1.71223G/s items_per_second=919.247M/s
BM_RleDecoding/65536/1 71170 ns 71153 ns 9838 bytes_per_second=1.7156G/s items_per_second=921.055M/s
BM_RleDecoding/1024/8 1617 ns 1617 ns 432308 bytes_per_second=1.17978G/s items_per_second=633.387M/s
BM_RleDecoding/4096/8 6172 ns 6171 ns 113454 bytes_per_second=1.23639G/s items_per_second=663.784M/s
BM_RleDecoding/32768/8 49135 ns 49128 ns 14250 bytes_per_second=1.24237G/s items_per_second=666.993M/s
BM_RleDecoding/65536/8 98000 ns 97985 ns 7142 bytes_per_second=1.2458G/s items_per_second=668.836M/s
BM_RleDecoding/1024/16 4111 ns 4111 ns 170113 bytes_per_second=475.13M/s items_per_second=249.105M/s
BM_RleDecoding/4096/16 16079 ns 16076 ns 43550 bytes_per_second=485.967M/s items_per_second=254.787M/s
BM_RleDecoding/32768/16 128261 ns 128232 ns 5462 bytes_per_second=487.399M/s items_per_second=255.537M/s
BM_RleDecoding/65536/16 256631 ns 256595 ns 2729 bytes_per_second=487.148M/s items_per_second=255.406M/s |
Profiling
Hot methods
ReadInt64Column:
parquet::internal:FindMinMax → FindMinMaxImpl
→ LevelsToBitmap
template <typename Predicate>
inline uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels,
Predicate predicate) {
uint64_t mask = 0;
for (int x = 0; x < num_levels; x++) {
mask |= static_cast<uint64_t>(predicate(levels[x]) ? 1 : 0) << x;
}
return ::arrow::BitUtil::ToLittleEndian(mask);
}
inline MinMax FindMinMaxImpl(const int16_t* levels, int64_t num_levels) {
MinMax out{std::numeric_limits<int16_t>::max(), std::numeric_limits<int16_t>::min()};
for (int x = 0; x < num_levels; x++) {
out.min = std::min(levels[x], out.min);
out.max = std::max(levels[x], out.max);
}
return out;
} |
The GCC and Clang would vectorize this automatically with SSE4/AVX2:
for (int x = 0; x < num_levels; x++) {
mask |= static_cast<uint64_t>(predicate(levels[x]) ? 1 : 0) << x;
}
return ::arrow::BitUtil::ToLittleEndian(mask);
} |
Optimization for Arm64:
Also enable GCC vectorization for Arm64 Simd like AVX:
if(ARROW_HAVE_RUNTIME_AVX2)
# AVX2 is used as a proxy for BMI2.
list(APPEND PARQUET_SRCS level_comparison_avx2.cc level_conversion_bmi2.cc)
set_source_files_properties(level_comparison_avx2.cc
PROPERTIES
SKIP_PRECOMPILE_HEADERS
ON
COMPILE_FLAGS
"${ARROW_AVX2_FLAG}")
# WARNING: DO NOT BLINDLY COPY THIS CODE FOR OTHER BMI2 USE CASES.
# This code is always guarded by runtime dispatch which verifies
# BMI2 is present. For a very small number of CPUs AVX2 does not
# imply BMI2.
set_source_files_properties(level_conversion_bmi2.cc
PROPERTIES
SKIP_PRECOMPILE_HEADERS
ON
COMPILE_FLAGS
"${ARROW_AVX2_FLAG} -DARROW_HAVE_BMI2 -mbmi2")
endif() |
Other hot methods are:
lz4_decompress, snappy::RawUncompress, ZSTD_decompress
RleDecoding:
parquet::LevelDecoder::Decode → FindMinMaxImpl
→ unpack32
AVX bit unpack implemented in:
bpacking_avx2.cc, bpacking_avx2_generated.h, bpacking_avx512.cc, bpacking_avx512_generated.h.
Arm64: Also add a Neon optimized version of bit-unpacking that leverages the generated code for 128-bit SIMD.
RleEncoding:
LevelEncoder::Encode → Flush() → arrow::util::RleEncoder::FlushLiteralRun:
// Write all the buffered values as bit packed literals
for (int i = 0; i < num_buffered_values_; ++i) {
bool success = bit_writer_.PutValue(buffered_values_[i], bit_width_);
DCHECK(success) << "There is a bug in using CheckBufferFull()";
} |
FlushLiteralRun → PutValue → BitWriter::PutValue → ByteSwap (int64_t, int32_t, int16_t, int8_t):
It could be optimized by Arm64 Neon: vext/vextq:
vextq_u64 (uint64x2_t __a, uint64x2_t __b, __const int __c)
{
__AARCH64_LANE_CHECK (__a, __c);
#ifdef __AARCH64EB__
return __builtin_shuffle (__b, __a, (uint64x2_t) {2-__c, 3-__c});
#else
return __builtin_shuffle (__a, __b, (uint64x2_t) {__c, __c+1});
#endif
} |
__builtin_shuffle:
ext v0.16b, v0.16b, v0.16b, #8
ret |