Host: Ampere Altra

2 numa nodes,
80 cores per node
500G memory

Enable Parquet and Benchmarks

Also enable snappy, lz4 and zstd for Column io

cmake -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZSTD=ON -DARROW_WITH_LZ4=ON -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_BUILD_BENCHMARKS=ON ..

Run benchmarks

WriteInt64Column

BM_WriteInt64Column<Repetition::REQUIRED>/1048576                           2639596 ns      2638016 ns          303 bytes_per_second=758.146M/s
BM_WriteInt64Column<Repetition::OPTIONAL>/1048576                           5345657 ns      5341984 ns          132 bytes_per_second=374.393M/s
BM_WriteInt64Column<Repetition::REPEATED>/1048576                           8851535 ns      8845487 ns           79 bytes_per_second=226.104M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1048576     17509566 ns     17492054 ns           40 bytes_per_second=114.338M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1048576     19999655 ns     19987350 ns           35 bytes_per_second=100.063M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1048576     23882360 ns     23868971 ns           30 bytes_per_second=83.7908M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::LZ4>/1048576        18408449 ns     18392312 ns           38 bytes_per_second=108.741M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1048576        20982352 ns     20969377 ns           32 bytes_per_second=95.3772M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::LZ4>/1048576        24883492 ns     24864342 ns           29 bytes_per_second=80.4365M/s
BM_WriteInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1048576       43420594 ns     43394430 ns           16 bytes_per_second=46.0889M/s
BM_WriteInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1048576       46023927 ns     45981533 ns           15 bytes_per_second=43.4957M/s
BM_WriteInt64Column<Repetition::REPEATED, Compression::ZSTD>/1048576       49986686 ns     49950594 ns           14 bytes_per_second=40.0396M/s

ReadInt64Column

BM_ReadInt64Column<Repetition::REQUIRED>/1024/16                               3805 ns         3803 ns       182461 bytes_per_second=513.556M/s
BM_ReadInt64Column<Repetition::REQUIRED>/1024/1024                             2843 ns         2843 ns       245949 bytes_per_second=687.047M/s
BM_ReadInt64Column<Repetition::REQUIRED>/65536/1024                           20008 ns        20001 ns        35386 bytes_per_second=6.10314G/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/16                               5906 ns         5905 ns       118519 bytes_per_second=330.736M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/1024/1024                             4434 ns         4433 ns       158402 bytes_per_second=440.6M/s
BM_ReadInt64Column<Repetition::OPTIONAL>/65536/1024                          113364 ns       113341 ns         6105 bytes_per_second=1102.86M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/16                               7859 ns         7857 ns        89492 bytes_per_second=248.599M/s
BM_ReadInt64Column<Repetition::REPEATED>/1024/1024                             5699 ns         5698 ns       123520 bytes_per_second=342.759M/s
BM_ReadInt64Column<Repetition::REPEATED>/65536/1024                          186313 ns       186263 ns         3761 bytes_per_second=671.094M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/16          6063 ns         6058 ns       114592 bytes_per_second=322.406M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/1024/1024        5146 ns         5144 ns       136238 bytes_per_second=379.701M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::SNAPPY>/65536/1024     125527 ns       125473 ns         5602 bytes_per_second=996.231M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/16          8270 ns         8266 ns        84247 bytes_per_second=236.281M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/1024/1024        6759 ns         6756 ns       103656 bytes_per_second=289.114M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::SNAPPY>/65536/1024     219198 ns       219107 ns         3177 bytes_per_second=570.499M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/16         10262 ns        10257 ns        68167 bytes_per_second=190.414M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/1024/1024        7936 ns         7931 ns        87868 bytes_per_second=246.26M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::SNAPPY>/65536/1024     292810 ns       292707 ns         2386 bytes_per_second=427.048M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/16             5955 ns         5952 ns       116976 bytes_per_second=328.172M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/1024/1024           5028 ns         5025 ns       139034 bytes_per_second=388.673M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::LZ4>/65536/1024        121755 ns       121714 ns         5742 bytes_per_second=1027M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/16             8137 ns         8130 ns        86195 bytes_per_second=240.233M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/1024/1024           6624 ns         6622 ns       105572 bytes_per_second=294.96M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::LZ4>/65536/1024        219563 ns       219446 ns         3209 bytes_per_second=569.616M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/16            10164 ns        10159 ns        68763 bytes_per_second=192.264M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/1024/1024           7868 ns         7865 ns        88969 bytes_per_second=248.34M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::LZ4>/65536/1024        290063 ns       289985 ns         2415 bytes_per_second=431.056M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/16           25455 ns        25447 ns        27518 bytes_per_second=76.7541M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/1024/1024         24607 ns        24596 ns        28426 bytes_per_second=79.4086M/s
BM_ReadInt64Column<Repetition::REQUIRED, Compression::ZSTD>/65536/1024      1213728 ns      1213228 ns          578 bytes_per_second=103.031M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/16           28433 ns        28420 ns        24625 bytes_per_second=68.7238M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/1024/1024         26956 ns        26947 ns        26013 bytes_per_second=72.481M/s
BM_ReadInt64Column<Repetition::OPTIONAL, Compression::ZSTD>/65536/1024      1319779 ns      1318919 ns          528 bytes_per_second=94.7746M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/16           30563 ns        30550 ns        22888 bytes_per_second=63.9312M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/1024/1024         28309 ns        28290 ns        24754 bytes_per_second=69.0395M/s
BM_ReadInt64Column<Repetition::REPEATED, Compression::ZSTD>/65536/1024      1391098 ns      1390659 ns          502 bytes_per_second=89.8855M/s

RleEncoding/RleDecoding

BM_RleEncoding/1024/1 1778 ns 1777 ns 391801 bytes_per_second=1098.86M/s items_per_second=576.12M/s
BM_RleEncoding/4096/1 6921 ns 6920 ns 100943 bytes_per_second=1.10252G/s items_per_second=591.913M/s
BM_RleEncoding/32768/1 54279 ns 54272 ns 12892 bytes_per_second=1.12462G/s items_per_second=603.774M/s
BM_RleEncoding/65536/1 108481 ns 108444 ns 6455 bytes_per_second=1.12566G/s items_per_second=604.332M/s
BM_RleEncoding/1024/8 4684 ns 4683 ns 146161 bytes_per_second=417.038M/s items_per_second=218.648M/s
BM_RleEncoding/4096/8 18622 ns 18618 ns 37582 bytes_per_second=419.63M/s items_per_second=220.007M/s
BM_RleEncoding/32768/8 148805 ns 148786 ns 4696 bytes_per_second=420.067M/s items_per_second=220.236M/s
BM_RleEncoding/65536/8 297708 ns 297613 ns 2349 bytes_per_second=420.009M/s items_per_second=220.206M/s
BM_RleEncoding/1024/16 5034 ns 5033 ns 139270 bytes_per_second=388.037M/s items_per_second=203.443M/s
BM_RleEncoding/4096/16 19970 ns 19964 ns 35012 bytes_per_second=391.337M/s items_per_second=205.173M/s
BM_RleEncoding/32768/16 159882 ns 159859 ns 4377 bytes_per_second=390.968M/s items_per_second=204.98M/s
BM_RleEncoding/65536/16 319724 ns 319654 ns 2192 bytes_per_second=391.047M/s items_per_second=205.021M/s
BM_RleDecoding/1024/1 1200 ns 1200 ns 583177 bytes_per_second=1.58909G/s items_per_second=853.134M/s
BM_RleDecoding/4096/1 4541 ns 4540 ns 154092 bytes_per_second=1.68058G/s items_per_second=902.253M/s
BM_RleDecoding/32768/1 35654 ns 35647 ns 19626 bytes_per_second=1.71223G/s items_per_second=919.247M/s
BM_RleDecoding/65536/1 71170 ns 71153 ns 9838 bytes_per_second=1.7156G/s items_per_second=921.055M/s
BM_RleDecoding/1024/8 1617 ns 1617 ns 432308 bytes_per_second=1.17978G/s items_per_second=633.387M/s
BM_RleDecoding/4096/8 6172 ns 6171 ns 113454 bytes_per_second=1.23639G/s items_per_second=663.784M/s
BM_RleDecoding/32768/8 49135 ns 49128 ns 14250 bytes_per_second=1.24237G/s items_per_second=666.993M/s
BM_RleDecoding/65536/8 98000 ns 97985 ns 7142 bytes_per_second=1.2458G/s items_per_second=668.836M/s
BM_RleDecoding/1024/16 4111 ns 4111 ns 170113 bytes_per_second=475.13M/s items_per_second=249.105M/s
BM_RleDecoding/4096/16 16079 ns 16076 ns 43550 bytes_per_second=485.967M/s items_per_second=254.787M/s
BM_RleDecoding/32768/16 128261 ns 128232 ns 5462 bytes_per_second=487.399M/s items_per_second=255.537M/s
BM_RleDecoding/65536/16 256631 ns 256595 ns 2729 bytes_per_second=487.148M/s items_per_second=255.406M/s

Profiling

Column-io-Arm64-3.svg

Hot methods

ReadInt64Column:

parquet::internal:FindMinMax → FindMinMaxImpl

→ LevelsToBitmap

template <typename Predicate>
inline uint64_t LevelsToBitmap(const int16_t* levels, int64_t num_levels,
                               Predicate predicate) {
  uint64_t mask = 0;
  for (int x = 0; x < num_levels; x++) {
    mask |= static_cast<uint64_t>(predicate(levels[x]) ? 1 : 0) << x;
  }
  return ::arrow::BitUtil::ToLittleEndian(mask);
}

inline MinMax FindMinMaxImpl(const int16_t* levels, int64_t num_levels) {
  MinMax out{std::numeric_limits<int16_t>::max(), std::numeric_limits<int16_t>::min()};
  for (int x = 0; x < num_levels; x++) {
    out.min = std::min(levels[x], out.min);
    out.max = std::max(levels[x], out.max);
  }
  return out;
}

The GCC and Clang would vectorize this automatically with SSE4/AVX2:

  for (int x = 0; x < num_levels; x++) {
    mask |= static_cast<uint64_t>(predicate(levels[x]) ? 1 : 0) << x;
  }
  return ::arrow::BitUtil::ToLittleEndian(mask);
}

Optimization for Arm64:

Also enable GCC vectorization for Arm64 Simd like AVX:

 if(ARROW_HAVE_RUNTIME_AVX2)
  # AVX2 is used as a proxy for BMI2.
  list(APPEND PARQUET_SRCS level_comparison_avx2.cc level_conversion_bmi2.cc)
  set_source_files_properties(level_comparison_avx2.cc
                              PROPERTIES
                              SKIP_PRECOMPILE_HEADERS
                              ON
                              COMPILE_FLAGS
                              "${ARROW_AVX2_FLAG}")
  # WARNING: DO NOT BLINDLY COPY THIS CODE FOR OTHER BMI2 USE CASES.
  # This code is always guarded by runtime dispatch which verifies
  # BMI2 is present.  For a very small number of CPUs AVX2 does not
  # imply BMI2.
  set_source_files_properties(level_conversion_bmi2.cc
                              PROPERTIES
                              SKIP_PRECOMPILE_HEADERS
                              ON
                              COMPILE_FLAGS
                              "${ARROW_AVX2_FLAG} -DARROW_HAVE_BMI2 -mbmi2")
endif()

Other hot methods are:

lz4_decompress, snappy::RawUncompress, ZSTD_decompress

RleDecoding:

parquet::LevelDecoder::Decode → FindMinMaxImpl

→ unpack32

AVX bit unpack implemented in:

bpacking_avx2.cc, bpacking_avx2_generated.h, bpacking_avx512.cc, bpacking_avx512_generated.h.

Arm64: Also add a Neon optimized version of bit-unpacking that leverages the generated code for 128-bit SIMD.

RleEncoding:

LevelEncoder::Encode → Flush() → arrow::util::RleEncoder::FlushLiteralRun:

  // Write all the buffered values as bit packed literals
  for (int i = 0; i < num_buffered_values_; ++i) {
    bool success = bit_writer_.PutValue(buffered_values_[i], bit_width_);
    DCHECK(success) << "There is a bug in using CheckBufferFull()";
  }

FlushLiteralRun → PutValue → BitWriter::PutValue → ByteSwap (int64_t, int32_t, int16_t, int8_t):

It could be optimized by Arm64 Neon: vext/vextq:

vextq_u64 (uint64x2_t __a, uint64x2_t __b, __const int __c)
{
  __AARCH64_LANE_CHECK (__a, __c);
#ifdef __AARCH64EB__
  return __builtin_shuffle (__b, __a, (uint64x2_t) {2-__c, 3-__c});
#else
  return __builtin_shuffle (__a, __b, (uint64x2_t) {__c, __c+1});
#endif
}

__builtin_shuffle:
        ext     v0.16b, v0.16b, v0.16b, #8
        ret

Big Data Team Space

Arrow Parquet-Column-io benchmark and profiling

Host: Ampere Altra

Enable Parquet and Benchmarks

Run benchmarks

Profiling

Hot methods

Related content