Profiling Arrow on Arm64

Benchmark Case


1. Compute-benchmark

Run on (224 X 400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x56)
  L1 Instruction 32K (x56)
  L2 Unified 256K (x56)
  L3 Unified 32768K (x2)
---------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time           CPU Iterations
---------------------------------------------------------------------------------------------------------
BM_BuildDictionary/min_time:1.000                                       3600 us       3599 us        387   1110.31MB/s
BM_BuildStringDictionary/min_time:1.000                                 8378 us       8376 us        167   36.0429MB/s
BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time            110089 us     110069 us         13   1.13544GB/s
BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time          109357 us     109341 us         13   1.14304GB/s
BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time         145468 us     145448 us         10   879.917MB/s
BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time       976512 us     975581 us          2   131.079MB/s
BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time          134020 us     134004 us         10   955.078MB/s
BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time        132638 us     132619 us         10   965.035MB/s
BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time       187459 us     187428 us          7   682.815MB/s
BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time    1143785 us    1143577 us          1   111.909MB/s
BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time           492816 us     492750 us          3   324.665MB/s
BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time         505816 us     505739 us          3   316.321MB/s
BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time        930209 us     930075 us          2   172.004MB/s
BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time     3684450 us    3683884 us          1   43.4258MB/s
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time         1679638 us    1679391 us          1   952.586MB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time       2064762 us    2064444 us          1   774.908MB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      2806435 us    2806041 us          1   570.118MB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    9844250 us    9841576 us          1   162.531MB/s
BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time            27662 us      27658 us         51   578.409MB/s
BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time          54364 us      54357 us         26   294.312MB/s

2. Hashing-benchmark

Run on (224 X 400 MHz CPU s)
CPU Caches:
  L1 Data 32K (x56)
  L1 Instruction 32K (x56)
  L2 Unified 256K (x56)
  L3 Unified 32768K (x2)
----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
BM_HashIntegers/repeats:1              20 us         20 us      34253    7.1921GB/s
BM_HashSmallStrings/repeats:1         186 us        186 us       3768   1103.54MB/s
BM_HashMediumStrings/repeats:1        710 us        710 us        986    1.8067GB/s
BM_HashLargeStrings/repeats:1         709 us        709 us       1004   2.66291GB/s


Perf Report


1. Compute-benchmark

perf record -e cpu-clock -g -o compute-bench.data ./compute-benchmark
perf report -i compute-bench.data
             


main                                                                                                                                                                      
         - 99.57% _ZN9benchmark22RunSpecifiedBenchmarksEPNS_17BenchmarkReporterES1_                                                                                                       
            + 30.77% _ZN5arrow7computeL23BM_UniqueString100bytesERN9benchmark5StateE                                                                                                     
            + 23.67% _ZN5arrow7computeL22BM_UniqueString10bytesERN9benchmark5StateE                                                                                                       
            + 15.57% _ZN5arrow7computeL21BM_UniqueInt64NoNullsERN9benchmark5StateE                                                                                                        
            + 15.31% _ZN5arrow7computeL23BM_UniqueInt64WithNullsERN9benchmark5StateE                                                                                                      
            + 5.63% _ZN5arrow7computeL23BM_UniqueUInt8WithNullsERN9benchmark5StateE                                                                                                       
             ......
             

Hot methods filtration 

Arrow::BufferBuilder::Append
Arrow::ComputeStringHash
Arrow::BinaryMemoTable:Lookup
Arrow::compute::RegularHashKernelImpl::Apend
Arrow::ArrayFromVector
Arrow::HashParams::randint

The profiling results are shown as the following flame graphs:  compute-bench.svg

2. Hashing-benchmark

perf record -e cpu-clock -g -o hashing-bench.data ./hashing-benchmark
perf report -i hashing-bench.data
             


main                                                                                                                              
         - _ZN9benchmark22RunSpecifiedBenchmarksEPNS_17BenchmarkReporterES1_                                                                                                       
              26.65% _ZN5arrow8internalL15BM_HashIntegersERN9benchmark5StateE                                                                                                            
            + 26.23% _ZN5arrow8internalL19BM_HashSmallStringsERN9benchmark5StateE                                                                                                        
            + 23.63% _ZN5arrow8internalL19BM_HashLargeStringsERN9benchmark5StateE                                                                                                        
            + 23.20% _ZN5arrow8internalL20BM_HashMediumStringsERN9benchmark5StateE   

Hot methods filtration 

Arrow::ComputeHash
Arrow::ComputeStringHash

The profiling results are shown as the following flame graphs: hashing-bench.svg


Hot methods optimization


For ComputeStringHash is the hot methods in Arrow , the Hash utility leverages SSE4 to accelerate the Crc32 data hash computation for x86.

Correspondingly, we could leverage the Arm crc32 extension instructions to accelerate the hash computation for AArch64. 

The patch: Leverage Armv8 crc32 extension instructions to accelerate the hash computation for Arm64 is merged by upstream.

The perfomance boost on Arm64

Hashing Benchmark

Before:

BM_HashSmallStrings/repeats:1         186 us        186 us       3768   1103.54MB/s
BM_HashMediumStrings/repeats:1        710 us        710 us        986    1.8067GB/s
BM_HashLargeStrings/repeats:1         709 us        709 us       1004   2.66291GB/s

 After:

BM_HashSmallStrings/repeats:1         182 us        182 us       3854   1.21026GB/s
BM_HashMediumStrings/repeats:1        550 us        550 us       1273   2.35532GB/s
BM_HashLargeStrings/repeats:1         323 us        323 us       2145   5.93187GB/s

There are about only ~8% improvement on SmallStrings hashing and ~30% improvement on MediumStrings hashing.
But The LargeString hashing gets 3x speedup.

Compute Benchmark

Before:

BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time         1679638 us    1679391 us          1   952.586MB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time       2064762 us    2064444 us          1   774.908MB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      2806435 us    2806041 us          1   570.118MB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    9844250 us    9841576 us          1   162.531MB/s

 After:

BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time         1411850 us    1411664 us          1    1.1067GB/s
BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time       1617926 us    1617703 us          1    988.92MB/s
BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time      2410494 us    2410174 us          1   663.764MB/s
BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time    8404810 us    8403749 us          1   190.367MB/s


The perfomacne on computing String100bytes is improved ~20%. 



Future work


  • Foucus on the futher optimation for SmallStrings/MediumStrings hahsing  
  • Try to optimize the other hot methods