Profiling Arrow on Arm64
Benchmark Case
1. Compute-benchmark
Run on (224 X 400 MHz CPU s) CPU Caches: L1 Data 32K (x56) L1 Instruction 32K (x56) L2 Unified 256K (x56) L3 Unified 32768K (x2) --------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------- BM_BuildDictionary/min_time:1.000 3600 us 3599 us 387 1110.31MB/s BM_BuildStringDictionary/min_time:1.000 8378 us 8376 us 167 36.0429MB/s BM_UniqueInt64NoNulls/16777216/50/min_time:1.000/real_time 110089 us 110069 us 13 1.13544GB/s BM_UniqueInt64NoNulls/16777216/1024/min_time:1.000/real_time 109357 us 109341 us 13 1.14304GB/s BM_UniqueInt64NoNulls/16777216/10240/min_time:1.000/real_time 145468 us 145448 us 10 879.917MB/s BM_UniqueInt64NoNulls/16777216/1048576/min_time:1.000/real_time 976512 us 975581 us 2 131.079MB/s BM_UniqueInt64WithNulls/16777216/50/min_time:1.000/real_time 134020 us 134004 us 10 955.078MB/s BM_UniqueInt64WithNulls/16777216/1024/min_time:1.000/real_time 132638 us 132619 us 10 965.035MB/s BM_UniqueInt64WithNulls/16777216/10240/min_time:1.000/real_time 187459 us 187428 us 7 682.815MB/s BM_UniqueInt64WithNulls/16777216/1048576/min_time:1.000/real_time 1143785 us 1143577 us 1 111.909MB/s BM_UniqueString10bytes/16777216/50/min_time:1.000/real_time 492816 us 492750 us 3 324.665MB/s BM_UniqueString10bytes/16777216/1024/min_time:1.000/real_time 505816 us 505739 us 3 316.321MB/s BM_UniqueString10bytes/16777216/10240/min_time:1.000/real_time 930209 us 930075 us 2 172.004MB/s BM_UniqueString10bytes/16777216/1048576/min_time:1.000/real_time 3684450 us 3683884 us 1 43.4258MB/s BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time 1679638 us 1679391 us 1 952.586MB/s BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time 2064762 us 2064444 us 1 774.908MB/s BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time 2806435 us 2806041 us 1 570.118MB/s BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time 9844250 us 9841576 us 1 162.531MB/s BM_UniqueUInt8NoNulls/16777216/200/min_time:1.000/real_time 27662 us 27658 us 51 578.409MB/s BM_UniqueUInt8WithNulls/16777216/200/min_time:1.000/real_time 54364 us 54357 us 26 294.312MB/s
2. Hashing-benchmark
Run on (224 X 400 MHz CPU s) CPU Caches: L1 Data 32K (x56) L1 Instruction 32K (x56) L2 Unified 256K (x56) L3 Unified 32768K (x2) ---------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------- BM_HashIntegers/repeats:1 20 us 20 us 34253 7.1921GB/s BM_HashSmallStrings/repeats:1 186 us 186 us 3768 1103.54MB/s BM_HashMediumStrings/repeats:1 710 us 710 us 986 1.8067GB/s BM_HashLargeStrings/repeats:1 709 us 709 us 1004 2.66291GB/s
Perf Report
1. Compute-benchmark
perf record -e cpu-clock -g -o compute-bench.data ./compute-benchmark perf report -i compute-bench.data
main - 99.57% _ZN9benchmark22RunSpecifiedBenchmarksEPNS_17BenchmarkReporterES1_ + 30.77% _ZN5arrow7computeL23BM_UniqueString100bytesERN9benchmark5StateE + 23.67% _ZN5arrow7computeL22BM_UniqueString10bytesERN9benchmark5StateE + 15.57% _ZN5arrow7computeL21BM_UniqueInt64NoNullsERN9benchmark5StateE + 15.31% _ZN5arrow7computeL23BM_UniqueInt64WithNullsERN9benchmark5StateE + 5.63% _ZN5arrow7computeL23BM_UniqueUInt8WithNullsERN9benchmark5StateE ......
Hot methods filtration
Arrow::BufferBuilder::Append |
Arrow::ComputeStringHash |
Arrow::BinaryMemoTable:Lookup |
Arrow::compute::RegularHashKernelImpl::Apend |
Arrow::ArrayFromVector |
Arrow::HashParams::randint |
The profiling results are shown as the following flame graphs: compute-bench.svg
2. Hashing-benchmark
perf record -e cpu-clock -g -o hashing-bench.data ./hashing-benchmark perf report -i hashing-bench.data
main - _ZN9benchmark22RunSpecifiedBenchmarksEPNS_17BenchmarkReporterES1_ 26.65% _ZN5arrow8internalL15BM_HashIntegersERN9benchmark5StateE + 26.23% _ZN5arrow8internalL19BM_HashSmallStringsERN9benchmark5StateE + 23.63% _ZN5arrow8internalL19BM_HashLargeStringsERN9benchmark5StateE + 23.20% _ZN5arrow8internalL20BM_HashMediumStringsERN9benchmark5StateE
Hot methods filtration
Arrow::ComputeHash |
Arrow::ComputeStringHash |
The profiling results are shown as the following flame graphs: hashing-bench.svg
Hot methods optimization
For ComputeStringHash is the hot methods in Arrow , the Hash utility leverages SSE4 to accelerate the Crc32 data hash computation for x86.
Correspondingly, we could leverage the Arm crc32 extension instructions to accelerate the hash computation for AArch64.
The patch: Leverage Armv8 crc32 extension instructions to accelerate the hash computation for Arm64 is merged by upstream.
The perfomance boost on Arm64
Hashing Benchmark
Before:
BM_HashSmallStrings/repeats:1 186 us 186 us 3768 1103.54MB/s BM_HashMediumStrings/repeats:1 710 us 710 us 986 1.8067GB/s BM_HashLargeStrings/repeats:1 709 us 709 us 1004 2.66291GB/s
After:
BM_HashSmallStrings/repeats:1 182 us 182 us 3854 1.21026GB/s BM_HashMediumStrings/repeats:1 550 us 550 us 1273 2.35532GB/s BM_HashLargeStrings/repeats:1 323 us 323 us 2145 5.93187GB/s
There are about only ~8% improvement on SmallStrings hashing and ~30% improvement on MediumStrings hashing.
But The LargeString hashing gets 3x speedup.
Compute Benchmark
Before:
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time 1679638 us 1679391 us 1 952.586MB/s BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time 2064762 us 2064444 us 1 774.908MB/s BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time 2806435 us 2806041 us 1 570.118MB/s BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time 9844250 us 9841576 us 1 162.531MB/s
After:
BM_UniqueString100bytes/16777216/50/min_time:1.000/real_time 1411850 us 1411664 us 1 1.1067GB/s BM_UniqueString100bytes/16777216/1024/min_time:1.000/real_time 1617926 us 1617703 us 1 988.92MB/s BM_UniqueString100bytes/16777216/10240/min_time:1.000/real_time 2410494 us 2410174 us 1 663.764MB/s BM_UniqueString100bytes/16777216/1048576/min_time:1.000/real_time 8404810 us 8403749 us 1 190.367MB/s
The perfomacne on computing String100bytes is improved ~20%.
Future work
- Foucus on the futher optimation for SmallStrings/MediumStrings hahsing
- Try to optimize the other hot methods