Following Run Android using QEMU, we performed an analysis of android booting time running with QEMU (tcg acceleration).
Setup
Build QEMU
To allow profiling QEMU, we compile it with frame pointers:
pushd qemu export CFLAGS="-O2 -g -fno-omit-frame-pointer" ./configure ninja -C build popd qemu
Run QEMU
We use this wrapper script (https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29464068097/Run+Android+using+QEMU#Run-Android-using-Cuttlefish-and-a-custom-build-of-QEMU ) to run Android Cuttlefish with a custom QEMU, and profiling using perf
:
$ cat ../qemu-system-aarch64 #!/usr/bin/env bash set -euo pipefail log() { echo "$@" 1>&2 } qemu=/home/user/.work/qemu/build/qemu-system-aarch64 sleep 1 log ------------------------------------------------------- $qemu --version log ------------------------------------------------------- log "Connect to qemu monitor using: socat -,echo=0,icanon=0 unix-connect:qemu-monitor-socket" log ------------------------------------------------------- args="$(echo $@)" args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/-cpu host/-cpu max/g') -cpu max -monitor unix:$HOME/qemu-monitor-socket,server,nowait" log "$qemu $(echo $args | tr ' ' '\n' | sed -e 's/$/ \\/')" log ------------------------------------------------------- perf record -F 300 --call-graph fp -o ~/perf.data $qemu $args log ------------------------------------------------------- $ cat ../qemu-system-x86_64 ... args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/,debug-threads=on//' -e 's/-cpu host/-cpu max/g') -monitor unix:$HOME/qemu-monitor-socket,server,nowait" ... # Run cuttlefish $ HOME=$(pwd) ./bin/launch_cvd -vm_manager qemu_cli -report_anonymous_usage_stats=n --start_webrtc=false -qemu_binary_dir=$(pwd)/..
Versions
QEMU: 9.1.0 (with this patch to fix performance issue we found: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/- it will be merged in next stable and qemu 9.2.0)
Android aarch64 image: aosp_cf_arm64_only_phone-img-12425990.zip
Android x64 image: aosp_cf_x86_64_phone-img-12426306.zip
Hardware
CPU (x64): Intel Core i7-11850H (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-11850H+%40+2.50GHz&id=4342 ), Cores: 8 Threads: 16
Analysis
boot time:
grep VIRTUAL_DEVICE_BOOT_STARTED kernel.log
(waiting for _COMPLETED event is unreliable, sometimes Bluetooth connection takes too long and boot fails).perf profile analysis: https://github.com/KDAB/hotspot (available in debian packages).
flamegraphs svg generation: https://github.com/brendangregg/FlameGraph
perf script -i perf.data | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg
Results
You need to download interactive svg flamegraphs and open them with your browser, as Confluence does not allow to render them inline.
Performance regression
While performing this investigation, we discovered a recent regression, which resulted in bad performance when booting aarch64 android, making any execution with -smp > 1 slower than -smp 1. In more, an overhead was present when booting x64 android.
In short, QEMU must be built with -mcx16
to ensure we use cmpxchg16 on x64 hosts. Else, any atomic instruction will be serialized, blocking all other vcpus.
Commit introducing the regression: https://gitlab.com/qemu-project/qemu/-/commit/c2bf2ccb266dc9ae4a6da75b845f54535417e109
Series fixing the regression: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/
This series will be merged in QEMU soon, and will be available in next-stable and QEMU 9.2.0. Meanwhile, ensure your QEMU is built with this patch.
Following results are presented with this fix.
qemu-system-x86_64 -accel kvm -smp 4 -cpu max
boot time: 30s (reference)
qemu-system-x86_64 -accel tcg - cpu max
-smp 1: 1036s (x34)
-smp 2: 410s (x13)
-smp 4: 280s (x9)
-smp 6: 260s (x8)
-smp 8: 260s (x8)
We can see that the speedup compared to -smp 2
is not linear. While booting, we can see that the QEMU process barely reaches 500% of cpu time in top. This is a limitation of Android boot sequence that does not seem able to use more than 4 cores.
qemu-system-aarch64 -accel tcg -cpu max -smp 2 (default to pauth=on)
boot time: 1280s (x42)
We can see that pointer authentication takes up to 40% of the execution time, so it’s better to disable it.
qemu-system-aarch64 -accel tcg -cpu max,pauth=off
-smp 1: 1034s (x34)
-smp 2: 512s (x17)
-smp 4: 380s (x12)
-smp 6: 360s (x12)
-smp 8: 375s (x12)
We can see that disabling pointer execution results in much faster execution, as expected.
Performance is close from what we observe when booting x64 version, with a small overhead for aarch64.
Perf report
x64 -smp 4 -cpu max
22.68% 10.55% qemu-system-x86 qemu-system-x86_64 [.] helper_lookup_tb_ptr 15.48% 0.00% qemu-system-x86 [unknown] [k] 0000000000000000 11.83% 1.33% qemu-system-x86 qemu-system-x86_64 [.] tb_htable_lookup 8.76% 0.39% qemu-system-x86 qemu-system-x86_64 [.] x86_cpu_tlb_fill 8.49% 0.00% qemu-system-x86 libc.so.6 [.] start_thread 8.49% 0.00% qemu-system-x86 qemu-system-x86_64 [.] qemu_thread_start 8.29% 0.02% qemu-system-x86 qemu-system-x86_64 [.] mttcg_cpu_thread_fn 8.04% 0.30% qemu-system-x86 qemu-system-x86_64 [.] mmu_lookup 7.86% 0.01% qemu-system-x86 qemu-system-x86_64 [.] tcg_cpu_exec 7.82% 0.05% qemu-system-x86 qemu-system-x86_64 [.] cpu_exec 7.72% 0.01% qemu-system-x86 qemu-system-x86_64 [.] cpu_exec_setjmp 7.66% 0.45% qemu-system-x86 qemu-system-x86_64 [.] cpu_exec_loop 7.52% 0.55% qemu-system-x86 qemu-system-x86_64 [.] mmu_lookup1 5.83% 0.08% qemu-system-x86 qemu-system-x86_64 [.] tb_gen_code 5.05% 0.00% qemu-system-x86 [unknown] [.] 0x0000000000000001 4.79% 0.00% qemu-system-x86 qemu-system-x86_64 [.] setjmp_gen_code.constprop.0 4.79% 0.12% qemu-system-x86 qemu-system-x86_64 [.] tlb_set_page_with_attrs 4.79% 2.05% qemu-system-x86 qemu-system-x86_64 [.] probe_access_internal 4.54% 4.22% qemu-system-x86 qemu-system-x86_64 [.] qht_lookup_custom 4.05% 0.50% qemu-system-x86 qemu-system-x86_64 [.] get_page_addr_code_hostp 4.01% 1.51% qemu-system-x86 qemu-system-x86_64 [.] tlb_set_page_full 3.95% 0.13% qemu-system-x86 qemu-system-x86_64 [.] do_ld8_mmu 3.91% 1.47% qemu-system-x86 qemu-system-x86_64 [.] mmu_translate 3.35% 0.76% qemu-system-x86 qemu-system-x86_64 [.] tcg_gen_code 3.09% 0.07% qemu-system-x86 qemu-system-x86_64 [.] helper_ldul_mmu 3.08% 0.37% qemu-system-x86 qemu-system-x86_64 [.] do_ld4_mmu 2.47% 0.50% qemu-system-x86 qemu-system-x86_64 [.] ptw_translate 1.94% 0.44% qemu-system-x86 qemu-system-x86_64 [.] probe_access_full 1.91% 0.01% qemu-system-x86 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 1.90% 0.01% qemu-system-x86 [kernel.kallsyms] [k] do_syscall_64
aarch64 -smp 4 -cpu max,pauth=off
90.71% 0.00% qemu-system-aar [unknown] [k] 0000000000000000 22.93% 8.97% qemu-system-aar qemu-system-aarch64 [.] helper_lookup_tb_ptr 12.87% 0.39% qemu-system-aar qemu-system-aarch64 [.] arm_cpu_tlb_fill 10.73% 0.55% qemu-system-aar qemu-system-aarch64 [.] mmu_lookup 10.63% 1.09% qemu-system-aar qemu-system-aarch64 [.] tb_htable_lookup 10.14% 0.82% qemu-system-aar qemu-system-aarch64 [.] mmu_lookup1 8.51% 0.32% qemu-system-aar qemu-system-aarch64 [.] get_phys_addr 7.87% 4.00% qemu-system-aar qemu-system-aarch64 [.] get_phys_addr_nogpc 7.20% 2.74% qemu-system-aar qemu-system-aarch64 [.] probe_access_internal 6.24% 0.00% qemu-system-aar libc.so.6 [.] start_thread 6.24% 0.00% qemu-system-aar qemu-system-aarch64 [.] qemu_thread_start 6.01% 0.01% qemu-system-aar qemu-system-aarch64 [.] mttcg_cpu_thread_fn 5.57% 0.20% qemu-system-aar qemu-system-aarch64 [.] do_ld8_mmu 5.36% 0.01% qemu-system-aar qemu-system-aarch64 [.] tcg_cpu_exec 5.24% 0.02% qemu-system-aar qemu-system-aarch64 [.] cpu_exec 5.20% 0.01% qemu-system-aar qemu-system-aarch64 [.] cpu_exec_setjmp 5.16% 0.25% qemu-system-aar qemu-system-aarch64 [.] cpu_exec_loop 4.40% 0.44% qemu-system-aar qemu-system-aarch64 [.] get_page_addr_code_hostp 4.23% 2.23% qemu-system-aar qemu-system-aarch64 [.] float32_minmax 3.68% 0.06% qemu-system-aar qemu-system-aarch64 [.] tb_gen_code 3.68% 3.51% qemu-system-aar qemu-system-aarch64 [.] qht_lookup_custom 3.67% 3.03% qemu-system-aar qemu-system-aarch64 [.] cpu_get_tb_cpu_state 3.45% 1.32% qemu-system-aar qemu-system-aarch64 [.] tlb_set_page_full 3.13% 0.26% qemu-system-aar qemu-system-aarch64 [.] helper_gvec_fmaxnum_s 3.01% 0.93% qemu-system-aar qemu-system-aarch64 [.] S1_ptw_translate 3.01% 0.00% qemu-system-aar qemu-system-aarch64 [.] setjmp_gen_code.constprop.0 2.98% 0.07% qemu-system-aar qemu-system-aarch64 [.] helper_ldul_mmu 2.89% 0.32% qemu-system-aar qemu-system-aarch64 [.] do_ld4_mmu 2.87% 0.23% qemu-system-aar qemu-system-aarch64 [.] helper_gvec_fminnum_s 2.73% 0.02% qemu-system-aar [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
Conclusion
We can boot android using QEMU with an overhead of x10 when using tcg compared to native execution.
We recommend to run android with:
4 cores
-cpu max (,pauth=off on aarch64)
ensure that cmpxchg is used on x64 (massive difference with smp > 1). This series https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/ will be merged in QEMU and will be available in qemu next stable and 9.2.0.