Android/QEMU boot time analysis

Following , we performed an analysis of android booting time running with QEMU (tcg acceleration).

Setup

Build QEMU

To allow profiling QEMU, we compile it with frame pointers:

pushd qemu export CFLAGS="-O2 -g -fno-omit-frame-pointer" ./configure ninja -C build popd qemu

Run QEMU

We use this wrapper script ( ) to run Android Cuttlefish with a custom QEMU, and profiling using perf:

$ cat ../qemu-system-aarch64 #!/usr/bin/env bash set -euo pipefail log() { echo "$@" 1>&2 } qemu=/home/user/.work/qemu/build/qemu-system-aarch64 sleep 1 log ------------------------------------------------------- $qemu --version log ------------------------------------------------------- log "Connect to qemu monitor using: socat -,echo=0,icanon=0 unix-connect:qemu-monitor-socket" log ------------------------------------------------------- args="$(echo $@)" args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/-cpu host/-cpu max/g') -cpu max -monitor unix:$HOME/qemu-monitor-socket,server,nowait" log "$qemu $(echo $args | tr ' ' '\n' | sed -e 's/$/ \\/')" log ------------------------------------------------------- perf record -F 300 --call-graph fp -o ~/perf.data $qemu $args log ------------------------------------------------------- $ cat ../qemu-system-x86_64 ... args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/,debug-threads=on//' -e 's/-cpu host/-cpu max/g') -monitor unix:$HOME/qemu-monitor-socket,server,nowait" ... # Run cuttlefish $ HOME=$(pwd) ./bin/launch_cvd -vm_manager qemu_cli -report_anonymous_usage_stats=n --start_webrtc=false -qemu_binary_dir=$(pwd)/..

Versions

Hardware

CPU (x64): Intel Core i7-11850H (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-11850H+%40+2.50GHz&id=4342 ), Cores: 8 Threads: 16

Analysis

  • boot time: grep VIRTUAL_DEVICE_BOOT_STARTED kernel.log (waiting for _COMPLETED event is unreliable, sometimes Bluetooth connection takes too long and boot fails).

  • perf profile analysis: https://github.com/KDAB/hotspot (available in debian packages).

  • flamegraphs svg generation: https://github.com/brendangregg/FlameGraph

    • perf script -i perf.data | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg

Results

You need to download interactive svg flamegraphs and open them with your browser, as Confluence does not allow to render them inline.

Performance regression

While performing this investigation, we discovered a recent regression, which resulted in bad performance when booting aarch64 android, making any execution with -smp > 1 slower than -smp 1. In more, an overhead was present when booting x64 android.

In short, QEMU must be built with -mcx16 to ensure we use cmpxchg16 on x64 hosts. Else, any atomic instruction will be serialized, blocking all other vcpus.

Commit introducing the regression: https://gitlab.com/qemu-project/qemu/-/commit/c2bf2ccb266dc9ae4a6da75b845f54535417e109

Series fixing the regression: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/

This series was merged, and will be available in next-stable and QEMU 9.2.0. Meanwhile, build QEMU from master branch.

Following results are presented with this fix.

qemu-system-x86_64 -accel kvm -smp 4 -cpu max

boot time: 30s (reference)

qemu-system-x86_64 -accel tcg - cpu max

  • -smp 1: 1036s (x34) x64_perf.data.smp_1.run0.svg

  • -smp 2: 410s (x13) x64_perf.data.smp_2.run0.svg

  • -smp 4: 280s (x9) x64_perf.data.smp_4.run1.svg

  • -smp 6: 260s (x8) x64_perf.data.smp_6.run0.svg

  • -smp 8: 260s (x8) x64_perf.data.smp_8.run0.svg

We can see that the speedup compared to -smp 2 is not linear. While booting, we can see that the QEMU process barely reaches 500% of cpu time in top. This is a limitation of Android boot sequence that does not seem able to use more than 4 cores.

qemu-system-aarch64 -accel tcg -cpu max -smp 2 (default to pauth=on)

boot time: 1280s (x42) arm64_pauth_on_smp2.data.svg

We can see that pointer authentication takes up to 40% of the execution time, so it’s better to disable it.

qemu-system-aarch64 -accel tcg -cpu max,pauth=off

  • -smp 1: 1034s (x34) arm64_perf.data.smp_1.run1.svg

  • -smp 2: 512s (x17) arm64_perf.data.smp_2.run0.svg

  • -smp 4: 380s (x12) arm64_perf.data.smp_4.run0.svg

  • -smp 6: 360s (x12) arm64_perf.data.smp_6.run0.svg

  • -smp 8: 375s (x12) arm64_perf.data.smp_8.run0.svg

We can see that disabling pointer execution results in much faster execution, as expected.
Performance is close from what we observe when booting x64 version, with a small overhead for aarch64.

Perf report

x64 -smp 4 -cpu max

aarch64 -smp 4 -cpu max,pauth=off

Conclusion

We can boot android using QEMU with an overhead of x10 when using tcg compared to native execution.

We recommend to run android with:

  • 4 cores

  • -cpu max (,pauth=off on aarch64)

  • ensure that cmpxchg is used on x64 (massive difference with smp > 1). This series https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/ was merged in QEMU and will be available in next stable and 9.2.0. Meanwhile, use a QEMU compiled from master branch.

  • Performance difference between aarch64 and x64 can be explained by TLB management on aarch64, and some helpers.