Android/QEMU boot time analysis
Following Run Android using QEMU, we performed an analysis of android booting time running with QEMU (tcg acceleration).
Setup
Build QEMU
To allow profiling QEMU, we compile it with frame pointers:
pushd qemu
export CFLAGS="-O2 -g -fno-omit-frame-pointer"
./configure
ninja -C build
popd qemu
Run QEMU
We use this wrapper script (Run Android using QEMU | Run Android using Cuttlefish and a custom build of QEMU ) to run Android Cuttlefish with a custom QEMU, and profiling using perf
:
$ cat ../qemu-system-aarch64
#!/usr/bin/env bash
set -euo pipefail
log()
{
echo "$@" 1>&2
}
qemu=/home/user/.work/qemu/build/qemu-system-aarch64
sleep 1
log -------------------------------------------------------
$qemu --version
log -------------------------------------------------------
log "Connect to qemu monitor using: socat -,echo=0,icanon=0 unix-connect:qemu-monitor-socket"
log -------------------------------------------------------
args="$(echo $@)"
args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/-cpu host/-cpu max/g') -cpu max -monitor unix:$HOME/qemu-monitor-socket,server,nowait"
log "$qemu $(echo $args | tr ' ' '\n' | sed -e 's/$/ \\/')"
log -------------------------------------------------------
perf record -F 300 --call-graph fp -o ~/perf.data $qemu $args
log -------------------------------------------------------
$ cat ../qemu-system-x86_64
...
args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/,debug-threads=on//' -e 's/-cpu host/-cpu max/g') -monitor unix:$HOME/qemu-monitor-socket,server,nowait"
...
# Run cuttlefish
$ HOME=$(pwd) ./bin/launch_cvd -vm_manager qemu_cli -report_anonymous_usage_stats=n --start_webrtc=false -qemu_binary_dir=$(pwd)/..
Versions
QEMU: 9.1.0 (with this patch to fix performance issue we found: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/- it will be available in next stable and qemu 9.2.0 and was already merged on master)
Android aarch64 image: aosp_cf_arm64_only_phone-img-12425990.zip
Android x64 image: aosp_cf_x86_64_phone-img-12426306.zip
Hardware
CPU (x64): Intel Core i7-11850H (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-11850H+%40+2.50GHz&id=4342 ), Cores: 8 Threads: 16
Analysis
boot time:
grep VIRTUAL_DEVICE_BOOT_STARTED kernel.log
(waiting for _COMPLETED event is unreliable, sometimes Bluetooth connection takes too long and boot fails).perf profile analysis: GitHub - KDAB/hotspot: The Linux perf GUI for performance analysis. (available in debian packages).
flamegraphs svg generation: GitHub - brendangregg/FlameGraph: Stack trace visualizer
perf script -i perf.data | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg
Results
You need to download interactive svg flamegraphs and open them with your browser, as Confluence does not allow to render them inline.
Performance regression
While performing this investigation, we discovered a recent regression, which resulted in bad performance when booting aarch64 android, making any execution with -smp > 1 slower than -smp 1. In more, an overhead was present when booting x64 android.
In short, QEMU must be built with -mcx16
to ensure we use cmpxchg16 on x64 hosts. Else, any atomic instruction will be serialized, blocking all other vcpus.
Commit introducing the regression: configure: move -mcx16 flag out of CPU_CFLAGS (c2bf2ccb) · Commits · QEMU / QEMU · GitLab
Series fixing the regression: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/
This series was merged, and will be available in next-stable and QEMU 9.2.0. Meanwhile, build QEMU from master branch.
Following results are presented with this fix.
qemu-system-x86_64 -accel kvm -smp 4 -cpu max
boot time: 30s (reference)
qemu-system-x86_64 -accel tcg - cpu max
-smp 1: 1036s (x34)
-smp 2: 410s (x13)
-smp 4: 280s (x9)
-smp 6: 260s (x8)
-smp 8: 260s (x8)
We can see that the speedup compared to -smp 2
is not linear. While booting, we can see that the QEMU process barely reaches 500% of cpu time in top. This is a limitation of Android boot sequence that does not seem able to use more than 4 cores.
qemu-system-aarch64 -accel tcg -cpu max -smp 2 (default to pauth=on)
boot time: 1280s (x42)
We can see that pointer authentication takes up to 40% of the execution time, so it’s better to disable it.
qemu-system-aarch64 -accel tcg -cpu max,pauth=off
-smp 1: 1034s (x34)
-smp 2: 512s (x17)
-smp 4: 380s (x12)
-smp 6: 360s (x12)
-smp 8: 375s (x12)
We can see that disabling pointer execution results in much faster execution, as expected.
Performance is close from what we observe when booting x64 version, with a small overhead for aarch64.
Perf report
x64 -smp 4 -cpu max
aarch64 -smp 4 -cpu max,pauth=off
Conclusion
We can boot android using QEMU with an overhead of x10 when using tcg compared to native execution.
We recommend to run android with:
4 cores
-cpu max (,pauth=off on aarch64)
ensure that cmpxchg is used on x64 (massive difference with smp > 1). This series https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/ was merged in QEMU and will be available in next stable and 9.2.0. Meanwhile, use a QEMU compiled from master branch.
Performance difference between aarch64 and x64 can be explained by TLB management on aarch64, and some helpers.