Skip to end of banner
Go to start of banner

Android/QEMU boot time analysis

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 35 Current »

Following Run Android using QEMU, we performed an analysis of android booting time running with QEMU (tcg acceleration).

Setup

Build QEMU

To allow profiling QEMU, we compile it with frame pointers:

pushd qemu
export CFLAGS="-O2 -g -fno-omit-frame-pointer"
./configure
ninja -C build
popd qemu

Run QEMU

We use this wrapper script (https://linaro.atlassian.net/wiki/spaces/QEMU/pages/29464068097/Run+Android+using+QEMU#Run-Android-using-Cuttlefish-and-a-custom-build-of-QEMU ) to run Android Cuttlefish with a custom QEMU, and profiling using perf:

$ cat ../qemu-system-aarch64
#!/usr/bin/env bash

set -euo pipefail

log()
{
    echo "$@" 1>&2
}

qemu=/home/user/.work/qemu/build/qemu-system-aarch64
sleep 1
log -------------------------------------------------------
$qemu --version
log -------------------------------------------------------
log "Connect to qemu monitor using: socat -,echo=0,icanon=0 unix-connect:qemu-monitor-socket"
log -------------------------------------------------------
args="$(echo $@)"
args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/-cpu host/-cpu max/g') -cpu max -monitor unix:$HOME/qemu-monitor-socket,server,nowait"
log "$qemu $(echo $args | tr ' ' '\n' | sed -e 's/$/ \\/')"
log -------------------------------------------------------
perf record -F 300 --call-graph fp -o ~/perf.data $qemu $args
log -------------------------------------------------------

$ cat ../qemu-system-x86_64
...
args="$(echo "$args" | sed -e 's/accel=kvm/accel=tcg/g' -e 's/,debug-threads=on//' -e 's/-cpu host/-cpu max/g') -monitor unix:$HOME/qemu-monitor-socket,server,nowait"
...

# Run cuttlefish
$ HOME=$(pwd) ./bin/launch_cvd -vm_manager qemu_cli -report_anonymous_usage_stats=n --start_webrtc=false -qemu_binary_dir=$(pwd)/..

Versions

Hardware

CPU (x64): Intel Core i7-11850H (https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-11850H+%40+2.50GHz&id=4342 ), Cores: 8 Threads: 16

Analysis

  • boot time: grep VIRTUAL_DEVICE_BOOT_STARTED kernel.log (waiting for _COMPLETED event is unreliable, sometimes Bluetooth connection takes too long and boot fails).

  • perf profile analysis: https://github.com/KDAB/hotspot (available in debian packages).

  • flamegraphs svg generation: https://github.com/brendangregg/FlameGraph

    • perf script -i perf.data  | ./FlameGraph/stackcollapse-perf.pl  | ./FlameGraph/flamegraph.pl > out.svg 

Results

You need to download interactive svg flamegraphs and open them with your browser, as Confluence does not allow to render them inline.

Performance regression

While performing this investigation, we discovered a recent regression, which resulted in bad performance when booting aarch64 android, making any execution with -smp > 1 slower than -smp 1. In more, an overhead was present when booting x64 android.

In short, QEMU must be built with -mcx16 to ensure we use cmpxchg16 on x64 hosts. Else, any atomic instruction will be serialized, blocking all other vcpus.

Commit introducing the regression: https://gitlab.com/qemu-project/qemu/-/commit/c2bf2ccb266dc9ae4a6da75b845f54535417e109

Series fixing the regression: https://lore.kernel.org/qemu-devel/20241007172317.1439564-2-pbonzini@redhat.com/

This series will be merged in QEMU soon, and will be available in next-stable and QEMU 9.2.0. Meanwhile, ensure your QEMU is built with this patch.

Following results are presented with this fix.

qemu-system-x86_64 -accel kvm -smp 4 -cpu max

boot time: 30s (reference)

qemu-system-x86_64 -accel tcg - cpu max

  • -smp 1: 1036s (x34) x64_perf.data.smp_1.run0.svg

  • -smp 2: 410s (x13) x64_perf.data.smp_2.run0.svg

  • -smp 4: 280s (x9) x64_perf.data.smp_4.run1.svg

  • -smp 6: 260s (x8) x64_perf.data.smp_6.run0.svg

  • -smp 8: 260s (x8) x64_perf.data.smp_8.run0.svg

We can see that the speedup compared to -smp 2 is not linear. While booting, we can see that the QEMU process barely reaches 500% of cpu time in top. This is a limitation of Android boot sequence that does not seem able to use more than 4 cores.

qemu-system-aarch64 -accel tcg -cpu max -smp 2 (default to pauth=on)

boot time: 1280s (x42) arm64_pauth_on_smp2.data.svg

We can see that pointer authentication takes up to 40% of the execution time, so it’s better to disable it.

qemu-system-aarch64 -accel tcg -cpu max,pauth=off

  • -smp 1: 1034s (x34) arm64_perf.data.smp_1.run1.svg

  • -smp 2: 512s (x17) arm64_perf.data.smp_2.run0.svg

  • -smp 4: 380s (x12) arm64_perf.data.smp_4.run0.svg

  • -smp 6: 360s (x12) arm64_perf.data.smp_6.run0.svg

  • -smp 8: 375s (x12) arm64_perf.data.smp_8.run0.svg

We can see that disabling pointer execution results in much faster execution, as expected.
Performance is close from what we observe when booting x64 version, with a small overhead for aarch64.

Perf report

x64 -smp 4 -cpu max

    22.68%    10.55%  qemu-system-x86  qemu-system-x86_64       [.] helper_lookup_tb_ptr
    15.48%     0.00%  qemu-system-x86  [unknown]                [k] 0000000000000000
    11.83%     1.33%  qemu-system-x86  qemu-system-x86_64       [.] tb_htable_lookup
     8.76%     0.39%  qemu-system-x86  qemu-system-x86_64       [.] x86_cpu_tlb_fill
     8.49%     0.00%  qemu-system-x86  libc.so.6                [.] start_thread
     8.49%     0.00%  qemu-system-x86  qemu-system-x86_64       [.] qemu_thread_start
     8.29%     0.02%  qemu-system-x86  qemu-system-x86_64       [.] mttcg_cpu_thread_fn
     8.04%     0.30%  qemu-system-x86  qemu-system-x86_64       [.] mmu_lookup
     7.86%     0.01%  qemu-system-x86  qemu-system-x86_64       [.] tcg_cpu_exec
     7.82%     0.05%  qemu-system-x86  qemu-system-x86_64       [.] cpu_exec
     7.72%     0.01%  qemu-system-x86  qemu-system-x86_64       [.] cpu_exec_setjmp
     7.66%     0.45%  qemu-system-x86  qemu-system-x86_64       [.] cpu_exec_loop
     7.52%     0.55%  qemu-system-x86  qemu-system-x86_64       [.] mmu_lookup1
     5.83%     0.08%  qemu-system-x86  qemu-system-x86_64       [.] tb_gen_code
     5.05%     0.00%  qemu-system-x86  [unknown]                [.] 0x0000000000000001
     4.79%     0.00%  qemu-system-x86  qemu-system-x86_64       [.] setjmp_gen_code.constprop.0
     4.79%     0.12%  qemu-system-x86  qemu-system-x86_64       [.] tlb_set_page_with_attrs
     4.79%     2.05%  qemu-system-x86  qemu-system-x86_64       [.] probe_access_internal
     4.54%     4.22%  qemu-system-x86  qemu-system-x86_64       [.] qht_lookup_custom
     4.05%     0.50%  qemu-system-x86  qemu-system-x86_64       [.] get_page_addr_code_hostp
     4.01%     1.51%  qemu-system-x86  qemu-system-x86_64       [.] tlb_set_page_full
     3.95%     0.13%  qemu-system-x86  qemu-system-x86_64       [.] do_ld8_mmu
     3.91%     1.47%  qemu-system-x86  qemu-system-x86_64       [.] mmu_translate
     3.35%     0.76%  qemu-system-x86  qemu-system-x86_64       [.] tcg_gen_code
     3.09%     0.07%  qemu-system-x86  qemu-system-x86_64       [.] helper_ldul_mmu
     3.08%     0.37%  qemu-system-x86  qemu-system-x86_64       [.] do_ld4_mmu
     2.47%     0.50%  qemu-system-x86  qemu-system-x86_64       [.] ptw_translate
     1.94%     0.44%  qemu-system-x86  qemu-system-x86_64       [.] probe_access_full
     1.91%     0.01%  qemu-system-x86  [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe
     1.90%     0.01%  qemu-system-x86  [kernel.kallsyms]        [k] do_syscall_64

aarch64 -smp 4 -cpu max,pauth=off

    90.71%     0.00%  qemu-system-aar  [unknown]                [k] 0000000000000000
    22.93%     8.97%  qemu-system-aar  qemu-system-aarch64      [.] helper_lookup_tb_ptr
    12.87%     0.39%  qemu-system-aar  qemu-system-aarch64      [.] arm_cpu_tlb_fill
    10.73%     0.55%  qemu-system-aar  qemu-system-aarch64      [.] mmu_lookup
    10.63%     1.09%  qemu-system-aar  qemu-system-aarch64      [.] tb_htable_lookup
    10.14%     0.82%  qemu-system-aar  qemu-system-aarch64      [.] mmu_lookup1
     8.51%     0.32%  qemu-system-aar  qemu-system-aarch64      [.] get_phys_addr
     7.87%     4.00%  qemu-system-aar  qemu-system-aarch64      [.] get_phys_addr_nogpc
     7.20%     2.74%  qemu-system-aar  qemu-system-aarch64      [.] probe_access_internal
     6.24%     0.00%  qemu-system-aar  libc.so.6                [.] start_thread
     6.24%     0.00%  qemu-system-aar  qemu-system-aarch64      [.] qemu_thread_start
     6.01%     0.01%  qemu-system-aar  qemu-system-aarch64      [.] mttcg_cpu_thread_fn
     5.57%     0.20%  qemu-system-aar  qemu-system-aarch64      [.] do_ld8_mmu
     5.36%     0.01%  qemu-system-aar  qemu-system-aarch64      [.] tcg_cpu_exec
     5.24%     0.02%  qemu-system-aar  qemu-system-aarch64      [.] cpu_exec
     5.20%     0.01%  qemu-system-aar  qemu-system-aarch64      [.] cpu_exec_setjmp
     5.16%     0.25%  qemu-system-aar  qemu-system-aarch64      [.] cpu_exec_loop
     4.40%     0.44%  qemu-system-aar  qemu-system-aarch64      [.] get_page_addr_code_hostp
     4.23%     2.23%  qemu-system-aar  qemu-system-aarch64      [.] float32_minmax
     3.68%     0.06%  qemu-system-aar  qemu-system-aarch64      [.] tb_gen_code
     3.68%     3.51%  qemu-system-aar  qemu-system-aarch64      [.] qht_lookup_custom
     3.67%     3.03%  qemu-system-aar  qemu-system-aarch64      [.] cpu_get_tb_cpu_state
     3.45%     1.32%  qemu-system-aar  qemu-system-aarch64      [.] tlb_set_page_full
     3.13%     0.26%  qemu-system-aar  qemu-system-aarch64      [.] helper_gvec_fmaxnum_s
     3.01%     0.93%  qemu-system-aar  qemu-system-aarch64      [.] S1_ptw_translate
     3.01%     0.00%  qemu-system-aar  qemu-system-aarch64      [.] setjmp_gen_code.constprop.0
     2.98%     0.07%  qemu-system-aar  qemu-system-aarch64      [.] helper_ldul_mmu
     2.89%     0.32%  qemu-system-aar  qemu-system-aarch64      [.] do_ld4_mmu
     2.87%     0.23%  qemu-system-aar  qemu-system-aarch64      [.] helper_gvec_fminnum_s
     2.73%     0.02%  qemu-system-aar  [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe

Conclusion

We can boot android using QEMU with an overhead of x10 when using tcg compared to native execution.

We recommend to run android with:

  • No labels