Analysis by Adhemerval Zanella, Maxim Kuvyrkov, & Ryan Arnold

Go Language

The Go programming language is "an open source programming language that makes it easy to build simple, reliable, and efficient software."

The container technology Docker is the premier open-source application written in Go at this time though it is expected that further use of the language will be forthcoming.

There are two implementations of Go. There is the Google led project 'golang' and a GNU implementation as part of GCC called 'GCCGo'.

The golang implementation is written in Go (recently) where-as the GCCGo implementation leverages the existing GCC compiler.

Stack Usage In The Go Language

As a stack grows it is necessary to account for the situation where the stack will exceed the size of the memory in which it was allocated if it starts with a conservativley small allocation.

The go language does not require that the developer think about or manage stack requirements/growth in multi-threaded programs. Without the compiler auto-managing the growth of the stack it must take a naive approach where a sufficiently large stack is pre-allocated for each thread. Ultimately this severally limits the number of threads that can be executed in a 32-bit address space.

The ideal situation is to grow the stack for each thread as-needed, thereby consuming the minimum amount of memory. This allows an initial stack allocation to be small enough to support nearly innumerable small threads. In-practice there are a few ways to do this:

Use contiguous stacks and copy the smaller stack into a new, larger stack, adjusting pointers as necessary (optimal approach).
Use stack fragmentation (split-stacks) whereby as the stack requirements grow, a new chunk of stack is allocated and new frames are put on the new stack fragment. These split stacks are chained together to give the functionality of a single stack.

Without the support of a auto-managed stack full use of the Go Language routines is not supported.

Golang Contiguous Stack Support

When the golang compiler was rewritten in native Go, it gained the ability to garbage-collect and reference count pointers. It therefore has the ability to know what on the stack is a pointer vs what is simply data. This knowledge allows golang to use contiguous stacks (in version 1.4) whereby, as stack requirements grow, a new (larger) stack is allocated and the entries from the previous stack are copied to the new stack, and all pointer references to stack slots are adjusted to the new locations.

Ideally any compiler implementation would be able to use the contiguous-stack method. Unfortunately this is a hard problem for compilers written in non-reference counting languages (like C) as it is hard to know what is data versus what is a pointer on the stack. It is not impossible, but it is not a solved problem in GCC.

GCC Go Split-Stack (stack fragmentation) Support

As GCC is implemented in the C programming language and lacks garbage-collection and reference counting and thus implementing contiguous stacks is an unsolved problem. It is impossible to currently know what is data and what is a pointer on the stack. As a result, the GCC must implement Split-stack support in order to enable.

Split-Stack Support is defined by GCC as the following:

The goal of split stacks is to permit a discontiguous stack which is grown automatically as needed. This means that you can run multiple threads, each starting with a small stack, and have the stack grow and shrink as required by the program. It is then no longer necessary to think about stack requirements when writing a multi-threaded program. The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size. It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space.

As stated in a GCC maillist thread by Ian Lance Taylor, split-stack support is currently only used in GCC for GCCGO. The ideal is to use contiguous-stack support which is what the go compiler now uses, but implementing this new scheme in gccgo would be problematic (compiler need to keep track on stack usage in various points). There are no known ongoing or planned projects to add stack copying to GCC for any architecture, so the best way forward for the AArch64 GCC Go port is to implement the split-stack support as it is implemented on x86, x86_64, MIPS, and PowerPC.

To fully implement split-stack support on AArch64 it will require patches on basically 3 project: the GLIBC runtime to extend the TCB fields to supports the split stack field, GCC changes to function prolog generation and to stack handling functions in libgcc, and, finally, on gold linker from binutils. The high-level description of the implementation steps can be found on the GCC wiki.

The split-stack support involves minor additions to the ABI, which need to be coordinated with ABI stakeholders.

Based on split-stack implementations for other architectures, the whole project is expected to take several months:

~1 month for GCC code-gen changes
~1 month for Libgcc routine changes
~1 month for gold changes
~1 week for glibc changes
~2 months for upstream review

However, for someone familiar with the GCC split-stack implementation it can take ~1.5 months for implementation and ~1.5 months for upstream review. E.g., the s390x split-stack port was done via a bounty on bountysource: https://www.bountysource.com/issues/28094543-s390x-linux-split-stacks-support .

GLIBC

GLIBC support is the most straightforward, however different from previous architecture approach current one requires to add a new symbol when TCB structure is extended. This allows the compiler to bind TCB fields usage with a versioned symbol from libc, thus preventing the programs which required a new TCB field to run on older glibc version (which may lead to memory corruptions).

This powerpc TCB extension shows what is required for the enablement:

Add the __private_ss field on struct tcbhead_t at sysdeps/powerpc/nptl/tls.h
Add a new version and symbol at sysdeps/aarch64/Versions
Add tests (if required)

The GCC split-stack wiki presents 6 different strategies for hold the per-thread data required by split-stack support:

Reserve a register;
Use a TLS variable;
Have the stack always end at a N-bit boundary;
Introduce a new function call which handles the comparison of the stack pointer and the stack expansion.
Reuse the stack protector support field
Arrange to allocate a new field in the TCB header

Each one have it downsides and pros and for current implemented split-stack support the TCB field was selected (x86, powerpc, and s390). For AArch64 the most efficient way would be also through a new field on TCB exporting a TLS variable on glibc (even with initial-exec access model) will incur in a GOT creation and access.

GCC

GCC work will require some codegen support on function prologue and epilogue. Based on powerpc split-stack patch submission:

Implement TARGET_INTERNAL_ARG_POINTER TARGET_SUPPORTS_SPLIT_STACK;
Extend md file if necessary;
Implement the morestack.S routine to actually allocate the new stack scheme;
Add new tests if required.

Binutils

Binutils support for split stack is only supported by gold linker. Based on powerpc patch submission linker will require to create stubs to correct the stack frame from split-stack call to external modules without split-stack support.

TCWG-public