14.2 Stack Allocation and Caching
14.1 located the execution stack as a contiguous range of memory [lo, hi): it is managed by the runtime and ultimately lives on the heap, so the moment a Goroutine starts, someone has to carve out this range of addresses for it. The questions follow immediately: who carves it? how large? and once Goroutines are created and destroyed over and over, if these two- or three-kilobyte chunks of memory had to be requested directly from the operating system or the global heap every time, the cost of locks and system calls would crush the scheduler at once.
The answer is that the stack has its own allocator. It is not the same code as the chapter 12 memory allocator, yet it follows almost the same blueprint: per-P lock-free caches up front, a lock-guarded global warehouse behind them, and the heap and operating system at the bottom. Readers who have gone through 12.2 will find that this layer is little more than transplanting the moves of mcache / mcentral / mheap onto the stack. This section first makes clear what this parallel structure looks like, then explains why it has to be kept separate from object allocation, and how it connects to garbage collection.
14.2.1 Why the Stack Needs Its Own Allocator
Handing the stack to the general-purpose object allocator looks at first like the easy path, but the stack has three properties that differ from ordinary objects, forcing the runtime to build a separate one.
First, sizes are highly regular. Go’s stack has a minimum of 2KB (stackMin = 2048), and each resize doubles by a power of two (14.3), so the vast majority of stacks come in only a few sizes: 2K / 4K / 8K / 16K. Regular sizes naturally suit a free list (12.2.1) bucketed by size, and even the lookup into a size-class table is unnecessary: a single shift computes which bucket to go to.
Second, the lifecycle and reclamation method are special. Unlike ordinary objects, the stack is not referenced through a pointer graph and kept alive by marking; its liveness is bound to the corresponding Goroutine. Stack memory is marked with the mSpanManual state (manually managed), takes no part in GC mark-and-sweep, and is instead returned explicitly by the runtime when the stack shrinks or the Goroutine dies (14.5). Only by keeping this kind of “manually managed” memory apart from “automatically reclaimed” object memory does the reclamation logic stay clean.
Third, allocation is extremely frequent and happens on the system stack. stackalloc must run on the scheduling stack (g0), because it absolutely cannot itself trigger stack growth, or it would deadlock (runtime issue 1547). This constraint demands that the allocation path be short, non-reentrant, and certainly not given to taking locks on the hot path.
Taken together, these three lead to one conclusion: give the stack a dedicated allocator with fixed sizes, bypassing GC, lock-free on the hot path. Its shape is isomorphic to the object allocator.
14.2.2 Three Layers: stackcache, stackpool, stackLarge
Small stacks (2K / 4K / 8K / 16K) travel a three-layer restocking chain, and each layer can be pointed out in the code below.
The first layer is the per-P stackcache, the lock-free fast path. It hangs right off the mcache, sharing the same per-P cache with object allocation (12.2.3):
| |
_NumStackOrders is 4 on 64-bit non-Windows platforms, corresponding to orders 0 through 3, that is, the four tiers 2K / 4K / 8K / 16K. Each tier has one free list, the block header is reused as a pointer to the next block, and both taking and returning are pointer operations. Because at any instant each P is held by only one M, accessing the stackcache requires no lock, and this is exactly where the vast majority of stack allocations finish.
The second layer is the global stackpool, the lock-guarded central warehouse. When some tier’s local cache runs empty, it wholesales from the stackpool:
| |
It corresponds to the mcentral in object allocation: globally shared, locked per bucket. Note the cache-line padding on the stackpool array, the same consideration as the padding of the central array in mheap, aligning the locks of different orders to different cache lines to avoid false sharing (12.2.5) when multiple cores operate on different tiers separately.
The third layer is mheap and the large-stack pool stackLarge, the heap at the bottom. When some tier of the stackpool runs empty, it requests a whole span of pages from mheap and cuts it into small stacks; large stacks bigger than 16K bypass the first two layers and go straight to stackLarge:
| |
Placing the three layers side by side with their counterparts in the object allocator makes the structural isomorphism obvious at a glance:
| Stack allocator | Object allocator (12.2) | Synchronization cost | Hit frequency |
|---|---|---|---|
stackcache (per P, hung off mcache) | mcache (per P) | lock-free | highest |
stackpool (global, locked per order bucket) | mcentral (global, locked per size-class bucket) | locked | medium |
mheap / stackLarge | mheap | lock plus possible system call | lowest |
flowchart TD
G["malg / copystack requests a stack"] --> SIZE{"stack size?"}
SIZE -->|"2K~16K small stack"| ORDER["pick a tier by order<br/>order = log2(n / fixedStack)"]
SIZE -->|"> 16K large stack"| LARGE["stackLarge<br/>bucketed by log2(npages)"]
ORDER --> CACHE["stackcache[order]<br/>per P, lock-free"]
CACHE -->|"local chain has a free block"| FAST["take the chain head and return (fastest)"]
CACHE -->|"local chain is empty"| POOL["stackcacherefill<br/>wholesale half a cache from stackpool"]
POOL --> SPOOL["stackpool[order]<br/>global, locked"]
SPOOL -->|"no free span in the bucket"| MHEAP["mheap.allocManual<br/>request a span by pages and cut small stacks"]
LARGE -->|"cache miss"| MHEAP
MHEAP -->|"heap is short"| OS["request from the operating system"]14.2.3 stackalloc: Choosing Among the Three Layers
What strings the three layers above together is stackalloc. The only thing it really has to answer is: from which layer should this allocation be drawn? A trimmed sketch follows; the full definition is in runtime/stack.go:
| |
Two trade-offs are worth pointing out. The first is the branch thisg.m.p == 0 || thisg.m.preemptoff != "": when the M temporarily has no P (in the middle of exiting a system call, or procresize adjusting the number of Ps), or is in the middle of preemption, the local cache is unavailable or untouchable (during GC it is concurrently emptied), so it falls back to locking the global pool directly to allocate. The lock-free fast path holds only in the normal state of “holding a P and not being preempted”, exactly the same availability condition as the mcache. The second is that the order is computed with a single loop of shifts, rather than looking up a size-class table the way object allocation does; stack sizes are powers of two to begin with, so this step can be cheaper.
14.2.4 Restocking: stackcacherefill and stackpoolalloc
When the local cache runs empty, it does not restock just one block. stackcacherefill wholesales half a cache’s worth of stack blocks from the global pool in one go, so as to avoid repeatedly locking and thrashing on the boundary of “just used up, just returned”:
| |
_StackCacheSize is 32KB, so half full is 16KB. Buying in bulk and leaving half a tank of margin is the general technique by which a cache layer reduces the number of interactions with the layer below; object allocation’s refill, and the scheduler stealing half the tasks from another P’s local queue (9.2), all follow this idea. Correspondingly, on return stackcacherelease only sends the excess back to the global pool once the local exceeds half full, and the two thresholds, one above and one below, keep the water level of the local cache stable around half full.
The end of the wholesale path is stackpoolalloc. It takes a block from the head of some order’s span list; when the list is empty, it requests a whole span of _StackCacheSize bytes from mheap, cuts it into equal-sized small stacks of fixedStack << order, strings them into a free list, and then hands them out:
| |
Note allocManual(..., spanAllocStack): stack spans are always requested with mheap.allocManual, with the state set to mSpanManual. This is the “manual management” mentioned in 14.2.1; such spans do not enter the GC’s object-scanning view (spanOfHeap returns nil for mSpanManual), and their reclamation is the responsibility of the stack-management code itself rather than of mark-and-sweep. The stack and objects share the same mheap and the same span machinery, yet on the matter of “who reclaims” they part ways entirely.
14.2.5 A Consistent Lineage of “Layering to Reduce Contention”
Looking back, this layer invented no new principle. It applies the restocking chain of 12.2, the per-P local queue of 9.2, and the per-P sharding of sync.Pool (11.6) to this particular kind of memory, the stack. The same move recurs throughout the Go runtime, and its core is always one sentence: make the hottest path a lock-free operation, and hold locks and system calls back in the increasingly cold rear.
Placed in the lineage, the archetype of this structure is still tcmalloc’s three levels of thread-cache / central-list / page-heap (12.1). The stack allocator is a specialization of it under the set of constraints “regular sizes, manual reclamation, must run on the system stack”: it drops the size-class table lookup (stack sizes are powers of two) and drops GC marking (the stack is manually managed), in exchange for a shorter hot path. It shares the foundation of mheap with the object allocator, while each grows on top of it a cache layer suited to its own needs.
The cost, as usual, is there too. One stackcache per P means stack memory cannot be reused directly between Ps; it must be relayed through the global stackpool. The _StackCacheSize/2 water line is an empirical threshold: set it high and memory is wasted, set it low and locking grows more frequent. Performance gains never come for free; they always come with a relocation of memory footprint and complexity. The next section (14.3) turns to what happens after this stretch of memory is used up: how the contiguous stack detects overflow, doubles in size, and then moves the whole stack to a new address (14.4).
Further Reading
- The Go Authors. runtime/stack.go. go1.26.4. (The authoritative definitions of
stackalloc,stackpool,stackLarge,stackcacherefill,stackpoolalloc) https://github.com/golang/go/blob/go1.26.4/src/runtime/stack.go - The Go Authors. runtime/mcache.go. go1.26.4. (The
stackcachefield, thestackfreelistdefinition) https://github.com/golang/go/blob/go1.26.4/src/runtime/mcache.go - The Go Authors. runtime/mheap.go. go1.26.4. (
allocManual,mSpanManual,spanAllocStack) https://github.com/golang/go/blob/go1.26.4/src/runtime/mheap.go - Sanjay Ghemawat, Paul Menage. TCMalloc: Thread-Caching Malloc. https://google.github.io/tcmalloc/design.html (The conceptual prototype of the three levels thread-cache / central-list / page-heap)
- This book, 12.2 Memory Allocation Components, 12.1 Memory Allocation Design Principles.
- This book, 9.2 Work-Stealing Scheduling, 11.6 The Cache Pool sync.Pool.
- This book, 14.1 The Design of Contiguous Stacks, 14.3 Stack Growth, 14.4 Stack Copying and Pointer Adjustment, 14.5 Stack Shrinking and Evolution.