12.2 Components
12.1 described the allocator as a layered structure that is “lock-free on the fast path, locked on the slow path.” This section names and locates the few core components of that structure, and grounds them in their actual form in go1.26: what state each component carries, why it is designed the way it is, and how they chain together into a restocking pipeline. Once you have understood these few things, the later allocation paths (12.4–12.6) are just “a walk across this same picture.”
To avoid falling back into translating the source field by field, the structs given below are all trimmed sketches: they keep only the fields relevant to the design, with comments explaining why each one exists. For the full definitions you can refer to runtime/mheap.go, mcache.go, and mcentral.go.
12.2.1 The free list: the underlying trick behind everything
Before getting to know the individual components, let us first meet a low-level trick they all share, the free list. The runtime manages a great many “fixed-size objects” (mcache, mspan, various kinds of metadata) through a fixed-size allocator called fixalloc, whose idea is extremely simple: string a batch of unallocated memory blocks into a chain using pointers, where the head of each block of memory serves precisely as the pointer to the next block. Allocation is taking off the head of the chain, and reclamation is inserting the block back at the head; both are pointer operations:
| |
“Storing the pointer to the next block in the head of the free block itself” is the essence of the free list. It needs no extra metadata array to record where the free slots are, and its space overhead is zero. This trick will recur repeatedly below: mspan uses it internally to string together free object slots, and mcache uses it to cache stacks. Once you understand it, many corners of the allocator fall into place.
12.2.2 mspan: the basic unit of allocation
mspan is the basic unit of the allocator, a run of contiguous memory pages cut into a number of equal-sized slots of the same size class (12.1). It is at once the “cargo” that flows between mcache and mcentral, and the unit at which garbage collection (13) scans and sweeps. The trimmed sketch:
| |
A few fields are worth pointing out. freeindex plus allocCache is the key to “quickly finding the next empty slot within a span”: allocCache caches the free bitmap near freeindex into a single uint64, so finding an empty slot degenerates into a single bit scan (find the lowest set bit), with no traversal required. The pair of bitmaps allocBits and gcmarkBits is the interface where the allocator and the GC are symbiotic (12.1): allocation looks at the former, and during sweep the GC overwrites the former with the latter, so that the slots of dead objects become allocatable again “without sweeping” (13.5). We can see that an mspan is not just “a run of memory”; it encodes both allocation state and reclamation state in the same piece of metadata.
12.2.3 mcache: the lock-free fast path, one per P
mcache is the local cache held one per P (9.3), the foundation of the allocator’s high performance. Because each P is held by only one M at any instant, accessing the mcache requires no lock, and this is precisely the fast path on which the vast majority of allocations finish and stop.
| |
The alloc array is the core: each size class (distinguishing the two spanClasses, one with pointers and one without) caches one mspan, and allocating a small object is just “take out the corresponding mspan by size class, and pick an empty slot from it.” tiny/tinyoffset serve tiny-object merging (12.6), while stackcache lets stack allocation (14.6) reuse the same per-P caching idea. The mcache itself is allocated from non-GC memory (via fixalloc) and lives permanently in the runtime: a P obtains its own mcache when it is created in procresize, and returns it when destroyed, so the mcache’s lifetime is bound to the P, and an M holds it together with the P.
12.2.4 mcentral: the central warehouse shared per size class
When the span of some size class in a P’s mcache is exhausted, the P must trade with mcentral for a span that has empty slots. There is one mcentral per size class, shared by all Ps, so accessing it requires a lock. Its structure, since Go 1.9, looks like this:
| |
Why are partial and full each two sets ([2]spanSet)? This is a clever trick coordinated with the sweep (13.5): the two sets correspond respectively to “already swept this GC round” and “not yet swept,” distinguished by sweepgen (the sweep generation). When taking a span, it preferentially draws from the “already swept and has empty slots” set; if it gets an unswept one, it sweeps it on the spot before using it. This “bucketing by sweep generation” replaces the early design of a single linked list under one lock, easing the lock contention between sweeping and allocation, another example of “restructuring a data structure for concurrency” (compare the evolution in 11.7). The spanSet itself is a chunked, lock-free set optimized for concurrency, which further lowers the cost of multiple Ps accessing mcentral at the same time.
12.2.5 mheap and arena: the global heap
At the end of the restocking pipeline is mheap, the single global heap that manages all pages. When mcentral lacks a span it asks mheap; large objects (12.4) request pages directly from mheap. Trimmed, we look at only a few things:
| |
Beneath mheap lie two layers of infrastructure. The page allocator pages (12.7) answers “which run of contiguous pages is free”; the arena is the organizational unit of the address space (12.3). The heap requests memory from the OS at the granularity of 64MB arenas, and each arena is equipped with a copy of metadata (pointer bitmap, span index), so that the runtime can look up, from any heap address, “which span it belongs to, whether it is a pointer, and whether it is live.” Note the cache-line padding on the central array: aligning the mcentrals of different size classes to different cache lines avoids false sharing when multiple cores operate on different size classes at the same time. This fussiness over cache lines is a common technique in highly concurrent runtime code.
12.2.6 How a single allocation passes through this hierarchy
Stringing the four together, the path of a single small-object allocation (12.5) is exactly the performance of that restocking pipeline from 12.1:
flowchart TD
G["goroutine requests memory"] --> SIZE{object size?}
SIZE -->|< 16B, no pointers| TINY["tiny object: merge into mcache's tiny block"]
SIZE -->|16B ~ 32KB| SMALL["small object: take from mcache by size class"]
SIZE -->|> 32KB| LARGE["large object: request pages directly from mheap"]
TINY --> MC["mcache (one per P, lock-free)"]
SMALL --> MC
MC -->|local span has empty slots| FAST["nextFreeFast: bit scan to take a slot, return (fastest)"]
MC -->|local span exhausted| MCENT["mcentral (one per size class, locked) trade for a span"]
MCENT -->|no span available| MHEAP["mheap (global, page allocator)"]
LARGE --> MHEAP
MHEAP -->|insufficient| OS["request from the OS (arena)"]The fastest step, nextFreeFast, is just a single bit scan over the current mspan’s allocCache:
| |
Only when this step returns 0 (the local span is exhausted) do we fall into the locked slow path: trade with the corresponding mcentral via cacheSpan for a span that has empty slots (sweeping it along the way); if mcentral has none either, ask mheap for new pages to carve a span; and if mheap is insufficient, wholesale an arena from the OS. The further down we go, the greater the synchronization cost and the lower the hit frequency. This is the entire point of layered caching: turn the hottest path into a few lock-free bit operations, and keep the expensive locks and system calls behind ever colder layers.
12.2.7 Design trade-offs, evolution, and lineage
This structure realizes the design principles of 12.1 as concrete parts, and each part corresponds to one trade-off:
- Lock-free per-P cache (mcache) eliminates fast-path contention, at the cost that each P occupies its own copy of cache memory, and objects cannot be reused directly between Ps (they must be relayed through mcentral). This is the same “layered contention reduction” move as the scheduler’s local run queue (9.2) and
sync.Pool’s per-P sharding (11.6); you will encounter it again and again in the Go runtime. - Organization by size class (mspan) lets allocation degenerate into the bit operation of “take one equal-sized slot,” at the cost of internal fragmentation from size mismatch (about 12.5% in the worst case).
- mcentral bucketed by sweep generation (
[2]spanSet) is an evolution made to lower the lock contention between allocation and sweeping. Before Go 1.9, mcentral was a single linked list under one lock, which became a bottleneck under high concurrency; restructuring it into spanSet relieved this markedly.
Placed in lineage, this hierarchy directly inherits from Google’s tcmalloc (12.1): thread cache (corresponding to mcache), central free list (corresponding to mcentral), and page heap (corresponding to mheap). jemalloc’s arena + tcache is an isomorphic idea. What Go grows on top of this, which tcmalloc does not have, is the layer of metadata serving precise garbage collection: mspan’s gcmarkBits, and the arena’s pointer bitmap. In other words, Go’s allocator is “the skeleton of tcmalloc plus the flesh born for GC symbiosis,” and this main thread will merge with it in 13 Garbage Collection.
Further reading
- Sanjay Ghemawat, Paul Menage. TCMalloc: Thread-Caching Malloc. https://google.github.io/tcmalloc/design.html (the conceptual prototype of mcache/mcentral/mheap)
- Jason Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD (jemalloc). 2006. (the isomorphic design of arena + tcache)
- The Go Authors. runtime/mcache.go, mcentral.go, mheap.go, mfixalloc.go. https://github.com/golang/go/tree/master/src/runtime
- Go 1.9 mcentral restructuring (spanSet), related discussion and commits. https://go-review.googlesource.com/c/go/+/38150
- This book’s 12.1 Design Principles, 12.5 Small-Object Allocation, 13.5 Sweeping and Bitmaps.