11.2 Mutex
sync.Mutex is the most basic synchronization primitive: it lets only one goroutine enter the
critical section at a time. Beneath that plain interface, it keeps weighing two goals that pull
against each other, throughput (hand the lock to someone as quickly as possible, do not let the
CPU sit idle) and fairness (do not let some unlucky waiter wait in line forever). The two cannot
both be maxed out: handing the lock off in strict arrival order is the fairest, but pays a context
switch on every handoff; letting a just-woken waiter compete freely with whoever is running is the
fastest, but can leave some waiter losing out again and again. This section first lays out the theory
and the hardware foundations of the old problem of mutual exclusion, then looks at how Go’s mutex
walks this tightrope.
11.2.1 The Mutual Exclusion Problem and Its Hardware Foundation
Mutual exclusion is one of the earliest topics in concurrency theory. Dijkstra formalized the “mutual exclusion problem” in 1965 and gave the first software solution that lets several processes take turns entering the critical section using only ordinary reads and writes, with no special hardware instructions. Later, Lamport’s 1974 bakery algorithm carried this road to a beautiful conclusion: each thread that wants to enter the critical section first “takes a number,” then enters in ascending order of numbers. This guarantees both mutual exclusion and first-come-first-served fair queuing, all using only ordinary reads and writes, with no atomic read-modify-write instructions needed.
A pure software solution is self-sufficient in theory but expensive in practice: the bakery algorithm has to scan the numbers of all threads, so its cost grows with the thread count, and it relies heavily on a sequentially consistent memory model (11.9), needing extra barriers on modern weakly-ordered hardware. Modern locks therefore no longer take the pure software road. Instead they stand directly on the atomic read-modify-write instructions the hardware provides: compare-and-swap (CAS), fetch-and-add, and so on (11.3). A single CAS atomically does what the bakery algorithm needs several steps to simulate; once the foundation changes, the upper design changes with it.
Around these primitives, locks have grown into a family, and understanding it helps make clear where Go’s mutex stands.
- The spin lock is the simplest: it retries a CAS over and over, entering once it wins and busy spinning in place to retry when it loses. It is nearly free under low contention but a disaster under high contention: several cores fight over the same lock variable, and every CAS makes that cache line repeatedly invalidate and re-fetch across cores, called cache-line bouncing, and the more contenders the worse it gets.
- The ticket lock uses fetch-and-add to issue numbers, then has each thread spin until its number is called, which restores FIFO fairness. But all waiters still spin on the same “now serving” variable, so the old cache-line bouncing problem is not eliminated.
- The MCS lock (Mellor-Crummey and Scott, 1991) is a classic among highly scalable locks. It has each waiter link itself into an explicit queue and spin on its own local variable, and when the previous lock holder releases, it only writes to that one local variable of its successor. As a result, no matter how many contenders there are, each handoff touches just one cache line, and the cache traffic under contention drops to a constant.
All of the above belong to the “busy wait” lineage, where waiters hold the CPU spinning. Another line
makes “sleep wait” cheap. In early days a thread could only block by trapping into the kernel, paying
the cost of a system call even when the lock had no contention at all. Linux’s futex (fast
userspace mutex, Franke, Russell, and Kirkwood, 2002) solved this pain point: with no contention,
locking and unlocking happen entirely in userspace with a single atomic operation, and only when a
waiter really must be blocked or woken does it trap into the kernel with that userspace address to
queue up. Almost all modern userspace locks, including Go’s mutex, stand on this “fast in userspace,
kernel as backstop” foundation of futex (and its equivalents on each platform). Go does not call
futex directly. Instead the runtime’s semaphore (runtime_SemacquireMutex / runtime_Semrelease)
wraps the low-level blocking primitive of each platform and presents a uniform interface upward.
11.2.2 The State Word and the Fast Path: Nearly Zero Cost Without Contention
Go’s mutex compresses a lock into two fields: a state word plus a semaphore.
| |
Packing several pieces of information into the same int32 lets a single atomic operation read and
modify all the key state of the lock at once: whether it is locked, whether there is a woken waiter,
whether it is starving, and how many waiters there are. The low three bits are flags, the high bits
are the waiter count, and adding or removing one waiter is just adding or subtracting
1<<mutexWaiterShift to state. This bit-field encoding lets the fast path of locking degenerate
into a single CAS on one integer.
When no one is contending, locking just atomically changes state from 0 (unlocked, no waiters) to
mutexLocked:
| |
This fast path does not enter the kernel and does not sleep. It is the key to why the mutex stays
light even though it is used at high frequency, and the vast majority of locks and unlocks end right
here. Only when the CAS fails, meaning the lock is held or there is already a waiter, do we fall into
the slow path lockSlow. Unlocking takes the symmetric fast path: clear the mutexLocked bit, and if
state ends up exactly 0 (no waiters, no other flags), everything is done, without even entering
unlockSlow.
11.2.3 The Slow Path: Spin First, Then Sleep
When contention happens, the goroutine does not go to sleep right away. Sleeping and waking both have
to deal with the runtime and are far from cheap, while a lock is often held only briefly and released
soon. So lockSlow first spins a few rounds, betting that the lock will free up soon, and if the
bet wins it saves a whole round of sleep and wake.
Spinning has strict admission conditions, and failing any one of them means giving up spinning and going to sleep:
| |
runtime_canSpin keeps “worth spinning” tight: it must be a multi-core machine, the spin count must
not exceed the cap (4 by default), and there must be no other goroutine waiting to run in the local
run queue. In other words, we spin only when “the lock is very likely to be acquired soon, and
spinning will not starve others”; once the lock is in starvation mode, or the spin count is exhausted,
we honestly hang ourselves on the semaphore sema and sleep, to be woken when the lock holder
unlocks. This “spin on short waits, sleep on long waits” hybrid strategy is not unique to Go.
pthread’s adaptive mutex (PTHREAD_MUTEX_ADAPTIVE_NP), Java’s biased / lightweight locks,
parking_lot, and others all take the same approach, differing only in the thresholds for how long to
spin and when to give up.
11.2.4 Fairness: Normal Mode and Starvation Mode
The most skillful piece of the mutex’s design is the two modes introduced in Go 1.9. They correspond to the two ends of the tightrope from 11.2.1: one end pursues throughput, the other holds onto fairness.
stateDiagram-v2
[*] --> Normal
Normal --> Normal: "newcomers race woken waiters (barging, throughput first)"
Normal --> Starving: "a waiter has queued for over 1ms"
Starving --> Starving: "on unlock, FIFO handoff straight to the head waiter"
Starving --> Normal: "queue drained, or head waited < 1ms"Normal mode pursues throughput. Waiters queue in FIFO, but a just-woken waiter does not directly get the lock. It has to compete for the same lock with the newcomer that is currently running and also wants to lock. The newcomer has a natural advantage: it is already running on the CPU and need not go through a wake-up, while the woken waiter has just climbed out of sleep and has not really been scheduled yet. So the newcomer often “barges” in and grabs the lock. Barging reduces context switches and significantly improves throughput, at the cost that the woken waiter who lost the race may be pushed back to the head of the queue over and over, losing out time after time, and fall into starvation.
To backstop the tail latency of this worst case, the mutex introduces starvation mode. When some
waiter’s wait time from enqueue to acquiring the lock exceeds 1ms (starvationThresholdNs) and it
still has not succeeded, it switches the mutex into starvation mode the moment it gets the lock. From
then on the rules invert: on unlock, no one is allowed to barge, and instead the lock’s ownership is
handed off straight to the head waiter in FIFO order (on unlock the mutexLocked bit is not even
set, and the waiter who receives the handoff sets it after waking); a newly arrived goroutine, even if
it sees the lock “looking empty,” does not try to acquire it and does not spin, but honestly lines up
at the tail. Once the queue drains, or the head waiter this time waited less than 1ms, the mutex
switches back to normal mode.
This mechanism incidentally resolves a question the reader often has: if unlocking wakes the FIFO
head waiter, why can waiters still
starve? The key is that in normal mode,
“waking” is not the same as “handing off the lock.” After unlockSlow earns the right to “wake one
person,” the one it wakes is indeed the head waiter, but after waking it still has to compete anew
with newcomers. Being woken is only a ticket to the race, not the lock itself:
| |
The two modes together let the mutex enjoy the high throughput of barging most of the time, while using the 1ms threshold as a backstop for the worst case: any waiter is barged past for only a while, and after 1ms it is guaranteed to get the lock in FIFO order. Normal mode is efficient because a goroutine can grab the same lock many times in a row even with a pile of blocked waiters in front of it; starvation mode is there specifically to suppress pathological tail latency.
11.2.5 How Others Do It: A Fairness Spectrum
“Throughput first, bounded backstop” is not unique to Go but a compromise the industry has repeatedly converged on. Lining up the various locks by fairness gives a spectrum, with “fully fair” at one end (strict FIFO direct handoff, no starvation but low throughput and large handoff cost) and “fully unfair” at the other (free barging, high throughput but waiters may starve).
- Java’s
ReentrantLocksimply leaves the choice to the user: a fair or unfair lock can be specified at construction. The default is the unfair lock, for the same reason Go chose barging, higher throughput; the fair lock grants strictly in FIFO order, suited to latency-sensitive scenarios that cannot tolerate starvation, but with markedly lower throughput. - Rust’s
parking_lottakes “eventual fairness”: normally it allows barging to grab throughput, but every short interval (on the order of 1ms) it forces one fair handoff, ensuring waiters do not starve indefinitely. The idea echoes Go’s starvation mode, except the trigger is decided by a “time interval” rather than “a single waiter’s wait time.”
Go’s 1ms threshold is one concrete and precise tradeoff point on this spectrum: by default it eats the throughput dividend of barging, and uses a fixed time upper bound to backstop fairness into “bounded waiting.” This threshold is an engineering empirical value rather than a theoretical optimum, and it strikes a compromise between “backing off too late, causing perceptible stalls” and “backing off too often, dragging down throughput.”
11.2.6 Read-Write Lock and TryLock
sync.RWMutex distinguishes readers from writers on top of mutual exclusion: multiple readers can
hold the lock at the same time, while a writer holds it exclusively. It suits read-heavy,
write-light scenarios, but watch out for the tradeoff in it: if readers keep arriving, a writer may go
a long time without getting the lock (writer starvation). Go’s implementation therefore makes a reader
that arrives later also block when a writer is already waiting, so that the writer is not delayed
indefinitely by readers. For RWMutex’s happens-before guarantees (one Unlock synchronizes before a
later Lock, and RUnlock synchronizes before a later Lock), see 11.9.
TryLock (and RWMutex’s TryLock / TryRLock, all added in Go 1.18) tries to lock but never
blocks: it returns true if it acquires the lock and false immediately if it does not. Its use is
narrow, and the official docs specifically warn: occasions where using TryLock is correct do exist
but are rare, and frequent reliance on TryLock is often a sign that some lock usage itself has a
design problem. From the memory model’s view, a successful TryLock is equivalent to a Lock, while
a failed TryLock establishes no synchronizes-before relationship.
A small note on implementation location: since Go 1.24, the core implementations of
Mutex,RWMutex, and others have moved down into theinternal/syncpackage, and the standard library’ssync.Mutexdegenerates into a thin wrapper (embedding aninternal/sync.Mutexand anoCopymarker, with methods forwarding directly). This move is so that internal packages such as the runtime can reuse the same implementation without forming a circular dependency onsync, and the mechanisms described in this section, the state word, the fast and slow paths, and the two modes, are unchanged.
11.2.7 Engineering Tradeoffs
The mutex’s design is tradeoffs everywhere, and each one confirms that old saying: a gain in
performance never comes for free, it always comes with a relocation of complexity. Packing several
pieces of information into one int32 with a state word bit field saves atomic operations on the lock
path at the price of obscure bit manipulation; betting on a spin for a short wait saves a sleep and
wake if the bet wins, or spins idle for a few rounds for nothing if it loses; trading throughput with
barging then forces in a whole starvation mode plus 1ms threshold to backstop fairness, and most of
the complexity of the entire lockSlow comes from this backstop layer.
Putting the mutex back into the panorama of Go concurrency, it and the channel represent two styles: the mutex plainly expresses “mutual exclusion,” the channel expresses “communication.” Go’s maxim “do not communicate by sharing memory; instead, share memory by communicating” recommends the latter, but this is a leaning, not a ban. Which one to use depends on whether you mean to “protect a piece of shared state” or to “pass ownership of data between goroutines,” not on which is “more advanced.” The next section turns to the atomic read-modify-write foundation under the mutex’s feet itself (11.3).
Further Reading
- Edsger W. Dijkstra. “Solution of a Problem in Concurrent Programming Control.” Communications of the ACM, 8(9), 1965. https://doi.org/10.1145/365559.365617 (formalization of the mutual exclusion problem and the first software solution)
- Leslie Lamport. “A New Solution of Dijkstra’s Concurrent Programming Problem.” Communications of the ACM, 17(8), 1974. https://doi.org/10.1145/361082.361093 (the bakery algorithm)
- John M. Mellor-Crummey, Michael L. Scott. “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.” ACM TOCS, 9(1), 1991. https://doi.org/10.1145/103727.103729 (the MCS lock and the foundation of scalable locks)
- Hubertus Franke, Rusty Russell, Matthew Kirkwood. “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux.” Proceedings of the Ottawa Linux Symposium, 2002. (futex, the foundation of fast userspace locking)
- Dmitry Vyukov. sync: make Mutex more fair (Go 1.9 starvation mode), 2016. https://go-review.googlesource.com/c/go/+/34310 ; for related discussion see issue #13086.
- The Go Authors. The Mutex implementation in runtime/internal.
src/internal/sync/mutex.go,src/sync/mutex.go. https://github.com/golang/go/tree/master/src/internal/sync - The Go Authors. The Go Memory Model: Locks. https://go.dev/ref/mem
- This book, 11.3 Atomic Operations, 11.9 The Memory Consistency Model.