Chapter 18 GPU and Heterogeneous Compute on Go: Under the Hood

18.1 Crossing the FFI Boundary

Mon, 01 Jan 0001 00:00:00 +0000

18.1 Crossing the FFI Boundary

15.6 already took the cgo bridge apart: a call from Go into C must switch to the g0 system stack, lay the arguments out again according to C’s ABI, entersyscall to give up the P, make the call, then exitsyscall to win back a P, and the whole round comes out one or two orders of magnitude more expensive than a Go call. The conclusion of that section was blunt: cgo suits a small number of coarse-grained calls, and its worst enemy is a hot loop that crosses the boundary again and again.

18.2 The Scheduler and Blocking Foreign Calls

Mon, 01 Jan 0001 00:00:00 +0000

18.2 The Scheduler and Blocking Foreign Calls

The prescription 18.1 gave was “asynchronous, synchronize seldom”: push commands into the stream, return immediately, wait only once at the end. But that “once at the end” must after all be waited. cudaStreamSynchronize blocks until the GPU has drained the whole stream; a synchronous cudaMemcpy, a driver call that does not take the asynchronous path, will all leave the Go-side thread genuinely stopped inside C. What this section asks is this: when a crossing really does block for a long time, how does the scheduling machinery of Chapter 9 react? Will it be dragged down by a single call stuck on the GPU?

18.3 The Divide Between Device Memory and the Garbage Collector

Mon, 01 Jan 0001 00:00:00 +0000

18.3 The Divide Between Device Memory and the Garbage Collector

15.6 covered cgo’s pointer rules: Go’s objects do not belong to C, the GC may move or reclaim them at any time, and so C must not hold an unpinned Go pointer after the call returns. That was deduced from a binary world of “two memories, Go’s and C’s.” The GPU complicates this map: now there are at least four kinds of memory, under different jurisdictions and obeying different rules. This section first draws that map clearly, then sees where the dividing lines between the garbage collector and each of them fall, and which line is most easily trampled in asynchronous transfers.

18.4 The Asynchronous Programming Model

Mon, 01 Jan 0001 00:00:00 +0000

18.4 The Asynchronous Programming Model

The previous three sections all spoke of the “costs” on the FFI boundary: cross fast (18.1), crossing occupies a thread (18.2), who owns the memory on the bridge (18.3). This section changes the angle and returns to the concurrency model itself. Go’s concurrency is goroutines and channels, the GPU’s concurrency is something else, and the CPU hides a third kind. Laying these three parallelisms out clearly and seeing how they connect is the close of this chapter, and the key to understanding “when to push the work across the boundary, and when there is no need at all.”