<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Chapter 18 GPU and Heterogeneous Compute on Go: Under the Hood</title><link>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/</link><description>Recent content in Chapter 18 GPU and Heterogeneous Compute on Go: Under the Hood</description><generator>Hugo</generator><language>en</language><atom:link href="https://golang.design/under-the-hood/en/part6hetero/ch18gpu/index.xml" rel="self" type="application/rss+xml"/><item><title>18.1 Crossing the FFI Boundary</title><link>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/boundary/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/boundary/</guid><description>&lt;h1 id="181-crossing-the-ffi-boundary"&gt;18.1 Crossing the FFI Boundary&lt;/h1&gt;
&lt;p&gt;&lt;a href="https://golang.design/under-the-hood/en/part5toolchain/ch15compile/cgo/"&gt;15.6&lt;/a&gt; already took the cgo bridge apart: a call
from Go into C must switch to the &lt;code&gt;g0&lt;/code&gt; system stack, lay the arguments out again according
to C&amp;rsquo;s ABI, &lt;code&gt;entersyscall&lt;/code&gt; to give up the P, make the call, then &lt;code&gt;exitsyscall&lt;/code&gt; to win back
a P, and the whole round comes out one or two orders of magnitude more expensive than a Go
call. The conclusion of that section was blunt: cgo suits a &lt;strong&gt;small number of
coarse-grained&lt;/strong&gt; calls, and its worst enemy is a hot loop that crosses the boundary again
and again.&lt;/p&gt;</description></item><item><title>18.2 The Scheduler and Blocking Foreign Calls</title><link>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/sched/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/sched/</guid><description>&lt;h1 id="182-the-scheduler-and-blocking-foreign-calls"&gt;18.2 The Scheduler and Blocking Foreign Calls&lt;/h1&gt;
&lt;p&gt;The prescription &lt;a href=".././boundary"&gt;18.1&lt;/a&gt; gave was &amp;ldquo;asynchronous, synchronize seldom&amp;rdquo;: push
commands into the stream, return immediately, wait only once at the end. But that &amp;ldquo;once at
the end&amp;rdquo; must after all be waited. &lt;code&gt;cudaStreamSynchronize&lt;/code&gt; blocks until the GPU has drained
the whole stream; a synchronous &lt;code&gt;cudaMemcpy&lt;/code&gt;, a driver call that does not take the
asynchronous path, will all leave the Go-side thread genuinely stopped inside C. What this
section asks is this: when a crossing &lt;strong&gt;really does block for a long time&lt;/strong&gt;, how does the
scheduling machinery of &lt;a href="https://golang.design/under-the-hood/en/part3concurrency/ch09sched/"&gt;Chapter 9&lt;/a&gt; react? Will it be
dragged down by a single call stuck on the GPU?&lt;/p&gt;</description></item><item><title>18.3 The Divide Between Device Memory and the Garbage Collector</title><link>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/memory/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/memory/</guid><description>&lt;h1 id="183-the-divide-between-device-memory-and-the-garbage-collector"&gt;18.3 The Divide Between Device Memory and the Garbage Collector&lt;/h1&gt;
&lt;p&gt;&lt;a href="https://golang.design/under-the-hood/en/part5toolchain/ch15compile/cgo/"&gt;15.6&lt;/a&gt; covered cgo&amp;rsquo;s pointer rules: Go&amp;rsquo;s objects do
not belong to C, the GC may move or reclaim them at any time, and so C must not hold an
unpinned Go pointer after the call returns. That was deduced from a binary world of &amp;ldquo;two
memories, Go&amp;rsquo;s and C&amp;rsquo;s.&amp;rdquo; The GPU complicates this map: now there are at least &lt;strong&gt;four&lt;/strong&gt; kinds
of memory, under different jurisdictions and obeying different rules. This section first
draws that map clearly, then sees where the dividing lines between the garbage collector
and each of them fall, and which line is most easily trampled in asynchronous transfers.&lt;/p&gt;</description></item><item><title>18.4 The Asynchronous Programming Model</title><link>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/model/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch18gpu/model/</guid><description>&lt;h1 id="184-the-asynchronous-programming-model"&gt;18.4 The Asynchronous Programming Model&lt;/h1&gt;
&lt;p&gt;The previous three sections all spoke of the &amp;ldquo;costs&amp;rdquo; on the FFI boundary: cross fast
(&lt;a href=".././boundary"&gt;18.1&lt;/a&gt;), crossing occupies a thread (&lt;a href=".././sched"&gt;18.2&lt;/a&gt;), who owns the
memory on the bridge (&lt;a href=".././memory"&gt;18.3&lt;/a&gt;). This section changes the angle and returns to
the &lt;strong&gt;concurrency model&lt;/strong&gt; itself. Go&amp;rsquo;s concurrency is goroutines and channels, the GPU&amp;rsquo;s
concurrency is something else, and the CPU hides a third kind. Laying these three
parallelisms out clearly and seeing how they connect is the close of this chapter, and the
key to understanding &amp;ldquo;when to push the work across the boundary, and when there is no need
at all.&amp;rdquo;&lt;/p&gt;</description></item></channel></rss>