<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Chapter 20 AI Inference and Serving on Go: Under the Hood</title><link>https://golang.design/under-the-hood/en/part6hetero/ch20inference/</link><description>Recent content in Chapter 20 AI Inference and Serving on Go: Under the Hood</description><generator>Hugo</generator><language>en</language><atom:link href="https://golang.design/under-the-hood/en/part6hetero/ch20inference/index.xml" rel="self" type="application/rss+xml"/><item><title>20.1 The Inference Runtime and FFI</title><link>https://golang.design/under-the-hood/en/part6hetero/ch20inference/runtime/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch20inference/runtime/</guid><description>&lt;h1 id="201-the-inference-runtime-and-ffi"&gt;20.1 The Inference Runtime and FFI&lt;/h1&gt;
&lt;p&gt;The FFI boundary that recurred through Chapters 18 and 19 wears, with large models, its most
current face. Training large models is almost the domain of Python and CUDA, but when a model
is trained and is to be deployed to &lt;strong&gt;serve&lt;/strong&gt; tens of millions of requests, the lead role
changes, and Go stands firm at this layer. This chapter is about how Go takes on AI&amp;rsquo;s
inference and serving, and the first section must settle the lowest-level problem: Go does
not do matrix multiply itself, it has to wire in a local inference runtime, and this wiring
is, once again, the boundary of Chapter 18.&lt;/p&gt;</description></item><item><title>20.2 Tokenization and Tensors</title><link>https://golang.design/under-the-hood/en/part6hetero/ch20inference/tokenize/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch20inference/tokenize/</guid><description>&lt;h1 id="202-tokenization-and-tensors"&gt;20.2 Tokenization and Tensors&lt;/h1&gt;
&lt;p&gt;&lt;a href=".././runtime"&gt;20.1&lt;/a&gt; settled the home of weights and tensors on the native runtime side, with
Go only passing handles and moving small data on the boundary. This section steps into that
&amp;ldquo;small data&amp;rdquo; itself: how text becomes the numbers a model can eat, and how the numbers turn
back into text. This seemingly trivial thing hides a detail that will make a Go programmer
smile in recognition, for it is almost a direct application of Chapter 5&amp;rsquo;s &amp;ldquo;a string is a
stretch of immutable bytes,&amp;rdquo; and the moment it is neglected, it spits out garbled text in
streaming output.&lt;/p&gt;</description></item><item><title>20.3 Serving, Batching, and Streaming</title><link>https://golang.design/under-the-hood/en/part6hetero/ch20inference/serving/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://golang.design/under-the-hood/en/part6hetero/ch20inference/serving/</guid><description>&lt;h1 id="203-serving-batching-and-streaming"&gt;20.3 Serving, Batching, and Streaming&lt;/h1&gt;
&lt;p&gt;The previous two sections laid the underpinnings of a single inference: &lt;a href=".././runtime"&gt;20.1&lt;/a&gt;
wired in the runtime through cgo and settled the weights and tensors,
&lt;a href=".././tokenize"&gt;20.2&lt;/a&gt; made clear how tokens go in and come out. But a real service must serve
thousands upon thousands of such requests at once, each continuously spitting tokens. How to
organize them efficiently and stably is a thoroughly &lt;strong&gt;concurrency and scheduling&lt;/strong&gt; problem,
and this is Go&amp;rsquo;s home ground. This section brings Chapter 10&amp;rsquo;s channels and Chapter 7&amp;rsquo;s
context down to large-model serving.&lt;/p&gt;</description></item></channel></rss>