Chapter 20 AI Inference and Serving on Go: Under the Hood

20.1 The Inference Runtime and FFI

Mon, 01 Jan 0001 00:00:00 +0000

20.1 The Inference Runtime and FFI

The FFI boundary that recurred through Chapters 18 and 19 wears, with large models, its most current face. Training large models is almost the domain of Python and CUDA, but when a model is trained and is to be deployed to serve tens of millions of requests, the lead role changes, and Go stands firm at this layer. This chapter is about how Go takes on AI’s inference and serving, and the first section must settle the lowest-level problem: Go does not do matrix multiply itself, it has to wire in a local inference runtime, and this wiring is, once again, the boundary of Chapter 18.

20.2 Tokenization and Tensors

Mon, 01 Jan 0001 00:00:00 +0000

20.2 Tokenization and Tensors

20.1 settled the home of weights and tensors on the native runtime side, with Go only passing handles and moving small data on the boundary. This section steps into that “small data” itself: how text becomes the numbers a model can eat, and how the numbers turn back into text. This seemingly trivial thing hides a detail that will make a Go programmer smile in recognition, for it is almost a direct application of Chapter 5’s “a string is a stretch of immutable bytes,” and the moment it is neglected, it spits out garbled text in streaming output.

20.3 Serving, Batching, and Streaming

Mon, 01 Jan 0001 00:00:00 +0000

20.3 Serving, Batching, and Streaming

The previous two sections laid the underpinnings of a single inference: 20.1 wired in the runtime through cgo and settled the weights and tensors, 20.2 made clear how tokens go in and come out. But a real service must serve thousands upon thousands of such requests at once, each continuously spitting tokens. How to organize them efficiently and stably is a thoroughly concurrency and scheduling problem, and this is Go’s home ground. This section brings Chapter 10’s channels and Chapter 7’s context down to large-model serving.