┏━━━━━━━━━━━━━━━━━━━━━━━━┓

┗━━━━━━━━━━━━━━━━━━━━━━━━┛
(Possibly) Naïve thoughts regarding Go.

Pointers Might Not be Ideal as Arguments

| PV/UV:/ | PDF | #Performance #Parameter #Pointer

Author(s): Changkun Ou

Permalink: https://golang.design/research/pointer-params

We are aware that using pointers for passing parameters can avoid data copy, which will benefit the performance. Nevertheless, there are always some edge cases we might need concern.

Introduction

Let's take this as an example:

 1// vec.go
 2type vec struct {
 3	x, y, z, w float64
 4}
 5
 6func (v vec) addv(u vec) vec {
 7	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
 8}
 9
10func (v *vec) addp(u *vec) *vec {
11	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
12	return v
13}

Which vector addition runs faster? Intuitively, we might consider that vec.addp is faster than vec.addv because its parameter u uses pointer form. There should be no copies of the data, whereas vec.addv involves data copy both when passing and returning.

However, if we do a micro-benchmark:

 1func BenchmarkVec(b *testing.B) {
 2	b.Run("addv", func(b *testing.B) {
 3		v1 := vec{1, 2, 3, 4}
 4		v2 := vec{4, 5, 6, 7}
 5		b.ReportAllocs()
 6		b.ResetTimer()
 7		for i := 0; i < b.N; i++ {
 8			if i%2 == 0 {
 9				v1 = v1.addv(v2)
10			} else {
11				v2 = v2.addv(v1)
12			}
13		}
14	})
15	b.Run("addp", func(b *testing.B) {
16		v1 := &vec{1, 2, 3, 4}
17		v2 := &vec{4, 5, 6, 7}
18		b.ReportAllocs()
19		b.ResetTimer()
20		for i := 0; i < b.N; i++ {
21			if i%2 == 0 {
22				v1 = v1.addp(v2)
23			} else {
24				v2 = v2.addp(v1)
25			}
26		}
27	})
28}

And run as follows:

1$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | \
2	tee new.txt
3$ benchstat new.txt

The benchstat will give you the following result:

name         time/op
Vec/addv-16  0.25ns ± 2%
Vec/addp-16  2.20ns ± 0%

name         alloc/op
Vec/addv-16   0.00B
Vec/addp-16   0.00B

name         allocs/op
Vec/addv-16    0.00
Vec/addp-16    0.00

How is this happening?

Inlining Optimization

This is all because of compiler optimization, and mostly because of inlining.

If we disable inline1 2 from the addv and addp:

 1//go:noinline
 2func (v vec) addv(u vec) vec {
 3	return vec{v.x + u.x, v.y + u.y, v.z + u.z, v.w + u.w}
 4}
 5
 6//go:noinline
 7func (v *vec) addp(u *vec) *vec {
 8	v.x, v.y, v.z, v.w = v.x+u.x, v.y+u.y, v.z+u.z, v.w+u.w
 9	return v
10}

Then run the benchmark and compare the perf with the previous one:

1$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | \
2	tee old.txt
3$ benchstat old.txt new.txt
4name         old time/op    new time/op    delta
5Vec/addv-16    4.99ns ± 1%    0.25ns ± 2%  -95.05%  (p=0.000 n=9+10)
6Vec/addp-16    3.35ns ± 1%    2.20ns ± 0%  -34.37%  (p=0.000 n=10+8)

The inline optimization transforms the vec.addv:

1v1 := vec{1, 2, 3, 4}
2v2 := vec{4, 5, 6, 7}
3v1 = v1.addv(v2)

to a direct assign statement:

1v1 := vec{1, 2, 3, 4}
2v2 := vec{4, 5, 6, 7}
3v1 = vec{1+4, 2+5, 3+6, 4+7}

And for the vec.addp's case:

1v1 := &vec{1, 2, 3, 4}
2v2 := &vec{4, 5, 6, 7}
3v1 = v1.addp(v2)

to a direct manipulation:

1v1 := vec{1, 2, 3, 4}
2v2 := vec{4, 5, 6, 7}
3v1.x, v1.y, v1.z, v1.w = v1.x+v2.x, v1.y+v2.y, v1.z+v2.z, v1.w+v2.w

Addressing Modes

If we check the compiled assembly, the reason reveals quickly:

1$ mkdir asm && go tool compile -S vec.go > asm/vec.s

The dumped assumbly code is as follows:

"".vec.addv STEXT nosplit size=89 args=0x60 locals=0x0 funcid=0x0
	0x0000 00000 (vec.go:7)	TEXT	"".vec.addv(SB), NOSPLIT|ABIInternal, $0-96
	0x0000 00000 (vec.go:7)	FUNCDATA	$0, gclocals·...(SB)
	0x0000 00000 (vec.go:7)	FUNCDATA	$1, gclocals·...(SB)
	0x0000 00000 (vec.go:8)	MOVSD	"".u+40(SP), X0
	0x0006 00006 (vec.go:8)	MOVSD	"".v+8(SP), X1
	0x000c 00012 (vec.go:8)	ADDSD	X1, X0
	0x0010 00016 (vec.go:8)	MOVSD	X0, "".~r1+72(SP)
	0x0016 00022 (vec.go:8)	MOVSD	"".u+48(SP), X0
	0x001c 00028 (vec.go:8)	MOVSD	"".v+16(SP), X1
	0x0022 00034 (vec.go:8)	ADDSD	X1, X0
	0x0026 00038 (vec.go:8)	MOVSD	X0, "".~r1+80(SP)
	0x002c 00044 (vec.go:8)	MOVSD	"".u+56(SP), X0
	0x0032 00050 (vec.go:8)	MOVSD	"".v+24(SP), X1
	0x0038 00056 (vec.go:8)	ADDSD	X1, X0
	0x003c 00060 (vec.go:8)	MOVSD	X0, "".~r1+88(SP)
	0x0042 00066 (vec.go:8)	MOVSD	"".u+64(SP), X0
	0x0048 00072 (vec.go:8)	MOVSD	"".v+32(SP), X1
	0x004e 00078 (vec.go:8)	ADDSD	X1, X0
	0x0052 00082 (vec.go:8)	MOVSD	X0, "".~r1+96(SP)
	0x0058 00088 (vec.go:8)	RET
"".(*vec).addp STEXT nosplit size=73 args=0x18 locals=0x0 funcid=0x0
	0x0000 00000 (vec.go:11)	TEXT	"".(*vec).addp(SB), NOSPLIT|ABIInternal, $0-24
	0x0000 00000 (vec.go:11)	FUNCDATA	$0, gclocals·...(SB)
	0x0000 00000 (vec.go:11)	FUNCDATA	$1, gclocals·...(SB)
	0x0000 00000 (vec.go:12)	MOVQ	"".u+16(SP), AX
	0x0005 00005 (vec.go:12)	MOVSD	(AX), X0
	0x0009 00009 (vec.go:12)	MOVQ	"".v+8(SP), CX
	0x000e 00014 (vec.go:12)	ADDSD	(CX), X0
	0x0012 00018 (vec.go:12)	MOVSD	8(AX), X1
	0x0017 00023 (vec.go:12)	ADDSD	8(CX), X1
	0x001c 00028 (vec.go:12)	MOVSD	16(CX), X2
	0x0021 00033 (vec.go:12)	ADDSD	16(AX), X2
	0x0026 00038 (vec.go:12)	MOVSD	24(AX), X3
	0x002b 00043 (vec.go:12)	ADDSD	24(CX), X3
	0x0030 00048 (vec.go:12)	MOVSD	X0, (CX)
	0x0034 00052 (vec.go:12)	MOVSD	X1, 8(CX)
	0x0039 00057 (vec.go:12)	MOVSD	X2, 16(CX)
	0x003e 00062 (vec.go:12)	MOVSD	X3, 24(CX)
	0x0043 00067 (vec.go:13)	MOVQ	CX, "".~r1+24(SP)
	0x0048 00072 (vec.go:13)	RET

The addv implementation uses values from the previous stack frame and writes the result directly to the return; whereas addp needs MOVQ3 4 5 that copies the parameter to different registers (e.g., copy pointers to AX and CX), then write back when returning. Therefore, with inline disabled, the reason that addv is slower than addp is caused by different memory access pattern.

Conclusion

Can pass by value always faster than pass by pointer? We could do a further test. But this time, we need use a generator to generate all possible cases. Here is how we could do it:

  1// gen.go
  2
  3// +build ignore
  4
  5package main
  6
  7import (
  8	"bytes"
  9	"fmt"
 10	"go/format"
 11	"io/ioutil"
 12	"strings"
 13	"text/template"
 14)
 15
 16var (
 17	head = `// Code generated by go run gen.go; DO NOT EDIT.
 18package fields_test
 19
 20import "testing"
 21`
 22	structTmpl = template.Must(template.New("ss").Parse(`
 23type {{.Name}} struct {
 24	{{.Properties}}
 25}
 26
 27func (s {{.Name}}) addv(ss {{.Name}}) {{.Name}} {
 28	return {{.Name}}{
 29		{{.Addv}}
 30	}
 31}
 32
 33func (s *{{.Name}}) addp(ss *{{.Name}}) *{{.Name}} {
 34	{{.Addp}}
 35	return s
 36}
 37`))
 38	benchHead = `func BenchmarkVec(b *testing.B) {`
 39	benchTail = `}`
 40	benchBody = template.Must(template.New("bench").Parse(`
 41	b.Run("addv-{{.Name}}", func(b *testing.B) {
 42		{{.InitV}}
 43		b.ResetTimer()
 44		for i := 0; i < b.N; i++ {
 45			if i%2 == 0 {
 46				v1 = v1.addv(v2)
 47			} else {
 48				v2 = v2.addv(v1)
 49			}
 50		}
 51	})
 52	b.Run("addp-{{.Name}}", func(b *testing.B) {
 53		{{.InitP}}
 54		b.ResetTimer()
 55		for i := 0; i < b.N; i++ {
 56			if i%2 == 0 {
 57				v1 = v1.addp(v2)
 58			} else {
 59				v2 = v2.addp(v1)
 60			}
 61		}
 62	})
 63`))
 64)
 65
 66type structFields struct {
 67	Name       string
 68	Properties string
 69	Addv       string
 70	Addp       string
 71}
 72type benchFields struct {
 73	Name  string
 74	InitV string
 75	InitP string
 76}
 77
 78func main() {
 79	w := new(bytes.Buffer)
 80	w.WriteString(head)
 81
 82	N := 10
 83
 84	for i := 0; i < N; i++ {
 85		var (
 86			ps   = []string{}
 87			adv  = []string{}
 88			adpl = []string{}
 89			adpr = []string{}
 90		)
 91		for j := 0; j <= i; j++ {
 92			ps = append(ps, fmt.Sprintf("x%d\tfloat64", j))
 93			adv = append(adv, fmt.Sprintf("s.x%d + ss.x%d,", j, j))
 94			adpl = append(adpl, fmt.Sprintf("s.x%d", j))
 95			adpr = append(adpr, fmt.Sprintf("s.x%d + ss.x%d", j, j))
 96		}
 97		err := structTmpl.Execute(w, structFields{
 98			Name:       fmt.Sprintf("s%d", i),
 99			Properties: strings.Join(ps, "\n"),
100			Addv:       strings.Join(adv, "\n"),
101			Addp:       strings.Join(adpl, ",") + " = " +
102				strings.Join(adpr, ","),
103		})
104		if err != nil {
105			panic(err)
106		}
107	}
108
109	w.WriteString(benchHead)
110	for i := 0; i < N; i++ {
111		nums1, nums2 := []string{}, []string{}
112		for j := 0; j <= i; j++ {
113			nums1 = append(nums1, fmt.Sprintf("%d", j))
114			nums2 = append(nums2, fmt.Sprintf("%d", j+i))
115		}
116		numstr1 := strings.Join(nums1, ", ")
117		numstr2 := strings.Join(nums2, ", ")
118
119		err := benchBody.Execute(w, benchFields{
120			Name: fmt.Sprintf("s%d", i),
121			InitV: fmt.Sprintf(`v1 := s%d{%s}
122v2 := s%d{%s}`, i, numstr1, i, numstr2),
123			InitP: fmt.Sprintf(`v1 := &s%d{%s}
124			v2 := &s%d{%s}`, i, numstr1, i, numstr2),
125		})
126		if err != nil {
127			panic(err)
128		}
129	}
130	w.WriteString(benchTail)
131
132	out, err := format.Source(w.Bytes())
133	if err != nil {
134		panic(err)
135	}
136	err = ioutil.WriteFile("impl_test.go", out, 0660)
137	if err != nil {
138		panic(err)
139	}
140}

If we generate our test code and perform the same benchmark procedure again:

 1$ go generate
 2$ perflock -governor 80% go test -v -run=none -bench=. -count=10 | \
 3	tee inline.txt
 4$ benchstat inline.txt
 5name            time/op
 6Vec/addv-s0-16  0.25ns ± 0%
 7Vec/addp-s0-16  2.20ns ± 0%
 8Vec/addv-s1-16  0.49ns ± 1%
 9Vec/addp-s1-16  2.20ns ± 0%
10Vec/addv-s2-16  0.25ns ± 1%
11Vec/addp-s2-16  2.20ns ± 0%
12Vec/addv-s3-16  0.49ns ± 2%
13Vec/addp-s3-16  2.21ns ± 1%
14Vec/addv-s4-16  8.29ns ± 0%
15Vec/addp-s4-16  2.37ns ± 1%
16Vec/addv-s5-16  9.06ns ± 1%
17Vec/addp-s5-16  2.74ns ± 1%
18Vec/addv-s6-16   9.9ns ± 0%
19Vec/addp-s6-16  3.17ns ± 0%
20Vec/addv-s7-16  10.9ns ± 1%
21Vec/addp-s7-16  3.27ns ± 1%
22Vec/addv-s8-16  11.4ns ± 0%
23Vec/addp-s8-16  3.29ns ± 0%
24Vec/addv-s9-16  13.4ns ± 1%
25Vec/addp-s9-16  3.37ns ± 0%

We could even further try a version that disables inline:

 1structTmpl = template.Must(template.New("ss").Parse(`
 2type {{.Name}} struct {
 3	{{.Properties}}
 4}
 5+//go:noinline
 6func (s {{.Name}}) addv(ss {{.Name}}) {{.Name}} {
 7	return {{.Name}}{
 8		{{.Addv}}
 9	}
10}
11+//go:noinline
12func (s *{{.Name}}) addp(ss *{{.Name}}) *{{.Name}} {
13	{{.Addp}}
14	return s
15}
16`))

Eventually, we will endup with the following results:

TLDR: The above figure basically demonstrates when should you pass-by-value or pass-by-pointer. If you are certain that your code won't produce any escape variables, and the size of your argument is smaller than 4*8 = 32 bytes, then you should go for pass-by-value; otherwise, you should keep using pointers.

References


  1. Dave Cheney. Mid-stack inlining in Go. May 2, 2020. https://dave.cheney.net/2020/05/02/mid-stack-inlining-in-go

  2. Dave Cheney. Inlining optimisations in Go. April 25, 2020. https://dave.cheney.net/2020/04/25/inlining-optimisations-in-go

  3. MOVSD. Move or Merge Scalar Double-Precision Floating-Point Value. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movsd

  4. ADDSD. Add Scalar Double-Precision Floating-Point Values. Last access: 2020-10-27. https://www.felixcloutier.com/x86/addsd

  5. MOVEQ. Move Quadword. Last access: 2020-10-27. https://www.felixcloutier.com/x86/movq