🙋🏾 💩 🤰 Language mechanics of memory profiling 🥃 👩🏾‍🤝‍👨🏿 🏬

Prelude

This is the third of four articles in a series that will provide insight into the mechanics and design of pointers, stacks, heaps, escape analysis, and value / pointer semantics in Go. This post is about memory profiling.

Table of contents of the cycle of articles:

Watch this video to see a demo of this code:

DGopherCon Singapore (2017) - Escape Analysis

Introduction

In a previous post, I taught the basics of escape analysis using an example that splits a value on a goroutine stack. I have not shown you any other scenarios that can lead to heap values. To help you with this, I'm going to debug a program that makes allocations in unexpected ways.

Program

I wanted to learn more about the io package, so I came up with a little task for myself. Given a stream of bytes, write a function that can find the string elvis and replace it with the capitalized string Elvis. We are talking about a king, so his name should always be capitalized.

Here is a link to the solution: play.golang.org/p/n_SzF4Cer4

Here is a link to benchmarks: play.golang.org/p/TnXrxJVfLV

The listing shows two different functions that accomplish this task. This post will focus on the algOne function as it uses the io package. Use the algTwo function to experiment with memory and processor profiles yourself.

Here is the input we are going to use and the expected output from the algOne function.

Listing 1

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis

Below is a listing of the algOne function.

Listing 2

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

I want to know how well this function works and how much pressure it puts on the heap. To find out, let's run a benchmark.

Benchmarking

I wrote a benchmark that calls the algOne function to perform processing on the data stream.

Listing 3

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }

We can run this benchmark using go test with the -bench, -benchtime and -benchmem switches.

Listing 4

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8        2000000          2522 ns/op       117 B/op            2 allocs/op

After running the benchmark, we see that the algOne function allocates 2 values with a total cost of 117 bytes per operation. This is great, but we need to know which lines of code in the function are causing these allocations. To find out, we need to generate profiling data for this test.

Profiling

To generate the profiling data, run the benchmark again, but this time we will query the memory profile using the -memprofile switch.

Listing 5

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8        2000000          2570 ns/op       117 B/op            2 allocs/op

After completing the benchmark, the testing tool created two new files.

Listing 6

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go

The source code is located in the memcpu folder in the algOne function of the stream.go file and the benchmark function in the stream_test.go file. The two new files created are named mem.out and memcpu.test. The mem.out file contains the profile data, and the memcpu.test file, named after the folder, contains the test binary that we need to access the symbols when looking at the profile data.

With the profile data and the test binary, we can run the pprof tool to examine the profile data.

Listing 7

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _

When profiling memory and looking for low hanging fruit, you can use the -alloc_space option instead of the default -inuse_space. This will show you where each allocation is happening, whether it's in memory or not when you grab the profile.

In the input box (pprof) we can check the algOne function using the list command. This command takes a regular expression as an argument to find the function (s) you want to view.

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _

Based on this profile, we now know that input and buf are allocated on the heap. Since input is a pointer variable, the profile does indeed indicate that the bytes.Buffer value pointed to by the input pointer is allocated. So let's focus on input allocation first and understand why it happens.

We might assume that the allocation is happening because the call to bytes.NewBuffer shares the bytes.Buffer value that creates the call stack. However, having a value in the flat column (the first column in pprof's output) tells me that the value is being allocated because the algOne function splits it up in a way that makes it pile up.

I know that the flat column represents allocations in the function, so take a look at what the list command shows for the Benchmark function that calls algOne.

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _

Since there is only a value in the cum column (second column), this tells me that Benchmark is not allocating anything directly. All allocations come from function calls that are executed within this loop. You can see that all allocation numbers from these two calls to list are all the same.

We still don't know why the bytes.Buffer value is allocated. This is where the -gcflags "-m -m" switch of the go build command comes in handy. The profiler can only tell you which values are being moved to the heap, while build can tell you why.

Compiler reporting

Let's ask the compiler what decisions it makes for escape analysis in code.

Listing 10

$ go build -gcflags "-m -m"

This command produces a lot of output. We just need to search the output for whatever stream.go: 83 has, because stream.go is the name of the file that contains this code, and line 83 contains the bytes.buffer value construct. After searching, we find 6 lines.

Listing 11

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

we are interested in the first line we found by searching for stream.go: 83.

Listing 12

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

This confirms that the bytes.Buffer value has not disappeared as it was pushed onto the call stack. This happened because the bytes.NewBuffer call never happened, the code inside the function was inline.

Here is the line of code in question:

Listing 13

83     input := bytes.NewBuffer(data)

Because the compiler decided to inline the bytes.NewBuffer function call, the code I wrote converts to this:

Listing 14

input := &bytes.Buffer{buf: data}

This means that the algOne function creates the bytes.Buffer value directly. So now the question is, what causes the value to pop out of the algOne stack frame? This answer is in the other 5 lines that we found in the report.

Listing 15

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93

These lines tell us that heap escape occurs in line 93 of code. The input variable is assigned to the interface value.

Interfaces

I don't remember doing interface value assignment at all in the code. However, if you look at line 93, it becomes clear what is going on.

Listing 16

 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }

The io.ReadFull call invokes the assignment of the interface. If you look at the definition of the io.ReadFull function, you can see that it accepts an input variable through an interface type.

Listing 17

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}

It looks like passing the bytes.Buffer address down the call stack and storing it in the value of the Reader interface causes an escape. We now know that the cost of using an interface is high: allocation and indirection. So if it's not clear exactly how an interface makes your code better, you probably don't need to use it. Here are a few guidelines I follow to test the use of interfaces in my code.

Use the interface when:

API users must provide implementation details.
The API has several implementations that they must support internally.
Parts of the API have been identified that may change and require separation.

Don't use the interface:

for the sake of using the interface.
to generalize the algorithm.
when users can declare their own interfaces.

Now we can ask ourselves, does this algorithm really need the io.ReadFull function? The answer is no, because the bytes.Buffer type has a set of methods that we can use. Using methods against a function's value can prevent allocations.

Let's change the code to remove the io package and use the Read method directly on the input variable.

This code change removes the need to import the io package, so to keep all line numbers the same, I use an empty identifier to import the io package. This will keep the imports on the list.

Listing 18

 12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }

When we benchmark this code change, we can see that there is no more allocation for the bytes.Buffer value.

Listing 19

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8        2000000          1814 ns/op         5 B/op            1 allocs/op

We also see a performance improvement of around 29%. Time changed from 2570 ns / op to 1814 ns / op. Now that this is resolved, we can focus on allocating an auxiliary slice for buf. If we use the profiler again for the new profile data we just created, we can determine what exactly is causing the remaining allocations.

Listing 20

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])

The only remaining allocation is on line 89, which is for creating an auxiliary slice.

Stack frames

We want to know why is the allocation happening for the auxiliary slice for buf? Let's run the build again with the -gcflags "-m -m" option and search for stream.go: 89.

Listing 21

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89

The report states that the auxiliary array is "too large for the stack." This message is misleading. The point is not that the array is too large, but that the compiler does not know what size the auxiliary array is at compile time.

Values can only be pushed onto the stack if the compiler knows the size of the value at compile time. This is because the size of each stack frame for each function is calculated at compile time. If the compiler doesn't know the size of a value, it is heaped.

To demonstrate this, let's temporarily hardcode the slice size to 5 and run the benchmark again.

Listing 22

89     buf := make([]byte, 5)

There are no more allocations this time.

Listing 23

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8        3000000          1720 ns/op         0 B/op            0 allocs/op

If you take another look at the compiler report, you will see that nothing is being moved to the heap.

Listing 24

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape

Obviously, we cannot hardcode the slice size, so we will have to live with 1 allocation for this algorithm.

Allocations and performance

Compare the performance gains we've achieved with each refactoring.

Listing 25

Before any optimization
BenchmarkAlgorithmOne-8        2000000          2570 ns/op       117 B/op            2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8        2000000          1814 ns/op         5 B/op            1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8        3000000          1720 ns/op         0 B/op            0 allocs/op

We got a performance increase of about 29% due to the fact that we removed the bytes.Buffer allocation and a ~ 33% acceleration after removing all allocations. Allocations are where application performance can suffer.

Conclusion

Go has some amazing tools to help you understand the decisions the compiler makes about escape analysis. Based on this information, you can refactor your code to help keep values on the stack that shouldn't be on the heap. You shouldn't write a program with zero allocations, but you should strive to minimize allocations whenever possible.

Don't make performance a top priority when writing code because you don't want to guess what should be performing. Write the code and optimize it to achieve performance for the first priority task. This means focusing primarily on integrity, readability, and simplicity. Once you have a working program, determine if it is fast enough. If not, use the tools provided by the language to find and fix performance issues.

Language mechanics of memory profiling