Flame Graphs for Go With pprof
Posted on
This week, I was working on a Go program and I wanted to understand which part was taking the most time. I had seen some people use flame graphs for this, but had never made one myself, so I decided to try it out. It took a little time to figure out the right tools to use, but once I did it was pretty easy. Here’s what one looks like (full size):
On the X axis, a wider bar means more time spent, and on the Y axis you can see the call stack (functions lower down are calling functions higher up). There’s no particular meaning to the colors.
Unfortunately I can’t share the actual program I was working on, but I’ll show
you the steps on a little program called whoami
which starts a web server that
responds to each request by simply writing back the IP address of the requester:
package main
import (
"log"
"net/http"
_ "net/http/pprof"
)
func main() {
http.HandleFunc("/", handle)
log.Fatal(http.ListenAndServe(":8080", nil))
}
func handle(w http.ResponseWriter, r *http.Request) {
log.Printf("handling request from: %s", r.RemoteAddr)
if _, err := w.Write([]byte(r.RemoteAddr)); err != nil {
log.Printf("could not write IP: %s", err)
}
}
Go’s standard library includes some tools for profiling the running program
through its various pprof
packages and utilities. Here, I’m importing
net/http/pprof
, which exposes /debug/pprof
endpoints
on the DefaultServeMux
.
Profiling our web server
I want to make sure whoami
is actually serving requests when I profile it, so
I’m using wrk to generate lots of requests by running wrk -d 30s 'http://localhost:8080'
. With that running, I can fetch a 20 second CPU profile
from the pprof server:
$ go tool pprof \
-raw -output=cpu.txt \
'http://localhost:8080/debug/pprof/profile?seconds=20'
This creates a file called cpu.txt
containing the decoded pprof samples, which
are what I need to build my flame graph.
Brendan Gregg, the inventor of flame graphs, has published some
scripts to turn pprof output into an interactive flame graph. Since whoami
is
a Go program, I can use stackcollapse-go.pl
to convert the samples to the
right format for flamegraph.pl
(both from this repository):
$ ./stackcollapse-go.pl cpu.txt | flamegraph.pl > flame.svg
Make it faster!
One thing I noticed though is that about 30% of the time it takes to serve()
each request seems to be spent in log.Printf
, which needs to make a write
system call to print the message to the terminal:
Maybe we can make our server faster by removing the logging? But to know if we can make it “faster,” we need to know how fast it is right now.
One interesting thing about flame graphs is that they don’t measure time in seconds, they measure it in samples. When you run pprof, it checks what your program is doing 100 times per second, so the flame graph is just an aggregation of “how many times was each unique stack trace sampled.” (This means that if you have a function that’s called rarely, it might not even appear in the flame graph at all!)
So to tell how fast whoami
is in absolute terms, we can use wrk
to gather
some initial statistics. I want to run wrk
again (rather than use the results
from when I was profiling) because profiling your program will slow it down.
$ wrk -d 10s 'http://localhost:8080'
Running 10s test @ http://localhost:8080
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 152.41us 51.27us 2.81ms 86.24%
Req/Sec 31.30k 2.27k 39.31k 72.77%
628837 requests in 10.10s, 76.76MB read
Requests/sec: 62261.70
Transfer/sec: 7.60MB
When I remove the call to log.Printf
and re-run the server, wrk
now reports:
$ wrk -d 10s 'http://localhost:8080'
Running 10s test @ http://localhost:8080
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 117.23us 30.28us 1.02ms 78.11%
Req/Sec 40.28k 3.09k 47.51k 69.80%
809605 requests in 10.10s, 98.83MB read
Requests/sec: 80160.78
Transfer/sec: 9.79MB
Sure enough, it looks like the average latency decreased by around 30% (from 152 to 117 µs), and the requests per second correspondingly increased from 62k to 80k, around 30%!
I don’t have super high confidence in this measurement though. I’m doing this
all on my laptop, so I’m not sure if the system calls that wrk
is making to
send the requests are slowing down the system calls whoami
is making to read
the requests and write the responses at all.
tl;dr: How to Make a Flame Graph from a pprof
source
Download the scripts from Brendan Gregg’s FlameGraph
repo and
then assuming <source>
is either a pprof file or URL, run these commands:
$ go tool pprof -raw -output=cpu.txt <source>
$ stackcollapse-go.pl cpu.txt | flamegraph.pl > cpu.svg
You can also use pprof's web UI to do this without needing any external scripts:
$ go tool pprof -http=: <source>
Then navigate to View > Flame Graph in the pprof UI that opens in your browser.