# Go Performance Reference ## Table of Contents 1. [pprof](#pprof) 2. [go tool trace](#go-tool-trace) 3. [Benchmarks](#benchmarks) 4. [Escape Analysis](#escape-analysis) 5. [Memory Optimization](#memory-optimization) 6. [String Performance](#string-performance) 7. [Struct Alignment](#struct-alignment) 8. [Map Performance](#map-performance) 9. [I/O Performance](#io-performance) 10. [Inlining](#inlining) 11. [Common Performance Anti-Patterns](#common-performance-anti-patterns) --- ## pprof ### Enable pprof in a Production Server ```go import ( "net/http" _ "net/http/pprof" // Side-effect import registers handlers on DefaultServeMux ) func main() { // Serve pprof on a separate port — never expose this publicly go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }() // ... start your actual server } ``` ### Collect and Analyze CPU Profiles ```bash # 30-second CPU profile from a running server go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 # Inside pprof interactive shell (pprof) top10 # Top 10 functions by CPU (pprof) list myFunc # Annotated source for a function (pprof) web # Open flame graph in browser (requires graphviz) (pprof) png > cpu.png # Export to image ``` ### Collect and Analyze Memory Profiles ```bash # Heap profile (in-use allocations) go tool pprof http://localhost:6060/debug/pprof/heap # Allocation profile (all allocations since start) go tool pprof http://localhost:6060/debug/pprof/allocs # Inside pprof (pprof) top # Top allocators (pprof) inuse_space # Sort by in-use bytes (pprof) alloc_objects # Sort by allocation count ``` ### Profile Goroutines ```bash # Goroutine profile — shows all running goroutines with stack traces go tool pprof http://localhost:6060/debug/pprof/goroutine # Or view in browser for a quick human-readable dump curl http://localhost:6060/debug/pprof/goroutine?debug=2 ``` ### Write Profiles Programmatically ```go import "runtime/pprof" // CPU profile f, _ := os.Create("cpu.prof") pprof.StartCPUProfile(f) defer pprof.StopCPUProfile() // Memory profile (write at end of program or specific checkpoint) f, _ := os.Create("mem.prof") runtime.GC() // Force GC for accurate snapshot pprof.WriteHeapProfile(f) f.Close() ``` ### Compare Two Profiles ```bash # Capture baseline and after a change, then diff them go tool pprof -base baseline.prof current.prof ``` --- ## go tool trace The tracer records goroutine scheduling, GC pauses, and syscalls at microsecond resolution. ### Record a Trace ```bash # From a live server curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out # Or programmatically import "runtime/trace" f, _ := os.Create("trace.out") trace.Start(f) defer trace.Stop() ``` ### Analyze a Trace ```bash go tool trace trace.out # Opens browser-based UI ``` Key views in the UI: - **Goroutine analysis**: Which goroutines ran, for how long, what blocked them - **View trace**: Timeline of all goroutines across P (processor) threads - **Minimum mutator utilization (MMU)**: Percentage of time your code ran vs GC ### Identify Scheduling Latency Look for goroutines spending time in "Runnable" state — this means they are ready to run but waiting for a P. Signs of over-subscription: too many goroutines competing for `GOMAXPROCS` slots. ```go // Instrument specific regions in the trace import "runtime/trace" ctx, task := trace.NewTask(ctx, "processOrder") defer task.End() trace.WithRegion(ctx, "validateInput", func() { validate(input) }) ``` --- ## Benchmarks ### Write Effective Benchmarks ```go func BenchmarkProcess(b *testing.B) { data := generateLargeInput() // Setup before timer b.ResetTimer() // Exclude setup from measurement for i := 0; i < b.N; i++ { Process(data) } } ``` ### Use b.StopTimer / b.StartTimer for Per-Iteration Setup ```go func BenchmarkSort(b *testing.B) { for i := 0; i < b.N; i++ { b.StopTimer() data := generateUnsortedSlice(1000) // Re-create per iteration b.StartTimer() sort.Ints(data) } } ``` ### Allocations Matter — Report Them ```go func BenchmarkParse(b *testing.B) { b.ReportAllocs() // Show allocs/op and B/op in output for i := 0; i < b.N; i++ { Parse(input) } } ``` ### Run Benchmarks ```bash go test -bench=. -benchmem -count=5 ./... # Run only matching benchmarks go test -bench=BenchmarkProcess -benchmem -run=^$ ./pkg/processor # -run=^$ suppresses tests, runs only benchmarks ``` ### Compare Results with benchstat ```bash go install golang.org/x/perf/cmd/benchstat@latest # Capture two runs go test -bench=. -count=10 ./... > before.txt # Make your change go test -bench=. -count=10 ./... > after.txt benchstat before.txt after.txt ``` Output shows statistical significance: `p < 0.05` means the difference is likely real, not noise. Use `-count=10` or more for reliable statistics. --- ## Escape Analysis ### Inspect Escape Decisions ```bash go build -gcflags='-m' ./... # Basic escape analysis go build -gcflags='-m=2' ./... # Verbose (shows escape reason) go test -gcflags='-m' ./... # On test files ``` ### Understand Heap vs Stack Values escape to the heap when: - Their address is returned or stored in a longer-lived structure - They are assigned to an interface - They are too large for the stack (default stack starts at 8KB, goroutines grow as needed but large locals still escape) - The compiler cannot prove the lifetime is bounded ```go // Stack allocated — does NOT escape func sumSquares(nums []int) int { total := 0 // total stays on stack for _, n := range nums { total += n * n } return total } // Heap allocated — escapes because address is returned func newCounter() *int { n := 0 return &n // n escapes: "moved to heap: n" } // Interface assignment causes escape func logValue(v interface{}) { // Passing int here allocates on heap fmt.Println(v) } ``` ### Reduce Allocations with Value Receivers ```go // BAD: Pointer causes allocation when assigned to interface type Point struct{ X, Y float64 } func (p *Point) String() string { return fmt.Sprintf("(%f, %f)", p.X, p.Y) } // GOOD: Value receiver, may stay on stack func (p Point) String() string { return fmt.Sprintf("(%f, %f)", p.X, p.Y) } ``` --- ## Memory Optimization ### Use sync.Pool for Frequently Allocated Short-Lived Objects ```go var bufPool = sync.Pool{ New: func() interface{} { return new(bytes.Buffer) }, } func formatMessage(data []byte) string { buf := bufPool.Get().(*bytes.Buffer) defer func() { buf.Reset() bufPool.Put(buf) }() buf.Write(data) // ... format into buf return buf.String() } ``` Pool objects may be collected by GC at any time. Never store state that must survive across GC cycles in a pool. ### Pre-Allocate Slices ```go // BAD: O(n) reallocations as slice grows var results []Result for _, item := range items { results = append(results, process(item)) } // GOOD: Single allocation results := make([]Result, 0, len(items)) for _, item := range items { results = append(results, process(item)) } ``` ### Reuse Slices Across Calls ```go type Processor struct { buf []byte // Reused across calls } func (p *Processor) Process(input []byte) []byte { p.buf = p.buf[:0] // Reset length, keep capacity p.buf = append(p.buf, input...) // ... transform p.buf return p.buf } ``` ### Avoid Large Value Copies ```go type LargeStruct struct { Data [4096]byte // ... } // BAD: Copies 4KB on every call func processLarge(s LargeStruct) { ... } // GOOD: Pass pointer func processLarge(s *LargeStruct) { ... } ``` --- ## String Performance ### Use strings.Builder for Concatenation ```go // BAD: Creates a new string on every iteration var result string for _, s := range parts { result += s + ", " } // GOOD: Single allocation var sb strings.Builder sb.Grow(estimatedSize) // Pre-grow if you know the size for _, s := range parts { sb.WriteString(s) sb.WriteString(", ") } result := sb.String() ``` ### Convert Between []byte and string Without Allocation The standard `string(b)` and `[]byte(s)` conversions always allocate. For read-only access within a single goroutine, use `unsafe`: ```go import "unsafe" // []byte to string — zero copy, safe only if you don't modify b afterward func bytesToString(b []byte) string { return unsafe.String(unsafe.SliceData(b), len(b)) } // string to []byte — zero copy, safe only for reads func stringToBytes(s string) []byte { return unsafe.Slice(unsafe.StringData(s), len(s)) } ``` These are valid as of Go 1.20. Do not use the older `*(*string)(unsafe.Pointer(&b))` pattern. ### Avoid fmt.Sprintf for Simple Concatenation ```go // BAD: Heap allocation, format parsing overhead key := fmt.Sprintf("%s:%d", prefix, id) // GOOD: strconv is faster for basic conversions key := prefix + ":" + strconv.Itoa(id) // GOOD: For multiple parts, strings.Join or Builder key := strings.Join([]string{prefix, strconv.Itoa(id)}, ":") ``` --- ## Struct Alignment The CPU reads memory in aligned chunks. Padding bytes are inserted to satisfy alignment requirements. Reordering fields from largest to smallest eliminates wasted bytes. ```go // BAD: 24 bytes due to padding type BadLayout struct { Active bool // 1 byte + 7 bytes padding Count int64 // 8 bytes Flag bool // 1 byte + 7 bytes padding } // GOOD: 16 bytes, no padding type GoodLayout struct { Count int64 // 8 bytes Active bool // 1 byte Flag bool // 1 byte + 6 bytes padding (to align to 8) } ``` ### Check Sizes and Padding ```go import "unsafe" fmt.Println(unsafe.Sizeof(BadLayout{})) // 24 fmt.Println(unsafe.Sizeof(GoodLayout{})) // 16 ``` ### Use fieldalignment to Find Problems Automatically ```bash go install golang.org/x/tools/go/analysis/passes/fieldalignment/cmd/fieldalignment@latest fieldalignment ./... # Report structs with inefficient layout fieldalignment -fix ./... # Rewrite fields automatically ``` ### Cache Line Considerations for Concurrent Structs Fields accessed by different goroutines should be on separate cache lines (64 bytes) to prevent false sharing: ```go type Counters struct { reads int64 _ [56]byte // Pad to fill cache line writes int64 } ``` --- ## Map Performance ### Pre-Size Maps ```go // BAD: Map grows incrementally, triggering multiple rehashes m := make(map[string]int) for _, item := range items { m[item.Key] = item.Value } // GOOD: Single allocation m := make(map[string]int, len(items)) for _, item := range items { m[item.Key] = item.Value } ``` ### Use Switch for Small Key Sets For fewer than ~8 fixed string keys, a switch statement is faster than a map due to branch prediction and no hashing overhead: ```go // Faster for small, known sets func httpMethodCode(method string) int { switch method { case "GET": return 0 case "POST": return 1 case "PUT": return 2 case "DELETE": return 3 default: return -1 } } ``` ### Choose sync.Map vs Sharded Map `sync.Map` is optimized for two specific cases: 1. Write-once, read-many (mostly reads after initial population) 2. Many goroutines reading/writing disjoint keys For general concurrent access with frequent writes, a sharded map with per-shard mutexes outperforms `sync.Map`: ```go const numShards = 256 type ShardedMap struct { shards [numShards]struct { sync.RWMutex m map[string]interface{} } } func (sm *ShardedMap) shard(key string) int { h := fnv.New32() h.Write([]byte(key)) return int(h.Sum32()) % numShards } func (sm *ShardedMap) Get(key string) (interface{}, bool) { s := &sm.shards[sm.shard(key)] s.RLock() v, ok := s.m[key] s.RUnlock() return v, ok } ``` --- ## I/O Performance ### Always Wrap with bufio Unbuffered reads and writes issue a syscall for every call. Buffering batches them: ```go // BAD: Syscall per line f, _ := os.Open("data.txt") scanner := bufio.NewScanner(f) // This is already buffered — correct // BAD: Syscall per Write call f, _ := os.Create("out.txt") fmt.Fprintln(f, line) // Goes through direct write // GOOD: Buffered writes f, _ := os.Create("out.txt") bw := bufio.NewWriterSize(f, 64*1024) // 64KB buffer defer bw.Flush() fmt.Fprintln(bw, line) ``` ### Use io.Copy for Efficient Transfers `io.Copy` uses a 32KB internal buffer and delegates to `sendfile(2)` or `splice(2)` when both sides support it (e.g., `*os.File` to `*net.TCPConn`): ```go // Efficient file download with no intermediate allocation func serveFile(w http.ResponseWriter, path string) error { f, err := os.Open(path) if err != nil { return err } defer f.Close() _, err = io.Copy(w, f) return err } ``` ### Use io.Pipe for Producer-Consumer Pipelines ```go pr, pw := io.Pipe() go func() { defer pw.Close() json.NewEncoder(pw).Encode(largeStruct) // Streams without buffering whole JSON }() http.Post(url, "application/json", pr) ``` ### Limit Reads to Avoid Memory Exhaustion ```go const maxBodySize = 1 << 20 // 1MB r.Body = http.MaxBytesReader(w, r.Body, maxBodySize) if err := json.NewDecoder(r.Body).Decode(&req); err != nil { http.Error(w, "request too large or invalid", http.StatusBadRequest) return } ``` --- ## Inlining The compiler inlines small functions to eliminate call overhead. A function is inlined when its "cost" (an internal AST node count) stays below a threshold (~80 nodes). ### Check What Gets Inlined ```bash go build -gcflags='-m' ./... # Output includes: # ./pkg/math.go:12:6: can inline Add # ./pkg/handler.go:45:12: inlining call to Add # ./pkg/handler.go:60:5: cannot inline processLarge: function too complex ``` ### Write Inlineable Functions ```go // Inlineable: small, no closures, no defer func clamp(v, min, max int) int { if v < min { return min } if v > max { return max } return v } // NOT inlineable: contains a closure func makeAdder(n int) func(int) int { return func(x int) int { return x + n } } ``` ### Prevent Inlining ```go //go:noinline // Force a function to never be inlined (useful for benchmarking) func expensiveOperation(data []byte) Result { // ... } ``` Use `//go:noinline` in benchmarks when you want to measure the cost of a function call itself, or to prevent the compiler from optimizing away a call you want to measure. --- ## Common Performance Anti-Patterns ### Reflection in Hot Paths Reflection bypasses type-system optimizations, performs map lookups, and allocates. Avoid in code called frequently: ```go // BAD: reflect.ValueOf allocates, method lookup is slow func setField(obj interface{}, name string, value interface{}) { v := reflect.ValueOf(obj).Elem() v.FieldByName(name).Set(reflect.ValueOf(value)) } // GOOD: Generated code or type switch func applyUpdate(u *User, field string, value interface{}) { switch field { case "Name": u.Name = value.(string) case "Email": u.Email = value.(string) } } ``` ### fmt.Sprintf in Hot Paths `fmt.Sprintf` parses a format string, uses reflection, and typically allocates: ```go // BAD in hot path key := fmt.Sprintf("user:%d:session:%s", userID, sessionID) // GOOD: strconv + concatenation key := "user:" + strconv.FormatInt(userID, 10) + ":session:" + sessionID // GOOD for complex formatting: pre-build a template or use strings.Builder ``` ### Unnecessary Allocations in Loops ```go // BAD: Allocates a new map every iteration for _, item := range items { m := map[string]int{"count": item.Count} process(m) } // GOOD: Allocate once, reuse m := make(map[string]int, 1) for _, item := range items { m["count"] = item.Count process(m) // Clear before next iteration if needed for k := range m { delete(m, k) } } ``` ### Goroutine Leak from Unclosed Channels ```go // BAD: Goroutine blocked forever if consumer exits early func generate(nums ...int) <-chan int { out := make(chan int) go func() { for _, n := range nums { out <- n // Blocks forever if nobody reads } close(out) }() return out } // GOOD: Use context for cancellation func generate(ctx context.Context, nums ...int) <-chan int { out := make(chan int, len(nums)) go func() { defer close(out) for _, n := range nums { select { case out <- n: case <-ctx.Done(): return } } }() return out } ``` ### Copying a Mutex Mutexes must not be copied after first use. Copying a locked mutex will deadlock; copying an unlocked mutex silently creates a new, independent lock: ```go // BAD: Copies the mutex type Cache struct{ mu sync.Mutex; data map[string]int } func copyCache(c Cache) Cache { return c } // Copies mu — wrong // GOOD: Always pass and return pointers for types containing mutexes func processCache(c *Cache) { ... } ``` ### Defer in a Tight Loop Defers execute at function return, not loop iteration. Inside a loop, defers pile up and all run together at the end: ```go // BAD: All files stay open until the function returns for _, path := range paths { f, _ := os.Open(path) defer f.Close() // Runs at function exit, not loop end process(f) } // GOOD: Wrap in a closure or extract to a helper for _, path := range paths { func() { f, _ := os.Open(path) defer f.Close() // Now runs at end of this closure process(f) }() } ```