Why Goroutines Don’t Scale the Way You Think
Goroutines feel almost free. You can spawn tens of thousands with minimal memory and clean syntax, and everything seems to “just work.” That illusion breaks in production. Go doesn’t scale with the number of goroutines — it scales with how well you understand the scheduler, CPU limits, and the hidden costs behind concurrency.
Goroutines Are Cheap — Execution Is Not
A goroutine starts small (~2 KB stack), but running it is not free. Go uses an M:N scheduler: many goroutines (G) are multiplexed onto OS threads (M), controlled by logical processors (P). The key constraint is P, not the number of goroutines.
If goroutines block or compete heavily, the runtime compensates by spawning more OS threads. At that point, you’re no longer paying “goroutine costs” — you’re paying kernel-level scheduling, thread overhead, and CPU contention. Benchmarks with sleeping goroutines don’t reflect real CPU-bound workloads.
GOMAXPROCS Is the Real Limit
GOMAXPROCS defines how many goroutines can execute in parallel. It controls the number of logical processors (P). No matter how many goroutines you create, only GOMAXPROCS of them can run at once.
Each P has a local run queue (up to 256 goroutines). If one CPU-heavy task occupies a P, everything behind it waits. Even with idle cores elsewhere, those goroutines are effectively stalled until preemption kicks in (~10ms). This creates latency spikes in mixed workloads.
Containers Lie About CPU
In Docker or Kubernetes, runtime.NumCPU() reads the host CPU count, not container limits. A container with 2 vCPUs on a 64-core machine will still see 64.
Go sets GOMAXPROCS to 64 → creates excessive parallelism → Linux enforces CPU quotas via CFS throttling → your app gets paused unpredictably.
This shows up as:
-
latency spikes
-
no clear CPU saturation
-
confusing performance behavior
Fix: automatically align GOMAXPROCS with cgroup limits (e.g., automaxprocs).
Work-Stealing Has a Cost
Go balances load using work-stealing: idle processors take goroutines from others. While this improves utilization, it destroys CPU cache locality.
When a goroutine moves between cores, its data becomes “cold,” forcing reloads from slower memory. For compute-heavy workloads (crypto, matrices), this can significantly reduce performance.
Short-lived, high-churn goroutines amplify this problem.
Blocking Syscalls Break the Model
GOMAXPROCS only limits goroutines executing Go code. Blocking system calls (disk I/O, cgo, some network ops) detach threads from the scheduler.
The runtime spawns new OS threads to keep Ps busy. A burst of blocking operations can create hundreds or thousands of threads.
Consequences:
-
high memory usage (thread stacks)
-
scheduler pressure
-
risk of hitting thread limits or OOM
Practical Takeaways
-
Goroutine count is not a scaling strategy
-
GOMAXPROCS defines real parallelism
-
Container CPU limits must be respected
-
CPU-bound tasks block execution queues
-
Work-stealing trades balance for cache loss
-
Blocking I/O can explode thread count
What Actually Works
-
Use bounded worker pools to control concurrency
-
Separate CPU-bound and I/O-bound workloads
-
Align GOMAXPROCS with real CPU limits
-
Profile the scheduler before optimizing
Final Thought
Goroutines are lightweight, but the system behind them is not. Performance issues in Go are rarely about code correctness — they come from mismatched assumptions about how concurrency maps to real hardware.
Learn more on this page https://krun.pro/gomaxprocs-trap/

Comments (0)