CPU Cache Coherence and False Sharing for Cloud Backend Engineers
Quick summary: Two goroutines updating adjacent counters can saturate memory bus on a c7g.8xlarge. Memory barriers, cache lines, and false sharing—why placement groups do not fix application-level contention.
Key Takeaways
- Two goroutines updating adjacent counters can saturate memory bus on a c7g
- 8xlarge
- Graviton3 (June 2026) offers strong price/performance for Java and Go services—but false sharing on hot counters still collapses scalability long before network limits
- Benchmark pattern (hypothetical workload) — Java 21 virtual-thread counter array with false sharing on adjacent AtomicLongs, throughput drops 8x (1
- 2M→150K ops/sec); padding to 64-byte cache lines restores 1
Table of Contents
Graviton3 (June 2026) offers strong price/performance for Java and Go services—but false sharing on hot counters still collapses scalability long before network limits.
Benchmark pattern (hypothetical workload) — Java 21 virtual-thread counter array with false sharing on adjacent AtomicLongs, throughput drops 8x (1.2M→150K ops/sec); padding to 64-byte cache lines restores 1.1M ops/sec on c7g.4xlarge Graviton3.
Symptom → mechanism → AWS control
| Production symptom | Mechanism | AWS control |
|---|---|---|
| Scaling cores doesn’t scale throughput | False sharing invalidates cache lines | @Contended (Java), pad hot counters to 64 bytes |
| Noisy neighbor CPU spikes | Cache coherence traffic on shared memory | Pin workloads to dedicated instances, Graviton for price/perf |
| Latency jitter on lock-free code | MESI protocol coherence misses | Per-thread local accumulators, merge on flush |
Opinionated take: When horizontal scaling stops helping, check false sharing before buying bigger instances—it’s the silent killer on Graviton and x86 alike.
Mechanism
CPUs cache data in 64-byte lines. Two threads mutating different variables in the same line cause cache line bouncing—memory barriers flush caches between cores.
Distributed systems add network coherence (DynamoDB conditional writes)—do not confuse with CPU MESI protocol.
AWS services map
| Need | Service | Skip when |
|---|---|---|
| CPU profiling | CloudWatch Agent + perf or JFR on ECS/EKS | Fully managed Lambda with no profiling access |
| Graviton price-performance | c7g/m7g instances | x86-only dependencies without ARM builds |
| Dedicated tenancy | EC2 dedicated hosts | Shared tenancy with low CPU sensitivity |
| Scenario | Mitigation |
|---|---|
| Per-request metrics arrays | Pad structs to cache line; use per-core aggregators |
| Lock-free queues on EC2 | Align atomic slots; benchmark on same instance class as prod |
| NUMA on large instances | Pin threads; use c7g size matched to actual parallelism |
Placement groups reduce network latency—they do not fix false sharing in code.
When this advice breaks
- I/O-bound Lambda — CPU cache irrelevant; optimize cold start and downstream calls.
- Managed services — You do not tune RDS CPU cache; tune queries.
What to do this week
- Run
perf c2cor VTune on hottest lock-free path under load. - Separate frequently updated atomics by 64 bytes in hot structs.
- Load test on production instance family—not laptop.
More in This Track
Part of the Engineering Guides library (June 2026).
- Previous: Part 3
- Browse tracks: Engineering Guides hub
What this guide doesn’t cover
JVM GC and object layout—see concurrency runtime track.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.