Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Two goroutines updating adjacent counters can saturate memory bus on a c7g.8xlarge. Memory barriers, cache lines, and false sharing—why placement groups do not fix application-level contention.

Key Facts

  • Two goroutines updating adjacent counters can saturate memory bus on a c7g
  • 8xlarge
  • Graviton3 (June 2026) offers strong price/performance for Java and Go services—but false sharing on hot counters still collapses scalability long before network limits
  • Benchmark pattern (hypothetical workload) — Java 21 virtual-thread counter array with false sharing on adjacent AtomicLongs, throughput drops 8x (1
  • 2M→150K ops/sec); padding to 64-byte cache lines restores 1

Entity Definitions

Lambda
Lambda is an AWS service discussed in this article.
EC2
EC2 is an AWS service discussed in this article.
RDS
RDS is an AWS service discussed in this article.
DynamoDB
DynamoDB is an AWS service discussed in this article.
CloudWatch
CloudWatch is an AWS service discussed in this article.
EKS
EKS is an AWS service discussed in this article.
ECS
ECS is an AWS service discussed in this article.

CPU Cache Coherence and False Sharing for Cloud Backend Engineers

Quick summary: Two goroutines updating adjacent counters can saturate memory bus on a c7g.8xlarge. Memory barriers, cache lines, and false sharing—why placement groups do not fix application-level contention.

Key Takeaways

  • Two goroutines updating adjacent counters can saturate memory bus on a c7g
  • 8xlarge
  • Graviton3 (June 2026) offers strong price/performance for Java and Go services—but false sharing on hot counters still collapses scalability long before network limits
  • Benchmark pattern (hypothetical workload) — Java 21 virtual-thread counter array with false sharing on adjacent AtomicLongs, throughput drops 8x (1
  • 2M→150K ops/sec); padding to 64-byte cache lines restores 1
CPU Cache Coherence and False Sharing for Cloud Backend Engineers
Table of Contents

Graviton3 (June 2026) offers strong price/performance for Java and Go services—but false sharing on hot counters still collapses scalability long before network limits.

Benchmark pattern (hypothetical workload) — Java 21 virtual-thread counter array with false sharing on adjacent AtomicLongs, throughput drops 8x (1.2M→150K ops/sec); padding to 64-byte cache lines restores 1.1M ops/sec on c7g.4xlarge Graviton3.

Symptom → mechanism → AWS control

Production symptomMechanismAWS control
Scaling cores doesn’t scale throughputFalse sharing invalidates cache lines@Contended (Java), pad hot counters to 64 bytes
Noisy neighbor CPU spikesCache coherence traffic on shared memoryPin workloads to dedicated instances, Graviton for price/perf
Latency jitter on lock-free codeMESI protocol coherence missesPer-thread local accumulators, merge on flush

Opinionated take: When horizontal scaling stops helping, check false sharing before buying bigger instances—it’s the silent killer on Graviton and x86 alike.

Mechanism

CPUs cache data in 64-byte lines. Two threads mutating different variables in the same line cause cache line bouncing—memory barriers flush caches between cores.

Distributed systems add network coherence (DynamoDB conditional writes)—do not confuse with CPU MESI protocol.

AWS services map

NeedServiceSkip when
CPU profilingCloudWatch Agent + perf or JFR on ECS/EKSFully managed Lambda with no profiling access
Graviton price-performancec7g/m7g instancesx86-only dependencies without ARM builds
Dedicated tenancyEC2 dedicated hostsShared tenancy with low CPU sensitivity
ScenarioMitigation
Per-request metrics arraysPad structs to cache line; use per-core aggregators
Lock-free queues on EC2Align atomic slots; benchmark on same instance class as prod
NUMA on large instancesPin threads; use c7g size matched to actual parallelism

Placement groups reduce network latency—they do not fix false sharing in code.

When this advice breaks

  • I/O-bound Lambda — CPU cache irrelevant; optimize cold start and downstream calls.
  • Managed services — You do not tune RDS CPU cache; tune queries.

What to do this week

  1. Run perf c2c or VTune on hottest lock-free path under load.
  2. Separate frequently updated atomics by 64 bytes in hot structs.
  3. Load test on production instance family—not laptop.

More in This Track

Part of the Engineering Guides library (June 2026).

What this guide doesn’t cover

JVM GC and object layout—see concurrency runtime track.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Recommended Reading

Explore All Articles »