Engineering Guides
Systems engineering for AWS architects who need the why before the console click
Nine learning tracks from TCP and transaction isolation to Kafka and Kubernetes—each guide maps a production symptom to the underlying mechanism and the AWS control that fixes it. Updated June 2026 for VPC Lattice, HTTP/3 on CloudFront, AMP, and Java 21 virtual threads.
Last hub review: June 2026 · Reviewed by AWS-certified architects (Solutions Architect – Professional)
Start here
High-traffic guides architects bookmark first—CAP on multi-Region AWS, MSK exactly-once, HTTP/3 on CloudFront, connection pool exhaustion, and Prometheus cardinality.
Modern Web Transport on AWS: TCP Congestion, HTTP/2, HTTP/3, and QUIC
Packet loss on mobile networks still punishes HTTP/1.1 head-of-line blocking—but HTTP/3 only helps if CloudFront terminates QUIC and your origin connection pools are sized for multiplexed streams. This guide connects Reno, Cubic, BBR, HPACK, and QUIC to ALB and CloudFront decisions.
Database Deadlocks, Connection Pool Exhaustion, and Prepared Statements on RDS
Too many "too many connections" pages are fixed by raising max_connections—which trades one outage for OOM on the writer. This guide traces deadlocks, pool sizing, RDS Proxy, and prepared statement caching on Aurora.
CAP Theorem in Practice on AWS: What Architects Actually Need for Multi-Region
CAP is not a trivia question—it is the reason your global DynamoDB table shows stale inventory or why Aurora Global reads lag 80 ms behind the writer. This guide maps partition tolerance, consistency, and availability trade-offs to concrete AWS controls.
Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics
Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.
Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics
That `user_id` label on every HTTP metric turns Amazon Managed Prometheus into a five-figure line item. This guide explains cardinality mechanics, EMF vs remote write, and Application Signals defaults worth disabling.
Service Mesh Traffic Shifting: VPC Lattice, Istio on EKS, and App Mesh EOL
App Mesh is legacy path—new meshes should start with VPC Lattice for AWS-native east-west or Istio on EKS when you need full L7 policy. Traffic shifting without duplicating load balancers per service.
How we structure every guide
Symptom → mechanism → AWS control—not a glossary dump.
Symptom
Start with what broke in production—stale reads, partition rebalance stalls, cardinality bills, or TLS handshake latency—not the textbook definition.
Mechanism
Name the systems primitive underneath: isolation levels, cooperative assignors, cache lines, or QUIC loss recovery. That is what your design review should debate.
AWS control
Map the mechanism to a concrete knob—RDS Proxy max connections, MSK rebalance protocol, AMP workspace limits, CloudFront HTTP/3 viewer policy.
Ship
Each guide ends with a one-week checklist and honest gaps. Pair with How-To Guides when you are ready to paste Terraform or CDK.
Networking & Protocol Engineering
TCP through QUIC, TLS termination, and server I/O primitives—mapped to ALB, CloudFront, and EC2 tuning decisions.
4 guides · ~9 min total read
Modern Web Transport on AWS: TCP Congestion, HTTP/2, HTTP/3, and QUIC
Packet loss on mobile networks still punishes HTTP/1.1 head-of-line blocking—but HTTP/3 only helps if CloudFront terminates QUIC and your origin connection pools are sized for multiplexed streams. This guide connects Reno, Cubic, BBR, HPACK, and QUIC to ALB and CloudFront decisions.
TLS 1.3 Handshake Internals on AWS: ALB, CloudFront, and ACM
A full TLS handshake on every API call adds RTTs your p99 cannot afford. This guide walks TLS 1.3 1-RTT resumption, ACM cert rotation, and security policies on ALB and CloudFront.
High-Concurrency Server I/O: epoll, Syscalls, and Zero-Copy on AWS EC2
C10k is solved until syscall overhead and context switches eat your Graviton cores. epoll, sendfile, and SO_REUSEPORT behaviors on EC2—and why Lambda caps concurrency differently.
CPU Cache Coherence and False Sharing for Cloud Backend Engineers
Two goroutines updating adjacent counters can saturate memory bus on a c7g.8xlarge. Memory barriers, cache lines, and false sharing—why placement groups do not fix application-level contention.
Database Internals & Performance
Isolation levels, storage engines, connection pools, and sharding—wired to RDS, Aurora, and DynamoDB.
6 guides · ~32 min total read
PostgreSQL Transaction Isolation and ACID vs BASE on AWS RDS and Aurora
Serializable sounds safest until your checkout times out under row locks. This guide maps READ COMMITTED, REPEATABLE READ, and SERIALIZABLE to RDS/Aurora defaults—and when DynamoDB conditional writes are the BASE alternative.
B-Tree vs LSM and Query Planner Internals on AWS Databases
Why Aurora PostgreSQL loves B-tree indexes on OLTP but DynamoDB feels like an LSM—and how cost-based optimization surprises you when statistics go stale on RDS.
Database Deadlocks, Connection Pool Exhaustion, and Prepared Statements on RDS
Too many "too many connections" pages are fixed by raising max_connections—which trades one outage for OOM on the writer. This guide traces deadlocks, pool sizing, RDS Proxy, and prepared statement caching on Aurora.
PostgreSQL Vacuum, Index Bloat, and Sharding Hot Partitions on AWS
Autovacuum cannot keep up after Black Friday bulk deletes—and your BRIN index is not helping point lookups. Vacuum strategy on Aurora, plus Aurora Limitless and DynamoDB hot key mitigation.
RDS vs Aurora: Read Replicas, Failover, and When to Switch
When to Use RDS vs Aurora (Production Decision Guide)
Distributed Systems Architecture
CAP, coordination, consensus, and event-sourced patterns on AWS multi-Region workloads.
7 guides · ~35 min total read
CAP Theorem in Practice on AWS: What Architects Actually Need for Multi-Region
CAP is not a trivia question—it is the reason your global DynamoDB table shows stale inventory or why Aurora Global reads lag 80 ms behind the writer. This guide maps partition tolerance, consistency, and availability trade-offs to concrete AWS controls.
CRDTs and Eventual Consistency Anti-Patterns on AWS
Last-write-wins is not a CRDT—it is how Global Tables lose cart merges. When to use counters, OR-Sets, and conflict-free merges vs when to keep a single Aurora writer.
Distributed Locking, Redlock, and Consistent Hashing on AWS
Redlock debates matter because ElastiCache is not a consensus system. Consistent hashing for sharding workers and ALB target stickiness—with DynamoDB conditional writes as the boring alternative.
Paxos, Raft, and Byzantine Fault Tolerance: What Cloud Architects Need
You rarely implement Raft on EC2—you buy it in Aurora, DynamoDB, and EKS etcd. This guide explains quorum math so you trust managed services and avoid rolling your own coordinator.
Exactly-Once, CQRS, and Event Sourcing Replay on AWS
Exactly-once is a myth end-to-end—but idempotent consumers plus event stores get you close. CQRS read models on DynamoDB streams, Kinesis, and EventBridge replay semantics.
Microservices Design Patterns on AWS (2026 Production Guide)
Event-Driven Microservices Reference Pattern
Messaging, Streaming & Event-Driven Systems
Kafka, ordering, backpressure, and the AWS async stack (SQS, SNS, EventBridge).
5 guides · ~40 min total read
Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics
Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.
Message Ordering, Backpressure, and RabbitMQ DLQs on AWS
FIFO guarantees shrink throughput—and unbounded queues only move backpressure to your AWS bill. Ordering, flow control, and Amazon MQ dead-letter patterns vs Kinesis resharding.
Reliable Queue Systems: SQS, Kafka, and Redis on AWS
SQS Reliable Messaging Patterns for Production
EventBridge Event-Driven Architecture Patterns
API & Application Architecture
Auth, rate limiting, and modern API protocols on API Gateway, Cognito, and AppSync.
4 guides · ~18 min total read
OAuth2 Token Introspection vs JWT Validation on Cognito and API Gateway
Local JWT validation is fast until revocation lags bite you. When to introspect at Cognito, use API Gateway JWT authorizers, and add Verified Permissions for fine-grained authz.
Rate Limiting: Token Bucket vs Leaky Bucket on AWS WAF and API Gateway
Token buckets allow bursts; leaky buckets smooth traffic—WAF rate rules and API Gateway usage plans implement neither perfectly but both matter for layered defense.
gRPC, GraphQL, Protobuf, and API Contracts on AWS
Protobuf on the wire saves bytes; GraphQL saves round trips until resolvers N+1 your Aurora cluster. ALB gRPC, AppSync, and consumer-driven contracts with Pact.
API Gateway Patterns: REST, HTTP, and WebSocket on AWS
Reliability Engineering & Observability
Tracing, metrics cardinality, logs, SLOs, and chaos—beyond default CloudWatch.
6 guides · ~52 min total read
Observability Beyond CloudWatch: OTel, Prometheus, and Grafana on AWS
Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics
That `user_id` label on every HTTP metric turns Amazon Managed Prometheus into a five-figure line item. This guide explains cardinality mechanics, EMF vs remote write, and Application Signals defaults worth disabling.
Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry
Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.
Resilience: Retries, Circuit Breakers, and Graceful Shutdown
Customer-Facing SLA and SLO Design on AWS
Chaos Engineering and Resilience Program with FIS (2026)
Kubernetes, Cloud Native & AWS
Deployments, PDBs, service mesh, container security, and multi-Region EKS patterns.
7 guides · ~53 min total read
Blue-Green vs Canary Deployment Decision Guide (2026)
Kubernetes Pod Disruption Budgets on EKS: Zero-Downtime Upgrades
Cluster upgrades and Karpenter consolidation look healthy in the console while PDB-blocked evictions freeze your node drain for 45 minutes. This guide wires minAvailable, maxUnavailable, and EKS managed node group semantics.
Service Mesh Traffic Shifting: VPC Lattice, Istio on EKS, and App Mesh EOL
App Mesh is legacy path—new meshes should start with VPC Lattice for AWS-native east-west or Istio on EKS when you need full L7 policy. Traffic shifting without duplicating load balancers per service.
Container Runtime Security: seccomp, AppArmor, and EKS Pod Security
Default Docker seccomp is not enough for regulated workloads. EKS Pod Security Standards, seccomp profiles, and Fargate platform version constraints.
EKS + Karpenter Cost-Optimized Autoscaling (How-To)
Serverless Cold Starts and Ingress Scale on AWS
Multi-Region AWS Without Doubling Costs
Concurrency, Runtime & Performance Engineering
JVM GC, virtual threads, and low-level concurrency choices for AWS runtimes.
2 guides · ~4 min total read
JVM G1 and ZGC Tuning on AWS Corretto for ECS and EKS
Heap too small triggers G1 humongous allocations; too large balloons pause times on Graviton. Corretto on ECS/EKS/Lambda Java—when ZGC generational beats G1 for API heaps.
Virtual Threads, Lock-Free Structures, and High-Throughput Runtimes on AWS
Project Loom virtual threads help I/O-bound Java on ECS—not CPU-bound aggregation. Compare actor models, lock-free queues, and when Lambda concurrency beats pinning threads on EC2.
Caching & Performance Optimization
Cache layers, invalidation, and probabilistic structures on ElastiCache and CloudFront.
3 guides · ~16 min total read
ElastiCache Redis Caching Strategies for Production
Distributed Cache Invalidation and Multi-Level Caching on AWS
Cache-aside without an invalidation story ships stale pricing to 2% of users—the hardest 2% to debug. This guide layers CloudFront, ElastiCache, and DAX with TTL, event-driven purge, and when write-through beats cache-aside.
Bloom Filters and HyperLogLog in Production on ElastiCache Redis
Bloom filters shave 90% of negative lookups; HyperLogLog estimates cardinality without storing every user ID. Redis modules on ElastiCache for abuse detection and feed deduplication.
Engineering Guides FAQ
Who are these engineering guides for?
How is this different from How-To Guides?
Do I need to read the tracks in order?
Are these guides kept current with AWS changes?
Stuck Between Theory and Production?
Our AWS architects run design reviews where these trade-offs become concrete—RDS Proxy sizing, MSK consumer groups, mesh vs Lattice, cardinality budgets.