Skip to main content

Engineering Guides

Systems engineering for AWS architects who need the why before the console click

Nine learning tracks from TCP and transaction isolation to Kafka and Kubernetes—each guide maps a production symptom to the underlying mechanism and the AWS control that fixes it. Updated June 2026 for VPC Lattice, HTTP/3 on CloudFront, AMP, and Java 21 virtual threads.

Last hub review: June 2026 · Reviewed by AWS-certified architects (Solutions Architect – Professional)

9
Learning tracks
44
Guides in library
AWS
Service mappings
Select
Partner reviewed

Start here

High-traffic guides architects bookmark first—CAP on multi-Region AWS, MSK exactly-once, HTTP/3 on CloudFront, connection pool exhaustion, and Prometheus cardinality.

Featured Part 1

Modern Web Transport on AWS: TCP Congestion, HTTP/2, HTTP/3, and QUIC

Packet loss on mobile networks still punishes HTTP/1.1 head-of-line blocking—but HTTP/3 only helps if CloudFront terminates QUIC and your origin connection pools are sized for multiplexed streams. This guide connects Reno, Cubic, BBR, HPACK, and QUIC to ALB and CloudFront decisions.

Networking & Protocol Engineering 3 min · Read Guide →
Featured Part 3

Database Deadlocks, Connection Pool Exhaustion, and Prepared Statements on RDS

Too many "too many connections" pages are fixed by raising max_connections—which trades one outage for OOM on the writer. This guide traces deadlocks, pool sizing, RDS Proxy, and prepared statement caching on Aurora.

Database Internals & Performance 2 min · Read Guide →
Featured Part 1

CAP Theorem in Practice on AWS: What Architects Actually Need for Multi-Region

CAP is not a trivia question—it is the reason your global DynamoDB table shows stale inventory or why Aurora Global reads lag 80 ms behind the writer. This guide maps partition tolerance, consistency, and availability trade-offs to concrete AWS controls.

Distributed Systems Architecture 3 min · Read Guide →
Featured Part 1

Kafka on MSK: Partition Rebalancing and Exactly-Once Semantics

Consumer group rebalance storms stall processing longer than broker outages. This guide covers cooperative rebalancing, idempotent producers, and transactional reads on Amazon MSK—with when SQS FIFO is simpler.

Messaging, Streaming & Event-Driven Systems 2 min · Read Guide →
Featured Part 2

Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics

That `user_id` label on every HTTP metric turns Amazon Managed Prometheus into a five-figure line item. This guide explains cardinality mechanics, EMF vs remote write, and Application Signals defaults worth disabling.

Reliability Engineering & Observability 2 min · Read Guide →
Featured Part 3

Service Mesh Traffic Shifting: VPC Lattice, Istio on EKS, and App Mesh EOL

App Mesh is legacy path—new meshes should start with VPC Lattice for AWS-native east-west or Istio on EKS when you need full L7 policy. Traffic shifting without duplicating load balancers per service.

Kubernetes, Cloud Native & AWS 2 min · Read Guide →

How we structure every guide

Symptom → mechanism → AWS control—not a glossary dump.

Symptom

Start with what broke in production—stale reads, partition rebalance stalls, cardinality bills, or TLS handshake latency—not the textbook definition.

Mechanism

Name the systems primitive underneath: isolation levels, cooperative assignors, cache lines, or QUIC loss recovery. That is what your design review should debate.

AWS control

Map the mechanism to a concrete knob—RDS Proxy max connections, MSK rebalance protocol, AMP workspace limits, CloudFront HTTP/3 viewer policy.

Ship

Each guide ends with a one-week checklist and honest gaps. Pair with How-To Guides when you are ready to paste Terraform or CDK.

Networking & Protocol Engineering

TCP through QUIC, TLS termination, and server I/O primitives—mapped to ALB, CloudFront, and EC2 tuning decisions.

4 guides · ~9 min total read

Database Internals & Performance

Isolation levels, storage engines, connection pools, and sharding—wired to RDS, Aurora, and DynamoDB.

6 guides · ~32 min total read

Part 1

PostgreSQL Transaction Isolation and ACID vs BASE on AWS RDS and Aurora

Serializable sounds safest until your checkout times out under row locks. This guide maps READ COMMITTED, REPEATABLE READ, and SERIALIZABLE to RDS/Aurora defaults—and when DynamoDB conditional writes are the BASE alternative.

Database Internals & Performance 2 min · Read Guide →
Part 2

B-Tree vs LSM and Query Planner Internals on AWS Databases

Why Aurora PostgreSQL loves B-tree indexes on OLTP but DynamoDB feels like an LSM—and how cost-based optimization surprises you when statistics go stale on RDS.

Database Internals & Performance 2 min · Read Guide →
Part 3

Database Deadlocks, Connection Pool Exhaustion, and Prepared Statements on RDS

Too many "too many connections" pages are fixed by raising max_connections—which trades one outage for OOM on the writer. This guide traces deadlocks, pool sizing, RDS Proxy, and prepared statement caching on Aurora.

Database Internals & Performance 2 min · Read Guide →
Part 4

PostgreSQL Vacuum, Index Bloat, and Sharding Hot Partitions on AWS

Autovacuum cannot keep up after Black Friday bulk deletes—and your BRIN index is not helping point lookups. Vacuum strategy on Aurora, plus Aurora Limitless and DynamoDB hot key mitigation.

Database Internals & Performance 2 min · Read Guide →
Part 5

RDS vs Aurora: Read Replicas, Failover, and When to Switch

Database Internals & Performance Read compare →
Part 6

When to Use RDS vs Aurora (Production Decision Guide)

Database Internals & Performance Read Guide →

Distributed Systems Architecture

CAP, coordination, consensus, and event-sourced patterns on AWS multi-Region workloads.

7 guides · ~35 min total read

Part 1

CAP Theorem in Practice on AWS: What Architects Actually Need for Multi-Region

CAP is not a trivia question—it is the reason your global DynamoDB table shows stale inventory or why Aurora Global reads lag 80 ms behind the writer. This guide maps partition tolerance, consistency, and availability trade-offs to concrete AWS controls.

Distributed Systems Architecture 3 min · Read Guide →
Part 2

CRDTs and Eventual Consistency Anti-Patterns on AWS

Last-write-wins is not a CRDT—it is how Global Tables lose cart merges. When to use counters, OR-Sets, and conflict-free merges vs when to keep a single Aurora writer.

Distributed Systems Architecture 2 min · Read Guide →
Part 3

Distributed Locking, Redlock, and Consistent Hashing on AWS

Redlock debates matter because ElastiCache is not a consensus system. Consistent hashing for sharding workers and ALB target stickiness—with DynamoDB conditional writes as the boring alternative.

Distributed Systems Architecture 2 min · Read Guide →
Part 4

Paxos, Raft, and Byzantine Fault Tolerance: What Cloud Architects Need

You rarely implement Raft on EC2—you buy it in Aurora, DynamoDB, and EKS etcd. This guide explains quorum math so you trust managed services and avoid rolling your own coordinator.

Distributed Systems Architecture 2 min · Read Guide →
Part 5

Exactly-Once, CQRS, and Event Sourcing Replay on AWS

Exactly-once is a myth end-to-end—but idempotent consumers plus event stores get you close. CQRS read models on DynamoDB streams, Kinesis, and EventBridge replay semantics.

Distributed Systems Architecture 2 min · Read Guide →
Part 6

Microservices Design Patterns on AWS (2026 Production Guide)

Distributed Systems Architecture Read Guide →
Part 7

Event-Driven Microservices Reference Pattern

Distributed Systems Architecture Read pattern →

Messaging, Streaming & Event-Driven Systems

Kafka, ordering, backpressure, and the AWS async stack (SQS, SNS, EventBridge).

5 guides · ~40 min total read

API & Application Architecture

Auth, rate limiting, and modern API protocols on API Gateway, Cognito, and AppSync.

4 guides · ~18 min total read

Reliability Engineering & Observability

Tracing, metrics cardinality, logs, SLOs, and chaos—beyond default CloudWatch.

6 guides · ~52 min total read

Kubernetes, Cloud Native & AWS

Deployments, PDBs, service mesh, container security, and multi-Region EKS patterns.

7 guides · ~53 min total read

Concurrency, Runtime & Performance Engineering

JVM GC, virtual threads, and low-level concurrency choices for AWS runtimes.

2 guides · ~4 min total read

Caching & Performance Optimization

Cache layers, invalidation, and probabilistic structures on ElastiCache and CloudFront.

3 guides · ~16 min total read

Engineering Guides FAQ

Who are these engineering guides for?
Senior engineers, platform teams, SREs, and architects who already know AWS service names but want the systems fundamentals behind design reviews—connection pooling, consensus, cardinality, transport protocols, and how they map to RDS, EKS, MSK, and API Gateway.
How is this different from How-To Guides?
How-To Guides are step-by-step implementation walkthroughs (Bedrock, Karpenter, compliance setup). Engineering Guides explain mechanisms first—then show which AWS control implements them. Start here for theory; jump to How-To Guides when you are ready to ship configuration.
Do I need to read the tracks in order?
Each track is ordered from foundations to production trade-offs. You can enter at any guide that matches your current incident or design question, but reading a track top-to-bottom builds a coherent mental model.
Are these guides kept current with AWS changes?
Yes. Each guide pins AWS features and versions in the opening section and carries an updateDate. Service mesh content reflects App Mesh deprecation in favor of VPC Lattice; observability guides reference Amazon Managed Prometheus and Application Signals as of June 2026.

Stuck Between Theory and Production?

Our AWS architects run design reviews where these trade-offs become concrete—RDS Proxy sizing, MSK consumer groups, mesh vs Lattice, cardinality budgets.