Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry
Quick summary: Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.
Key Takeaways
- Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident
- Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail
- CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked
- Aggregation architecture 1
- App → structured JSON (correlation ID) 2
Table of Contents
CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked.
Symptom → mechanism → AWS control
| Production symptom | Mechanism | AWS control |
|---|---|---|
| Log bill dwarfs compute | 100% log ingest at INFO | OTel probabilistic + tail sampling, CloudWatch Logs retention tiers |
| Can’t find error in log flood | No trace correlation | OTel trace_id in log attributes, CloudWatch Logs Insights |
| Hot partition on log group | Single log group per service | Per-environment log groups, S3 export for archive |
Opinionated take: Sample success logs at 1–5% and keep 100% of errors—wire trace_id into every log line before you centralize aggregation.
Benchmark pattern (hypothetical workload) — 200GB/day application logs, OTel tail-sampling (1% success, 100% error) reduces ingest to 22GB/day, CloudWatch Logs bill $1,840→$202/month; X-Ray trace-linked logs preserve full context on errors.
Aggregation architecture
- App → structured JSON (correlation ID)
- ADOT collector → tail sampling (keep errors + slow)
- CloudWatch Logs hot path + Firehose → S3/Glue for audit
Sampling rules
- Always keep:
level=ERROR,http.status>=500, latency > SLO - Sample info: 1–5% baseline
- Never sample security audit events
Logs Insights
Use for incident search; not primary metrics store—pair with cardinality guide.
AWS services map
| Need | Service | Skip when |
|---|---|---|
| Intelligent sampling | ADOT collector tail_sampling | Compliance requires 100% audit retention |
| Log storage + query | CloudWatch Logs + Insights | Long-term archive → S3 + Athena |
| Trace-log correlation | OTel + X-Ray / Application Signals | Batch jobs with no request context |
What to do this week
- Enable ADOT tail sampling processor in collector config.
- Set log retention tiers (7d hot, 90d S3).
- Dashboard ingestion GB/day with anomaly detection.
More in This Track
Part of the Engineering Guides library (June 2026).
- Previous: Part 2
- Next: Part 4
- Browse tracks: Engineering Guides hub
What this guide doesn’t cover
Full OTel stack setup—part 1 canonical post in track.
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.