Log Sampling CloudWatch OTel AWS

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

Quick summary: Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.

Key Takeaways

Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident
Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail
CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked
Aggregation architecture 1
App → structured JSON (correlation ID) 2

CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked.

Symptom → mechanism → AWS control

Production symptom	Mechanism	AWS control
Log bill dwarfs compute	100% log ingest at INFO	OTel probabilistic + tail sampling, CloudWatch Logs retention tiers
Can’t find error in log flood	No trace correlation	OTel trace_id in log attributes, CloudWatch Logs Insights
Hot partition on log group	Single log group per service	Per-environment log groups, S3 export for archive

Opinionated take: Sample success logs at 1–5% and keep 100% of errors—wire trace_id into every log line before you centralize aggregation.

Benchmark pattern (hypothetical workload) — 200GB/day application logs, OTel tail-sampling (1% success, 100% error) reduces ingest to 22GB/day, CloudWatch Logs bill $1,840→$202/month; X-Ray trace-linked logs preserve full context on errors.

Aggregation architecture

App → structured JSON (correlation ID)
ADOT collector → tail sampling (keep errors + slow)
CloudWatch Logs hot path + Firehose → S3/Glue for audit

Sampling rules

Always keep: level=ERROR, http.status>=500, latency > SLO
Sample info: 1–5% baseline
Never sample security audit events

Logs Insights

Use for incident search; not primary metrics store—pair with cardinality guide.

AWS services map

Need	Service	Skip when
Intelligent sampling	ADOT collector tail_sampling	Compliance requires 100% audit retention
Log storage + query	CloudWatch Logs + Insights	Long-term archive → S3 + Athena
Trace-log correlation	OTel + X-Ray / Application Signals	Batch jobs with no request context

What to do this week

Enable ADOT tail sampling processor in collector config.
Set log retention tiers (7d hot, 90d S3).
Dashboard ingestion GB/day with anomaly detection.

What this guide doesn’t cover

Full OTel stack setup—part 1 canonical post in track.

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

Symptom → mechanism → AWS control

Aggregation architecture

Sampling rules

Logs Insights

AWS services map

What to do this week

More in This Track

What this guide doesn’t cover

Recommended Reading

Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Production Resilience on AWS: Timeouts, Retries With Jitter, Circuit Limits, and Graceful Shutdown

AI & assistant-friendly summary

Summary

Key Facts

Entity Definitions

Related Content

Symptom → mechanism → AWS control

Aggregation architecture

Sampling rules

Logs Insights

AWS services map

What to do this week

More in This Track

What this guide doesn’t cover

Recommended Reading

Prometheus Cardinality Explosion on AWS: AMP, EMF, and Cost-Aware Metrics

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Production Resilience on AWS: Timeouts, Retries With Jitter, Circuit Limits, and Graceful Shutdown