Skip to main content

AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.

Key Facts

  • Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident
  • Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail
  • CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked
  • Aggregation architecture 1
  • App → structured JSON (correlation ID) 2

Entity Definitions

S3
S3 is an AWS service discussed in this article.
CloudWatch
CloudWatch is an AWS service discussed in this article.
Glue
Glue is an AWS service discussed in this article.
Athena
Athena is an AWS service discussed in this article.
compliance
compliance is a cloud computing concept discussed in this article.

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

DevOps & CI/CD Palaniappan P 2 min read

Quick summary: Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.

Key Takeaways

  • Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident
  • Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail
  • CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked
  • Aggregation architecture 1
  • App → structured JSON (correlation ID) 2
Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry
Table of Contents

CloudWatch Logs ingestion (June 2026) bills per GB—100% trace/log correlation without sampling destroyed margins on a $40k/mo observability line item for a mid-market SaaS we benchmarked.

Symptom → mechanism → AWS control

Production symptomMechanismAWS control
Log bill dwarfs compute100% log ingest at INFOOTel probabilistic + tail sampling, CloudWatch Logs retention tiers
Can’t find error in log floodNo trace correlationOTel trace_id in log attributes, CloudWatch Logs Insights
Hot partition on log groupSingle log group per servicePer-environment log groups, S3 export for archive

Opinionated take: Sample success logs at 1–5% and keep 100% of errors—wire trace_id into every log line before you centralize aggregation.

Benchmark pattern (hypothetical workload) — 200GB/day application logs, OTel tail-sampling (1% success, 100% error) reduces ingest to 22GB/day, CloudWatch Logs bill $1,840→$202/month; X-Ray trace-linked logs preserve full context on errors.

Aggregation architecture

  1. App → structured JSON (correlation ID)
  2. ADOT collector → tail sampling (keep errors + slow)
  3. CloudWatch Logs hot path + Firehose → S3/Glue for audit

Sampling rules

  • Always keep: level=ERROR, http.status>=500, latency > SLO
  • Sample info: 1–5% baseline
  • Never sample security audit events

Logs Insights

Use for incident search; not primary metrics store—pair with cardinality guide.

AWS services map

NeedServiceSkip when
Intelligent samplingADOT collector tail_samplingCompliance requires 100% audit retention
Log storage + queryCloudWatch Logs + InsightsLong-term archive → S3 + Athena
Trace-log correlationOTel + X-Ray / Application SignalsBatch jobs with no request context

What to do this week

  1. Enable ADOT tail sampling processor in collector config.
  2. Set log retention tiers (7d hot, 90d S3).
  3. Dashboard ingestion GB/day with anomaly detection.

More in This Track

Part of the Engineering Guides library (June 2026).

What this guide doesn’t cover

Full OTel stack setup—part 1 canonical post in track.

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Recommended Reading

Explore All Articles »
7 min

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

The reflex to bolt Amazon Managed Prometheus + Grafana onto every workload is how observability bills quietly double. CloudWatch Application Signals now gives you an auto-discovered service map, SLOs, and traces with near-zero setup; AMP only earns its keep when you are PromQL-native or drowning in high-cardinality metrics — where ingestion (not retention) is the cost driver. Here is the decision matrix, an ADOT dual-export config, and the three levers that actually cut the AMP bill.

5 min

From One FIS Experiment to a Resilience Program (2026): AWS Fault Injection Service, Stop Conditions, and GameDays That Actually Change Behavior

Running one AWS FIS experiment in a demo account is not chaos engineering — it is a screenshot. A program ties experiments to SLOs, scopes blast radius with tags, halts on CloudWatch alarm stop conditions, schedules via EventBridge, and closes the loop by re-testing the fix. FIS now ships AZ Power Interruption and cross-Region connectivity scenarios in its Scenario Library. Here is the L0→L3 maturity matrix, a GameDay runbook, and a stop-condition-wired experiment skeleton.