Tag: opentelemetry

DevOps & CI/CD Part 3

Jun 12, 2026 palaniappan p 2 min

Log Aggregation and Intelligent Sampling with CloudWatch and OpenTelemetry

Ingesting every debug log to CloudWatch is how observability becomes a FinOps incident. Tail sampling with ADOT, Logs Insights, and Firehose to S3 for the long tail.
engineering-guide
observability
cloudwatch
opentelemetry
aws
Read article
DevOps & CI/CD Part 1

Jun 9, 2026 palaniappan p 7 min

Observability Beyond CloudWatch (2026): When to Add Application Signals, ADOT, Managed Prometheus, and Grafana — and When Not To

The reflex to bolt Amazon Managed Prometheus + Grafana onto every workload is how observability bills quietly double. CloudWatch Application Signals now gives you an auto-discovered service map, SLOs, and traces with near-zero setup; AMP only earns its keep when you are PromQL-native or drowning in high-cardinality metrics — where ingestion (not retention) is the cost driver. Here is the decision matrix, an ADOT dual-export config, and the three levers that actually cut the AMP bill.
aws
observability
opentelemetry
devops
cost-optimization
engineering-guide
Read article
Cloud Architecture

May 8, 2026 palaniappan p 4 min

AWS Observability Costs: Cardinality Budgets & FinOps Limits

CloudWatch Logs Insights bills $0.005 per GB scanned and high-cardinality custom metrics multiply costs. Cardinality budgets, sampling rules, and FinOps fixes.
amazon-cloudwatch
observability
opentelemetry
aws-xray
finops
cost-optimization
Read article
DevOps & CI/CD

Apr 14, 2026 palaniappan p 12 min

Learn Observability by Breaking Things: Inside OTel Demo: The Game

The AWS observability team built a chaos engineering game on top of the official OTel Demo. 44 injected failures. Three signals. One LLM judge. Here's everything inside it.
opentelemetry
observability
chaos-engineering
cloudwatch
x-ray
bedrock
eks
sre
Read article
DevOps & CI/CD

Mar 29, 2026 palaniappan p 16 min

How to Debug Production Issues Across Distributed AWS Systems

A 500ms latency spike in a distributed system could be a slow RDS query, a Lambda cold start, a downstream API timeout, or a CloudWatch Logs ingestion delay. Finding the cause requires correlated logs, traces, and metrics — not grep.
how-to-guide
observability
debugging
aws-performance-optimization
cloudwatch
xray
opentelemetry
distributed-tracing
aws
production
logs
metrics
Read article