AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns.

Key Facts

  • A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns
  • A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns

Entity Definitions

CloudWatch
CloudWatch is an AWS service discussed in this article.

AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices

DevOps & CI/CD Palaniappan P 8 min read

Quick summary: A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns.

Key Takeaways

  • A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns
  • A practical guide to AWS CloudWatch for production observability — custom metrics, structured logging, alarm strategies, dashboards, and cost-effective monitoring patterns
AWS CloudWatch Observability: Metrics, Logs, and Alarms Best Practices
Table of Contents

CloudWatch is the observability foundation for every AWS workload. It collects metrics from every AWS service, stores and queries logs, fires alarms, and renders dashboards — all without deploying any monitoring infrastructure. Yet most teams use only a fraction of CloudWatch’s capabilities, missing the monitoring practices that prevent outages and accelerate debugging.

This guide covers the production observability patterns we implement for clients through our managed services and DevOps engagements.

Metrics: What to Monitor

The Four Golden Signals

For every service, monitor these four signals (from Google’s SRE book, applicable to any production system):

SignalWhat It MeasuresCloudWatch Metric Example
LatencyHow long requests takeALB TargetResponseTime p50, p95, p99
TrafficRequest volumeALB RequestCount, API Gateway Count
ErrorsFailure rateALB HTTPCode_Target_5XX_Count, Lambda Errors
SaturationHow full your resources areEC2 CPUUtilization, RDS FreeStorageSpace

If you monitor nothing else, monitor these four signals for every service that receives traffic. They tell you whether your service is working (errors), how fast (latency), how busy (traffic), and whether it is running out of capacity (saturation).

Custom Metrics

AWS built-in metrics cover infrastructure. Your application’s business logic generates custom metrics that are often more valuable:

Application metrics:

  • Orders processed per minute
  • Payment success/failure rate
  • User registration rate
  • API response times by endpoint
  • Queue depth and processing lag

Using CloudWatch Embedded Metric Format (EMF):

{
  "_aws": {
    "Timestamp": 1648657200000,
    "CloudWatchMetrics": [
      {
        "Namespace": "MyApp",
        "Dimensions": [["Service", "Environment"]],
        "Metrics": [{ "Name": "OrdersProcessed", "Unit": "Count" }]
      }
    ]
  },
  "Service": "OrderService",
  "Environment": "production",
  "OrdersProcessed": 42
}

EMF lets you emit custom metrics as structured JSON log lines. CloudWatch extracts metrics automatically — no API calls needed, no SDK dependency, and each metric costs the same as a standard CloudWatch custom metric.

Metric Math

Combine metrics for more meaningful signals:

  • Error rate: (Errors / Invocations) * 100 — More useful than raw error count because it normalizes for traffic volume
  • Availability: ((TotalRequests - 5xxErrors) / TotalRequests) * 100
  • Cache hit ratio: CacheHits / (CacheHits + CacheMisses) * 100

Metric Math computes derived metrics without additional cost beyond the source metrics.

Logs: Structured Logging

Why Structured Logging Matters

Unstructured logs (free-text strings) are human-readable but machine-hostile:

[2026-06-05 14:30:22] ERROR: Payment failed for user 12345, amount $99.99, reason: card_declined

Structured logs (JSON) are both human-readable and queryable:

{
  "timestamp": "2026-06-05T14:30:22Z",
  "level": "ERROR",
  "message": "Payment failed",
  "userId": "12345",
  "amount": 99.99,
  "currency": "USD",
  "reason": "card_declined",
  "requestId": "abc-123",
  "traceId": "1-abc-def"
}

With structured logs, you can query: “Show me all payment failures over $100 in the last hour” using CloudWatch Logs Insights:

fields @timestamp, userId, amount, reason
| filter level = "ERROR" and message = "Payment failed" and amount > 100
| sort @timestamp desc
| limit 50

Logs Insights Query Patterns

Error investigation:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as errorCount by bin(5m)
| sort errorCount desc

Latency analysis:

fields @timestamp, duration, endpoint
| filter ispresent(duration)
| stats avg(duration) as avgDuration, pct(duration, 95) as p95, pct(duration, 99) as p99 by endpoint
| sort p99 desc

Request tracing:

fields @timestamp, @message
| filter requestId = "abc-123"
| sort @timestamp asc

Log Cost Optimization

CloudWatch Logs can become expensive. At $0.50/GB ingested and $0.03/GB stored, a verbose application logging 100 GB/day costs $1,500/month in ingestion alone.

Cost reduction strategies:

  • Log levels — Use DEBUG only in development, INFO for normal operations, WARN/ERROR for problems. Never log full request/response bodies in production.
  • Sampling — Log 1 in 10 successful requests but log every error. This reduces volume by 90% while retaining all failure data.
  • Retention policies — Set log group retention to 30 days for application logs, 90 days for security logs, and 1 year for audit logs. Default retention is forever.
  • Log class — Use CloudWatch Logs Infrequent Access class for logs that are rarely queried but must be retained for compliance. 50% cheaper for ingestion.
  • Lambda log destinations (available since May 2025) — Lambda supports S3 and Kinesis Firehose as direct log destinations in addition to CloudWatch Logs. Route high-volume Lambda logs directly to S3 at $0.023/GB storage cost — a 95% reduction compared to CloudWatch Logs ingestion at $0.50/GB. Use this for analytics logs, access logs, and request/response logging. Reserve CloudWatch for error logs and metrics that need real-time alerting.

CloudWatch Live Tail (CLI support added June 2024): Live Tail streams log events in real time as they arrive, like tail -f for CloudWatch Logs. CLI support enables aws logs tail --follow for local terminal log streaming without opening the AWS Console. Useful for watching Lambda executions, following deployment logs, or debugging production issues in real time.

Alarms: Alert on What Matters

Alarm Design Principles

Alert on symptoms, not causes. An alarm on “API error rate > 5%” is more useful than “EC2 CPU > 80%.” High CPU might not cause user impact; high error rate definitely does.

Alert on rates, not counts. “50 errors in 5 minutes” means different things at different traffic levels. “Error rate > 2%” is meaningful regardless of scale.

Set thresholds based on data, not guesses. Use CloudWatch Anomaly Detection for metrics with variable baselines (traffic volume, latency during peak hours). Anomaly Detection learns your metric patterns and alerts on deviations.

Alarm Tiers

TierSeverityResponseExample
P1 — CriticalService down or data lossImmediate (PagerDuty, phone call)API error rate > 10%, database unreachable
P2 — HighDegraded performanceWithin 1 hour (Slack, email)p99 latency > 2s, disk > 85%
P3 — WarningPotential issueNext business dayMemory trending up, cost anomaly
P4 — InfoInformationalReview in weekly ops meetingNew deployment, scaling event

Composite Alarms

Reduce alert noise by combining multiple alarms:

Composite Alarm: "Service Degraded"
  = (ErrorRateAlarm IN ALARM) AND (LatencyAlarm IN ALARM)

A composite alarm fires only when both error rate AND latency are problematic — reducing false positives from temporary latency spikes that do not affect error rate.

Alarm Actions

ActionAWS ServiceUse Case
SNS notificationSNS → Email/Slack/PagerDutyAlert humans
Auto ScalingAuto Scaling policyScale up/down based on metric
Lambda functionLambdaAutomated remediation
Systems ManagerSSM AutomationRun remediation runbook
EventBridgeEventBridge ruleTrigger complex workflows

Dashboards

Dashboard Design

One dashboard per service/team. A dashboard that shows everything shows nothing. Create focused dashboards:

  • Executive dashboard — Availability, error rates, costs (updated daily)
  • Service dashboard — Golden signals for each service (real-time)
  • Infrastructure dashboard — EC2, RDS, ElastiCache resource utilization
  • Cost dashboard — Daily spend by service, anomaly indicators

Dashboard Best Practices

  • Time-align all widgets — Use the dashboard time picker, not per-widget time ranges
  • Red/yellow/green indicators — Use CloudWatch alarm status widgets that show health at a glance
  • Include context — Add text widgets explaining what each metric means and what “normal” looks like
  • Auto-refresh — Set dashboards to refresh every 1-5 minutes for operational views

CloudWatch Application Signals: Native APM

CloudWatch Application Signals became generally available in June 2024 and is the most significant observability addition to CloudWatch in years. Application Signals provides application performance monitoring (APM) capabilities natively within CloudWatch, without requiring third-party APM tools:

What Application Signals provides:

  • Auto-instrumentation — Automatic trace collection for Java, Python, and Node.js applications using AWS Distro for OpenTelemetry (no code changes required beyond enabling the agent)
  • Service map — Visual representation of service-to-service dependencies with latency and error rates per edge
  • SLO management — Define and track Service Level Objectives (latency and availability targets) with built-in breach alerting
  • Standard metrics — Automatically calculates availability, latency (p50/p90/p99), fault rate, and error rate per service

Application Signals AI-powered debugging (November 2025): Application Signals now includes AI-assisted root cause analysis. When an SLO breach or error spike is detected, Application Signals automatically correlates traces, metrics, and logs to propose probable root causes — reducing mean time to diagnosis from minutes to seconds.

Cost: Application Signals is priced per monitored host/function per month, with a 30-day free trial. For most teams, it replaces or supplements third-party APM tools at a lower cost with better AWS service correlation.

X-Ray and OpenTelemetry

For serverless applications and microservices, distributed tracing is essential for debugging latency and failures across services.

Client → API Gateway → Lambda A → DynamoDB
                     → Lambda B → SQS → Lambda C → S3

X-Ray SDK deprecation: AWS has deprecated the X-Ray SDK in favor of AWS Distro for OpenTelemetry (ADOT), which implements the OpenTelemetry standard and provides the same AWS service correlation that X-Ray SDK provided. If you are instrumenting new services, use ADOT directly. Existing X-Ray SDK instrumentation continues to work, but new feature development is focused on the OpenTelemetry path.

Benefits of migrating to ADOT/OpenTelemetry:

  • Industry-standard instrumentation portable to any observability backend
  • Richer auto-instrumentation libraries with broader framework support
  • Compatible with Application Signals for automatic SLO tracking
  • No vendor lock-in — export traces to CloudWatch X-Ray, Jaeger, Zipkin, or any OTLP endpoint

CloudWatch vs Third-Party Tools

FactorCloudWatchDatadog/New Relic/Grafana Cloud
Cost (small env)$50-$200/month$200-$500/month
Cost (large env)$500-$2,000/month$5,000-$20,000/month
AWS integrationNative (zero config)Agent/integration required
Custom dashboardsGoodExcellent
APM depthX-Ray (good)Excellent
Log analyticsLogs Insights (good)Excellent (more intuitive)
Multi-cloudAWS onlyMulti-cloud

Our recommendation: Start with CloudWatch. It provides 80% of the observability most teams need at a fraction of the cost. Add a third-party tool when you need deeper APM, more intuitive dashboards, or multi-cloud visibility.

For cost optimization, CloudWatch’s native integration and lower cost make it the default choice for AWS-only environments.

Getting Started

Observability is not something you add after launching — it is something you build alongside your application. Start with the four golden signals, add structured logging, and build dashboards before you need them. The time to set up monitoring is not during an outage.

When observability becomes the problem: For the cost failure patterns that observability generates — high-cardinality metric explosion, debug logging in production, X-Ray sampling at 100%, and retention misconfigurations that accumulate terabytes of unbilled history — see Logging Yourself Into Bankruptcy from The AWS Cost Trap series.

For CloudWatch setup and monitoring strategy as part of our managed services, or for observability architecture design, talk to our team.

Contact us to implement production observability →

PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
How to Build Cost-Aware CI/CD Pipelines on AWS

How to Build Cost-Aware CI/CD Pipelines on AWS

CI/CD infrastructure is invisible until your DevOps bill hits $15,000/month. Build minutes, artifact storage, and ephemeral environments accumulate costs that few teams track. Here is how to measure and control them.