AI & assistant-friendly summary

This section provides structured content for AI assistants and search engines. You can cite or summarize it when referencing this page.

Summary

AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials. This guide covers setup, alerting, and automation to prevent bill shock.

Key Facts

  • AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials
  • AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials

Entity Definitions

Lambda
Lambda is an AWS service discussed in this article.
EC2
EC2 is an AWS service discussed in this article.

How to Use AWS Cost Anomaly Detection to Catch Surprise Bills

Quick summary: AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials. This guide covers setup, alerting, and automation to prevent bill shock.

Key Takeaways

  • AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials
  • AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials
Table of Contents

AWS Cost Anomaly Detection is an ML service that watches your spending and alerts you when costs spike unexpectedly. Instead of discovering a $50K surprise bill at month-end, Anomaly Detection flags the issue within hours.

This guide covers setting up Anomaly Detection, configuring alerts, and automating remediation to prevent bill shock.

Optimizing AWS Costs? FactualMinds helps teams implement FinOps practices and cost governance. See our cost optimization services or talk to our team.

Step 1: Understand Anomaly Detection

Anomaly Detection learns your normal spending pattern and flags deviations:

Baseline Period (1-3 months)
  → EC2: $500/day average
  → Lambda: $50/day average
  → S3: $100/day average

Day 1 (Normal)
  → EC2: $520/day (5% variance, expected)
  → Lambda: $48/day (4% variance, normal)
  ✓ No alert

Day 2 (Anomaly)
  → EC2: $2,500/day (400% spike!)
  → Lambda: $50/day (normal)
  ⚠ ALERT: EC2 spending 5x above baseline

Key concepts:

  • Baseline: Average spending over 1-3 months
  • Threshold: How much variance before alerting (default 80% increase)
  • Frequency: Real-time detection (alerts within 24 hours)
  • Scope: Monitor all AWS or specific services/accounts

Step 2: Enable Cost Anomaly Detection

Go to AWS BillingCost ManagementAnomaly Detection:

  1. Click Create monitor
  2. Name: production-spending-monitor
  3. Monitoring scope:
    • Option A: All AWS spending (broadest)
    • Option B: Specific services (EC2, Lambda, RDS, etc.)
    • Option C: Specific linked accounts (if using Organizations)
  4. Select option A (monitor all spending) for now
  5. Click Create

Step 3: Set Alert Threshold

  1. In the monitor, click Edit
  2. Anomaly threshold: Set to 80% (default)
    • Alerts when spending increases >80% from baseline
    • If your daily spend is $1,000, alerts when it hits $1,800+
  3. Frequency: Daily report (default)
  4. Baseline period: 1 month minimum (use 3 months for accuracy)
  5. Click Save

Step 4: Configure Alert Notifications

Email Alerts

  1. Go to monitor → AlertsAdd alert
  2. Type: Email
  3. Recipients: ops-team@company.com
  4. Click Create

You’ll receive daily email if anomalies are detected.

SNS Alerts (For Automation)

  1. Click Add alert
  2. Type: SNS
  3. SNS Topic: Create or select SNS topic
    aws sns create-topic --name cost-anomaly-alerts
  4. Click Create

SNS allows downstream automation (Lambda, Slack, etc.).

Create separate monitors to avoid cross-service false positives:

Monitor 1: EC2 Spending

  1. Create monitorEC2 only
  2. Threshold: 80%
  3. Alerts only if EC2 spikes (ignores Lambda/S3 changes)

Monitor 2: Lambda Spending

  1. Create monitorLambda only
  2. Threshold: 100% (Lambda is variable, higher threshold)
  3. Alerts only if Lambda costs double

Monitor 3: Data Transfer

  1. Create monitorData Transfer only
  2. Threshold: 150% (Data transfer is often bursty)

This way, a spike in one service doesn’t trigger noise from others.

Step 6: Integrate with SNS for Notifications

Set up Slack or custom alerts via SNS:

Slack Integration

  1. Create a Slack app and get webhook URL
  2. Create Lambda to forward SNS to Slack:
import json
import boto3
import urllib3

def lambda_handler(event, context):
    # Parse SNS message
    message = json.loads(event['Records'][0]['Sns']['Message'])

    # Extract anomaly info
    monitor_name = message['anomalyName']
    anomaly_severity = message['anomalySeverity']
    cost_increase = message.get('costImpact', 'Unknown')

    # Create Slack message
    slack_message = {
        'text': f':warning: Cost Anomaly Detected!',
        'blocks': [
            {
                'type': 'section',
                'text': {
                    'type': 'mrkdwn',
                    'text': f'*Monitor:* {monitor_name}\n*Severity:* {anomaly_severity}\n*Cost Increase:* {cost_increase}'
                }
            },
            {
                'type': 'section',
                'text': {
                    'type': 'mrkdwn',
                    'text': '<https://console.aws.amazon.com/cost-management|View in AWS Console>'
                }
            }
        ]
    }

    # Post to Slack
    http = urllib3.PoolManager()
    http.request(
        'POST',
        os.environ['SLACK_WEBHOOK'],
        body=json.dumps(slack_message),
        headers={'Content-Type': 'application/json'}
    )

    return {'statusCode': 200}

Deploy Lambda:

aws lambda create-function \
  --function-name cost-anomaly-to-slack \
  --runtime python3.11 \
  --handler lambda_function.lambda_handler \
  --zip-file fileb://function.zip \
  --environment Variables=SLACK_WEBHOOK=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX

# Subscribe Lambda to SNS topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:cost-anomaly-alerts \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:cost-anomaly-to-slack

Step 7: Investigate Anomalies

When alerted, check the anomaly:

  1. Go to BillingAnomaly DetectionAnomalies
  2. Click anomaly to view details:
    • Service: Which service spiked (EC2, Lambda, etc.)
    • Date: When spike occurred
    • Estimated cost: Impact ($500, $5K, etc.)
    • Baseline vs. Actual: Comparison chart
  3. Click View details to investigate

Example investigation:

Anomaly: EC2 spending spiked from $500 to $2,000 on 2026-04-02
Action: Check EC2 console for new instances
Found: 16x c5.24xlarge instances running (cost: $1,500/day each)
Root cause: Auto scaling group scaled up due to traffic spike (legitimate)
Resolution: Increase instance termination threshold or horizontally scale

Step 8: Automate Remediation (Non-Production)

For staging/dev environments, automatically shut down resources:

import boto3

def lambda_handler(event, context):
    # Parse anomaly from SNS
    message = json.loads(event['Records'][0]['Sns']['Message'])
    service = message['service']

    if service == 'EC2':
        # Stop all untagged instances in staging
        ec2 = boto3.client('ec2')
        instances = ec2.describe_instances(
            Filters=[
                {'Name': 'tag:Environment', 'Values': ['staging']},
                {'Name': 'instance-state-name', 'Values': ['running']}
            ]
        )

        for reservation in instances['Reservations']:
            for instance in reservation['Instances']:
                print(f"Stopping {instance['InstanceId']}")
                ec2.stop_instances(InstanceIds=[instance['InstanceId']])

        # Alert team
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789012:ops-alerts',
            Subject='Cost Anomaly: Stopped Staging Instances',
            Message=f'Stopped {len(instances)} staging instances due to cost anomaly'
        )

    return {'statusCode': 200}

Step 9: Common Anomaly Patterns and Responses

Pattern 1: Runaway Lambda (Infinite Loop)

Alert: Lambda costs increase 10x Investigation: Check Lambda logs, CloudWatch Metrics Action: (1) Temporarily disable function trigger, (2) Fix code, (3) Redeploy

Pattern 2: Crypto Mining (Compromised Credentials)

Alert: EC2 CPU usage 100%, spending spikes 20x Investigation: Check EC2 instance SSH logs, running processes Action: (1) Terminate instances immediately, (2) Rotate credentials, (3) Review IAM access logs

Pattern 3: Forgotten Dev Environment

Alert: RDS spending increases 5x (new database created) Investigation: Check RDS instances, find dev instance left running Action: (1) Stop or delete non-production database, (2) Set up automation to stop dev instances after hours

Pattern 4: Data Transfer Spike

Alert: Data Transfer cost increases 30x Investigation: Check CloudFront, NAT Gateway, or inter-region transfer Action: (1) Review distribution, (2) Optimize caching, (3) Consider edge locations

Step 10: Cost Anomaly Prevention

Pattern 1: Tagging Policy

Tag all resources with Environment, Owner, CostCenter:

aws ec2 create-tags \
  --resources i-1234567890abcdef0 \
  --tags Key=Environment,Value=production Key=Owner,Value=team-a Key=CostCenter,Value=engineering

Use tags to:

  • Create monitors per environment (production alert at higher threshold)
  • Alert cost center owner (not general ops)
  • Audit untagged resources (likely abandoned)

Pattern 2: Budget Alerts (In Addition to Anomaly Detection)

AWS Budgets set hard thresholds:

aws budgets create-budget \
  --account-id 123456789012 \
  --budget BudgetName=monthly-budget,BudgetLimit="{Amount=10000,Unit=USD}",TimeUnit=MONTHLY,BudgetType=COST \
  --notifications-with-subscribers NotificationWithSubscribers={Notification={ComparisonOperator=GREATER_THAN,NotificationType=FORECASTED,Threshold=80},Subscribers=[{SubscriptionType=EMAIL,Address=ops@company.com}]}

This alerts if you’re forecasted to hit 80% of monthly budget.

Pattern 3: Service Quotas

Limit the damage of a bug:

aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --desired-value 10  # Max 10 EC2 instances, prevents 1000-instance runaway

Common Mistakes

  1. Not checking baseline period

    • Anomaly Detection needs 1+ month of data
    • If enabled on day 1, won’t alert for first month
  2. Too-low threshold

    • Threshold 30%: alerts on every traffic spike (noise)
    • Better: 80% for prod, 100% for services with variance
  3. Not investigating root cause

    • Alert comes in, you panic and shut everything down
    • Usually, the spike is legitimate (traffic spike, promo day, etc.)
    • Investigate first, remediate second
  4. Ignoring early warnings

    • Anomaly says EC2 is increasing gradually (not a spike)
    • Ignore it, bill ends up $20K overrun
    • Gradual increases are harder to catch; set budget alerts too

Next Steps

  1. Enable Anomaly Detection (5 mins)
  2. Create monitors by service (15 mins)
  3. Configure SNS alerts (10 mins)
  4. Integrate with Slack (30 mins)
  5. Test with a known cost increase (run expensive query)
  6. Investigate and respond to first anomaly
  7. Talk to FactualMinds if you need help setting up FinOps practices or building cost governance
PP
Palaniappan P

AWS Cloud Architect & AI Expert

AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.

AWS ArchitectureCloud MigrationGenAI on AWSCost OptimizationDevOps

Ready to discuss your AWS strategy?

Our certified architects can help you implement these solutions.

Recommended Reading

Explore All Articles »
How to Eliminate AWS Surprise Bills From Autoscaling

How to Eliminate AWS Surprise Bills From Autoscaling

AWS surprise bills from autoscaling follow a small set of repeatable failure patterns: feedback loops, scale-out without scale-in, burst amplification from misconfigured metrics, and commitment mismatches after scaling events. Each pattern has a specific fix.