How to Use AWS Cost Anomaly Detection to Catch Surprise Bills
Quick summary: AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials. This guide covers setup, alerting, and automation to prevent bill shock.
Key Takeaways
- AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials
- AWS Cost Anomaly Detection uses machine learning to flag unusual spending patterns — runaway EC2 instances, unexpected Lambda spikes, or compromised credentials
Table of Contents
AWS Cost Anomaly Detection is an ML service that watches your spending and alerts you when costs spike unexpectedly. Instead of discovering a $50K surprise bill at month-end, Anomaly Detection flags the issue within hours.
This guide covers setting up Anomaly Detection, configuring alerts, and automating remediation to prevent bill shock.
Optimizing AWS Costs? FactualMinds helps teams implement FinOps practices and cost governance. See our cost optimization services or talk to our team.
Step 1: Understand Anomaly Detection
Anomaly Detection learns your normal spending pattern and flags deviations:
Baseline Period (1-3 months)
→ EC2: $500/day average
→ Lambda: $50/day average
→ S3: $100/day average
Day 1 (Normal)
→ EC2: $520/day (5% variance, expected)
→ Lambda: $48/day (4% variance, normal)
✓ No alert
Day 2 (Anomaly)
→ EC2: $2,500/day (400% spike!)
→ Lambda: $50/day (normal)
⚠ ALERT: EC2 spending 5x above baselineKey concepts:
- Baseline: Average spending over 1-3 months
- Threshold: How much variance before alerting (default 80% increase)
- Frequency: Real-time detection (alerts within 24 hours)
- Scope: Monitor all AWS or specific services/accounts
Step 2: Enable Cost Anomaly Detection
Go to AWS Billing → Cost Management → Anomaly Detection:
- Click Create monitor
- Name:
production-spending-monitor - Monitoring scope:
- Option A: All AWS spending (broadest)
- Option B: Specific services (EC2, Lambda, RDS, etc.)
- Option C: Specific linked accounts (if using Organizations)
- Select option A (monitor all spending) for now
- Click Create
Step 3: Set Alert Threshold
- In the monitor, click Edit
- Anomaly threshold: Set to 80% (default)
- Alerts when spending increases >80% from baseline
- If your daily spend is $1,000, alerts when it hits $1,800+
- Frequency: Daily report (default)
- Baseline period: 1 month minimum (use 3 months for accuracy)
- Click Save
Step 4: Configure Alert Notifications
Email Alerts
- Go to monitor → Alerts → Add alert
- Type: Email
- Recipients: ops-team@company.com
- Click Create
You’ll receive daily email if anomalies are detected.
SNS Alerts (For Automation)
- Click Add alert
- Type: SNS
- SNS Topic: Create or select SNS topic
aws sns create-topic --name cost-anomaly-alerts - Click Create
SNS allows downstream automation (Lambda, Slack, etc.).
Step 5: Create Monitor by Service (Optional but Recommended)
Create separate monitors to avoid cross-service false positives:
Monitor 1: EC2 Spending
- Create monitor → EC2 only
- Threshold: 80%
- Alerts only if EC2 spikes (ignores Lambda/S3 changes)
Monitor 2: Lambda Spending
- Create monitor → Lambda only
- Threshold: 100% (Lambda is variable, higher threshold)
- Alerts only if Lambda costs double
Monitor 3: Data Transfer
- Create monitor → Data Transfer only
- Threshold: 150% (Data transfer is often bursty)
This way, a spike in one service doesn’t trigger noise from others.
Step 6: Integrate with SNS for Notifications
Set up Slack or custom alerts via SNS:
Slack Integration
- Create a Slack app and get webhook URL
- Create Lambda to forward SNS to Slack:
import json
import boto3
import urllib3
def lambda_handler(event, context):
# Parse SNS message
message = json.loads(event['Records'][0]['Sns']['Message'])
# Extract anomaly info
monitor_name = message['anomalyName']
anomaly_severity = message['anomalySeverity']
cost_increase = message.get('costImpact', 'Unknown')
# Create Slack message
slack_message = {
'text': f':warning: Cost Anomaly Detected!',
'blocks': [
{
'type': 'section',
'text': {
'type': 'mrkdwn',
'text': f'*Monitor:* {monitor_name}\n*Severity:* {anomaly_severity}\n*Cost Increase:* {cost_increase}'
}
},
{
'type': 'section',
'text': {
'type': 'mrkdwn',
'text': '<https://console.aws.amazon.com/cost-management|View in AWS Console>'
}
}
]
}
# Post to Slack
http = urllib3.PoolManager()
http.request(
'POST',
os.environ['SLACK_WEBHOOK'],
body=json.dumps(slack_message),
headers={'Content-Type': 'application/json'}
)
return {'statusCode': 200}Deploy Lambda:
aws lambda create-function \
--function-name cost-anomaly-to-slack \
--runtime python3.11 \
--handler lambda_function.lambda_handler \
--zip-file fileb://function.zip \
--environment Variables=SLACK_WEBHOOK=https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX
# Subscribe Lambda to SNS topic
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:cost-anomaly-alerts \
--protocol lambda \
--notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:cost-anomaly-to-slackStep 7: Investigate Anomalies
When alerted, check the anomaly:
- Go to Billing → Anomaly Detection → Anomalies
- Click anomaly to view details:
- Service: Which service spiked (EC2, Lambda, etc.)
- Date: When spike occurred
- Estimated cost: Impact ($500, $5K, etc.)
- Baseline vs. Actual: Comparison chart
- Click View details to investigate
Example investigation:
Anomaly: EC2 spending spiked from $500 to $2,000 on 2026-04-02
Action: Check EC2 console for new instances
Found: 16x c5.24xlarge instances running (cost: $1,500/day each)
Root cause: Auto scaling group scaled up due to traffic spike (legitimate)
Resolution: Increase instance termination threshold or horizontally scaleStep 8: Automate Remediation (Non-Production)
For staging/dev environments, automatically shut down resources:
import boto3
def lambda_handler(event, context):
# Parse anomaly from SNS
message = json.loads(event['Records'][0]['Sns']['Message'])
service = message['service']
if service == 'EC2':
# Stop all untagged instances in staging
ec2 = boto3.client('ec2')
instances = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
print(f"Stopping {instance['InstanceId']}")
ec2.stop_instances(InstanceIds=[instance['InstanceId']])
# Alert team
sns = boto3.client('sns')
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:ops-alerts',
Subject='Cost Anomaly: Stopped Staging Instances',
Message=f'Stopped {len(instances)} staging instances due to cost anomaly'
)
return {'statusCode': 200}Step 9: Common Anomaly Patterns and Responses
Pattern 1: Runaway Lambda (Infinite Loop)
Alert: Lambda costs increase 10x Investigation: Check Lambda logs, CloudWatch Metrics Action: (1) Temporarily disable function trigger, (2) Fix code, (3) Redeploy
Pattern 2: Crypto Mining (Compromised Credentials)
Alert: EC2 CPU usage 100%, spending spikes 20x Investigation: Check EC2 instance SSH logs, running processes Action: (1) Terminate instances immediately, (2) Rotate credentials, (3) Review IAM access logs
Pattern 3: Forgotten Dev Environment
Alert: RDS spending increases 5x (new database created) Investigation: Check RDS instances, find dev instance left running Action: (1) Stop or delete non-production database, (2) Set up automation to stop dev instances after hours
Pattern 4: Data Transfer Spike
Alert: Data Transfer cost increases 30x Investigation: Check CloudFront, NAT Gateway, or inter-region transfer Action: (1) Review distribution, (2) Optimize caching, (3) Consider edge locations
Step 10: Cost Anomaly Prevention
Pattern 1: Tagging Policy
Tag all resources with Environment, Owner, CostCenter:
aws ec2 create-tags \
--resources i-1234567890abcdef0 \
--tags Key=Environment,Value=production Key=Owner,Value=team-a Key=CostCenter,Value=engineeringUse tags to:
- Create monitors per environment (production alert at higher threshold)
- Alert cost center owner (not general ops)
- Audit untagged resources (likely abandoned)
Pattern 2: Budget Alerts (In Addition to Anomaly Detection)
AWS Budgets set hard thresholds:
aws budgets create-budget \
--account-id 123456789012 \
--budget BudgetName=monthly-budget,BudgetLimit="{Amount=10000,Unit=USD}",TimeUnit=MONTHLY,BudgetType=COST \
--notifications-with-subscribers NotificationWithSubscribers={Notification={ComparisonOperator=GREATER_THAN,NotificationType=FORECASTED,Threshold=80},Subscribers=[{SubscriptionType=EMAIL,Address=ops@company.com}]}This alerts if you’re forecasted to hit 80% of monthly budget.
Pattern 3: Service Quotas
Limit the damage of a bug:
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-1216C47A \
--desired-value 10 # Max 10 EC2 instances, prevents 1000-instance runawayCommon Mistakes
Not checking baseline period
- Anomaly Detection needs 1+ month of data
- If enabled on day 1, won’t alert for first month
Too-low threshold
- Threshold 30%: alerts on every traffic spike (noise)
- Better: 80% for prod, 100% for services with variance
Not investigating root cause
- Alert comes in, you panic and shut everything down
- Usually, the spike is legitimate (traffic spike, promo day, etc.)
- Investigate first, remediate second
Ignoring early warnings
- Anomaly says EC2 is increasing gradually (not a spike)
- Ignore it, bill ends up $20K overrun
- Gradual increases are harder to catch; set budget alerts too
Next Steps
- Enable Anomaly Detection (5 mins)
- Create monitors by service (15 mins)
- Configure SNS alerts (10 mins)
- Integrate with Slack (30 mins)
- Test with a known cost increase (run expensive query)
- Investigate and respond to first anomaly
- Talk to FactualMinds if you need help setting up FinOps practices or building cost governance
AWS Cloud Architect & AI Expert
AWS-certified cloud architect and AI expert with deep expertise in cloud migrations, cost optimization, and generative AI on AWS.


