How to Build a Safe Terraform Apply Workflow on AWS: Approval Gates, Plan Review, and Rollback
Quick summary: One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS. This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters.
Key Takeaways
- One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
- This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters
- One bad `terraform apply` can delete your database, destroy your application load balancer, or lock your team out of AWS
- This guide covers the approval gates, plan review processes, and safety tools that prevent infrastructure disasters
Table of Contents
Somewhere, right now, someone ran terraform apply -auto-approve in a production Terraform configuration and didn’t realize it would destroy a database with customer data.
It happens. And it happens because teams optimize for speed without considering the cost of a mistake.
Terraform makes infrastructure changes easy—maybe too easy. A developer can run terraform apply locally and reshape your entire production environment in seconds, without review, without approval, without anyone knowing it happened.
This guide covers how to build safe apply workflows that are fast enough for real work while being careful enough that you sleep at night.
The Cost of a Bad Apply
Let’s quantify what happens when Terraform goes wrong:
Real scenario 1: A developer refactors a resource name. Terraform doesn’t see a rename; it sees the old resource disappearing and a new one appearing. Without care, terraform apply destroys the old RDS database and creates a new one. Data loss. Recovery from backup takes 6 hours. The incident costs $200k+ in business impact.
Real scenario 2: A new engineer on the team runs terraform apply on a production branch without realizing they’re logged into the wrong AWS account. Resources are destroyed in the wrong environment. Pointing to recovery: 3 hours. Customer impact: 2 hours of downtime.
Real scenario 3: A team member makes a CLI typo in a variable value. The typo deploys to production. A security group rule is opened to the world. You don’t find out until the next day’s security audit.
The cost of prevention—adding an approval step, having someone review the plan, blocking -auto-approve in production—is measured in minutes. The cost of failure is measured in hours and thousands of dollars.
The 3-Gate Model: Plan → Review → Apply
A safe workflow has three gates:
Gate 1: Plan (What Will Change?)
terraform plan -out=tfplanOutput the plan to a file. Never rely on console-only output (which scrolls away and is hard to review).
The plan shows:
# aws_db_instance.main will be destroyed
- resource "aws_db_instance" "main" {
# aws_security_group.app will be updated in-place
~ resource "aws_security_group" "app" {
~ ingress {
+ cidr_blocks = ["0.0.0.0/0"]
from_port = 443
to_port = 443
}
}A reviewer should read this and say “yes, this is what I expected” or “wait, why is the database being destroyed?”
Plan safety tips:
- Always output to a file (plans are cryptographically signed; console output isn’t)
- Commit the plan to CI/CD so there’s an audit trail
- If the plan is larger than 100 lines, display it in a tool that’s designed for reading (not a text scroll)
Gate 2: Review (Is This Actually Safe?)
A human reads the plan. Not the person who wrote the code, but someone else. Ideally someone senior.
A reviewer should ask:
- “Are any critical resources being destroyed?” (databases, load balancers, security groups)
- “Are any IAM permissions being changed?” (could break applications)
- “Are any resource replacements happening?” (which means downtime)
- “Does this match the ticket/PR description?”
The review happens before apply. The review blocks apply if something looks wrong.
Gate 3: Apply (Make It Happen)
Only after review approval does the apply happen. And it should happen:
- In CI/CD, not on a developer’s laptop
- With audit logging (who applied it, when, what changed)
- With the exact plan that was reviewed (not a fresh plan that could be different)
Terraform supports this with terraform apply tfplan. The plan file is cryptographically signed, so if someone tampered with it, apply will fail.
What to Audit in a Terraform Plan
Not everything in a plan is dangerous, but some things are red flags.
Red Flag 1: Resource Destruction
# aws_rds_db_instance.main will be DESTROYEDDatabases should never be destroyed by accident. If you see a database destruction, pause and understand why:
- Is it a resource rename? (In which case, use
terraform state mv) - Is it a legitimate decommissioning? (In which case, require extra approvals)
- Is it a mistake in the code change? (Fix and re-plan)
Red Flag 2: Resource Replacement
# aws_db_instance.main will be destroyed and recreated
- will be destroyed
+ will be createdThis is dangerous because it means downtime (the resource is gone during the recreation). For databases, it means data loss (usually).
Red Flag 3: Large Security Group Changes
~ resource "aws_security_group" "app" {
~ ingress {
+ cidr_blocks = ["0.0.0.0/0"]
}
}Opening access to 0.0.0.0/0 (the entire internet) should be questioned. Is this intentional?
Red Flag 4: IAM Policy Changes
~ resource "aws_iam_role_policy" "app_role" {
+ "s3:*"
- "s3:GetObject"
- "s3:PutObject"
}Adding broad permissions (like s3:* instead of specific actions) is a security issue.
Red Flag 5: Encryption or Backup Settings Disabled
~ resource "aws_rds_db_instance" "main" {
~ storage_encrypted = true -> false
~ backup_retention_period = 30 -> 0
}Disabling encryption or backups is almost never intentional. Question this.
Green Flag: Additive Changes Only
+ resource "aws_s3_bucket" "backup" { ... }
+ resource "aws_iam_role" "service" { ... }Creating new resources with no changes to existing ones is low risk. These plans can be approved quickly.
Blocking Dangerous Commands in CI/CD
Some commands should never run in production. Set up guards:
Block -auto-approve in Production
The -auto-approve flag skips the approval step entirely. It should only exist in dev.
In your CI/CD pipeline:
if [[ "$ENVIRONMENT" == "production" ]] && [[ "$TERRAFORM_ARGS" == *"-auto-approve"* ]]; then
echo "❌ -auto-approve is forbidden in production"
exit 1
fiBlock terraform destroy in Production
if [[ "$ENVIRONMENT" == "production" ]] && [[ "$COMMAND" == "destroy" ]]; then
echo "❌ terraform destroy is forbidden in production. Use drift detection instead."
exit 1
fiIf you need to destroy resources in production, require a separate approval process or don’t allow it through normal CI/CD.
Block -parallelism=1000 in Production
Terraform’s -parallelism flag controls how many resources change simultaneously. High parallelism can cause issues:
if [[ "$ENVIRONMENT" == "production" ]]; then
terraform apply -parallelism=5 tfplan
else
terraform apply -parallelism=10 tfplan
fiLimiting parallelism means changes happen more slowly, giving you time to notice problems.
Per-Environment Policies: Auto-Approve for Dev, Manual Gate for Prod
Different environments have different risk profiles.
| Environment | Approval Required | Auto-Approve OK | Parallelism | Policy |
|---|---|---|---|---|
| Dev | No | Yes | 10+ | Speed matters; we accept risk |
| Staging | Maybe | No | 5 | Simulate production, but still safe to experiment |
| Production | Always | No | 3-5 | Every change is reviewed; destructive ops are blocked |
Example CI/CD configuration:
# .github/workflows/terraform.yml
on: [push, pull_request]
env:
TF_VAR_environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Plan
run: |
terraform init
terraform plan -out=tfplan
- name: Require Approval (Production Only)
if: env.TF_VAR_environment == 'production'
uses: actions/github-script@v6
with:
script: |
github.rest.pulls.requestReviewers({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: context.issue.number,
reviewers: ['senior-infra-engineer']
})
- name: Wait for Approval (Production Only)
if: env.TF_VAR_environment == 'production'
run: |
# Block until PR is approved
# (Implementation depends on your approval strategy)
- name: Terraform Apply (Auto for Dev, Conditional for Prod)
run: |
if [[ "$ENVIRONMENT" == "production" ]]; then
terraform apply tfplan # Requires prior approval
else
terraform apply -auto-approve tfplan
fi
env:
ENVIRONMENT: ${{ env.TF_VAR_environment }}AWS-Specific Risks and How to Mitigate Them
Some Terraform operations are particularly risky on AWS.
Risk 1: RDS Resource Replacement
RDS instances can’t be replaced (updated in place) for certain changes:
resource "aws_db_instance" "main" {
allocated_storage = 100 # Changed from 50
skip_final_snapshot = false # Safe
apply_immediately = true # Dangerous! Causes immediate downtime
}If apply_immediately = true, the change happens now, not during your maintenance window. Your database is unavailable.
Mitigation: Review RDS changes extra carefully. Use apply_immediately = false in production.
Risk 2: ElastiCache Node Replacement
Changing node types in ElastiCache causes the cache to be recreated, flushing all cached data.
resource "aws_elasticache_cluster" "main" {
node_type = "cache.t3.micro" # Changed from cache.t3.small
}This is a cache replacement. Plan for cache misses and increased load on your database.
Risk 3: Security Group Rule Changes During Active Traffic
Removing a security group rule during active traffic can drop connections mid-stream.
resource "aws_security_group_rule" "app_ingress" {
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"] # Removing this rule breaks connections
}Mitigation: Make security group changes during maintenance windows, or apply them gradually (update code, apply change, verify, then roll forward).
Rollback Options When Apply Goes Wrong
If terraform apply causes problems, you have options.
Option 1: Terraform State Rollback
If the plan that was applied was bad, you can use terraform state push to revert to the previous state:
# Save current state
terraform state pull > current-state.json
# Restore previous state (from backup)
terraform state push previous-state.json
# Re-plan (should show how to recreate the destroyed resources)
terraform planThis is a last resort. It’s not clean. But it works when you need to undo a disaster quickly.
Option 2: Destroy and Rebuild
For some resources, it’s faster to destroy and recreate:
terraform destroy -target=aws_instance.web
terraform apply -target=aws_instance.webThis removes the corrupted resource and rebuilds it cleanly.
Option 3: Manual AWS Console Changes
If Terraform is causing problems, make changes directly in the AWS console to stabilize, then fix Terraform code and re-apply:
- Manually fix the problem in AWS console
- Update Terraform code to match
- Run
terraform importif necessary to bring it under Terraform management - Run
terraform planto verify zero changes
Tools for Safe Workflow Automation
Several tools specialize in safe Terraform workflows.
Atlantis
Atlantis is a self-hosted tool that runs terraform plan on pull requests and manages terraform apply approvals.
Workflow:
- Developer opens PR with infrastructure changes
- Atlantis runs
terraform planand posts the plan in the PR - Reviewers comment
atlantis applyto approve - Atlantis runs
terraform applywith full audit logging
Benefits:
- Plan output is visible in the PR
- No developer access needed to run apply
- Full audit trail of who approved what
Spacelift
Spacelift is a SaaS platform (like Terraform Cloud) that adds approval workflows, policy enforcement, and drift detection.
Features:
- Require approval before apply
- Block dangerous operations (destroy, auto-approve)
- Policy as Code (enforce naming conventions, required tags, etc.)
- Drift detection and remediation
GitHub Actions with Required Approvals
If you’re using GitHub, you can use GitHub’s built-in approval mechanisms:
- name: Create Approval Issue
if: github.event_name == 'pull_request'
uses: actions/github-script@v6
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Approval Required: Infrastructure Changes',
body: 'This PR modifies production infrastructure. Requires approval from @senior-infra-engineer'
})Testing Your Safe Workflow
Before deploying to production, test your approval workflow in staging:
- Create a change in staging that would be dangerous (like increasing instance size)
- Verify the plan is created correctly
- Verify the approval requirement blocks apply
- Verify approval enables apply
- Verify the change applies correctly
If this process works in staging, you can trust it in production.
Conclusion: Safety Doesn’t Slow You Down
Teams often think safety and speed are opposites. In practice, they’re the same thing.
A team that adds 2 minutes of review time to each Terraform apply is slower per-change. But a team that loses 6 hours to a data deletion is much slower overall.
Start with the 3-gate model: plan, review, apply. Add approval requirements. Block dangerous commands. Test your rollback procedures. Measure cycle time and improve gradually.
Your goal: “We have never lost production data to a bad Terraform apply, and we never will.”
If building safe infrastructure practices feels like too much to tackle alone, FactualMinds helps teams implement governance frameworks that balance safety with speed. We’ve helped dozens of teams move from manual, error-prone infrastructure management to automated, auditable processes. Let’s talk about how to build safe Terraform workflows that your team can trust.
