AWS - freeCodeCamp.org

My Team's Experience Moving from AWS to a PaaS

Manish Shivanandhan — Tue, 30 Jun 2026 20:49:58 +0000

Most product teams assume infrastructure ownership is simply part of building software. We did too. It wasn’t until we measured how much engineering time was disappearing into operational work that we realised how expensive that assumption had become.

During a quarterly planning session, one of our engineers asked a question nobody on the team had thought to ask directly before: “How much of our time is actually going into infrastructure, versus building things people use?”

It wasn’t a rhetorical question. We pulled up our sprint history, our incident logs, and our calendars, and tried to answer it honestly.

We were a 7-person internal tooling team inside a larger enterprise organisation. Our mandate was straightforward: make other teams across the company faster through workflow automation, internal dashboards, and integrations between internal systems.

Our Amazon Web Services (AWS) environment wasn’t poorly built. It was, by most standards, mature infrastructure. Containerised services on ECS, automated deployments through GitHub Actions, CloudWatch observability, and properly scoped IAM roles across environments. Nothing about it would have raised concerns in an architecture review.

What it cost us wasn’t visible on an invoice. It was visible in calendars, in context-switching, and in how often “infrastructure work” quietly displaced the backlog we were actually accountable for.

That conversation eventually led us to evaluate and migrate to Sevalla, a Platform-as-a-Service infrastructure control for operational simplicity. The migration took three weeks. The effects were measurable within a month.

In this article, we'll walk through what our AWS setup looked like before migrating, what the migration process actually involved, the specific metrics that changed afterwards, and the trade-offs we accepted along the way.

What We'll Cover:

Before the Migration
The Number That Started the Conversation
The Deployment Process: What “Reasonably Automated” Actually Meant
What the Migration Actually Involved
What Changed After the Migration
What We Gave Up
The Actual Lesson

Before the Migration

Our AWS setup was respectable. We weren’t running something embarrassingly manual. We had:

ECS for container orchestration
RDS for databases
CloudWatch for logs and metrics
A CI/CD pipeline through GitHub Actions
IAM roles managed across environments
CloudFormation templates maintained by one senior engineer

It worked. Deployments were automated. The system was stable.

The problem wasn’t that anything was broken. The problem was what it cost us to keep it running smoothly.

The Number That Started the Conversation

During a quarterly planning session, we tried to honestly account for where engineering time was going.

We estimated that across the team, roughly 12–15 hours per week were being spent on infrastructure-related work that wasn’t directly delivering value to internal users. This included:

Deployment pipeline maintenance and debugging (~4 hrs/week)
CloudWatch log investigation and alert tuning (~3 hrs/week)
IAM permissions management and access reviews (~2 hrs/week)
Dependency updates, security patches for infrastructure components (~2 hrs/week)
Ad hoc incidents, environment drift, cost anomaly investigations (~3–4 hrs/week)

At a fully-loaded engineer cost, 12–15 hours per week is the equivalent of roughly one-third of a full-time engineer, every week, spent on keeping the lights on rather than building anything.

For a team whose backlog was already longer than we could realistically tackle, that number was hard to justify.

The Deployment Process: What “Reasonably Automated” Actually Meant

Our deployment pipeline was good by most standards. Push to main, GitHub Actions triggered a build, pushed an image to ECR, and updated the ECS service. On a good day, a deployment took about 12 minutes from merge to live.

But “reasonably automated” came with caveats.

Only one engineer fully understood the pipeline. If something failed mid-deployment, like a task definition mismatch, an IAM permission error, or a CloudFormation stack conflict, most of the team would either wait for that engineer or spend significant time reading AWS documentation to diagnose it themselves.

Rollbacks were manual. There was no clean one-click rollback. Rolling back meant redeploying the previous image tag, which required knowing what that tag was, triggering the pipeline again, and waiting another 12 minutes. In an incident, those 12 minutes mattered.

Environment parity was fragile. We had staging and production environments. Keeping them consistent required discipline and periodic reconciliation. Configuration drift happened more than we’d like to admit, and it occasionally caused releases to behave differently in production than they had in staging.

New team members couldn’t deploy confidently. Onboarding a new engineer to the deployment process took the better part of a day, and most new hires remained hesitant to trigger deployments independently for weeks. The pipeline was automated, but the knowledge wasn’t.

What the Migration Actually Involved

We moved over the course of about three weeks, migrating services incrementally rather than cutting over all at once.

The largest time investment was translating our environment variable configuration and secrets management from AWS Parameter Store into Sevalla’s environment configuration. That took roughly half a day.

The CI/CD migration was straightforward. We replaced our ECS deployment step with Sevalla’s Git-connected deployment. The GitHub integration picked up our repository directly.

Database migration was the most careful part. We ran both databases in parallel for two weeks, verified data consistency, then cut over DNS. There was no data loss, and no downtime.

Total migration effort across the team: approximately 40 hours spread over three weeks, mostly concentrated in two engineers.

What Changed After the Migration

Deployment time dropped from ~12 minutes to ~3 minutes

This wasn’t the most important change, but it was the most immediately visible one. Faster deployments meant faster feedback loops. A fix could be in production and verified within minutes rather than waiting out a build cycle.

Over a typical week with 8–10 deployments, that’s roughly 90 minutes of cumulative waiting time recovered, per week.

Any engineer could deploy confidently on day one

This was the change that mattered most operationally. The deployment process became visible, documented by the interface itself, and required no specialist knowledge to operate. A new engineer joining the team could deploy their first change independently on their first day.

The informal “deployment gatekeeper” role that had quietly formed around our most AWS-experienced engineer effectively dissolved.

Rollbacks went from a 12-minute manual process to a 30-second action

Every deployment in Sevalla retains a one-click rollback to the previous build. During the first month after migration, we used this twice: once for a regression we caught quickly, and once during a failed database migration we immediately needed to reverse.

Both incidents that previously would have required hours of manual intervention were resolved in under a minute.

Infrastructure maintenance time dropped to approximately 2–3 hours per week

We no longer maintain IAM roles, CloudWatch alerts, CloudFormation templates, or ECS task definitions. The infrastructure surface area we own shrank dramatically.

Our estimate of 12–15 hours per week of infrastructure work fell to roughly 2–3 hours per week . It now involves primarily monitoring application behaviour and reviewing build logs. That’s a recovery of approximately 10 hours per week of engineering time redirected toward the actual backlog.

Over a quarter, that’s roughly 130 hours, or about three full working weeks, returned to product work.

Looking back, we had quietly become a platform team. Not because we intended to, but because every infrastructure decision created more infrastructure to own.

Log visibility improved without any additional tooling

One outcome we didn’t anticipate: production visibility got better even though we invested less in it.

On AWS, meaningful log analysis required CloudWatch Insights queries, proper log group configuration, and knowing where to look. Useful observability required deliberate setup effort.

On Sevalla, build logs, runtime logs, and deployment history are accessible directly from the dashboard without configuration. When something went wrong in production, the time from “something is broken” to “here is what happened” dropped from 10–20 minutes of searching across tools to under 2 minutes in most cases.

What We Gave Up

Intellectual honesty requires listing the trade-offs.

First, we have less infrastructure flexibility. If we needed custom networking topology, specialised compute instances, or fine-grained storage configuration, Sevalla wouldn't cover those requirements. For an internal tooling team, none of those needs has materialised. But they could.

Also, some AWS-native integrations required reworking. We used a few Lambda functions that had to be refactored into services. That added some migration complexity we hadn’t fully anticipated.

The Actual Lesson

The migration confirmed something that’s easy to miss when you’re inside it: the cost of infrastructure ownership for a product team isn’t primarily the cloud bill. It’s the engineering attention.

For our team, 10 hours per week of recovered time across a 7-person team meant a 28% increase in capacity available for work that users actually care about. That’s not a marginal improvement. It’s a meaningful change in what the team can realistically ship.

That outcome isn’t specific to Sevalla. Any infrastructure simplification that genuinely reduces operational burden would produce a similar result.

The question worth asking isn’t whether your team can manage infrastructure. It’s whether managing infrastructure is the best use of the engineering capacity you have.

For an internal tooling team whose value is measured entirely by what it ships, not by how it deploys, the answer, for us, was clearly no.

The EKS Cost Optimization Handbook: Reduce Your AWS Bill by 60% Using Karpenter and Rightsizing

Ayobami Adejumo — Mon, 22 Jun 2026 16:38:45 +0000

This handbook is a complete guide to the 7-step playbook that took one EKS bill from $85,000/month to $34,000/month — without touching a single line of product code.

I've audited EKS clusters at more than 10 companies. The same waste patterns appear every time: over-provisioned nodes, cross-AZ data transfer, idle EBS volumes, and so on. And the most expensive mistake of all: buying compute commitments before rightsizing.

This handbook is the fix. I've used this 7-step playbook to reduce EKS costs by 50–60% at every company where I've implemented it. There are no product code changes, and no downtime. Just infrastructure optimization executed in the right order.

By the end of this guide, you'll know how to right-size pod resource requests, implement Karpenter for intelligent bin-packing and Spot diversification, migrate compatible workloads to Graviton for 20% cheaper compute, and eliminate NAT Gateway charges entirely with VPC endpoints.

All Terraform modules, NodePool templates, and automation scripts referenced in this guide are available in the companion repository at github.com/aayostem/eks-cost-optimization. The repo includes ready-to-deploy configurations for every step so you can move from reading to implementing in the same afternoon.

What You'll Learn
Prerequisites
Part 1: The Baseline — Where Your EKS Money Is Going
Part 2: Right-Sizing Pod Resource Requests
Part 3: Karpenter for Bin-Packing and Spot Diversification
Part 4: Graviton Migration
Part 5: VPC Endpoints for Data Transfer
Part 6: EBS Volume Optimisation
Part 7: Load Balancer Consolidation
The Complete 7-Step Sequence
Best Practices Summary
Resources

What You'll Learn

How to right-size pod resource requests using VPA recommendations
The complete Karpenter setup with Spot diversification and automatic consolidation
Graviton3 migration for all non-GPU workloads
VPC endpoints to eliminate NAT Gateway data transfer charges
EBS gp2 to gp3 migration — 20% cheaper with zero performance loss
Load balancer consolidation with shared Ingress
The 7-step sequence that maximises ROI — and why the order isn't optional

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Working familiarity with Kubernetes — you can deploy an application and inspect pods
Basic AWS knowledge — you understand EC2 instance types, VPCs, and EBS volumes
Comfort reading Terraform HCL and Kubernetes YAML

Tools and access:

An existing EKS cluster running Kubernetes 1.27 or later
kubectl configured and pointing at your cluster
AWS CLI v2 installed and authenticated with appropriate permissions
Helm 3 installed (for Karpenter and Kubecost)
Metrics Server installed in your cluster

Companion repository: Clone the repo before starting. It contains all YAML, Terraform, and shell scripts referenced in this guide:

git clone https://github.com/aayostem/eks-cost-optimization
cd eks-cost-optimization

Estimated savings: For a cluster running at $85,000/month with typical over-provisioning, expect $40,000–55,000/month in savings after completing all 7 steps. Smaller clusters under $10,000/month typically see 40–50% reduction.

Part 1: The Baseline — Where Your EKS Money Is Going

1.1 The Typical EKS Cost Breakdown

Before touching anything, you need to know exactly where the money is going. Optimising the wrong category first is how teams waste weeks of engineering time and see no meaningful reduction.

Here's what a typical $85,000/month EKS cluster looks like when you break it down:

Category	Monthly Cost	Percentage	Waste Potential
Compute (EC2 nodes)	$52,000	61%	High — over-provisioning, wrong instance types
Data Transfer	$15,300	18%	Very High — cross-AZ and NAT Gateway charges
Storage (EBS volumes)	$10,200	12%	Medium — unattached volumes and gp2 vs gp3
Load Balancers	$4,250	5%	Low to Medium — single-service ALBs
EKS Control Plane	$72	<1%	None — this is a fixed cost
Other	$3,178	4%	Low

Compute and Data Transfer together represent 79% of the bill and account for 90% of the correctable waste. Those are the targets.

Run this command to see your own breakdown before starting anything:

# Pull last month's cost breakdown by service
# Save this output — it becomes your before number
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Screenshot the output and save it. You'll compare against it after each step to verify actual savings before moving to the next one.

1.2 The Most Expensive Mistake: Wrong Optimisation Order

Here's what most teams do when they get a large AWS bill:

Buy Savings Plans immediately, locking in waste at a 30% discount
Then implement Karpenter, discovering they've over-committed the wrong instance family
Then migrate to Graviton, discovering their Savings Plan doesn't cover ARM instances

The result: a 12–36 month commitment paying for waste they could have eliminated in three weeks.

The correct sequence is:

Step 1: Right-size pod requests        ← Always first
Step 2: Implement Karpenter            ← Dynamic provisioning on rightsized requests
Step 3: Enable Spot for non-prod       ← Karpenter handles fallback automatically
Step 4: Migrate to Graviton            ← Karpenter makes this seamless
Step 5: Add VPC endpoints              ← Eliminate data transfer charges
Step 6: Optimise EBS volumes           ← Quick win, run alongside other steps
Step 7: Consolidate load balancers     ← Final structural cleanup

Then, and only then, buy Savings Plans — against the optimised baseline you've just established.

The one rule: optimise first, then commit. Every step before the Savings Plan purchase reduces what you're locking in for 1–3 years.

Part 2: Right-Sizing Pod Resource Requests

2.1 Why Over-Provisioned Requests Are So Expensive

Kubernetes schedules pods based on resource requests — not actual usage. A pod that requests 2 vCPUs and 4GB of memory requires a node with that capacity available, regardless of whether the pod is actually using it.

Here's the incorrect approach with the requests set to worst-case peak estimates:

# Bad: Resource requests set during initial deployment, never revisited
# This pod actually uses 250m CPU and 512Mi memory on average
resources:
  requests:
    cpu: "2"        # 8x more than actual usage
    memory: "4Gi"   # 8x more than actual usage
  limits:
    cpu: "4"
    memory: "8Gi"

When every pod is over-requested by 8x, your cluster needs 8x more nodes than your workloads actually require. That's where the 61% compute line in your bill comes from.

First, verify actual usage before changing anything:

# Install Metrics Server if not already running
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check actual CPU and memory usage per pod
# Compare these numbers against your current resource requests
kubectl top pods --all-namespaces --sort-by=cpu

Expected output showing the typical gap:

NAMESPACE     NAME                    CPU(cores)   MEMORY(bytes)
production    payment-api-xxx         25m          128Mi
production    user-api-xxx            15m          96Mi
production    notification-svc-xxx    5m           64Mi
staging       worker-xxx              10m          256Mi

If your pods are requesting 2 CPU cores each but using 25m–15m cores in practice, you have a 50–80x over-request ratio. Every node in your cluster is mostly empty space you're paying for.

2.2 Using the Vertical Pod Autoscaler for Recommendations

The Vertical Pod Autoscaler (VPA) is a Kubernetes component that analyses historical CPU and memory usage for each deployment and recommends optimal resource requests. You use it in recommendation-only mode first — it tells you what to set without changing anything automatically, so you can review and apply the changes yourself with full control.

Here's the correct implementation:

# Good: VPA in recommendation-only mode
# Watches your pod's actual usage for 24+ hours, then recommends right-sized requests
# updateMode: "Off" means it only recommends — it never restarts your pods
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  updatePolicy:
    updateMode: "Off"   # Recommendation only — you apply manually after review
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "100m"     # VPA will never recommend below this floor
        memory: "256Mi"
      maxAllowed:
        cpu: "2"        # VPA will never recommend above this ceiling
        memory: "4Gi"

Install VPA and retrieve recommendations:

# Install VPA components
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-1.0.0/vpa-v1.0.0.yaml

# Apply the VPA manifest for each deployment you want to right-size
kubectl apply -f vpa/payment-api-vpa.yaml

# Wait 24 hours for VPA to collect usage data, then check recommendations
kubectl describe vpa payment-api-vpa -n production

What a VPA recommendation looks like:

Recommendation:
  Container Recommendations:
    Container Name: payment-api
    Lower Bound:
      cpu:     50m
      memory:  128Mi
    Target:
      cpu:     250m      ← Set your requests to this value
      memory:  512Mi     ← Set your requests to this value
    Upper Bound:
      cpu:     500m
      memory:  1Gi

Apply the recommendation to your deployment:

# Good: Right-sized requests based on VPA Target recommendation
resources:
  requests:
    cpu: "250m"     # Down from 2000m — an 8x reduction
    memory: "512Mi" # Down from 4096Mi — an 8x reduction
  limits:
    cpu: "500m"     # 2x the request — headroom for genuine spikes
    memory: "1Gi"   # 2x the request

All VPA manifests for common deployment types are in vpa/ in the companion repo.

2.3 The ROI of Right-Sizing

Metric	Before	After	Improvement
Average CPU utilisation	18%	65%	+47 percentage points
Node count required	42	28	-33%
Monthly compute cost	$52,000	$36,400	-$15,600/month

Verify the improvement after applying recommendations:

# Check cluster-level utilisation after right-sizing
# Target: 60–75% CPU and memory utilisation across nodes
kubectl top nodes

Part 3: Karpenter for Bin-Packing and Spot Diversification

Karpenter is an open-source Kubernetes node provisioner built by AWS and donated to the CNCF.

Where the default Kubernetes Cluster Autoscaler scales pre-configured node groups up and down, Karpenter watches the actual resource requests of pending pods and provisions exactly the right EC2 instance type to satisfy them — selecting dynamically from thousands of available instance families rather than the two or three you pre-configured. It also continuously monitors running nodes for underutilisation and consolidates workloads onto fewer nodes, terminating the empty ones automatically.

The result is a cluster that is always sized to what your workloads actually need right now, not what you anticipated at setup time.

3.1 The Ceiling with Cluster Autoscaler

Cluster Autoscaler works with pre-defined node groups. You configure which instance types are available and it scales those groups up and down.

The limitation is that it can only provision instances from the types you pre-configured. It can't dynamically select the right instance type based on what the workload actually needs right now.

Here's the incorrect approach using static node groups:

# Bad: Two static node groups, each over-provisioning against worst-case scenarios
# CPU-optimised group runs even when workloads are memory-bound
# Memory-optimised group runs even when workloads are CPU-bound
eksctl create nodegroup \
  --cluster my-cluster \
  --name cpu-optimized \
  --instance-types c5.2xlarge \
  --nodes-min 5 --nodes-max 20

eksctl create nodegroup \
  --cluster my-cluster \
  --name memory-optimized \
  --instance-types r5.2xlarge \
  --nodes-min 3 --nodes-max 10

You're provisioning for the worst case in each family simultaneously. At any given moment, one group is underutilised while the other is scaling. Neither is right.

3.2 How Karpenter Solves This

Karpenter watches the actual resource requests of pending pods and provisions exactly the right instance type to fit them. It selects from thousands of available instance types, not just the two you pre-configured. It also consolidates running workloads onto fewer nodes when utilisation drops, automatically terminating underutilised nodes.

Here's the correct implementation:

# Good: Karpenter NodePool
# Karpenter selects the optimal instance type based on pending pod requirements
# Tries Spot first, falls back to On-Demand automatically when Spot isn't available
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM (Graviton) — Karpenter picks the cheaper option
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        # Try Spot first, fall back to On-Demand if unavailable
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # Exclude families with poor price-to-performance ratio
        - key: karpenter.k8s.aws/instance-family
          operator: NotIn
          values: ["t2", "t3a"]
  limits:
    cpu: "1000"
    memory: "4000Gi"
  disruption:
    # Remove underutilised nodes and reschedule their pods automatically
    consolidationPolicy: WhenUnderutilized
    # Recycle nodes after 30 days to ensure fresh, patched AMIs
    expireAfter: 720h

What each setting does:

consolidationPolicy: WhenUnderutilized: Karpenter continuously monitors node utilisation and removes underused nodes, moving their pods elsewhere. Your node count decreases automatically as load drops without any manual intervention.
expireAfter: 720h: Nodes older than 30 days are gracefully replaced, ensuring your infrastructure always runs the latest EKS-optimised AMI with current security patches.
values: ["spot", "on-demand"]: Karpenter attempts Spot capacity first. If Spot is unavailable for the requested instance type, it falls back to On-Demand with no alerts and no manual action required.

Migrating from Cluster Autoscaler safely:

# Step 1: Install Karpenter alongside Cluster Autoscaler — do not remove CAS yet
helm repo add karpenter https://charts.karpenter.sh
helm install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --set settings.clusterName=your-cluster-name

# Step 2: Apply NodePool and NodeClass configuration
kubectl apply -f karpenter/nodepool.yaml
kubectl apply -f karpenter/nodeclass.yaml

# Step 3: Taint existing legacy nodes so new pods schedule on Karpenter nodes
# This migrates workloads gradually — zero downtime
kubectl taint nodes -l eks.amazonaws.com/nodegroup=cpu-optimized \
  group=legacy:NoSchedule

# Step 4: Watch pods reschedule to Karpenter-managed nodes over the next hour
kubectl get pods -o wide --all-namespaces | grep -v legacy

# Step 5: After 30 days of stable operation, remove the old node groups
eksctl delete nodegroup --cluster my-cluster --name cpu-optimized
eksctl delete nodegroup --cluster my-cluster --name memory-optimized

Ready-to-deploy NodePool and NodeClass templates are in karpenter/ in the companion repo.

3.3 Spot Instances for Non-Production Workloads

Staging and development workloads don't need the reliability guarantees of On-Demand instances. Moving them to Spot saves 60–90% on those node costs. Karpenter handles Spot interruptions by rescheduling pods automatically. For stateless workloads, interruptions are invisible to users.

# Good: Spot-only NodePool for staging environments
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: staging-spot
spec:
  template:
    metadata:
      labels:
        billing/environment: staging
    spec:
      taints:
        - key: environment
          value: staging
          effect: NoSchedule  # Only pods that tolerate this taint schedule here
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]   # Spot only for non-production
  disruption:
    consolidationPolicy: WhenUnderutilized

3.4 The ROI of Karpenter and Spot

Metric	Before (Cluster Autoscaler)	After (Karpenter + Spot)	Improvement
Average node count	28	18	-36%
Average CPU utilisation	65%	82%	+17 percentage points
Staging environment cost	$8,000/month	$2,400/month	-70%
Scale-up time for new pods	3–5 minutes	30–60 seconds	-80%

Part 4: Graviton Migration

AWS Graviton is Amazon's own ARM-based processor family, available across EC2 instance types with names ending in g — m7g, c7g, r7g, and so on.

Graviton instances are priced approximately 20% lower than equivalent Intel or AMD x86 instances. For most server-side workloads — Node.js, Python, Go, Java — they also deliver 20–40% better performance per dollar because the processor architecture is optimised specifically for these workload types.

You don't change your application code to use Graviton. You change the architecture flag in your container image build and the node selector in your Kubernetes deployment.

4.1 Why Graviton Reduces Cost Without Reducing Performance

The first question to answer before migrating is whether your container images support ARM64. Most official images from Docker Hub ship as multi-architecture images. Your own application images need to be built for both architectures explicitly.

Check whether your images support ARM64:

# Check if an image has an ARM64 manifest
docker manifest inspect your-registry/your-app:latest | jq '.manifests[].platform'

Expected output for a multi-arch image:

{"architecture": "amd64", "os": "linux"},
{"architecture": "arm64", "os": "linux", "variant": "v8"}

If arm64 appears, the image is ready. If not, you need to build and push a multi-arch image first.

Build and push a multi-architecture image:

# Build for both x86 and ARM in a single command using Docker Buildx
docker buildx create --use --name multi-arch-builder

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag your-registry/your-app:latest \
  --push \
  .

4.2 Migrating Workloads to Graviton

With Karpenter already installed, Graviton migration is a single label change on your deployment. Karpenter provisions the appropriate ARM64 node automatically.

Here's the correct implementation:

# Good: nodeSelector directs the pod to Graviton nodes
# Karpenter provisions an arm64 node if one isn't already available
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64   # Schedule exclusively on Graviton nodes
      containers:
        - name: api
          image: your-registry/payment-api:latest  # Must be multi-arch

Migrate gradually, starting with stateless services:

# Step 1: Migrate one stateless service and monitor for 48 hours
kubectl patch deployment payment-api \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/arch":"arm64"}}}}}'

# Step 2: Watch for errors in the first 30 minutes
kubectl logs -l app=payment-api --tail=100 -f

# Step 3: Verify the pod is running on a Graviton node
# The NODE column should show a Graviton instance type (m7g, c7g, r7g)
kubectl get pods -l app=payment-api -o wide

# Step 4: After 48 hours of stable operation, migrate the next service

There are some situations where you shouldn't migrate to Graviton: GPU workloads, applications with native x86 binary dependencies, or any workload where you haven't yet built multi-arch images.

4.3 The ROI of Graviton

Workload Type	x86 Monthly Cost	Graviton Monthly Cost	Saving
Web services (Node.js, Python)	$18,000	$14,400	$3,600/month
Data processing	$12,000	$9,600	$2,400/month
API services (Go, Java)	$8,000	$6,400	$1,600/month
Total	$38,000	$30,400	$7,600/month

Part 5: VPC Endpoints for Data Transfer

5.1 The NAT Gateway Tax

Every byte that travels from your EKS pods to an AWS service — S3, DynamoDB, ECR, SQS — goes through a NAT Gateway if you haven't configured VPC endpoints. NAT Gateway charges $0.045 per GB of data processed.

A busy EKS cluster pulling container images from ECR, writing to S3, and polling SQS queues can process hundreds of terabytes per month through NAT Gateway — generating thousands of dollars in charges for traffic that never actually left the AWS network.

Measure your current NAT Gateway cost before adding endpoints:

# Get last month's NAT Gateway data processing charges
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity DAILY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NATGateway-Bytes"]
    }
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

5.2 VPC Endpoints — The Fix That Takes 30 Minutes

A VPC endpoint creates a private connection between your VPC and an AWS service, routing traffic through the AWS backbone without touching the NAT Gateway. The data transfer becomes free. Each endpoint costs approximately $0.01/hour — roughly $7.20/month — far less than the NAT Gateway processing charges it replaces.

Here's the complete implementation for the four most common EKS traffic destinations:

# Get your VPC ID and primary route table ID first
VPC_ID=$(aws eks describe-cluster --name your-cluster \
  --query 'cluster.resourcesVpcConfig.vpcId' --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters Name=vpc-id,Values=$VPC_ID Name=association.main,Values=true \
  --query 'RouteTables[0].RouteTableId' --output text)

echo "VPC: \(VPC_ID | Route Table: \)ROUTE_TABLE_ID"

# S3 gateway endpoint — free to create, eliminates all S3 traffic through NAT
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids $ROUTE_TABLE_ID

# DynamoDB gateway endpoint — also free, same mechanism as S3
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids $ROUTE_TABLE_ID

# ECR API interface endpoint — eliminates NAT charges on image pulls
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters Name=vpc-id,Values=$VPC_ID Name=tag:Tier,Values=private \
    --query 'Subnets[*].SubnetId' --output text)

# ECR Docker endpoint — required alongside ECR API for complete image pull coverage
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters Name=vpc-id,Values=$VPC_ID Name=tag:Tier,Values=private \
    --query 'Subnets[*].SubnetId' --output text)

The Terraform module that creates all four endpoints in a single apply is in terraform/vpc-endpoints/ in the companion repo.

Verify that the endpoints are routing traffic correctly:

aws ec2 describe-vpc-endpoints \
  --filters Name=vpc-id,Values=$VPC_ID \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State,Type:VpcEndpointType}' \
  --output table
# Expected: all endpoints showing State=available

5.3 The ROI of VPC Endpoints

Service	Before (Through NAT)	After (VPC Endpoint)	Monthly Saving
S3 data transfer	$4,500	$0	$4,500
ECR image pulls	$800	$0	$800
DynamoDB queries	$1,200	$0	$1,200
Endpoint cost	—	$29 (4 endpoints)	-$29
Net saving			$6,471/month

Part 6: EBS Volume Optimisation

6.1 The gp2 to gp3 Migration

EBS gp2 volumes price their IOPS based on storage size — 3 IOPS per GB, with a 100 IOPS minimum. EBS gp3 volumes provide 3,000 IOPS baseline regardless of size, and cost 20% less per GB. The migration runs online with no downtime.

Find and migrate all gp2 volumes:

# Step 1: List all gp2 volumes and their sizes
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,State:State}' \
  --output table

# Step 2: Migrate each gp2 volume to gp3 — no instance stop required
# The modify operation runs online while the volume stays attached and in use
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text | tr '\t' '\n' | while read vol; do
    echo "Migrating $vol from gp2 to gp3..."
    aws ec2 modify-volume \
      --volume-id $vol \
      --volume-type gp3
done

# Step 3: Verify all volumes are now gp3
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text
# Expected: empty output — zero gp2 volumes remaining

6.2 Finding and Removing Orphaned Volumes and Snapshots

When Kubernetes PersistentVolumeClaims are deleted, the underlying EBS volumes sometimes aren't cleaned up. They keep running — and billing — indefinitely.

# Find unattached EBS volumes — status=available means not attached to any instance
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find EBS snapshots older than 90 days
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601=seconds)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

Before deleting any snapshot, cross-reference with your RDS automated backup schedule to confirm it's not the only backup for a production database.

6.3 The ROI of EBS Optimisation

Resource	Before	After	Monthly Saving
gp2 → gp3 migration (1TB total)	$102	$72	$30
Unattached volumes removed (50 × 100GB)	$500	$0	$500
Old snapshots cleaned (500GB)	$25	$0	$25
Total	$627	$72	$555/month

Part 7: Load Balancer Consolidation

7.1 The Problem — One Load Balancer Per Service

Many teams create a separate LoadBalancer Service for every microservice. On AWS, each Application Load Balancer costs approximately $16.20/month base charge plus $0.008/LCU-hour for traffic processed. At 20 microservices, that's $324/month before a single request is processed.

Here's the incorrect approach:

# Bad: This creates a dedicated AWS ALB every time it's applied
# 20 microservices = 20 ALBs = $324+/month before any traffic charges
apiVersion: v1
kind: Service
metadata:
  name: payment-api
spec:
  type: LoadBalancer   # Creates a dedicated ALB
  ports:
  - port: 80
    targetPort: 8080

7.2 The Fix — Shared Ingress Controller

An Ingress controller is a Kubernetes component that runs as a pod inside your cluster and programs a single external load balancer to route traffic to multiple services based on hostname and URL path. Instead of one AWS Application Load Balancer per microservice, you get one ALB total — with path-based routing directing each request to the right backend service. The result is the same routing behaviour at a fraction of the cost.

Here's the correct implementation:

# Good: One Ingress resource routes all external traffic
# The AWS Load Balancer Controller creates one ALB for all services listed here
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: shared-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
spec:
  rules:
  - host: api.company.com
    http:
      paths:
      - path: /payments
        pathType: Prefix
        backend:
          service:
            name: payment-service
            port:
              number: 8080
      - path: /users
        pathType: Prefix
        backend:
          service:
            name: user-service
            port:
              number: 8080
  - host: dashboard.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: dashboard-service
            port:
              number: 3000
  tls:
  - hosts:
    - api.company.com
    - dashboard.company.com
    secretName: tls-wildcard-cert

Verify the Ingress is provisioned and the ALB DNS name is assigned:

# Watch until the ADDRESS column shows the ALB DNS name (typically 2–3 minutes)
kubectl get ingress shared-ingress -n production -w

The cost difference:

Approach	Load balancers	Monthly cost
LoadBalancer Service per microservice (20 services)	20 ALBs	~$400/month
Single Ingress controller	1 ALB	~$27/month
Monthly saving		~$373/month

The shared Ingress manifest is in k8s/ingress/ in the companion repo.

The Complete 7-Step Sequence

Step	Action	Time to Implement	Expected Monthly Saving
1	Right-size pod resource requests (VPA)	1 week	$15,600
2	Install Karpenter with consolidation	1 week	$8,400
3	Move staging and dev to Spot	1 week	$11,200
4	Migrate compatible workloads to Graviton	2 weeks	$7,600
5	Add VPC endpoints for S3, ECR, DynamoDB	1 day	$6,471
6	Migrate gp2 to gp3 and delete orphaned volumes	1 day	$555
7	Consolidate load balancers with shared Ingress	1 day	$373
Total		3–4 weeks	$49,799/month

Annual saving at this rate: $597,588. Engineering time required: one engineer, one sprint per step.

Best Practices for EKS Cost Optimisation

✅ Do: Right-size pod resource requests before any other optimisation. Every subsequent step depends on accurate requests.

✅ Do: Implement Karpenter with consolidationPolicy: WhenUnderutilized. Let it continuously optimise your node count automatically.

✅ Do: Move staging and development workloads to Spot. 60–90% savings for workloads that tolerate interruption.

✅ Do: Migrate compatible workloads to Graviton. Most web services and APIs run without code changes.

✅ Do: Add VPC endpoints for S3, DynamoDB, and ECR before reviewing data transfer costs.

✅ Do: Migrate gp2 volumes to gp3. It's online, zero downtime, and immediately 20% cheaper.

✅ Do: Use a single shared Ingress controller for all external traffic instead of per-service load balancers.

❌ Don't: Buy Savings Plans before completing steps 1–6. You'll lock in waste for 1–3 years.

❌ Don't: Use static node groups with Cluster Autoscaler when your workload mix changes. Karpenter handles this dynamically.

❌ Don't: Run staging and development environments on On-Demand instances. Spot interruptions are manageable, but the cost difference is not.

Resources

Karpenter Documentation — Official NodePool configuration reference and installation guide
AWS Graviton Getting Started Guide — Language-specific compatibility notes and migration guidance from AWS
Vertical Pod Autoscaler GitHub — VPA installation and configuration documentation
AWS VPC Endpoints Documentation — Complete list of available VPC endpoints and configuration options
EBS Volume Modification Documentation — AWS guide for online volume type migration with zero downtime
AWS Load Balancer Controller — Official documentation for the Ingress controller that provisions AWS ALBs
AWS Cost Explorer API Reference — Full reference for the cost breakdown commands used throughout this guide
EKS Best Practices Guide — Cost Optimisation — AWS's official EKS cost optimisation framework
Companion Repository — All Terraform modules, NodePool templates, VPA manifests, and automation scripts from this guide

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Ayobami Adejumo — Mon, 15 Jun 2026 23:22:50 +0000

My first AWS bill was $23,000. I had been working at the company for three weeks.

Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda function that called an external enrichment API on every user event. Clean code. Solid tests. Thirty-two million events that month. At $0.0007 per API call.

My engineering manager forwarded the invoice with two words: "Please explain."

That was the moment I discovered FinOps — not from a conference talk or a certification course, but from the specific shame of having written expensive code and not knowing it until the damage was done.

This roadmap is what I needed that day. A complete, honest guide to transforming from an engineer who builds things that work into an engineer who builds things that work and cost what they should. By the end of this guide, you'll have the skills, the scripts, and the vocabulary to talk about cloud spend the way a CFO and a CTO both want to hear.

What You'll Learn
Prerequisites
The Four Stages Overview
Stage 1: The Cost-Aware Engineer — Months 1 to 3
Stage 2: The Optimisation Specialist — Months 4 to 8
Stage 3: The Automation Architect — Months 9 to 15
Stage 4: The Cloud Financial Manager — Months 16 to 24
Essential Tools and Certifications
Your 90-Day Action Plan
Best Practices Summary
Resources

What You'll Learn

How to read your AWS bill as an engineer, not as a passive observer
The exact tagging strategy that makes cost attribution possible
How to right-size EC2 and RDS instances using CloudWatch data you already have
The correct sequence for purchasing Savings Plans — and why sequence matters more than the discount percentage
How to build automated cleanup systems for orphaned resources
How to present cloud cost findings to engineering leadership with data that drives decisions
The chargeback and showback models that make cost accountability stick

Let's begin.

Prerequisites

Before following this roadmap, you should have some skills and tools ready to go.

Knowledge:

You can deploy an application to AWS (EC2, Lambda, or containers)
You understand basic AWS services: S3, RDS, EC2, VPC, IAM
You're comfortable reading Python and writing simple bash scripts
You know what a pull request is and have gone through at least one code review

Access:

Read-only access to your AWS billing console and Cost Explorer
AWS CLI v2 configured with at least ReadOnlyAccess policy attached
Python 3.9 or later for running the audit scripts in this guide

Mindset: You don't need to be a finance expert. But you do need to be willing to look at numbers that might be uncomfortable. Every engineer I've worked with who became excellent at FinOps had one thing in common: they were willing to be the person who asked "but what does this cost?" in a room where nobody else wanted to.

Estimated time: This roadmap covers 24 months of deliberate skill-building. You can absorb the reading in a few evenings. The practice is the 24 months.

The Four Stages Overview

Before going deep, here's the complete picture of where you're going:

Stage 1 — Cost-Aware Engineer (Months 1–3)
├── Read your cloud bill and understand it
├── Tag every resource with meaningful metadata
├── Identify your top 5 cost drivers
└── Block your first expensive PR with cost justification

Stage 2 — Optimisation Specialist (Months 4–8)
├── Right-size every over-provisioned resource
├── Implement storage lifecycle policies
├── Move non-production to Spot instances
└── Purchase your first Savings Plan in the right order

Stage 3 — Automation Architect (Months 9–15)
├── Build automated cleanup for orphaned resources
├── Add cost estimation to your CI/CD pipeline
├── Create cost-aware auto-scaling triggers
└── Deploy a self-service FinOps dashboard

Stage 4 — Cloud Financial Manager (Months 16–24)
├── Lead monthly FinOps reviews with engineering leadership
├── Build chargeback models for departments
├── Negotiate enterprise agreements with AWS
└── Forecast cloud spend within 5% variance

The reason this is a 24-month journey and not a weekend project: each stage builds on the previous one. Engineers who jump straight to Savings Plans without rightsizing first end up paying discounted prices for waste. Engineers who build dashboards before tagging get beautiful charts with no actionable data. The sequence isn't arbitrary.

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

The default AWS Cost Explorer view shows you service-level totals. That's accounting. What you need is engineering-level decomposition: which specific resources cost money, what business function they serve, and whether each dollar is justified.

Start by pulling a proper breakdown:

# Pull last month's cost breakdown grouped by service
# Run this before touching any optimisation — this is your baseline
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Save the output. Name the file aws-baseline-YYYY-MM.txt. You'll compare every future month against this number. Without a baseline, you can't measure progress — and without measurable progress, you can't make the case to leadership that the work is worth engineering time.

Three questions for every service in your top 5:

Most engineers stop at "what is this service?" and never reach the useful question. Here's the framework I use when I first audit an account:

The first question is whether you know what specific business function this service is performing. Not the product name, the function. "S3" isn't an answer. "Storing unprocessed video uploads that sit for 90 days before anyone watches them" is an answer.

The second question is whether the cost is growing, stable, or shrinking when you look at the past three months. A stable $12,000/month is a different problem from a $12,000/month line that was $4,000 six months ago.

The third question is what percentage of your total bill this service represents. Optimising a 1% line item while a 40% line item runs unchecked is a common time-wasting trap.

1.2 The Tagging Strategy That Actually Survives

Here's the honest truth about tagging: most tagging strategies die within six months because they're designed for reporting rather than for engineers. Engineers don't tag things well when they're moving fast. The solution isn't to demand more discipline. Instead, it's to make tagging enforced at the infrastructure layer.

Here's the minimal viable tag set (the six tags that cover 90% of attribution needs):

# These six tags enable cost attribution, accountability, and automated remediation
# Add these to every resource in your AWS account — EC2, RDS, S3, Lambda, everything

Environment: "production" | "staging" | "dev"
Team: "platform" | "backend" | "data" | "ml"
Service: "payment-api" | "fraud-detection" | "user-service"
Owner: "ayo@cloudfrugal.com"     # Person responsible for this resource
CostCenter: "engineering"         # For chargeback reporting
AutoShutdown: "true" | "false"    # Enables automated remediation

Enforce tags at the Terraform level so they can't be skipped:

# variables.tf
# Add this to your Terraform root module
# Any plan that creates a resource without these tags will fail validation

variable "required_tags" {
  description = "Tags required on every resource in this account"
  type = map(string)
  
  validation {
    condition = contains(keys(var.required_tags), "Environment") &&
                contains(keys(var.required_tags), "Team") &&
                contains(keys(var.required_tags), "Owner")
    error_message = "required_tags must include Environment, Team, and Owner."
  }
}

# Apply in every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = merge(var.required_tags, {
    Name    = "app-server-${var.environment}"
    Service = "payment-api"
  })
}

Find everything that's currently untagged:

# List EC2 instances missing the Team tag
# Run this weekly until you hit zero results
aws ec2 describe-instances \
  --query "Reservations[].Instances[?!not_null(Tags[?Key=='Team'].Value | [0])].[InstanceId, InstanceType, State.Name]" \
  --output table

Once you start finding untagged resources, you'll discover a pattern: the oldest resources in the account are the least tagged, and they're often the most expensive. An EC2 instance from 2021 that predates your tagging policy is exactly the kind of thing that generates a $3,000/month line item nobody can explain.

1.3 The Cost-Aware Code Review

The most underused FinOps practice in engineering teams is reviewing code changes for cost implications before they merge. It takes thirty seconds per PR once you build the habit, and it prevents the kind of problem that opened this guide: the expensive feature that nobody priced before shipping.

Add this section to your PR template:

## Cost Impact (required for infrastructure and data changes)

- [ ] This change does not affect cloud resource usage
- [ ] New API calls introduced: estimated cost per call $______, calls/month ______
- [ ] New data storage: estimated monthly delta $______
- [ ] Cross-region data transfer introduced: yes / no
- [ ] New external service dependency with per-call pricing: yes / no

If any box other than the first is checked, add a cost estimate before requesting review.

The discipline is in making cost estimation a first-class review concern, not an afterthought that gets caught by the finance team on the 15th of the month.

Stage 1 Outcomes

By the end of month 3, you should have a baseline cost breakdown on file, 100% tag coverage on active resources, identified your top 5 cost drivers with specific reduction targets, and blocked at least one expensive PR with a cost justification that held up in review.

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

The single most reliable source of cloud waste I find in every account I audit is over-provisioned compute.

The pattern is consistent: an engineer provisions an instance at a size that handles their anticipated peak load, the peak never quite materialises at the expected scale, and nobody revisits the instance size because there's no automatic signal that says "this machine is 75% empty."

Make sure you verify actual utilisation before changing anything:

# rightsize_analyzer.py
# Finds EC2 instances running below 20% average CPU for 14 days
# These are right-sizing candidates — not automatic deletions

import boto3
from datetime import datetime, timedelta

def find_oversized_instances(region='us-east-1'):
    """
    Returns instances with average CPU below 20% for the last 14 days.
    Low CPU alone doesn't mean right-size — check memory too if CW agent installed.
    """
    ec2 = boto3.client('ec2', region_name=region)
    cw  = boto3.client('cloudwatch', region_name=region)

    reservations = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )['Reservations']

    candidates = []

    for r in reservations:
        for inst in r['Instances']:
            iid  = inst['InstanceId']
            itype = inst['InstanceType']
            tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}

            # Pull 14-day average CPU from CloudWatch
            stats = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': iid}],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=1209600,   # One 14-day period
                Statistics=['Average']
            )['Datapoints']

            avg_cpu = stats[0]['Average'] if stats else 0.0

            if avg_cpu < 20.0:
                candidates.append({
                    'instance_id':  iid,
                    'instance_type': itype,
                    'avg_cpu_pct':  round(avg_cpu, 1),
                    'environment':  tags.get('Environment', 'unknown'),
                    'owner':        tags.get('Owner', 'unknown'),
                    'team':         tags.get('Team', 'unknown'),
                })

    return sorted(candidates, key=lambda x: x['avg_cpu_pct'])

if __name__ == '__main__':
    results = find_oversized_instances()
    print(f"\nFound {len(results)} right-sizing candidates:\n")
    for r in results:
        print(f"  {r['instance_id']} ({r['instance_type']}) — "
              f"{r['avg_cpu_pct']}% avg CPU — "
              f"owner: {r['owner']}")

A word of caution: CPU utilisation below 20% is a signal, not a verdict. Some workloads are memory-intensive or I/O-bound and will show low CPU while being correctly sized. Before acting on any right-sizing recommendation, check memory utilisation (requires the CloudWatch agent) and network I/O patterns alongside CPU.

2.2 Storage Tiering: Stop Paying Retail for Cold Data

S3 Standard costs $0.023 per GB per month. S3 Glacier Deep Archive costs $0.00099 per GB per month. The difference is a factor of 23. If you have data that you last accessed six months ago and you're keeping it in S3 Standard because nobody set up lifecycle policies, you're paying 23x more than necessary.

The complete S3 lifecycle policy for engineering teams:

{
  "Rules": [
    {
      "ID": "application-logs-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30,  "StorageClass": "STANDARD_IA"},
        {"Days": 90,  "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    },
    {
      "ID": "training-checkpoints-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "ml-checkpoints/"},
      "Transitions": [
        {"Days": 7,  "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    }
  ]
}

# Apply the lifecycle policy to a bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-logs-bucket \
  --lifecycle-configuration file://lifecycle.json

# Verify it applied correctly
aws s3api get-bucket-lifecycle-configuration \
  --bucket your-logs-bucket

2.3 Savings Plans: The Sequence Is Everything

A Savings Plan is a commitment to spend a minimum dollar amount per hour on AWS compute for one or three years, in exchange for discounts of 30–70% off On-Demand rates. The discount is real. The trap is buying before optimising.

The wrong order: You have a $50,000/month EC2 bill. You buy a Savings Plan covering $35,000/hour. Then you implement right-sizing and Spot instances — and your actual spend drops to $22,000/month. You've committed to paying $35,000/month for 12 months against a need of $22,000. You're paying $13,000/month for compute you don't use, at a 30% discount. Congratulations on your discounted waste.

The right order:

Month 1-2: Right-size all instances using VPA and CloudWatch data
Month 3:   Move staging and development to Spot instances
Month 4:   Migrate compatible workloads to Graviton (20% cheaper)
Month 5:   Add VPC endpoints to eliminate NAT Gateway charges
Month 6:   THEN look at your steady-state On-Demand spend
Month 6+:  Purchase Savings Plans covering 70% of that optimised baseline

Calculate what to commit to:

# Get your On-Demand EC2 spend for the last 30 days
# This is your rightsized baseline — the number to commit against
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE",       "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own recommendation for what to commit
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

Orphaned resources are the cloud equivalent of a gym membership you forgot to cancel. They exist, they charge you, but nobody notices until the annual audit.

The root cause isn't laziness. It's the absence of lifecycle management at the infrastructure layer. When an engineer spins up an EC2 instance for a one-week experiment and then leaves the company, there's no automatic signal that the instance is now orphaned. It sits there, billing $140/month, until someone hunts it down.

The fix is a weekly automated audit that surfaces candidates for deletion and notifies the registered owner, not a process change that depends on engineers remembering to clean up.

# orphan_reporter.py
# Runs every Sunday via EventBridge → Lambda
# Posts a Slack report of orphaned resources for human review
# DOES NOT auto-delete — deletion requires a human decision

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
UNATTACHED_VOLUME_AGE_DAYS = 14
SNAPSHOT_AGE_DAYS = 90


def find_orphaned_resources():
    ec2 = boto3.client('ec2')
    report = {'monthly_waste_usd': 0, 'items': []}

    # Unattached EBS volumes
    for vol in ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']:
        age = (datetime.now(timezone.utc) - vol['CreateTime']).days
        if age >= UNATTACHED_VOLUME_AGE_DAYS:
            cost = round(vol['Size'] * 0.08, 2)  # gp3 rate
            tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
            report['items'].append({
                'type':  'Unattached EBS Volume',
                'id':    vol['VolumeId'],
                'detail': f"{vol['Size']}GB {vol['VolumeType']} — {age} days old",
                'owner': tags.get('Owner', 'unknown'),
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    # Unassociated Elastic IPs
    for addr in ec2.describe_addresses()['Addresses']:
        if 'AssociationId' not in addr:
            report['items'].append({
                'type':  'Unassociated Elastic IP',
                'id':    addr.get('AllocationId', addr['PublicIp']),
                'detail': addr['PublicIp'],
                'owner': 'unknown',
                'monthly_cost_usd': 3.60,
            })
            report['monthly_waste_usd'] += 3.60

    # Old snapshots
    cutoff = (datetime.now(timezone.utc) - timedelta(days=SNAPSHOT_AGE_DAYS)).isoformat()
    for snap in ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']:
        if snap['StartTime'].isoformat() < cutoff:
            cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
            report['items'].append({
                'type':  f'Snapshot ({SNAPSHOT_AGE_DAYS}+ days old)',
                'id':    snap['SnapshotId'],
                'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
                'owner': 'unknown',
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    return report


def post_to_slack(report):
    lines = [
        f":money_with_wings: *Weekly Orphaned Resource Report*",
        f"Found *{len(report['items'])} orphaned resources* "
        f"costing *${report['monthly_waste_usd']:.2f}/month*\n",
    ]
    for item in report['items'][:20]:  # Cap at 20 lines to stay readable
        lines.append(
            f"• `{item['type']}` {item['id']} — {item['detail']} "
            f"— *${item['monthly_cost_usd']:.2f}/mo* — owner: {item['owner']}"
        )
    lines.append("\nReview and delete anything no longer needed.")

    req = urllib.request.Request(
        SLACK_WEBHOOK,
        data=json.dumps({'text': '\n'.join(lines)}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)


def lambda_handler(event, context):
    report = find_orphaned_resources()
    post_to_slack(report)
    return {
        'items_found': len(report['items']),
        'monthly_waste': report['monthly_waste_usd'],
    }

3.2 Cost Estimation in Your CI/CD Pipeline

The goal is to catch expensive infrastructure changes at the PR stage — before they deploy and before they generate a billing surprise.

# .github/workflows/cost-check.yml
# Runs on any PR that touches infrastructure files
# Uses Infracost to estimate the monthly cost delta

name: Infrastructure Cost Check

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'infrastructure/**'
      - '*.tf'

jobs:
  cost-estimate:
    name: Estimate monthly cost change
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown \
            --path terraform/ \
            --format json \
            --out-file /tmp/infracost.json

      - name: Post cost diff to PR
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

      - name: Block if monthly increase exceeds threshold
        run: |
          MONTHLY_DELTA=$(cat /tmp/infracost.json | \
            jq '.projects[0].diff.totalMonthlyCost' | tr -d '"')

          echo "Estimated monthly cost change: \$$MONTHLY_DELTA"

          # Fail the PR if this change adds more than $500/month
          python3 -c "
          import sys
          delta = float('$MONTHLY_DELTA')
          if delta > 500:
              print(f'PR blocked: estimated +\\({delta:.2f}/month exceeds \\)500 threshold')
              sys.exit(1)
          else:
              print(f'Cost check passed: estimated +\${delta:.2f}/month')
          "

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

By month 16, you have the data. What changes at Stage 4 is the audience. You're no longer presenting to engineers who understand instance types and NAT Gateway pricing. You're presenting to a CTO who wants to know if the infrastructure investment is proportional to the business value it produces, and a CFO who wants to know when the line will stop going up.

The vocabulary shift is simple but important. You stop saying "we right-sized our EC2 instances" and start saying "we reduced our infrastructure unit cost by 28% while maintaining the same request throughput." You stop saying "we eliminated NAT Gateway charges" and start saying "we closed a $6,400/month gap between what we were paying and what was necessary."

The metric that anchors every executive FinOps conversation is cost per business unit. Not total bill (cost per API call, cost per user, cost per transaction, cost per model inference). That ratio tells the story of whether your infrastructure efficiency is improving as the business scales.

# unit_economics.py
# Calculate cost per transaction — the metric that matters to leadership

import boto3
from datetime import datetime, timedelta

def calculate_cost_per_transaction(service_name, transaction_count, days_back=30):
    """
    Returns cost per transaction for a given service over the last N days.
    transaction_count: total transactions for the same period (from your metrics)
    """
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key':    'Service',
                'Values': [service_name]
            }
        }
    )

    total_cost = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in response['ResultsByTime']
    )

    cost_per_txn = total_cost / transaction_count if transaction_count > 0 else 0

    return {
        'service':           service_name,
        'period_days':       days_back,
        'total_cost_usd':    round(total_cost, 2),
        'transactions':      transaction_count,
        'cost_per_txn_usd':  round(cost_per_txn, 6),
    }


# Example: payment service processed 4.2M transactions this month
result = calculate_cost_per_transaction('payment-api', 4_200_000)
print(f"Cost per transaction: ${result['cost_per_txn_usd']:.6f}")
print(f"Total infrastructure cost: ${result['total_cost_usd']:,.2f}")

4.2 The Chargeback and Showback Models

Chargeback means actually billing departments for their cloud usage. Showback means showing departments their usage costs without the internal billing transfer. Both create the same outcome: engineers start caring about what they consume because someone they work with is paying attention to it.

# showback_report.py
# Generates monthly cost-by-team report for distribution to engineering leads

import boto3
from datetime import datetime

def generate_team_showback():
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG',       'Key': 'Team'},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        ]
    )

    by_team = {}
    for group in response['ResultsByTime'][0].get('Groups', []):
        team    = group['Keys'][0].replace('Team$', '') or 'untagged'
        service = group['Keys'][1]
        cost    = float(group['Metrics']['UnblendedCost']['Amount'])

        if team not in by_team:
            by_team[team] = {'total': 0, 'services': {}}
        by_team[team]['total'] += cost
        by_team[team]['services'][service] = round(cost, 2)

    # Print sorted by total cost descending
    print(f"\n{'='*52}")
    print(f"  Month-to-Date Cloud Spend by Team")
    print(f"  Generated: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"{'='*52}\n")

    for team, data in sorted(by_team.items(), key=lambda x: x[1]['total'], reverse=True):
        print(f"  {team:<20} ${data['total']:>10,.2f}/month")
        top_services = sorted(data['services'].items(), key=lambda x: x[1], reverse=True)[:3]
        for svc, cost in top_services:
            print(f"    └─ {svc:<30} ${cost:>8,.2f}")
    print()

generate_team_showback()

Essential Tools and Certifications

The tools that matter at each stage of this roadmap:

Stage	Tool	Why It Matters
1	AWS Cost Explorer	Free, built-in, the starting point for all cost analysis
1	AWS CLI `ce` commands	Scriptable cost queries — dashboards can't be automated
2	AWS Compute Optimizer	ML-powered rightsizing recommendations for EC2 and RDS
2	VPA (Kubernetes)	Pod-level rightsizing recommendations using actual usage
3	Infracost	PR-level cost estimation for Terraform changes
3	AWS Budgets	Proactive alerts — catches problems before the monthly invoice
4	AWS Cost and Usage Report + Athena	SQL-level billing analysis at any granularity
4	CloudHealth or Vantage	Multi-account, multi-cloud cost management

The one certification worth your time: FinOps Certified Practitioner from the FinOps Foundation. It takes 20 hours to prepare and $300 to sit. It signals to hiring managers and clients that you understand the discipline formally — which matters when you're the person leading FinOps conversations at the executive level.

Your 90-Day Action Plan

Month 1 — Foundation:

Enable Cost Explorer if it isn't already on. Pull the baseline command from Section 1.1 and save the output. Run the untagged resource query from Section 1.2 and document how many resources are missing tags. Find your top three cost drivers. Present the findings to your engineering manager — not as a problem, but as an opportunity with a dollar figure attached.

Month 2 — Quick Wins:

Run the rightsizing analyser from Section 2.1 on your EC2 fleet. Downsize the three highest-confidence candidates. Apply S3 lifecycle policies to your two largest buckets. Create VPC endpoints for S3, ECR, and DynamoDB. Estimate the savings from each action and document them against your baseline.

Month 3 — Automation and Habits:

Deploy the orphan reporter Lambda on a Sunday schedule. Add the cost check GitHub Action to your infrastructure repository. Start a monthly FinOps review meeting — even if it's just you and one other engineer. Build the habit before you need the audience.

Best Practices Summary

✅ Do: Establish a cost baseline before any optimisation. The number is meaningless without a comparison point.

✅ Do: Right-size before buying Savings Plans. Always. The sequence changes the outcome.

✅ Do: Enforce tagging at the infrastructure layer — Terraform or CloudFormation — not as a process reminder.

✅ Do: Move staging and development to Spot instances. The interruption rate is manageable, while the 70% cost difference is not.

✅ Do: Add VPC endpoints for S3, ECR, and DynamoDB before reviewing data transfer costs. It's a 30-minute fix for a multi-thousand-dollar line item.

✅ Do: Present cost findings as cost-per-business-metric, not as total bill. "We reduced cost per transaction from $0.0021 to $0.0013" is a business result. "$38,000/month reduction" is an accounting result.

❌ Don't: Buy Savings Plans on an unoptimised baseline. You'll lock in discounted waste.

❌ Don't: Build FinOps dashboards before tagging is complete. Beautiful charts with no attribution data answer no questions.

❌ Don't: Run orphaned resource cleanup without human review first. Run in report-only mode for two weeks, verify the candidates are genuinely orphaned, then add deletion logic.

Resources

FinOps Foundation Framework — The practitioner framework that defines the Inform, Optimise, and Operate cycle this roadmap is built on
AWS Cost Explorer API Reference — Full reference for the cost query commands used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service; complements the manual analysis in Stage 2
Infracost Documentation — Setup guide for the PR-level cost estimation tool in Stage 3
FinOps Certified Practitioner Exam — The certification referenced in the tools section
AWS Savings Plans Documentation — The authoritative reference on commitment types, coverage rules, and purchase strategy
Companion Repository — All scripts from this guide, including the rightsizing analyser, orphan reporter, and showback report generator

Ayobami Adejumo is a senior platform engineer and FinOps consultant. He has audited AWS infrastructure for 20+ Series A and Series B companies. He is an active FinOps Foundation Supporter

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Ayobami Adejumo — Tue, 02 Jun 2026 16:27:27 +0000

You raised your Series A. Engineering hired fast. Features shipped faster. And somewhere between month six and month twelve, someone forwarded you an AWS Cost Explorer screenshot with a line that only goes up.

That line isn't random. It follows a pattern. The same eight patterns, at the same growth stage, at almost every company I've audited.

This guide names all eight, shows you exactly where to look, and gives you the fix for each one. By the time you finish reading, you'll know which leaks are draining your runway — and what to do about them this week.

Who This Guide Is For
Before You Start: Establish Your Baseline
Pattern 1: The New Hire Experiment Tax
Pattern 2: Staging Environment Proliferation
Pattern 3: The NAT Gateway Tax
Pattern 4: The Savings Plan Timing Mistake
Pattern 5: Cross-AZ Data Transfer
Pattern 6: The gp2 Volume Trap
Pattern 7: The Infinite Log Trap
Pattern 8: The Orphaned Resource Collector
The Full Savings Summary
What to Do This Week
Resources

Who This Guide Is For

This guide is written for engineers, CTOs, and technical co-founders at Series A companies — typically 15 to 80 engineers, AWS bills between $20,000 and $150,000 per month, and a finance team that has recently started paying attention to the infrastructure line.

You don't need a dedicated FinOps team. You need one engineer, one afternoon per week, and the eight patterns in this guide.

What you should have before starting:

AWS account access with Cost Explorer enabled
AWS CLI v2 configured (aws configure)
Basic familiarity with EC2, RDS, EBS, and S3
A Cost Explorer bookmark — you will use it constantly

Estimated time to complete all fixes: 8–20 engineering hours spread across two sprints. The reading takes around 20 minutes. The highest-ROI fix (Pattern 3) takes about 30 minutes.

Before You Start: Establish Your Baseline

Don't skip this step. Optimization without a baseline is just guessing. Run this command before touching anything:

# Pull last month's AWS cost breakdown by service
# This becomes your before number — save it somewhere
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Then screenshot the output. Name the file aws-baseline-YYYY-MM.png. You'll compare against this after each fix to verify actual savings.

The typical breakdown at Series A looks like this:

AWS Service	% of Bill	Waste Potential
EC2 (compute)	45–55%	High
Data Transfer	15–20%	Very High
RDS	10–15%	Medium
EBS	8–12%	Medium
CloudWatch	3–6%	Medium
Load Balancers	3–5%	Low

Now let's go through each pattern.

Pattern 1: The New Hire Experiment Tax

Every engineering hire needs a development environment. This is expected. What's not expected is what happens after the feature ships: nothing.

The environment keeps running. At $0.192/hour for an m5.xlarge, a forgotten dev environment costs $138/month. Ten engineers who each forgot one environment is $1,380/month — for infrastructure that's doing precisely nothing.

This pattern accelerates after a Series A because hiring moves fast. A new engineer joins on Monday, spins up an EC2, an RDS, and a namespace in the dev cluster, ships the feature by Friday, and moves to the next ticket. The environment isn't on anyone's radar. There's no off-boarding process for dev resources.

What the waste looks like:

Dev environment for Alice (feature/payment-flow):
  EC2 m5.xlarge — last CPU activity: 23 days ago
  RDS db.t3.medium — last connection: 19 days ago
  EKS namespace — last pod scheduled: 15 days ago
  Monthly cost: $187
  Status: running

Finding it:

# Find EC2 instances with average CPU below 5% for the last 14 days
# These are idle instances — candidates for shutdown or termination
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --period 1209600 \
  --statistics Average \
  --start-time $(date -d '14 days ago' --iso-8601=seconds) \
  --end-time $(date --iso-8601=seconds) \
  --dimensions Name=InstanceId,Value=YOUR_INSTANCE_ID \
  --query 'Datapoints[*].{Average:Average}' \
  --output table

The Fix — an Automatic Idle Instance Stopper:

The Lambda below runs every night at 22:00. It checks every EC2 instance tagged Environment=dev for CPU utilisation over the past seven days. Any instance averaging below 5% gets stopped automatically. An SNS notification goes to the engineer's email before the stop happens, giving them a chance to override it by adding a KeepAlive=true tag.

# idle_environment_stopper.py
# Deploy as a Lambda function triggered by EventBridge on schedule: cron(0 22 * * ? *)
# This stops idle dev environments before they run through the night and weekend

import boto3
from datetime import datetime, timedelta, timezone

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')

IDLE_CPU_THRESHOLD = 5.0      # Stop instances below this average CPU %
IDLE_DAYS = 7                  # Look back 7 days of CloudWatch data
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT:dev-environment-alerts'

def get_average_cpu(instance_id):
    """Return the 7-day average CPU utilisation for an EC2 instance."""
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=datetime.now(timezone.utc) - timedelta(days=IDLE_DAYS),
        EndTime=datetime.now(timezone.utc),
        Period=604800,  # One 7-day period
        Statistics=['Average']
    )
    datapoints = response.get('Datapoints', [])
    return datapoints[0]['Average'] if datapoints else 0.0

def lambda_handler(event, context):
    """Stop idle dev instances and notify their owners."""
    
    # Find all running dev instances
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'tag:Environment', 'Values': ['dev', 'development']},
        ]
    )

    stopped = []
    skipped = []

    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

            # Skip instances explicitly marked to keep alive
            if tags.get('KeepAlive', '').lower() == 'true':
                skipped.append(instance_id)
                continue

            avg_cpu = get_average_cpu(instance_id)

            if avg_cpu < IDLE_CPU_THRESHOLD:
                # Notify the owner before stopping
                owner = tags.get('Owner', 'unknown')
                sns.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Subject=f'Dev environment stopped: {instance_id}',
                    Message=(
                        f'Instance {instance_id} (Owner: {owner}) had {avg_cpu:.1f}% average CPU '
                        f'over {IDLE_DAYS} days and has been stopped.\n\n'
                        f'To prevent this, add the tag: KeepAlive=true\n'
                        f'To restart: aws ec2 start-instances --instance-ids {instance_id}'
                    )
                )
                ec2.stop_instances(InstanceIds=[instance_id])
                stopped.append({'id': instance_id, 'owner': owner, 'avg_cpu': avg_cpu})

    print(f"Stopped {len(stopped)} idle instances. Skipped {len(skipped)} keep-alive instances.")
    return {'stopped': stopped, 'skipped': skipped}

Monthly savings: $1,000–$2,000 depending on team size and how long the pattern has been running.

Pattern 2: Staging Environment Proliferation

Staging starts as one environment. Then the frontend team needs their own because the backend team keeps breaking theirs. Then the ML team needs isolated compute. Then QA needs a stable environment for integration tests.

Before anyone noticed, you have four staging environments running 24/7 — each one idle for 16 hours of every day.

The waste isn't in the existence of the environments. It's in the schedule. Staging environments don't need to run at 3am.

What the waste looks like:

staging-frontend:   $250/month   Used: Mon-Fri 09:00-18:00
staging-backend:    $250/month   Used: Mon-Fri 09:00-18:00
staging-ml:         $250/month   Used: Mon-Fri 10:00-17:00
staging-qa:         $250/month   Used: Mon-Fri 09:00-17:00
Total:            $1,000/month   Running: 24 hours/day, 7 days/week
Actual usage:        ~35%        You are paying 100%

Finding it:

# Find EKS node groups tagged as staging with their current status
aws eks list-nodegroups --cluster-name your-cluster-name --output table

# Check EC2 instances tagged staging and their launch time
# Any instance running > 30 days with no weekend stop schedule is a candidate
aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=staging" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,Launch:LaunchTime}' \
  --output table

The Fix — Scheduled Start and Stop with AWS Instance Scheduler:

# Option 1: Tag-based scheduling with AWS Instance Scheduler (CloudFormation solution)
# Add these tags to your staging EC2 instances and RDS clusters:
# Schedule: office-hours
# This starts instances at 08:00 and stops them at 20:00 Mon-Fri
# Weekend: completely off

# Option 2: Quick Lambda-based solution — stop all staging at 20:00 weekdays
aws events put-rule \
  --schedule-expression "cron(0 20 ? * MON-FRI *)" \
  --name stop-staging-environments \
  --state ENABLED

# The stop Lambda — same pattern as Pattern 1 but targets staging tag
# Add a corresponding start rule at 07:30 Mon-Fri

Consolidation in Addition to Scheduling

If frontend and backend share a database schema, consolidate them into one shared staging environment with namespace-level isolation. The combined cost is lower than two separate environments:

# One shared staging cluster with namespace isolation
# frontend-staging and backend-staging share nodes via Karpenter
# but are isolated by namespace-level network policies
apiVersion: v1
kind: Namespace
metadata:
  name: staging-frontend
  labels:
    environment: staging
    team: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: staging-backend
  labels:
    environment: staging
    team: backend

The math:

Scenario	Monthly cost
Before: 4 environments, always on	$1,000
After: 2 consolidated environments, office hours only	$290
Monthly savings	$710

Pattern 3: The NAT Gateway Tax

NAT Gateway is the most consistently underestimated line item on every AWS bill I've audited. It charges $0.045 per GB of data processed — and in EKS clusters, a staggering amount of traffic flows through it by default.

Every pod that pulls a container image from ECR goes through NAT Gateway. Every Lambda that writes to S3 goes through NAT Gateway. Every service that polls SQS, queries DynamoDB, or calls the Secrets Manager API goes through NAT Gateway — unless you have configured VPC endpoints.

VPC endpoints create a private connection between your VPC and the AWS service. Traffic routes through the AWS backbone instead of NAT Gateway. The data transfer becomes free.

What the waste looks like:

# Run this to see your current NAT Gateway data processing bill
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NatGateway-Bytes", "NatGateway-Hours"]
    }
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

If this number is above $200, you have a NAT Gateway problem. At most Series A companies running EKS, it is between $800 and $6,000.

The Fix — VPC Endpoints for the Four Highest-traffic AWS Services:

# Get your VPC ID and route table ID first
VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=tag:Name,Values=your-vpc-name" \
  --query 'Vpcs[0].VpcId' --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=association.main,Values=true" \
  --query 'RouteTables[0].RouteTableId' --output text)

# S3 gateway endpoint — free to create, eliminates all S3 NAT charges
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids $ROUTE_TABLE_ID

# DynamoDB gateway endpoint — also free
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids $ROUTE_TABLE_ID

# ECR API endpoint — eliminates NAT charges on every container pull
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

# ECR Docker endpoint — required alongside ECR API for image pulls
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

When explaining this to your CFO, call it the NAT tax. They understand taxes. "We're paying a $0.045/GB tax on internal network traffic that we can eliminate in 30 minutes" lands better than "data processing bytes."

Monthly savings: $2,000–$8,000 depending on your container pull frequency and S3 usage.

Pattern 4: The Savings Plan Timing Mistake

A Savings Plan is a commitment to spend a fixed dollar amount per hour on AWS compute for one or three years in exchange for a 30–70% discount. The math is attractive. The timing is where teams go wrong.

When the bill gets large, the instinct is to commit. Buy the Savings Plan, reduce the bill, show the CFO. The problem: if you haven't rightsized first, you're committing to pay for waste at a discount. When you rightsize later, your actual spend drops below your commitment — and you pay for compute you're not using.

What wrong order looks like:

Step 1: AWS bill is $100,000/month
Step 2: Buy $70,000/hour Savings Plan commitment
Step 3: Rightsize instances — actual spend drops to $60,000
Step 4: Savings Plan covers \(70,000 but you only use \)60,000
Step 5: You pay $28,000/month for compute you do not use
         (Savings Plan discount applied to the overage)
         
Net result: You locked in waste for 12 months

What right order looks like:

Step 1: Rightsize instances — spend drops from \(100,000 to \)60,000
Step 2: Add Spot for staging — spend drops from \(60,000 to \)45,000
Step 3: Migrate compatible workloads to Graviton — spend drops to $36,000
Step 4: NOW buy a Savings Plan covering $25,000/month (70% of steady-state)
Step 5: Effective monthly cost: \(12,500 for committed + \)11,000 on-demand = $23,500

Net result: $76,500/month saved versus the original bill

How to check what you should commit to:

# View your last 30 days of EC2 On-Demand spend
# This is your rightsized baseline — what you actually use after optimisation
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own Savings Plan recommendation based on your usage
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

As a rule, commit to 60–70% of your steady-state On-Demand spend after optimisation. Leave 30–40% flexible. Never commit on the unoptimised baseline.

Monthly savings: $5,000–$15,000 depending on compute spend. This is the pattern with the highest single-action ROI when sequenced correctly.

Pattern 5: Cross-AZ Data Transfer

AWS charges $0.01 per GB in each direction when data crosses an Availability Zone boundary. $0.01 sounds negligible. It's not — because AZ boundaries are crossed constantly in distributed systems, and the charge is bidirectional.

The most common scenario: your application pods are scheduled across multiple AZs (as they should be for resilience), but your database is pinned to one AZ. Every database query from a pod in a different AZ costs $0.01/GB going to the database and $0.01/GB coming back. At 100GB of database traffic per day, that's $60/month. At 1TB per day, it is $600/month.

What the waste looks like:

# Check current cross-AZ data transfer charges
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{"Dimensions": {"Key": "USAGE_TYPE", "Values": ["DataTransfer-Regional-Bytes"]}}'  \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

How to find which pods are causing the cross-AZ traffic:

# Check which AZ your database RDS instance is in
aws rds describe-db-instances \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,AZ:AvailabilityZone}' \
  --output table

# Check which AZs your application pods are running in
kubectl get pods -o wide -n production | awk '{print $7}' | sort | uniq -c

If your RDS is in us-east-1a and 60% of your pods are in us-east-1b and us-east-1c, you have a cross-AZ traffic problem.

The Fix — Topology-aware Routing:

# topology-aware-routing.yaml
# This tells Kubernetes to prefer scheduling pods in the same AZ
# as the node making the request — keeping traffic local

apiVersion: v1
kind: Service
metadata:
  name: payment-api
  namespace: production
  annotations:
    # Route traffic to pods in the same AZ as the caller when possible
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: payment-api
  ports:
  - port: 8080
    targetPort: 8080

# For pods themselves — spread across AZs but prefer local
# topologySpreadConstraints ensures even distribution
# while topology-aware routing keeps traffic within AZs

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payment-api

For database traffic specifically, consider migrating from single-AZ RDS to Aurora, which handles AZ routing internally. Your application connects to one endpoint and Aurora routes internally — no cross-AZ charge from the application layer.

Monthly savings: $500–$6,000 depending on database query volume and AZ distribution of your pods.

Pattern 6: The gp2 Volume Trap

In 2014, AWS launched gp2 EBS volumes. In 2020, they launched gp3 — cheaper, faster, and with better baseline performance. In 2026, most Series A companies are still running gp2.

The difference: gp2 costs $0.10/GB/month and provides 3 IOPS per GB (100 IOPS minimum). gp3 costs $0.08/GB/month and provides 3,000 IOPS baseline regardless of size. gp3 is 20% cheaper and 10x faster on IOPS for most volume sizes. The migration is online — it runs while the volume is attached and in use.

Finding all your gp2 volumes:

# List every gp2 volume in your account with its size and monthly cost
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    State:State,
    MonthlyCost_USD:Size
  }' \
  --output table

# Count the total: number of volumes and combined GB
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' --output text

aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'sum(Volumes[*].Size)' --output text

The Fix — Migrate All gp2 to gp3 in One Script:

#!/bin/bash
# migrate_gp2_to_gp3.sh
# Migrates all gp2 volumes to gp3. Online operation — no downtime.
# Each modification runs asynchronously; the volume stays available throughout.

echo "Starting gp2 to gp3 migration..."

# Get all gp2 volume IDs
VOLUMES=$(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text)

COUNT=0
for VOL_ID in $VOLUMES; do
  echo "Migrating $VOL_ID to gp3..."
  aws ec2 modify-volume \
    --volume-id $VOL_ID \
    --volume-type gp3 \
    --no-cli-pager
  COUNT=$((COUNT + 1))
done

echo "Migration initiated for $COUNT volumes."
echo "Modifications run online — no downtime. Monitor progress:"
echo "aws ec2 describe-volumes-modifications --query 'VolumesModifications[*].{ID:VolumeId,State:ModificationState}'"

Verify completion:

# Check that no gp2 volumes remain
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' \
  --output text
# Expected: 0

Monthly savings: 20% of your total EBS spend. At $10,000/month in EBS, that's $2,000 saved for 30 minutes of work.

Pattern 7: The Infinite Log Trap

CloudWatch log groups have a default retention policy of "Never expire." Every log group created without an explicit retention setting accumulates logs indefinitely. For a busy Series A company, this means you're storing debug logs from 2022 that nobody has opened since the sprint review they were created for.

The cost compounds quietly. CloudWatch charges $0.03/GB/month for log storage and $0.50/GB for log ingestion. A cluster generating 50GB of logs per day ingests $25/day — $750/month — and then stores those logs forever at an increasing monthly cost.

Finding log groups with no retention policy:

# List all log groups with their retention settings
# Any group showing "retentionInDays: null" is infinite — it never expires
aws logs describe-log-groups \
  --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}' \
  --output table | grep -E "(None|null)"

# Count how many log groups have no retention set
aws logs describe-log-groups \
  --query 'length(logGroups[?retentionInDays==`null`])' \
  --output text

The Fix — Set Retention Policies in Bulk:

Different log types have different compliance requirements. Debug logs don't need to be kept. Audit logs might need 365 days. The table below gives sensible defaults:

Log Type	Recommended Retention	Reason
Application debug logs	14 days	Only useful for active debugging
Application error logs	90 days	Post-incident investigation window
Access logs	30 days	Security review window
CloudTrail audit logs	365 days	SOC2 evidence requirement
VPC Flow Logs	90 days	Security investigation window

#!/bin/bash
# set_log_retention.sh
# Sets 30-day retention on all log groups that have no policy set
# Adjust the retention period per log group type as needed

echo "Setting retention policies on log groups with no expiry..."

# Get all log groups with no retention
aws logs describe-log-groups \
  --query 'logGroups[?retentionInDays==`null`].logGroupName' \
  --output text | tr '\t' '\n' | while read LOG_GROUP; do

  # Skip CloudTrail logs — these need longer retention for SOC2
  if echo "$LOG_GROUP" | grep -qi "cloudtrail"; then
    echo "Skipping CloudTrail log group: $LOG_GROUP"
    aws logs put-retention-policy \
      --log-group-name "$LOG_GROUP" \
      --retention-in-days 365
    continue
  fi

  # Set 30-day retention on all other log groups
  echo "Setting 30-day retention on: $LOG_GROUP"
  aws logs put-retention-policy \
    --log-group-name "$LOG_GROUP" \
    --retention-in-days 30
done

echo "Done. Logs older than their retention period will be deleted automatically by CloudWatch."

Monthly savings: $500–$2,000 on storage costs. The ingestion cost reduction kicks in immediately when noisy debug logging is reduced. The storage cost reduction compounds over 30–90 days as old logs expire.

Pattern 8: The Orphaned Resource Collector

Every departed engineer leaves a trail. An EBS volume attached to a terminated instance. An Elastic IP allocated but not associated. A load balancer fronting a service that was deprecated in Q3. Old snapshots from an RDS instance that was replaced. None of these are intentional, but all of them are billed.

The fix is a weekly audit. Not a manual investigation — an automated script that runs every Sunday night, finds orphaned resources, and sends a Slack message with a list of candidates for deletion.

Finding the orphans:

# Unattached EBS volumes — you are paying for storage with nothing in it
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    Created:CreateTime,
    MonthlyCost:Size
  }' \
  --output table

# Unassociated Elastic IPs — $3.60/month each when not attached to a running instance
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==`null`].[PublicIp,AllocationId]' \
  --output table

# Old snapshots — created more than 90 days ago, no longer needed
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601=seconds)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

# Idle load balancers — active but routing zero traffic
aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[*].{ARN:LoadBalancerArn,DNS:DNSName,State:State.Code}' \
  --output table

The weekly cleanup Lambda:

# orphan_resource_reporter.py
# Runs every Sunday at 20:00 via EventBridge
# Reports orphaned resources to Slack — does NOT auto-delete
# Deletion requires a human decision. The Lambda surfaces the candidates.

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

def get_orphaned_resources():
    """Collect all orphaned AWS resources and their estimated monthly costs."""
    ec2 = boto3.client('ec2')
    elbv2 = boto3.client('elbv2')
    report = {'total_monthly_waste': 0, 'resources': []}

    # Unattached EBS volumes ($0.08/GB/month for gp3)
    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']
    for vol in volumes:
        monthly_cost = round(vol['Size'] * 0.08, 2)
        report['resources'].append({
            'type': 'Unattached EBS Volume',
            'id': vol['VolumeId'],
            'detail': f"{vol['Size']}GB {vol['VolumeType']}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    # Unassociated Elastic IPs ($3.60/month each)
    addresses = ec2.describe_addresses()['Addresses']
    for addr in addresses:
        if 'AssociationId' not in addr:
            report['resources'].append({
                'type': 'Unassociated Elastic IP',
                'id': addr['AllocationId'],
                'detail': addr['PublicIp'],
                'monthly_cost': 3.60
            })
            report['total_monthly_waste'] += 3.60

    # Snapshots older than 90 days
    cutoff = (datetime.now(timezone.utc) - timedelta(days=90)).isoformat()
    snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    old_snapshots = [s for s in snapshots if s['StartTime'].isoformat() < cutoff]
    for snap in old_snapshots:
        monthly_cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
        report['resources'].append({
            'type': 'Old Snapshot (90+ days)',
            'id': snap['SnapshotId'],
            'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    return report

def post_to_slack(report):
    """Send the orphaned resource report to Slack."""
    resource_lines = '\n'.join([
        f"• {r['type']} `{r['id']}` — {r['detail']} — *${r['monthly_cost']}/month*"
        for r in report['resources']
    ])

    message = {
        'text': (
            f":money_with_wings: *Weekly Orphaned Resource Report*\n\n"
            f"Found *{len(report['resources'])} orphaned resources* "
            f"costing *${report['total_monthly_waste']:.2f}/month*\n\n"
            f"{resource_lines}\n\n"
            f"Review and delete resources that are no longer needed."
        )
    }
    
    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=json.dumps(message).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)

def lambda_handler(event, context):
    report = get_orphaned_resources()
    post_to_slack(report)
    return {
        'resources_found': len(report['resources']),
        'monthly_waste': report['total_monthly_waste']
    }

Monthly savings: $500–$2,000. Every departed engineer typically leaves $50–$200 in orphaned resources. At a team of 30 with 30% annual turnover, that compounds quickly.

The Full Savings Summary

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	$1,000–$2,000	2 hours (Lambda)	Medium
2. Staging proliferation	$600–$800	3 hours (scheduling)	Low
3. NAT Gateway tax	$2,000–$8,000	30 minutes	Low
4. Savings Plan timing	$5,000–$15,000	One decision	Low
5. Cross-AZ data transfer	$500–$6,000	2 hours	Medium
6. gp2 volume trap	$1,000–$5,000	30 minutes (script)	Low
7. Infinite log trap	$500–$2,000	1 hour (script)	Low
8. Orphaned resources	$500–$2,000	2 hours (Lambda)	Low
Total potential	$11,100–$40,800/month

What to Do This Week

Don't fix all eight this week. Prioritise by ROI per hour of engineering time:

Day 1 (30 minutes): Pattern 3 — NAT Gateway endpoints. Highest ROI per minute of any fix in this guide. One command creates the S3 endpoint. Done.

Day 2 (30 minutes): Pattern 6 — gp2 to gp3 migration. Run the script. Check the output. Done.

Day 3 (1 hour): Pattern 7 — log retention policies. Run the bulk retention script. Done.

Day 4 (2 hours): Patterns 1 and 8 — deploy both Lambdas. They run automatically from here.

Next sprint: Pattern 2 (staging schedule), Pattern 5 (topology-aware routing), and Pattern 4 (run the rightsizing cycle first, then evaluate Savings Plans).

Open Cost Explorer after each fix. Compare against your baseline screenshot from the start of this guide. The line should start going down.

Resources

FinOps Foundation Framework — The practitioner framework this guide contributes to, covering Inform, Optimize, and Operate phases of cloud cost management
AWS Cost Explorer API Reference — Full reference for the get-cost-and-usage command used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service, used alongside the patterns in this guide for EC2 and EBS recommendations
AWS VPC Endpoints Documentation — Complete list of available VPC endpoints for Pattern 3
AWS Instance Scheduler Solution — The AWS-maintained CloudFormation solution for Pattern 2 environment scheduling
Karpenter Documentation — For teams ready to go beyond these 8 patterns into dynamic node provisioning and Spot diversification
FinOps Foundation Asset Library — The community asset library where practical scripts like the ones in this guide are contributed and maintained by practitioners

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He has audited AWS infrastructure for 30+ Series A companies and contributes practical tooling to the FinOps Foundation Asset Library.

Common DevOps Mistakes and How to Avoid Them — Tips for Startups

Tolani Akintayo — Thu, 14 May 2026 17:53:38 +0000

Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production.

Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.

This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.

Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.

Who This Article Is For
Why Startups Are a Different Environment
Mistake 1: Deploying Without Understanding What You're Deploying
Mistake 2: Using Production as a Development Environment
Mistake 3: Hardcoding Secrets and Credentials
Mistake 4: Overengineering for Problems You Don't Have Yet
Mistake 5: No Observability Before Launch
Mistake 6: Treating Security as a Final Step
Mistake 7: Manual Deployments in Production
Mistake 8: No Disaster Recovery Plan
Mistake 9: No Documentation or Runbooks
Mistake 10: Solving Technical Problems Without Understanding the Business
The System Thinking Framework Every DevOps Engineer Needs
Your Production Readiness Checklist
Conclusion

Who This Article Is For

Early-career DevOps and cloud engineers who are building or maintaining production infrastructure at a startup.
Backend developers who have recently taken on DevOps responsibilities.
Engineers joining a startup who want to understand what operational discipline actually looks like in a fast-moving environment.

You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.

Why Startups Are a Different Environment

Before getting into the mistakes, you have to understand why startups produce them in the first place.

In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.

This creates four specific pressure points:

Speed pressure. The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.
Budget constraints. Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.
Absent guardrails. There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.
Constantly changing requirements. The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.

Mistake 1: Deploying Without Understanding What You're Deploying

The Scenario

A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.

The Business Impact

When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.

The Fix

Before you deploy anything to production, you should be able to answer these five questions in writing:

What compute type is running my code? (EC2, Lambda, Fargate, container?)
How does a new version replace the old one? (Rolling? Blue/green? All-at-once?)
Where does configuration and secrets come from? (SSM? Secrets Manager? Environment file?)
What downstream services depend on this? (Database connections? Other APIs? Cache?)
How do I roll back in under five minutes if this breaks?

If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.

"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."

Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.

The first question is: What is this technology or concept about?
This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.
The second question is: Why do we need it?
I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.
The third question is: How should it be implemented?
There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.

This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.

Mistake 2: Using Production as a Development Environment

The Scenario

To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.

This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."

The Business Impact

A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.

The Fix

You need at minimum three separate environments and ideally three separate AWS accounts:

Environment	Purpose	Access Level
dev	Break things freely. No real data.	Engineers have broad access
staging	Mirror of production. Final verification.	Controlled access
production	Real customers. Real data.	MFA required. No manual deployments.

Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.

Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.

# terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}

# terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}

The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.

Mistake 3: Hardcoding Secrets and Credentials

The Scenario

A new engineer joins a startup and clones the repository. Inside they find a .env file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.

GitHub's automated secret scanning never triggered because the secrets were inside a .env file rather than raw in the code. The credentials had been valid and actively used for over six months.

The Business Impact

Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:

Crypto-mining workloads generating thousands of dollars in cloud bills overnight
Complete exfiltration of customer data from every S3 bucket
Privilege escalation: the attacker creates new admin users and locks you out of your own account
AWS account suspension while the investigation runs

According to GitHub's annual security report, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.

The Fix

Step 1: Never commit secrets to Git. Not temporarily. Not in a branch. Not in a private repository.

Step 2: Add .gitignore before you create the first file. Check in the .gitignore with the first line of code before any .env files exist.

# .gitignore
.env
.env.*
*.pem
*.key
secrets/

Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets. Your application reads secrets at runtime:

# Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]

Step 4: Scan your existing repositories immediately. You may already have a problem:

# Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo

Step 5: Add a pre-commit hook to prevent future accidents:

pip install pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets

pre-commit install
# Now the hook runs before every commit and blocks detected secrets

There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.

Mistake 4: Overengineering for Problems You Don't Have Yet

The Scenario

A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.

The Business Impact

Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.

Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.

The Fix

Match your infrastructure to your actual stage:

Scale	Right Infrastructure	Cost Range
1–1,000 users	Single EC2 + RDS + Nginx reverse proxy	$20–50/month
1K–50K users	Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD	$200-500/month
50K–500K users	ECS Fargate, RDS read replicas, ElastiCache, full observability	$1K-5K/month
500K+ users	Multi-region, managed Kubernetes, dedicated SRE	$10K+/month

The question to ask before every infrastructure decision is: "What specific, measurable problem does this solve today that my current setup cannot solve?"

Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.

Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.

Mistake 5: No Observability Before Launch

The Scenario

A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.

The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.

Business Impact

Without observability:

You find out about production problems from users, not from your systems
Incidents take 10x longer to resolve because diagnosis is guesswork
You cannot tell whether a deployment improved or degraded performance
You have no data for making better architecture decisions

The Fix

Implement the four golden signals before any service goes to production. These come from Google's Site Reliability Engineering book:

Latency: How long requests take to complete (p50, p95, p99)
Traffic: How many requests per second the system is handling
Errors: The rate of failed requests (5xx responses per minute)
Saturation: How close the system is to its limits (CPU, memory, connection pool)

Here is a minimal CloudWatch alarm setup using the AWS CLI:

# Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef

Every application should also expose a /health endpoint that returns 200 OK when healthy:

# FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }

Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.

You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."

Mistake 6: Treating Security as a Final Step

The Scenario

A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:

S3 buckets publicly accessible by default
EC2 instances with port 22 open to 0.0.0.0/0
IAM users with AdministratorAccess for the entire team
No encryption on the database at rest
JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.

The Business Impact

Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.

The Fix

Apply these six security controls before the first line of production code ships:

1. Principle of Least Privilege every IAM role gets only what it needs:

One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (s3:*) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.

The fix is simple: look at what your service actually does, then write a policy that allows exactly that.

If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}

Notice the Resource is scoped to my-app-uploads/* not all S3 buckets. And the Action list covers only GetObject and PutObject not DeleteObject, not s3:*. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.

2. Block all S3 public access by default:

AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.

The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:

aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.

3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:

Port 22 open to 0.0.0.0/0 is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.

AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:

# Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0

To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the AmazonSSMManagedInstanceCore policy attached. Once that is set up, you can close port 22 on the security group entirely.

4. Enable MFA for all IAM users and enforce it via policy:

A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.

Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.

The AWS documentation provides the Complete Deny Without MFA Policy, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.

5. Enable CloudTrail in all regions:

Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.

CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:

aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation

The --enable-log-file-validation flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every AssumeRole, every DeleteBucket, and every RunInstances call in your account is permanently recorded.

6. Run AWS Security Hub from day one:

Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.

Enabling it takes a single command:

aws securityhub enable-security-hub

Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.

Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.

Mistake 7: Manual Deployments in Production

The Scenario

A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running git pull, running npm install, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips npm install. The application starts crashing because a new dependency is missing.

The Business Impact

Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.

The Fix

If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:

# .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" >> $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true

Notice wait-for-service-stability: true. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.

Mistake 8: No Disaster Recovery Plan

The Scenario

A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.

The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.

The Business Impact

The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.

Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.

The Fix

Define your RTO and RPO before you design anything:

RTO (Recovery Time Objective): How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.
RPO (Recovery Point Objective): How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.

Enable RDS Multi-AZ for all production databases:

# Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}

Test your backups on a schedule. Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.

# Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"

For official guidance on RDS backup and restore, refer to the AWS RDS Backup and Restore documentation.

Mistake 9: No Documentation or Runbooks

The Scenario

The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.

The Business Impact

Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.

The Fix

Documentation for an engineering team means three specific things:

Infrastructure as Code is the highest form of documentation. The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.
A runbook for every operational task. A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:

# Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.

An architecture README in every repository. Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.

Mistake 10: Solving Technical Problems Without Understanding the Business

The Scenario

A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.

The Business Impact

Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.

The Fix

Before making any infrastructure decision, answer these four questions:

What is the actual, measured bottleneck? Instrument before you act. The bottleneck is almost never where you assumed it was.
What does success look like, and how will you measure it? "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.
What is the full cost of this solution? Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?
Can a simpler solution solve 80% of the problem in 20% of the time?

Always profile and measure before you rebuild:

# Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"

Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.

The System Thinking Framework Every DevOps Engineer Needs

Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.

A system thinker asks six questions before making any change in production:

Question	Why You Ask It
What does this change?	List every configuration, file, or service that will be different.
What does this depend on?	What must be true upstream for this component to work correctly?
What depends on this?	What downstream systems are affected if this changes or fails?
What is the failure mode?	Does this fail loudly (500 errors) or silently (wrong data)?
What is the rollback path?	How do you reverse this in under five minutes?
What does healthy look like after the change?	What metrics confirm everything is working correctly?

This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.

Your Production Readiness Checklist

Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.

Infrastructure

Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git
Separate dev, staging, and production environments exist with separate credentials
All production changes go through an automated CI/CD pipeline, no manual SSH deployments
You can rebuild the entire production environment from code in under two hours

Security

No secrets, credentials, or API keys exist in any Git repository
All production secrets are in Secrets Manager or SSM Parameter Store
All IAM roles follow the principle of least privilege
S3 buckets have public access blocked by default
Port 22 is not open to 0.0.0.0/0 on any security group
CloudTrail is enabled in all regions
All IAM users have MFA enabled
AWS Security Hub is enabled and findings are reviewed weekly

Observability

Every service has a /health endpoint that monitoring checks continuously
Alerts fire within five minutes of a production error rate spike
Dashboards exist showing latency, error rate, and resource utilization
Logs are centralized and searchable, not scattered across individual servers

Reliability

Production database has Multi-AZ enabled
Backup restoration has been tested in the last 30 days
Written runbooks exist for the three most likely failure scenarios
RTO and RPO requirements are documented and the architecture meets them

Documentation

Every repository has a README explaining what it does and how to deploy it
A new engineer could understand the production architecture from documentation alone
No single engineer holds critical knowledge that lives only in their head

Conclusion

None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.

The good news is that every single one of them is preventable with the right awareness and the right habits applied early.

You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.

The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.

Want to Go Deeper?

If this article resonated with you, The Startup DevOps Field Guide covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.

It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.

AWS Certified Cloud Practitioner Study Course – Pass the Exam With This Free 14-Hour Course

Beau Carnes — Thu, 14 May 2026 13:00:00 +0000

Passing the AWS Certified Cloud Practitioner Exam is one of the first steps to a career in cloud development. And freeCodeCamp just published a free 14-hour course that will help you prepare for the exam.

This course has been updated for 2026.

This exam mostly deals with cloud computing concepts. Even if you are new to coding, you should be able to prepare for this exam and earn the AWS certification. Andrew Brown created this course. He is a popular instructor and the CEO of ExamPro.

What is the AWS Certified Cloud Practitioner?

The Certified Cloud Practitioner is the entry-level AWS certification that goes through:

The cloud fundamentals, for example Cloud Concepts, Cloud Architecture, and Cloud Deployment Models
A close look at the AWS Core Services
A quick look at the vast amount of AWS services
Identity, Security, and Governance of the Cloud
Billing, Pricing, and Support of AWS Services

The course code is CLF-C02 but its commonly referred to as the CCP.

Amazon Web Services is the leading Cloud Service Provider (CSP) in the world and the AWS Certified Cloud Practitioner is the most common starting point for people breaking into the cloud industry.

Consider the AWS Certified Cloud Practitioner if:

You are new to cloud and need to learn the fundamentals
You are in the executive, management, or sales level and need to acquire strategic information about cloud for adoption or migration
You are a Senior Cloud Engineer or Solutions Architect who needs to reset or refresh your AWS knowledge after working for multiple years

No matter your path towards a cloud role, the AWS Certified Cloud Practitioner provides fundamental knowledge that you shouldn't skip.

Here are all the sections in this comprehensive course:

Introduction

🎤 Is Certified Cloud Practitioner right for me?
🎤 Exam Guide
🎤 Practice Exam Sample
🎤 Case Study Question Type
🎤 Validators

Cloud Concepts

🎤 What is Cloud Computing
🎤 Evolution of Cloud Hosting
🎤 What is Amazon
🎤 What is AWS
🎤 What is a Cloud Service Provider
🎤 Landscape of CSPs
🎤 Gartner Magic Quadrant for Cloud
🎤 Common Cloud Services
🎤 AWS Technology Overview
🎤 AWS Services Preview
🎤 Evolution of Computing
🎤 Types of Cloud Computing
🎤 Cloud Computing Deployment Models
🎤 Deployment Model Use Cases

Getting Started

🎤 Create an AWS Account
🎤 Create IAM User
🎤 AWS Region Selector
🎤 Overbilling Story
🎤 AWS Budgets
🎤 AWS Free Tier
🎤 Billing Alarm
🎤 Turning on MFA

Digital Transformation

🎤 Innovation Waves
🎤 Burning Platform
🎤 Digital Transformation Checklist
🎤 Evolution of Computing Power
🎤 Amazon Braket

The Benefits of Cloud

🎤 The Benefits of the Cloud
🎤 The Six Advantages of Cloud
🎤 The Six Advantages of Cloud Doc Reference
🎤 The Seven Advantages of Cloud

Global Infrastructure

🎤 AWS Global Infrastructure Overview
🎤 AWS Global Infrastructure Follow Along
🎤 Regions
🎤 Regional vs Global Services
🎤 Availability Zones AZs
🎤 Regions and AZ Visualized
🎤 Selecting Regions and Azs Follow Along
🎤 Fault Tolerance
🎤 AWS Global Network
🎤 Points of Presence PoP
🎤 Tier 1
🎤 AWS Services using PoPs
🎤 AWS Direct Connect
🎤 Direct Connect Locations
🎤 AWS Local Zones
🎤 Wavelength Zones
🎤 Data Residency
🎤 AWS for Government
🎤 GovCloud
🎤 AWS in China
🎤 AWS in China Follow Along
🎤 Sustainability
🎤 Sustainability Follow Along
🎤 AWS Ground Station
🎤 AWS Outposts

Cloud Architecture

🎤 Cloud Architecture Terminologies
🎤 High Availability
🎤 High Scalability
🎤 High Elasticity
🎤 Fault Tolerance
🎤 High Durability
🎤 Business Continuity Plan
🎤 Disaster Recovery Options
🎤 RTO Visualized
🎤 RPO Visualized
🎤 Architectural diagram example
🎤 HA Follow Along

Management and Developer Tools

🎤 AWS API
🎤 AWS API Follow Along
🎤 AWS Management Console
🎤 AWS Management Console Follow Along
🎤 Service Console
🎤 Service Console Follow Along
🎤 AWS Account ID
🎤 AWS Account ID Follow Along
🎤 AWS Tools for PowerShell
🎤 AWS Tools for Powershell Follow Along
🎤 Amazon Resource Names
🎤 ARN Follow Along
🎤 AWS CLI
🎤 AWS CLI Follow Along
🎤 AWS SDK
🎤 AWS SDK Follow Along
🎤 AWS CloudShell
🎤 Infrastructure as Code
🎤 CloudFormation
🎤 CloudFormation Follow Along
🎤 CDK
🎤 CDK Follow Along
🎤 AWS Toolkit for VSCode
🎤 Access Keys
🎤 Access Keys Follow Along
🎤 AWS Documentation
🎤 AWS Documentation Follow Along

Shared Responsibility Model

🎤 Introduction to Shared Responsibility Model
🎤 AWS Shared Responsibility Model
🎤 Types of Cloud Responsibilities
🎤 Shared Responsibility for Compute
🎤 Shared Responsibility Model Alternate
🎤 Shared Responsibility Model Architecture

Compute

🎤 EC2 Overview
🎤 VMs Containers and Serverless
🎤 Compute Follow Along
🎤 High Performance Computing HPC
🎤 HPC Follow Along
🎤 Edge and Hybrid
🎤 Edge Computing Follow Along
🎤 Cost Capacity Management

Storage Services

🎤 Types of Storage Services
🎤 Introduction to S3
🎤 S3 Storage Classes
🎤 AWS Snow Family
🎤 Storage Services
🎤 S3 Follow Along
🎤 EBS Follow Along
🎤 EFS Follow Along
🎤 Snow Family Follow Along

Databases

🎤 What is a database
🎤 What is a data warehouse
🎤 What is a key value store
🎤 What is a document database
🎤 NoSQL Database Services
🎤 Relational Database Services
🎤 Other Database Services
🎤 DynamoDB Follow Along
🎤 RDS Follow Along
🎤 Redshift Follow Along

Networking

🎤 Cloud Native Networking Services
🎤 Enterprise Hybrid Networking Services
🎤 Virtual Private Cloud VPC Subnets
🎤 Security Groups vs NACLs
🎤 Security Groups vs NACLs Follow Along
🎤 AWS CloudFront

EC2

🎤 Introduction to EC2
🎤 EC2 Instance Families
🎤 EC2 Instance Types
🎤 Dedicated Host vs Dedicated Instances
🎤 EC2 Tenancy
🎤 Launch an EC2 SSH and Sessions Manager
🎤 Elastic IP
🎤 AMI and Launch Template
🎤 Launch an ASG
🎤 Launch an ALB
🎤 Cleanup

EC2 Pricing Models

🎤 Ec2 Pricing Models
🎤 On Demand
🎤 Reserved Instances
🎤 RI Attributes
🎤 Regional and Zonal RI
🎤 RI Limits
🎤 Capacity Reservations
🎤 Standard vs Convertible RI
🎤 RI Marketplace
🎤 Spot Instances
🎤 Dedicated Instances
🎤 Savings Plan

Identity

🎤 Zero Trust Model
🎤 Zero Trust on AWS
🎤 Zero Trust on AWS with Third Parties
🎤 Directory Service
🎤 Active Directory
🎤 Identity Providers
🎤 Single Sign On
🎤 LDAP
🎤 Multi Factor Authenication
🎤 Security Keys
🎤 AWS IAM
🎤 Anatomy of an IAM Policy
🎤 IAM Policies Follow Along
🎤 Principle of Least Priivilege
🎤 AWS Account Root User
🎤 AWS SSO

Application Integration

🎤 Introduction to Application Integration
🎤 Queueing and SQS
🎤 Streaming and Kinesis
🎤 Pub Sub and SNS
🎤 API Gateway and Amazon API Gateway
🎤 State Machines and AWS Step Functions
🎤 Event Bus and Amazon Event Bridge
🎤 Application Integration Services

Containers

🎤 VMs vs Containers
🎤 What are Microservices
🎤 Kuberenetes
🎤 Docker
🎤 Podman
🎤 Container Services

Governance

🎤 Organizations and Accounts
🎤 AWS Control Tower
🎤 AWS Config
🎤 AWS Config FollowAlong
🎤 AWS Quick Starts
🎤 AWS QuickStarts Follow Along
🎤 Tagging
🎤 Tag Name Follow Along
🎤 Resource Groups
🎤 Resource Groups Follow Along
🎤 Business Centric Services

Provisioning

🎤 Provisioning Services
🎤 AWS Elastic Beanstalk
🎤 AWS Elastic Beanstalk Follow Along

Serverless

🎤 What is Serverless
🎤 Serverless Services

Windows on AWS

🎤 Windows on AWS
🎤 EC2 Windows Follow Along
🎤 AWS License Manager

Logging

🎤 Logging Services
🎤 AWS Cloud Trail
🎤 CloudWatch Alarm
🎤 Anatomy of an Alarm
🎤 Log Streams and Events
🎤 Log Insights
🎤 CloudWatch Metrics
🎤 AWS CloudTrail Follow Along

ML AI BigData

🎤 Introduction to ML and AI
🎤 AI and ML Services
🎤 BigData and Analytics Services
🎤 Amazon QuickSight
🎤 QuickSight Follow Along
🎤 Machine Learning and AI Services Extended
🎤 Generative AI
🎤 ML and DL Frameworks and Tools
🎤 Apache MXNet
🎤 What is Intel
🎤 Intel Xeon Scalable and Intel Gaudi
🎤 What is a GPU
🎤 What is CUDA

AWS Well Architected Framework

🎤 AWS Well Architected Framework
🎤 General Defintions
🎤 On Architecture
🎤 Amazon Leadership Principles
🎤 General Design Principles
🎤 Anatomy of a Pillar
🎤 Operational Excellence
🎤 Security
🎤 Reliability
🎤 Performance Efficiency
🎤 Cost Optimization
🎤 AWS Well Architected Tool
🎤 Well Architected Framework and Tool Follow Along
🎤 AWS Architecture Center

TCO and Migration

🎤 Total Cost of Ownership TCO
🎤 CAPEX vs OPEX
🎤 Shifting IT Personnel
🎤 AWS Pricing Calculator
🎤 AWS Pricing Calculator Follow Along
🎤 Migration Evaluator
🎤 VM Import Export
🎤 Database Migration Service
🎤 Cloud Adoption Framework

Billing and Pricing

🎤 AWS Free Services
🎤 AWS Support Plans
🎤 Technical Account Manager
🎤 AWS Support Follow Along
🎤 AWS Marketplace
🎤 AWS Marketplace Follow Along
🎤 Consolidated Billing
🎤 Consolidated Billing Volume Discounts
🎤 AWS Trusted Advisor
🎤 AWS Trusted Advisor Follow Along
🎤 SLAs
🎤 AWS SLA Examples
🎤 AWS SLA Follow Along
🎤 Service Health Dashboard
🎤 AWS Personal Health Dashboard
🎤 AWS Abuse
🎤 AWS Abuse Report Follow Along
🎤 AWS Free Tier
🎤 AWS Credits
🎤 AWS Partner Network
🎤 AWS Budgets
🎤 AWS Budget Reports
🎤 AWS Cost and Usage Reports
🎤 Cost Allocation Tags
🎤 Billing Alarms
🎤 AWS Cost Explorer
🎤 AWS Cost Explorer Follow Along
🎤 Programmatic Pricing APIs
🎤 AWS Savings Plan Follow Along

Security

🎤 Defense In Depth
🎤 CIA Triad
🎤 Vulnerabilities
🎤 Encryption
🎤 Cyphers
🎤 Cryptographic Keys
🎤 Hashing and Salting
🎤 Digital Signatures and Signing
🎤 In Transit vs At Rest Encryption
🎤 Compliance Programs
🎤 AWS Compliance Programs Follow Along
🎤 Pen Testing
🎤 Pen Testing Follow Along
🎤 AWS Artifact
🎤 AWS Artifact Follow Along
🎤 AWS Inspector
🎤 DDoS
🎤 AWS Shield
🎤 AWS Guard Duty
🎤 AWS Guard Duty Follow Along
🎤 Amazon Macie
🎤 AWS VPN
🎤 AWS WAF
🎤 AWS WAF Follow Along
🎤 Hardware Security Module
🎤 AWS KMS
🎤 AWS KMS Follow Along
🎤 CloudHSM

Variation Study

🎤 Know Your Initialisms
🎤 AWS Config AWS AppConfig
🎤 SNS vs SQS
🎤 SNS vs SES vs PinPoint vs Workmail
🎤 Amazon Inspector vs AWS Trusted Advisor
🎤 Connect Named Services
🎤 Elastic Transcoder vs MediaConvert
🎤 AWS Artifact vs Amazon Inspector
🎤 ELB variants

You can watch the entire course on the freeCodeCamp.org (14-hour course).

How to Migrate to S3 Native State Locking in Terraform

Tolani Akintayo — Thu, 07 May 2026 22:58:43 +0000

If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.

But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.

The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.

In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning DynamoDB is no longer required for state locking. Terraform 1.10 added support for this feature, and it's now generally available.

In this tutorial, you'll learn:

What S3 native locking is and how it works
How to set it up from scratch if you're starting a new project
How to migrate an existing S3 + DynamoDB setup to S3 native locking safely
How to verify locking is working and handle edge cases

By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.

What Is Terraform State Locking?
What Is S3 Native State Locking?
How S3 Native Locking Compares to the S3 + DynamoDB Approach
Prerequisites
Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch
Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking
How to Verify That Locking Is Working
How to Handle a Stuck Lock
Rollback Plan: If Something Goes Wrong
Security Best Practices for Your State Bucket
Conclusion
References

What is Terraform State Locking?

Before looking at the new approach, it helps to understand what state locking is solving.

Terraform stores everything it knows about your infrastructure in a state file – a JSON document that maps your configuration to real AWS resources. When you run terraform apply, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.

The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a race condition. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.

State locking solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.

Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ---------->|                            |
   |                                   |                            |
   |<------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock --->|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |<------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ---------->|                            |
   |                                   |                            |
   |           [COMPLETED]             |<------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |

What Is S3 Native State Locking?

Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a LockID primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.

S3 native locking uses S3 Object Lock instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.

When S3 native locking is enabled in your Terraform backend:

Terraform writes your state to an .tfstate object in S3 (as before)
To acquire a lock, Terraform uses S3's conditional write operations – specifically the if-none-match conditional header to create a lock file atomically
If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held
When the operation completes, Terraform deletes the lock file to release the lock.

The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.

Note: This feature requires Terraform version 1.10.0 or later and an S3 bucket with Object Lock enabled. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Aspect	S3 + DynamoDB (Old)	S3 Native Locking (New)
AWS services required	S3 + DynamoDB	S3 only
IAM permissions needed	S3 + DynamoDB permissions	S3 permissions only
Terraform version	Any	1.10.0 or later
Setup complexity	Two resources, two IAM scopes	One resource
Stuck lock resolution	Delete DynamoDB record	Delete S3 lock file
Cost	S3 storage + DynamoDB on-demand	S3 storage only
Object Lock requirement	Not required	Required on S3 bucket
Locking mechanism	DynamoDB conditional writes	S3 conditional writes (`if-none-match`)
State versioning	S3 Versioning (recommended)	S3 Versioning (required for full safety)

The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.

Prerequisites

Before you start, make sure you have the following in place:

Terraform 1.10.0 or later installed. Check your version:

terraform version

If you need to upgrade, follow the official upgrade guide.

AWS CLI installed and configured with credentials that have permission to create and manage S3 buckets.

aws --version
aws sts get-caller-identity   # confirm you're authenticated

IAM permissions to perform the following S3 actions:
- s3:CreateBucket
- s3:PutBucketVersioning
- s3:PutBucketEncryption
- s3:PutObjectLegalHold
- s3:PutObjectRetention
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
- s3:ListBucket
For the migration path: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.

Step 1: Create the S3 Bucket with Versioning and Encryption

Object Lock must be enabled at bucket creation time. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

Note: For regions other than us-east-1, add the --create-bucket-configuration flag.

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket

Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:

aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled

Enable server-side encryption so your state files are encrypted at rest:

aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:

aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Verify the bucket configuration:

# Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state

Expected output for the Object Lock check:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Step 2: Configure the Terraform Backend with Native Locking

In your Terraform project, create or update your backend.tf file:

terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}

The critical difference from the old configuration is the use_lockfile = true parameter. Notice what is absent: there's no dynamodb_table argument. No DynamoDB table. No second service.

Here's a direct comparison of the old and new configurations:

Old configuration (S3 + DynamoDB):

terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}

New configuration (S3 native locking):

terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}

Step 3: Initialize and Verify

Run terraform init to initialize the backend:

terraform init

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!

Run a plan to confirm everything is working end-to-end:

terraform plan

If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket – a .tflock file will appear temporarily alongside your state file during the operation and disappear when it completes.

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Follow this section if you have an existing Terraform setup using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.

Important: Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means all team members and CI/CD pipelines must stop running terraform plan or terraform apply during the migration. The migration itself takes under 10 minutes.

Step 1: Verify Your Current Setup

Before making any changes, document your existing backend configuration and confirm the state file is accessible:

# Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'

Check your current backend.tf and note the exact values:

# Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}

Run one final plan to confirm the current state is clean and there are no unexpected changes pending:

terraform plan

If the plan shows no changes, you're in a safe state to proceed.

Step 2: Enable Object Lock on the Existing S3 Bucket

This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.

But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.

Run the following AWS CLI command to enable Object Lock on your existing bucket:

aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'

Note: This command enables Object Lock in governance mode with no default retention, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.

Verify Object Lock is now enabled:

aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket

Expected output:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):

aws s3api get-bucket-versioning \
  --bucket your-existing-bucket

Expected output:

{
    "Status": "Enabled"
}

If versioning isn't enabled, enable it before proceeding:

aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled

Step 3: Update the Terraform Backend Configuration

Update your backend.tf to remove the dynamodb_table argument and add use_lockfile = true:

terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}

Your updated backend.tf should look like this:

terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}

Step 4: Reinitialize Terraform

Run terraform init with the -reconfigure flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):

terraform init -reconfigure

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!

If you see an error here: The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.

Step 5: Verify the Migration

Run a plan to confirm Terraform is working correctly with the new backend configuration:

terraform plan

The plan should:

Complete successfully
Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)
NOT mention DynamoDB anywhere in its output

To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:

╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵

The Path field shows .tfstate.tflock, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.

Step 6: Clean Up the DynamoDB Table

Once you've confirmed the migration is working correctly and your team has run at least one successful plan and apply cycle using the new backend, you can remove the DynamoDB table.

Wait at least 24-48 hours before deleting the DynamoDB table if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.

When you're ready, delete the DynamoDB table:

aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table

Confirm the deletion:

aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table

Expected output:

An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found

This error confirms that the table is gone. The migration is complete.

If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run terraform apply to destroy it via Terraform rather than the CLI directly. This keeps your state clean:

# Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

After removing the block, run:

terraform apply

Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.

How to Verify That Locking Is Working

After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.

Method 1: Observe the lock file during an operation

In one terminal, start a long-running plan against a configuration with many resources:

terraform plan

While it's running, in a second terminal, check for the lock file in S3:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

You should see a file like:

2026-05-06 14:22:01        512 terraform.tfstate.tflock

After the plan completes, run the same command again. The .tflock file should be gone.

Method 2: Read the lock file contents

While a plan is running, download and read the lock file to see its contents:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock && cat /tmp/current.lock

Expected output (formatted for readability):

{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}

This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.

How to Handle a Stuck Lock

With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the .tflock file from S3.

A lock can get stuck if:

A terraform apply or plan process was killed mid-execution
A CI/CD pipeline runner crashed during a Terraform operation
A network interruption prevented the lock release from completing

Here's how you can check for a stuck lock:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

If a .tflock file exists and no Terraform operation is currently running, it is a stuck lock.

You can also read the lock to understand who held it:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock && cat /tmp/stuck.lock

This tells you who (Who field) was running the operation, what operation it was (Operation field), and when it was acquired (Created field).

And you can force-unlock using Terraform like this:

terraform force-unlock LOCK-ID

Replace LOCK-ID with the ID value from the lock file contents. For example:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Terraform will confirm:

Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!

An alternative is to delete the lock file directly via CLI. If terraform force-unlock doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:

aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock

Only delete the lock file if you are certain no Terraform operation is currently running. Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.

Rollback Plan: If Something Goes Wrong

If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.

Step 1: Stop all Terraform operations in your team and CI/CD pipelines.

Step 2: Recreate the DynamoDB table if you already deleted it:

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Step 3: Revert backend.tf to the previous configuration:

terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}

Step 4: Reinitialize:

terraform init -reconfigure

Step 5: Verify:

terraform plan

The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.

Note: Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using dynamodb_table in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.

Security Best Practices for Your State Bucket

Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:

Enable Versioning (Required)

Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.

aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled

Block All Public Access (Non-Negotiable)

Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.

aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Enable Server-Side Encryption

Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:

aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Apply Least-Privilege IAM Permissions

The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}

Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.

Enable Access Logging

Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:

aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'

Conclusion

AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.

Here's a summary of what you accomplished:

Understood what state locking is and why it's required for safe Terraform operations
Compared S3 native locking to the existing S3 + DynamoDB approach
Set up a fresh Terraform backend using S3 native locking with correct bucket configuration
Migrated an existing backend from S3 + DynamoDB to S3 native locking safely
Learned how to verify locking, handle stuck locks, and roll back if needed
Applied security best practices to the state bucket

This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.

If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out The Startup DevOps Field Guide. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.

References

The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands

Ayobami Adejumo — Tue, 05 May 2026 18:26:21 +0000

If your team is preparing for a SOC 2 Type II review, this handbook is for you. It's a self-contained guide to the exact 90-day timeline, 14 critical controls, and evidence collection infrastructure that auditors actually check.

Everyone publishes the controls list. But nobody publishes the week-by-week engineering calendar you'll need to follow to make sure your ducks are in a row.

Here is the exact 90-day timeline — including the mistakes that add 60 days (and how to avoid them).

What You'll Learn
Prerequisites
Weeks 1–2: The Scope Decision
Weeks 3–6: The 14 Controls That Must Be Active on Day 1
Weeks 7–10: The Evidence Collection Infrastructure
Weeks 11–14: Auditor Selection and Readiness Assessment
Weeks 15–18: The Observation Period
The 90-Day SOC2 Timeline at a Glance
What's Next
Resources

What You'll Learn

By the end of this guide, you'll know:

How to scope your SOC2 boundary correctly — the decision that determines everything else
The 14 controls that must be active on day 1 of your observation period
How to build evidence collection infrastructure that runs automatically
How to choose an auditor and run a readiness assessment
What happens during the observation period and how to close gaps without restarting the clock

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Basic understanding of AWS services (EC2, RDS, S3, IAM, VPC)
Familiarity with Terraform or another infrastructure as code tool
Comfort reading GitHub Actions YAML workflows
A general understanding of what SOC2 is — if you are starting from scratch, read the AICPA's SOC2 overview first

Tools and access:

An AWS account with administrator access
A GitHub organisation with admin rights
Terraform installed (v1.0 or later)
Python 3.8 or later (for the evidence collector Lambda)
A compliance automation platform — Vanta or Drata — connected to your AWS account and GitHub organisation

Estimated time: 90 days end-to-end, with active engineering work of approximately 8–12 hours per week in the first six weeks, tapering to 2–4 hours per week during the observation period.

Weeks 1–2: The Scope Decision — What Is In and Out of Your SOC2 Boundary

What Most Teams Get Wrong

Most teams scope their SOC2 boundary too broadly. They include every AWS account, every service, every environment. This is a mistake — and here is exactly why.

A broader scope means more controls to implement, more evidence to collect, and more systems the auditor will examine.

Every system inside your boundary must satisfy all 14 controls. Including your development sandbox means your engineers' experimental environments must have GuardDuty enabled, CloudTrail logging, and branch-protected deployments. That adds weeks of work and months of evidence collection for systems that pose no risk to your customers.

A correctly bounded scope means you include only the systems that store, process, or transmit customer data — and you prove that everything else cannot reach those systems.

Bad scope (over-inclusive):

Entire AWS Organization
├── Production (in scope)
├── Staging (in scope)
├── Development (in scope)
├── Sandbox (in scope)
└── CI/CD (in scope)

Good scope (correctly bounded):

SOC2 Boundary
├── Production AWS Account (in scope)
├── Production EKS Cluster (in scope)
├── Production RDS (in scope)
└── Everything else (OUT of scope — proven by network segmentation)

The correctly bounded scope works because it draws the tightest defensible line around the systems that actually handle customer data. Everything outside that line is excluded — not by assumption, but by technical controls that prevent those systems from reaching anything inside the boundary.

The Scope Decision Framework

For every system in your infrastructure, ask these four questions:

Question	If YES	If NO
Does this system store, process, or transmit customer data?	✅ In scope	❌ Out of scope
Does this system affect the availability of customer-facing services?	✅ In scope	❌ Out of scope
Does this system have access to production credentials?	✅ In scope	❌ Out of scope
Can a compromise of this system lead to a customer data breach?	✅ In scope	❌ Out of scope

Any system where the answer to even one question is yes belongs inside your boundary.

Network Segmentation — The Technical Proof That Your Boundary Holds

Network segmentation is the practice of dividing your infrastructure into isolated zones so that systems in one zone can't communicate with systems in another unless you explicitly allow it.

In the context of SOC2, it's the technical control that proves your out-of-scope systems genuinely can't reach your in-scope systems — not just by policy, but by infrastructure enforcement.

Without network segmentation, the SOC2 auditor can't trust that your boundary is real. A developer in your sandbox environment who can query your production database means the sandbox is effectively in scope, regardless of what your diagram says.

Here's the Terraform that implements network segmentation between your production and non-production environments. The network access control list (NACL) blocks all inbound traffic from the broader private IP range (10.0.0.0/8) into your in-scope production VPC, while the explicit aws_vpc_peering_connection comment documents the deliberate decision not to peer environments:

# This account has NO VPC peering to non-production environments.
# The absence of peering is itself the segmentation control.
# Do NOT add peering connections to this account without SOC2 scope review.

resource "aws_network_acl" "deny_non_production" {
  vpc_id = aws_vpc.production.id

  # Block all inbound traffic from non-production IP ranges
  ingress {
    rule_no    = 100
    action     = "deny"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "10.0.0.0/8"
  }

  # Allow legitimate inbound traffic (HTTPS from internet)
  ingress {
    rule_no    = 200
    action     = "allow"
    from_port  = 443
    to_port    = 443
    protocol   = "tcp"
    cidr_block = "0.0.0.0/0"
  }

  # Allow all outbound (tighten this per your architecture)
  egress {
    rule_no    = 100
    action     = "allow"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "0.0.0.0/0"
  }

  tags = {
    Name        = "production-nacl"
    Environment = "production"
    Purpose     = "SOC2 network segmentation"
  }
}

Verify the segmentation with this command after applying the Terraform:

# Confirm no VPC peering connections exist from production to non-production
aws ec2 describe-vpc-peering-connections \
  --filters Name=status-code,Values=active \
  --query 'VpcPeeringConnections[*].{ID:VpcPeeringConnectionId,Requester:RequesterVpcInfo.VpcId,Accepter:AccepterVpcInfo.VpcId}' \
  --output table

The Deliverable: Your SOC2 Boundary Diagram

At the end of weeks 1–2, you need a boundary diagram — a visual document that shows every in-scope system, every out-of-scope system, and the segmentation controls between them.

Here is what the diagram should contain:

Include every AWS service, every data flow arrow, and a label on the segmentation control. This diagram becomes your primary scope evidence and is typically the first thing an auditor asks for.

Weeks 3–6: The 14 Controls That Must Be Active on Day 1

These 14 controls must be implemented and actively collecting evidence from day 1 of your observation period. If you add any of them late, the observation period clock for that control restarts from the implementation date — not from day 1 of the audit period.

Think of the observation period as a surveillance camera recording your infrastructure. The auditor watches the footage later. If the camera was not on when a specific event occurred, that event has no record — and the SOC2 control for it has a gap.

Control 1: MFA Enforcement (CC6.6)

Multi-Factor Authentication (MFA) requires a user to verify their identity using two independent factors — something they know (a password) and something they have (a phone or hardware key). Without MFA, a stolen password is sufficient to access your production systems.

SOC2 CC6.6 requires that access to systems is restricted to authorized users. MFA is the technical control that makes "authorized" meaningful. Without it, any password compromise is a production access event.

To implement MFA, you can use AWS IAM Identity Center (formerly SSO) connected to your identity provider (Okta, Google Workspace, or Azure AD). MFA is then enforced at the identity provider level — any user without MFA enrolled can't authenticate, regardless of which AWS service they're trying to reach.

# IAM Identity Center configuration — MFA is enforced at the IdP level.
# No IAM user has direct console or CLI access.
# All access goes through SSO sessions (8-hour expiry by default).

resource "aws_ssoadmin_instance_access_control_attributes" "mfa" {
  instance_arn = tolist(data.aws_ssoadmin_instances.this.arns)[0]

  attribute {
    key = "email"
    value {
      source = ["$${path:email}"]
    }
  }
}

You can verify that no IAM users retain direct console access (which would bypass MFA):

# Any user listed here has direct console access bypassing SSO — investigate immediately
aws iam list-users \
  --query 'Users[?PasswordLastUsed!=`null`].[UserName,PasswordLastUsed]' \
  --output table

Control 2: Infrastructure as Code (CC8.1)

Infrastructure as Code (IaC) means defining your cloud infrastructure in version-controlled code files (Terraform, Pulumi, or AWS CDK) rather than creating resources manually through the AWS console. Every infrastructure change is proposed in a pull request, reviewed by a colleague, and applied through an automated pipeline.

SOC2 CC8.1 covers change management — the requirement that every change to your production environment is documented, reviewed, and approved. Manual console changes produce no audit trail. If an engineer opens the AWS console and creates a security group without going through Terraform, that change is invisible to your SOC2 auditor. IaC makes every change reviewable and traceable.

Now let's see how to implement IaC here. This GitHub Actions workflow applies Terraform only from the main branch, after a pull request has been reviewed and approved. The workflow creates an immutable record of every infrastructure change:

# .github/workflows/terraform-apply.yml
name: Terraform Apply (Production)
on:
  push:
    branches: [main]
    paths: ['terraform/**']

permissions:
  id-token: write   # Required for AWS OIDC authentication
  contents: read

jobs:
  apply:
    name: Apply Infrastructure Changes
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval for production

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/terraform-apply
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: "1.6.0"

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=tfplan -input=false

      - name: Terraform Apply
        run: terraform apply -input=false tfplan

SOC2 evidence this produces: A GitHub Actions run log for every infrastructure change, showing who triggered it (the pull request author), when it was applied, and what changed.

Control 3: CloudTrail Enabled (CC7.1)

AWS CloudTrail is a service that records every API call made in your AWS account — who called it, when, from which IP address, and whether it succeeded. Think of it as the complete audit log of everything that has ever happened in your AWS environment.

SOC2 CC7.1 requires monitoring for security events. CloudTrail is the foundational logging layer — without it, you can't detect unauthorized access, investigate incidents, or prove to an auditor that your controls were operating as intended. An auditor who can't see historical AWS API activity can't verify that your access controls were enforced during the observation period.

To implement it, you'll want to enable multi-region CloudTrail so that activity in every AWS region is captured, including global services like IAM. You can ship logs to an S3 bucket with Object Lock enabled (Control 3 in the evidence collection section covers this) so logs can't be modified or deleted:

# Enable CloudTrail with log file validation and multi-region coverage
aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name your-cloudtrail-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

# Start the trail (creation alone does not start logging)
aws cloudtrail start-logging --name production-audit-trail

# Verify the trail is active and logging
aws cloudtrail get-trail-status --name production-audit-trail \
  --query '{IsLogging:IsLogging,LatestDeliveryTime:LatestDeliveryTime}'

Control 4: GuardDuty Enabled (CC7.2)

AWS GuardDuty is a threat detection service that analyses your CloudTrail logs, VPC Flow Logs, and DNS logs. It uses machine learning to identify suspicious behaviour — things like an EC2 instance communicating with a known malware server, an IAM user logging in from an unusual country, or unusual API call patterns that indicate credential theft.

SOC2 CC7.2 requires the use of detection tools to identify potential security events. GuardDuty is the monitoring layer that tells you when something anomalous is happening, not just what happened after the fact. Without it, you would only discover a compromise when the damage is done.

Here's the implementation:

# Enable GuardDuty — findings published every 15 minutes for active threats
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Verify GuardDuty is active
aws guardduty list-detectors --query 'DetectorIds' --output table

You can set up an EventBridge rule to route CRITICAL and HIGH severity GuardDuty findings to your incident response channel immediately. A finding sitting unreviewed for 90 days is a qualified SOC2 finding.

Control 5: VPC Flow Logs (CC6.1)

VPC Flow Logs capture information about the IP traffic flowing through your Virtual Private Cloud — every accepted and rejected connection, including source IP, destination IP, port, protocol, and whether the traffic was allowed or denied. They are the network-level audit trail that CloudTrail doesn't provide.

SOC2 CC6.1 requires logical access controls and monitoring. VPC Flow Logs let you verify that your network segmentation is actually working (traffic you denied is showing as rejected in the logs), detect unexpected communication between services, and investigate security events at the network layer.

# Create an IAM role for VPC Flow Logs to deliver to CloudWatch
aws iam create-role \
  --role-name vpc-flow-logs-role \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Principal":{"Service":"vpc-flow-logs.amazonaws.com"},
      "Action":"sts:AssumeRole"
    }]
  }'

# Enable VPC Flow Logs for all traffic (ACCEPT and REJECT)
aws ec2 create-flow-logs \
  --resource-ids vpc-YOUR_PRODUCTION_VPC_ID \
  --resource-type VPC \
  --traffic-type ALL \
  --log-group-name /aws/vpc/flow-logs/production \
  --deliver-log-permission-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/vpc-flow-logs-role

# Verify flow logs are active
aws ec2 describe-flow-logs \
  --filter Name=resource-id,Values=vpc-YOUR_PRODUCTION_VPC_ID \
  --query 'FlowLogs[*].{Status:FlowLogStatus,LogGroup:LogGroupName}'

Control 6: Secrets Manager (CC6.7)

Secrets management means storing credentials (database passwords, API keys, certificates, and other sensitive configuration values) in a dedicated, access-controlled service (like AWS Secrets Manager or HashiCorp Vault) rather than in .env files, GitHub repository secrets, or hardcoded in application code.

SOC2 CC6.7 requires protecting sensitive system components from unauthorized access. A secret stored in an .env file committed to a repository is accessible to every developer with repo access, every CI/CD runner, and every engineer who has ever cloned the repo — including those who have since left the company.

A Secrets Manager provides centralised storage, access logging, automatic rotation, and fine-grained IAM permissions so only specific services can retrieve specific secrets.

Let's look at the implementation — storing and rotating a secret:

# Store a database credential with automatic 90-day rotation
aws secretsmanager create-secret \
  --name production/postgresql/credentials \
  --description "Production PostgreSQL credentials — rotated every 90 days" \
  --secret-string '{
    "username": "app_user",
    "password": "REPLACE_WITH_STRONG_PASSWORD",
    "host": "your-rds-endpoint.us-east-1.rds.amazonaws.com",
    "port": 5432,
    "dbname": "production"
  }'

# Enable automatic rotation every 90 days
aws secretsmanager rotate-secret \
  --secret-id production/postgresql/credentials \
  --rotation-rules AutomaticallyAfterDays=90

How your application retrieves the secret at runtime (no hardcoded credentials):

# Good: secret retrieved at runtime from Secrets Manager
import boto3
import json

def get_db_credentials():
    client = boto3.client('secretsmanager', region_name='us-east-1')
    response = client.get_secret_value(SecretId='production/postgresql/credentials')
    return json.loads(response['SecretString'])

# Bad: secret hardcoded in application code or .env file
DB_PASSWORD = "my_database_password_123"  # Never do this

The access log in CloudTrail records every time a secret is retrieved, by which IAM role, at what time. That log is your SOC2 evidence that secrets access is controlled and auditable.

Control 7: EBS Encryption (CC6.1)

EBS (Elastic Block Store) encryption ensures that the persistent disks attached to your EC2 instances and used by your RDS databases are encrypted at rest using AES-256. If an AWS employee or an attacker gained physical access to the storage hardware, the data would be unreadable without the encryption key.

SOC2 CC6.1 requires protecting information assets from unauthorised access. Encryption at rest is the control that protects data in the event of physical storage compromise or an improperly decommissioned disk. Enabling it account-wide means every new EBS volume is encrypted automatically, including RDS storage, EKS node volumes, and EC2 instance root volumes.

# Enable EBS encryption by default for all new volumes in this region
aws ec2 enable-ebs-encryption-by-default

# Verify it is enabled
aws ec2 get-ebs-encryption-by-default \
  --query 'EbsEncryptionByDefault'
# Expected output: true

# Check existing volumes — any showing false need to be migrated
aws ec2 describe-volumes \
  --query 'Volumes[?Encrypted==`false`].[VolumeId,Size,VolumeType]' \
  --output table

Any existing unencrypted volumes must be snapshot-and-replaced. The process: create a snapshot of the unencrypted volume, create a new encrypted volume from the snapshot, and swap it into the instance.

Control 8: S3 Block Public Access (CC6.1)

Amazon S3 buckets can be configured to allow public access — meaning anyone on the internet can read their contents without authentication. Block Public Access is an account-level and bucket-level setting that prevents any bucket from being made public, regardless of the bucket's own policy.

A misconfigured S3 bucket is one of the most common causes of data breaches in cloud environments. Block Public Access at the account level means a developer can't accidentally expose a bucket containing customer data, even if they set the wrong bucket policy. It's a guardrail, not just a policy.

# Block public access at the AWS account level — applies to all buckets
aws s3control put-public-access-block \
  --account-id YOUR_ACCOUNT_ID \
  --public-access-block-configuration \
    BlockPublicAcls=true,\
    IgnorePublicAcls=true,\
    BlockPublicPolicy=true,\
    RestrictPublicBuckets=true

# Verify account-level setting is active
aws s3control get-public-access-block \
  --account-id YOUR_ACCOUNT_ID

# Scan for any buckets that have public access enabled (should be zero)
aws s3api list-buckets --query 'Buckets[*].Name' --output text | \
  tr '\t' '\n' | while read bucket; do
    result=\((aws s3api get-public-access-block --bucket "\)bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "WARNING: $bucket has public access not fully blocked"
    fi
  done

Control 9: Branch Protection (CC8.1)

Branch protection is a GitHub setting that prevents engineers from pushing code directly to your main branch without going through a pull request that has been reviewed and approved by at least one other team member. It also requires your CI pipeline to pass before any code can be merged.

SOC2 CC8.1 requires change management — the requirement that every change to production systems is documented, reviewed, and approved. Without branch protection, an engineer can push directly to main, which deploys directly to production through your CI/CD pipeline, with no review and no audit trail. Branch protection is the technical enforcement of your change management policy.

The critical setting that most teams miss: the "Do not allow bypassing the above settings" option must be enabled. Without it, administrators can bypass branch protection — and a SOC2 auditor will flag this as a gap because it means your change management control can be circumvented.

# .github/settings.yml — enforces branch protection via code
# Requires the settings GitHub App: https://github.com/apps/settings

branches:
  - name: main
    protection:
      required_pull_request_reviews:
        required_approving_review_count: 1
        dismiss_stale_reviews: true
        require_code_owner_reviews: false
      required_status_checks:
        strict: true
        contexts:
          - "CI / test"
          - "Security / trivy-scan"
      enforce_admins: true         # Admins cannot bypass — this is critical
      restrictions: null           # No push restriction beyond the above
      allow_force_pushes: false
      allow_deletions: false

Here's how you can verify that branch protection is enforced and admins can't bypass it:

# Returns the branch protection rules including enforce_admins status
curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
  https://api.github.com/repos/YOUR_ORG/YOUR_REPO/branches/main/protection \
  | jq '{enforce_admins: .enforce_admins.enabled, required_reviews: .required_pull_request_reviews.required_approving_review_count}'

Control 10: Container Image Scanning (CC7.4)

Container image scanning analyses your Docker images before deployment to identify known security vulnerabilities (CVEs) in the operating system packages and application dependencies they contain.

Trivy is an open-source scanner that checks the base image (Ubuntu, Alpine, and so on), all installed OS packages, and language-specific dependencies (npm, pip, Go modules) against the National Vulnerability Database.

SOC2 CC7.4 requires monitoring and identifying vulnerabilities. Every container you deploy contains a base image with OS packages — and those packages regularly receive CVE disclosures. A critical CVE left unpatched for 90 days in a production container is a SOC2 finding. Automated scanning in CI means every image is checked before it can deploy.

# .github/workflows/security-scan.yml
name: Security Scan
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t app:${{ github.sha }} .

      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: app:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1          # Fail the pipeline on CRITICAL or HIGH findings

      - name: Upload results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()            # Upload even if scan found issues
        with:
          sarif_file: trivy-results.sarif

The scanner looks for:

CVEs in base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)
Vulnerable versions of application dependencies (a known RCE in an npm package your app uses)
Misconfigurations in the Dockerfile itself (running as root, using latest tags)

Results appear in the GitHub Security tab for your repository, giving you a historical record of every scan — which is your SOC2 evidence.

Control 11: Incident Response Plan (CC9.2)

An incident response plan is a written, tested procedure that defines exactly what your team does when a security event occurs — from the moment an alert fires through to customer notification and post-incident review.

SOC2 CC9.2 requires that you have a documented process for responding to security events and that you've tested it. The auditor will ask for the written runbook and evidence that a tabletop exercise (a simulated incident walkthrough) has been conducted within the observation period.

Your incident response runbook must include:

Severity classification: Definitions of P1 (production down, customer data at risk), P2 (degraded service, potential risk), and P3 (minor issue, no customer impact) — and the response SLA for each.
Escalation path: Exactly who gets paged at each severity level, with contact details. Not "the on-call engineer" — specific names and a backup if the first person doesn't respond within 10 minutes.
First 15 minutes: The specific steps to take immediately — isolate the affected system, assess the scope, notify the incident channel, begin the timeline log.
Communication templates: Pre-written Slack messages, customer email templates, and regulatory notification templates (GDPR requires notification within 72 hours, HIPAA within 60 days).
Post-incident review: The blameless postmortem process, the 5-why root cause analysis template, and the action item tracking process.

Conduct a tabletop exercise at least once during your observation period: gather your engineering team for 45 minutes, simulate a realistic scenario (for example, "an AWS access key was committed to a public GitHub repo"), and walk through the runbook together. Document the meeting date, attendees, scenario, gaps found, and remediation actions. This document is your evidence.

Control 12: Access Reviews (CC6.3)

An access review is a quarterly audit of who has access to what in your production systems — AWS accounts, GitHub repositories, production databases, and every SaaS tool that touches customer data. You verify that every person on the list still works at the company and still needs the access their role grants them.

SOC2 CC6.3 requires that access is revoked when it's no longer needed. Former employees who retain access to production AWS accounts represent a genuine security risk and a definitive SOC2 finding.

In every access review I've conducted, at least 3–5 former employees or contractors still had active access they should not.

The quarterly access review checklist:

# 1. IAM users — list all with their last login date
aws iam generate-credential-report
aws iam get-credential-report --output text --query Content \
  | base64 --decode | cut -d',' -f1,5 | column -t -s ','

# 2. IAM roles — find roles that have not been used in 90+ days
aws iam get-account-authorization-details \
  --query 'RoleDetailList[*].{Role:RoleName,LastUsed:RoleLastUsed.LastUsedDate}' \
  --output table

# 3. Verify AWS SSO user list matches your current employee list
aws identitystore list-users \
  --identity-store-id YOUR_IDENTITY_STORE_ID \
  --query 'Users[*].{Name:DisplayName,Email:Emails[0].Value}' \
  --output table

Cross-reference the output against your current employee list in your HR system. Document every change made — access removed, permissions reduced, accounts disabled. The documented changes are the evidence that the review was conducted meaningfully, not just as a checkbox exercise.

Control 13: Backup Verification (CC9.5)

Backup verification is the process of actually restoring your backups to confirm they work — not just confirming that backups are being created. A backup that has never been tested doesn't exist from a recovery perspective.

SOC2 CC9.5 requires that recovery procedures are tested. If your production database is corrupted and you discover for the first time during the incident that your automated RDS snapshots can't be restored, you have both a disaster recovery failure and a SOC2 finding.

How to test your RDS backup:

# Step 1: Find your most recent production snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier your-production-db \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text

# Step 2: Restore the snapshot to a test instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier backup-verification-test \
  --db-snapshot-identifier YOUR_SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=backup-verification Key=Environment,Value=test

# Step 3: Wait for the restore to complete (typically 5–15 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier backup-verification-test

# Step 4: Connect and verify data integrity (spot check key tables)
# Run this against the restored instance
psql -h RESTORED_INSTANCE_ENDPOINT -U your_user -d your_database \
  -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Document the test result and delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier backup-verification-test \
  --skip-final-snapshot

Document the test date, the snapshot used, the restore time, the data verification query results, and who conducted the test. Run this quarterly at minimum. This documentation is your SOC2 evidence for CC9.5.

Control 14: Change Management Log (CC8.1)

A change management log is the auditable record of every change made to your production environment — what changed, who approved it, and when it was applied.

SOC2 CC8.1 requires that changes to your production environment are authorized and documented. With IaC and GitOps in place, you already have two separate sources of immutable change history that together satisfy this control.

GitHub Pull Request history provides the record of every code and infrastructure change: who opened the PR, who reviewed and approved it, what the CI status was, and when it was merged. This is your change management log for application and infrastructure changes.

ArgoCD sync history provides the record of every deployment to your Kubernetes cluster: which application was synced, from which Git commit, at what time, and whether the sync succeeded.

To export the ArgoCD sync history as evidence:

# Export ArgoCD application sync history as JSON evidence
argocd app history YOUR_APP_NAME --output json > argocd-sync-history-$(date +%Y%m).json

# Upload to your SOC2 evidence bucket
aws s3 cp argocd-sync-history-$(date +%Y%m).json \
  s3://your-soc2-evidence-bucket/change-management/$(date +%Y/%m)/

# For each deployment, the evidence contains:
# - App name, deployed revision (Git commit SHA)
# - Deployment timestamp
# - Initiating user or automated sync
# - Success/failure status

Together, the GitHub PR history and the ArgoCD sync history give the auditor a complete, tamper-evident record of every change to your production environment during the observation period.

Weeks 7–10: The Evidence Collection Infrastructure

Evidence is the difference between passing and failing SOC2.

You might be wondering: what exactly is evidence? In SOC2 terms, evidence is the documentation that proves a specific control was operating correctly during a specific point in time within the observation period. A policy document says you will do something. Evidence proves you did it — and that you did it continuously, not just the week before the audit.

For example:

For MFA enforcement (Control 1), evidence is a screenshot of your IAM Identity Center MFA settings taken at a specific date during the observation period, combined with an IAM credential report showing zero IAM users with console access.
For GuardDuty (Control 4), evidence is the GuardDuty console screenshot showing active detectors, plus your documented response to any findings during the period.
For access reviews (Control 12), evidence is the completed access review document with dates, names, and specific access changes made.

The challenge is collecting this evidence continuously across 3–12 months without spending hundreds of hours on manual work. The solution is automated evidence collection infrastructure.

The Evidence Bucket — Tamper-Proof Storage for Your Audit Evidence

The evidence bucket is an S3 bucket with Object Lock enabled in GOVERNANCE mode. Object Lock prevents any object from being deleted or modified for the retention period you specify — in this case, 365 days. This means once a piece of evidence is uploaded, it can't be altered, even by a user with administrator access (without explicitly overriding the lock, which itself creates an audit trail).

This tamper-evident property is what gives the auditor confidence that the evidence was not created or modified after the fact.

# terraform/soc2-evidence-bucket.tf

resource "aws_s3_bucket" "soc2_evidence" {
  bucket = "\({var.company_name}-soc2-evidence-\){var.environment}"
}

# Block all public access to the evidence bucket
resource "aws_s3_bucket_public_access_block" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning so overwrites create new versions, not replacements
resource "aws_s3_bucket_versioning" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Object Lock in GOVERNANCE mode — objects cannot be deleted for 365 days
resource "aws_s3_bucket_object_lock_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 365
    }
  }
}

# Encrypt all evidence at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

The Daily Evidence Collector Lambda

This Lambda function runs automatically every day and exports the status of each critical control to a time-stamped JSON file in the evidence bucket. Over your 3–12 month observation period, it creates a daily record proving that your controls were active and operating.

The function checks seven controls automatically: CloudTrail status, GuardDuty status, VPC Flow Logs, S3 public access block, EBS encryption, MFA compliance, and GuardDuty finding count. Each daily snapshot is uploaded with Object Lock enabled so it can't be modified.

# lambda/evidence-collector/handler.py

import boto3
import json
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    """
    Daily SOC2 evidence collector.
    Runs at 00:00 UTC every day via EventBridge scheduler.
    Exports control status to S3 evidence bucket with Object Lock.
    """
    evidence = {
        'collection_timestamp': datetime.now(timezone.utc).isoformat(),
        'collection_date': datetime.now(timezone.utc).strftime('%Y-%m-%d'),
        'account_id': boto3.client('sts').get_caller_identity()['Account'],
        'controls': {}
    }

    # Control 3: CloudTrail status
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails(includeShadowTrails=False)['trailList']
    multi_region_trails = [t for t in trails if t.get('IsMultiRegionTrail')]
    evidence['controls']['cloudtrail'] = {
        'status': 'PASS' if multi_region_trails else 'FAIL',
        'detail': f"{len(multi_region_trails)} multi-region trail(s) active",
        'trails': [t['Name'] for t in multi_region_trails]
    }

    # Control 4: GuardDuty status
    guardduty = boto3.client('guardduty')
    detectors = guardduty.list_detectors()['DetectorIds']
    unresolved_critical = 0
    for detector_id in detectors:
        findings = guardduty.list_findings(
            DetectorId=detector_id,
            FindingCriteria={
                'Criterion': {
                    'severity': {'Gte': 7},  # HIGH and CRITICAL only
                    'service.archived': {'Eq': ['false']}
                }
            }
        )
        unresolved_critical += len(findings['FindingIds'])

    evidence['controls']['guardduty'] = {
        'status': 'PASS' if detectors else 'FAIL',
        'detail': f"{len(detectors)} detector(s) active, {unresolved_critical} unresolved HIGH/CRITICAL findings",
        'unresolved_high_critical': unresolved_critical
    }

    # Control 5: VPC Flow Logs
    ec2 = boto3.client('ec2')
    flow_logs = ec2.describe_flow_logs(
        Filters=[{'Name': 'resource-type', 'Values': ['VPC']},
                 {'Name': 'flow-log-status', 'Values': ['ACTIVE']}]
    )['FlowLogs']
    evidence['controls']['vpc_flow_logs'] = {
        'status': 'PASS' if flow_logs else 'FAIL',
        'detail': f"{len(flow_logs)} active VPC flow log(s)",
        'active_flow_logs': len(flow_logs)
    }

    # Control 7: EBS encryption by default
    ebs_encryption = ec2.get_ebs_encryption_by_default()['EbsEncryptionByDefault']
    evidence['controls']['ebs_encryption_by_default'] = {
        'status': 'PASS' if ebs_encryption else 'FAIL',
        'detail': 'EBS encryption by default is enabled' if ebs_encryption else 'EBS encryption by default is NOT enabled'
    }

    # Control 8: S3 Block Public Access (account level)
    s3control = boto3.client('s3control')
    account_id = boto3.client('sts').get_caller_identity()['Account']
    try:
        pab = s3control.get_public_access_block(AccountId=account_id)['PublicAccessBlockConfiguration']
        all_blocked = all([pab['BlockPublicAcls'], pab['IgnorePublicAcls'],
                           pab['BlockPublicPolicy'], pab['RestrictPublicBuckets']])
        evidence['controls']['s3_block_public_access'] = {
            'status': 'PASS' if all_blocked else 'FAIL',
            'detail': 'All four S3 Block Public Access settings enabled' if all_blocked else 'One or more S3 Block Public Access settings not enabled',
            'configuration': pab
        }
    except Exception as e:
        evidence['controls']['s3_block_public_access'] = {'status': 'FAIL', 'detail': str(e)}

    # Upload evidence to S3 with Object Lock
    s3 = boto3.client('s3')
    evidence_key = f"daily/{evidence['collection_date']}/control-status.json"
    lock_until = datetime.now(timezone.utc) + timedelta(days=365)

    s3.put_object(
        Bucket='YOUR_EVIDENCE_BUCKET_NAME',
        Key=evidence_key,
        Body=json.dumps(evidence, indent=2),
        ContentType='application/json',
        ObjectLockMode='GOVERNANCE',
        ObjectLockRetainUntilDate=lock_until
    )

    # Alert if any control fails
    failed_controls = [k for k, v in evidence['controls'].items() if v['status'] == 'FAIL']
    if failed_controls:
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='YOUR_ALERT_TOPIC_ARN',
            Subject=f'SOC2 Control Failure Detected — {evidence["collection_date"]}',
            Message=f'The following controls failed their daily check:\n\n{json.dumps(failed_controls, indent=2)}'
        )

    return {
        'statusCode': 200,
        'controls_checked': len(evidence['controls']),
        'controls_failed': len(failed_controls),
        'evidence_location': f"s3://YOUR_EVIDENCE_BUCKET_NAME/{evidence_key}"
    }

The GitHub Actions Evidence Workflow

This workflow runs daily and captures evidence that can't be automated through AWS APIs — GitHub-level controls like branch protection status, recent pull request activity, and CI pipeline results. It exports these as JSON files to the same evidence bucket.

# .github/workflows/soc2-evidence.yml
name: SOC2 Evidence Collection
on:
  schedule:
    - cron: '0 1 * * *'   # 01:00 UTC daily (after the Lambda runs at 00:00)
  workflow_dispatch:        # Allow manual trigger when needed

permissions:
  contents: read

jobs:
  collect-github-evidence:
    name: Collect GitHub Control Evidence
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/evidence-collector
          aws-region: us-east-1

      - name: Collect branch protection status
        run: |
          DATE=$(date +%Y-%m-%d)
          mkdir -p evidence/github

          # Export branch protection rules for main
          curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
            "https://api.github.com/repos/${{ github.repository }}/branches/main/protection" \
            | jq '{
                date: "'$DATE'",
                enforce_admins: .enforce_admins.enabled,
                required_reviews: .required_pull_request_reviews.required_approving_review_count,
                required_status_checks: .required_status_checks.contexts,
                allow_force_pushes: .allow_force_pushes.enabled
              }' > evidence/github/branch-protection-$DATE.json

          echo "Branch protection evidence collected"
          cat evidence/github/branch-protection-$DATE.json

      - name: Upload evidence to S3
        run: |
          DATE=$(date +%Y-%m-%d)
          aws s3 sync evidence/ \
            s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/ \
            --no-progress
          echo "Evidence uploaded: s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/"

Weeks 11–14: Auditor Selection and Readiness Assessment

How to Choose a SOC2 Auditor

Selecting the right auditor is more consequential than most teams realize. SOC2 audits are conducted by CPA firms — specifically, firms licensed to issue SOC reports. The right firm has experience with cloud-native, SaaS companies your size. The wrong firm could apply enterprise audit frameworks to a seed-stage startup and generate findings based on controls that aren't appropriate to your context.

Here is what to look for and what to watch out for:

Experience matters more than brand

A large Big Four firm isn't necessarily better than a specialist boutique auditor for a 20-person SaaS company.

Ask specifically: "How many SOC2 audits have you completed in the last 12 months for SaaS companies between 10 and 50 employees?" You want a firm where this is common, not exceptional.

Verify familiarity with your compliance tool

If you're using Vanta or Drata, confirm that the auditor has experience with evidence produced by those platforms. Some auditors prefer to collect evidence directly and are unfamiliar with automated evidence exports. An auditor who doesn't trust your Vanta evidence will ask you to re-collect everything manually.

Understand what Type II actually costs

For a Series A SaaS company, expect $15,000–$30,000 for a SOC2 Type II audit with a 3-month observation period. A quote below $10,000 often means the auditor is cutting corners on the review depth. A quote above $50,000 for a small company typically means the firm is applying enterprise pricing to a startup engagement.

Get references from similar companies

Ask the auditor for two or three references from SaaS companies they've audited in the last year. Call those references and ask: did the auditor understand cloud infrastructure? Were the findings reasonable? How was the communication during the review?

Here's a summary table of some things to watch out for:

Criteria	What to Look For	Red Flag
Experience	5+ years, 20+ SaaS audits annually	"We have completed several SOC2 audits" (vague)
Tool familiarity	Has reviewed Vanta/Drata evidence before	Requires manual re-collection of automated evidence
Company size fit	Has audited companies your size	Only lists enterprise clients as references
Cost (Type II)	$15K–$30K for a 20-person company	Under $10K or over $50K without clear justification
References	Can provide SaaS company contacts to call	Cannot provide references

How to Run a Readiness Assessment (Mock Audit)

A readiness assessment is a self-conducted simulation of the real audit, run 2–4 weeks before you engage the auditor. Its purpose is to find and close gaps before the auditor finds them, because gaps found in a mock audit cost you a week of remediation time, while gaps found in the real audit cost you a conditional report and a re-review.

You can run the readiness assessment yourself or hire a consultant to run it. The consultant approach is more valuable because an independent reviewer will find gaps you have rationalised away.

The process:

Step 1: Work through every control in the checklist below and attempt to produce the evidence that an auditor would request.
Step 2: For every control where you can't produce clear, timestamped evidence: that's a gap. Document it.
Step 3: Prioritise gaps by type. Evidence gaps (missing evidence for an active control) require evidence collection infrastructure fixes. Control gaps (a control that isn't implemented) require engineering work.
Step 4: Close all gaps before engaging the real auditor.

Control	Evidence Required	How to Verify	Ready?
MFA enforced	IAM credential report + SSO MFA policy screenshot	`aws iam get-credential-report`	⬜
CloudTrail active	Trail status + S3 delivery confirmation	`aws cloudtrail get-trail-status`	⬜
GuardDuty active	Detector list + finding review log	`aws guardduty list-detectors`	⬜
VPC Flow Logs	Active flow log list + sample log entries	`aws ec2 describe-flow-logs`	⬜
Secrets in Secrets Manager	Secret list + rotation policy confirmation	`aws secretsmanager list-secrets`	⬜
EBS encryption by default	Account-level encryption setting	`aws ec2 get-ebs-encryption-by-default`	⬜
S3 Block Public Access	Account-level PAB configuration	`aws s3control get-public-access-block`	⬜
Branch protection (no admin bypass)	GitHub branch protection API response	GitHub API or Settings UI	⬜
Trivy scanning in CI	GitHub Actions run history showing scans	GitHub Actions logs	⬜
Incident response runbook	Written runbook + tabletop exercise notes with date	Document review	⬜
Access review	Quarterly review document with specific changes made	Document review	⬜
Backup test	RDS restore log + data verification results	Document review	⬜
Change management log	GitHub PR history + ArgoCD sync history	GitHub and ArgoCD	⬜

The one thing most teams skip: Running the readiness assessment against their own evidence bucket. Pull a random day's evidence from the daily Lambda export and verify that it's complete, timestamped, and accurately reflects the control status on that day.

If the evidence file for December 14th shows GuardDuty as PASS but GuardDuty was actually disabled that day, the auditor will find the discrepancy in the AWS account history — and that's a qualified finding.

Weeks 15–18: The Observation Period

How the Auditor Observes Your Controls

The SOC2 auditor doesn't physically visit your office or sit inside your AWS console watching your infrastructure in real time. The audit is a remote, documentation-based process conducted entirely through evidence review.

Here is how it actually works:

First, the auditor provides a list of evidence requests — typically 80–150 items for a Type II audit. You upload the evidence to a shared portal (the auditor provides this — it is usually a secure document sharing platform). The auditor reviews the evidence, asks follow-up questions, and identifies gaps where evidence is missing or a control wasn't operating as described.

For automated controls like CloudTrail and GuardDuty, the evidence is your daily Lambda exports — the auditor spot-checks a sample of daily snapshots across the observation period to verify the controls were consistently active.

For manual controls like access reviews and backup tests, the evidence is the documents you produced when you ran those processes.

The practical implication: the auditor is trusting your evidence. This is why the Object Lock on your evidence bucket matters. It proves to the auditor that the evidence was generated at the time it claims to have been generated and hasn't been modified since.

What the Auditor Reviews Over the Observation Period

What They Check	How Often	What They Are Looking For
CloudTrail logs	Spot check monthly	Manual console changes that bypassed IaC, gaps in log delivery
GuardDuty findings	Review quarterly summary	HIGH or CRITICAL findings not remediated within your documented SLA
Access review completion	Verify each quarterly cycle	Reviews skipped, reviews with no access changes despite employee turnover
Incident response tests	Verify annually	No tabletop exercise conducted during the observation period
Evidence collection	Verify continuous coverage	Gaps in daily evidence exports, missing evidence for specific dates
Change management log	Sample PR/sync history	Deployments with no associated pull request or review

What Triggers a Finding

A SOC2 finding is the auditor's documented conclusion that a control wasn't operating effectively during the observation period. Findings range from observations (minor issues that don't affect the audit opinion) to qualified opinions (material failures that result in a qualified rather than unqualified report).

Understanding what triggers findings — and which ones restart the observation period — is critical for managing your audit timeline.

Control gaps occur when a required control isn't implemented or was disabled during the observation period. If you discover in month 2 that MFA wasn't enforced on one IAM user for the first three weeks, you must document the remediation and demonstrate the gap was closed.

Whether this restarts your observation period depends on how long the gap lasted and how the auditor assesses the risk — but a gap of less than 30 days that's immediately remediated and documented typically doesn't restart the clock.

Evidence gaps are more serious. If your daily Lambda evidence collector failed for two weeks and produced no evidence exports, you have a two-week window with no documented proof that your controls were operating. The auditor can't verify controls they can't see evidence for.

Evidence gaps almost always require extending the observation period because there's no way to retroactively produce evidence for a period that wasn't recorded.

Process failures occur when a manual control wasn't executed as documented. The most common is an access review that was skipped. Like control gaps, these can typically be remediated without restarting the clock if they're documented promptly and the remediation is clear.

Unpatched critical CVEs are a special case. If Trivy identifies a CRITICAL vulnerability in a production container and it remains unpatched for more than your documented remediation SLA (typically 30 days for critical, 90 days for high), this is a qualified finding that the auditor will note in the report.

How to Close Gaps Without Restarting the Clock

When you discover a gap during the observation period:

For control gaps:

1. Fix the control immediately — don't wait
2. Document the fix: screenshot, PR link, or CLI command output with timestamp
3. Note the gap date range in your audit log: "Control gap: 2024-03-10 to 2024-03-14 (4 days). Root cause: [X]. Remediated: [Y]. No customer data accessed during gap period."
4. Notify your auditor proactively — they will find it anyway; proactive disclosure is better than defensive explanation
5. The observation period doesn't restart if the gap was short-lived and promptly remediated

For evidence gaps:

1. Fix the evidence collection infrastructure immediately
2. Understand that you can't retroactively generate evidence for the gap period
3. The observation period for affected controls effectively restarts from the date evidence collection resumed
4. If the gap is early in your observation period, you may be able to extend the period rather than restart — discuss with your auditor

The pro tip: Set up a CloudWatch alarm that triggers if the evidence Lambda fails to deliver to S3 on schedule. A missing daily evidence file is caught within 24 hours, not discovered during the audit review.

The 90-Day SOC2 Timeline at a Glance

Weeks	Focus	Key Deliverables	Common Mistake
1–2	Scope	Boundary diagram, network segmentation Terraform	Over-scoping to include dev and staging
3–6	Controls	14 controls implemented and collecting evidence	Starting controls after the observation period begins
7–10	Evidence	S3 evidence bucket, Lambda daily collector, GitHub Actions workflow	Manual evidence collection with inevitable gaps
11–14	Readiness	Mock audit, gap remediation, auditor selected	Skipping the mock audit
15–18	Observation	Daily evidence, quarterly reviews, incident response test	Discovering evidence gaps during the audit rather than before

What's Next?

Start with Week 1. Define your SOC2 boundary. Apply the four-question framework to every system in your infrastructure. Draw the diagram in Excalidraw. Document the network segmentation controls.

Then implement the 14 controls in order, starting with MFA and CloudTrail — the two that most commonly fail audits when they're missing.

Then build your evidence collection infrastructure before the observation period starts. The automated Lambda and GitHub Actions workflow are the difference between a smooth audit and a 60-day extension.

One thing to remember: SOC2 is 20% controls, 30% evidence, and 50% continuous operation. Start early. Automate everything. Run a mock audit before you call the real one.

Resources

The following resources are referenced throughout this guide:

AICPA SOC2 Overview — The official SOC2 documentation from the American Institute of CPAs, including the Trust Service Criteria
Vanta — Compliance automation platform that connects to AWS and GitHub to automate evidence collection and track control status
Drata — Alternative compliance automation platform with similar capabilities to Vanta
Trivy by Aqua Security — Open-source container and filesystem vulnerability scanner used in Control 10
Excalidraw — Free, open-source diagram tool for creating the SOC2 boundary diagram
AWS IAM Identity Center documentation — Official AWS documentation for setting up SSO and MFA enforcement
GitHub branch protection documentation — Official GitHub documentation for configuring branch protection rules
ArgoCD documentation — Official ArgoCD documentation for GitOps deployment and sync history

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He writes about SOC2 compliance engineering, Kubernetes cost optimization, and platform engineering.

How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway

Rakshath Naik — Thu, 30 Apr 2026 05:06:15 +0000

In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.

While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.

In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.

The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.

Prerequisites
Building the Brain: The Model
Deploying the Model to AWS
How to Run The Project Locally
Our Project Architecture
Conclusion: The Power of Serverless AI
Acknowledgment / References

1. Prerequisites

Fundamental skills: Basic proficiency in Python and understanding of Machine Learning concepts like classification.
AWS account: Access to an AWS account with permissions for Lambda, S3, and API Gateway.
Environment: Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.
AWS CLI: Configured on your local machine for file uploads.
HuggingFace account: You can directly download the model from my account.

2. Building the Brain: The Model

Photo by Steve A Johnson on Unsplash

At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.

1. Vectorization: Turning Text into Math

Machine Learning models can't read text. They require numerical input. To solve this, we used the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer.

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train

Here's the mathematical formula:

$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$

TF-IDF term definitions:

wᵢ,ⱼ (Weight): The final importance score of a specific word in a document.
tfᵢ,ⱼ (Term Frequency): How often a word appears in a single email.
N (Total Documents): The total count of all emails in your dataset.
dfᵢ (Document Frequency): The number of different emails that contain this specific word.
log(N/dfᵢ) (IDF): A penalty that lowers the score of common words like the or is that appear everywhere.

It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.

2. Training: The Logistic Regression Engine

We'll use Logistic Regression here, a classification algorithm that predicts the probability of an outcome.

In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the Spam or Ham label.

During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.

model = LogisticRegression()
model.fit(X_train_features, Y_train)

In our case, it calculates the probability that an email belongs to spam or HAM.

The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.

$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$

where z = β₀ + β₁x₁ + … + βₙxₙ.

3. Evaluation: Testing the Intelligence

After training, we need to verify if the brain actually works on data it hasn't seen before.

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).

4. Exporting the Logic (Serialization)

To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).

joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')

We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.

We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.

The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: Get the model on HuggingFace.

3. Deploying the Model to AWS

Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.

1. Model Storage: Amazon S3

First, we'll uploade our .pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.

2. The Production Backend: AWS Lambda

To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.

The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.

Commands in AWS CLI:


# 1. Create a workspace
mkdir ml_layer && cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/

We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.

The Lambda Function:


import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }

Key features of the Lambda function:

Warm start caching: By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.
Dynamic dependency loading: The sys.path.append('/opt/python') line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.
Bimodal input handling: The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.

3. The API Gateway - The Bridge to the Web

Photo by Growtika on Unsplash

Creating the REST API

Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.

First navigate to the Amazon API Gateway console and select Create API -> REST API.
Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.
Then in the left sidebar, click Resources and enter a resource name (e.g: / predict as entered by me)
Next click the create method and select POST and then select Lambda Function for integration type
Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).

The CORS Configuration (The Troubleshooting Hub)
This is where many developers encounter the dreaded Connection Error. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.

To fix this, we'll enable CORS:

Access-Control-Allow-Origin: Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.
The OPTIONS method: API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.
Access-Control-Allow-Headers: In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.

Image illustrates the CORS configuration for our project. (Image by author)

Deployment Stages

Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: https://[api-id].execute-api.[region].amazonaws.com/prod/classify.

Connecting the Frontend (The JavaScript Layer)

With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the Analyze button on your site.


async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}

4. How to Run The Project Locally

You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the .html file. Opening it as a file in your browser can cause security restrictions. Instead, you should host it using a simple local server.

Step 1: Open the terminal or Command Prompt.

Step 2: Navigate to your project folder

cd [PATH_TO_YOUR_FOLDER]

Step 3: Start a local Python web server.

python -m http.server 8000

Step 4: Access the application.

Open your browser and navigate to:
http://localhost:8000/your-file-name.html

Watch the Demo:

5. Our Project Architecture

The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)

Client Front-End Interaction: The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like WIN free iPhone now and trigger a request.
The Entry Point: API Gateway: The request hits the Amazon API Gateway, which acts as the security guard and translator.
(a) CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.
(b) Classification Request (POST) routes the actual message data to your backend logic.
The Engine: AWS Lambda (Python 3.11): The central “lightbulb” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.
Storage & Retrieval: S3 Bucket: Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.
Dependency and Model Download: The function reaches out to the S3 Bucket to pull in the sklearn_lib.zip (the engine) and the .pkl files (the intelligence).
Required Dependency and Model: These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.
The Inference Pipeline: Inside the Lambda, a three-step mathematical cycle occurs:
(a) Text Vectorizer: Translates the words into numbers.
(b) Logistic Regression: Calculates the probability of spam based on those numbers.
(c) Label: Assigns a final result (Spam or Ham).
The Result Delivery: The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “Result: SPAM” with a visual indicator.

6. Conclusion: The Power of Serverless AI

By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.

This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.

Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.

7. Acknowledgment / References

Pre-trained spam classification model: View on Hugging Face (rakshath1/mail-spam-detector · Hugging Face)
Scikit-learn Documentation
AWS Lambda Documentation
Amazon S3 Documentation
Amazon API Gateway Documentation

Connect With Me

You may also like

How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS

Tolani Akintayo — Mon, 27 Apr 2026 15:07:43 +0000

If you've been storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks in a CI/CD pipeline.

Here's why: static credentials don't expire on their own. If they get leaked through a misconfigured workflow, a public fork, or a compromised repository, an attacker has persistent access to your AWS environment until you manually rotate them. And most teams don't rotate them often enough.

OpenID Connect (OIDC) solves this entirely. Instead of storing long-lived credentials, GitHub Actions requests a short-lived token directly from AWS every time your workflow runs. No secrets to rotate. No credentials to leak. No manual key management.

In this tutorial, you'll learn how to set up OIDC authentication between GitHub Actions and AWS from scratch. By the end, your workflows will authenticate to AWS securely without storing a single access key.

What Is OpenID Connect (OIDC)?
How OIDC Works Between GitHub Actions and AWS
Prerequisites
Step 1: Create an IAM OIDC Identity Provider in AWS

Step 2: Create an IAM Role with a Trust Policy

Step 3: Attach Permissions to the IAM Role

Step 4: Store the Role ARN as a GitHub Actions Variable

Step 5: Configure Your GitHub Actions Workflow

Step 6: Run and Verify Your Workflow
Security Best Practices
Troubleshooting Common Errors
Conclusion
References

What Is OpenID Connect (OIDC)?

OpenID Connect is an identity protocol built on top of OAuth 2.0. It allows systems to verify identity through tokens rather than shared secrets.

In the context of GitHub Actions and AWS:

GitHub acts as the identity provider (IdP). It issues a signed JWT (JSON Web Token) for each workflow run.
AWS acts as the service provider. It validates that token against GitHub's public keys and exchanges it for temporary AWS credentials. The credentials AWS returns are short-lived (valid for up to 1 hour by default) and scoped to exactly the IAM role you define. When the workflow ends, those credentials are gone.

This model is called federated identity. It's the same concept used when you "Sign in with Google" on a third-party website. The difference is that instead of a user signing in, your workflow is the one authenticating.

How OIDC Works Between GitHub Actions and AWS

Before writing a single line of YAML, it beneficial to understand the flow. This is my personal approach when implementing new technologies or concepts. Here's what happens every time your workflow runs:

The diagram illustrates a secure authentication flow between GitHub Actions and AWS using OpenID Connect (OIDC), eliminating the need to store long-lived AWS credentials in GitHub. Here's what happens step-by-step:

1. Initial Authentication Request

When your GitHub Actions workflow starts, the runner (the virtual machine executing your workflow) requests a JSON Web Token (JWT) from GitHub's OIDC provider located at https://token.actions.githubusercontent.com.

2. Token Issuance

GitHub's OIDC provider generates and signs a JWT containing important claims (metadata) about your workflow. These claims include details like which repository the workflow is running from, which branch triggered it, what environment it's running in, and other contextual information that proves the workflow's identity.

3. Token Validation

The GitHub Actions runner presents this signed JWT to AWS Security Token Service (STS). AWS STS validates the JWT's signature by checking it against GitHub's publicly available cryptographic keys, ensuring the token is authentic and hasn't been tampered with.

4. Trust Policy Verification

AWS STS checks the trust policy configured on your IAM Role. This trust policy specifies which GitHub repositories, branches, or environments are allowed to assume this role. If the claims in the JWT match your trust policy conditions, authentication succeeds.

5. Temporary Credentials Issued

Once validated, AWS STS returns temporary security credentials to the GitHub Actions runner. These credentials include an Access Key ID, Secret Access Key, and Session Token that are valid for a limited time (typically 1 hour by default, configurable up to 12 hours).

6. AWS API Access

The GitHub Actions runner uses these temporary credentials to authenticate API calls to your AWS resources such as pushing Docker images to ECR, updating ECS services, writing to S3 buckets, or invoking Lambda functions.

The key point: AWS never sees your GitHub credentials, and GitHub never sees your AWS credentials. The JWT is the only thing exchanged and it's signed, scoped, and short-lived.

Prerequisites

Before you start, make sure you have the following in place:

An AWS account with IAM permissions to create identity providers and roles
A GitHub repository (public or private) where your workflows will run
Basic familiarity with GitHub Actions, knowing how to write a .yml workflow file
Basic familiarity with AWS IAM roles, policies, and permissions
The AWS CLI installed and configured (optional, but useful for verification). You don't need to be an AWS expert. Each step includes the exact console path and the configuration values you need.

Step 1: Create an IAM OIDC Identity Provider in AWS

The first thing you need to do is tell AWS to trust GitHub as an identity provider. This is a one-time setup per AWS account.

How to Do It in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Identity providers

3. Click Add provider

4. For Provider type, select OpenID Connect

5. For Provider URL, enter:

https://token.actions.githubusercontent.com

6. For Audience, enter:

sts.amazonaws.com

7. Click Add provider

How to Do It with the AWS CLI

If you prefer the terminal, run this command:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \

Once created, you'll see token.actions.githubusercontent.com listed under Identity providers in your IAM console. This provider will be referenced in your IAM role's trust policy in the next step.

Step 2: Create an IAM Role with a Trust Policy

Now you need an IAM role that your GitHub Actions workflow will assume. The trust policy on this role controls which repositories and branches are allowed to request credentials.

How to Create the IAM Role in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Roles

3. Click Create role

4. For Trusted entity type, select Web identity

5. For Identity Provider, choose: token.actions.githubusercontent.com which you created earlier.

6. For Audience, choose sts.amazonaws.com as well

7. For GitHub organisation, enter your GitHub username or organization name

8. For GitHub repository, enter your GitHub repository

9. For GitHub branch, enter your branch name (for example, main)

10. Click Next, then Next, give a name to the role and click create role

Note: Creating the IAM role using this approach already establishes the Trusted Entities using a trusted policy based on the step 4-9 above. You can verify this by clicking on the created role and navigating to Trust relationships.

How to Create the IAM Role with the AWS CLI

First, you'll need to create a trust policy document on your local machine: You can call it trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_GITHUB_ORG/YOUR_REPO_NAME:*"
        }
      }
    }
  ]
}

Replace the following placeholders before saving:

Placeholder	Replace With
`YOUR_ACCOUNT_ID`	Your 12-digit AWS account ID
`YOUR_GITHUB_ORG`	Your GitHub username or organization name
`YOUR_REPO_NAME`	The name of your GitHub repository

How to Understand the `sub` Condition

The sub (subject) claim in the JWT tells AWS exactly where the request is coming from. The value repo:your-org/your-repo:* means any branch in that repository can assume this role.

You can tighten this further depending on your needs:

# Only the main branch
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
 
# Only a specific GitHub Environment
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Scoping this correctly is one of the most important security decisions in this setup. Here's how to decide:

Use ref:refs/heads/main if only your main/production branch should deploy to AWS. This is the most restrictive and secure option: feature branches can't accidentally (or maliciously) trigger deployments or modify production resources.
Use environment:production if you're using GitHub Environments with protection rules (required reviewers, deployment gates). This lets you control deployments through GitHub's approval workflow while still restricting which workflows can access AWS.
Use repo:your-org/your-repo:* (wildcard) only if you need any branch to deploy. for example, in development environments where every feature branch deploys to its own isolated stack. Never use this for production roles.

Run this command to create the role using your trust policy:

aws iam create-role \
  --role-name GitHubActionsOIDCRole \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role assumed by GitHub Actions via OIDC"

Take note of the Role ARN in the output. It will look like this:

arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

You'll need this ARN in your workflow YAML in Step 4.

Step 3: Attach Permissions to the IAM Role

The IAM role can now authenticate, but it has no permissions yet. You need to attach a policy that defines what your workflow is actually allowed to do in AWS.

How to Apply the Principle of Least Privilege

Only grant the permissions your workflow genuinely needs. If your workflow deploys to S3, give it S3 permissions. If it pushes images to ECR, give it ECR permissions. Never attach AdministratorAccess to a CI/CD role.

Option 1: Attach an AWS managed policy (quick start):

aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Option 2: Create a custom policy scoped to a specific S3 bucket (recommended for production):

This approach is recommended for production because it limits the blast radius of a security incident. If your workflow credentials are ever compromised, a custom policy scoped to a specific bucket means an attacker can only affect that single bucket not every S3 bucket in your AWS account. It also prevents accidental misconfigurations in your workflow from impacting unrelated resources.

Create a file called s3-deploy-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Then create and attach it:

aws iam create-policy \
  --policy-name GitHubActionsS3DeployPolicy \
  --policy-document file://s3-deploy-policy.json
 
aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/GitHubActionsS3DeployPolicy

Note: You can as well implement Step 3 via the console.

Reference: For a full list of available AWS IAM actions, see the AWS IAM actions reference.

Step 4: Store the Role ARN as a GitHub Actions Variable

Before you configure your workflow, you need to make the Role ARN available to it. You'll store it as a repository variable in GitHub, not a secret, because the ARN itself isn't sensitive data.

How to Add the Variable in Your Repository

First, open your GitHub repository and click Settings:

In the left sidebar, scroll down to Secrets and variables, then click Actions:

Then click the Variables tab (not Secrets). Click New repository variable – you can set the Name to:

AWS_ROLE_ARN

Set the Value to your Role ARN from Step 2, for example:

arn:aws:iam::YOUR_ACCOUNT_ID::role/GitHubActionsOIDCRole

Click Add variable:

You'll reference this variable in your workflow in the next step using ${{ vars.AWS_ROLE_ARN }}.

Step 5: Configure Your GitHub Actions Workflow

With AWS and GitHub fully configured, you now need to update your workflow to request an OIDC token and use it to authenticate.

How to Set the Required Workflow Permissions

Your workflow must declare id-token: write. Without this, GitHub won't issue an OIDC token to the runner.

permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read    # Required to checkout the repository

Important: If you set permissions at the job level, they override any top-level permissions. Make sure id-token: write is present at whichever level your AWS authentication step runs.

Full Workflow Example

Here's a complete workflow that authenticates to AWS using OIDC and deploys a static site to S3:

name: Deploy to AWS S3
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write
  contents: read
 
jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-2
 
      - name: Verify AWS identity
        run: aws sts get-caller-identity
 
      - name: Deploy to S3
        run: |
          aws s3 sync ./code s3://your-bucket-name

Replace the following before committing:

Placeholder	Replace With
`AWS_ROLE_ARN`	The variable name for your IAM role ARN in GitHub
`us-east-2`	Your target AWS region
`your-bucket-name`	Your S3 bucket name
`./code`	The local directory where the file you want to sync to S3 is located

You can see the code sample in my GitHub Repo here.

Note: The aws-actions/configure-aws-credentials action handles the entire OIDC token exchange automatically. It requests the JWT from GitHub, calls sts:AssumeRoleWithWebIdentity, and exports the temporary credentials as environment variables for the rest of the job.

See the action's official documentation for all available options.

Step 6: Run and Verify Your Workflow

Push your workflow to the main branch and open the Actions tab in your repository to watch it run.

What a Successful Run Looks Like

The Configure AWS credentials via OIDC step should show:

Assuming role with OIDC: arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

The Verify AWS identity step (aws sts get-caller-identity) should return:

{
    "UserId": "AROA...:GitHubActions",
    "Account": "YOUR_ACCOUNT_ID",
    "Arn": "arn:aws:sts::YOUR_ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions"
}

If you see an assumed-role ARN in the output, OIDC is working correctly. Your workflow is now authenticating to AWS without a single stored credential.

Security Best Practices

Getting OIDC working is step one. Locking it down properly is step two.

Scope the `sub` Condition as Tightly as Possible

Don't use a wildcard like repo:your-org/*:* that allows any repository in your organization to assume the role. Scope it to the exact repository and branch that needs access.

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"

Use GitHub Environments for Production Deployments

GitHub Environments let you add manual approval gates and restrict which branches can deploy. When combined with OIDC, you can scope your trust policy to only allow the production environment:

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Apply Least-Privilege Permissions to Every IAM Role

Never attach AdministratorAccess or PowerUserAccess to a role used by CI/CD. Define a custom policy with only the actions your workflow actually needs.

Create Separate IAM Roles Per Environment

A staging role and a production role should have different permission scopes. Your staging deployment role should never have write access to production resources.

Enable AWS CloudTrail

Every call made using the temporary credentials is logged in CloudTrail under the assumed role ARN. This gives you a full audit trail of exactly what your workflow did in AWS.

Reference: GitHub's official security hardening guide for OIDC: About security hardening with OpenID Connect

Troubleshooting Common Errors

Error: `Not authorized to perform sts:AssumeRoleWithWebIdentity`

This usually means the trust policy on your IAM role doesn't match the sub claim in the JWT.

Check the following:

The sub condition exactly matches your repository path (it is case-sensitive)
The aud condition is set to sts.amazonaws.com
The Federated principal uses the correct AWS account ID

To inspect the actual token claims your workflow is receiving, add this debug step temporarily:

- name: Print OIDC token claims
  run: |
    TOKEN=\((curl -s -H "Authorization: Bearer \)ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')
    echo $TOKEN | cut -d '.' -f2 | base64 -d 2>/dev/null | jq .

Error: `Could not load credentials from any providers`

This almost always means id-token: write is missing from your workflow permissions. Double-check that you have:

permissions:
  id-token: write
  contents: read

Error: `AccessDenied` When Calling an AWS Service

Authentication succeeded but the IAM role doesn't have permission to perform the action your workflow is attempting. Check the permissions policy attached to your role and compare it against the specific action in the error message.

Conclusion

You've gone from storing static, long-lived AWS credentials in GitHub Secrets to a fully keyless authentication setup using OIDC. Here's what you accomplished:

Registered GitHub as a trusted OIDC identity provider in AWS.
Created an IAM role with a scoped trust policy tied to a specific repository.
Attached least-privilege permissions to that role.
Configured your GitHub Actions workflow to request and use short-lived AWS credentials.
Verified the authentication flow end-to-end.

This pattern works across every AWS service from S3, ECS, Lambda, ECR, Secrets Manager, and more. The workflow example here uses S3, but you only need to swap out the permissions policy and the deployment commands to adapt it for any service.

If you want to go further, explore:

Configuring OIDC for multiple cloud providers: Azure, GCP, and HashiCorp Vault.
GitHub Environments and deployment protection rules: for multi-stage pipelines with approval gates.
AWS IAM Access Analyzer: to validate and tighten your role policies automatically.

If you're building out your DevOps practice and want a complete, production-ready reference for infrastructure automation, CI/CD, and platform engineering, check out The Startup DevOps Field Guide. It covers the patterns, templates, and runbooks I've used across real AWS environments.

You can also connect with me on LinkedIn

References

Why Chrome OS Is the Operating System the AI Era Was Built For

Christopher Galliart — Fri, 17 Apr 2026 18:05:16 +0000

Chrome OS runs on a read-only filesystem. You can't install executables on the host. There's no traditional desktop environment. Everything that interacts with the underlying system does so through a sandboxed browser, a containerized Linux terminal, or a cloud connection.

For years, that list of constraints was the reason people dismissed it. But in 2026, it's the reason Chrome OS might be the most correctly designed operating system for what's coming.

The security architecture treats the endpoint as untrusted by default. The containerized Linux environment gives developers a full headless stack without compromising the host. And an upcoming OS-level rewrite, Aluminium, puts Google's on-device AI models directly into the kernel.

This article covers security architecture, the container-based developer environment, cloud-streamed creative tools via AWS NICE DCV, cloud gaming, and what Aluminium OS means for on-device AI.

Here's what we'll cover:

Security-First Architecture in an Era of AI-Powered Threats
A Headless Linux Stack That's More Flexible Than It Looks
AWS NICE DCV Changes the Creative Tools Conversation
Cloud Gaming Works
Aluminium OS: On-Device Models on Google's Own Architecture
Where This Lands

Security-First Architecture in an Era of AI-Powered Threats

Threat actors are getting better tools. Models like Mythos are lowering the barrier for generating convincing phishing campaigns, crafting polymorphic malware, and automating social engineering at scale.

Traditional operating systems present exactly the attack surface these tools target: writable system files, user-installable executables, patches that sit uninstalled for weeks because someone clicked "remind me later."

Chrome OS sidesteps most of this by design. The root filesystem is read-only and cryptographically verified on every boot through a process called Verified Boot.

If anything has modified the OS files since the last verified state, whether that's malware, a compromised package, or a rogue AI agent that decided to start deleting system files, the device detects it at startup and either self-corrects or refuses to boot.

Persistence across reboots isn't difficult. It's architecturally impossible through software alone.

Updates happen silently. While you're working, the system downloads the next OS version to an inactive partition. On your next reboot, it pivots to the updated version. No prompts, no deferred patches, no exposure window.

Major updates ship every four to six weeks. Security patches land every two to three weeks. The gap between vulnerability discovery and remediation is measured in days.

Chrome OS consistently doesn't appear in the top 50 products by CVE count in the NIST vulnerability database. Windows and the Linux kernel sit near the top every year. When AI is actively being weaponized to find and exploit vulnerabilities faster than humans can patch them, a read-only, verified, automatically updated endpoint is a different category of security posture.

The tradeoff is trust. Chrome OS's security model means trusting Google as the root authority for your entire computing stack: updates, certificate trust, telemetry. Organizations with strict data sovereignty requirements should weigh that dependency carefully.

A Headless Linux Stack That's More Flexible Than It Looks

Chrome OS is a text-based operating system. There's no native GUI layer. Stop and sit with that for a second, because it's the thing that makes people dismiss Chrome OS and also the thing that makes it work.

The entire graphical interface you interact with IS the Chrome browser. The Ash shell, Chrome's window manager, is the desktop. You don't install applications onto it the way you install .exe files on Windows or drag .app bundles into a macOS Applications folder. If it isn't running in a browser tab, an Android VM, or a Linux container, it doesn't run. That restriction is what keeps the host locked down, and it's what makes everything else possible.

Under the hood, Chrome OS runs a minimal virtual machine called Termina through crosvm, Google's Rust-based VM monitor.

Inside Termina, LXD manages Linux containers. The default container, penguin, is a Debian environment with a special trick: it bridges GUI-based Linux applications directly into the Chrome OS desktop through a Wayland proxy called Sommelier. Install VS Code, GIMP, or LibreOffice in penguin and they show up in your Chrome OS app launcher, running in windows alongside your browser tabs. For a lot of developers, penguin alone covers the daily workflow.

But Termina gives you more than penguin. Through the LXD layer you can spin up independent containers that are fully isolated operating systems: Arch, Alpine, Ubuntu, whatever you need.

These aren't attached to the GUI bridge. They run headless, natively, with their own systemd, their own package managers, their own persistent state. Need a clean Ubuntu environment to test a deployment script without touching your main setup? lxc launch and you're there. Need to blow it away? lxc delete and it's gone. No orphaned files on the host, no cross-contamination between environments.

The key distinction from Docker is that LXD runs system containers (full OS emulation) rather than application containers. You get background services, persistent daemons, the works. You can also run Docker inside any of these LXD containers if you need application-level containerization on top of that.

Snapshot your entire environment with lxc snapshot before a risky dependency install and roll back instantly if something breaks. That kind of safety net is broader than version control alone: it captures your full OS configuration, not just code.

Pair this with browser-native tools like GitHub Codespaces, Google Colab, AWS CloudShell, or vscode.dev, and the terminal handles your local tooling while the browser handles everything else.

AI coding assistants like Claude and Gemini already operate natively in the browser. The distance between "cloud IDE" and "local IDE" keeps shrinking.

There are friction points: no custom kernel modules inside Crostini. Nested KVM requires Intel Gen 10+ processors. VPN routing into the Linux container from the Chrome OS host can be a headache, with WireGuard requiring userspace workarounds inside the container.

But none of these break the core architecture for cloud-native work. They're just worth knowing about before you commit.

AWS NICE DCV Changes the Creative Tools Conversation

One of the longest-standing arguments against Chrome OS has been the absence of professional creative software. There's no Premiere, no DaVinci Resolve, no Blender, no Ableton. For years, this was a dead-end conversation.

AWS NICE DCV (Desktop Cloud Visualization) reopens it. DCV is a high-performance remote display protocol that streams GPU-accelerated desktop sessions from EC2 instances to any device, including a Chromebook running the browser-based DCV client. It supports OpenGL, Vulkan, and DirectX rendering, with adaptive encoding that adjusts to network conditions. On AWS, the DCV license is free. You pay only for the EC2 compute time.

Netflix engineers use DCV to stream content creation applications to remote artists. Volkswagen runs 3D CAD simulations across their engineering division through it. A VFX studio called RVX used it to deliver visual effects for HBO's The Last of Us, streaming Nuke, Maya, Houdini, and Blender to artists distributed across Europe from servers in Iceland. Their team said it was the best remote experience they'd ever worked with.

So: a Chromebook connected to a g5.xlarge EC2 instance (one A10G GPU) can run Blender, DaVinci Resolve, or any other GPU-accelerated creative application with full hardware acceleration. The rendering happens in the data center. DCV streams the pixels. The creative professional gets a responsive, high-fidelity workspace on a $400 machine that couldn't locally render a single frame.

The constraints are connectivity and cost. You need sustained bandwidth (25+ Mbps for 1080p work, more for 4K multi-monitor setups) and leaving a GPU instance running around the clock adds up. But for studios and professionals who already budget for high-end workstations, the math often pencils out, especially when you factor in zero local hardware maintenance and the ability to scale GPU power on demand.

Cloud Gaming Works

GeForce NOW survived where Stadia failed because it made a better business decision: bring your own games. Connect your existing Steam, Epic, or Ubisoft library and stream from NVIDIA's server-side hardware. The Ultimate tier now runs on RTX 5080-class infrastructure. 4K at 120fps with ray tracing, on a fanless Chromebook.

Chrome OS has a structural advantage as a cloud gaming client. GeForce NOW runs natively in the Chromium browser via WebRTC, and users consistently report less micro-stuttering and tighter input handling than the standalone Windows desktop app. Under good network conditions, measured total latency runs 13 to 14ms, with sub-3ms ping documented near datacenter proximity. That's below human perceptual threshold for most game types.

Anti-cheat systems like Easy Anti-Cheat and Riot Vanguard are a non-issue in this model. They run on the server where the game executes, not on your local endpoint. On-device gaming isn't viable on Chrome OS and likely never will be. The architecture isn't designed for it, and even projects attempting to bridge local GPUs hit bottlenecks in the container layers. Cloud gaming is the path, and it works.

The limiting factors are network-dependent. Latency spikes above 500ms on bad connections make fast-twitch games unplayable, and NVIDIA's 100-hour monthly cap on the Ultimate tier has drawn criticism. But cloud gaming on Chrome OS has crossed the line from novelty to daily-driver viable for most use cases.

Aluminium OS: On-Device Models on Google's Own Architecture

The most consequential near-term development for Chrome OS is Project Aluminium, a ground-up rewrite that replaces the current Chrome OS foundation with a native Android kernel. Not another bolted-on compatibility layer: a new operating system built on Android 16, designed to run Android applications natively with direct hardware acceleration instead of routing them through the resource-heavy ARCVM virtual machine that currently eats CPU cycles on even basic app launches.

The AI story is the real story. Aluminium is being built with Gemini models integrated directly into the OS: the file system, the application launcher, the window manager.

Google serving their own proprietary models on their own devices, using an architecture optimized specifically to run them, is a level of vertical integration that no other OS vendor has in the pipeline. Apple has the silicon advantage for local inference. Google has the model-to-OS integration advantage. Those are competing theses about where AI compute should live, and both are worth taking seriously.

The rollout timeline from court documents and leaked roadmaps puts a trusted tester program on select hardware in late 2026, premium tablets by early 2027, and general consumer availability in 2028. Chrome OS Classic gets maintained through existing support obligations until 2033 or 2034.

The launch won't be perfect. Google's track record on platform transitions gives the community earned skepticism. But the ability to iterate a natively AI-integrated OS on hardware they control is the kind of capability that compounds over time.

Where This Lands

Two years ago, calling Chrome OS a serious platform for development or creative work would have been a stretch. Today you can run a full Debian environment with systemd daemons, snapshot your workspace, stream Blender from a GPU-backed data center, play AAA games at 4K on hardware you don't own, and do all of it from a verified, read-only endpoint that patches itself while you sleep.

The remaining gaps are real. But they're concentrated in workflows that are themselves moving to the cloud. Chrome OS was designed around assumptions about computing that used to be premature. They're not premature anymore.

How to Build a Full-Stack CRUD App with React, AWS Lambda, DynamoDB, and Cognito Auth

Benedicta Onyebuchi — Tue, 17 Mar 2026 15:13:02 +0000

Building a web application that works only on your local machine is one thing. Building one that is secure, connected to a real database, and accessible to anyone on the internet is another challenge entirely. And it requires a different set of tools.

Most production web applications share a common set of needs: they store and retrieve data, they expose that data through an API, they require users to authenticate before accessing sensitive operations, and they need to be deployed somewhere reliable and fast.

Meeting all of those needs used to require managing servers, configuring databases, handling authentication infrastructure, and provisioning hosting environments – often as separate, manual processes.

AWS changes that model significantly. With the combination of services you'll use in this tutorial (Lambda, DynamoDB, API Gateway, Cognito, and CloudFront), you can build and deploy a fully functional, secured, globally distributed application without managing a single server.

Each service handles one specific responsibility:

DynamoDB stores your data
Lambda runs your business logic on demand
API Gateway exposes your functions as a REST API
Cognito manages user authentication
CloudFront delivers your frontend worldwide over HTTPS.

The AWS CDK (Cloud Development Kit) ties all of this together by letting you define every one of those services as TypeScript code. Instead of clicking through the AWS Console to configure each resource manually, you describe your entire infrastructure in a single file and deploy it with one command.

By the end of this tutorial, you will have a fully deployed vendor management dashboard. Users can sign up, log in, and then create, read, and delete vendors, with all data securely stored in AWS DynamoDB and all routes protected by Amazon Cognito authentication.

What You'll Build

In this handbook, you'll build a two-panel web app where authenticated users can:

Add a new vendor (name, category, contact email)
View all saved vendors in real time
Delete a vendor from the list
Sign in and sign out securely

The frontend is built with Next.js. The backend runs entirely on AWS: DynamoDB stores the data, Lambda functions handle the logic, API Gateway exposes a REST API, Cognito manages authentication, and CloudFront serves the app globally over HTTPS.

Who This Is For
Prerequisites
Architecture Overview
Part 1: Set Up Your AWS Account and Tools
Part 2: Set Up the Project Structure
Part 3: Define the Database (DynamoDB)
Part 4: Write the Lambda Functions
Part 5: Build the API with API Gateway
Part 6: Deploy the Backend to AWS
Part 7: Build the React Frontend
Part 8: Add Authentication with Amazon Cognito
Part 9: Deploy the Frontend with S3 and CloudFront
What You Built
Conclusion

Who This Is For

This tutorial is for developers who know basic JavaScript and React but have never used AWS. You don't need any prior backend, cloud, or DevOps experience. I'll explain every AWS concept before we use it.

Prerequisites

Before starting, make sure you have the following installed and available:

Node.js 18 or higher: Download here
npm: Included with Node.js
A code editor: I recommend VS Code
A terminal: Any terminal on macOS, Linux, or Windows (WSL recommended on Windows)
An AWS account: You will create one in Part 1. A credit card is required, but the Free Tier covers everything in this tutorial.
Basic familiarity with React and TypeScript: You should understand components, useState, and useEffect.

Architecture Overview

Before writing any code, here's a plain-English description of how the pieces fit together.

When a user clicks "Add Vendor" in the React app:

The frontend reads the user's JWT auth token from the browser session
It sends a POST request to API Gateway, including the token in the request header
API Gateway checks the token against Cognito. If the token is invalid or missing, it rejects the request with a 401 error immediately
If the token is valid, API Gateway passes the request to the createVendor Lambda function
The Lambda function writes the new vendor to DynamoDB
DynamoDB confirms the write, and the Lambda returns a success response
The frontend re-fetches the vendor list and updates the UI

The same flow applies to reading and deleting vendors, with different Lambda functions and HTTP methods.

How the app is deployed: Your React app is exported as a static site, uploaded to an S3 bucket, and served globally through CloudFront. Your backend infrastructure (Lambda functions, API Gateway, DynamoDB, Cognito) is defined in TypeScript using AWS CDK and deployed with a single command.

Part 1: Set Up Your AWS Account and Tools

Before writing any application code, you need three things in place: an AWS account, the right tools on your machine, and credentials that let those tools communicate with AWS on your behalf.

1.1 Create Your AWS Account

If you don't have an AWS account:

Go to https://aws.amazon.com
Click Create an AWS Account
Follow the sign-up prompts and add a payment method
Once registered, log in to the AWS Management Console

AWS has a Free Tier that covers all the services used in this tutorial. You won't be charged for normal use while following along.

1.2 Install the AWS CLI and CDK

The AWS CLI is a command-line tool that lets you interact with AWS from your terminal: checking resources, configuring credentials, and more.

The AWS CDK (Cloud Development Kit) is the tool you will use to define your entire backend (database, Lambda functions, API) using TypeScript code. Instead of clicking through the AWS Console to create each resource, you describe what you want in a TypeScript file and CDK builds it for you.

Install both:

# Install AWS CLI (macOS)
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

# For Linux, see: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html
# For Windows, see: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html

# Install AWS CDK globally
npm install -g aws-cdk

Verify both are installed:

aws --version
cdk --version

Both commands should print a version number. If they do, you are ready to move on.

1.3 Configure Your AWS Credentials (IAM)

This step is critical. Your terminal needs a set of credentials – like a username and password – to act on your behalf inside AWS.

Think of your root account (the one you signed up with) as the master key to your entire AWS account. You should never use it for day-to-day development. Instead, you will create a separate IAM user with its own set of keys. If those keys are ever exposed, you can delete them without compromising your root account.

Phase 1: Create an IAM User

Log in to the AWS Console and search for IAM in the top search bar
In the left sidebar, click Users, then click Create user
Name the user cdk-dev. Leave "Provide user access to the AWS Management Console" unchecked – you only need terminal access, not console access
On the permissions screen, choose Attach policies directly

Search for AdministratorAccess and check the box next to it

Note on permissions: In a production job you would use a more restricted policy. For this tutorial, Administrator access is needed because CDK creates many different types of AWS resources.

6. Click through to the end and click Create user

Phase 2: Generate Access Keys

Click on your newly created cdk-dev user from the Users list
Go to the Security credentials tab
Scroll down to Access keys and click Create access key
Select Command Line Interface (CLI), check the acknowledgment box, and click Next
Click Create access key

Important: Copy both the Access Key ID and the Secret Access Key right now. You will never be able to see the Secret Access Key again after closing this screen. Save both values in a password manager or secure note.

Phase 3: Connect Your Terminal to AWS

Run the following command in your terminal:

aws configure

You will be prompted for four values:

AWS Access Key ID:     [paste your Access Key ID]
AWS Secret Access Key: [paste your Secret Access Key]
Default region name:   us-east-1
Default output format: json

Use us-east-1 as your region for this tutorial. After this step, every CDK and AWS CLI command you run will use these credentials automatically.

Part 2: Set Up the Project Structure

You will use a monorepo layout – one top-level folder with two sub-projects inside: frontend for your React app and backend for your AWS infrastructure code. They are deployed independently but live side by side.

2.1 Create the Workspace

mkdir vendor-tracker && cd vendor-tracker
mkdir backend frontend

2.2 Initialize the Frontend (Next.js)

Navigate into the frontend folder and run:

cd frontend
npx create-next-app@latest .

When prompted, choose the following options:

TypeScript --> Yes
ESLint --> Yes
Tailwind CSS --> Yes
src/ directory -->No
App Router --> Yes
Import alias --> No

2.3 Initialize the Backend (CDK)

Navigate into the backend folder and run:

cd ../backend
cdk init app --language typescript

This generates a boilerplate CDK project. The most important file it creates is backend/lib/backend-stack.ts. This is where you will define all of your AWS infrastructure as TypeScript code.

Also install esbuild, which CDK uses to bundle your Lambda functions:

npm install --save-dev esbuild

2.4 Understanding CDK Before You Write Any Code

CDK is likely different from most tools you have used. Here is how it works:

Normally, you would create AWS resources by clicking through the AWS Console: create a table here, configure a Lambda function there. CDK lets you do all of that using TypeScript code instead.

When you run cdk deploy, CDK reads your TypeScript file, converts it into an AWS CloudFormation template (an internal AWS format for describing infrastructure), and submits it to AWS. AWS then creates all the resources you described.

A few terms you will see throughout this tutorial:

Stack: The collection of all AWS resources you define together. Your BackendStack class is your stack.
Construct: Each individual AWS resource you create inside a stack (a table, a Lambda function, an API) is called a construct.
Deploy: Running cdk deploy sends your TypeScript definition to AWS and creates or updates the real resources.

The main file you'll work in is backend/lib/backend-stack.ts. Think of it as the blueprint for your entire backend.

Your final project structure will look like this:

vendor-tracker/
├── backend/
│   ├── lambda/
│   │   ├── createVendor.ts
│   │   ├── getVendors.ts
│   │   └── deleteVendor.ts
│   ├── lib/
│   │   └── backend-stack.ts
│   └── package.json
└── frontend/
    ├── app/
    │   ├── layout.tsx
    │   ├── page.tsx
    │   └── providers.tsx
    ├── lib/
    │   └── api.ts
    ├── types/
    │   └── vendor.ts
    └── .env.local

Part 3: Define the Database (DynamoDB)

DynamoDB is AWS's NoSQL database. Think of it as a fast, scalable key-value store in the cloud. Every item in a DynamoDB table must have a unique ID called the partition key. For your vendor table, that key will be vendorId.

Open backend/lib/backend-stack.ts. Replace the entire file contents with the following:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: {
        name: 'vendorId',
        type: dynamodb.AttributeType.STRING,
      },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY, // For development only
    });
  }
}

What each line does:

partitionKey tells DynamoDB that vendorId is the unique identifier for every record. No two vendors can share the same vendorId.
PAY_PER_REQUEST means you only pay when data is actually read or written. There is no charge when the table is idle, which makes it cost-effective for learning.
RemovalPolicy.DESTROY means the table will be deleted when you run cdk destroy. For production apps you would not use this.

Part 4: Write the Lambda Functions

A Lambda function is your server, but unlike a traditional server, it only runs when it's called. AWS spins it up on demand, runs your code, and shuts it down. You're only charged for the time your code is actually running.

You'll write three Lambda functions:

createVendor.ts: Adds a new vendor to DynamoDB
getVendors.ts: Returns all vendors from DynamoDB
deleteVendor.ts: Removes a vendor from DynamoDB by ID

Create a new folder inside backend:

mkdir backend/lambda

A Note on the AWS SDK

All three Lambda functions use AWS SDK v3 (@aws-sdk/client-dynamodb and @aws-sdk/lib-dynamodb). This is the current standard. An older version of the SDK (aws-sdk) exists but is deprecated and not bundled in the Node.js 18 Lambda runtime, which is what you'll use. Stick to v3 throughout.

4.1 Create Vendor Lambda

Create backend/lambda/createVendor.ts:

import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
import { randomUUID } from "crypto";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async (event: any) => {
  try {
    const body = JSON.parse(event.body);

    const item = {
      vendorId: randomUUID(), // Generates a collision-safe unique ID
      name: body.name,
      category: body.category,
      contactEmail: body.contactEmail,
      createdAt: new Date().toISOString(),
    };

    await docClient.send(
      new PutCommand({
        TableName: process.env.TABLE_NAME!,
        Item: item,
      })
    );

    return {
      statusCode: 201,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Access-Control-Allow-Methods": "OPTIONS,POST,GET,DELETE",
      },
      body: JSON.stringify({ message: "Vendor created", vendorId: item.vendorId }),
    };
  } catch (error) {
    console.error("Error creating vendor:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to create vendor" }),
    };
  }
};

What each part does:

randomUUID() generates a universally unique ID using Node's built-in crypto module. No extra package is needed. This is more reliable than Date.now(), which can produce duplicate IDs if two requests arrive within the same millisecond.
process.env.TABLE_NAME reads the DynamoDB table name from an environment variable. You'll set this value in the CDK stack. This avoids hardcoding the table name inside your Lambda code.
The headers block is required for CORS (Cross-Origin Resource Sharing). Without Access-Control-Allow-Origin, your browser will block responses from a different domain than your frontend. Without Access-Control-Allow-Headers, the Authorization header you add later for Cognito will be rejected during the browser's preflight check.

4.2 Get Vendors Lambda

Create backend/lambda/getVendors.ts:

import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, ScanCommand } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async () => {
  try {
    const response = await docClient.send(
      new ScanCommand({
        TableName: process.env.TABLE_NAME!,
      })
    );

    return {
      statusCode: 200,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Content-Type": "application/json",
      },
      body: JSON.stringify(response.Items ?? []),
    };
  } catch (error) {
    console.error("Error fetching vendors:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to fetch vendors" }),
    };
  }
};

What each part does:

ScanCommand reads every item in the table and returns them as an array. For a learning project this is fine. In a production app with millions of rows, you would use a more targeted QueryCommand to avoid reading the entire table on every request.
response.Items ?? [] returns an empty array if the table is empty, preventing the frontend from crashing when there are no vendors yet.

4.3 Delete Vendor Lambda

Create backend/lambda/deleteVendor.ts:

import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, DeleteCommand } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async (event: any) => {
  try {
    const body = JSON.parse(event.body);
    const { vendorId } = body;

    if (!vendorId) {
      return {
        statusCode: 400,
        headers: { "Access-Control-Allow-Origin": "*" },
        body: JSON.stringify({ error: "vendorId is required" }),
      };
    }

    await docClient.send(
      new DeleteCommand({
        TableName: process.env.TABLE_NAME!,
        Key: { vendorId },
      })
    );

    return {
      statusCode: 200,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Access-Control-Allow-Methods": "OPTIONS,POST,GET,DELETE",
      },
      body: JSON.stringify({ message: "Vendor deleted" }),
    };
  } catch (error) {
    console.error("Error deleting vendor:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to delete vendor" }),
    };
  }
};

What each part does:

DeleteCommand removes the item whose vendorId matches the key you provide. DynamoDB doesn't return an error if the item doesn't exist. It simply does nothing.
The 400 guard at the top returns a clear error if the caller forgets to send a vendorId, rather than letting DynamoDB throw a confusing internal error.

Part 5: Build the API with API Gateway

API Gateway is what gives your Lambda functions a public URL. Without it, there's no way for your browser to trigger a Lambda function. Think of it as the front door of your backend: it receives HTTP requests, checks whether the caller is authorized, routes the request to the correct Lambda, and returns the Lambda's response to the caller.

Now you'll wire everything together in backend/lib/backend-stack.ts.

5.1 Add Lambda Functions and API Gateway to the Stack

Replace the entire contents of backend/lib/backend-stack.ts with this complete, assembled file:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table 
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: {
        name: 'vendorId',
        type: dynamodb.AttributeType.STRING,
      },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // 2. Lambda Functions
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // 3. Permissions (Least Privilege)
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // 4. API Gateway
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda));
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda));
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda));

    // 5. Outputs
    new cdk.CfnOutput(this, 'ApiEndpoint', {
      value: api.url,
    });
  }
}

What each section does:

NodejsFunction is a special CDK construct that automatically bundles your Lambda code and all its dependencies into a single file using esbuild before uploading it to AWS. This is why you installed esbuild in Part 2.

Always use NodejsFunction instead of the basic lambda.Function construct. The basic version requires you to manually manage bundling, which causes "Module not found" errors at runtime.

Permissions (Least Privilege): In AWS, no resource can communicate with any other resource by default. A Lambda function has no access to DynamoDB, S3, or anything else unless you explicitly grant it.

This is called the Least Privilege principle: each piece of your system gets exactly the permissions it needs, and nothing more. grantWriteData lets a Lambda write and delete items. grantReadData lets a Lambda read items. Using separate grants for each function means the getVendors Lambda can never accidentally delete data.

CfnOutput prints a value to your terminal after cdk deploy completes. You'll use the ApiEndpoint URL to configure your frontend.

Part 6: Deploy the Backend to AWS

Your infrastructure is fully defined in code. Now you'll deploy it to AWS and get a live API URL.

6.1 Bootstrap Your AWS Environment

Before your first CDK deployment, AWS needs a small landing zone in your account – an S3 bucket where CDK can upload your Lambda bundles and other assets. This setup step is called bootstrapping and only needs to be done once per AWS account per region.

From inside your backend folder, run:

cdk bootstrap

Important: Bootstrapping is region-specific. If you ever switch to a different AWS region, you will need to run cdk bootstrap again in that region.

6.2 Deploy

Run:

cdk deploy

CDK will display a summary of everything it is about to create and ask for your confirmation. Type y and press Enter.

When the deployment finishes, you'll see an Outputs section in your terminal:

Outputs:
BackendStack.ApiEndpoint = https://abcdef123.execute-api.us-east-1.amazonaws.com/prod/

Copy that URL. You'll need it when building the frontend.

6.3 Troubleshooting: How to Read AWS Error Logs

Real deployments rarely go perfectly the first time. If something goes wrong after deploying, here is how to find the actual error message.

Error: 502 Bad Gateway

A 502 means API Gateway received your request but your Lambda crashed before it could respond. The most common cause is a missing environment variable – for example, if TABLE_NAME was not passed correctly and the Lambda cannot find the table.

To find the actual error message, use CloudWatch Logs:

Log in to the AWS Console and search for CloudWatch
In the left sidebar, click Logs --> Log groups

Find the group named /aws/lambda/BackendStack-CreateVendorHandler...
Click the most recent Log stream
Read the error message. It will tell you exactly what went wrong

Two common messages and their fixes:

Runtime.ImportModuleError : Your Lambda cannot find a module. Make sure you're using NodejsFunction (not lambda.Function) in your CDK stack. NodejsFunction automatically bundles dependencies; lambda.Function does not.
AccessDeniedException: Your Lambda tried to access DynamoDB but doesn't have permission. Check that you have the correct grantWriteData or grantReadData call in your stack for that Lambda.

Part 7: Build the React Frontend

Your backend is live. Now you'll build the React UI that talks to it.

7.1 Define the Vendor Type

Before writing any API or component code, define what a "vendor" looks like in TypeScript. This gives you type safety throughout your frontend code.

Create frontend/types/vendor.ts:

export interface Vendor {
  vendorId?: string; // Optional when creating — the Lambda generates it
  name: string;
  category: string;
  contactEmail: string;
  createdAt?: string;
}

The vendorId? is marked optional with ? because when you are creating a new vendor, you don't have an ID yet. The createVendor Lambda generates one. When you read vendors back from the API, vendorId will always be present.

7.2 Create the API Service Layer

Rather than writing fetch calls directly inside your React components, you'll centralize all your API logic in one file. This pattern is called a service layer. It keeps your components clean and makes it easy to update API calls in one place.

First, create a .env.local file inside your frontend folder to store your API URL:

# frontend/.env.local
NEXT_PUBLIC_API_URL=https://abcdef123.execute-api.us-east-1.amazonaws.com/prod

Replace the URL with the ApiEndpoint value from your cdk deploy output. The NEXT_PUBLIC_ prefix is required by Next.js to make an environment variable accessible in the browser.

You might be wondering: why not hardcode the URL? If you paste your API URL directly into your code and push it to GitHub, it becomes publicly visible. While an API URL alone does not expose your data (Cognito will protect that), it's good practice to keep URLs and secrets out of source control. Always use .env.local and add it to your .gitignore.

Make sure .env.local is in your .gitignore:

echo ".env.local" >> frontend/.gitignore

Now create frontend/lib/api.ts:

import { Vendor } from '@/types/vendor';

const BASE_URL = process.env.NEXT_PUBLIC_API_URL!;

export const getVendors = async (): Promise => {
  const response = await fetch(`${BASE_URL}/vendors`);
  if (!response.ok) throw new Error('Failed to fetch vendors');
  return response.json();
};

export const createVendor = async (vendor: Omit): Promise => {
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(vendor),
  });
  if (!response.ok) throw new Error('Failed to create vendor');
};

export const deleteVendor = async (vendorId: string): Promise => {
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'DELETE',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ vendorId }),
  });
  if (!response.ok) throw new Error('Failed to delete vendor');
};

What each part does:

Omit means the createVendor function accepts a vendor without an ID or timestamp (those are generated server-side).
if (!response.ok) throw new Error(...) ensures that any HTTP error (4xx or 5xx) surfaces as a JavaScript error in your component, where you can show the user a meaningful message instead of silently failing.

You'll update these functions later in Part 8 to include the Cognito auth token.

7.3 Build the Main Page

Now create the main page component. It includes a form for adding vendors and a live list that displays all current vendors.

Replace the contents of frontend/app/page.tsx with:

'use client';

import { useState, useEffect } from 'react';
import { createVendor, getVendors, deleteVendor } from '@/lib/api';
import { Vendor } from '@/types/vendor';

export default function Home() {
  const [vendors, setVendors] = useState([]);
  const [form, setForm] = useState({ name: '', category: '', contactEmail: '' });
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  const loadVendors = async () => {
    try {
      const data = await getVendors();
      setVendors(data);
    } catch {
      setError('Failed to load vendors.');
    }
  };

  // Load vendors once when the page first renders
  useEffect(() => {
    loadVendors();
  }, []);
  // The empty [] means this runs only once. Without it, the effect would
  // run after every render, causing an infinite loop of fetch requests.

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault(); // Prevent the browser from reloading the page on submit
    setLoading(true);
    setError('');
    try {
      await createVendor(form);
      setForm({ name: '', category: '', contactEmail: '' }); // Reset the form
      await loadVendors(); // Refresh the list from DynamoDB
    } catch {
      setError('Failed to add vendor. Please try again.');
    } finally {
      setLoading(false);
    }
  };

  const handleDelete = async (vendorId: string) => {
    try {
      await deleteVendor(vendorId);
      await loadVendors(); // Refresh after deleting
    } catch {
      setError('Failed to delete vendor.');
    }
  };

  return (
    
      Vendor Tracker
      Manage your vendors, stored in AWS DynamoDB.

      {error && (
        {error}
      )}

      

        {/* ── Add Vendor Form ── */}
        
          Add New Vendor
          
             setForm({ ...form, name: e.target.value })}
              required
            />
             setForm({ ...form, category: e.target.value })}
              required
            />
             setForm({ ...form, contactEmail: e.target.value })}
              required
            />
            
          
        

        {/* ── Vendor List ── */}
        
          
            Current Vendors ({vendors.length})
          
          
            {vendors.length === 0 ? (
              No vendors yet. Add one using the form.
            ) : (
              vendors.map(v => (
                
                  
                    {v.name}
                    {v.category} · {v.contactEmail}
                  
                  
                
              ))
            )}
          
        

      
    
  );
}

Key points in this component:

'use client' at the top is a Next.js directive. It tells Next.js that this component uses browser APIs (useState, useEffect, event handlers) and must run in the browser, not be pre-rendered on the server.
e.preventDefault() inside handleSubmit stops the browser's default form submission behavior, which would cause a full page reload and wipe your React state.
After every createVendor or deleteVendor call, loadVendors() is called again. This re-fetches the latest data from DynamoDB so the UI always matches what is actually stored in the database.

7.4 Test the App Locally

Start your Next.js development server:

cd frontend
npm run dev

Open http://localhost:3000 in your browser. You should see the two-panel layout. Try adding a vendor and confirm it appears in the list.

Verifying the connection to AWS:

Open Chrome DevTools (F12) and click the Network tab. When you add a vendor, you should see:

A POST request to your AWS API URL returning a 201 status code
A GET request returning 200 with the updated vendor list

You can also verify the data was saved by opening the AWS Console, navigating to DynamoDB --> Tables --> VendorTable --> Explore table items. Your vendor should appear there.

Part 8: Add Authentication with Amazon Cognito

Right now your API is completely open. Anyone who finds your API URL can add or delete vendors. You'll fix that with Amazon Cognito.

Cognito is AWS's authentication service. It manages a User Pool – a database of registered users with usernames and passwords. When a user logs in, Cognito issues a JWT (JSON Web Token): a cryptographically signed string that proves who the user is. Your API Gateway will check for this token on every request. No valid token means no access.

What is a JWT? A JSON Web Token is a string that looks like eyJhbGci.... It contains encoded information about the user and is signed by Cognito using a secret key.

API Gateway can verify the signature without contacting Cognito on every request, which makes token checking fast. Think of it as a tamper-proof badge: anyone can read the name on it, but only Cognito's signature makes it valid.

8.1 Add Cognito to the CDK Stack

Open backend/lib/backend-stack.ts and update it to include Cognito. Here is the complete updated file:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // ─── 1. DynamoDB Table ────────────────────────────────────────────────────
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: { name: 'vendorId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // ─── 2. Lambda Functions ──────────────────────────────────────────────────
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // ─── 3. Permissions ───────────────────────────────────────────────────────
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // ─── 4. Cognito User Pool ─────────────────────────────────────────────────
    const userPool = new cognito.UserPool(this, 'VendorUserPool', {
      selfSignUpEnabled: true,
      signInAliases: { email: true },
      autoVerify: { email: true },
      userVerification: {
        emailStyle: cognito.VerificationEmailStyle.CODE,
      },
    });

    // Required to host Cognito's internal auth endpoints
    userPool.addDomain('VendorUserPoolDomain', {
      cognitoDomain: {
        domainPrefix: `vendor-tracker-${this.account}`,
      },
    });

    const userPoolClient = userPool.addClient('VendorAppClient');

    // ─── 5. API Gateway + Authorizer ──────────────────────────────────────────
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const authorizer = new apigateway.CognitoUserPoolsAuthorizer(
      this,
      'VendorAuthorizer',
      { cognitoUserPools: [userPool] }
    );

    const authOptions = {
      authorizer,
      authorizationType: apigateway.AuthorizationType.COGNITO,
    };

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda), authOptions);
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda), authOptions);
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda), authOptions);

    // ─── 6. Outputs ───────────────────────────────────────────────────────────
    new cdk.CfnOutput(this, 'ApiEndpoint', { value: api.url });
    new cdk.CfnOutput(this, 'UserPoolId', { value: userPool.userPoolId });
    new cdk.CfnOutput(this, 'UserPoolClientId', { value: userPoolClient.userPoolClientId });
  }
}

What changed:

CognitoUserPoolsAuthorizer tells API Gateway to check every request for a valid Cognito JWT before passing it to any Lambda. If the token is missing or invalid, API Gateway rejects the request with a 401 Unauthorized response without ever touching your Lambda.
authOptions is applied to all three API methods: GET, POST, and DELETE. All routes are now protected.
autoVerify: { email: true } tells Cognito to mark the email attribute as verified after a user confirms via the verification code email. It doesn't skip the verification email, as users still receive a code. If you want to skip verification during development, you can manually confirm users in the Cognito console (covered in section 8.5).
Two new CfnOutput values (UserPoolId and UserPoolClientId) will appear in your terminal after the next deployment. Your frontend needs them to connect to Cognito.

Deploy the updated stack:

cd backend
cdk deploy

After deployment, your terminal output will include three values:

Outputs:
BackendStack.ApiEndpoint     = https://abc123.execute-api.us-east-1.amazonaws.com/prod/
BackendStack.UserPoolId      = us-east-1_xxxxxxxx
BackendStack.UserPoolClientId = xxxxxxxxxxxxxxxxxxxx

Save all three values. You'll use them in the next step.

8.2 Install and Configure AWS Amplify

AWS Amplify is a frontend library that handles all the complex authentication logic for you: it manages the login UI, stores tokens in the browser, refreshes expired tokens automatically, and exposes a simple API to read the current user's session.

Install the Amplify libraries inside your frontend folder:

cd frontend
npm install aws-amplify @aws-amplify/ui-react

Create frontend/app/providers.tsx. This file initializes Amplify with your Cognito configuration. It runs once when the app loads:

'use client';

import { Amplify } from 'aws-amplify';

Amplify.configure(
  {
    Auth: {
      Cognito: {
        userPoolId: process.env.NEXT_PUBLIC_USER_POOL_ID!,
        userPoolClientId: process.env.NEXT_PUBLIC_USER_POOL_CLIENT_ID!,
      },
    },
  },
  { ssr: true }
);

export function Providers({ children }: { children: React.ReactNode }) {
  return <>{children};
}

Add the Cognito IDs to your frontend/.env.local file:

NEXT_PUBLIC_API_URL=https://abc123.execute-api.us-east-1.amazonaws.com/prod
NEXT_PUBLIC_USER_POOL_ID=us-east-1_xxxxxxxx
NEXT_PUBLIC_USER_POOL_CLIENT_ID=xxxxxxxxxxxxxxxxxxxx

Replace the values with the outputs from your cdk deploy.

8.3 Wire Providers into the App Layout

This step is critical. Amplify must be initialized before any component tries to use authentication. If you skip this step, fetchAuthSession() will throw an "Amplify not configured" error and nothing will work.

Open frontend/app/layout.tsx and update it to wrap the app in the Providers component:

import type { Metadata } from 'next';
import './globals.css';
import { Providers } from './providers';

export const metadata: Metadata = {
  title: 'Vendor Tracker',
  description: 'Manage your vendors with AWS',
};

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    
      
        {children}
      
    
  );
}

By wrapping {children} in , you ensure that Amplify is configured once at the root of the app, before any child page or component renders.

8.4 Protect the UI with withAuthenticator

Now wrap your Home component so that unauthenticated users see a login screen instead of the dashboard.

Replace the contents of frontend/app/page.tsx with this updated version:

'use client';

import { useState, useEffect } from 'react';
import { withAuthenticator } from '@aws-amplify/ui-react';
import '@aws-amplify/ui-react/styles.css';
import { getVendors, createVendor, deleteVendor } from '@/lib/api';
import { Vendor } from '@/types/vendor';

// withAuthenticator injects `signOut` and `user` as props automatically
function Home({ signOut, user }: { signOut?: () => void; user?: any }) {
  const [vendors, setVendors] = useState([]);
  const [form, setForm] = useState({ name: '', category: '', contactEmail: '' });
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  const loadVendors = async () => {
    try {
      const data = await getVendors();
      setVendors(data);
    } catch {
      setError('Failed to load vendors.');
    }
  };

  useEffect(() => {
    loadVendors();
  }, []);

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault();
    setLoading(true);
    setError('');
    try {
      await createVendor(form);
      setForm({ name: '', category: '', contactEmail: '' });
      await loadVendors();
    } catch {
      setError('Failed to add vendor.');
    } finally {
      setLoading(false);
    }
  };

  const handleDelete = async (vendorId: string) => {
    try {
      await deleteVendor(vendorId);
      await loadVendors();
    } catch {
      setError('Failed to delete vendor.');
    }
  };

  return (
    
      {/* ── Header ── */}
      
        
          Vendor Tracker
          Signed in as: {user?.signInDetails?.loginId}
        
        
      

      {error && (
        {error}
      )}

      

        {/* ── Add Vendor Form ── */}
        
          Add New Vendor
          
             setForm({ ...form, name: e.target.value })}
              required
            />
             setForm({ ...form, category: e.target.value })}
              required
            />
             setForm({ ...form, contactEmail: e.target.value })}
              required
            />
            
          
        

        {/* ── Vendor List ── */}
        
          
            Current Vendors ({vendors.length})
          
          
            {vendors.length === 0 ? (
              No vendors yet.
            ) : (
              vendors.map(v => (
                
                  
                    {v.name}
                    {v.category} · {v.contactEmail}
                  
                  
                
              ))
            )}
          
        

      
    
  );
}

// Wrapping Home with withAuthenticator means any user who is not logged in
// will see Amplify's built-in login/signup screen instead of this component.
export default withAuthenticator(Home);

8.5 Pass the Auth Token to API Calls

Now that API Gateway requires a JWT on every request, your fetch calls need to include the token in the Authorization header. Without it, every request will return a 401 Unauthorized error.

Update frontend/lib/api.ts with a token helper and updated fetch calls:

import { fetchAuthSession } from 'aws-amplify/auth';
import { Vendor } from '@/types/vendor';

const BASE_URL = process.env.NEXT_PUBLIC_API_URL!;

// Retrieves the current user's JWT token from the active Amplify session
const getAuthToken = async (): Promise => {
  const session = await fetchAuthSession();
  const token = session.tokens?.idToken?.toString();
  if (!token) throw new Error('No active session. Please sign in.');
  return token;
};

export const getVendors = async (): Promise => {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    headers: { Authorization: token },
  });
  if (!response.ok) throw new Error('Failed to fetch vendors');
  return response.json();
};

export const createVendor = async (
  vendor: Omit
): Promise => {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: token,
    },
    body: JSON.stringify(vendor),
  });
  if (!response.ok) throw new Error('Failed to create vendor');
};

export const deleteVendor = async (vendorId: string): Promise => {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'DELETE',
    headers: {
      'Content-Type': 'application/json',
      Authorization: token,
    },
    body: JSON.stringify({ vendorId }),
  });
  if (!response.ok) throw new Error('Failed to delete vendor');
};

What getAuthToken does:

fetchAuthSession() reads the currently logged-in user's session from the browser. Amplify stores the session in memory and localStorage after the user signs in.

session.tokens?.idToken is the JWT string that API Gateway's Cognito Authorizer is looking for. Passing it as the Authorization header tells API Gateway: "This request is from an authenticated user."

8.6 Troubleshooting Cognito

When a new user signs up through the Amplify UI, Cognito marks the account as Unconfirmed until the user verifies their email address. A verification code is sent to the user's email. After entering the code, the account becomes confirmed and the user can log in.

If you are testing locally and want to skip the email step, you can manually confirm any account in the AWS Console:

Open the AWS Console and navigate to Cognito
Click on your User Pool (VendorUserPool...)
Click the Users tab
Click on the user's email address
Open the Actions dropdown and click Confirm account

401 Unauthorized errors after deployment

If you are getting 401 errors, check two things:

Open Chrome DevTools --> Network tab, click the failing request, and look at the Request Headers. You should see an Authorization header with a long string of characters. If it is missing, getAuthToken is failing. Check that Amplify is configured correctly in providers.tsx and wired in via layout.tsx.
In your CDK stack, confirm that authorizationType: apigateway.AuthorizationType.COGNITO is present on every protected method definition. If it is missing, API Gateway may not be checking tokens even though the authorizer is defined.

Part 9: Deploy the Frontend with S3 and CloudFront

Your app works locally. Now you'll deploy it to a real HTTPS URL that anyone in the world can visit.

The strategy: Next.js will export your React app as a set of static HTML, CSS, and JavaScript files. Those files will be uploaded to an S3 bucket (AWS's file storage service). CloudFront sits in front of the bucket as a Content Delivery Network (CDN), distributing your files to servers around the world and serving them over HTTPS.

9.1 Configure Next.js for Static Export

Open frontend/next.config.js (or next.config.mjs) and add the output: 'export' setting:

/** @type {import('next').NextConfig} */
const nextConfig = {
  output: 'export', // Generates a static /out folder instead of a Node.js server
};

export default nextConfig;

Note on 'use client' and static export: When output: 'export' is set, Next.js builds every page at compile time. Any component that uses browser-only APIs – like withAuthenticator from Amplify – must have 'use client' at the top of the file. This tells Next.js to skip server-side rendering for that component and run it only in the browser.

You already have 'use client' in page.tsx. If you ever see a build error mentioning window is not defined or similar, check that the relevant component has 'use client' at the top.

Build the frontend:

cd frontend
npm run build

This generates an /out folder containing your complete website as static files. Verify the folder was created:

ls out
# You should see: index.html, _next/, etc.

9.2 Add S3 and CloudFront to the CDK Stack

Open backend/lib/backend-stack.ts and add the hosting infrastructure. Here's the complete final version of the file:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table 
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: { name: 'vendorId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // 2. Lambda Functions
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // 3. Permissions
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // 4. Cognito User Pool
    const userPool = new cognito.UserPool(this, 'VendorUserPool', {
      selfSignUpEnabled: true,
      signInAliases: { email: true },
      autoVerify: { email: true },
      userVerification: {
        emailStyle: cognito.VerificationEmailStyle.CODE,
      },
    });

    userPool.addDomain('VendorUserPoolDomain', {
      cognitoDomain: { domainPrefix: `vendor-tracker-${this.account}` },
    });

    const userPoolClient = userPool.addClient('VendorAppClient');

    // 5. API Gateway + Authorizer
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const authorizer = new apigateway.CognitoUserPoolsAuthorizer(
      this,
      'VendorAuthorizer',
      { cognitoUserPools: [userPool] }
    );

    const authOptions = {
      authorizer,
      authorizationType: apigateway.AuthorizationType.COGNITO,
    };

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda), authOptions);
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda), authOptions);
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda), authOptions);

    // 6. S3 Bucket (Frontend Files) 
    const siteBucket = new s3.Bucket(this, 'VendorSiteBucket', {
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });

    // 7. CloudFront Distribution (HTTPS + CDN)
    const distribution = new cloudfront.Distribution(this, 'SiteDistribution', {
      defaultBehavior: {
        origin: new origins.S3Origin(siteBucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
      },
      defaultRootObject: 'index.html',
      errorResponses: [
        {
          // Redirect all 404s back to index.html so React can handle routing
          httpStatus: 404,
          responseHttpStatus: 200,
          responsePagePath: '/index.html',
        },
      ],
    });

    // 8. Deploy Frontend Files to S3 
    new s3deploy.BucketDeployment(this, 'DeployWebsite', {
      sources: [s3deploy.Source.asset('../frontend/out')],
      destinationBucket: siteBucket,
      distribution,
      distributionPaths: ['/*'], // Clears CloudFront cache on every deploy
    });

    // 9. Outputs ───────────────────────────────────────────────────────────
    new cdk.CfnOutput(this, 'ApiEndpoint', { value: api.url });
    new cdk.CfnOutput(this, 'UserPoolId', { value: userPool.userPoolId });
    new cdk.CfnOutput(this, 'UserPoolClientId', { value: userPoolClient.userPoolClientId });
    new cdk.CfnOutput(this, 'CloudFrontURL', {
      value: `https://${distribution.distributionDomainName}`,
    });
  }
}

What the hosting infrastructure does:

The S3 bucket stores your static HTML, CSS, and JavaScript files. It is private – users cannot access it directly.
CloudFront is the CDN that sits in front of S3. It gives you an HTTPS URL and caches your files at edge locations worldwide, so the app loads fast no matter where users are located. REDIRECT_TO_HTTPS automatically upgrades any HTTP request to HTTPS.
The error response for 404 returns index.html instead of an error page. This is necessary for single-page apps: if a user navigates directly to a route like /vendors/123, CloudFront cannot find a file at that path, but sending back index.html lets the React app handle the routing correctly.
distributionPaths: ['/*'] tells CloudFront to invalidate its entire cache after every deployment. This ensures users always see the latest version of your app immediately.
BucketDeployment is a CDK construct that automatically uploads the contents of your frontend/out folder to the S3 bucket every time you run cdk deploy.

9.3 Run the Final Deployment

First, build the frontend with the latest environment variables:

cd frontend
npm run build

Then deploy everything from the backend folder:

cd ../backend
cdk deploy

After deployment finishes, copy the CloudFrontURL from the terminal output:

Outputs:
BackendStack.CloudFrontURL = https://d1234abcd.cloudfront.net

Open that URL in your browser. Your app is now live on the internet, served over HTTPS, globally distributed.

What You Built

You now have a fully deployed, production-style full-stack application. Here is a summary of every piece you built and what it does:

Layer	Service	What it does
Frontend	Next.js + CloudFront	React UI served globally over HTTPS
Auth	Amazon Cognito + Amplify	User sign-up, login, and JWT token management
API	API Gateway	Routes HTTP requests, validates auth tokens
Logic	AWS Lambda (×3)	Creates, reads, and deletes vendors on demand
Database	DynamoDB	Stores vendor records with no idle cost
Storage	S3	Holds your built frontend files
Infrastructure	AWS CDK	Defines and deploys all of the above as code

Conclusion

You have built and deployed the foundational pattern of almost every cloud application: a secured API backed by a database, deployed with infrastructure as code. Here is everything you accomplished:

You set up a professional AWS development environment with scoped IAM credentials. You defined your entire backend infrastructure as TypeScript code using AWS CDK, which means your database, API, Lambda functions, and authentication system are all version-controlled, repeatable, and deployable with a single command.

You wrote three Lambda functions that handle create, read, and delete operations, each with proper error handling and the correct AWS SDK v3 patterns. You connected them to a REST API through API Gateway and protected every route with Amazon Cognito authentication, so only registered, verified users can interact with your data.

On the frontend, you built a Next.js application with a service layer that cleanly separates API logic from UI components, manages JWTs automatically through AWS Amplify, and gives users a complete sign-up and sign-in flow without you writing a single line of authentication UI code.

Finally, you deployed the entire system: your backend to AWS Lambda and DynamoDB, and your frontend as a static site served globally through CloudFront over HTTPS.

The full source code for this tutorial is available on GitHub. Clone it, modify it, and use it as a reference for your own projects.

How to Build a Serverless RAG Pipeline on AWS That Scales to Zero

Christopher Galliart — Wed, 11 Mar 2026 18:19:40 +0000

Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM endpoint, and the usual AWS infrastructure, and you're looking at real money before a single user shows up.

But it doesn't have to work that way. In this tutorial, you'll deploy a fully serverless RAG pipeline that processes documents, images, video, and audio, then scales to zero when nobody's using it.

Everything runs in your AWS account, your data never leaves your infrastructure, and your ongoing monthly cost for a modest knowledge base will be closer to 2-3 USD than 300 USD.

We'll use RAGStack-Lambda, an open-source project I built on AWS. By the end, you'll have a deployed pipeline with a dashboard, an AI chat interface with source citations, a drop-in web component you can embed in any app, and an MCP server you can use to feed your assistant context.

What This Actually Costs

Before we build anything, let's talk money, because the cost story is the whole point.

RAG pipelines have two cost phases: ingestion (processing your documents once) and operation (querying them over time).

Most platforms charge you a flat monthly rate regardless of which phase you're in. A serverless architecture flips that: ingestion costs something, and then everything scales to zero.

Ingestion: The One-Time Hit

When you upload documents, several things happen: text extraction (OCR for PDFs and images), embedding generation, metadata extraction, and storage. Here's what that actually costs per service:

Textract (OCR): This is the most expensive part of ingestion, and it only applies to scanned PDFs and images that need text extraction. Plain text, HTML, CSV, and other text-based formats skip this entirely.

Textract charges about 1.50 USD per 1,000 pages for standard text detection. If you're uploading 500 pages of scanned PDFs, that's about 0.75 USD. A heavy initial load of several thousand scanned pages might run 5-10 USD. But once your documents are processed, you never pay this again unless you add new ones.

Bedrock Embeddings (Nova Multimodal): This is where your content gets converted into vectors for semantic search. The pricing is almost comically cheap:

Text: 0.00002 USD per 1,000 input tokens
Images: 0.00115 USD per image
Video/Audio: 0.00200 USD per minute

To put that in perspective: if you have 1,500 text documents averaging 2,500 tokens each after chunking, your total embedding cost is about 0.08 USD. A knowledge base with 500 images runs 0.58 USD. Even a mixed corpus of text, images, and a few hours of video stays well under 2 USD for the entire embedding pass. This is a one-time cost – you only re-embed if you add or update documents.

Bedrock LLM (Metadata Extraction): RAGStack uses an LLM to analyze each document and extract structured metadata automatically. This is a few inference calls per document using Nova Lite or a similar model. At 0.06 USD/0.24 USD per million input/output tokens, processing 1,500 documents costs well under 1 USD.

S3 Vectors (Storage): Storing your embeddings. At 0.06 USD per GB/month, a knowledge base of 1,500 documents with 1,024-dimension vectors takes up a trivially small amount of space. We're talking pennies per month.

S3 (Document Storage): Your source documents in standard S3. Even cheaper, 0.023 USD per GB/month.

DynamoDB: Stores document metadata and processing state. The on-demand pricing model means you pay per request during ingestion, then essentially nothing at rest. A few cents for the initial load.

To put real numbers on it: if you upload 200 text documents (PDFs, HTML, markdown), your total ingestion cost is likely under 1 USD. If you upload 1,000 scanned PDFs that need OCR, you might see 5-8 USD as a one-time hit. That 7-10 USD figure you might see referenced? That's the upper end for a heavy initial load with lots of OCR work.

Operation: Where Scale-to-Zero Shines

Once your documents are ingested, the pipeline is waiting. Not running. Waiting. Here's what each query costs:

Lambda: Invocations are billed per request and duration. The free tier covers 1 million requests/month. For a personal or small-team knowledge base, you may never leave the free tier.

S3 Vectors (Queries): 2.50 USD per million query API calls, plus a per-TB data processing charge. For a small index queried a few hundred times a month, this rounds to effectively zero.

Bedrock (Chat Inference): This is your main operating cost. Each chat response requires an LLM call. Using Nova Lite at 0.06 USD per million input tokens and 0.24 USD per million output tokens, a typical RAG query (retrieval context + user question + response) might cost 0.001-0.003 USD per query. A hundred queries a month is 0.10-0.30 USD.

Step Functions: Orchestrates the document processing pipeline. Standard workflows charge 0.025 USD per 1,000 state transitions. Minimal during operation since it's only active during ingestion.

Cognito: User authentication. Free for the first 10,000 monthly active users.

CloudFront: Serves the dashboard UI. Free tier covers 1 TB of data transfer per month.

API Gateway: Handles GraphQL API requests. Free tier covers 1 million API calls per month.

Add it all up for a knowledge base with 500 documents getting a few hundred queries per month, and your monthly operating cost is somewhere between 0.50 USD and 3.00 USD. Most of that is the LLM inference for chat responses.

The Comparison That Matters

Here's the same pipeline on a traditional always-on stack:

Service	RAGStack-Lambda	Traditional Stack
Vector Database	S3 Vectors: pennies/mo	Pinecone Starter: `70 USD`/mo
Vector Database (alt)	S3 Vectors: pennies/mo	OpenSearch Serverless: about `350 USD`/mo min
Compute	Lambda: free tier	EC2 or ECS: `50-150 USD`/mo
LLM Inference	Same per-query cost	Same per-query cost
Total (idle)	about `0.50-3.00 USD`/mo	`120-500 USD`/mo

The LLM inference cost per query is roughly the same everywhere – that's Bedrock's on-demand pricing regardless of your architecture. The difference is everything else. Traditional stacks pay a floor cost whether anyone's using them or not. A serverless stack pays for what it uses, and idle costs essentially nothing.

What About Transcribe?

If you're uploading video or audio, AWS Transcribe adds cost for speech-to-text conversion. Standard transcription runs about 0.024 USD per minute of audio. A 10-minute video costs 0.24 USD to transcribe. This is a one-time ingestion cost, once transcribed and embedded, the resulting text chunks are queried like any other document.

What You're Building

By the end of this tutorial, you'll have a deployed pipeline that does the following:

You upload a document (PDF, image, video, audio, HTML, CSV, the full list is extensive) through a web dashboard.
The pipeline detects the file type and routes it to the right processor. Scanned PDFs go through OCR via Textract. Video and audio go through Transcribe for speech-to-text, split into 30-second searchable chunks with speaker identification. Images get visual embeddings and any caption text you provide.
An LLM analyzes each document and extracts structured metadata, topic, document type, date range, people mentioned, whatever's relevant. This happens automatically.
Everything gets embedded using Amazon Nova Multimodal Embeddings and stored in a Bedrock Knowledge Base backed by S3 Vectors.
You (or your users) ask questions through an AI chat interface. The pipeline retrieves relevant documents, passes them as context to a Bedrock LLM, and returns an answer with collapsible source citations, including timestamp links for video and audio that jump to the exact position.

All of this runs in your AWS account. No external control plane, no third-party services beyond AWS itself.

The Architecture

A few things to note about this architecture:

Step Functions orchestrate everything. When a document is uploaded, a state machine manages the entire processing flow, detecting the file type, routing to the right processor, waiting for async operations like Transcribe jobs, then triggering embedding and metadata extraction.

This is what makes the pipeline reliable without a running server. If a step fails, it retries. You can see exactly where every document is in the processing pipeline.

Lambda does the compute. Every processing step is a Lambda function. They spin up when needed, run for a few seconds to a few minutes, and shut down. There's no EC2 instance idling at 3 AM.

S3 Vectors is the vector store. Your embeddings live in S3's purpose-built vector storage rather than in a dedicated vector database like Pinecone or OpenSearch.

This is what makes the "scale to zero" cost possible: you're paying object storage rates for vector data instead of keeping a database cluster warm. It also means your vectors are sitting in your own S3 bucket, not in a third-party managed service that holds your data on their terms.

Cognito handles auth. The dashboard and API are protected with Cognito user pools. When you deploy, you get a temporary password via email. The web component uses IAM-based authentication, and server-side integrations use API key auth.

CloudFront serves the UI. The dashboard is a static React app served through CloudFront, so there's no web server to maintain.

Two Ways to Deploy

You have two deployment paths depending on what you want:

AWS Marketplace (the fast path), click deploy, fill in two fields (stack name and email), and wait about 10 minutes. No local tooling required. This is the path we'll walk through first.

From Source (the developer path), Clone the repo, run publish.py, and deploy via SAM CLI. This is the path for when you want to customize the processing pipeline, modify the UI, or contribute to the project. We'll cover this after the Marketplace walkthrough.

Both paths produce the same stack. The Marketplace version just wraps the CloudFormation template in a one-click deployment.

Prerequisites

Before you deploy, you'll need:

An AWS account with permissions to create CloudFormation stacks, Lambda functions, S3 buckets, DynamoDB tables, and Cognito user pools. If you're using an admin account, you're covered.
Bedrock model access: RAGStack defaults to us-east-1 because that's where Nova Multimodal Embeddings is available. Amazon's own models (including Nova) are available by default in Bedrock, no manual enablement required. Just make sure your IAM role has the necessary bedrock:InvokeModel permissions.
For the Marketplace path: just a web browser.
For the source path: Python 3.13+, Node.js 24+, AWS CLI and SAM CLI configured, and Docker (for building Lambda layers).

Deploying from AWS Marketplace

This is the fastest path – no local tools, no CLI, no Docker. You'll launch a CloudFormation stack and have a working pipeline in about 10 minutes.

Step 1: Launch the Stack

Click the direct deploy link to open CloudFormation's "Quick create stack" page with the template pre-loaded.

Step 2: Fill In Two Fields

The page has a lot of options, but you only need two:

Stack name: Must be lowercase. This becomes the prefix for all your AWS resources (for example, my-docs, team-kb, project-notes). Keep it short.
Admin Email: Under Required Settings. Cognito will send your temporary login credentials here. Use an email you can access right now.

Everything else – Build Options, Advanced Settings, OCR Backend, model selections – can stay at the defaults. They're there for customization later, but the defaults work out of the box.

Step 3: Deploy

Scroll to the bottom, check the three acknowledgment boxes under "Capabilities and transforms," and click Create stack.

Deployment takes roughly 10 minutes. You can watch the progress in the CloudFormation Events tab if you're curious, but there's nothing to do until the stack status flips to CREATE_COMPLETE.

Step 4: Log In

Once the stack finishes, check your email. Cognito sends you the dashboard URL and a temporary password. Log in, set a new password, and you're looking at an empty dashboard ready for documents.

Deploying from Source

If you want to customize the pipeline, modify the UI, or contribute to the project, deploy from source instead.

Step 1: Clone and Set Up

git clone https://github.com/HatmanStack/RAGStack-Lambda.git
cd RAGStack-Lambda

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Step 2: Deploy

The publish.py script handles everything: building the frontend, packaging Lambda functions, and deploying via SAM CLI.

python publish.py \
  --project-name my-docs \
  --admin-email admin@example.com

This defaults to us-east-1 for Nova Multimodal Embeddings. The script will build the React dashboard, build the web component, package all Lambda layers with Docker, and deploy the CloudFormation stack through SAM.

First deploy takes longer (15-20 minutes) because it's building everything from scratch. Subsequent deploys are faster since SAM caches unchanged resources.

If you only want to iterate on the backend and skip UI builds:

# Skip dashboard build (still builds web component)
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui

# Skip ALL UI builds
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui-all

Once it finishes, you'll get the same Cognito email and dashboard URL as the Marketplace path.

Uploading Your First Documents

The dashboard has tabs for different content types. We'll start with the Documents tab since that's the most common use case.

Documents

Click the Documents tab and upload a file. RAGStack accepts a wide range of formats: PDF, DOCX, XLSX, HTML, CSV, JSON, XML, EML, EPUB, TXT, and Markdown. Drag and drop or use the file picker.

Once uploaded, the document enters the processing pipeline. You'll see the status update in real time:

UPLOADED: File received and stored in S3.
PROCESSING: Step Functions has picked it up and routed it to the right processor. Text-based files (HTML, CSV, Markdown) go through direct extraction. Scanned PDFs and images go through Textract OCR. The LLM analyzes the content and extracts structured metadata, topic, document type, people mentioned, date ranges, whatever's relevant to the content.
INDEXED: Embeddings generated, vectors stored, document is searchable.

Text documents typically process in 1-5 minutes. OCR-heavy documents (scanned PDFs, images with text) can take 2-15 minutes depending on page count.

Images

The Images tab works differently. Upload a JPG, PNG, GIF, or WebP and you can add a caption. Both the visual content and caption text get embedded using Nova Multimodal Embeddings, so you can search by what's in the image or by your description of it.

This is where multimodal embeddings earn their keep. A traditional text-only RAG pipeline would need you to describe every image manually. Here, the image itself becomes searchable, and since everything stays in your AWS account, you're not sending personal photos or sensitive visual content to an external service to get there.

What About Video and Audio?

Upload video or audio files and RAGStack routes them through AWS Transcribe for speech-to-text conversion. The transcript gets split into 30-second chunks with speaker identification, then embedded like any other document. When chat results reference a video source, you get timestamp links that jump to the exact position in the recording.

Web Scraping

The Scrape tab lets you pull websites directly into your knowledge base. Enter a URL and RAGStack crawls the page, extracts the content, and processes it through the same pipeline as uploaded documents, metadata extraction, embedding, indexing.

This is useful for building a knowledge base from existing web content without manually saving and uploading pages. Documentation sites, blog archives, reference material, anything publicly accessible.

Chatting With Your Knowledge Base

This is the payoff. Go to the Chat tab, type a question, and RAGStack retrieves relevant documents from your knowledge base, passes them as context to a Bedrock LLM, and returns an answer with source citations.

The citations are collapsible, so click to expand and see which documents informed the answer, with the option to download the source file. For video and audio sources, you get clickable timestamps that jump to the relevant moment.

Metadata Filtering

If you've uploaded enough documents to have meaningful metadata categories, the chat interface lets you filter search results by metadata before querying. RAGStack auto-discovers the metadata structure from your documents, so you don't configure this manually, it just appears as your knowledge base grows.

This is useful when you have a large mixed corpus. Instead of hoping the vector search picks the right context from thousands of documents, you can narrow it down: "only search documents about project X" or "only search content from Q4 2024."

Embedding the Web Component in Your App

The dashboard is useful for managing your knowledge base, but the real power is embedding RAGStack's chat in your own application. The web component works with any framework, React, Vue, Angular, Svelte, plain HTML.

Load the script once from your CloudFront distribution:

Then drop the component wherever you want a chat interface:

That's it. The component handles authentication (via IAM), manages conversation state, and renders source citations, all self-contained. Your CloudFront URL is in the stack outputs.

For server-side integrations that don't need a UI, the GraphQL API is available with API key authentication. You can find your endpoint and API key in the dashboard under Settings.

Using the MCP Server

RAGStack includes an MCP server that connects your knowledge base to AI assistants like Claude Desktop, Cursor, VS Code, and Amazon Q CLI. Instead of switching to the dashboard to search your documents, you ask your assistant directly.

Install it:

pip install ragstack-mcp

Then add it to your AI assistant's MCP configuration:

{
  "ragstack": {
    "command": "uvx",
    "args": ["ragstack-mcp"],
    "env": {
      "RAGSTACK_GRAPHQL_ENDPOINT": "YOUR_ENDPOINT",
      "RAGSTACK_API_KEY": "YOUR_API_KEY"
    }
  }
}

Your endpoint and API key are in the dashboard under Settings. Once configured, type @ragstack in your assistant's chat to invoke the MCP server, then ask things like "search my knowledge base for authentication docs" and it queries RAGStack directly.

See the MCP Server docs for the full list of available tools and setup details.

What You Can Build From Here

You've got a deployed RAG pipeline that costs almost nothing to run and handles text, images, video, and audio. A few directions you might take it:

A searchable personal archive. Every conference talk you've saved, every PDF textbook, every tutorial video that's sitting in a folder somewhere. Upload it all, and now you have one search interface across years of accumulated material. The multimodal embeddings mean your screenshots and diagrams are searchable too, not just the text.

I built a family archive app this way, scanned letters, old photos, home videos, with RAGStack deployed as a nested CloudFormation stack so the whole family can search across decades of memories using the chat widget.

A second brain for a client project. Scrape the client's existing docs, upload the SOW and meeting notes, drop in the codebase documentation. Now you've got a searchable knowledge base scoped to that engagement. Spin it up at the start, tear it down when the contract ends. At these costs, it's disposable infrastructure.

AI chat over a niche dataset. Recipe collections, legal filings, research papers, local government meeting minutes, any corpus that's too specialized for general-purpose LLMs to know well. The web component means you can ship it as a standalone tool without building a frontend from scratch.

RAG for your MCP workflow. If you're already using Claude Desktop or Cursor, the MCP server turns your knowledge base into another tool your assistant can reach for. Upload your team's runbooks and architecture docs, and now @ragstack in your editor gives you instant context without tab-switching.

Wrapping Up

The serverless RAG pipeline you just deployed handles document processing, multimodal embeddings, metadata extraction, and AI chat with source citations, all scaling to zero when idle, all running in your AWS account. Your documents, your vectors, your infrastructure. The traditional approach to this stack costs 120-500 USD/month in baseline infrastructure. This one costs pocket change.

The full source is at github.com/HatmanStack/RAGStack-Lambda. File issues, open PRs, or just poke around the architecture. If you want to go deeper on the technical tradeoffs, particularly how filtered vector search behaves on cost-optimized backends like S3 Vectors, that's a story for the next post.

How to Deploy a MERN Stack Notes App on AWS

Umair Mirza — Sat, 17 Jan 2026 02:25:39 +0000

Platforms like Vercel, Netlify, and Render simplify deployment by handling infrastructure for you. In this tutorial, we’ll step one layer deeper and work directly with AWS to understand the building blocks behind these platforms.

You'll take a small React and Express notes app and ship it straight to AWS. We'll use EC2 for the API, RDS Postgres for the database, and S3 (optionally CloudFront) for the frontend. If you're new to AWS, you can turn on the Free Tier first: https://aws.amazon.com/free.

If you’ve mostly used one-click deployments before, this guide will help you understand what’s happening behind the scenes. You’ll work directly with the core AWS services involved, focusing only on the pieces that matter so you can see how everything fits together. This will also enable you to have more control over cost, security, and scaling.

If you just want to grab the finished code, it's all in this public repo: umair-mirza/mern-notes-aws. You can clone or fork it and follow along without creating a new project from scratch.

What You’ll Build
Prerequisites
Mental Map
Free Tier Basics
Environment Variables
Step #1 - Run It Locally First
Step #2 - Push to GitHub (So EC2 Can Pull)
Step #3 - Create AWS Resources (Quick Path)
Step #4 - Configure the EC2 Box
Step #5 - Build and Upload the Frontend
Step #6 - Quick Troubleshooting
Step #7 - Secure and Save
Step #8 - Verify End-to-End
Next Steps

What You’ll Build

Before touching any buttons in AWS, it's helpful to know the exact pieces you're trying to build. At the end of this guide, you'll have a classic three-tier web app: a browser-based frontend, a backend API, and a database, all talking to each other over a network.

API (Express/Node) on EC2
Postgres on RDS (Free Tier eligible)
React/Vite frontend on S3 (CloudFront optional for CDN/HTTPS)
Health check at /api/health and CRUD at /api/notes

Prerequisites

You don't need to be a DevOps expert to follow along, but you should be comfortable running basic commands in a terminal and editing some config files. If you've ever used npm install before, then you're in the right place.

AWS account + AWS CLI configured (aws configure) – see AWS account setup and AWS CLI install.
Node.js 18+ and npm – get it from nodejs.org .
Git + GitHub repo – see GitHub getting started.
(Optional) Route 53 domain for a clean URL – Route 53 domains.

Mental Map

AWS throws a lot of jargon at you (VPCs, security groups, subnets). This section is the story version of what happens when someone opens your app in the browser, without any buzzwords. If you can picture this flow, the later AWS screens will feel less scary.

Browser loads the built React app from S3 (or CloudFront -> S3)
Browser calls the API on EC2 over HTTP/HTTPS
EC2 talks to RDS Postgres on port 5432 inside your VPC
Security groups: allow 80/443 to EC2; allow 5432 only from the EC2 SG to RDS

Free Tier Basics

AWS can be cheap if you use the free tier, but it can also surprise you with bills if you accidentally orprovision or leave things running. Here are the main knobs that affect cost for this tutorial and what to watch out for.

EC2: t2.micro or t3.micro ~750 hours/month
RDS: db.t3.micro Postgres/MySQL with ~20 GB storage
S3/CloudFront: Small sites cost pennies - free tier includes some egress
Save money: Stop EC2 when idle. Delete unused buckets/DBs

Environment Variables

Environment variables are just configuration values that live outside your code: ports, database URLs, and allowed origins. They keep secrets (like DB passwords) out of your Git repo and let the same code run in different places (local, staging, production) with different settings.

Backend: PORT, DATABASE_URL (your RDS endpoint), DATABASE_SSL (true on RDS), CORS_ORIGIN
Frontend: VITE_API_URL (API base, for example, https://api.example.com/api)

Step #1 - Run It Locally First

Before touching AWS, you want to prove the app actually works on your own machine. This removes a whole category of "Is it AWS or my code?" debugging later. In this step you just install dependencies and run both backend and frontend in dev mode.

cd mern-notes-aws

# Backend
cd backend
npm install
cp .env.example .env   # set DATABASE_URL to RDS (or local Postgres), DATABASE_SSL=true for RDS
npm run dev            # API on http://localhost:4000

# Frontend (new terminal)
cd frontend
npm install
cp .env.example .env   # keep API URL at http://localhost:4000/api for local dev
npm run dev            # SPA on http://localhost:5173

Open http://localhost:5173, add a note, and check if it persists. /api/health should return { status: 'ok' }. If something is broken here, pause and fix it before moving on. AWS will only make debugging harder.

Step #2 - Push to GitHub (So EC2 Can Pull)

Your EC2 server in AWS needs a place to pull your code from. Using GitHub is the simplest option: you push your code once, then the EC2 instance clones that repo. You can also reuse this repo later with CI/CD if you decide to automate deployments.

cd mern-notes-aws
git init
git add .
git commit -m "feat: mern notes app"
git branch -M main
git remote add origin https://github.com//mern-notes-aws.git
git push -u origin main

If you're following along with my example repo instead of creating your own, you can simply fork umair-mirza/mern-notes-aws and use that as your remote.

Before pushing, make sure your .env file is not committed to GitHub. Add it to your .gitignore so secrets like database passwords never end up in version control:

echo ".env" >> .gitignore

If you’ve already created a .env file locally, double-check it doesn’t appear in git status before committing.

Step #3 - Create AWS Resources (Quick Path)

RDS (Postgres, Free Tier template)

RDS (Relational Database Service) is AWS's way of running managed databases for you. Instead of installing Postgres manually on a VM, you click a few options and AWS handles backups, patching, and high availability. For this app we only need a small, free tier–eligible Postgres instance.

For more background, you can skim the official Amazon RDS for PostgreSQL docs.

We’ll start by creating the database layer. The settings below are the minimum you need for a small, production-style Postgres setup that stays within the AWS Free Tier while still following basic best practices.

RDS Create database Postgres Free Tier.
Class db.t3.micro, storage 20 GB gp2/gp3.
Set master user/pass. You'll need them for DATABASE_URL.
Public access: No.
Security group: allow 5432 only from the EC2 security group.
Enable backups and Require SSL. Download the RDS CA if you want strict cert validation.

S3 Bucket for the Frontend

S3 is AWS's "infinite hard drive" for files. A React/Vite app builds down to plain HTML, CSS, and JavaScript files, which are perfect to host from S3. Think of S3 as a very simple web server that just serves static files.

If you want to see more options, check the Hosting a static website on Amazon S3 guide.

Now, we’ll create an S3 bucket to host the React frontend. These options configure the bucket for static website hosting while keeping it simple and inexpensive.

Create bucket mern-notes-aws-frontend-.
For simple hosting, enable static website hosting and allow public reads, or keep private and use CloudFront + OAC.
Turn on versioning if you want rollback safety.

EC2 for the API

EC2 is "a computer in the cloud" that you control. You'll install Node.js on it, pull your code, and run server.js so that your backend API is always on. The security group attached to this instance works like a firewall.

If you've never launched an instance before, the Getting started with Amazon EC2 guide walks through the console screens you'll see.

Finally, we’ll provision a small EC2 instance to run the Express API. The configuration below focuses on a free tier–eligible setup that’s secure enough for learning and easy to extend later.

Launch Amazon Linux 2023, size t3.micro.
Inbound SG: 22 (your IP), 80 (world), 443 if you add HTTPS on the instance/ALB.
Attach this SG as the allowed source to RDS.

Optional: CloudFront + Route 53

CloudFront is AWS's CDN (content delivery network), and Route 53 is their DNS service. You don't strictly need them to get your app working, but they make it faster and nicer: your app loads from edge locations close to users and can live behind a friendly domain like app.example.com.

For more details, see Getting started with Amazon CloudFront and the Route 53 DNS developer guide.

Origin: the S3 bucket. Default root index.html. Add OAC if bucket is private.
Request an ACM cert in us-east-1, then create a Route 53 A/AAAA alias to the distribution.

Step #4 - Configure the EC2 Box

Once your EC2 instance is running, you treat it like a clean Linux machine. The commands below install the tools your API needs, pull your code from GitHub, configure environment variables, and run the server in a production-safe way.

Install basics:

sudo dnf update -y

This command updates all system packages to the latest versions. It's a good first step on any new Linux server.

sudo dnf install -y git

Installs Git so the EC2 instance can clone your repository from GitHub.

curl -fsSL https://rpm.nodesource.com/setup_20.x | sudo bash -

Adds the official NodeSource repository so you can install a modern version of Node.js (v20). Amazon Linux doesn’t ship with recent Node versions by default.

sudo dnf install -y nodejs

Installs Node.js and npm, which are required to run your Express API.

sudo npm install -g pm2

Installs PM2, a lightweight process manager that keeps your Node app running in the background and restarts it if it crashes or the server reboots.

Pull code and set environment variables:

git clone https://github.com//mern-notes-aws.git
cd mern-notes-aws/backend
npm install

cat <<'EOF' > .env
PORT=80
DATABASE_URL=postgres://:@:5432/
DATABASE_SSL=true
CORS_ORIGIN=https://
EOF

Start the API with PM2:

pm2 start server.js --name mern-notes-api
pm2 save
pm2 startup systemd -u ec2-user --hp /home/ec2-user

PM2 is a small process manager that makes sure your Node server keeps running if the machine reboots or the process crashes. Test on the box: curl http://localhost/api/health. From your laptop: http:///api/health (make sure SG allows 80/443).

Step #5 - Build and Upload the Frontend

In development, Vite serves your React app from memory, but in production you want a set of static files that any web server (or S3) can host. npm run build creates an optimized dist/ folder that you sync to S3 so the browser can load it.

cd frontend
setx VITE_API_URL "https:///api"
npm run build

This sets an environment variable called VITE_API_URL on your local machine. Vite only exposes environment variables to the frontend if they start with the VITE_ prefix.

Upload:

aws s3 sync dist/ s3://mern-notes-aws-frontend-/ --delete

This uploads your compiled frontend (dist/) to S3 and removes old files that no longer exist locally, ensuring the bucket reflects the current version of the app

Open the S3 website URL or your CloudFront URL.

Step #6 - Quick Troubleshooting

If something doesn't work the first time, that's normal, especially with networking and AWS permissions. This section gives you a few quick places to look before you start randomly changing settings in the console.

API 500s: pm2 logs mern-notes-api. This is often a bad DATABASE_URL or SSL flag.
DB connect issues: RDS SG must allow the EC2 SG - use the RDS endpoint.
CORS errors: CORS_ORIGIN must match your frontend origin exactly.
403 from S3: If you’re using static website hosting, allow public reads. With CloudFront, keep bucket private and use OAC.
Blank page: Confirm that you’ve uploaded dist/ to the right bucket.

Step #7 - Secure and Save

Once everything works, you don't want to accidentally expose your database to the internet or burn through free tier hours. These are simple, beginner-friendly hardening steps that make your setup safer and cheaper without turning you into a full-time security engineer.

Turn off SSH after setup or switch to SSM Session Manager.
Use HTTPS (CloudFront + ACM or ALB + ACM).
Keep RDS private and use SSM port forwarding if needed.
Ship PM2 logs with CloudWatch Agent and add alarms for CPU/status checks.
Snapshot RDS daily and stop EC2 when idle to save hours.

Step #8 - Verify End-to-End

Before you celebrate, run through the app like a real user: open it in the browser, create notes, refresh, and make sure everything behaves as expected. This confirms your frontend, API, and database are all wired together correctly.

Load the frontend (S3 or CloudFront).
Create and delete notes. They should persist in RDS.
Hit /api/health for a quick liveness check.

Next Steps

Once you're comfortable with this manual setup, you can start layering on more advanced tools. The ideas are the same: frontend, API and database but you get more automation, safety, and scalability.

Add Prisma + migrations for stronger schemas.
Add auth (Cognito/Auth0) and per-user notes.
Containerize and run on ECS/Fargate or add an ALB in front of EC2.
Use Terraform/CDK to recreate this stack with one command.

How to Manage Blue-Green Deployments on AWS ECS with Database Migrations: Complete Implementation Guide

Destiny Erhabor — Thu, 15 Jan 2026 18:25:13 +0000

Blue-green deployments are celebrated for enabling zero-downtime releases and instant rollbacks. You deploy your new version (green) alongside the current one (blue), switch traffic over, and if something goes wrong, you switch back. Simple, right?

Not quite. While blue-green deployments work beautifully for stateless applications, they become significantly more complex when you introduce databases and stateful services into the equation. The moment your blue and green environments need to share a database, you're facing a fundamental challenge: how do you evolve your schema and data without breaking either version?

In this article, we'll tackle the real-world complexities of implementing blue-green deployments on Amazon ECS when your application depends on shared state. You'll learn practical strategies for handling database migrations, managing sessions, and maintaining data consistency across application versions.

💡 Complete Working Example: All code examples in this article are available in the bluegreen-deployment-ecs repository on GitHub. You can clone it and deploy the entire infrastructure to your AWS account.

The Problem with State in Blue-Green Deployments
Database Migration Strategies for Blue-Green
Handling Stateful Services in ECS
Complete Implementation: End-to-End Example
Rollback Strategies
Monitoring During Deployments
Best Practices
When NOT to Use Blue-Green
Alternative Deployment Strategies
Cleanup
Conclusion
Further Resources

The Problem with State in Blue-Green Deployments

The elegance of blue-green deployments starts to crumble when you consider databases. Here's why: your blue environment runs application version 1, your green environment runs version 2, but they both connect to the same RDS instance.

Consider this scenario: you're adding a new feature that requires a new database column. Version 2 of your application expects this column to exist. You deploy green, run your migration to add the column, and switch traffic.

Everything works great until you need to rollback. Now version 1 is receiving traffic, but it doesn't know what to do with that new column. Worse, if your migration removed or renamed a column that version 1 depends on, your rollback will fail catastrophically.

Here are the specific challenges you'll face:

Schema versioning conflicts: Your blue environment expects schema version N, while green expects version N+1. Any breaking schema change will cause one environment to fail.
Data inconsistencies: If version 2 writes data in a new format that version 1 can't read, switching back to blue will result in errors or data corruption.
Irreversible migrations: Some database changes are inherently destructive. Dropping a column, changing data types, or restructuring tables can't be easily undone.
Failed rollbacks: The promise of instant rollback becomes hollow when your database has evolved beyond what the blue environment can handle.

Let's explore the strategies that solve these problems.

Database Migration Strategies for Blue-Green

Strategy 1: The Expand-Contract Pattern (Recommended)

The expand-contract pattern is the most practical approach for blue-green deployments with shared databases. It works by breaking schema changes into three phases, ensuring backwards compatibility throughout.

Phase 1: Expand

In this phase, you add new schema elements while keeping old ones intact. If you're renaming a column, add the new column without removing the old one. If you're changing table structure, create new tables alongside existing ones.

-- Example: Renaming 'user_name' to 'username'
-- Phase 1: Expand - Add new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- Populate new column from old column
UPDATE users SET username = user_name WHERE username IS NULL;

At this point, your database supports both the old schema (used by blue) and the new schema (used by green). Your application code needs to handle both as well.

Phase 2: Deploy

Now, deploy your green environment with code that uses the new schema. But this code should still write to both old and new columns to maintain compatibility.

# Version 2 code - writes to both columns
def update_user(user_id, username):
    db.execute(
        "UPDATE users SET username = %s, user_name = %s WHERE id = %s",
        (username, username, user_id)
    )

Traffic shifts from blue to green. Both environments work because the database supports both schemas. If you need to rollback, blue still functions perfectly because the old columns are intact.

Phase 3: Contract

After you're confident green is stable and you've decommissioned blue, remove the old schema elements in a separate deployment.

-- Phase 3: Contract - Remove old column
ALTER TABLE users DROP COLUMN user_name;

Update your application code to stop writing to the old columns. This is now version 3, deployed as a standard release.

When to use: This should be your default approach for most schema changes including adding/removing columns, renaming fields, changing constraints, and restructuring tables.

Strategy 2: Parallel Schemas or Databases

For major breaking changes where backwards compatibility is impractical, you might maintain entirely separate database versions. Version 1 connects to database A, version 2 connects to database B. This approach requires data synchronization between databases. AWS Database Migration Service (DMS) can replicate data in near real-time, or you can build custom replication logic using change data capture.

# Configuration for version-specific database connections
DATABASE_CONFIG = {
    'v1': {
        'host': 'blue-db.cluster-xxxxx.us-east-1.rds.amazonaws.com',
        'database': 'app_v1'
    },
    'v2': {
        'host': 'green-db.cluster-yyyyy.us-east-1.rds.amazonaws.com',
        'database': 'app_v2'
    }
}

During the transition period, you run DMS to keep both databases synchronized, with the understanding that writes go to the active version's database.

The challenge is that you're now managing data synchronization, dealing with replication lag, and paying for two databases. Eventually, you need to consolidate back to one database, which requires another migration. This is expensive and complex, which is why it's the "nuclear option."

When to use: Only for major architectural changes, complete data model redesigns, or when migrating between database types (for example, MySQL to PostgreSQL). If expand-contract can possibly work, use that instead.

Strategy 3: Feature Flags for Gradual Rollout

Feature flags allow you to decouple deployment from release. Both blue and green run the same codebase, but features are toggled on or off via configuration. This shifts the problem from schema compatibility to code-level compatibility.

def create_user(user_data):
    config = get_feature_config()
    if config['use_new_user_schema']:
        return create_user_v2(user_data)
    else:
        return create_user_v1(user_data)

Instead of having two separate deployments (blue and green), you have ONE deployment with conditional logic. The "switch" from old to new behavior happens via configuration change, not infrastructure change. This is technically not pure blue-green, but it's a powerful hybrid approach.

How it works

Your application checks AWS AppConfig (or similar service) for feature flags before executing code paths. When a flag is off, it uses the old schema/logic. When on, it uses the new schema/logic. You can even enable features for a percentage of users (5% get new behavior, 95% get old behavior) for gradual rollout.

The tradeoff is that your codebase temporarily contains both old and new logic with conditional branches everywhere. This increases complexity and requires disciplined cleanup after the feature is fully released. However, you gain fine-grained control and can toggle features on/off instantly without deploying new infrastructure.

When to use: For large features with uncertain stability, gradual rollouts to monitor impact, or when you want instant rollback capability without touching infrastructure. Also useful when combined with expand-contract for extra safety.

Handling Stateful Services in ECS

Beyond databases, several other stateful components require careful consideration during blue-green deployments.

Session Management

It’s a good idea to store sessions in ElastiCache or DynamoDB rather than application memory:

app.config['SESSION_TYPE'] = 'dynamodb'
app.config['SESSION_DYNAMODB'] = boto3.client('dynamodb')

Shared Resources

Beyond database sessions, your application likely depends on other stateful components that need coordination during blue-green deployments:

1. S3 buckets

If your application stores files or data in S3, schema changes to object metadata or file formats can cause compatibility issues between versions. To address this, you can enable S3 versioning to maintain multiple format versions simultaneously.

For example, if version 2 writes JSON files with a new structure, version 1 should still be able to read the old format. You can include a version prefix in object keys (like v1/user-data.json and v2/user-data.json) or embed version metadata in the objects themselves.

Message queues (SQS/SNS)

Messages sent by one version must be readable by the other during the transition. You can use versioned message schemas with a schema_version field in your message payload. Both blue and green should be able to parse messages from either version, even if they only produce messages in their preferred format. Consider using a schema registry or validation library to ensure compatibility.

Cache layers (ElastiCache/Redis)

Cached data structure changes can cause deserialization errors when switching between versions. Try versioning your cache keys by including the schema version: CACHE_VERSION = 'v2' and then cache_key = f"user:{CACHE_VERSION}:{user_id}". This ensures blue and green maintain separate cache namespaces, preventing cross-contamination. When you fully migrate to green, you can flush the old cache keys or let them expire naturally.

CACHE_VERSION = 'v2'
cache_key = f"user:{CACHE_VERSION}:{user_id}"

Implementation: End-to-End Example

Let's walk through a complete blue-green deployment with ECS, handling a database schema change using the expand-contract pattern. We'll migrate from a single address text field to structured street_address, city, state, and zip_code fields.

Here’s the scenario: You're running an e-commerce application on ECS. The current version (blue) stores customer addresses in a single address text field. Version 2 (green) splits this into structured fields: street_address, city, state, and zip_code.

Architecture Setup

Your infrastructure includes:

ECS cluster running Fargate tasks
Application Load Balancer with two target groups (blue and green)
RDS PostgreSQL database (shared between environments)
CodeDeploy for managing traffic shifts
Parameter Store for database connection strings

💡 Implementation Note: The complete Terraform code for this architecture is available in the companion GitHub repository.

Prerequisites

Before starting, make sure that you have the following tools installed and your AWS credentials properly configured:

# Required tools
aws --version      # AWS CLI
terraform --version # Terraform >= 1.0
docker --version   # Docker
psql --version     # PostgreSQL client

# Configure AWS credentials
aws configure
aws sts get-caller-identity  # Verify your identity

Step 1: Deploy Infrastructure and Blue Environment

We’ll start by setting up the entire AWS infrastructure from scratch using Terraform, then deploying the initial version of our application (blue environment).

First, clone the repository and set up your environment:

# Clone the repository
git clone https://github.com/Caesarsage/bluegreen-deployment-ecs.git
cd bluegreen-deployment-ecs

# Create terraform variables
cd terraform
cat > terraform.tfvars <"us-east-1"
project_name       = "ecommerce-bluegreen"
environment        = "production"
vpc_cidr           = "10.0.0.0/16"

# Database credentials (CHANGE THESE!)
db_username = "dbadmin"
db_password = "ChangeThisPassword123!"

# Container configuration
container_image = "PLACEHOLDER"  # Will update after building image
container_port  = 8080

# Scaling configuration
desired_count = 2
cpu           = "256"
memory        = "512"

# Notifications
notification_email = "your-email@example.com"
EOF

Security Note: Never commit terraform.tfvars to Git. It's already in .gitignore.

Next, initialize Terraform and create the ECR repository:

# Initialize Terraform
terraform init
terraform validate

# Create ECR repository
terraform apply -target=aws_ecr_repository.app

# Get ECR repository URL
export ECR_REPO=$(terraform output -raw ecr_repository_url)
echo "ECR Repository: $ECR_REPO"

We create the ECR repository first because we need somewhere to push our Docker image. Then we'll build the image, push it, and finally deploy the rest of the infrastructure that depends on that image existing.

Build and push the initial application like this:


cd ..  # Back to project root

# Set variables
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export ECR_REPOSITORY=ecommerce-bluegreen
export IMAGE_TAG=v1.0.0

# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
    docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Build the image
docker build --platform linux/amd64 -t $ECR_REPOSITORY:$IMAGE_TAG -f docker/Dockerfile .

# Tag and push to ECR
docker tag $ECR_REPOSITORY:$IMAGE_TAG \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

# Update terraform.tfvars with the image URL
echo "container_image = \"$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG\"" >> terraform/terraform.tfvars

The application code is a Flask application that handles both old and new schema formats based on the APP_VERSION environment variable.

Now deploy the complete infrastructure:

cd terraform
terraform apply  # Takes ~15-20 minutes

# Get outputs
export ALB_URL=$(terraform output -raw alb_url)
export TEST_URL=$(terraform output -raw test_url)
export DB_ENDPOINT=$(terraform output -raw db_endpoint)
export ECR_URL=$(terraform output -raw ecr_repository_url)
export BASTION_IP=$(terraform output -raw bastion_public_ip)

echo "Application URL: $ALB_URL"
echo "Test URL: $TEST_URL"
echo "Database Endpoint: $DB_ENDPOINT"

The production listener (port 80) is what your users hit. The test listener (port 8080) lets you test the green environment before shifting production traffic to it. This is crucial for validation.

You can see the complete Terraform configuration in terraform.

Step 2: Initialize Database Schema

Now you’ll need to initialize the database with the schema for version 1 (blue). We'll use Bastion for secure access:

# Copy the migration files to the bastion host from your local machine

scp -i ~/.ssh/id_rsa docker/init.sql ec2-user@$BASTION_IP:/tmp/
scp -i ~/.ssh/id_rsa migrations/*.sql ec2-user@$BASTION_IP:/tmp/

# Then SSH into it and run migrations
ssh -i ~/.ssh ec2-user@$BASTION_IP

# Inside the bastion:
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/init.sql

# Verify
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d customers"

# Exit the container
exit

Step 3: Verify Blue Environment

We’ll want to test that everything works before we start the migration. This is your baseline: you want to confirm that the current system is healthy before introducing changes.

# Check health
curl $ALB_URL/health | jq

# Expected response:
# {
#   "status": "healthy",
#   "version": "blue",
#   "environment": "production",
#   "database": "connected",
#   "schema": "compatible"
# }

# Create a customer with the old schema (single address field)
curl -X POST $ALB_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{
      "name": "John Doe",
      "email": "john@example.com",
      "address": "123 Main St, New York, NY, 10001"
    }' | jq

# List customers
curl $ALB_URL/api/customers | jq

Step 4: Expand Phase – Add New Columns

This is the first phase of expand-contract. We're adding the new columns WITHOUT removing the old one, creating a database schema that supports both blue and green simultaneously.

Run the expand migration (migrations/001_expand_address.sql):

-- Migration: 001_expand_address_fields.sql
BEGIN;

ALTER TABLE customers 
  ADD COLUMN street_address VARCHAR(255),
  ADD COLUMN city VARCHAR(100),
  ADD COLUMN state VARCHAR(2),
  ADD COLUMN zip_code VARCHAR(10);

-- Populate new columns from existing data
-- This uses a simple parsing strategy; yours might be more sophisticated

UPDATE customers 
SET 
  street_address = SPLIT_PART(address, ',', 1),
  city = TRIM(SPLIT_PART(address, ',', 2)),
  state = TRIM(SPLIT_PART(address, ',', 3)),
  zip_code = TRIM(SPLIT_PART(address, ',', 4))
WHERE address IS NOT NULL;

COMMIT;

Critical observation: We're NOT dropping the address column. It's still there. Blue continues reading and writing to it, completely unaware that new columns exist. This is what makes the migration safe – nothing breaks.

# Then SSH into it and run migrations
ssh -i ~/.ssh ec2-user@$BASTION_IP

# Inside the bastion:
export DB_ENDPOINT = "" # from terraform output

psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/001_expand_address.sql

# Verify new columns exist
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

exit

Verification: The \d customers command shows the table structure. You should see BOTH the old address column AND the new street_address, city, state, zip_code columns. This confirms the expand phase worked.

The database now supports both old (blue) and new (green) schemas. Blue is still running and working perfectly, and nothing has changed from its perspective.

Step 5: Build and Deploy Green Environment

Now we’ll build version 2 of our application that knows how to work with the new structured address fields, while maintaining backwards compatibility with the old schema.

Start by building version 2 with structured address support:

cd ..  # Back to project root

# Build new version
export IMAGE_TAG=v2.0.0

docker build --platform linux/amd64 -t $ECR_REPOSITORY:$IMAGE_TAG -f docker/Dockerfile .

docker tag $ECR_REPOSITORY:$IMAGE_TAG \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

What’s different is that the v2 application code now has logic that:

Reads from the new structured columns (street_address, city, and so on)
Writes to BOTH new columns AND the old address column
Accepts API requests with structured address format

Why write to both: This is crucial. Even though green prefers the new format, it maintains the old format, too. If you need to rollback to blue, all the data blue needs is there and up-to-date. Without this, rollback would be impossible: blue would see empty or stale address fields.

Now create and register green task definition:

cd terraform

# Get necessary ARNs
EXECUTION_ROLE_ARN=$(terraform output -raw ecs_task_execution_role_arn)
TASK_ROLE_ARN=$(terraform output -raw ecs_task_role_arn)
DB_SECRET_ARN=$(terraform output -raw db_secret_arn)

# Create task definition
cat > task-def-green.json <"family": "ecommerce-bluegreen",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "${EXECUTION_ROLE_ARN}",
  "taskRoleArn": "${TASK_ROLE_ARN}",
  "containerDefinitions": [{
    "name": "app",
    "image": "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPOSITORY}:${IMAGE_TAG}",
    "essential": true,
    "portMappings": [{
      "containerPort": 8080,
      "protocol": "tcp"
    }],
    "environment": [
      {"name": "APP_VERSION", "value": "green"},
      {"name": "ENVIRONMENT", "value": "production"},
      {"name": "AWS_REGION", "value": "${AWS_REGION}"},
      {"name": "DB_HOST", "value": "${DB_ENDPOINT}"},
      {"name": "DB_PORT", "value": "5432"},
      {"name": "DB_NAME", "value": "ecommerce"}
    ],
    "secrets": [
      {
        "name": "DB_USER",
        "valueFrom": "${DB_SECRET_ARN}:username::"
      },
      {
        "name": "DB_PASSWORD",
        "valueFrom": "${DB_SECRET_ARN}:password::"
      }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/ecommerce-bluegreen",
        "awslogs-region": "${AWS_REGION}",
        "awslogs-stream-prefix": "ecs"
      }
    },
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3,
      "startPeriod": 60
    }
  }]
}
EOF

# Register the task definition
aws ecs register-task-definition --cli-input-json file://task-def-green.json

This JSON tells ECS everything about how to run your container:

Which Docker image to use (the v2.0.0 we just built)
How much CPU/memory to allocate (256 CPU units = 0.25 vCPU)
Environment variables (notice APP_VERSION is set to "green")
Secrets (database credentials pulled from AWS Secrets Manager)
Health check configuration (curl the /health endpoint every 30 seconds)
Logging configuration (send logs to CloudWatch)

Key detail: The APP_VERSION environment variable is how the application knows whether to behave as blue or green. Same codebase, different behavior based on configuration.

Step 6: Execute Blue-Green Deployment

Alright, now it’s time to create AppSpec and trigger the deployment:

TASK_DEF_ARN=$(aws ecs describe-task-definition \
  --task-definition ecommerce-bluegreen \
  --query 'taskDefinition.taskDefinitionArn' \
  --output text)

cat > appspec.json <"version": 0.0,
  "Resources": [{
    "TargetService": {
      "Type": "AWS::ECS::Service",
      "Properties": {
        "TaskDefinition": "${TASK_DEF_ARN}",
        "LoadBalancerInfo": {
          "ContainerName": "app",
          "ContainerPort": 8080
        }
      }
    }
  }]
}
EOF

# Deploy
APPSPEC=$(cat appspec.json | jq -c .)
aws deploy create-deployment \
  --application-name ecommerce-bluegreen \
  --deployment-group-name ecommerce-bluegreen-deployment-group \
  --deployment-config-name CodeDeployDefault.ECSLinear10PercentEvery3Minutes \
  --description "Blue-green deployment to structured address schema" \
  --cli-input-json "{
    \"revision\": {
      \"revisionType\": \"AppSpecContent\",
      \"appSpecContent\": {
        \"content\": $(echo \"$APPSPEC\" | jq -Rs .)
      }
    }
  }"

DEPLOYMENT_ID=$(aws deploy list-deployments \
    --application-name ecommerce-bluegreen \
    --deployment-group-name ecommerce-bluegreen-deployment-group \
    --query 'deployments[0]' --output text)

Monitor the deployment:

# Watch status
watch -n 10 "aws deploy get-deployment --deployment-id $DEPLOYMENT_ID \
    --query 'deploymentInfo.status' --output text"

# Monitor traffic distribution
while true; do
    echo "Production: $(curl -s $ALB_URL/health | jq -r '.version')"
    echo "Test: $(curl -s $TEST_URL/health | jq -r '.version')"
    sleep 30
done

The deployment shifts 10% of traffic every 3 minutes, completing in 30 minutes.

Step 7: Validate Green Environment

After the deployment begins, you need to validate that the green environment is functioning correctly with the new structured address format before allowing production traffic to reach it.

The CodeBuild dashboard below shows the Traffic migration and Deployment status:

We can also test through the test listener (port 8080), which provides isolated access to green tasks:

# Test new structured address API
curl -X POST $TEST_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{
      "name": "Jane Smith",
      "email": "jane@example.com",
      "address": {
        "street": "456 Oak Ave",
        "city": "Los Angeles",
        "state": "CA",
        "zip": "90001"
      }
    }' | jq

curl $ALB_URL/api/customers | jq

What you're validating:

The green environment accepts the new structured address format
Data is correctly written to both new columns (street_address, city, state, zip_code) and the old address column for backwards compatibility
The API response matches expectations for the new schema
Existing data from blue environment is still accessible and readable

If any of these tests fail, you can stop the deployment before production traffic reaches green, preventing customer impact.

Step 8: Post-Deployment Validation

Once CodeDeploy completes the traffic shift, all production requests route to green. This is your opportunity to verify that the deployment was successful and that the new version is handling real production traffic correctly.

# Verify all production traffic goes to green
# Running this multiple times confirms consistent routing
for i in {1..10}; do
    curl -s $ALB_URL/health | jq -r '.version'
done
# Expected output: "green" for all 10 requests

# Test complete CRUD operations with the new API
# Create a customer with structured address
CUSTOMER_ID=$(curl -s -X POST $ALB_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{"name": "Test User", "email": "test@example.com",
         "address": {"street": "789 Test St", "city": "Test City", 
         "state": "TX", "zip": "75001"}}' | jq -r '.id')

# Read the customer back to verify data persistence
curl $ALB_URL/api/customers/$CUSTOMER_ID | jq

# Update the customer to test modification
curl -X PUT $ALB_URL/api/customers/$CUSTOMER_ID \
    -H "Content-Type: application/json" \
    -d '{"address": {"street": "999 Updated Ave", "city": "Test City", 
         "state": "TX", "zip": "75001"}}' | jq

# Delete the test customer for cleanup
curl -X DELETE $ALB_URL/api/customers/$CUSTOMER_ID

What you're validating:

Traffic routing is 100% to green with no requests reaching blue
Create operations work with the new structured address format
Read operations return correct data with proper address structure
Update operations successfully modify existing records
Delete operations work without errors
The application correctly writes to both new columns and old address column (enabling potential rollback)

Check your CloudWatch logs and metrics during this validation period for any unexpected errors, increased latency, or database connection issues.

Step 9: Contract Phase (After 24-72 Hours)

This is the final phase of expand-contract. We're removing the old address column now that we're confident green is stable. This is the point of no return.

CRITICAL: Only proceed after green has been stable for your confidence period!

# Backup database first
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# Wait for snapshot
aws rds wait db-snapshot-completed \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# Run contract migration
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/002_contract_address.sql

# Verify old column is gone
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

The contract migration (migrations/002_contract_address.sql) removes the old address column.

Why wait 24-72 hours: You want to be absolutely certain green is stable before making irreversible changes. During this waiting period:

All your monitoring should show green performing normally
You've seen the system handle multiple daily traffic patterns (morning peak, evening peak, overnight)
Weekly batch jobs have run successfully
You've verified third-party integrations work
No unusual errors or performance degradation

It’s important to snapshot first because once you drop that column, there's no undo button. The snapshot is your safety net. If you discover a critical issue after contracting, you can restore this snapshot and get back to a state where rollback is possible. Without it, you're gambling.

What the contract migration does:

-- migrations/002_contract_address.sql
BEGIN;
ALTER TABLE customers DROP COLUMN address;
COMMIT;

It's simple but permanent. The old address column is gone. The Blue environment will no longer work with this database, as it expects that column to exist. This is fine because blue has been decommissioned (no traffic, tasks terminated).

What to update: You should also deploy version 3 of your application that removes the dual-write logic. Version 2 (green) is still writing to both the new columns and the old address column. Version 3 can stop wasting cycles writing to a column that no longer exists.

The contract migration (migrations/002_contract_address.sql) removes the old address column. Your migration is now complete!

Rollback Strategies

During Deployment (Safe Window)

Use this strategy when you detect issues during the traffic shift, before all traffic has moved to green. CodeDeploy is still managing the deployment, which means it can automatically revert traffic distribution to the previous state.

# Immediate rollback
aws deploy stop-deployment \
    --deployment-id $DEPLOYMENT_ID \
    --auto-rollback-enabled

You should use this strategy when you notice increased error rates, degraded performance, or functional issues during the canary or linear traffic shift. CodeDeploy automatically shifts all traffic back to blue, and green tasks are terminated. This is the safest and fastest rollback option.

This works because the database still contains the old address column (expand phase), so blue can function normally. No data has been lost or made incompatible.

After Deployment (Before Contract)

Use this when the deployment completed successfully, but you discover issues hours or days later during the monitoring period, before you've run the contract migration. Both blue and green environments still exist, and the database supports both schemas.

# Manual listener update
aws elbv2 modify-listener \
    --listener-arn $(terraform output -raw alb_listener_arn) \
    --default-actions Type=forward,TargetGroupArn=$(terraform output -raw blue_target_group_arn)

Or use the provided script:

cd scripts
./rollback.sh

Use this when you discover bugs in green that weren't caught during initial testing, business metrics show unexpected changes (conversion rates drop, customer complaints increase), or third-party integration issues emerge.

This works because the database still has both old and new schema elements. Blue tasks still exist and can serve traffic immediately. Because green was writing to both old and new columns, blue sees all the latest data.

With this, the traffic immediately shifts from green back to blue. Green continues running for observability, but serves no traffic. You can debug green in place without customer impact.

After Contract Phase

Use this as a last resort when you've already removed the old address column, and blue can no longer function with the current database schema. This is significantly more complex and time-consuming than the previous two strategies.

# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db-restored \
    --db-snapshot-identifier pre-contract-YYYYMMDD-HHMMSS

Only use this strategy when you discover a critical, production-breaking issue after the contract phase, and you have no other option but to return to the previous version.

Why it's painful:

Database restore takes 10-30 minutes depending on size
You lose all data written after the snapshot was taken
Requires updating connection strings to point to the restored instance
Need to re-deploy blue environment
Must communicate downtime to users

This is why you wait 24-72 hours before contracting, and take a snapshot immediately before the contract migration. The lengthy waiting period allows you to catch most issues while the safer rollback strategies are still available.

Monitoring During Deployments

Essential Metrics

During a blue-green deployment, you need to monitor both environments simultaneously to detect issues early and make informed decisions about proceeding or rolling back.For each target group (blue and green), track these CloudWatch metrics:

1. TargetResponseTime

Measures latency from when the load balancer sends a request to when it receives a response. You're looking for sudden spikes or gradual degradation. Green should have similar response times to blue (within 10-20%). If green's latency is significantly higher, you may have performance regressions, inefficient queries with the new schema, or resource constraints.

2. RequestCount

Shows traffic volume hitting each target group. During the deployment, you should see blue's count decreasing while green's increases proportionally. If the numbers don't add up (total requests drop significantly), users might be experiencing errors and not retrying. If green receives traffic but shows zero requests, health checks might be failing.

3. HTTPCode_Target_5XX_Count

Server errors indicate application problems. Even a single 5XX error during deployment warrants investigation. Green should have zero 5XX errors during the initial traffic shift. Any errors could indicate incompatibility issues with the new schema, missing environment variables, or database connection problems.

4. DatabaseConnections (from RDS metrics):

Shows active database connections from both environments. Watch for connection pool exhaustion, which manifests as a sudden spike or plateau at your max connections limit. If green uses more connections than blue did, you might have connection leaks or inefficient connection handling in the new code.

5. CPUUtilization

Monitor both ECS task CPU and RDS CPU. Green tasks should use similar CPU to blue tasks for the same request volume. Higher CPU might indicate less efficient code or more complex queries. RDS CPU spikes during deployment often indicate poorly optimized new queries or missing indexes for the new schema.

What to expect:

First 5-10 minutes: Green receives 10% traffic, metrics should closely match blue's baseline
15-20 minutes: Green at 30-50% traffic, both environments should show stable metrics
25-30 minutes: Green at 100% traffic, metrics should stabilize at historical levels
Any divergence from these patterns warrants stopping the deployment and investigating

Custom application metrics: Beyond infrastructure metrics, monitor business-critical metrics like checkout completion rates, API success rates, and user sign-up flows. Sometimes technical metrics look fine but user-facing functionality is broken.

Best Practices

Test Migrations in Staging

Always run your database migrations against a staging environment that mirrors production scale and complexity before touching production. Copy a recent production snapshot to staging and execute your expand migration there first.

Why this matters: Migrations that work fine on small datasets can timeout or lock tables on production-scale data. You might discover that adding an index to a 50-million-row table takes 2 hours, or that your column population query needs optimization.

What to test:

Migration execution time (should complete in seconds/minutes, not hours)
Table locks and their impact (can reads/writes continue during migration?)
Query performance with new schema (are your indexes still effective?)
Rollback procedures (can you undo the migration if needed?)

Use Migration Tools

Don't write raw SQL migrations manually. Use Flyway, Liquibase, Alembic (for Python), or your framework's built-in migration tools (Rails migrations, Django migrations, Entity Framework migrations).

Why this matters: Migration tools provide version tracking, rollback capabilities, checksums to prevent tampering, and a standardized way to manage schema changes across environments.

Configure Health Checks Properly

Your health check endpoint should verify that the application can actually function, not just that the process is running. A comprehensive health check validates database connectivity, schema compatibility, and dependent service availability.

@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'schema': check_schema_compatibility(),
        'cache': check_cache_connection()
    }

    if all(checks.values()):
        return jsonify(checks), 200
    else:
        return jsonify(checks), 503

def check_schema_compatibility():
    """Verify expected schema elements exist"""
    try:
        result = db.query("""
            SELECT column_name 
            FROM information_schema.columns 
            WHERE table_name = 'customers'
            AND column_name IN ('street_address', 'city', 'state', 'zip_code')
        """)
        return len(result) == 4
    except:
        return False

For ALB health checks specifically, make sure you configure appropriate thresholds in your target group settings. A healthy threshold of 2 means the target must pass 2 consecutive health checks before receiving traffic. An unhealthy threshold of 3 means it must fail 3 consecutive checks before being removed. Set your interval to 30 seconds and timeout to 5 seconds to balance responsiveness with stability.

# Terraform configuration for ALB health checks
resource "aws_lb_target_group" "green" {
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
}

This configuration ensures that ECS tasks aren't marked healthy prematurely (preventing traffic to broken tasks) while also not being overly sensitive to transient issues (preventing unnecessary task replacements).

Plan the Contract Phase

The contract phase is irreversible, so treat it with appropriate caution. Wait a minimum of 24-72 hours after green deployment before removing old schema elements. This waiting period isn't arbitrary: it ensures you've observed the system under various conditions.

What to verify before contracting:

Green has handled multiple daily traffic patterns (morning rush, evening peak, overnight batch jobs)
All scheduled jobs and cron tasks have run successfully with the new schema
Weekly reports or analytics pipelines have completed
Third-party integrations (payment processors, shipping APIs, analytics tools) are working
No unusual error patterns in logs
Business metrics (conversions, sign-ups, purchases) remain stable
Customer support hasn't reported related issues

The pre-contract checklist:

# 1. Create a final snapshot
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# 2. Document current state
echo "Green tasks: $(aws ecs describe-services --cluster ecommerce --services ecommerce-green | jq '.services[0].runningCount')"
echo "Error rate: $(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Sum)"

# 3. Notify team
echo "Running contract migration at $(date)"

# 4. Run migration
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f migrations/002_contract_address.sql

# 5. Verify
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

Version Your APIs

When changing data formats, maintain backward compatibility by supporting both old and new API versions simultaneously. This allows API consumers (mobile apps, third-party integrations, other services) to migrate at their own pace without coordinating releases.

# Support both API versions during transition
@app.route('/api/v1/customers/')
def get_customer_v1(id):
    customer = Customer.find(id)
    return jsonify({
        'id': customer.id,
        'name': customer.name,
        'address': customer.address  # Old format
    })

@app.route('/api/v2/customers/')
def get_customer_v2(id):
    customer = Customer.find(id)
    return jsonify({
        'id': customer.id,
        'name': customer.name,
        'address': {  # New structured format
            'street': customer.street_address,
            'city': customer.city,
            'state': customer.state,
            'zip': customer.zip_code
        }
    })

To implement this, you can initially deploy both endpoints with blue-green. Then monitor usage of v1 endpoint over time. Once v1 traffic drops below 1% (meaning clients have migrated), deprecate it formally. Remove v1 endpoint in a subsequent release, not during the blue-green deployment itself.

Announce the new API version to consumers with a migration timeline. Give them 2-3 months to update their integrations. Send reminder emails at the halfway point and 2 weeks before v1 shutdown.

Monitor Both Environments

During the transition period, both blue and green are production environments serving real traffic. Monitor them separately to detect version-specific issues.

Set up separate CloudWatch dashboards for blue and green target groups with the same metrics arranged identically. This makes it easy to spot differences at a glance. If green's response time is 200ms while blue's is 50ms, that's a red flag.

Alert on metric divergence

Create alarms that trigger when green's metrics deviate significantly from blue's baseline. For example, if green's error rate is more than 2x blue's historical average, trigger an alert. If green's database query time is 50% higher, investigate before shifting more traffic.

Log aggregation

Ensure logs from both environments are tagged with their version (environment: blue or environment: green) so you can filter and compare them. Use CloudWatch Insights queries to spot patterns.

When NOT to Use Blue-Green

Blue-green isn't always the right choice. Avoid it when you have:

Very large database migrations: If your migration takes hours or requires significant locks, use a traditional maintenance window.
Highly stateful applications: Real-time collaboration tools or WebSocket applications with complex in-memory state may need rolling deployments instead.
Cost constraints: Running two environments doubles costs. Consider canary deployments for cost-sensitive applications.
Complex data model redesigns: Use the strangler fig pattern to gradually migrate functionality to a new service.

Alternative Deployment Strategies

Canary Deployments

Route a small percentage (5-10%) to the new version:

{
  "trafficRouting": {
    "type": "TimeBasedCanary",
    "timeBasedCanary": {
      "canaryPercentage": 10,
      "canaryInterval": 5
    }
  }
}

Rolling Deployments

Gradually replace old tasks with new ones:

{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

Cleanup

After you've successfully completed your blue-green deployment, validated the green environment, and run the contract phase, you need to clean up the AWS resources to avoid unnecessary costs and resource sprawl.

What you're removing:

The entire infrastructure stack (VPC, subnets, NAT gateways, load balancer, ECS cluster, RDS database, and all associated resources)
This is appropriate for a tutorial/testing scenario where you deployed everything from scratch

Important considerations before cleanup:

Ensure you have backups if you need to reference any data later
Export any logs or metrics you want to retain
Document lessons learned from the deployment
Verify no production traffic is still using these resources

cd terraform

# Terraform will prompt you to confirm with "yes"
# Review the destruction plan carefully before confirming
terraform destroy  # Takes ~10-15 minutes

Partial cleanup: If you want to keep certain resources (like RDS snapshots for reference), you can remove them from Terraform state before destroying:

# Remove RDS from Terraform management before destroying
terraform state rm aws_db_instance.main
terraform destroy  # Now destroys everything except RDS

For production environments, you would NOT destroy everything. Instead, you'd decommission the blue environment specifically after confirming green is stable:

# Production scenario - remove only blue environment
terraform destroy -target=aws_ecs_service.blue
terraform destroy -target=aws_lb_target_group.blue

Conclusion

Blue-green deployments with databases require careful planning, but the expand-contract pattern makes it manageable.

Here are some key takeaways:

Use expand-contract as default – Maintains backwards compatibility and safe rollbacks.
Externalize state – Sessions, caches, and storage should use external services.
Plan for three phases – Don't rush to the contract phase.
Test everything in staging – Mirror production scale and complexity.
Monitor aggressively – Track technical and business metrics for both environments.
Know when to use alternatives – Blue-green isn't always the answer.
Document rollback procedures – Everyone should know the rollback process before deployment.

The expand-contract pattern requires more work upfront, but this investment pays dividends in reduced risk and maintained uptime. With the strategies and complete implementation provided here, you can successfully deploy even complex, stateful applications with confidence.

As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn.

For more practical hands-on Cloud/DevOps projects like this one, follow and star this repository: Learn-DevOps-by-building.

Further Resources

Complete Code: github.com/Caesarsage/bluegreen-deployment-ecs
Learn DevOps by Building: GitHub repo
AWS ECS Blue/Green Documentation: AWS Docs
AWS CodeDeploy for ECS: AWS Docs

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	\(1,000–\)2,000	2 hours (Lambda)	Medium
2. Staging proliferation	\(600–\)800	3 hours (scheduling)	Low
3. NAT Gateway tax	\(2,000–\)8,000	30 minutes	Low
4. Savings Plan timing	\(5,000–\)15,000	One decision	Low
5. Cross-AZ data transfer	\(500–\)6,000	2 hours	Medium
6. gp2 volume trap	\(1,000–\)5,000	30 minutes (script)	Low
7. Infinite log trap	\(500–\)2,000	1 hour (script)	Low
8. Orphaned resources	\(500–\)2,000	2 hours (Lambda)	Low
Total potential	\(11,100–\)40,800/month

AWS - freeCodeCamp.org

My Team's Experience Moving from AWS to a PaaS

What We'll Cover:

Before the Migration

The Number That Started the Conversation

The Deployment Process: What “Reasonably Automated” Actually Meant

What the Migration Actually Involved

What Changed After the Migration

Deployment time dropped from ~12 minutes to ~3 minutes

Any engineer could deploy confidently on day one

Rollbacks went from a 12-minute manual process to a 30-second action

Infrastructure maintenance time dropped to approximately 2–3 hours per week

Log visibility improved without any additional tooling

What We Gave Up

The Actual Lesson

The EKS Cost Optimization Handbook: Reduce Your AWS Bill by 60% Using Karpenter and Rightsizing

Table of Contents

What You'll Learn

Prerequisites

Part 1: The Baseline — Where Your EKS Money Is Going

1.1 The Typical EKS Cost Breakdown

1.2 The Most Expensive Mistake: Wrong Optimisation Order

Part 2: Right-Sizing Pod Resource Requests

2.1 Why Over-Provisioned Requests Are So Expensive

2.2 Using the Vertical Pod Autoscaler for Recommendations

2.3 The ROI of Right-Sizing

Part 3: Karpenter for Bin-Packing and Spot Diversification

3.1 The Ceiling with Cluster Autoscaler

3.2 How Karpenter Solves This

3.3 Spot Instances for Non-Production Workloads

3.4 The ROI of Karpenter and Spot

Part 4: Graviton Migration

4.1 Why Graviton Reduces Cost Without Reducing Performance

4.2 Migrating Workloads to Graviton

4.3 The ROI of Graviton

Part 5: VPC Endpoints for Data Transfer

5.1 The NAT Gateway Tax

5.2 VPC Endpoints — The Fix That Takes 30 Minutes

5.3 The ROI of VPC Endpoints

Part 6: EBS Volume Optimisation

6.1 The gp2 to gp3 Migration

6.2 Finding and Removing Orphaned Volumes and Snapshots

6.3 The ROI of EBS Optimisation

Part 7: Load Balancer Consolidation

7.1 The Problem — One Load Balancer Per Service

7.2 The Fix — Shared Ingress Controller

The Complete 7-Step Sequence

Best Practices for EKS Cost Optimisation

Resources

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Table of Contents

What You'll Learn

Prerequisites

The Four Stages Overview

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

Three questions for every service in your top 5:

1.2 The Tagging Strategy That Actually Survives

1.3 The Cost-Aware Code Review

Stage 1 Outcomes

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

2.2 Storage Tiering: Stop Paying Retail for Cold Data

2.3 Savings Plans: The Sequence Is Everything

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

3.2 Cost Estimation in Your CI/CD Pipeline

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

4.2 The Chargeback and Showback Models

Essential Tools and Certifications

Your 90-Day Action Plan

Month 1 — Foundation:

Month 2 — Quick Wins:

Month 3 — Automation and Habits:

Best Practices Summary

Resources

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Table of Contents

Who This Guide Is For