Ayobami Adejumo - freeCodeCamp.org

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Ayobami Adejumo — Mon, 15 Jun 2026 23:22:50 +0000

My first AWS bill was $23,000. I had been working at the company for three weeks.

Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda function that called an external enrichment API on every user event. Clean code. Solid tests. Thirty-two million events that month. At $0.0007 per API call.

My engineering manager forwarded the invoice with two words: "Please explain."

That was the moment I discovered FinOps — not from a conference talk or a certification course, but from the specific shame of having written expensive code and not knowing it until the damage was done.

This roadmap is what I needed that day. A complete, honest guide to transforming from an engineer who builds things that work into an engineer who builds things that work and cost what they should. By the end of this guide, you'll have the skills, the scripts, and the vocabulary to talk about cloud spend the way a CFO and a CTO both want to hear.

What You'll Learn
Prerequisites
The Four Stages Overview
Stage 1: The Cost-Aware Engineer — Months 1 to 3
Stage 2: The Optimisation Specialist — Months 4 to 8
Stage 3: The Automation Architect — Months 9 to 15
Stage 4: The Cloud Financial Manager — Months 16 to 24
Essential Tools and Certifications
Your 90-Day Action Plan
Best Practices Summary
Resources

What You'll Learn

How to read your AWS bill as an engineer, not as a passive observer
The exact tagging strategy that makes cost attribution possible
How to right-size EC2 and RDS instances using CloudWatch data you already have
The correct sequence for purchasing Savings Plans — and why sequence matters more than the discount percentage
How to build automated cleanup systems for orphaned resources
How to present cloud cost findings to engineering leadership with data that drives decisions
The chargeback and showback models that make cost accountability stick

Let's begin.

Prerequisites

Before following this roadmap, you should have some skills and tools ready to go.

Knowledge:

You can deploy an application to AWS (EC2, Lambda, or containers)
You understand basic AWS services: S3, RDS, EC2, VPC, IAM
You're comfortable reading Python and writing simple bash scripts
You know what a pull request is and have gone through at least one code review

Access:

Read-only access to your AWS billing console and Cost Explorer
AWS CLI v2 configured with at least ReadOnlyAccess policy attached
Python 3.9 or later for running the audit scripts in this guide

Mindset: You don't need to be a finance expert. But you do need to be willing to look at numbers that might be uncomfortable. Every engineer I've worked with who became excellent at FinOps had one thing in common: they were willing to be the person who asked "but what does this cost?" in a room where nobody else wanted to.

Estimated time: This roadmap covers 24 months of deliberate skill-building. You can absorb the reading in a few evenings. The practice is the 24 months.

The Four Stages Overview

Before going deep, here's the complete picture of where you're going:

Stage 1 — Cost-Aware Engineer (Months 1–3)
├── Read your cloud bill and understand it
├── Tag every resource with meaningful metadata
├── Identify your top 5 cost drivers
└── Block your first expensive PR with cost justification

Stage 2 — Optimisation Specialist (Months 4–8)
├── Right-size every over-provisioned resource
├── Implement storage lifecycle policies
├── Move non-production to Spot instances
└── Purchase your first Savings Plan in the right order

Stage 3 — Automation Architect (Months 9–15)
├── Build automated cleanup for orphaned resources
├── Add cost estimation to your CI/CD pipeline
├── Create cost-aware auto-scaling triggers
└── Deploy a self-service FinOps dashboard

Stage 4 — Cloud Financial Manager (Months 16–24)
├── Lead monthly FinOps reviews with engineering leadership
├── Build chargeback models for departments
├── Negotiate enterprise agreements with AWS
└── Forecast cloud spend within 5% variance

The reason this is a 24-month journey and not a weekend project: each stage builds on the previous one. Engineers who jump straight to Savings Plans without rightsizing first end up paying discounted prices for waste. Engineers who build dashboards before tagging get beautiful charts with no actionable data. The sequence isn't arbitrary.

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

The default AWS Cost Explorer view shows you service-level totals. That's accounting. What you need is engineering-level decomposition: which specific resources cost money, what business function they serve, and whether each dollar is justified.

Start by pulling a proper breakdown:

# Pull last month's cost breakdown grouped by service
# Run this before touching any optimisation — this is your baseline
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Save the output. Name the file aws-baseline-YYYY-MM.txt. You'll compare every future month against this number. Without a baseline, you can't measure progress — and without measurable progress, you can't make the case to leadership that the work is worth engineering time.

Three questions for every service in your top 5:

Most engineers stop at "what is this service?" and never reach the useful question. Here's the framework I use when I first audit an account:

The first question is whether you know what specific business function this service is performing. Not the product name, the function. "S3" isn't an answer. "Storing unprocessed video uploads that sit for 90 days before anyone watches them" is an answer.

The second question is whether the cost is growing, stable, or shrinking when you look at the past three months. A stable $12,000/month is a different problem from a $12,000/month line that was $4,000 six months ago.

The third question is what percentage of your total bill this service represents. Optimising a 1% line item while a 40% line item runs unchecked is a common time-wasting trap.

1.2 The Tagging Strategy That Actually Survives

Here's the honest truth about tagging: most tagging strategies die within six months because they're designed for reporting rather than for engineers. Engineers don't tag things well when they're moving fast. The solution isn't to demand more discipline. Instead, it's to make tagging enforced at the infrastructure layer.

Here's the minimal viable tag set (the six tags that cover 90% of attribution needs):

# These six tags enable cost attribution, accountability, and automated remediation
# Add these to every resource in your AWS account — EC2, RDS, S3, Lambda, everything

Environment: "production" | "staging" | "dev"
Team: "platform" | "backend" | "data" | "ml"
Service: "payment-api" | "fraud-detection" | "user-service"
Owner: "ayo@cloudfrugal.com"     # Person responsible for this resource
CostCenter: "engineering"         # For chargeback reporting
AutoShutdown: "true" | "false"    # Enables automated remediation

Enforce tags at the Terraform level so they can't be skipped:

# variables.tf
# Add this to your Terraform root module
# Any plan that creates a resource without these tags will fail validation

variable "required_tags" {
  description = "Tags required on every resource in this account"
  type = map(string)
  
  validation {
    condition = contains(keys(var.required_tags), "Environment") &&
                contains(keys(var.required_tags), "Team") &&
                contains(keys(var.required_tags), "Owner")
    error_message = "required_tags must include Environment, Team, and Owner."
  }
}

# Apply in every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = merge(var.required_tags, {
    Name    = "app-server-${var.environment}"
    Service = "payment-api"
  })
}

Find everything that's currently untagged:

# List EC2 instances missing the Team tag
# Run this weekly until you hit zero results
aws ec2 describe-instances \
  --query "Reservations[].Instances[?!not_null(Tags[?Key=='Team'].Value | [0])].[InstanceId, InstanceType, State.Name]" \
  --output table

Once you start finding untagged resources, you'll discover a pattern: the oldest resources in the account are the least tagged, and they're often the most expensive. An EC2 instance from 2021 that predates your tagging policy is exactly the kind of thing that generates a $3,000/month line item nobody can explain.

1.3 The Cost-Aware Code Review

The most underused FinOps practice in engineering teams is reviewing code changes for cost implications before they merge. It takes thirty seconds per PR once you build the habit, and it prevents the kind of problem that opened this guide: the expensive feature that nobody priced before shipping.

Add this section to your PR template:

## Cost Impact (required for infrastructure and data changes)

- [ ] This change does not affect cloud resource usage
- [ ] New API calls introduced: estimated cost per call $______, calls/month ______
- [ ] New data storage: estimated monthly delta $______
- [ ] Cross-region data transfer introduced: yes / no
- [ ] New external service dependency with per-call pricing: yes / no

If any box other than the first is checked, add a cost estimate before requesting review.

The discipline is in making cost estimation a first-class review concern, not an afterthought that gets caught by the finance team on the 15th of the month.

Stage 1 Outcomes

By the end of month 3, you should have a baseline cost breakdown on file, 100% tag coverage on active resources, identified your top 5 cost drivers with specific reduction targets, and blocked at least one expensive PR with a cost justification that held up in review.

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

The single most reliable source of cloud waste I find in every account I audit is over-provisioned compute.

The pattern is consistent: an engineer provisions an instance at a size that handles their anticipated peak load, the peak never quite materialises at the expected scale, and nobody revisits the instance size because there's no automatic signal that says "this machine is 75% empty."

Make sure you verify actual utilisation before changing anything:

# rightsize_analyzer.py
# Finds EC2 instances running below 20% average CPU for 14 days
# These are right-sizing candidates — not automatic deletions

import boto3
from datetime import datetime, timedelta

def find_oversized_instances(region='us-east-1'):
    """
    Returns instances with average CPU below 20% for the last 14 days.
    Low CPU alone doesn't mean right-size — check memory too if CW agent installed.
    """
    ec2 = boto3.client('ec2', region_name=region)
    cw  = boto3.client('cloudwatch', region_name=region)

    reservations = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )['Reservations']

    candidates = []

    for r in reservations:
        for inst in r['Instances']:
            iid  = inst['InstanceId']
            itype = inst['InstanceType']
            tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}

            # Pull 14-day average CPU from CloudWatch
            stats = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': iid}],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=1209600,   # One 14-day period
                Statistics=['Average']
            )['Datapoints']

            avg_cpu = stats[0]['Average'] if stats else 0.0

            if avg_cpu < 20.0:
                candidates.append({
                    'instance_id':  iid,
                    'instance_type': itype,
                    'avg_cpu_pct':  round(avg_cpu, 1),
                    'environment':  tags.get('Environment', 'unknown'),
                    'owner':        tags.get('Owner', 'unknown'),
                    'team':         tags.get('Team', 'unknown'),
                })

    return sorted(candidates, key=lambda x: x['avg_cpu_pct'])

if __name__ == '__main__':
    results = find_oversized_instances()
    print(f"\nFound {len(results)} right-sizing candidates:\n")
    for r in results:
        print(f"  {r['instance_id']} ({r['instance_type']}) — "
              f"{r['avg_cpu_pct']}% avg CPU — "
              f"owner: {r['owner']}")

A word of caution: CPU utilisation below 20% is a signal, not a verdict. Some workloads are memory-intensive or I/O-bound and will show low CPU while being correctly sized. Before acting on any right-sizing recommendation, check memory utilisation (requires the CloudWatch agent) and network I/O patterns alongside CPU.

2.2 Storage Tiering: Stop Paying Retail for Cold Data

S3 Standard costs $0.023 per GB per month. S3 Glacier Deep Archive costs $0.00099 per GB per month. The difference is a factor of 23. If you have data that you last accessed six months ago and you're keeping it in S3 Standard because nobody set up lifecycle policies, you're paying 23x more than necessary.

The complete S3 lifecycle policy for engineering teams:

{
  "Rules": [
    {
      "ID": "application-logs-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30,  "StorageClass": "STANDARD_IA"},
        {"Days": 90,  "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    },
    {
      "ID": "training-checkpoints-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "ml-checkpoints/"},
      "Transitions": [
        {"Days": 7,  "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    }
  ]
}

# Apply the lifecycle policy to a bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-logs-bucket \
  --lifecycle-configuration file://lifecycle.json

# Verify it applied correctly
aws s3api get-bucket-lifecycle-configuration \
  --bucket your-logs-bucket

2.3 Savings Plans: The Sequence Is Everything

A Savings Plan is a commitment to spend a minimum dollar amount per hour on AWS compute for one or three years, in exchange for discounts of 30–70% off On-Demand rates. The discount is real. The trap is buying before optimising.

The wrong order: You have a $50,000/month EC2 bill. You buy a Savings Plan covering $35,000/hour. Then you implement right-sizing and Spot instances — and your actual spend drops to $22,000/month. You've committed to paying $35,000/month for 12 months against a need of $22,000. You're paying $13,000/month for compute you don't use, at a 30% discount. Congratulations on your discounted waste.

The right order:

Month 1-2: Right-size all instances using VPA and CloudWatch data
Month 3:   Move staging and development to Spot instances
Month 4:   Migrate compatible workloads to Graviton (20% cheaper)
Month 5:   Add VPC endpoints to eliminate NAT Gateway charges
Month 6:   THEN look at your steady-state On-Demand spend
Month 6+:  Purchase Savings Plans covering 70% of that optimised baseline

Calculate what to commit to:

# Get your On-Demand EC2 spend for the last 30 days
# This is your rightsized baseline — the number to commit against
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE",       "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own recommendation for what to commit
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

Orphaned resources are the cloud equivalent of a gym membership you forgot to cancel. They exist, they charge you, but nobody notices until the annual audit.

The root cause isn't laziness. It's the absence of lifecycle management at the infrastructure layer. When an engineer spins up an EC2 instance for a one-week experiment and then leaves the company, there's no automatic signal that the instance is now orphaned. It sits there, billing $140/month, until someone hunts it down.

The fix is a weekly automated audit that surfaces candidates for deletion and notifies the registered owner, not a process change that depends on engineers remembering to clean up.

# orphan_reporter.py
# Runs every Sunday via EventBridge → Lambda
# Posts a Slack report of orphaned resources for human review
# DOES NOT auto-delete — deletion requires a human decision

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
UNATTACHED_VOLUME_AGE_DAYS = 14
SNAPSHOT_AGE_DAYS = 90


def find_orphaned_resources():
    ec2 = boto3.client('ec2')
    report = {'monthly_waste_usd': 0, 'items': []}

    # Unattached EBS volumes
    for vol in ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']:
        age = (datetime.now(timezone.utc) - vol['CreateTime']).days
        if age >= UNATTACHED_VOLUME_AGE_DAYS:
            cost = round(vol['Size'] * 0.08, 2)  # gp3 rate
            tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
            report['items'].append({
                'type':  'Unattached EBS Volume',
                'id':    vol['VolumeId'],
                'detail': f"{vol['Size']}GB {vol['VolumeType']} — {age} days old",
                'owner': tags.get('Owner', 'unknown'),
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    # Unassociated Elastic IPs
    for addr in ec2.describe_addresses()['Addresses']:
        if 'AssociationId' not in addr:
            report['items'].append({
                'type':  'Unassociated Elastic IP',
                'id':    addr.get('AllocationId', addr['PublicIp']),
                'detail': addr['PublicIp'],
                'owner': 'unknown',
                'monthly_cost_usd': 3.60,
            })
            report['monthly_waste_usd'] += 3.60

    # Old snapshots
    cutoff = (datetime.now(timezone.utc) - timedelta(days=SNAPSHOT_AGE_DAYS)).isoformat()
    for snap in ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']:
        if snap['StartTime'].isoformat() < cutoff:
            cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
            report['items'].append({
                'type':  f'Snapshot ({SNAPSHOT_AGE_DAYS}+ days old)',
                'id':    snap['SnapshotId'],
                'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
                'owner': 'unknown',
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    return report


def post_to_slack(report):
    lines = [
        f":money_with_wings: *Weekly Orphaned Resource Report*",
        f"Found *{len(report['items'])} orphaned resources* "
        f"costing *${report['monthly_waste_usd']:.2f}/month*\n",
    ]
    for item in report['items'][:20]:  # Cap at 20 lines to stay readable
        lines.append(
            f"• `{item['type']}` {item['id']} — {item['detail']} "
            f"— *${item['monthly_cost_usd']:.2f}/mo* — owner: {item['owner']}"
        )
    lines.append("\nReview and delete anything no longer needed.")

    req = urllib.request.Request(
        SLACK_WEBHOOK,
        data=json.dumps({'text': '\n'.join(lines)}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)


def lambda_handler(event, context):
    report = find_orphaned_resources()
    post_to_slack(report)
    return {
        'items_found': len(report['items']),
        'monthly_waste': report['monthly_waste_usd'],
    }

3.2 Cost Estimation in Your CI/CD Pipeline

The goal is to catch expensive infrastructure changes at the PR stage — before they deploy and before they generate a billing surprise.

# .github/workflows/cost-check.yml
# Runs on any PR that touches infrastructure files
# Uses Infracost to estimate the monthly cost delta

name: Infrastructure Cost Check

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'infrastructure/**'
      - '*.tf'

jobs:
  cost-estimate:
    name: Estimate monthly cost change
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown \
            --path terraform/ \
            --format json \
            --out-file /tmp/infracost.json

      - name: Post cost diff to PR
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

      - name: Block if monthly increase exceeds threshold
        run: |
          MONTHLY_DELTA=$(cat /tmp/infracost.json | \
            jq '.projects[0].diff.totalMonthlyCost' | tr -d '"')

          echo "Estimated monthly cost change: \$$MONTHLY_DELTA"

          # Fail the PR if this change adds more than $500/month
          python3 -c "
          import sys
          delta = float('$MONTHLY_DELTA')
          if delta > 500:
              print(f'PR blocked: estimated +\\({delta:.2f}/month exceeds \\)500 threshold')
              sys.exit(1)
          else:
              print(f'Cost check passed: estimated +\${delta:.2f}/month')
          "

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

By month 16, you have the data. What changes at Stage 4 is the audience. You're no longer presenting to engineers who understand instance types and NAT Gateway pricing. You're presenting to a CTO who wants to know if the infrastructure investment is proportional to the business value it produces, and a CFO who wants to know when the line will stop going up.

The vocabulary shift is simple but important. You stop saying "we right-sized our EC2 instances" and start saying "we reduced our infrastructure unit cost by 28% while maintaining the same request throughput." You stop saying "we eliminated NAT Gateway charges" and start saying "we closed a $6,400/month gap between what we were paying and what was necessary."

The metric that anchors every executive FinOps conversation is cost per business unit. Not total bill (cost per API call, cost per user, cost per transaction, cost per model inference). That ratio tells the story of whether your infrastructure efficiency is improving as the business scales.

# unit_economics.py
# Calculate cost per transaction — the metric that matters to leadership

import boto3
from datetime import datetime, timedelta

def calculate_cost_per_transaction(service_name, transaction_count, days_back=30):
    """
    Returns cost per transaction for a given service over the last N days.
    transaction_count: total transactions for the same period (from your metrics)
    """
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key':    'Service',
                'Values': [service_name]
            }
        }
    )

    total_cost = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in response['ResultsByTime']
    )

    cost_per_txn = total_cost / transaction_count if transaction_count > 0 else 0

    return {
        'service':           service_name,
        'period_days':       days_back,
        'total_cost_usd':    round(total_cost, 2),
        'transactions':      transaction_count,
        'cost_per_txn_usd':  round(cost_per_txn, 6),
    }


# Example: payment service processed 4.2M transactions this month
result = calculate_cost_per_transaction('payment-api', 4_200_000)
print(f"Cost per transaction: ${result['cost_per_txn_usd']:.6f}")
print(f"Total infrastructure cost: ${result['total_cost_usd']:,.2f}")

4.2 The Chargeback and Showback Models

Chargeback means actually billing departments for their cloud usage. Showback means showing departments their usage costs without the internal billing transfer. Both create the same outcome: engineers start caring about what they consume because someone they work with is paying attention to it.

# showback_report.py
# Generates monthly cost-by-team report for distribution to engineering leads

import boto3
from datetime import datetime

def generate_team_showback():
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG',       'Key': 'Team'},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        ]
    )

    by_team = {}
    for group in response['ResultsByTime'][0].get('Groups', []):
        team    = group['Keys'][0].replace('Team$', '') or 'untagged'
        service = group['Keys'][1]
        cost    = float(group['Metrics']['UnblendedCost']['Amount'])

        if team not in by_team:
            by_team[team] = {'total': 0, 'services': {}}
        by_team[team]['total'] += cost
        by_team[team]['services'][service] = round(cost, 2)

    # Print sorted by total cost descending
    print(f"\n{'='*52}")
    print(f"  Month-to-Date Cloud Spend by Team")
    print(f"  Generated: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"{'='*52}\n")

    for team, data in sorted(by_team.items(), key=lambda x: x[1]['total'], reverse=True):
        print(f"  {team:<20} ${data['total']:>10,.2f}/month")
        top_services = sorted(data['services'].items(), key=lambda x: x[1], reverse=True)[:3]
        for svc, cost in top_services:
            print(f"    └─ {svc:<30} ${cost:>8,.2f}")
    print()

generate_team_showback()

Essential Tools and Certifications

The tools that matter at each stage of this roadmap:

Stage	Tool	Why It Matters
1	AWS Cost Explorer	Free, built-in, the starting point for all cost analysis
1	AWS CLI `ce` commands	Scriptable cost queries — dashboards can't be automated
2	AWS Compute Optimizer	ML-powered rightsizing recommendations for EC2 and RDS
2	VPA (Kubernetes)	Pod-level rightsizing recommendations using actual usage
3	Infracost	PR-level cost estimation for Terraform changes
3	AWS Budgets	Proactive alerts — catches problems before the monthly invoice
4	AWS Cost and Usage Report + Athena	SQL-level billing analysis at any granularity
4	CloudHealth or Vantage	Multi-account, multi-cloud cost management

The one certification worth your time: FinOps Certified Practitioner from the FinOps Foundation. It takes 20 hours to prepare and $300 to sit. It signals to hiring managers and clients that you understand the discipline formally — which matters when you're the person leading FinOps conversations at the executive level.

Your 90-Day Action Plan

Month 1 — Foundation:

Enable Cost Explorer if it isn't already on. Pull the baseline command from Section 1.1 and save the output. Run the untagged resource query from Section 1.2 and document how many resources are missing tags. Find your top three cost drivers. Present the findings to your engineering manager — not as a problem, but as an opportunity with a dollar figure attached.

Month 2 — Quick Wins:

Run the rightsizing analyser from Section 2.1 on your EC2 fleet. Downsize the three highest-confidence candidates. Apply S3 lifecycle policies to your two largest buckets. Create VPC endpoints for S3, ECR, and DynamoDB. Estimate the savings from each action and document them against your baseline.

Month 3 — Automation and Habits:

Deploy the orphan reporter Lambda on a Sunday schedule. Add the cost check GitHub Action to your infrastructure repository. Start a monthly FinOps review meeting — even if it's just you and one other engineer. Build the habit before you need the audience.

Best Practices Summary

✅ Do: Establish a cost baseline before any optimisation. The number is meaningless without a comparison point.

✅ Do: Right-size before buying Savings Plans. Always. The sequence changes the outcome.

✅ Do: Enforce tagging at the infrastructure layer — Terraform or CloudFormation — not as a process reminder.

✅ Do: Move staging and development to Spot instances. The interruption rate is manageable, while the 70% cost difference is not.

✅ Do: Add VPC endpoints for S3, ECR, and DynamoDB before reviewing data transfer costs. It's a 30-minute fix for a multi-thousand-dollar line item.

✅ Do: Present cost findings as cost-per-business-metric, not as total bill. "We reduced cost per transaction from $0.0021 to $0.0013" is a business result. "$38,000/month reduction" is an accounting result.

❌ Don't: Buy Savings Plans on an unoptimised baseline. You'll lock in discounted waste.

❌ Don't: Build FinOps dashboards before tagging is complete. Beautiful charts with no attribution data answer no questions.

❌ Don't: Run orphaned resource cleanup without human review first. Run in report-only mode for two weeks, verify the candidates are genuinely orphaned, then add deletion logic.

Resources

FinOps Foundation Framework — The practitioner framework that defines the Inform, Optimise, and Operate cycle this roadmap is built on
AWS Cost Explorer API Reference — Full reference for the cost query commands used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service; complements the manual analysis in Stage 2
Infracost Documentation — Setup guide for the PR-level cost estimation tool in Stage 3
FinOps Certified Practitioner Exam — The certification referenced in the tools section
AWS Savings Plans Documentation — The authoritative reference on commitment types, coverage rules, and purchase strategy
Companion Repository — All scripts from this guide, including the rightsizing analyser, orphan reporter, and showback report generator

Ayobami Adejumo is a senior platform engineer and FinOps consultant. He has audited AWS infrastructure for 20+ Series A and Series B companies. He is an active FinOps Foundation Supporter

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Ayobami Adejumo — Tue, 02 Jun 2026 16:27:27 +0000

You raised your Series A. Engineering hired fast. Features shipped faster. And somewhere between month six and month twelve, someone forwarded you an AWS Cost Explorer screenshot with a line that only goes up.

That line isn't random. It follows a pattern. The same eight patterns, at the same growth stage, at almost every company I've audited.

This guide names all eight, shows you exactly where to look, and gives you the fix for each one. By the time you finish reading, you'll know which leaks are draining your runway — and what to do about them this week.

Who This Guide Is For
Before You Start: Establish Your Baseline
Pattern 1: The New Hire Experiment Tax
Pattern 2: Staging Environment Proliferation
Pattern 3: The NAT Gateway Tax
Pattern 4: The Savings Plan Timing Mistake
Pattern 5: Cross-AZ Data Transfer
Pattern 6: The gp2 Volume Trap
Pattern 7: The Infinite Log Trap
Pattern 8: The Orphaned Resource Collector
The Full Savings Summary
What to Do This Week
Resources

Who This Guide Is For

This guide is written for engineers, CTOs, and technical co-founders at Series A companies — typically 15 to 80 engineers, AWS bills between $20,000 and $150,000 per month, and a finance team that has recently started paying attention to the infrastructure line.

You don't need a dedicated FinOps team. You need one engineer, one afternoon per week, and the eight patterns in this guide.

What you should have before starting:

AWS account access with Cost Explorer enabled
AWS CLI v2 configured (aws configure)
Basic familiarity with EC2, RDS, EBS, and S3
A Cost Explorer bookmark — you will use it constantly

Estimated time to complete all fixes: 8–20 engineering hours spread across two sprints. The reading takes around 20 minutes. The highest-ROI fix (Pattern 3) takes about 30 minutes.

Before You Start: Establish Your Baseline

Don't skip this step. Optimization without a baseline is just guessing. Run this command before touching anything:

# Pull last month's AWS cost breakdown by service
# This becomes your before number — save it somewhere
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Then screenshot the output. Name the file aws-baseline-YYYY-MM.png. You'll compare against this after each fix to verify actual savings.

The typical breakdown at Series A looks like this:

AWS Service	% of Bill	Waste Potential
EC2 (compute)	45–55%	High
Data Transfer	15–20%	Very High
RDS	10–15%	Medium
EBS	8–12%	Medium
CloudWatch	3–6%	Medium
Load Balancers	3–5%	Low

Now let's go through each pattern.

Pattern 1: The New Hire Experiment Tax

Every engineering hire needs a development environment. This is expected. What's not expected is what happens after the feature ships: nothing.

The environment keeps running. At $0.192/hour for an m5.xlarge, a forgotten dev environment costs $138/month. Ten engineers who each forgot one environment is $1,380/month — for infrastructure that's doing precisely nothing.

This pattern accelerates after a Series A because hiring moves fast. A new engineer joins on Monday, spins up an EC2, an RDS, and a namespace in the dev cluster, ships the feature by Friday, and moves to the next ticket. The environment isn't on anyone's radar. There's no off-boarding process for dev resources.

What the waste looks like:

Dev environment for Alice (feature/payment-flow):
  EC2 m5.xlarge — last CPU activity: 23 days ago
  RDS db.t3.medium — last connection: 19 days ago
  EKS namespace — last pod scheduled: 15 days ago
  Monthly cost: $187
  Status: running

Finding it:

# Find EC2 instances with average CPU below 5% for the last 14 days
# These are idle instances — candidates for shutdown or termination
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --period 1209600 \
  --statistics Average \
  --start-time $(date -d '14 days ago' --iso-8601=seconds) \
  --end-time $(date --iso-8601=seconds) \
  --dimensions Name=InstanceId,Value=YOUR_INSTANCE_ID \
  --query 'Datapoints[*].{Average:Average}' \
  --output table

The Fix — an Automatic Idle Instance Stopper:

The Lambda below runs every night at 22:00. It checks every EC2 instance tagged Environment=dev for CPU utilisation over the past seven days. Any instance averaging below 5% gets stopped automatically. An SNS notification goes to the engineer's email before the stop happens, giving them a chance to override it by adding a KeepAlive=true tag.

# idle_environment_stopper.py
# Deploy as a Lambda function triggered by EventBridge on schedule: cron(0 22 * * ? *)
# This stops idle dev environments before they run through the night and weekend

import boto3
from datetime import datetime, timedelta, timezone

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')

IDLE_CPU_THRESHOLD = 5.0      # Stop instances below this average CPU %
IDLE_DAYS = 7                  # Look back 7 days of CloudWatch data
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT:dev-environment-alerts'

def get_average_cpu(instance_id):
    """Return the 7-day average CPU utilisation for an EC2 instance."""
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=datetime.now(timezone.utc) - timedelta(days=IDLE_DAYS),
        EndTime=datetime.now(timezone.utc),
        Period=604800,  # One 7-day period
        Statistics=['Average']
    )
    datapoints = response.get('Datapoints', [])
    return datapoints[0]['Average'] if datapoints else 0.0

def lambda_handler(event, context):
    """Stop idle dev instances and notify their owners."""
    
    # Find all running dev instances
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'tag:Environment', 'Values': ['dev', 'development']},
        ]
    )

    stopped = []
    skipped = []

    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

            # Skip instances explicitly marked to keep alive
            if tags.get('KeepAlive', '').lower() == 'true':
                skipped.append(instance_id)
                continue

            avg_cpu = get_average_cpu(instance_id)

            if avg_cpu < IDLE_CPU_THRESHOLD:
                # Notify the owner before stopping
                owner = tags.get('Owner', 'unknown')
                sns.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Subject=f'Dev environment stopped: {instance_id}',
                    Message=(
                        f'Instance {instance_id} (Owner: {owner}) had {avg_cpu:.1f}% average CPU '
                        f'over {IDLE_DAYS} days and has been stopped.\n\n'
                        f'To prevent this, add the tag: KeepAlive=true\n'
                        f'To restart: aws ec2 start-instances --instance-ids {instance_id}'
                    )
                )
                ec2.stop_instances(InstanceIds=[instance_id])
                stopped.append({'id': instance_id, 'owner': owner, 'avg_cpu': avg_cpu})

    print(f"Stopped {len(stopped)} idle instances. Skipped {len(skipped)} keep-alive instances.")
    return {'stopped': stopped, 'skipped': skipped}

Monthly savings: $1,000–$2,000 depending on team size and how long the pattern has been running.

Pattern 2: Staging Environment Proliferation

Staging starts as one environment. Then the frontend team needs their own because the backend team keeps breaking theirs. Then the ML team needs isolated compute. Then QA needs a stable environment for integration tests.

Before anyone noticed, you have four staging environments running 24/7 — each one idle for 16 hours of every day.

The waste isn't in the existence of the environments. It's in the schedule. Staging environments don't need to run at 3am.

What the waste looks like:

staging-frontend:   $250/month   Used: Mon-Fri 09:00-18:00
staging-backend:    $250/month   Used: Mon-Fri 09:00-18:00
staging-ml:         $250/month   Used: Mon-Fri 10:00-17:00
staging-qa:         $250/month   Used: Mon-Fri 09:00-17:00
Total:            $1,000/month   Running: 24 hours/day, 7 days/week
Actual usage:        ~35%        You are paying 100%

Finding it:

# Find EKS node groups tagged as staging with their current status
aws eks list-nodegroups --cluster-name your-cluster-name --output table

# Check EC2 instances tagged staging and their launch time
# Any instance running > 30 days with no weekend stop schedule is a candidate
aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=staging" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,Launch:LaunchTime}' \
  --output table

The Fix — Scheduled Start and Stop with AWS Instance Scheduler:

# Option 1: Tag-based scheduling with AWS Instance Scheduler (CloudFormation solution)
# Add these tags to your staging EC2 instances and RDS clusters:
# Schedule: office-hours
# This starts instances at 08:00 and stops them at 20:00 Mon-Fri
# Weekend: completely off

# Option 2: Quick Lambda-based solution — stop all staging at 20:00 weekdays
aws events put-rule \
  --schedule-expression "cron(0 20 ? * MON-FRI *)" \
  --name stop-staging-environments \
  --state ENABLED

# The stop Lambda — same pattern as Pattern 1 but targets staging tag
# Add a corresponding start rule at 07:30 Mon-Fri

Consolidation in Addition to Scheduling

If frontend and backend share a database schema, consolidate them into one shared staging environment with namespace-level isolation. The combined cost is lower than two separate environments:

# One shared staging cluster with namespace isolation
# frontend-staging and backend-staging share nodes via Karpenter
# but are isolated by namespace-level network policies
apiVersion: v1
kind: Namespace
metadata:
  name: staging-frontend
  labels:
    environment: staging
    team: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: staging-backend
  labels:
    environment: staging
    team: backend

The math:

Scenario	Monthly cost
Before: 4 environments, always on	$1,000
After: 2 consolidated environments, office hours only	$290
Monthly savings	$710

Pattern 3: The NAT Gateway Tax

NAT Gateway is the most consistently underestimated line item on every AWS bill I've audited. It charges $0.045 per GB of data processed — and in EKS clusters, a staggering amount of traffic flows through it by default.

Every pod that pulls a container image from ECR goes through NAT Gateway. Every Lambda that writes to S3 goes through NAT Gateway. Every service that polls SQS, queries DynamoDB, or calls the Secrets Manager API goes through NAT Gateway — unless you have configured VPC endpoints.

VPC endpoints create a private connection between your VPC and the AWS service. Traffic routes through the AWS backbone instead of NAT Gateway. The data transfer becomes free.

What the waste looks like:

# Run this to see your current NAT Gateway data processing bill
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NatGateway-Bytes", "NatGateway-Hours"]
    }
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

If this number is above $200, you have a NAT Gateway problem. At most Series A companies running EKS, it is between $800 and $6,000.

The Fix — VPC Endpoints for the Four Highest-traffic AWS Services:

# Get your VPC ID and route table ID first
VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=tag:Name,Values=your-vpc-name" \
  --query 'Vpcs[0].VpcId' --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=association.main,Values=true" \
  --query 'RouteTables[0].RouteTableId' --output text)

# S3 gateway endpoint — free to create, eliminates all S3 NAT charges
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids $ROUTE_TABLE_ID

# DynamoDB gateway endpoint — also free
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids $ROUTE_TABLE_ID

# ECR API endpoint — eliminates NAT charges on every container pull
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

# ECR Docker endpoint — required alongside ECR API for image pulls
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

When explaining this to your CFO, call it the NAT tax. They understand taxes. "We're paying a $0.045/GB tax on internal network traffic that we can eliminate in 30 minutes" lands better than "data processing bytes."

Monthly savings: $2,000–$8,000 depending on your container pull frequency and S3 usage.

Pattern 4: The Savings Plan Timing Mistake

A Savings Plan is a commitment to spend a fixed dollar amount per hour on AWS compute for one or three years in exchange for a 30–70% discount. The math is attractive. The timing is where teams go wrong.

When the bill gets large, the instinct is to commit. Buy the Savings Plan, reduce the bill, show the CFO. The problem: if you haven't rightsized first, you're committing to pay for waste at a discount. When you rightsize later, your actual spend drops below your commitment — and you pay for compute you're not using.

What wrong order looks like:

Step 1: AWS bill is $100,000/month
Step 2: Buy $70,000/hour Savings Plan commitment
Step 3: Rightsize instances — actual spend drops to $60,000
Step 4: Savings Plan covers \(70,000 but you only use \)60,000
Step 5: You pay $28,000/month for compute you do not use
         (Savings Plan discount applied to the overage)
         
Net result: You locked in waste for 12 months

What right order looks like:

Step 1: Rightsize instances — spend drops from \(100,000 to \)60,000
Step 2: Add Spot for staging — spend drops from \(60,000 to \)45,000
Step 3: Migrate compatible workloads to Graviton — spend drops to $36,000
Step 4: NOW buy a Savings Plan covering $25,000/month (70% of steady-state)
Step 5: Effective monthly cost: \(12,500 for committed + \)11,000 on-demand = $23,500

Net result: $76,500/month saved versus the original bill

How to check what you should commit to:

# View your last 30 days of EC2 On-Demand spend
# This is your rightsized baseline — what you actually use after optimisation
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own Savings Plan recommendation based on your usage
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

As a rule, commit to 60–70% of your steady-state On-Demand spend after optimisation. Leave 30–40% flexible. Never commit on the unoptimised baseline.

Monthly savings: $5,000–$15,000 depending on compute spend. This is the pattern with the highest single-action ROI when sequenced correctly.

Pattern 5: Cross-AZ Data Transfer

AWS charges $0.01 per GB in each direction when data crosses an Availability Zone boundary. $0.01 sounds negligible. It's not — because AZ boundaries are crossed constantly in distributed systems, and the charge is bidirectional.

The most common scenario: your application pods are scheduled across multiple AZs (as they should be for resilience), but your database is pinned to one AZ. Every database query from a pod in a different AZ costs $0.01/GB going to the database and $0.01/GB coming back. At 100GB of database traffic per day, that's $60/month. At 1TB per day, it is $600/month.

What the waste looks like:

# Check current cross-AZ data transfer charges
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{"Dimensions": {"Key": "USAGE_TYPE", "Values": ["DataTransfer-Regional-Bytes"]}}'  \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

How to find which pods are causing the cross-AZ traffic:

# Check which AZ your database RDS instance is in
aws rds describe-db-instances \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,AZ:AvailabilityZone}' \
  --output table

# Check which AZs your application pods are running in
kubectl get pods -o wide -n production | awk '{print $7}' | sort | uniq -c

If your RDS is in us-east-1a and 60% of your pods are in us-east-1b and us-east-1c, you have a cross-AZ traffic problem.

The Fix — Topology-aware Routing:

# topology-aware-routing.yaml
# This tells Kubernetes to prefer scheduling pods in the same AZ
# as the node making the request — keeping traffic local

apiVersion: v1
kind: Service
metadata:
  name: payment-api
  namespace: production
  annotations:
    # Route traffic to pods in the same AZ as the caller when possible
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: payment-api
  ports:
  - port: 8080
    targetPort: 8080

# For pods themselves — spread across AZs but prefer local
# topologySpreadConstraints ensures even distribution
# while topology-aware routing keeps traffic within AZs

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payment-api

For database traffic specifically, consider migrating from single-AZ RDS to Aurora, which handles AZ routing internally. Your application connects to one endpoint and Aurora routes internally — no cross-AZ charge from the application layer.

Monthly savings: $500–$6,000 depending on database query volume and AZ distribution of your pods.

Pattern 6: The gp2 Volume Trap

In 2014, AWS launched gp2 EBS volumes. In 2020, they launched gp3 — cheaper, faster, and with better baseline performance. In 2026, most Series A companies are still running gp2.

The difference: gp2 costs $0.10/GB/month and provides 3 IOPS per GB (100 IOPS minimum). gp3 costs $0.08/GB/month and provides 3,000 IOPS baseline regardless of size. gp3 is 20% cheaper and 10x faster on IOPS for most volume sizes. The migration is online — it runs while the volume is attached and in use.

Finding all your gp2 volumes:

# List every gp2 volume in your account with its size and monthly cost
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    State:State,
    MonthlyCost_USD:Size
  }' \
  --output table

# Count the total: number of volumes and combined GB
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' --output text

aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'sum(Volumes[*].Size)' --output text

The Fix — Migrate All gp2 to gp3 in One Script:

#!/bin/bash
# migrate_gp2_to_gp3.sh
# Migrates all gp2 volumes to gp3. Online operation — no downtime.
# Each modification runs asynchronously; the volume stays available throughout.

echo "Starting gp2 to gp3 migration..."

# Get all gp2 volume IDs
VOLUMES=$(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text)

COUNT=0
for VOL_ID in $VOLUMES; do
  echo "Migrating $VOL_ID to gp3..."
  aws ec2 modify-volume \
    --volume-id $VOL_ID \
    --volume-type gp3 \
    --no-cli-pager
  COUNT=$((COUNT + 1))
done

echo "Migration initiated for $COUNT volumes."
echo "Modifications run online — no downtime. Monitor progress:"
echo "aws ec2 describe-volumes-modifications --query 'VolumesModifications[*].{ID:VolumeId,State:ModificationState}'"

Verify completion:

# Check that no gp2 volumes remain
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' \
  --output text
# Expected: 0

Monthly savings: 20% of your total EBS spend. At $10,000/month in EBS, that's $2,000 saved for 30 minutes of work.

Pattern 7: The Infinite Log Trap

CloudWatch log groups have a default retention policy of "Never expire." Every log group created without an explicit retention setting accumulates logs indefinitely. For a busy Series A company, this means you're storing debug logs from 2022 that nobody has opened since the sprint review they were created for.

The cost compounds quietly. CloudWatch charges $0.03/GB/month for log storage and $0.50/GB for log ingestion. A cluster generating 50GB of logs per day ingests $25/day — $750/month — and then stores those logs forever at an increasing monthly cost.

Finding log groups with no retention policy:

# List all log groups with their retention settings
# Any group showing "retentionInDays: null" is infinite — it never expires
aws logs describe-log-groups \
  --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}' \
  --output table | grep -E "(None|null)"

# Count how many log groups have no retention set
aws logs describe-log-groups \
  --query 'length(logGroups[?retentionInDays==`null`])' \
  --output text

The Fix — Set Retention Policies in Bulk:

Different log types have different compliance requirements. Debug logs don't need to be kept. Audit logs might need 365 days. The table below gives sensible defaults:

Log Type	Recommended Retention	Reason
Application debug logs	14 days	Only useful for active debugging
Application error logs	90 days	Post-incident investigation window
Access logs	30 days	Security review window
CloudTrail audit logs	365 days	SOC2 evidence requirement
VPC Flow Logs	90 days	Security investigation window

#!/bin/bash
# set_log_retention.sh
# Sets 30-day retention on all log groups that have no policy set
# Adjust the retention period per log group type as needed

echo "Setting retention policies on log groups with no expiry..."

# Get all log groups with no retention
aws logs describe-log-groups \
  --query 'logGroups[?retentionInDays==`null`].logGroupName' \
  --output text | tr '\t' '\n' | while read LOG_GROUP; do

  # Skip CloudTrail logs — these need longer retention for SOC2
  if echo "$LOG_GROUP" | grep -qi "cloudtrail"; then
    echo "Skipping CloudTrail log group: $LOG_GROUP"
    aws logs put-retention-policy \
      --log-group-name "$LOG_GROUP" \
      --retention-in-days 365
    continue
  fi

  # Set 30-day retention on all other log groups
  echo "Setting 30-day retention on: $LOG_GROUP"
  aws logs put-retention-policy \
    --log-group-name "$LOG_GROUP" \
    --retention-in-days 30
done

echo "Done. Logs older than their retention period will be deleted automatically by CloudWatch."

Monthly savings: $500–$2,000 on storage costs. The ingestion cost reduction kicks in immediately when noisy debug logging is reduced. The storage cost reduction compounds over 30–90 days as old logs expire.

Pattern 8: The Orphaned Resource Collector

Every departed engineer leaves a trail. An EBS volume attached to a terminated instance. An Elastic IP allocated but not associated. A load balancer fronting a service that was deprecated in Q3. Old snapshots from an RDS instance that was replaced. None of these are intentional, but all of them are billed.

The fix is a weekly audit. Not a manual investigation — an automated script that runs every Sunday night, finds orphaned resources, and sends a Slack message with a list of candidates for deletion.

Finding the orphans:

# Unattached EBS volumes — you are paying for storage with nothing in it
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    Created:CreateTime,
    MonthlyCost:Size
  }' \
  --output table

# Unassociated Elastic IPs — $3.60/month each when not attached to a running instance
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==`null`].[PublicIp,AllocationId]' \
  --output table

# Old snapshots — created more than 90 days ago, no longer needed
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601=seconds)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

# Idle load balancers — active but routing zero traffic
aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[*].{ARN:LoadBalancerArn,DNS:DNSName,State:State.Code}' \
  --output table

The weekly cleanup Lambda:

# orphan_resource_reporter.py
# Runs every Sunday at 20:00 via EventBridge
# Reports orphaned resources to Slack — does NOT auto-delete
# Deletion requires a human decision. The Lambda surfaces the candidates.

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

def get_orphaned_resources():
    """Collect all orphaned AWS resources and their estimated monthly costs."""
    ec2 = boto3.client('ec2')
    elbv2 = boto3.client('elbv2')
    report = {'total_monthly_waste': 0, 'resources': []}

    # Unattached EBS volumes ($0.08/GB/month for gp3)
    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']
    for vol in volumes:
        monthly_cost = round(vol['Size'] * 0.08, 2)
        report['resources'].append({
            'type': 'Unattached EBS Volume',
            'id': vol['VolumeId'],
            'detail': f"{vol['Size']}GB {vol['VolumeType']}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    # Unassociated Elastic IPs ($3.60/month each)
    addresses = ec2.describe_addresses()['Addresses']
    for addr in addresses:
        if 'AssociationId' not in addr:
            report['resources'].append({
                'type': 'Unassociated Elastic IP',
                'id': addr['AllocationId'],
                'detail': addr['PublicIp'],
                'monthly_cost': 3.60
            })
            report['total_monthly_waste'] += 3.60

    # Snapshots older than 90 days
    cutoff = (datetime.now(timezone.utc) - timedelta(days=90)).isoformat()
    snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    old_snapshots = [s for s in snapshots if s['StartTime'].isoformat() < cutoff]
    for snap in old_snapshots:
        monthly_cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
        report['resources'].append({
            'type': 'Old Snapshot (90+ days)',
            'id': snap['SnapshotId'],
            'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    return report

def post_to_slack(report):
    """Send the orphaned resource report to Slack."""
    resource_lines = '\n'.join([
        f"• {r['type']} `{r['id']}` — {r['detail']} — *${r['monthly_cost']}/month*"
        for r in report['resources']
    ])

    message = {
        'text': (
            f":money_with_wings: *Weekly Orphaned Resource Report*\n\n"
            f"Found *{len(report['resources'])} orphaned resources* "
            f"costing *${report['total_monthly_waste']:.2f}/month*\n\n"
            f"{resource_lines}\n\n"
            f"Review and delete resources that are no longer needed."
        )
    }
    
    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=json.dumps(message).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)

def lambda_handler(event, context):
    report = get_orphaned_resources()
    post_to_slack(report)
    return {
        'resources_found': len(report['resources']),
        'monthly_waste': report['total_monthly_waste']
    }

Monthly savings: $500–$2,000. Every departed engineer typically leaves $50–$200 in orphaned resources. At a team of 30 with 30% annual turnover, that compounds quickly.

The Full Savings Summary

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	$1,000–$2,000	2 hours (Lambda)	Medium
2. Staging proliferation	$600–$800	3 hours (scheduling)	Low
3. NAT Gateway tax	$2,000–$8,000	30 minutes	Low
4. Savings Plan timing	$5,000–$15,000	One decision	Low
5. Cross-AZ data transfer	$500–$6,000	2 hours	Medium
6. gp2 volume trap	$1,000–$5,000	30 minutes (script)	Low
7. Infinite log trap	$500–$2,000	1 hour (script)	Low
8. Orphaned resources	$500–$2,000	2 hours (Lambda)	Low
Total potential	$11,100–$40,800/month

What to Do This Week

Don't fix all eight this week. Prioritise by ROI per hour of engineering time:

Day 1 (30 minutes): Pattern 3 — NAT Gateway endpoints. Highest ROI per minute of any fix in this guide. One command creates the S3 endpoint. Done.

Day 2 (30 minutes): Pattern 6 — gp2 to gp3 migration. Run the script. Check the output. Done.

Day 3 (1 hour): Pattern 7 — log retention policies. Run the bulk retention script. Done.

Day 4 (2 hours): Patterns 1 and 8 — deploy both Lambdas. They run automatically from here.

Next sprint: Pattern 2 (staging schedule), Pattern 5 (topology-aware routing), and Pattern 4 (run the rightsizing cycle first, then evaluate Savings Plans).

Open Cost Explorer after each fix. Compare against your baseline screenshot from the start of this guide. The line should start going down.

Resources

FinOps Foundation Framework — The practitioner framework this guide contributes to, covering Inform, Optimize, and Operate phases of cloud cost management
AWS Cost Explorer API Reference — Full reference for the get-cost-and-usage command used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service, used alongside the patterns in this guide for EC2 and EBS recommendations
AWS VPC Endpoints Documentation — Complete list of available VPC endpoints for Pattern 3
AWS Instance Scheduler Solution — The AWS-maintained CloudFormation solution for Pattern 2 environment scheduling
Karpenter Documentation — For teams ready to go beyond these 8 patterns into dynamic node provisioning and Spot diversification
FinOps Foundation Asset Library — The community asset library where practical scripts like the ones in this guide are contributed and maintained by practitioners

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He has audited AWS infrastructure for 30+ Series A companies and contributes practical tooling to the FinOps Foundation Asset Library.

GDPR Article 32 for Software Engineers: Technical Controls, Implementations, and Auditor Questions

Ayobami Adejumo — Thu, 28 May 2026 16:20:25 +0000

When I first read GDPR Article 32, I made a mistake. I thought it was a legal document.

But it's not. It's an infrastructure specification.

The regulation says you need "appropriate technical measures" to protect personal data. That phrase is terrifying because it's vague. What does "appropriate" mean? What counts as a "technical measure"? Who decides whether you've done enough?

The compliance consultant will give you a 50-page policy document. The auditor will ignore it and ask for your database schema.

This guide is the middle ground. I've implemented Article 32 controls for 12 SaaS companies. The same nine controls appear every time. The same three auditor questions appear every time.

This is a complete guide to the 9 technical controls you must implement, the exact code and commands for each, and the questions your GDPR auditor will ask.

What You'll Learn
Prerequisites
Part 1: Understanding Article 32
Part 2: Article 32(1)(a) — Pseudonymisation and Encryption
Part 3: Article 32(1)(b) — Confidentiality and Integrity
Part 4: Article 32(1)(c) — Availability and Resilience
Part 5: Article 32(1)(d) — Regular Testing
Part 6: Penetration Testing
Best Practices Summary
What's Next
Resources

What You'll Learn

The 9 technical controls required by GDPR Article 32(1)(a) through (d)
Exact PostgreSQL commands for pseudonymisation and field-level encryption
How to implement automatic logoff and unique user identification
Application-level audit logging that goes beyond CloudTrail
Integrity controls that prove data has not been altered
mTLS and TLS 1.3 for transmission security
The 5 auditor questions you must answer with evidence

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Familiarity with PostgreSQL and basic SQL
Basic understanding of AWS services (KMS, RDS, CloudTrail)
Comfort reading Python and JavaScript/Node.js code
A working knowledge of what GDPR is — if you are starting from scratch, read the ICO's GDPR overview first

Tools and access:

PostgreSQL 14 or later
An AWS account with IAM administrator access
Python 3.8 or later with cryptography library (pip install cryptography)
Node.js 16 or later
A compliance automation tool — Vanta or OneTrust — is optional but recommended for evidence collection

Estimated time: The controls in this guide take 2–4 weeks to implement fully, depending on your existing infrastructure. Individual controls range from 30 minutes (KMS key setup) to 5 days (full application-layer encryption rollout).

Part 1: Understanding Article 32 — The Technical Requirements

1.1. What Article 32 Actually Requires

Article 32 of the GDPR is titled "Security of processing." It requires controllers and processors to implement "appropriate technical and organisational measures" to ensure a level of security appropriate to the risk.

Here is the important distinction most teams miss: Article 32 is not a checklist of policies. A policy says "we encrypt personal data." Evidence says "here is the KMS key with automatic rotation, here is the application-layer encryption code, and here are the CloudTrail logs showing every decryption attempt." The auditor wants evidence, not documentation.

The four main requirements:

Section	Requirement	What It Means for Engineers
32(1)(a)	Pseudonymisation and encryption	Personal data must be stored so it cannot be attributed to a specific data subject without additional information held separately
32(1)(b)	Confidentiality, integrity, availability, and resilience	Systems must protect data from unauthorised access, alteration, loss, and be able to recover from incidents
32(1)(c)	Restoring availability and access	You must be able to restore data and regain system access after a physical or technical incident
32(1)(d)	Regular testing and risk assessment	You must have a process for regularly testing and evaluating your security measures

1.2. The Scope Question: What Data Is Covered?

Before implementing any controls, you must know what data falls under Article 32. The regulation applies to personal data — any information that can identify a living individual directly or indirectly.

Data types and their protection levels:

Category	Examples	Protection Level
Personal data	Name, email, phone, IP address	Standard
Sensitive personal data	Health data, biometric data, political opinions, religious beliefs	Enhanced
Pseudonymised data	Data where direct identifiers are replaced with a code	Standard
Anonymised data	Data that cannot be re-identified under any reasonable circumstances	Out of scope

The data mapping question your auditor will ask:

"Can you provide a data flow diagram showing where personal data enters your system, where it is stored, where it is processed, and how it is deleted?"

Before the auditor asks, run this command to document all databases storing personal data in your AWS environment:

# List all RDS instances with their encryption status
# Any StorageEncrypted: false is a finding
aws rds describe-db-instances \
  --query 'DBInstances[*].{
    ID:DBInstanceIdentifier,
    Engine:Engine,
    StorageEncrypted:StorageEncrypted,
    Region:AvailabilityZone
  }' \
  --output table

Any instance showing StorageEncrypted: false must be addressed before your Article 32 audit.

Part 2: Article 32(1)(a) — Pseudonymisation and Encryption

2.1. How to Implement Pseudonymisation at the Database Layer

Pseudonymisation replaces direct identifiers — names, email addresses, passport numbers — with a pseudonym or code. The goal is that the main working dataset cannot identify a data subject without access to a separately stored, separately protected lookup table.

Here is the incorrect approach — direct identifiers in plaintext:

-- Bad: Direct identifiers stored in the main working table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    full_name VARCHAR(255),       -- Direct identifier — should not be here
    email VARCHAR(255),           -- Direct identifier — should not be here
    passport_number VARCHAR(50)   -- Direct identifier — should not be here
);

This approach means any engineer, analyst, or attacker with SELECT access to the users table can immediately read and identify individuals. There is no separation between working data and identifying data.

Here is the correct implementation with a separate identifiers table:

-- Good: Pseudonymised main table with a separate, restricted lookup table

-- Step 1: Main working table uses only the pseudonym
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    pseudonym UUID DEFAULT gen_random_uuid(),  -- Non-guessable pseudonym
    created_at TIMESTAMP DEFAULT NOW(),
    account_status VARCHAR(50)
    -- No direct identifiers here
);

-- Step 2: Identifier lookup table — kept separate, access restricted
CREATE TABLE user_identifiers (
    pseudonym UUID PRIMARY KEY,
    full_name VARCHAR(255),
    email VARCHAR(255),
    passport_number VARCHAR(50),
    FOREIGN KEY (pseudonym) REFERENCES users(pseudonym)
);

-- Step 3: Grant minimal, role-based access
GRANT SELECT ON users TO app_role;                              -- Application uses pseudonym only
GRANT SELECT, INSERT, UPDATE ON user_identifiers TO identity_service_role;  -- Only the identity service sees names

What each part does:

gen_random_uuid() creates a version-4 UUID pseudonym for each user — unpredictable and not reversible without the lookup table
The main users table is safe for analytics, reporting, and general application use without exposing any identifying information
Only the identity_service_role can join the two tables — this role is assigned only to the specific service that handles identity operations

The auditor question you will receive:

"How do you ensure that pseudonymised data cannot be re-identified by an unauthorised party?"

Your evidence:

-- Show that only the identity service role has access to the identifiers table
SELECT grantee, privilege_type, table_name
FROM information_schema.role_table_grants
WHERE table_name = 'user_identifiers';

-- Expected output: only identity_service_role listed

2.2. How to Implement Encryption at Rest with Customer-Managed Keys

Storage-layer encryption protects data if someone physically steals the disk. But it does not protect against a privileged AWS employee, a compromised cloud administrator, or an authorised user with direct database access. Article 32 auditors know this distinction — and they will ask about it.

Here is the incorrect approach — AWS-managed keys:

# Bad: AWS-managed KMS key
# You do not control who at AWS can access the key material
aws kms create-key \
  --origin AWS_KMS \
  --description "AWS managed key for production"

The problem: when the auditor asks "can you prove that AWS employees cannot decrypt your customer data?", the answer is no. AWS-managed keys are managed by AWS.

Here is the correct implementation — customer-managed key with automatic rotation:

# Step 1: Create a customer-managed KMS key
KEY_ID=$(aws kms create-key \
  --origin AWS_KMS \
  --description "Customer-managed key for production PII — Article 32 compliant" \
  --tags TagKey=Purpose,TagValue=GDPR TagKey=Environment,TagValue=production \
  --query 'KeyMetadata.KeyId' \
  --output text)

echo "Created KMS key: $KEY_ID"

# Step 2: Enable automatic 90-day rotation
aws kms enable-key-rotation --key-id $KEY_ID

# Step 3: Apply to your production RDS instance
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --kms-key-id $KEY_ID \
  --apply-immediately

The auditor question:

"Show me that your encryption keys are rotated automatically and that you can prove who has accessed them."

Your evidence:

# Verify rotation is enabled — expected output: true
aws kms get-key-rotation-status --key-id $KEY_ID \
  --query 'KeyRotationEnabled'

# Show the CloudTrail audit trail of every key usage event
aws logs filter-log-events \
  --log-group-name cloudtrail-logs \
  --filter-pattern '{ $.eventSource = "kms.amazonaws.com" }' \
  --query 'events[*].{Time:timestamp,Event:message}' \
  --output table

2.3. How to Implement Application-Layer Encryption for Sensitive Fields

Storage encryption is the floor. Application-layer encryption is the ceiling that Article 32 auditors are increasingly expecting for health data, financial records, and other sensitive personal data.

Here is the difference: with storage encryption only, a database administrator who runs SELECT email FROM users sees the plaintext email address. With application-layer encryption, they see gAAAAABm... — an encrypted byte string that only the application (with access to the Vault key) can decrypt.

# application_encryption.py
from cryptography.fernet import Fernet

class FieldEncryption:
    """
    Encrypts sensitive personal data fields before they are stored in the database.
    The encryption key is stored in HashiCorp Vault or AWS Secrets Manager — never in code.
    A database administrator with direct SQL access sees only encrypted bytes.
    """

    def __init__(self, key: str):
        # key must be a 32-byte base64-encoded string — retrieve from Vault
        self.cipher = Fernet(key.encode())

    def encrypt_field(self, plaintext: str) -> str:
        """Encrypt a sensitive field before writing to the database."""
        if not plaintext:
            return None
        encrypted_bytes = self.cipher.encrypt(plaintext.encode())
        return encrypted_bytes.decode()

    def decrypt_field(self, ciphertext: str) -> str:
        """
        Decrypt a field when legitimately needed by the application.
        This method requires the Vault key — database admins cannot call it.
        """
        if not ciphertext:
            return None
        decrypted_bytes = self.cipher.decrypt(ciphertext.encode())
        return decrypted_bytes.decode()


# Usage in your application:
from vault_client import get_secret  # Your Vault or Secrets Manager client

# Retrieve the encryption key at application startup — never hardcode it
encryption_key = get_secret("gdpr/field-encryption-key")
encryptor = FieldEncryption(encryption_key)

# Before storing a user's health record
user.health_data_encrypted = encryptor.encrypt_field(user.health_data_plaintext)

# Before reading for a legitimate purpose (subject access request, etc.)
health_data = encryptor.decrypt_field(user.health_data_encrypted)

The auditor question:

"If a database administrator queries the users table directly, can they read customer health data in plaintext?"

Your evidence: Run a direct database query and show the auditor the encrypted output. Then demonstrate that the decryption key is not accessible to database administrators — it is retrieved only by the application through Vault.

Part 3: Article 32(1)(b) — Confidentiality and Integrity

3.1. How to Implement Automatic Logoff

Article 32(1)(b) requires protection against "unauthorised access to personal data." A session that never expires — or expires after 24 hours — is an access control gap. A user who logs in on a shared machine and walks away has left an open door.

Here is the incorrect approach — a 24-hour JWT session:

// Bad: 24-hour access token with no inactivity check
const token = jwt.sign(
  { userId: user.id, role: user.role },
  process.env.JWT_SECRET,
  { expiresIn: '24h' }  // Too long — violates Article 32 intent
);

The problem: if a user logs in on a shared computer and closes the laptop without logging out, the session remains valid for up to 24 hours. Anyone who opens that laptop can access personal data.

Here is the correct implementation — a 15-minute access token with a rolling refresh:

// Good: Short-lived access token with rolling refresh via HTTP-only cookie

// Access token — valid for 15 minutes of activity
const accessToken = jwt.sign(
  { userId: user.id, role: user.role, type: 'access' },
  process.env.JWT_ACCESS_SECRET,
  { expiresIn: '15m' }
);

// Refresh token — valid for 8 hours total session duration
const refreshToken = jwt.sign(
  { userId: user.id, type: 'refresh' },
  process.env.JWT_REFRESH_SECRET,
  { expiresIn: '8h' }
);

// Set refresh token as HTTP-only cookie — not accessible to JavaScript
res.cookie('refreshToken', refreshToken, {
  httpOnly: true,    // Prevents XSS access
  secure: true,      // HTTPS only
  sameSite: 'strict', // Prevents CSRF
  maxAge: 8 * 60 * 60 * 1000  // 8 hours in milliseconds
});

// Session middleware that enforces absolute timeout
const MAX_TOTAL_SESSION_MS = 8 * 60 * 60 * 1000; // 8 hours

app.use((req, res, next) => {
  if (!req.session?.createdAt) return next();

  const sessionAge = Date.now() - req.session.createdAt;
  if (sessionAge > MAX_TOTAL_SESSION_MS) {
    req.session.destroy();
    return res.status(401).json({
      error: 'Session expired after 8 hours. Please log in again.'
    });
  }
  next();
});

The auditor question:

"Show me that your application terminates inactive sessions after a reasonable period."

Your evidence: A browser developer tools screenshot showing the cookie expiration time, plus a test recording showing that after 15 minutes of inactivity the user is presented with a re-authentication prompt.

3.2. How to Implement Unique User Identification with IRSA

Article 32(1)(b) requires that you can identify who accessed personal data. Shared service accounts make this impossible — the audit log shows data-export-service but you cannot tell which engineer triggered the export.

Here is the incorrect approach — a shared service account:

# Bad: One shared Kubernetes service account used by multiple engineers and pipelines
apiVersion: v1
kind: ServiceAccount
metadata:
  name: data-export           # Three engineers and two pipelines share this identity
  namespace: production

When an audit log shows data-export performed a bulk user export at 03:17 UTC, you cannot answer the auditor's question: "who authorised this?"

Here is the correct implementation — IAM Roles for Service Accounts (IRSA):

# Step 1: Create a separate IAM role for each service identity
# This command creates a role that can only be assumed by the 'payment-service'
# Kubernetes service account in the 'production' namespace

aws iam create-role \
  --role-name eks-payment-service-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/YOUR_OIDC_ID"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/YOUR_OIDC_ID:sub":
            "system:serviceaccount:production:payment-service"
        }
      }
    }]
  }'

# Step 2: Annotate the Kubernetes service account with its unique IAM role
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service          # One service account, one service, one role
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/eks-payment-service-role

Every AWS API call from payment-service now appears in CloudTrail as eks-payment-service-role — a unique, traceable identity. No shared accounts. No ambiguous audit logs.

The auditor question:

"How do you ensure that every action on personal data can be attributed to a specific individual or service?"

Your evidence:

# Verify no shared service accounts exist — every account should have a unique role annotation
kubectl get serviceaccounts --all-namespaces \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.metadata.annotations.eks\.amazonaws\.com/role-arn}{"\n"}{end}'

Part 4: Article 32(1)(c) — Availability and Resilience

4.1. How to Implement Multi-AZ and Backup Requirements

Article 32(1)(c) requires "the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident." This is not a suggestion — it is a legal requirement. If your database is in a single Availability Zone and that AZ experiences a networking event, you are in violation.

Here is the incorrect approach — single-AZ RDS with no automated backups:

# Bad: Single-AZ RDS — one networking event makes personal data unavailable
resource "aws_db_instance" "production" {
  identifier              = "production-database"
  multi_az                = false   # No automatic failover
  backup_retention_period = 0       # No automated backups — Article 32 violation
}

If the Availability Zone has a networking issue, the database is unreachable. If the instance is corrupted, there are no backups to restore. Both scenarios violate Article 32(1)(c).

Here is the correct implementation — Multi-AZ with tested automated backups:

# Good: Multi-AZ RDS with 30-day backup retention
resource "aws_db_instance" "production" {
  identifier = "production-database"

  # Multi-AZ creates a synchronous standby replica in a different AZ
  # Automatic failover completes in 60-120 seconds with no data loss
  multi_az = true

  # 30-day backup retention — gives you recovery point flexibility
  backup_retention_period = 30
  backup_window           = "03:00-04:00"  # Low-traffic window for backup

  # Copy all tags to snapshots for compliance tracking
  copy_tags_to_snapshot = true

  # Performance Insights for monitoring query health
  performance_insights_enabled          = true
  performance_insights_retention_period = 7

  tags = {
    Environment       = "production"
    DataClassification = "personal-data"
    GDPRScope         = "article32"
  }
}

How to test your RTO and RPO monthly:

# Step 1: Find your most recent automated snapshot
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
  --db-instance-identifier production-database \
  --snapshot-type automated \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text)

echo "Testing restore of snapshot: $SNAPSHOT_ID"

# Step 2: Start the restore — measure the time
START_TIME=$(date +%s)

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier gdpr-restore-test \
  --db-snapshot-identifier $SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=gdpr-rto-test Key=DeleteAfter,Value=$(date -d '+1 day' +%Y-%m-%d)

# Step 3: Wait for restore to complete
aws rds wait db-instance-available \
  --db-instance-identifier gdpr-restore-test

END_TIME=$(date +%s)
RTO_SECONDS=$((END_TIME - START_TIME))
echo "Restore completed in $((RTO_SECONDS / 60)) minutes"

# Step 4: Verify data integrity with a spot check
# Connect to the restored instance and verify record counts match production
# psql -h RESTORED_ENDPOINT -U admin -d production \
#   -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier gdpr-restore-test \
  --skip-final-snapshot

The auditor question:

"What is your Recovery Time Objective and Recovery Point Objective for personal data? When did you last test it?"

Your evidence: A documented monthly DR test log showing: snapshot used, restore start time, restore completion time, data verification query results, and the engineer who conducted the test.

Part 5: Article 32(1)(d) — Regular Testing

5.1. How to Implement Automated Vulnerability Scanning

Article 32(1)(d) requires "a process for regularly testing, assessing and evaluating the effectiveness of technical and organisational measures." This includes automated vulnerability scanning of every container image before it reaches production.

Here is the incorrect approach — no scanning in the deployment pipeline:

# Bad: No vulnerability scanning — a critical CVE in the base image deploys undetected
name: Deploy
on: [push]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: docker build -t myapp .
      - run: docker push myapp  # Deploys without any security check

If a critical CVE is present in the base image (such as a remote code execution vulnerability in OpenSSL), it goes straight to production. Under Article 32(1)(d), this is a finding.

Here is the correct implementation — Trivy scanning with pipeline enforcement:

# Good: Trivy scans every image — CRITICAL/HIGH CVEs block the deployment
name: Security Scan and Deploy
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t myapp:${{ github.sha }} .

      - name: Scan for vulnerabilities with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: 'myapp:${{ github.sha }}'
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'         # Fail the pipeline — image cannot deploy with CRITICAL/HIGH CVEs

      - name: Upload scan results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()             # Upload results even if scan failed, for review
        with:
          sarif_file: 'trivy-results.sarif'

Trivy scans for:

CVEs in the base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)
Vulnerable versions of application dependencies (a known exploit in an npm or pip package your application uses)
Misconfigurations in the Dockerfile (running as root, using latest tag instead of a pinned SHA)

Results appear in the GitHub Security tab, creating a timestamped, searchable history of every scan. That history is your Article 32(1)(d) evidence.

How to run a weekly AWS Inspector assessment for running workloads:

# List all active CRITICAL findings across your AWS account
aws inspector2 list-findings \
  --filter-criteria '{
    "severity": [{"comparison": "EQUALS", "value": "CRITICAL"}],
    "findingStatus": [{"comparison": "EQUALS", "value": "ACTIVE"}]
  }' \
  --query 'findings[*].{
    Title:title,
    Resource:resources[0].id,
    Severity:severity,
    CVE:packageVulnerabilityDetails.vulnerabilityId
  }' \
  --output table

The auditor question:

"Show me your vulnerability management programme, including how you prioritise and remediate findings."

Your evidence: A weekly vulnerability report — generated automatically from the above command — showing active findings, severity, the GitHub issue created for each finding, and the closure date once remediated.

Part 6: Article 32(1)(d) — Penetration Testing

6.1. Why Automated Scanning Is Not Enough

Article 32(1)(d) requires evaluating the effectiveness of security measures. Automated vulnerability scanners find known CVEs in libraries and OS packages. They cannot find:

Business logic vulnerabilities (an API endpoint that returns another user's data when given a specific parameter)
Authentication bypasses (a JWT implementation that accepts unsigned tokens)
Privilege escalation paths (an attacker can move from a low-privilege role to admin through a sequence of legitimate API calls)
Insecure direct object references (accessing /api/users/124 instead of /api/users/123 returns data for a different customer)

The ICO (UK Information Commissioner's Office) and the CNIL (France's data protection authority) both state in their guidance that annual manual penetration testing is expected for organisations processing significant volumes of personal data.

What an acceptable pen test scope looks like:

# Annual Penetration Test Scope — Article 32 Compliance

## Testing Period
Start: 2025-04-01  
End: 2025-04-14  
Testing firm: [Accredited firm — CREST or CHECK certified]

## In Scope
- Production web application: https://app.yourcompany.com
- Production API: https://api.yourcompany.com/v1/*
- Authentication flows: OAuth2, JWT, session management
- Data stores: PostgreSQL (via application access only, not direct DB access)
- AWS account: External reconnaissance of public-facing services only

## Testing Types
- External infrastructure testing (all public IP ranges)
- Web application testing (OWASP Top 10 2021)
- API security testing (all authenticated and unauthenticated endpoints)
- Authentication and session management testing
- GDPR-specific test cases (data subject rights endpoints, consent flows)

## Remediation SLAs
- CRITICAL: 24 hours from report delivery
- HIGH: 7 calendar days
- MEDIUM: 30 calendar days
- LOW: 90 calendar days

How to track and evidence remediation:

# Create GitHub issues for each finding on receipt of the pen test report
# This creates a traceable record of every finding and its resolution

for finding_id in $(cat pentest-report-findings.txt); do
  gh issue create \
    --title "Pen test finding: $finding_id" \
    --body "See pentest-report-2025-04.pdf, section $finding_id. Severity: HIGH. SLA: 7 days." \
    --label "security,pentest" \
    --assignee "@security-lead"
done

The auditor question:

"When was your last penetration test? Show me the report and your remediation evidence."

Your evidence:

The penetration test report from a CREST or CHECK certified firm, dated within the last 12 months
A remediation tracker (GitHub issues or Jira) showing every CRITICAL and HIGH finding with a closure date
Evidence that all CRITICAL findings were closed within 24 hours (the git commit or deployment log)

Here are the key takeaways from this guide:

✅ Do: Implement application-layer encryption for sensitive fields. Storage encryption alone is not enough — a DBA with direct database access can still read plaintext.

✅ Do: Use customer-managed KMS keys with automatic rotation. You need to prove control over the key material.

✅ Do: Store pseudonymised data separately from identifiers, with restricted role-based access to the lookup table.

✅ Do: Enforce automatic logoff after 15 minutes of inactivity with an 8-hour absolute session limit.

✅ Do: Use unique service accounts with IRSA. Every action on personal data must be attributable to a specific identity.

✅ Do: Test your backups monthly. Document RTO and RPO with actual restore test results.

✅ Do: Run Trivy in CI to block CRITICAL and HIGH CVEs before deployment.

✅ Do: Conduct an annual manual penetration test from a CREST or CHECK certified firm.

❌ Don't: Use 24-hour JWT sessions or sessions with no inactivity timeout.

❌ Don't: Store secrets in environment variables, .env files, or hardcoded in source code.

❌ Don't: Skip the annual penetration test. An auditor from the ICO or CNIL will not accept "we run automated scans" as a substitute.

❌ Don't: Use AWS-managed KMS keys if you need to prove key material control to your auditor.

Resources

ICO Guide to GDPR Article 32 — The UK Information Commissioner's Office official guidance on Article 32 security obligations
ENISA Guidelines on Article 32 — The EU Agency for Cybersecurity's SME guidelines on personal data security
Trivy by Aqua Security — Open-source container vulnerability scanner used in Part 5
OWASP Top 10 2021 — The standard reference for web application security risks, used in pen test scoping
AWS KMS Key Rotation Documentation — Official AWS documentation for automatic key rotation
PostgreSQL Row Security Policies — How to implement row-level security for granular access control on pseudonymised data
EKS IAM Roles for Service Accounts (IRSA) — Official AWS documentation for unique service account identity on EKS
CREST Certified Testing Firms — Directory of CREST-certified penetration testing firms for your annual Article 32 assessment

Ayobami Adejumo is a senior platform engineer and compliance infrastructure specialist. He writes about GDPR engineering controls, SOC2 implementation, and FinOps - cloud cost optimization

The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands

Ayobami Adejumo — Tue, 05 May 2026 18:26:21 +0000

If your team is preparing for a SOC 2 Type II review, this handbook is for you. It's a self-contained guide to the exact 90-day timeline, 14 critical controls, and evidence collection infrastructure that auditors actually check.

Everyone publishes the controls list. But nobody publishes the week-by-week engineering calendar you'll need to follow to make sure your ducks are in a row.

Here is the exact 90-day timeline — including the mistakes that add 60 days (and how to avoid them).

What You'll Learn
Prerequisites
Weeks 1–2: The Scope Decision
Weeks 3–6: The 14 Controls That Must Be Active on Day 1
Weeks 7–10: The Evidence Collection Infrastructure
Weeks 11–14: Auditor Selection and Readiness Assessment
Weeks 15–18: The Observation Period
The 90-Day SOC2 Timeline at a Glance
What's Next
Resources

What You'll Learn

By the end of this guide, you'll know:

How to scope your SOC2 boundary correctly — the decision that determines everything else
The 14 controls that must be active on day 1 of your observation period
How to build evidence collection infrastructure that runs automatically
How to choose an auditor and run a readiness assessment
What happens during the observation period and how to close gaps without restarting the clock

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Basic understanding of AWS services (EC2, RDS, S3, IAM, VPC)
Familiarity with Terraform or another infrastructure as code tool
Comfort reading GitHub Actions YAML workflows
A general understanding of what SOC2 is — if you are starting from scratch, read the AICPA's SOC2 overview first

Tools and access:

An AWS account with administrator access
A GitHub organisation with admin rights
Terraform installed (v1.0 or later)
Python 3.8 or later (for the evidence collector Lambda)
A compliance automation platform — Vanta or Drata — connected to your AWS account and GitHub organisation

Estimated time: 90 days end-to-end, with active engineering work of approximately 8–12 hours per week in the first six weeks, tapering to 2–4 hours per week during the observation period.

Weeks 1–2: The Scope Decision — What Is In and Out of Your SOC2 Boundary

What Most Teams Get Wrong

Most teams scope their SOC2 boundary too broadly. They include every AWS account, every service, every environment. This is a mistake — and here is exactly why.

A broader scope means more controls to implement, more evidence to collect, and more systems the auditor will examine.

Every system inside your boundary must satisfy all 14 controls. Including your development sandbox means your engineers' experimental environments must have GuardDuty enabled, CloudTrail logging, and branch-protected deployments. That adds weeks of work and months of evidence collection for systems that pose no risk to your customers.

A correctly bounded scope means you include only the systems that store, process, or transmit customer data — and you prove that everything else cannot reach those systems.

Bad scope (over-inclusive):

Entire AWS Organization
├── Production (in scope)
├── Staging (in scope)
├── Development (in scope)
├── Sandbox (in scope)
└── CI/CD (in scope)

Good scope (correctly bounded):

SOC2 Boundary
├── Production AWS Account (in scope)
├── Production EKS Cluster (in scope)
├── Production RDS (in scope)
└── Everything else (OUT of scope — proven by network segmentation)

The correctly bounded scope works because it draws the tightest defensible line around the systems that actually handle customer data. Everything outside that line is excluded — not by assumption, but by technical controls that prevent those systems from reaching anything inside the boundary.

The Scope Decision Framework

For every system in your infrastructure, ask these four questions:

Question	If YES	If NO
Does this system store, process, or transmit customer data?	✅ In scope	❌ Out of scope
Does this system affect the availability of customer-facing services?	✅ In scope	❌ Out of scope
Does this system have access to production credentials?	✅ In scope	❌ Out of scope
Can a compromise of this system lead to a customer data breach?	✅ In scope	❌ Out of scope

Any system where the answer to even one question is yes belongs inside your boundary.

Network Segmentation — The Technical Proof That Your Boundary Holds

Network segmentation is the practice of dividing your infrastructure into isolated zones so that systems in one zone can't communicate with systems in another unless you explicitly allow it.

In the context of SOC2, it's the technical control that proves your out-of-scope systems genuinely can't reach your in-scope systems — not just by policy, but by infrastructure enforcement.

Without network segmentation, the SOC2 auditor can't trust that your boundary is real. A developer in your sandbox environment who can query your production database means the sandbox is effectively in scope, regardless of what your diagram says.

Here's the Terraform that implements network segmentation between your production and non-production environments. The network access control list (NACL) blocks all inbound traffic from the broader private IP range (10.0.0.0/8) into your in-scope production VPC, while the explicit aws_vpc_peering_connection comment documents the deliberate decision not to peer environments:

# This account has NO VPC peering to non-production environments.
# The absence of peering is itself the segmentation control.
# Do NOT add peering connections to this account without SOC2 scope review.

resource "aws_network_acl" "deny_non_production" {
  vpc_id = aws_vpc.production.id

  # Block all inbound traffic from non-production IP ranges
  ingress {
    rule_no    = 100
    action     = "deny"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "10.0.0.0/8"
  }

  # Allow legitimate inbound traffic (HTTPS from internet)
  ingress {
    rule_no    = 200
    action     = "allow"
    from_port  = 443
    to_port    = 443
    protocol   = "tcp"
    cidr_block = "0.0.0.0/0"
  }

  # Allow all outbound (tighten this per your architecture)
  egress {
    rule_no    = 100
    action     = "allow"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "0.0.0.0/0"
  }

  tags = {
    Name        = "production-nacl"
    Environment = "production"
    Purpose     = "SOC2 network segmentation"
  }
}

Verify the segmentation with this command after applying the Terraform:

# Confirm no VPC peering connections exist from production to non-production
aws ec2 describe-vpc-peering-connections \
  --filters Name=status-code,Values=active \
  --query 'VpcPeeringConnections[*].{ID:VpcPeeringConnectionId,Requester:RequesterVpcInfo.VpcId,Accepter:AccepterVpcInfo.VpcId}' \
  --output table

The Deliverable: Your SOC2 Boundary Diagram

At the end of weeks 1–2, you need a boundary diagram — a visual document that shows every in-scope system, every out-of-scope system, and the segmentation controls between them.

Here is what the diagram should contain:

Include every AWS service, every data flow arrow, and a label on the segmentation control. This diagram becomes your primary scope evidence and is typically the first thing an auditor asks for.

Weeks 3–6: The 14 Controls That Must Be Active on Day 1

These 14 controls must be implemented and actively collecting evidence from day 1 of your observation period. If you add any of them late, the observation period clock for that control restarts from the implementation date — not from day 1 of the audit period.

Think of the observation period as a surveillance camera recording your infrastructure. The auditor watches the footage later. If the camera was not on when a specific event occurred, that event has no record — and the SOC2 control for it has a gap.

Control 1: MFA Enforcement (CC6.6)

Multi-Factor Authentication (MFA) requires a user to verify their identity using two independent factors — something they know (a password) and something they have (a phone or hardware key). Without MFA, a stolen password is sufficient to access your production systems.

SOC2 CC6.6 requires that access to systems is restricted to authorized users. MFA is the technical control that makes "authorized" meaningful. Without it, any password compromise is a production access event.

To implement MFA, you can use AWS IAM Identity Center (formerly SSO) connected to your identity provider (Okta, Google Workspace, or Azure AD). MFA is then enforced at the identity provider level — any user without MFA enrolled can't authenticate, regardless of which AWS service they're trying to reach.

# IAM Identity Center configuration — MFA is enforced at the IdP level.
# No IAM user has direct console or CLI access.
# All access goes through SSO sessions (8-hour expiry by default).

resource "aws_ssoadmin_instance_access_control_attributes" "mfa" {
  instance_arn = tolist(data.aws_ssoadmin_instances.this.arns)[0]

  attribute {
    key = "email"
    value {
      source = ["$${path:email}"]
    }
  }
}

You can verify that no IAM users retain direct console access (which would bypass MFA):

# Any user listed here has direct console access bypassing SSO — investigate immediately
aws iam list-users \
  --query 'Users[?PasswordLastUsed!=`null`].[UserName,PasswordLastUsed]' \
  --output table

Control 2: Infrastructure as Code (CC8.1)

Infrastructure as Code (IaC) means defining your cloud infrastructure in version-controlled code files (Terraform, Pulumi, or AWS CDK) rather than creating resources manually through the AWS console. Every infrastructure change is proposed in a pull request, reviewed by a colleague, and applied through an automated pipeline.

SOC2 CC8.1 covers change management — the requirement that every change to your production environment is documented, reviewed, and approved. Manual console changes produce no audit trail. If an engineer opens the AWS console and creates a security group without going through Terraform, that change is invisible to your SOC2 auditor. IaC makes every change reviewable and traceable.

Now let's see how to implement IaC here. This GitHub Actions workflow applies Terraform only from the main branch, after a pull request has been reviewed and approved. The workflow creates an immutable record of every infrastructure change:

# .github/workflows/terraform-apply.yml
name: Terraform Apply (Production)
on:
  push:
    branches: [main]
    paths: ['terraform/**']

permissions:
  id-token: write   # Required for AWS OIDC authentication
  contents: read

jobs:
  apply:
    name: Apply Infrastructure Changes
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval for production

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/terraform-apply
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: "1.6.0"

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=tfplan -input=false

      - name: Terraform Apply
        run: terraform apply -input=false tfplan

SOC2 evidence this produces: A GitHub Actions run log for every infrastructure change, showing who triggered it (the pull request author), when it was applied, and what changed.

Control 3: CloudTrail Enabled (CC7.1)

AWS CloudTrail is a service that records every API call made in your AWS account — who called it, when, from which IP address, and whether it succeeded. Think of it as the complete audit log of everything that has ever happened in your AWS environment.

SOC2 CC7.1 requires monitoring for security events. CloudTrail is the foundational logging layer — without it, you can't detect unauthorized access, investigate incidents, or prove to an auditor that your controls were operating as intended. An auditor who can't see historical AWS API activity can't verify that your access controls were enforced during the observation period.

To implement it, you'll want to enable multi-region CloudTrail so that activity in every AWS region is captured, including global services like IAM. You can ship logs to an S3 bucket with Object Lock enabled (Control 3 in the evidence collection section covers this) so logs can't be modified or deleted:

# Enable CloudTrail with log file validation and multi-region coverage
aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name your-cloudtrail-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

# Start the trail (creation alone does not start logging)
aws cloudtrail start-logging --name production-audit-trail

# Verify the trail is active and logging
aws cloudtrail get-trail-status --name production-audit-trail \
  --query '{IsLogging:IsLogging,LatestDeliveryTime:LatestDeliveryTime}'

Control 4: GuardDuty Enabled (CC7.2)

AWS GuardDuty is a threat detection service that analyses your CloudTrail logs, VPC Flow Logs, and DNS logs. It uses machine learning to identify suspicious behaviour — things like an EC2 instance communicating with a known malware server, an IAM user logging in from an unusual country, or unusual API call patterns that indicate credential theft.

SOC2 CC7.2 requires the use of detection tools to identify potential security events. GuardDuty is the monitoring layer that tells you when something anomalous is happening, not just what happened after the fact. Without it, you would only discover a compromise when the damage is done.

Here's the implementation:

# Enable GuardDuty — findings published every 15 minutes for active threats
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Verify GuardDuty is active
aws guardduty list-detectors --query 'DetectorIds' --output table

You can set up an EventBridge rule to route CRITICAL and HIGH severity GuardDuty findings to your incident response channel immediately. A finding sitting unreviewed for 90 days is a qualified SOC2 finding.

Control 5: VPC Flow Logs (CC6.1)

VPC Flow Logs capture information about the IP traffic flowing through your Virtual Private Cloud — every accepted and rejected connection, including source IP, destination IP, port, protocol, and whether the traffic was allowed or denied. They are the network-level audit trail that CloudTrail doesn't provide.

SOC2 CC6.1 requires logical access controls and monitoring. VPC Flow Logs let you verify that your network segmentation is actually working (traffic you denied is showing as rejected in the logs), detect unexpected communication between services, and investigate security events at the network layer.

# Create an IAM role for VPC Flow Logs to deliver to CloudWatch
aws iam create-role \
  --role-name vpc-flow-logs-role \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Principal":{"Service":"vpc-flow-logs.amazonaws.com"},
      "Action":"sts:AssumeRole"
    }]
  }'

# Enable VPC Flow Logs for all traffic (ACCEPT and REJECT)
aws ec2 create-flow-logs \
  --resource-ids vpc-YOUR_PRODUCTION_VPC_ID \
  --resource-type VPC \
  --traffic-type ALL \
  --log-group-name /aws/vpc/flow-logs/production \
  --deliver-log-permission-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/vpc-flow-logs-role

# Verify flow logs are active
aws ec2 describe-flow-logs \
  --filter Name=resource-id,Values=vpc-YOUR_PRODUCTION_VPC_ID \
  --query 'FlowLogs[*].{Status:FlowLogStatus,LogGroup:LogGroupName}'

Control 6: Secrets Manager (CC6.7)

Secrets management means storing credentials (database passwords, API keys, certificates, and other sensitive configuration values) in a dedicated, access-controlled service (like AWS Secrets Manager or HashiCorp Vault) rather than in .env files, GitHub repository secrets, or hardcoded in application code.

SOC2 CC6.7 requires protecting sensitive system components from unauthorized access. A secret stored in an .env file committed to a repository is accessible to every developer with repo access, every CI/CD runner, and every engineer who has ever cloned the repo — including those who have since left the company.

A Secrets Manager provides centralised storage, access logging, automatic rotation, and fine-grained IAM permissions so only specific services can retrieve specific secrets.

Let's look at the implementation — storing and rotating a secret:

# Store a database credential with automatic 90-day rotation
aws secretsmanager create-secret \
  --name production/postgresql/credentials \
  --description "Production PostgreSQL credentials — rotated every 90 days" \
  --secret-string '{
    "username": "app_user",
    "password": "REPLACE_WITH_STRONG_PASSWORD",
    "host": "your-rds-endpoint.us-east-1.rds.amazonaws.com",
    "port": 5432,
    "dbname": "production"
  }'

# Enable automatic rotation every 90 days
aws secretsmanager rotate-secret \
  --secret-id production/postgresql/credentials \
  --rotation-rules AutomaticallyAfterDays=90

How your application retrieves the secret at runtime (no hardcoded credentials):

# Good: secret retrieved at runtime from Secrets Manager
import boto3
import json

def get_db_credentials():
    client = boto3.client('secretsmanager', region_name='us-east-1')
    response = client.get_secret_value(SecretId='production/postgresql/credentials')
    return json.loads(response['SecretString'])

# Bad: secret hardcoded in application code or .env file
DB_PASSWORD = "my_database_password_123"  # Never do this

The access log in CloudTrail records every time a secret is retrieved, by which IAM role, at what time. That log is your SOC2 evidence that secrets access is controlled and auditable.

Control 7: EBS Encryption (CC6.1)

EBS (Elastic Block Store) encryption ensures that the persistent disks attached to your EC2 instances and used by your RDS databases are encrypted at rest using AES-256. If an AWS employee or an attacker gained physical access to the storage hardware, the data would be unreadable without the encryption key.

SOC2 CC6.1 requires protecting information assets from unauthorised access. Encryption at rest is the control that protects data in the event of physical storage compromise or an improperly decommissioned disk. Enabling it account-wide means every new EBS volume is encrypted automatically, including RDS storage, EKS node volumes, and EC2 instance root volumes.

# Enable EBS encryption by default for all new volumes in this region
aws ec2 enable-ebs-encryption-by-default

# Verify it is enabled
aws ec2 get-ebs-encryption-by-default \
  --query 'EbsEncryptionByDefault'
# Expected output: true

# Check existing volumes — any showing false need to be migrated
aws ec2 describe-volumes \
  --query 'Volumes[?Encrypted==`false`].[VolumeId,Size,VolumeType]' \
  --output table

Any existing unencrypted volumes must be snapshot-and-replaced. The process: create a snapshot of the unencrypted volume, create a new encrypted volume from the snapshot, and swap it into the instance.

Control 8: S3 Block Public Access (CC6.1)

Amazon S3 buckets can be configured to allow public access — meaning anyone on the internet can read their contents without authentication. Block Public Access is an account-level and bucket-level setting that prevents any bucket from being made public, regardless of the bucket's own policy.

A misconfigured S3 bucket is one of the most common causes of data breaches in cloud environments. Block Public Access at the account level means a developer can't accidentally expose a bucket containing customer data, even if they set the wrong bucket policy. It's a guardrail, not just a policy.

# Block public access at the AWS account level — applies to all buckets
aws s3control put-public-access-block \
  --account-id YOUR_ACCOUNT_ID \
  --public-access-block-configuration \
    BlockPublicAcls=true,\
    IgnorePublicAcls=true,\
    BlockPublicPolicy=true,\
    RestrictPublicBuckets=true

# Verify account-level setting is active
aws s3control get-public-access-block \
  --account-id YOUR_ACCOUNT_ID

# Scan for any buckets that have public access enabled (should be zero)
aws s3api list-buckets --query 'Buckets[*].Name' --output text | \
  tr '\t' '\n' | while read bucket; do
    result=\((aws s3api get-public-access-block --bucket "\)bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "WARNING: $bucket has public access not fully blocked"
    fi
  done

Control 9: Branch Protection (CC8.1)

Branch protection is a GitHub setting that prevents engineers from pushing code directly to your main branch without going through a pull request that has been reviewed and approved by at least one other team member. It also requires your CI pipeline to pass before any code can be merged.

SOC2 CC8.1 requires change management — the requirement that every change to production systems is documented, reviewed, and approved. Without branch protection, an engineer can push directly to main, which deploys directly to production through your CI/CD pipeline, with no review and no audit trail. Branch protection is the technical enforcement of your change management policy.

The critical setting that most teams miss: the "Do not allow bypassing the above settings" option must be enabled. Without it, administrators can bypass branch protection — and a SOC2 auditor will flag this as a gap because it means your change management control can be circumvented.

# .github/settings.yml — enforces branch protection via code
# Requires the settings GitHub App: https://github.com/apps/settings

branches:
  - name: main
    protection:
      required_pull_request_reviews:
        required_approving_review_count: 1
        dismiss_stale_reviews: true
        require_code_owner_reviews: false
      required_status_checks:
        strict: true
        contexts:
          - "CI / test"
          - "Security / trivy-scan"
      enforce_admins: true         # Admins cannot bypass — this is critical
      restrictions: null           # No push restriction beyond the above
      allow_force_pushes: false
      allow_deletions: false

Here's how you can verify that branch protection is enforced and admins can't bypass it:

# Returns the branch protection rules including enforce_admins status
curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
  https://api.github.com/repos/YOUR_ORG/YOUR_REPO/branches/main/protection \
  | jq '{enforce_admins: .enforce_admins.enabled, required_reviews: .required_pull_request_reviews.required_approving_review_count}'

Control 10: Container Image Scanning (CC7.4)

Container image scanning analyses your Docker images before deployment to identify known security vulnerabilities (CVEs) in the operating system packages and application dependencies they contain.

Trivy is an open-source scanner that checks the base image (Ubuntu, Alpine, and so on), all installed OS packages, and language-specific dependencies (npm, pip, Go modules) against the National Vulnerability Database.

SOC2 CC7.4 requires monitoring and identifying vulnerabilities. Every container you deploy contains a base image with OS packages — and those packages regularly receive CVE disclosures. A critical CVE left unpatched for 90 days in a production container is a SOC2 finding. Automated scanning in CI means every image is checked before it can deploy.

# .github/workflows/security-scan.yml
name: Security Scan
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t app:${{ github.sha }} .

      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: app:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1          # Fail the pipeline on CRITICAL or HIGH findings

      - name: Upload results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()            # Upload even if scan found issues
        with:
          sarif_file: trivy-results.sarif

The scanner looks for:

CVEs in base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)
Vulnerable versions of application dependencies (a known RCE in an npm package your app uses)
Misconfigurations in the Dockerfile itself (running as root, using latest tags)

Results appear in the GitHub Security tab for your repository, giving you a historical record of every scan — which is your SOC2 evidence.

Control 11: Incident Response Plan (CC9.2)

An incident response plan is a written, tested procedure that defines exactly what your team does when a security event occurs — from the moment an alert fires through to customer notification and post-incident review.

SOC2 CC9.2 requires that you have a documented process for responding to security events and that you've tested it. The auditor will ask for the written runbook and evidence that a tabletop exercise (a simulated incident walkthrough) has been conducted within the observation period.

Your incident response runbook must include:

Severity classification: Definitions of P1 (production down, customer data at risk), P2 (degraded service, potential risk), and P3 (minor issue, no customer impact) — and the response SLA for each.
Escalation path: Exactly who gets paged at each severity level, with contact details. Not "the on-call engineer" — specific names and a backup if the first person doesn't respond within 10 minutes.
First 15 minutes: The specific steps to take immediately — isolate the affected system, assess the scope, notify the incident channel, begin the timeline log.
Communication templates: Pre-written Slack messages, customer email templates, and regulatory notification templates (GDPR requires notification within 72 hours, HIPAA within 60 days).
Post-incident review: The blameless postmortem process, the 5-why root cause analysis template, and the action item tracking process.

Conduct a tabletop exercise at least once during your observation period: gather your engineering team for 45 minutes, simulate a realistic scenario (for example, "an AWS access key was committed to a public GitHub repo"), and walk through the runbook together. Document the meeting date, attendees, scenario, gaps found, and remediation actions. This document is your evidence.

Control 12: Access Reviews (CC6.3)

An access review is a quarterly audit of who has access to what in your production systems — AWS accounts, GitHub repositories, production databases, and every SaaS tool that touches customer data. You verify that every person on the list still works at the company and still needs the access their role grants them.

SOC2 CC6.3 requires that access is revoked when it's no longer needed. Former employees who retain access to production AWS accounts represent a genuine security risk and a definitive SOC2 finding.

In every access review I've conducted, at least 3–5 former employees or contractors still had active access they should not.

The quarterly access review checklist:

# 1. IAM users — list all with their last login date
aws iam generate-credential-report
aws iam get-credential-report --output text --query Content \
  | base64 --decode | cut -d',' -f1,5 | column -t -s ','

# 2. IAM roles — find roles that have not been used in 90+ days
aws iam get-account-authorization-details \
  --query 'RoleDetailList[*].{Role:RoleName,LastUsed:RoleLastUsed.LastUsedDate}' \
  --output table

# 3. Verify AWS SSO user list matches your current employee list
aws identitystore list-users \
  --identity-store-id YOUR_IDENTITY_STORE_ID \
  --query 'Users[*].{Name:DisplayName,Email:Emails[0].Value}' \
  --output table

Cross-reference the output against your current employee list in your HR system. Document every change made — access removed, permissions reduced, accounts disabled. The documented changes are the evidence that the review was conducted meaningfully, not just as a checkbox exercise.

Control 13: Backup Verification (CC9.5)

Backup verification is the process of actually restoring your backups to confirm they work — not just confirming that backups are being created. A backup that has never been tested doesn't exist from a recovery perspective.

SOC2 CC9.5 requires that recovery procedures are tested. If your production database is corrupted and you discover for the first time during the incident that your automated RDS snapshots can't be restored, you have both a disaster recovery failure and a SOC2 finding.

How to test your RDS backup:

# Step 1: Find your most recent production snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier your-production-db \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text

# Step 2: Restore the snapshot to a test instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier backup-verification-test \
  --db-snapshot-identifier YOUR_SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=backup-verification Key=Environment,Value=test

# Step 3: Wait for the restore to complete (typically 5–15 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier backup-verification-test

# Step 4: Connect and verify data integrity (spot check key tables)
# Run this against the restored instance
psql -h RESTORED_INSTANCE_ENDPOINT -U your_user -d your_database \
  -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Document the test result and delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier backup-verification-test \
  --skip-final-snapshot

Document the test date, the snapshot used, the restore time, the data verification query results, and who conducted the test. Run this quarterly at minimum. This documentation is your SOC2 evidence for CC9.5.

Control 14: Change Management Log (CC8.1)

A change management log is the auditable record of every change made to your production environment — what changed, who approved it, and when it was applied.

SOC2 CC8.1 requires that changes to your production environment are authorized and documented. With IaC and GitOps in place, you already have two separate sources of immutable change history that together satisfy this control.

GitHub Pull Request history provides the record of every code and infrastructure change: who opened the PR, who reviewed and approved it, what the CI status was, and when it was merged. This is your change management log for application and infrastructure changes.

ArgoCD sync history provides the record of every deployment to your Kubernetes cluster: which application was synced, from which Git commit, at what time, and whether the sync succeeded.

To export the ArgoCD sync history as evidence:

# Export ArgoCD application sync history as JSON evidence
argocd app history YOUR_APP_NAME --output json > argocd-sync-history-$(date +%Y%m).json

# Upload to your SOC2 evidence bucket
aws s3 cp argocd-sync-history-$(date +%Y%m).json \
  s3://your-soc2-evidence-bucket/change-management/$(date +%Y/%m)/

# For each deployment, the evidence contains:
# - App name, deployed revision (Git commit SHA)
# - Deployment timestamp
# - Initiating user or automated sync
# - Success/failure status

Together, the GitHub PR history and the ArgoCD sync history give the auditor a complete, tamper-evident record of every change to your production environment during the observation period.

Weeks 7–10: The Evidence Collection Infrastructure

Evidence is the difference between passing and failing SOC2.

You might be wondering: what exactly is evidence? In SOC2 terms, evidence is the documentation that proves a specific control was operating correctly during a specific point in time within the observation period. A policy document says you will do something. Evidence proves you did it — and that you did it continuously, not just the week before the audit.

For example:

For MFA enforcement (Control 1), evidence is a screenshot of your IAM Identity Center MFA settings taken at a specific date during the observation period, combined with an IAM credential report showing zero IAM users with console access.
For GuardDuty (Control 4), evidence is the GuardDuty console screenshot showing active detectors, plus your documented response to any findings during the period.
For access reviews (Control 12), evidence is the completed access review document with dates, names, and specific access changes made.

The challenge is collecting this evidence continuously across 3–12 months without spending hundreds of hours on manual work. The solution is automated evidence collection infrastructure.

The Evidence Bucket — Tamper-Proof Storage for Your Audit Evidence

The evidence bucket is an S3 bucket with Object Lock enabled in GOVERNANCE mode. Object Lock prevents any object from being deleted or modified for the retention period you specify — in this case, 365 days. This means once a piece of evidence is uploaded, it can't be altered, even by a user with administrator access (without explicitly overriding the lock, which itself creates an audit trail).

This tamper-evident property is what gives the auditor confidence that the evidence was not created or modified after the fact.

# terraform/soc2-evidence-bucket.tf

resource "aws_s3_bucket" "soc2_evidence" {
  bucket = "\({var.company_name}-soc2-evidence-\){var.environment}"
}

# Block all public access to the evidence bucket
resource "aws_s3_bucket_public_access_block" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning so overwrites create new versions, not replacements
resource "aws_s3_bucket_versioning" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Object Lock in GOVERNANCE mode — objects cannot be deleted for 365 days
resource "aws_s3_bucket_object_lock_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 365
    }
  }
}

# Encrypt all evidence at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

The Daily Evidence Collector Lambda

This Lambda function runs automatically every day and exports the status of each critical control to a time-stamped JSON file in the evidence bucket. Over your 3–12 month observation period, it creates a daily record proving that your controls were active and operating.

The function checks seven controls automatically: CloudTrail status, GuardDuty status, VPC Flow Logs, S3 public access block, EBS encryption, MFA compliance, and GuardDuty finding count. Each daily snapshot is uploaded with Object Lock enabled so it can't be modified.

# lambda/evidence-collector/handler.py

import boto3
import json
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    """
    Daily SOC2 evidence collector.
    Runs at 00:00 UTC every day via EventBridge scheduler.
    Exports control status to S3 evidence bucket with Object Lock.
    """
    evidence = {
        'collection_timestamp': datetime.now(timezone.utc).isoformat(),
        'collection_date': datetime.now(timezone.utc).strftime('%Y-%m-%d'),
        'account_id': boto3.client('sts').get_caller_identity()['Account'],
        'controls': {}
    }

    # Control 3: CloudTrail status
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails(includeShadowTrails=False)['trailList']
    multi_region_trails = [t for t in trails if t.get('IsMultiRegionTrail')]
    evidence['controls']['cloudtrail'] = {
        'status': 'PASS' if multi_region_trails else 'FAIL',
        'detail': f"{len(multi_region_trails)} multi-region trail(s) active",
        'trails': [t['Name'] for t in multi_region_trails]
    }

    # Control 4: GuardDuty status
    guardduty = boto3.client('guardduty')
    detectors = guardduty.list_detectors()['DetectorIds']
    unresolved_critical = 0
    for detector_id in detectors:
        findings = guardduty.list_findings(
            DetectorId=detector_id,
            FindingCriteria={
                'Criterion': {
                    'severity': {'Gte': 7},  # HIGH and CRITICAL only
                    'service.archived': {'Eq': ['false']}
                }
            }
        )
        unresolved_critical += len(findings['FindingIds'])

    evidence['controls']['guardduty'] = {
        'status': 'PASS' if detectors else 'FAIL',
        'detail': f"{len(detectors)} detector(s) active, {unresolved_critical} unresolved HIGH/CRITICAL findings",
        'unresolved_high_critical': unresolved_critical
    }

    # Control 5: VPC Flow Logs
    ec2 = boto3.client('ec2')
    flow_logs = ec2.describe_flow_logs(
        Filters=[{'Name': 'resource-type', 'Values': ['VPC']},
                 {'Name': 'flow-log-status', 'Values': ['ACTIVE']}]
    )['FlowLogs']
    evidence['controls']['vpc_flow_logs'] = {
        'status': 'PASS' if flow_logs else 'FAIL',
        'detail': f"{len(flow_logs)} active VPC flow log(s)",
        'active_flow_logs': len(flow_logs)
    }

    # Control 7: EBS encryption by default
    ebs_encryption = ec2.get_ebs_encryption_by_default()['EbsEncryptionByDefault']
    evidence['controls']['ebs_encryption_by_default'] = {
        'status': 'PASS' if ebs_encryption else 'FAIL',
        'detail': 'EBS encryption by default is enabled' if ebs_encryption else 'EBS encryption by default is NOT enabled'
    }

    # Control 8: S3 Block Public Access (account level)
    s3control = boto3.client('s3control')
    account_id = boto3.client('sts').get_caller_identity()['Account']
    try:
        pab = s3control.get_public_access_block(AccountId=account_id)['PublicAccessBlockConfiguration']
        all_blocked = all([pab['BlockPublicAcls'], pab['IgnorePublicAcls'],
                           pab['BlockPublicPolicy'], pab['RestrictPublicBuckets']])
        evidence['controls']['s3_block_public_access'] = {
            'status': 'PASS' if all_blocked else 'FAIL',
            'detail': 'All four S3 Block Public Access settings enabled' if all_blocked else 'One or more S3 Block Public Access settings not enabled',
            'configuration': pab
        }
    except Exception as e:
        evidence['controls']['s3_block_public_access'] = {'status': 'FAIL', 'detail': str(e)}

    # Upload evidence to S3 with Object Lock
    s3 = boto3.client('s3')
    evidence_key = f"daily/{evidence['collection_date']}/control-status.json"
    lock_until = datetime.now(timezone.utc) + timedelta(days=365)

    s3.put_object(
        Bucket='YOUR_EVIDENCE_BUCKET_NAME',
        Key=evidence_key,
        Body=json.dumps(evidence, indent=2),
        ContentType='application/json',
        ObjectLockMode='GOVERNANCE',
        ObjectLockRetainUntilDate=lock_until
    )

    # Alert if any control fails
    failed_controls = [k for k, v in evidence['controls'].items() if v['status'] == 'FAIL']
    if failed_controls:
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='YOUR_ALERT_TOPIC_ARN',
            Subject=f'SOC2 Control Failure Detected — {evidence["collection_date"]}',
            Message=f'The following controls failed their daily check:\n\n{json.dumps(failed_controls, indent=2)}'
        )

    return {
        'statusCode': 200,
        'controls_checked': len(evidence['controls']),
        'controls_failed': len(failed_controls),
        'evidence_location': f"s3://YOUR_EVIDENCE_BUCKET_NAME/{evidence_key}"
    }

The GitHub Actions Evidence Workflow

This workflow runs daily and captures evidence that can't be automated through AWS APIs — GitHub-level controls like branch protection status, recent pull request activity, and CI pipeline results. It exports these as JSON files to the same evidence bucket.

# .github/workflows/soc2-evidence.yml
name: SOC2 Evidence Collection
on:
  schedule:
    - cron: '0 1 * * *'   # 01:00 UTC daily (after the Lambda runs at 00:00)
  workflow_dispatch:        # Allow manual trigger when needed

permissions:
  contents: read

jobs:
  collect-github-evidence:
    name: Collect GitHub Control Evidence
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/evidence-collector
          aws-region: us-east-1

      - name: Collect branch protection status
        run: |
          DATE=$(date +%Y-%m-%d)
          mkdir -p evidence/github

          # Export branch protection rules for main
          curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
            "https://api.github.com/repos/${{ github.repository }}/branches/main/protection" \
            | jq '{
                date: "'$DATE'",
                enforce_admins: .enforce_admins.enabled,
                required_reviews: .required_pull_request_reviews.required_approving_review_count,
                required_status_checks: .required_status_checks.contexts,
                allow_force_pushes: .allow_force_pushes.enabled
              }' > evidence/github/branch-protection-$DATE.json

          echo "Branch protection evidence collected"
          cat evidence/github/branch-protection-$DATE.json

      - name: Upload evidence to S3
        run: |
          DATE=$(date +%Y-%m-%d)
          aws s3 sync evidence/ \
            s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/ \
            --no-progress
          echo "Evidence uploaded: s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/"

Weeks 11–14: Auditor Selection and Readiness Assessment

How to Choose a SOC2 Auditor

Selecting the right auditor is more consequential than most teams realize. SOC2 audits are conducted by CPA firms — specifically, firms licensed to issue SOC reports. The right firm has experience with cloud-native, SaaS companies your size. The wrong firm could apply enterprise audit frameworks to a seed-stage startup and generate findings based on controls that aren't appropriate to your context.

Here is what to look for and what to watch out for:

Experience matters more than brand

A large Big Four firm isn't necessarily better than a specialist boutique auditor for a 20-person SaaS company.

Ask specifically: "How many SOC2 audits have you completed in the last 12 months for SaaS companies between 10 and 50 employees?" You want a firm where this is common, not exceptional.

Verify familiarity with your compliance tool

If you're using Vanta or Drata, confirm that the auditor has experience with evidence produced by those platforms. Some auditors prefer to collect evidence directly and are unfamiliar with automated evidence exports. An auditor who doesn't trust your Vanta evidence will ask you to re-collect everything manually.

Understand what Type II actually costs

For a Series A SaaS company, expect $15,000–$30,000 for a SOC2 Type II audit with a 3-month observation period. A quote below $10,000 often means the auditor is cutting corners on the review depth. A quote above $50,000 for a small company typically means the firm is applying enterprise pricing to a startup engagement.

Get references from similar companies

Ask the auditor for two or three references from SaaS companies they've audited in the last year. Call those references and ask: did the auditor understand cloud infrastructure? Were the findings reasonable? How was the communication during the review?

Here's a summary table of some things to watch out for:

Criteria	What to Look For	Red Flag
Experience	5+ years, 20+ SaaS audits annually	"We have completed several SOC2 audits" (vague)
Tool familiarity	Has reviewed Vanta/Drata evidence before	Requires manual re-collection of automated evidence
Company size fit	Has audited companies your size	Only lists enterprise clients as references
Cost (Type II)	$15K–$30K for a 20-person company	Under $10K or over $50K without clear justification
References	Can provide SaaS company contacts to call	Cannot provide references

How to Run a Readiness Assessment (Mock Audit)

A readiness assessment is a self-conducted simulation of the real audit, run 2–4 weeks before you engage the auditor. Its purpose is to find and close gaps before the auditor finds them, because gaps found in a mock audit cost you a week of remediation time, while gaps found in the real audit cost you a conditional report and a re-review.

You can run the readiness assessment yourself or hire a consultant to run it. The consultant approach is more valuable because an independent reviewer will find gaps you have rationalised away.

The process:

Step 1: Work through every control in the checklist below and attempt to produce the evidence that an auditor would request.
Step 2: For every control where you can't produce clear, timestamped evidence: that's a gap. Document it.
Step 3: Prioritise gaps by type. Evidence gaps (missing evidence for an active control) require evidence collection infrastructure fixes. Control gaps (a control that isn't implemented) require engineering work.
Step 4: Close all gaps before engaging the real auditor.

Control	Evidence Required	How to Verify	Ready?
MFA enforced	IAM credential report + SSO MFA policy screenshot	`aws iam get-credential-report`	⬜
CloudTrail active	Trail status + S3 delivery confirmation	`aws cloudtrail get-trail-status`	⬜
GuardDuty active	Detector list + finding review log	`aws guardduty list-detectors`	⬜
VPC Flow Logs	Active flow log list + sample log entries	`aws ec2 describe-flow-logs`	⬜
Secrets in Secrets Manager	Secret list + rotation policy confirmation	`aws secretsmanager list-secrets`	⬜
EBS encryption by default	Account-level encryption setting	`aws ec2 get-ebs-encryption-by-default`	⬜
S3 Block Public Access	Account-level PAB configuration	`aws s3control get-public-access-block`	⬜
Branch protection (no admin bypass)	GitHub branch protection API response	GitHub API or Settings UI	⬜
Trivy scanning in CI	GitHub Actions run history showing scans	GitHub Actions logs	⬜
Incident response runbook	Written runbook + tabletop exercise notes with date	Document review	⬜
Access review	Quarterly review document with specific changes made	Document review	⬜
Backup test	RDS restore log + data verification results	Document review	⬜
Change management log	GitHub PR history + ArgoCD sync history	GitHub and ArgoCD	⬜

The one thing most teams skip: Running the readiness assessment against their own evidence bucket. Pull a random day's evidence from the daily Lambda export and verify that it's complete, timestamped, and accurately reflects the control status on that day.

If the evidence file for December 14th shows GuardDuty as PASS but GuardDuty was actually disabled that day, the auditor will find the discrepancy in the AWS account history — and that's a qualified finding.

Weeks 15–18: The Observation Period

How the Auditor Observes Your Controls

The SOC2 auditor doesn't physically visit your office or sit inside your AWS console watching your infrastructure in real time. The audit is a remote, documentation-based process conducted entirely through evidence review.

Here is how it actually works:

First, the auditor provides a list of evidence requests — typically 80–150 items for a Type II audit. You upload the evidence to a shared portal (the auditor provides this — it is usually a secure document sharing platform). The auditor reviews the evidence, asks follow-up questions, and identifies gaps where evidence is missing or a control wasn't operating as described.

For automated controls like CloudTrail and GuardDuty, the evidence is your daily Lambda exports — the auditor spot-checks a sample of daily snapshots across the observation period to verify the controls were consistently active.

For manual controls like access reviews and backup tests, the evidence is the documents you produced when you ran those processes.

The practical implication: the auditor is trusting your evidence. This is why the Object Lock on your evidence bucket matters. It proves to the auditor that the evidence was generated at the time it claims to have been generated and hasn't been modified since.

What the Auditor Reviews Over the Observation Period

What They Check	How Often	What They Are Looking For
CloudTrail logs	Spot check monthly	Manual console changes that bypassed IaC, gaps in log delivery
GuardDuty findings	Review quarterly summary	HIGH or CRITICAL findings not remediated within your documented SLA
Access review completion	Verify each quarterly cycle	Reviews skipped, reviews with no access changes despite employee turnover
Incident response tests	Verify annually	No tabletop exercise conducted during the observation period
Evidence collection	Verify continuous coverage	Gaps in daily evidence exports, missing evidence for specific dates
Change management log	Sample PR/sync history	Deployments with no associated pull request or review

What Triggers a Finding

A SOC2 finding is the auditor's documented conclusion that a control wasn't operating effectively during the observation period. Findings range from observations (minor issues that don't affect the audit opinion) to qualified opinions (material failures that result in a qualified rather than unqualified report).

Understanding what triggers findings — and which ones restart the observation period — is critical for managing your audit timeline.

Control gaps occur when a required control isn't implemented or was disabled during the observation period. If you discover in month 2 that MFA wasn't enforced on one IAM user for the first three weeks, you must document the remediation and demonstrate the gap was closed.

Whether this restarts your observation period depends on how long the gap lasted and how the auditor assesses the risk — but a gap of less than 30 days that's immediately remediated and documented typically doesn't restart the clock.

Evidence gaps are more serious. If your daily Lambda evidence collector failed for two weeks and produced no evidence exports, you have a two-week window with no documented proof that your controls were operating. The auditor can't verify controls they can't see evidence for.

Evidence gaps almost always require extending the observation period because there's no way to retroactively produce evidence for a period that wasn't recorded.

Process failures occur when a manual control wasn't executed as documented. The most common is an access review that was skipped. Like control gaps, these can typically be remediated without restarting the clock if they're documented promptly and the remediation is clear.

Unpatched critical CVEs are a special case. If Trivy identifies a CRITICAL vulnerability in a production container and it remains unpatched for more than your documented remediation SLA (typically 30 days for critical, 90 days for high), this is a qualified finding that the auditor will note in the report.

How to Close Gaps Without Restarting the Clock

When you discover a gap during the observation period:

For control gaps:

1. Fix the control immediately — don't wait
2. Document the fix: screenshot, PR link, or CLI command output with timestamp
3. Note the gap date range in your audit log: "Control gap: 2024-03-10 to 2024-03-14 (4 days). Root cause: [X]. Remediated: [Y]. No customer data accessed during gap period."
4. Notify your auditor proactively — they will find it anyway; proactive disclosure is better than defensive explanation
5. The observation period doesn't restart if the gap was short-lived and promptly remediated

For evidence gaps:

1. Fix the evidence collection infrastructure immediately
2. Understand that you can't retroactively generate evidence for the gap period
3. The observation period for affected controls effectively restarts from the date evidence collection resumed
4. If the gap is early in your observation period, you may be able to extend the period rather than restart — discuss with your auditor

The pro tip: Set up a CloudWatch alarm that triggers if the evidence Lambda fails to deliver to S3 on schedule. A missing daily evidence file is caught within 24 hours, not discovered during the audit review.

The 90-Day SOC2 Timeline at a Glance

Weeks	Focus	Key Deliverables	Common Mistake
1–2	Scope	Boundary diagram, network segmentation Terraform	Over-scoping to include dev and staging
3–6	Controls	14 controls implemented and collecting evidence	Starting controls after the observation period begins
7–10	Evidence	S3 evidence bucket, Lambda daily collector, GitHub Actions workflow	Manual evidence collection with inevitable gaps
11–14	Readiness	Mock audit, gap remediation, auditor selected	Skipping the mock audit
15–18	Observation	Daily evidence, quarterly reviews, incident response test	Discovering evidence gaps during the audit rather than before

What's Next?

Start with Week 1. Define your SOC2 boundary. Apply the four-question framework to every system in your infrastructure. Draw the diagram in Excalidraw. Document the network segmentation controls.

Then implement the 14 controls in order, starting with MFA and CloudTrail — the two that most commonly fail audits when they're missing.

Then build your evidence collection infrastructure before the observation period starts. The automated Lambda and GitHub Actions workflow are the difference between a smooth audit and a 60-day extension.

One thing to remember: SOC2 is 20% controls, 30% evidence, and 50% continuous operation. Start early. Automate everything. Run a mock audit before you call the real one.

Resources

The following resources are referenced throughout this guide:

AICPA SOC2 Overview — The official SOC2 documentation from the American Institute of CPAs, including the Trust Service Criteria
Vanta — Compliance automation platform that connects to AWS and GitHub to automate evidence collection and track control status
Drata — Alternative compliance automation platform with similar capabilities to Vanta
Trivy by Aqua Security — Open-source container and filesystem vulnerability scanner used in Control 10
Excalidraw — Free, open-source diagram tool for creating the SOC2 boundary diagram
AWS IAM Identity Center documentation — Official AWS documentation for setting up SSO and MFA enforcement
GitHub branch protection documentation — Official GitHub documentation for configuring branch protection rules
ArgoCD documentation — Official ArgoCD documentation for GitOps deployment and sync history

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He writes about SOC2 compliance engineering, Kubernetes cost optimization, and platform engineering.

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	\(1,000–\)2,000	2 hours (Lambda)	Medium
2. Staging proliferation	\(600–\)800	3 hours (scheduling)	Low
3. NAT Gateway tax	\(2,000–\)8,000	30 minutes	Low
4. Savings Plan timing	\(5,000–\)15,000	One decision	Low
5. Cross-AZ data transfer	\(500–\)6,000	2 hours	Medium
6. gp2 volume trap	\(1,000–\)5,000	30 minutes (script)	Low
7. Infinite log trap	\(500–\)2,000	1 hour (script)	Low
8. Orphaned resources	\(500–\)2,000	2 hours (Lambda)	Low
Total potential	\(11,100–\)40,800/month

Ayobami Adejumo - freeCodeCamp.org

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Table of Contents

What You'll Learn

Prerequisites

The Four Stages Overview

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

Three questions for every service in your top 5:

1.2 The Tagging Strategy That Actually Survives

1.3 The Cost-Aware Code Review

Stage 1 Outcomes

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

2.2 Storage Tiering: Stop Paying Retail for Cold Data

2.3 Savings Plans: The Sequence Is Everything

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

3.2 Cost Estimation in Your CI/CD Pipeline

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

4.2 The Chargeback and Showback Models

Essential Tools and Certifications

Your 90-Day Action Plan

Month 1 — Foundation:

Month 2 — Quick Wins:

Month 3 — Automation and Habits:

Best Practices Summary

Resources

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Table of Contents

Who This Guide Is For

Before You Start: Establish Your Baseline

Pattern 1: The New Hire Experiment Tax

The Fix — an Automatic Idle Instance Stopper:

Pattern 2: Staging Environment Proliferation

The Fix — Scheduled Start and Stop with AWS Instance Scheduler:

Consolidation in Addition to Scheduling

Pattern 3: The NAT Gateway Tax

The Fix — VPC Endpoints for the Four Highest-traffic AWS Services:

Pattern 4: The Savings Plan Timing Mistake

Pattern 5: Cross-AZ Data Transfer

The Fix — Topology-aware Routing:

Pattern 6: The gp2 Volume Trap

The Fix — Migrate All gp2 to gp3 in One Script:

Pattern 7: The Infinite Log Trap

The Fix — Set Retention Policies in Bulk:

Pattern 8: The Orphaned Resource Collector

The Full Savings Summary

What to Do This Week

Resources

GDPR Article 32 for Software Engineers: Technical Controls, Implementations, and Auditor Questions

Table of Contents

What You'll Learn

Prerequisites

Part 1: Understanding Article 32 — The Technical Requirements

1.1. What Article 32 Actually Requires

1.2. The Scope Question: What Data Is Covered?

Part 2: Article 32(1)(a) — Pseudonymisation and Encryption

2.1. How to Implement Pseudonymisation at the Database Layer

2.2. How to Implement Encryption at Rest with Customer-Managed Keys

2.3. How to Implement Application-Layer Encryption for Sensitive Fields

Part 3: Article 32(1)(b) — Confidentiality and Integrity

3.1. How to Implement Automatic Logoff

3.2. How to Implement Unique User Identification with IRSA

Part 4: Article 32(1)(c) — Availability and Resilience

4.1. How to Implement Multi-AZ and Backup Requirements

Part 5: Article 32(1)(d) — Regular Testing

5.1. How to Implement Automated Vulnerability Scanning

Part 6: Article 32(1)(d) — Penetration Testing

6.1. Why Automated Scanning Is Not Enough

Best Practices for GDPR Article 32 Compliance

Resources

The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands

Table of Contents

What You'll Learn

Prerequisites

Weeks 1–2: The Scope Decision — What Is In and Out of Your SOC2 Boundary

What Most Teams Get Wrong

The Scope Decision Framework