finops - freeCodeCamp.org

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Ayobami Adejumo — Mon, 15 Jun 2026 23:22:50 +0000

My first AWS bill was $23,000. I had been working at the company for three weeks.

Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda function that called an external enrichment API on every user event. Clean code. Solid tests. Thirty-two million events that month. At $0.0007 per API call.

My engineering manager forwarded the invoice with two words: "Please explain."

That was the moment I discovered FinOps — not from a conference talk or a certification course, but from the specific shame of having written expensive code and not knowing it until the damage was done.

This roadmap is what I needed that day. A complete, honest guide to transforming from an engineer who builds things that work into an engineer who builds things that work and cost what they should. By the end of this guide, you'll have the skills, the scripts, and the vocabulary to talk about cloud spend the way a CFO and a CTO both want to hear.

What You'll Learn
Prerequisites
The Four Stages Overview
Stage 1: The Cost-Aware Engineer — Months 1 to 3
Stage 2: The Optimisation Specialist — Months 4 to 8
Stage 3: The Automation Architect — Months 9 to 15
Stage 4: The Cloud Financial Manager — Months 16 to 24
Essential Tools and Certifications
Your 90-Day Action Plan
Best Practices Summary
Resources

What You'll Learn

How to read your AWS bill as an engineer, not as a passive observer
The exact tagging strategy that makes cost attribution possible
How to right-size EC2 and RDS instances using CloudWatch data you already have
The correct sequence for purchasing Savings Plans — and why sequence matters more than the discount percentage
How to build automated cleanup systems for orphaned resources
How to present cloud cost findings to engineering leadership with data that drives decisions
The chargeback and showback models that make cost accountability stick

Let's begin.

Prerequisites

Before following this roadmap, you should have some skills and tools ready to go.

Knowledge:

You can deploy an application to AWS (EC2, Lambda, or containers)
You understand basic AWS services: S3, RDS, EC2, VPC, IAM
You're comfortable reading Python and writing simple bash scripts
You know what a pull request is and have gone through at least one code review

Access:

Read-only access to your AWS billing console and Cost Explorer
AWS CLI v2 configured with at least ReadOnlyAccess policy attached
Python 3.9 or later for running the audit scripts in this guide

Mindset: You don't need to be a finance expert. But you do need to be willing to look at numbers that might be uncomfortable. Every engineer I've worked with who became excellent at FinOps had one thing in common: they were willing to be the person who asked "but what does this cost?" in a room where nobody else wanted to.

Estimated time: This roadmap covers 24 months of deliberate skill-building. You can absorb the reading in a few evenings. The practice is the 24 months.

The Four Stages Overview

Before going deep, here's the complete picture of where you're going:

Stage 1 — Cost-Aware Engineer (Months 1–3)
├── Read your cloud bill and understand it
├── Tag every resource with meaningful metadata
├── Identify your top 5 cost drivers
└── Block your first expensive PR with cost justification

Stage 2 — Optimisation Specialist (Months 4–8)
├── Right-size every over-provisioned resource
├── Implement storage lifecycle policies
├── Move non-production to Spot instances
└── Purchase your first Savings Plan in the right order

Stage 3 — Automation Architect (Months 9–15)
├── Build automated cleanup for orphaned resources
├── Add cost estimation to your CI/CD pipeline
├── Create cost-aware auto-scaling triggers
└── Deploy a self-service FinOps dashboard

Stage 4 — Cloud Financial Manager (Months 16–24)
├── Lead monthly FinOps reviews with engineering leadership
├── Build chargeback models for departments
├── Negotiate enterprise agreements with AWS
└── Forecast cloud spend within 5% variance

The reason this is a 24-month journey and not a weekend project: each stage builds on the previous one. Engineers who jump straight to Savings Plans without rightsizing first end up paying discounted prices for waste. Engineers who build dashboards before tagging get beautiful charts with no actionable data. The sequence isn't arbitrary.

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

The default AWS Cost Explorer view shows you service-level totals. That's accounting. What you need is engineering-level decomposition: which specific resources cost money, what business function they serve, and whether each dollar is justified.

Start by pulling a proper breakdown:

# Pull last month's cost breakdown grouped by service
# Run this before touching any optimisation — this is your baseline
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Save the output. Name the file aws-baseline-YYYY-MM.txt. You'll compare every future month against this number. Without a baseline, you can't measure progress — and without measurable progress, you can't make the case to leadership that the work is worth engineering time.

Three questions for every service in your top 5:

Most engineers stop at "what is this service?" and never reach the useful question. Here's the framework I use when I first audit an account:

The first question is whether you know what specific business function this service is performing. Not the product name, the function. "S3" isn't an answer. "Storing unprocessed video uploads that sit for 90 days before anyone watches them" is an answer.

The second question is whether the cost is growing, stable, or shrinking when you look at the past three months. A stable $12,000/month is a different problem from a $12,000/month line that was $4,000 six months ago.

The third question is what percentage of your total bill this service represents. Optimising a 1% line item while a 40% line item runs unchecked is a common time-wasting trap.

1.2 The Tagging Strategy That Actually Survives

Here's the honest truth about tagging: most tagging strategies die within six months because they're designed for reporting rather than for engineers. Engineers don't tag things well when they're moving fast. The solution isn't to demand more discipline. Instead, it's to make tagging enforced at the infrastructure layer.

Here's the minimal viable tag set (the six tags that cover 90% of attribution needs):

# These six tags enable cost attribution, accountability, and automated remediation
# Add these to every resource in your AWS account — EC2, RDS, S3, Lambda, everything

Environment: "production" | "staging" | "dev"
Team: "platform" | "backend" | "data" | "ml"
Service: "payment-api" | "fraud-detection" | "user-service"
Owner: "ayo@cloudfrugal.com"     # Person responsible for this resource
CostCenter: "engineering"         # For chargeback reporting
AutoShutdown: "true" | "false"    # Enables automated remediation

Enforce tags at the Terraform level so they can't be skipped:

# variables.tf
# Add this to your Terraform root module
# Any plan that creates a resource without these tags will fail validation

variable "required_tags" {
  description = "Tags required on every resource in this account"
  type = map(string)
  
  validation {
    condition = contains(keys(var.required_tags), "Environment") &&
                contains(keys(var.required_tags), "Team") &&
                contains(keys(var.required_tags), "Owner")
    error_message = "required_tags must include Environment, Team, and Owner."
  }
}

# Apply in every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = merge(var.required_tags, {
    Name    = "app-server-${var.environment}"
    Service = "payment-api"
  })
}

Find everything that's currently untagged:

# List EC2 instances missing the Team tag
# Run this weekly until you hit zero results
aws ec2 describe-instances \
  --query "Reservations[].Instances[?!not_null(Tags[?Key=='Team'].Value | [0])].[InstanceId, InstanceType, State.Name]" \
  --output table

Once you start finding untagged resources, you'll discover a pattern: the oldest resources in the account are the least tagged, and they're often the most expensive. An EC2 instance from 2021 that predates your tagging policy is exactly the kind of thing that generates a $3,000/month line item nobody can explain.

1.3 The Cost-Aware Code Review

The most underused FinOps practice in engineering teams is reviewing code changes for cost implications before they merge. It takes thirty seconds per PR once you build the habit, and it prevents the kind of problem that opened this guide: the expensive feature that nobody priced before shipping.

Add this section to your PR template:

## Cost Impact (required for infrastructure and data changes)

- [ ] This change does not affect cloud resource usage
- [ ] New API calls introduced: estimated cost per call $______, calls/month ______
- [ ] New data storage: estimated monthly delta $______
- [ ] Cross-region data transfer introduced: yes / no
- [ ] New external service dependency with per-call pricing: yes / no

If any box other than the first is checked, add a cost estimate before requesting review.

The discipline is in making cost estimation a first-class review concern, not an afterthought that gets caught by the finance team on the 15th of the month.

Stage 1 Outcomes

By the end of month 3, you should have a baseline cost breakdown on file, 100% tag coverage on active resources, identified your top 5 cost drivers with specific reduction targets, and blocked at least one expensive PR with a cost justification that held up in review.

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

The single most reliable source of cloud waste I find in every account I audit is over-provisioned compute.

The pattern is consistent: an engineer provisions an instance at a size that handles their anticipated peak load, the peak never quite materialises at the expected scale, and nobody revisits the instance size because there's no automatic signal that says "this machine is 75% empty."

Make sure you verify actual utilisation before changing anything:

# rightsize_analyzer.py
# Finds EC2 instances running below 20% average CPU for 14 days
# These are right-sizing candidates — not automatic deletions

import boto3
from datetime import datetime, timedelta

def find_oversized_instances(region='us-east-1'):
    """
    Returns instances with average CPU below 20% for the last 14 days.
    Low CPU alone doesn't mean right-size — check memory too if CW agent installed.
    """
    ec2 = boto3.client('ec2', region_name=region)
    cw  = boto3.client('cloudwatch', region_name=region)

    reservations = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )['Reservations']

    candidates = []

    for r in reservations:
        for inst in r['Instances']:
            iid  = inst['InstanceId']
            itype = inst['InstanceType']
            tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}

            # Pull 14-day average CPU from CloudWatch
            stats = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': iid}],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=1209600,   # One 14-day period
                Statistics=['Average']
            )['Datapoints']

            avg_cpu = stats[0]['Average'] if stats else 0.0

            if avg_cpu < 20.0:
                candidates.append({
                    'instance_id':  iid,
                    'instance_type': itype,
                    'avg_cpu_pct':  round(avg_cpu, 1),
                    'environment':  tags.get('Environment', 'unknown'),
                    'owner':        tags.get('Owner', 'unknown'),
                    'team':         tags.get('Team', 'unknown'),
                })

    return sorted(candidates, key=lambda x: x['avg_cpu_pct'])

if __name__ == '__main__':
    results = find_oversized_instances()
    print(f"\nFound {len(results)} right-sizing candidates:\n")
    for r in results:
        print(f"  {r['instance_id']} ({r['instance_type']}) — "
              f"{r['avg_cpu_pct']}% avg CPU — "
              f"owner: {r['owner']}")

A word of caution: CPU utilisation below 20% is a signal, not a verdict. Some workloads are memory-intensive or I/O-bound and will show low CPU while being correctly sized. Before acting on any right-sizing recommendation, check memory utilisation (requires the CloudWatch agent) and network I/O patterns alongside CPU.

2.2 Storage Tiering: Stop Paying Retail for Cold Data

S3 Standard costs $0.023 per GB per month. S3 Glacier Deep Archive costs $0.00099 per GB per month. The difference is a factor of 23. If you have data that you last accessed six months ago and you're keeping it in S3 Standard because nobody set up lifecycle policies, you're paying 23x more than necessary.

The complete S3 lifecycle policy for engineering teams:

{
  "Rules": [
    {
      "ID": "application-logs-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30,  "StorageClass": "STANDARD_IA"},
        {"Days": 90,  "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    },
    {
      "ID": "training-checkpoints-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "ml-checkpoints/"},
      "Transitions": [
        {"Days": 7,  "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    }
  ]
}

# Apply the lifecycle policy to a bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-logs-bucket \
  --lifecycle-configuration file://lifecycle.json

# Verify it applied correctly
aws s3api get-bucket-lifecycle-configuration \
  --bucket your-logs-bucket

2.3 Savings Plans: The Sequence Is Everything

A Savings Plan is a commitment to spend a minimum dollar amount per hour on AWS compute for one or three years, in exchange for discounts of 30–70% off On-Demand rates. The discount is real. The trap is buying before optimising.

The wrong order: You have a $50,000/month EC2 bill. You buy a Savings Plan covering $35,000/hour. Then you implement right-sizing and Spot instances — and your actual spend drops to $22,000/month. You've committed to paying $35,000/month for 12 months against a need of $22,000. You're paying $13,000/month for compute you don't use, at a 30% discount. Congratulations on your discounted waste.

The right order:

Month 1-2: Right-size all instances using VPA and CloudWatch data
Month 3:   Move staging and development to Spot instances
Month 4:   Migrate compatible workloads to Graviton (20% cheaper)
Month 5:   Add VPC endpoints to eliminate NAT Gateway charges
Month 6:   THEN look at your steady-state On-Demand spend
Month 6+:  Purchase Savings Plans covering 70% of that optimised baseline

Calculate what to commit to:

# Get your On-Demand EC2 spend for the last 30 days
# This is your rightsized baseline — the number to commit against
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE",       "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own recommendation for what to commit
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

Orphaned resources are the cloud equivalent of a gym membership you forgot to cancel. They exist, they charge you, but nobody notices until the annual audit.

The root cause isn't laziness. It's the absence of lifecycle management at the infrastructure layer. When an engineer spins up an EC2 instance for a one-week experiment and then leaves the company, there's no automatic signal that the instance is now orphaned. It sits there, billing $140/month, until someone hunts it down.

The fix is a weekly automated audit that surfaces candidates for deletion and notifies the registered owner, not a process change that depends on engineers remembering to clean up.

# orphan_reporter.py
# Runs every Sunday via EventBridge → Lambda
# Posts a Slack report of orphaned resources for human review
# DOES NOT auto-delete — deletion requires a human decision

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
UNATTACHED_VOLUME_AGE_DAYS = 14
SNAPSHOT_AGE_DAYS = 90


def find_orphaned_resources():
    ec2 = boto3.client('ec2')
    report = {'monthly_waste_usd': 0, 'items': []}

    # Unattached EBS volumes
    for vol in ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']:
        age = (datetime.now(timezone.utc) - vol['CreateTime']).days
        if age >= UNATTACHED_VOLUME_AGE_DAYS:
            cost = round(vol['Size'] * 0.08, 2)  # gp3 rate
            tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
            report['items'].append({
                'type':  'Unattached EBS Volume',
                'id':    vol['VolumeId'],
                'detail': f"{vol['Size']}GB {vol['VolumeType']} — {age} days old",
                'owner': tags.get('Owner', 'unknown'),
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    # Unassociated Elastic IPs
    for addr in ec2.describe_addresses()['Addresses']:
        if 'AssociationId' not in addr:
            report['items'].append({
                'type':  'Unassociated Elastic IP',
                'id':    addr.get('AllocationId', addr['PublicIp']),
                'detail': addr['PublicIp'],
                'owner': 'unknown',
                'monthly_cost_usd': 3.60,
            })
            report['monthly_waste_usd'] += 3.60

    # Old snapshots
    cutoff = (datetime.now(timezone.utc) - timedelta(days=SNAPSHOT_AGE_DAYS)).isoformat()
    for snap in ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']:
        if snap['StartTime'].isoformat() < cutoff:
            cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
            report['items'].append({
                'type':  f'Snapshot ({SNAPSHOT_AGE_DAYS}+ days old)',
                'id':    snap['SnapshotId'],
                'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
                'owner': 'unknown',
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    return report


def post_to_slack(report):
    lines = [
        f":money_with_wings: *Weekly Orphaned Resource Report*",
        f"Found *{len(report['items'])} orphaned resources* "
        f"costing *${report['monthly_waste_usd']:.2f}/month*\n",
    ]
    for item in report['items'][:20]:  # Cap at 20 lines to stay readable
        lines.append(
            f"• `{item['type']}` {item['id']} — {item['detail']} "
            f"— *${item['monthly_cost_usd']:.2f}/mo* — owner: {item['owner']}"
        )
    lines.append("\nReview and delete anything no longer needed.")

    req = urllib.request.Request(
        SLACK_WEBHOOK,
        data=json.dumps({'text': '\n'.join(lines)}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)


def lambda_handler(event, context):
    report = find_orphaned_resources()
    post_to_slack(report)
    return {
        'items_found': len(report['items']),
        'monthly_waste': report['monthly_waste_usd'],
    }

3.2 Cost Estimation in Your CI/CD Pipeline

The goal is to catch expensive infrastructure changes at the PR stage — before they deploy and before they generate a billing surprise.

# .github/workflows/cost-check.yml
# Runs on any PR that touches infrastructure files
# Uses Infracost to estimate the monthly cost delta

name: Infrastructure Cost Check

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'infrastructure/**'
      - '*.tf'

jobs:
  cost-estimate:
    name: Estimate monthly cost change
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown \
            --path terraform/ \
            --format json \
            --out-file /tmp/infracost.json

      - name: Post cost diff to PR
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

      - name: Block if monthly increase exceeds threshold
        run: |
          MONTHLY_DELTA=$(cat /tmp/infracost.json | \
            jq '.projects[0].diff.totalMonthlyCost' | tr -d '"')

          echo "Estimated monthly cost change: \$$MONTHLY_DELTA"

          # Fail the PR if this change adds more than $500/month
          python3 -c "
          import sys
          delta = float('$MONTHLY_DELTA')
          if delta > 500:
              print(f'PR blocked: estimated +\\({delta:.2f}/month exceeds \\)500 threshold')
              sys.exit(1)
          else:
              print(f'Cost check passed: estimated +\${delta:.2f}/month')
          "

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

By month 16, you have the data. What changes at Stage 4 is the audience. You're no longer presenting to engineers who understand instance types and NAT Gateway pricing. You're presenting to a CTO who wants to know if the infrastructure investment is proportional to the business value it produces, and a CFO who wants to know when the line will stop going up.

The vocabulary shift is simple but important. You stop saying "we right-sized our EC2 instances" and start saying "we reduced our infrastructure unit cost by 28% while maintaining the same request throughput." You stop saying "we eliminated NAT Gateway charges" and start saying "we closed a $6,400/month gap between what we were paying and what was necessary."

The metric that anchors every executive FinOps conversation is cost per business unit. Not total bill (cost per API call, cost per user, cost per transaction, cost per model inference). That ratio tells the story of whether your infrastructure efficiency is improving as the business scales.

# unit_economics.py
# Calculate cost per transaction — the metric that matters to leadership

import boto3
from datetime import datetime, timedelta

def calculate_cost_per_transaction(service_name, transaction_count, days_back=30):
    """
    Returns cost per transaction for a given service over the last N days.
    transaction_count: total transactions for the same period (from your metrics)
    """
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key':    'Service',
                'Values': [service_name]
            }
        }
    )

    total_cost = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in response['ResultsByTime']
    )

    cost_per_txn = total_cost / transaction_count if transaction_count > 0 else 0

    return {
        'service':           service_name,
        'period_days':       days_back,
        'total_cost_usd':    round(total_cost, 2),
        'transactions':      transaction_count,
        'cost_per_txn_usd':  round(cost_per_txn, 6),
    }


# Example: payment service processed 4.2M transactions this month
result = calculate_cost_per_transaction('payment-api', 4_200_000)
print(f"Cost per transaction: ${result['cost_per_txn_usd']:.6f}")
print(f"Total infrastructure cost: ${result['total_cost_usd']:,.2f}")

4.2 The Chargeback and Showback Models

Chargeback means actually billing departments for their cloud usage. Showback means showing departments their usage costs without the internal billing transfer. Both create the same outcome: engineers start caring about what they consume because someone they work with is paying attention to it.

# showback_report.py
# Generates monthly cost-by-team report for distribution to engineering leads

import boto3
from datetime import datetime

def generate_team_showback():
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG',       'Key': 'Team'},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        ]
    )

    by_team = {}
    for group in response['ResultsByTime'][0].get('Groups', []):
        team    = group['Keys'][0].replace('Team$', '') or 'untagged'
        service = group['Keys'][1]
        cost    = float(group['Metrics']['UnblendedCost']['Amount'])

        if team not in by_team:
            by_team[team] = {'total': 0, 'services': {}}
        by_team[team]['total'] += cost
        by_team[team]['services'][service] = round(cost, 2)

    # Print sorted by total cost descending
    print(f"\n{'='*52}")
    print(f"  Month-to-Date Cloud Spend by Team")
    print(f"  Generated: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"{'='*52}\n")

    for team, data in sorted(by_team.items(), key=lambda x: x[1]['total'], reverse=True):
        print(f"  {team:<20} ${data['total']:>10,.2f}/month")
        top_services = sorted(data['services'].items(), key=lambda x: x[1], reverse=True)[:3]
        for svc, cost in top_services:
            print(f"    └─ {svc:<30} ${cost:>8,.2f}")
    print()

generate_team_showback()

Essential Tools and Certifications

The tools that matter at each stage of this roadmap:

Stage	Tool	Why It Matters
1	AWS Cost Explorer	Free, built-in, the starting point for all cost analysis
1	AWS CLI `ce` commands	Scriptable cost queries — dashboards can't be automated
2	AWS Compute Optimizer	ML-powered rightsizing recommendations for EC2 and RDS
2	VPA (Kubernetes)	Pod-level rightsizing recommendations using actual usage
3	Infracost	PR-level cost estimation for Terraform changes
3	AWS Budgets	Proactive alerts — catches problems before the monthly invoice
4	AWS Cost and Usage Report + Athena	SQL-level billing analysis at any granularity
4	CloudHealth or Vantage	Multi-account, multi-cloud cost management

The one certification worth your time: FinOps Certified Practitioner from the FinOps Foundation. It takes 20 hours to prepare and $300 to sit. It signals to hiring managers and clients that you understand the discipline formally — which matters when you're the person leading FinOps conversations at the executive level.

Your 90-Day Action Plan

Month 1 — Foundation:

Enable Cost Explorer if it isn't already on. Pull the baseline command from Section 1.1 and save the output. Run the untagged resource query from Section 1.2 and document how many resources are missing tags. Find your top three cost drivers. Present the findings to your engineering manager — not as a problem, but as an opportunity with a dollar figure attached.

Month 2 — Quick Wins:

Run the rightsizing analyser from Section 2.1 on your EC2 fleet. Downsize the three highest-confidence candidates. Apply S3 lifecycle policies to your two largest buckets. Create VPC endpoints for S3, ECR, and DynamoDB. Estimate the savings from each action and document them against your baseline.

Month 3 — Automation and Habits:

Deploy the orphan reporter Lambda on a Sunday schedule. Add the cost check GitHub Action to your infrastructure repository. Start a monthly FinOps review meeting — even if it's just you and one other engineer. Build the habit before you need the audience.

Best Practices Summary

✅ Do: Establish a cost baseline before any optimisation. The number is meaningless without a comparison point.

✅ Do: Right-size before buying Savings Plans. Always. The sequence changes the outcome.

✅ Do: Enforce tagging at the infrastructure layer — Terraform or CloudFormation — not as a process reminder.

✅ Do: Move staging and development to Spot instances. The interruption rate is manageable, while the 70% cost difference is not.

✅ Do: Add VPC endpoints for S3, ECR, and DynamoDB before reviewing data transfer costs. It's a 30-minute fix for a multi-thousand-dollar line item.

✅ Do: Present cost findings as cost-per-business-metric, not as total bill. "We reduced cost per transaction from $0.0021 to $0.0013" is a business result. "$38,000/month reduction" is an accounting result.

❌ Don't: Buy Savings Plans on an unoptimised baseline. You'll lock in discounted waste.

❌ Don't: Build FinOps dashboards before tagging is complete. Beautiful charts with no attribution data answer no questions.

❌ Don't: Run orphaned resource cleanup without human review first. Run in report-only mode for two weeks, verify the candidates are genuinely orphaned, then add deletion logic.

Resources

FinOps Foundation Framework — The practitioner framework that defines the Inform, Optimise, and Operate cycle this roadmap is built on
AWS Cost Explorer API Reference — Full reference for the cost query commands used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service; complements the manual analysis in Stage 2
Infracost Documentation — Setup guide for the PR-level cost estimation tool in Stage 3
FinOps Certified Practitioner Exam — The certification referenced in the tools section
AWS Savings Plans Documentation — The authoritative reference on commitment types, coverage rules, and purchase strategy
Companion Repository — All scripts from this guide, including the rightsizing analyser, orphan reporter, and showback report generator

Ayobami Adejumo is a senior platform engineer and FinOps consultant. He has audited AWS infrastructure for 20+ Series A and Series B companies. He is an active FinOps Foundation Supporter

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Ayobami Adejumo — Tue, 02 Jun 2026 16:27:27 +0000

You raised your Series A. Engineering hired fast. Features shipped faster. And somewhere between month six and month twelve, someone forwarded you an AWS Cost Explorer screenshot with a line that only goes up.

That line isn't random. It follows a pattern. The same eight patterns, at the same growth stage, at almost every company I've audited.

This guide names all eight, shows you exactly where to look, and gives you the fix for each one. By the time you finish reading, you'll know which leaks are draining your runway — and what to do about them this week.

Who This Guide Is For
Before You Start: Establish Your Baseline
Pattern 1: The New Hire Experiment Tax
Pattern 2: Staging Environment Proliferation
Pattern 3: The NAT Gateway Tax
Pattern 4: The Savings Plan Timing Mistake
Pattern 5: Cross-AZ Data Transfer
Pattern 6: The gp2 Volume Trap
Pattern 7: The Infinite Log Trap
Pattern 8: The Orphaned Resource Collector
The Full Savings Summary
What to Do This Week
Resources

Who This Guide Is For

This guide is written for engineers, CTOs, and technical co-founders at Series A companies — typically 15 to 80 engineers, AWS bills between $20,000 and $150,000 per month, and a finance team that has recently started paying attention to the infrastructure line.

You don't need a dedicated FinOps team. You need one engineer, one afternoon per week, and the eight patterns in this guide.

What you should have before starting:

AWS account access with Cost Explorer enabled
AWS CLI v2 configured (aws configure)
Basic familiarity with EC2, RDS, EBS, and S3
A Cost Explorer bookmark — you will use it constantly

Estimated time to complete all fixes: 8–20 engineering hours spread across two sprints. The reading takes around 20 minutes. The highest-ROI fix (Pattern 3) takes about 30 minutes.

Before You Start: Establish Your Baseline

Don't skip this step. Optimization without a baseline is just guessing. Run this command before touching anything:

# Pull last month's AWS cost breakdown by service
# This becomes your before number — save it somewhere
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn

Then screenshot the output. Name the file aws-baseline-YYYY-MM.png. You'll compare against this after each fix to verify actual savings.

The typical breakdown at Series A looks like this:

AWS Service	% of Bill	Waste Potential
EC2 (compute)	45–55%	High
Data Transfer	15–20%	Very High
RDS	10–15%	Medium
EBS	8–12%	Medium
CloudWatch	3–6%	Medium
Load Balancers	3–5%	Low

Now let's go through each pattern.

Pattern 1: The New Hire Experiment Tax

Every engineering hire needs a development environment. This is expected. What's not expected is what happens after the feature ships: nothing.

The environment keeps running. At $0.192/hour for an m5.xlarge, a forgotten dev environment costs $138/month. Ten engineers who each forgot one environment is $1,380/month — for infrastructure that's doing precisely nothing.

This pattern accelerates after a Series A because hiring moves fast. A new engineer joins on Monday, spins up an EC2, an RDS, and a namespace in the dev cluster, ships the feature by Friday, and moves to the next ticket. The environment isn't on anyone's radar. There's no off-boarding process for dev resources.

What the waste looks like:

Dev environment for Alice (feature/payment-flow):
  EC2 m5.xlarge — last CPU activity: 23 days ago
  RDS db.t3.medium — last connection: 19 days ago
  EKS namespace — last pod scheduled: 15 days ago
  Monthly cost: $187
  Status: running

Finding it:

# Find EC2 instances with average CPU below 5% for the last 14 days
# These are idle instances — candidates for shutdown or termination
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --period 1209600 \
  --statistics Average \
  --start-time $(date -d '14 days ago' --iso-8601=seconds) \
  --end-time $(date --iso-8601=seconds) \
  --dimensions Name=InstanceId,Value=YOUR_INSTANCE_ID \
  --query 'Datapoints[*].{Average:Average}' \
  --output table

The Fix — an Automatic Idle Instance Stopper:

The Lambda below runs every night at 22:00. It checks every EC2 instance tagged Environment=dev for CPU utilisation over the past seven days. Any instance averaging below 5% gets stopped automatically. An SNS notification goes to the engineer's email before the stop happens, giving them a chance to override it by adding a KeepAlive=true tag.

# idle_environment_stopper.py
# Deploy as a Lambda function triggered by EventBridge on schedule: cron(0 22 * * ? *)
# This stops idle dev environments before they run through the night and weekend

import boto3
from datetime import datetime, timedelta, timezone

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')

IDLE_CPU_THRESHOLD = 5.0      # Stop instances below this average CPU %
IDLE_DAYS = 7                  # Look back 7 days of CloudWatch data
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:YOUR_ACCOUNT:dev-environment-alerts'

def get_average_cpu(instance_id):
    """Return the 7-day average CPU utilisation for an EC2 instance."""
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=datetime.now(timezone.utc) - timedelta(days=IDLE_DAYS),
        EndTime=datetime.now(timezone.utc),
        Period=604800,  # One 7-day period
        Statistics=['Average']
    )
    datapoints = response.get('Datapoints', [])
    return datapoints[0]['Average'] if datapoints else 0.0

def lambda_handler(event, context):
    """Stop idle dev instances and notify their owners."""
    
    # Find all running dev instances
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'tag:Environment', 'Values': ['dev', 'development']},
        ]
    )

    stopped = []
    skipped = []

    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

            # Skip instances explicitly marked to keep alive
            if tags.get('KeepAlive', '').lower() == 'true':
                skipped.append(instance_id)
                continue

            avg_cpu = get_average_cpu(instance_id)

            if avg_cpu < IDLE_CPU_THRESHOLD:
                # Notify the owner before stopping
                owner = tags.get('Owner', 'unknown')
                sns.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Subject=f'Dev environment stopped: {instance_id}',
                    Message=(
                        f'Instance {instance_id} (Owner: {owner}) had {avg_cpu:.1f}% average CPU '
                        f'over {IDLE_DAYS} days and has been stopped.\n\n'
                        f'To prevent this, add the tag: KeepAlive=true\n'
                        f'To restart: aws ec2 start-instances --instance-ids {instance_id}'
                    )
                )
                ec2.stop_instances(InstanceIds=[instance_id])
                stopped.append({'id': instance_id, 'owner': owner, 'avg_cpu': avg_cpu})

    print(f"Stopped {len(stopped)} idle instances. Skipped {len(skipped)} keep-alive instances.")
    return {'stopped': stopped, 'skipped': skipped}

Monthly savings: $1,000–$2,000 depending on team size and how long the pattern has been running.

Pattern 2: Staging Environment Proliferation

Staging starts as one environment. Then the frontend team needs their own because the backend team keeps breaking theirs. Then the ML team needs isolated compute. Then QA needs a stable environment for integration tests.

Before anyone noticed, you have four staging environments running 24/7 — each one idle for 16 hours of every day.

The waste isn't in the existence of the environments. It's in the schedule. Staging environments don't need to run at 3am.

What the waste looks like:

staging-frontend:   $250/month   Used: Mon-Fri 09:00-18:00
staging-backend:    $250/month   Used: Mon-Fri 09:00-18:00
staging-ml:         $250/month   Used: Mon-Fri 10:00-17:00
staging-qa:         $250/month   Used: Mon-Fri 09:00-17:00
Total:            $1,000/month   Running: 24 hours/day, 7 days/week
Actual usage:        ~35%        You are paying 100%

Finding it:

# Find EKS node groups tagged as staging with their current status
aws eks list-nodegroups --cluster-name your-cluster-name --output table

# Check EC2 instances tagged staging and their launch time
# Any instance running > 30 days with no weekend stop schedule is a candidate
aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=staging" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,Type:InstanceType,Launch:LaunchTime}' \
  --output table

The Fix — Scheduled Start and Stop with AWS Instance Scheduler:

# Option 1: Tag-based scheduling with AWS Instance Scheduler (CloudFormation solution)
# Add these tags to your staging EC2 instances and RDS clusters:
# Schedule: office-hours
# This starts instances at 08:00 and stops them at 20:00 Mon-Fri
# Weekend: completely off

# Option 2: Quick Lambda-based solution — stop all staging at 20:00 weekdays
aws events put-rule \
  --schedule-expression "cron(0 20 ? * MON-FRI *)" \
  --name stop-staging-environments \
  --state ENABLED

# The stop Lambda — same pattern as Pattern 1 but targets staging tag
# Add a corresponding start rule at 07:30 Mon-Fri

Consolidation in Addition to Scheduling

If frontend and backend share a database schema, consolidate them into one shared staging environment with namespace-level isolation. The combined cost is lower than two separate environments:

# One shared staging cluster with namespace isolation
# frontend-staging and backend-staging share nodes via Karpenter
# but are isolated by namespace-level network policies
apiVersion: v1
kind: Namespace
metadata:
  name: staging-frontend
  labels:
    environment: staging
    team: frontend
---
apiVersion: v1
kind: Namespace
metadata:
  name: staging-backend
  labels:
    environment: staging
    team: backend

The math:

Scenario	Monthly cost
Before: 4 environments, always on	$1,000
After: 2 consolidated environments, office hours only	$290
Monthly savings	$710

Pattern 3: The NAT Gateway Tax

NAT Gateway is the most consistently underestimated line item on every AWS bill I've audited. It charges $0.045 per GB of data processed — and in EKS clusters, a staggering amount of traffic flows through it by default.

Every pod that pulls a container image from ECR goes through NAT Gateway. Every Lambda that writes to S3 goes through NAT Gateway. Every service that polls SQS, queries DynamoDB, or calls the Secrets Manager API goes through NAT Gateway — unless you have configured VPC endpoints.

VPC endpoints create a private connection between your VPC and the AWS service. Traffic routes through the AWS backbone instead of NAT Gateway. The data transfer becomes free.

What the waste looks like:

# Run this to see your current NAT Gateway data processing bill
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NatGateway-Bytes", "NatGateway-Hours"]
    }
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

If this number is above $200, you have a NAT Gateway problem. At most Series A companies running EKS, it is between $800 and $6,000.

The Fix — VPC Endpoints for the Four Highest-traffic AWS Services:

# Get your VPC ID and route table ID first
VPC_ID=$(aws ec2 describe-vpcs \
  --filters "Name=tag:Name,Values=your-vpc-name" \
  --query 'Vpcs[0].VpcId' --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=association.main,Values=true" \
  --query 'RouteTables[0].RouteTableId' --output text)

# S3 gateway endpoint — free to create, eliminates all S3 NAT charges
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids $ROUTE_TABLE_ID

# DynamoDB gateway endpoint — also free
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids $ROUTE_TABLE_ID

# ECR API endpoint — eliminates NAT charges on every container pull
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

# ECR Docker endpoint — required alongside ECR API for image pulls
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:Tier,Values=private" \
    --query 'Subnets[*].SubnetId' --output text)

When explaining this to your CFO, call it the NAT tax. They understand taxes. "We're paying a $0.045/GB tax on internal network traffic that we can eliminate in 30 minutes" lands better than "data processing bytes."

Monthly savings: $2,000–$8,000 depending on your container pull frequency and S3 usage.

Pattern 4: The Savings Plan Timing Mistake

A Savings Plan is a commitment to spend a fixed dollar amount per hour on AWS compute for one or three years in exchange for a 30–70% discount. The math is attractive. The timing is where teams go wrong.

When the bill gets large, the instinct is to commit. Buy the Savings Plan, reduce the bill, show the CFO. The problem: if you haven't rightsized first, you're committing to pay for waste at a discount. When you rightsize later, your actual spend drops below your commitment — and you pay for compute you're not using.

What wrong order looks like:

Step 1: AWS bill is $100,000/month
Step 2: Buy $70,000/hour Savings Plan commitment
Step 3: Rightsize instances — actual spend drops to $60,000
Step 4: Savings Plan covers \(70,000 but you only use \)60,000
Step 5: You pay $28,000/month for compute you do not use
         (Savings Plan discount applied to the overage)
         
Net result: You locked in waste for 12 months

What right order looks like:

Step 1: Rightsize instances — spend drops from \(100,000 to \)60,000
Step 2: Add Spot for staging — spend drops from \(60,000 to \)45,000
Step 3: Migrate compatible workloads to Graviton — spend drops to $36,000
Step 4: NOW buy a Savings Plan covering $25,000/month (70% of steady-state)
Step 5: Effective monthly cost: \(12,500 for committed + \)11,000 on-demand = $23,500

Net result: $76,500/month saved versus the original bill

How to check what you should commit to:

# View your last 30 days of EC2 On-Demand spend
# This is your rightsized baseline — what you actually use after optimisation
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own Savings Plan recommendation based on your usage
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS

As a rule, commit to 60–70% of your steady-state On-Demand spend after optimisation. Leave 30–40% flexible. Never commit on the unoptimised baseline.

Monthly savings: $5,000–$15,000 depending on compute spend. This is the pattern with the highest single-action ROI when sequenced correctly.

Pattern 5: Cross-AZ Data Transfer

AWS charges $0.01 per GB in each direction when data crosses an Availability Zone boundary. $0.01 sounds negligible. It's not — because AZ boundaries are crossed constantly in distributed systems, and the charge is bidirectional.

The most common scenario: your application pods are scheduled across multiple AZs (as they should be for resilience), but your database is pinned to one AZ. Every database query from a pod in a different AZ costs $0.01/GB going to the database and $0.01/GB coming back. At 100GB of database traffic per day, that's $60/month. At 1TB per day, it is $600/month.

What the waste looks like:

# Check current cross-AZ data transfer charges
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --filter '{"Dimensions": {"Key": "USAGE_TYPE", "Values": ["DataTransfer-Regional-Bytes"]}}'  \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Total.UnblendedCost.Amount' \
  --output text

How to find which pods are causing the cross-AZ traffic:

# Check which AZ your database RDS instance is in
aws rds describe-db-instances \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,AZ:AvailabilityZone}' \
  --output table

# Check which AZs your application pods are running in
kubectl get pods -o wide -n production | awk '{print $7}' | sort | uniq -c

If your RDS is in us-east-1a and 60% of your pods are in us-east-1b and us-east-1c, you have a cross-AZ traffic problem.

The Fix — Topology-aware Routing:

# topology-aware-routing.yaml
# This tells Kubernetes to prefer scheduling pods in the same AZ
# as the node making the request — keeping traffic local

apiVersion: v1
kind: Service
metadata:
  name: payment-api
  namespace: production
  annotations:
    # Route traffic to pods in the same AZ as the caller when possible
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: payment-api
  ports:
  - port: 8080
    targetPort: 8080

# For pods themselves — spread across AZs but prefer local
# topologySpreadConstraints ensures even distribution
# while topology-aware routing keeps traffic within AZs

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payment-api

For database traffic specifically, consider migrating from single-AZ RDS to Aurora, which handles AZ routing internally. Your application connects to one endpoint and Aurora routes internally — no cross-AZ charge from the application layer.

Monthly savings: $500–$6,000 depending on database query volume and AZ distribution of your pods.

Pattern 6: The gp2 Volume Trap

In 2014, AWS launched gp2 EBS volumes. In 2020, they launched gp3 — cheaper, faster, and with better baseline performance. In 2026, most Series A companies are still running gp2.

The difference: gp2 costs $0.10/GB/month and provides 3 IOPS per GB (100 IOPS minimum). gp3 costs $0.08/GB/month and provides 3,000 IOPS baseline regardless of size. gp3 is 20% cheaper and 10x faster on IOPS for most volume sizes. The migration is online — it runs while the volume is attached and in use.

Finding all your gp2 volumes:

# List every gp2 volume in your account with its size and monthly cost
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    State:State,
    MonthlyCost_USD:Size
  }' \
  --output table

# Count the total: number of volumes and combined GB
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' --output text

aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'sum(Volumes[*].Size)' --output text

The Fix — Migrate All gp2 to gp3 in One Script:

#!/bin/bash
# migrate_gp2_to_gp3.sh
# Migrates all gp2 volumes to gp3. Online operation — no downtime.
# Each modification runs asynchronously; the volume stays available throughout.

echo "Starting gp2 to gp3 migration..."

# Get all gp2 volume IDs
VOLUMES=$(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text)

COUNT=0
for VOL_ID in $VOLUMES; do
  echo "Migrating $VOL_ID to gp3..."
  aws ec2 modify-volume \
    --volume-id $VOL_ID \
    --volume-type gp3 \
    --no-cli-pager
  COUNT=$((COUNT + 1))
done

echo "Migration initiated for $COUNT volumes."
echo "Modifications run online — no downtime. Monitor progress:"
echo "aws ec2 describe-volumes-modifications --query 'VolumesModifications[*].{ID:VolumeId,State:ModificationState}'"

Verify completion:

# Check that no gp2 volumes remain
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'length(Volumes)' \
  --output text
# Expected: 0

Monthly savings: 20% of your total EBS spend. At $10,000/month in EBS, that's $2,000 saved for 30 minutes of work.

Pattern 7: The Infinite Log Trap

CloudWatch log groups have a default retention policy of "Never expire." Every log group created without an explicit retention setting accumulates logs indefinitely. For a busy Series A company, this means you're storing debug logs from 2022 that nobody has opened since the sprint review they were created for.

The cost compounds quietly. CloudWatch charges $0.03/GB/month for log storage and $0.50/GB for log ingestion. A cluster generating 50GB of logs per day ingests $25/day — $750/month — and then stores those logs forever at an increasing monthly cost.

Finding log groups with no retention policy:

# List all log groups with their retention settings
# Any group showing "retentionInDays: null" is infinite — it never expires
aws logs describe-log-groups \
  --query 'logGroups[*].{Name:logGroupName,RetentionDays:retentionInDays,StoredBytes:storedBytes}' \
  --output table | grep -E "(None|null)"

# Count how many log groups have no retention set
aws logs describe-log-groups \
  --query 'length(logGroups[?retentionInDays==`null`])' \
  --output text

The Fix — Set Retention Policies in Bulk:

Different log types have different compliance requirements. Debug logs don't need to be kept. Audit logs might need 365 days. The table below gives sensible defaults:

Log Type	Recommended Retention	Reason
Application debug logs	14 days	Only useful for active debugging
Application error logs	90 days	Post-incident investigation window
Access logs	30 days	Security review window
CloudTrail audit logs	365 days	SOC2 evidence requirement
VPC Flow Logs	90 days	Security investigation window

#!/bin/bash
# set_log_retention.sh
# Sets 30-day retention on all log groups that have no policy set
# Adjust the retention period per log group type as needed

echo "Setting retention policies on log groups with no expiry..."

# Get all log groups with no retention
aws logs describe-log-groups \
  --query 'logGroups[?retentionInDays==`null`].logGroupName' \
  --output text | tr '\t' '\n' | while read LOG_GROUP; do

  # Skip CloudTrail logs — these need longer retention for SOC2
  if echo "$LOG_GROUP" | grep -qi "cloudtrail"; then
    echo "Skipping CloudTrail log group: $LOG_GROUP"
    aws logs put-retention-policy \
      --log-group-name "$LOG_GROUP" \
      --retention-in-days 365
    continue
  fi

  # Set 30-day retention on all other log groups
  echo "Setting 30-day retention on: $LOG_GROUP"
  aws logs put-retention-policy \
    --log-group-name "$LOG_GROUP" \
    --retention-in-days 30
done

echo "Done. Logs older than their retention period will be deleted automatically by CloudWatch."

Monthly savings: $500–$2,000 on storage costs. The ingestion cost reduction kicks in immediately when noisy debug logging is reduced. The storage cost reduction compounds over 30–90 days as old logs expire.

Pattern 8: The Orphaned Resource Collector

Every departed engineer leaves a trail. An EBS volume attached to a terminated instance. An Elastic IP allocated but not associated. A load balancer fronting a service that was deprecated in Q3. Old snapshots from an RDS instance that was replaced. None of these are intentional, but all of them are billed.

The fix is a weekly audit. Not a manual investigation — an automated script that runs every Sunday night, finds orphaned resources, and sends a Slack message with a list of candidates for deletion.

Finding the orphans:

# Unattached EBS volumes — you are paying for storage with nothing in it
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{
    ID:VolumeId,
    Size:Size,
    Created:CreateTime,
    MonthlyCost:Size
  }' \
  --output table

# Unassociated Elastic IPs — $3.60/month each when not attached to a running instance
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==`null`].[PublicIp,AllocationId]' \
  --output table

# Old snapshots — created more than 90 days ago, no longer needed
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime<='$(date -d '90 days ago' --iso-8601=seconds)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table

# Idle load balancers — active but routing zero traffic
aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[*].{ARN:LoadBalancerArn,DNS:DNSName,State:State.Code}' \
  --output table

The weekly cleanup Lambda:

# orphan_resource_reporter.py
# Runs every Sunday at 20:00 via EventBridge
# Reports orphaned resources to Slack — does NOT auto-delete
# Deletion requires a human decision. The Lambda surfaces the candidates.

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

def get_orphaned_resources():
    """Collect all orphaned AWS resources and their estimated monthly costs."""
    ec2 = boto3.client('ec2')
    elbv2 = boto3.client('elbv2')
    report = {'total_monthly_waste': 0, 'resources': []}

    # Unattached EBS volumes ($0.08/GB/month for gp3)
    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']
    for vol in volumes:
        monthly_cost = round(vol['Size'] * 0.08, 2)
        report['resources'].append({
            'type': 'Unattached EBS Volume',
            'id': vol['VolumeId'],
            'detail': f"{vol['Size']}GB {vol['VolumeType']}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    # Unassociated Elastic IPs ($3.60/month each)
    addresses = ec2.describe_addresses()['Addresses']
    for addr in addresses:
        if 'AssociationId' not in addr:
            report['resources'].append({
                'type': 'Unassociated Elastic IP',
                'id': addr['AllocationId'],
                'detail': addr['PublicIp'],
                'monthly_cost': 3.60
            })
            report['total_monthly_waste'] += 3.60

    # Snapshots older than 90 days
    cutoff = (datetime.now(timezone.utc) - timedelta(days=90)).isoformat()
    snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    old_snapshots = [s for s in snapshots if s['StartTime'].isoformat() < cutoff]
    for snap in old_snapshots:
        monthly_cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
        report['resources'].append({
            'type': 'Old Snapshot (90+ days)',
            'id': snap['SnapshotId'],
            'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
            'monthly_cost': monthly_cost
        })
        report['total_monthly_waste'] += monthly_cost

    return report

def post_to_slack(report):
    """Send the orphaned resource report to Slack."""
    resource_lines = '\n'.join([
        f"• {r['type']} `{r['id']}` — {r['detail']} — *${r['monthly_cost']}/month*"
        for r in report['resources']
    ])

    message = {
        'text': (
            f":money_with_wings: *Weekly Orphaned Resource Report*\n\n"
            f"Found *{len(report['resources'])} orphaned resources* "
            f"costing *${report['total_monthly_waste']:.2f}/month*\n\n"
            f"{resource_lines}\n\n"
            f"Review and delete resources that are no longer needed."
        )
    }
    
    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=json.dumps(message).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)

def lambda_handler(event, context):
    report = get_orphaned_resources()
    post_to_slack(report)
    return {
        'resources_found': len(report['resources']),
        'monthly_waste': report['total_monthly_waste']
    }

Monthly savings: $500–$2,000. Every departed engineer typically leaves $50–$200 in orphaned resources. At a team of 30 with 30% annual turnover, that compounds quickly.

The Full Savings Summary

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	$1,000–$2,000	2 hours (Lambda)	Medium
2. Staging proliferation	$600–$800	3 hours (scheduling)	Low
3. NAT Gateway tax	$2,000–$8,000	30 minutes	Low
4. Savings Plan timing	$5,000–$15,000	One decision	Low
5. Cross-AZ data transfer	$500–$6,000	2 hours	Medium
6. gp2 volume trap	$1,000–$5,000	30 minutes (script)	Low
7. Infinite log trap	$500–$2,000	1 hour (script)	Low
8. Orphaned resources	$500–$2,000	2 hours (Lambda)	Low
Total potential	$11,100–$40,800/month

What to Do This Week

Don't fix all eight this week. Prioritise by ROI per hour of engineering time:

Day 1 (30 minutes): Pattern 3 — NAT Gateway endpoints. Highest ROI per minute of any fix in this guide. One command creates the S3 endpoint. Done.

Day 2 (30 minutes): Pattern 6 — gp2 to gp3 migration. Run the script. Check the output. Done.

Day 3 (1 hour): Pattern 7 — log retention policies. Run the bulk retention script. Done.

Day 4 (2 hours): Patterns 1 and 8 — deploy both Lambdas. They run automatically from here.

Next sprint: Pattern 2 (staging schedule), Pattern 5 (topology-aware routing), and Pattern 4 (run the rightsizing cycle first, then evaluate Savings Plans).

Open Cost Explorer after each fix. Compare against your baseline screenshot from the start of this guide. The line should start going down.

Resources

FinOps Foundation Framework — The practitioner framework this guide contributes to, covering Inform, Optimize, and Operate phases of cloud cost management
AWS Cost Explorer API Reference — Full reference for the get-cost-and-usage command used throughout this guide
AWS Compute Optimizer — AWS's own rightsizing recommendation service, used alongside the patterns in this guide for EC2 and EBS recommendations
AWS VPC Endpoints Documentation — Complete list of available VPC endpoints for Pattern 3
AWS Instance Scheduler Solution — The AWS-maintained CloudFormation solution for Pattern 2 environment scheduling
Karpenter Documentation — For teams ready to go beyond these 8 patterns into dynamic node provisioning and Spot diversification
FinOps Foundation Asset Library — The community asset library where practical scripts like the ones in this guide are contributed and maintained by practitioners

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He has audited AWS infrastructure for 30+ Series A companies and contributes practical tooling to the FinOps Foundation Asset Library.

Pattern	Monthly Saving	Time to Fix	Difficulty
1. New hire experiment tax	\(1,000–\)2,000	2 hours (Lambda)	Medium
2. Staging proliferation	\(600–\)800	3 hours (scheduling)	Low
3. NAT Gateway tax	\(2,000–\)8,000	30 minutes	Low
4. Savings Plan timing	\(5,000–\)15,000	One decision	Low
5. Cross-AZ data transfer	\(500–\)6,000	2 hours	Medium
6. gp2 volume trap	\(1,000–\)5,000	30 minutes (script)	Low
7. Infinite log trap	\(500–\)2,000	1 hour (script)	Low
8. Orphaned resources	\(500–\)2,000	2 hours (Lambda)	Low
Total potential	\(11,100–\)40,800/month

finops - freeCodeCamp.org

The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager

Table of Contents

What You'll Learn

Prerequisites

The Four Stages Overview

Stage 1: The Cost-Aware Engineer — Months 1 to 3

1.1 Reading the Bill Like an Engineer, Not an Accountant

Three questions for every service in your top 5:

1.2 The Tagging Strategy That Actually Survives

1.3 The Cost-Aware Code Review

Stage 1 Outcomes

Stage 2: The Optimisation Specialist — Months 4 to 8

2.1 Right-Sizing: The 80/20 of Cloud Savings

2.2 Storage Tiering: Stop Paying Retail for Cold Data

2.3 Savings Plans: The Sequence Is Everything

Stage 3: The Automation Architect — Months 9 to 15

3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself

3.2 Cost Estimation in Your CI/CD Pipeline

Stage 4: The Cloud Financial Manager — Months 16 to 24

4.1 Leading FinOps Reviews with Executives

4.2 The Chargeback and Showback Models

Essential Tools and Certifications

Your 90-Day Action Plan

Month 1 — Foundation:

Month 2 — Quick Wins:

Month 3 — Automation and Habits:

Best Practices Summary

Resources

The AWS FinOps Guide for Series A Startups: The 8 Cost Patterns That Appear After Product-Market Fit

Table of Contents

Who This Guide Is For

Before You Start: Establish Your Baseline

Pattern 1: The New Hire Experiment Tax

The Fix — an Automatic Idle Instance Stopper:

Pattern 2: Staging Environment Proliferation

The Fix — Scheduled Start and Stop with AWS Instance Scheduler:

Consolidation in Addition to Scheduling

Pattern 3: The NAT Gateway Tax

The Fix — VPC Endpoints for the Four Highest-traffic AWS Services:

Pattern 4: The Savings Plan Timing Mistake

Pattern 5: Cross-AZ Data Transfer

The Fix — Topology-aware Routing:

Pattern 6: The gp2 Volume Trap

The Fix — Migrate All gp2 to gp3 in One Script:

Pattern 7: The Infinite Log Trap

The Fix — Set Retention Policies in Bulk:

Pattern 8: The Orphaned Resource Collector

The Full Savings Summary

What to Do This Week

Resources