Devops - freeCodeCamp.org

Common DevOps Mistakes and How to Avoid Them — Tips for Startups

Tolani Akintayo — Thu, 14 May 2026 17:53:38 +0000

Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production.

Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.

This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.

Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.

Who This Article Is For
Why Startups Are a Different Environment
Mistake 1: Deploying Without Understanding What You're Deploying
Mistake 2: Using Production as a Development Environment
Mistake 3: Hardcoding Secrets and Credentials
Mistake 4: Overengineering for Problems You Don't Have Yet
Mistake 5: No Observability Before Launch
Mistake 6: Treating Security as a Final Step
Mistake 7: Manual Deployments in Production
Mistake 8: No Disaster Recovery Plan
Mistake 9: No Documentation or Runbooks
Mistake 10: Solving Technical Problems Without Understanding the Business
The System Thinking Framework Every DevOps Engineer Needs
Your Production Readiness Checklist
Conclusion

Who This Article Is For

Early-career DevOps and cloud engineers who are building or maintaining production infrastructure at a startup.
Backend developers who have recently taken on DevOps responsibilities.
Engineers joining a startup who want to understand what operational discipline actually looks like in a fast-moving environment.

You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.

Why Startups Are a Different Environment

Before getting into the mistakes, you have to understand why startups produce them in the first place.

In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.

This creates four specific pressure points:

Speed pressure. The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.
Budget constraints. Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.
Absent guardrails. There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.
Constantly changing requirements. The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.

Mistake 1: Deploying Without Understanding What You're Deploying

The Scenario

A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.

The Business Impact

When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.

The Fix

Before you deploy anything to production, you should be able to answer these five questions in writing:

What compute type is running my code? (EC2, Lambda, Fargate, container?)
How does a new version replace the old one? (Rolling? Blue/green? All-at-once?)
Where does configuration and secrets come from? (SSM? Secrets Manager? Environment file?)
What downstream services depend on this? (Database connections? Other APIs? Cache?)
How do I roll back in under five minutes if this breaks?

If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.

"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."

Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.

The first question is: What is this technology or concept about?
This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.
The second question is: Why do we need it?
I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.
The third question is: How should it be implemented?
There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.

This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.

Mistake 2: Using Production as a Development Environment

The Scenario

To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.

This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."

The Business Impact

A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.

The Fix

You need at minimum three separate environments and ideally three separate AWS accounts:

Environment	Purpose	Access Level
dev	Break things freely. No real data.	Engineers have broad access
staging	Mirror of production. Final verification.	Controlled access
production	Real customers. Real data.	MFA required. No manual deployments.

Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.

Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.

# terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}

# terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}

The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.

Mistake 3: Hardcoding Secrets and Credentials

The Scenario

A new engineer joins a startup and clones the repository. Inside they find a .env file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.

GitHub's automated secret scanning never triggered because the secrets were inside a .env file rather than raw in the code. The credentials had been valid and actively used for over six months.

The Business Impact

Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:

Crypto-mining workloads generating thousands of dollars in cloud bills overnight
Complete exfiltration of customer data from every S3 bucket
Privilege escalation: the attacker creates new admin users and locks you out of your own account
AWS account suspension while the investigation runs

According to GitHub's annual security report, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.

The Fix

Step 1: Never commit secrets to Git. Not temporarily. Not in a branch. Not in a private repository.

Step 2: Add .gitignore before you create the first file. Check in the .gitignore with the first line of code before any .env files exist.

# .gitignore
.env
.env.*
*.pem
*.key
secrets/

Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets. Your application reads secrets at runtime:

# Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]

Step 4: Scan your existing repositories immediately. You may already have a problem:

# Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo

Step 5: Add a pre-commit hook to prevent future accidents:

pip install pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets

pre-commit install
# Now the hook runs before every commit and blocks detected secrets

There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.

Mistake 4: Overengineering for Problems You Don't Have Yet

The Scenario

A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.

The Business Impact

Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.

Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.

The Fix

Match your infrastructure to your actual stage:

Scale	Right Infrastructure	Cost Range
1–1,000 users	Single EC2 + RDS + Nginx reverse proxy	$20–50/month
1K–50K users	Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD	$200-500/month
50K–500K users	ECS Fargate, RDS read replicas, ElastiCache, full observability	$1K-5K/month
500K+ users	Multi-region, managed Kubernetes, dedicated SRE	$10K+/month

The question to ask before every infrastructure decision is: "What specific, measurable problem does this solve today that my current setup cannot solve?"

Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.

Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.

Mistake 5: No Observability Before Launch

The Scenario

A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.

The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.

Business Impact

Without observability:

You find out about production problems from users, not from your systems
Incidents take 10x longer to resolve because diagnosis is guesswork
You cannot tell whether a deployment improved or degraded performance
You have no data for making better architecture decisions

The Fix

Implement the four golden signals before any service goes to production. These come from Google's Site Reliability Engineering book:

Latency: How long requests take to complete (p50, p95, p99)
Traffic: How many requests per second the system is handling
Errors: The rate of failed requests (5xx responses per minute)
Saturation: How close the system is to its limits (CPU, memory, connection pool)

Here is a minimal CloudWatch alarm setup using the AWS CLI:

# Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef

Every application should also expose a /health endpoint that returns 200 OK when healthy:

# FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }

Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.

You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."

Mistake 6: Treating Security as a Final Step

The Scenario

A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:

S3 buckets publicly accessible by default
EC2 instances with port 22 open to 0.0.0.0/0
IAM users with AdministratorAccess for the entire team
No encryption on the database at rest
JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.

The Business Impact

Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.

The Fix

Apply these six security controls before the first line of production code ships:

1. Principle of Least Privilege every IAM role gets only what it needs:

One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (s3:*) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.

The fix is simple: look at what your service actually does, then write a policy that allows exactly that.

If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}

Notice the Resource is scoped to my-app-uploads/* not all S3 buckets. And the Action list covers only GetObject and PutObject not DeleteObject, not s3:*. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.

2. Block all S3 public access by default:

AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.

The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:

aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.

3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:

Port 22 open to 0.0.0.0/0 is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.

AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:

# Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0

To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the AmazonSSMManagedInstanceCore policy attached. Once that is set up, you can close port 22 on the security group entirely.

4. Enable MFA for all IAM users and enforce it via policy:

A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.

Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.

The AWS documentation provides the Complete Deny Without MFA Policy, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.

5. Enable CloudTrail in all regions:

Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.

CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:

aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation

The --enable-log-file-validation flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every AssumeRole, every DeleteBucket, and every RunInstances call in your account is permanently recorded.

6. Run AWS Security Hub from day one:

Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.

Enabling it takes a single command:

aws securityhub enable-security-hub

Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.

Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.

Mistake 7: Manual Deployments in Production

The Scenario

A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running git pull, running npm install, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips npm install. The application starts crashing because a new dependency is missing.

The Business Impact

Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.

The Fix

If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:

# .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" >> $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true

Notice wait-for-service-stability: true. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.

Mistake 8: No Disaster Recovery Plan

The Scenario

A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.

The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.

The Business Impact

The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.

Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.

The Fix

Define your RTO and RPO before you design anything:

RTO (Recovery Time Objective): How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.
RPO (Recovery Point Objective): How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.

Enable RDS Multi-AZ for all production databases:

# Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}

Test your backups on a schedule. Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.

# Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"

For official guidance on RDS backup and restore, refer to the AWS RDS Backup and Restore documentation.

Mistake 9: No Documentation or Runbooks

The Scenario

The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.

The Business Impact

Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.

The Fix

Documentation for an engineering team means three specific things:

Infrastructure as Code is the highest form of documentation. The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.
A runbook for every operational task. A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:

# Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.

An architecture README in every repository. Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.

Mistake 10: Solving Technical Problems Without Understanding the Business

The Scenario

A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.

The Business Impact

Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.

The Fix

Before making any infrastructure decision, answer these four questions:

What is the actual, measured bottleneck? Instrument before you act. The bottleneck is almost never where you assumed it was.
What does success look like, and how will you measure it? "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.
What is the full cost of this solution? Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?
Can a simpler solution solve 80% of the problem in 20% of the time?

Always profile and measure before you rebuild:

# Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"

Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.

The System Thinking Framework Every DevOps Engineer Needs

Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.

A system thinker asks six questions before making any change in production:

Question	Why You Ask It
What does this change?	List every configuration, file, or service that will be different.
What does this depend on?	What must be true upstream for this component to work correctly?
What depends on this?	What downstream systems are affected if this changes or fails?
What is the failure mode?	Does this fail loudly (500 errors) or silently (wrong data)?
What is the rollback path?	How do you reverse this in under five minutes?
What does healthy look like after the change?	What metrics confirm everything is working correctly?

This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.

Your Production Readiness Checklist

Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.

Infrastructure

Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git
Separate dev, staging, and production environments exist with separate credentials
All production changes go through an automated CI/CD pipeline, no manual SSH deployments
You can rebuild the entire production environment from code in under two hours

Security

No secrets, credentials, or API keys exist in any Git repository
All production secrets are in Secrets Manager or SSM Parameter Store
All IAM roles follow the principle of least privilege
S3 buckets have public access blocked by default
Port 22 is not open to 0.0.0.0/0 on any security group
CloudTrail is enabled in all regions
All IAM users have MFA enabled
AWS Security Hub is enabled and findings are reviewed weekly

Observability

Every service has a /health endpoint that monitoring checks continuously
Alerts fire within five minutes of a production error rate spike
Dashboards exist showing latency, error rate, and resource utilization
Logs are centralized and searchable, not scattered across individual servers

Reliability

Production database has Multi-AZ enabled
Backup restoration has been tested in the last 30 days
Written runbooks exist for the three most likely failure scenarios
RTO and RPO requirements are documented and the architecture meets them

Documentation

Every repository has a README explaining what it does and how to deploy it
A new engineer could understand the production architecture from documentation alone
No single engineer holds critical knowledge that lives only in their head

Conclusion

None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.

The good news is that every single one of them is preventable with the right awareness and the right habits applied early.

You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.

The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.

Want to Go Deeper?

If this article resonated with you, The Startup DevOps Field Guide covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.

It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.

How to Migrate to S3 Native State Locking in Terraform

Tolani Akintayo — Thu, 07 May 2026 22:58:43 +0000

If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.

But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.

The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.

In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning DynamoDB is no longer required for state locking. Terraform 1.10 added support for this feature, and it's now generally available.

In this tutorial, you'll learn:

What S3 native locking is and how it works
How to set it up from scratch if you're starting a new project
How to migrate an existing S3 + DynamoDB setup to S3 native locking safely
How to verify locking is working and handle edge cases

By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.

What Is Terraform State Locking?
What Is S3 Native State Locking?
How S3 Native Locking Compares to the S3 + DynamoDB Approach
Prerequisites
Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch
Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking
How to Verify That Locking Is Working
How to Handle a Stuck Lock
Rollback Plan: If Something Goes Wrong
Security Best Practices for Your State Bucket
Conclusion
References

What is Terraform State Locking?

Before looking at the new approach, it helps to understand what state locking is solving.

Terraform stores everything it knows about your infrastructure in a state file – a JSON document that maps your configuration to real AWS resources. When you run terraform apply, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.

The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a race condition. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.

State locking solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.

Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ---------->|                            |
   |                                   |                            |
   |<------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock --->|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |<------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ---------->|                            |
   |                                   |                            |
   |           [COMPLETED]             |<------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |

What Is S3 Native State Locking?

Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a LockID primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.

S3 native locking uses S3 Object Lock instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.

When S3 native locking is enabled in your Terraform backend:

Terraform writes your state to an .tfstate object in S3 (as before)
To acquire a lock, Terraform uses S3's conditional write operations – specifically the if-none-match conditional header to create a lock file atomically
If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held
When the operation completes, Terraform deletes the lock file to release the lock.

The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.

Note: This feature requires Terraform version 1.10.0 or later and an S3 bucket with Object Lock enabled. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Aspect	S3 + DynamoDB (Old)	S3 Native Locking (New)
AWS services required	S3 + DynamoDB	S3 only
IAM permissions needed	S3 + DynamoDB permissions	S3 permissions only
Terraform version	Any	1.10.0 or later
Setup complexity	Two resources, two IAM scopes	One resource
Stuck lock resolution	Delete DynamoDB record	Delete S3 lock file
Cost	S3 storage + DynamoDB on-demand	S3 storage only
Object Lock requirement	Not required	Required on S3 bucket
Locking mechanism	DynamoDB conditional writes	S3 conditional writes (`if-none-match`)
State versioning	S3 Versioning (recommended)	S3 Versioning (required for full safety)

The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.

Prerequisites

Before you start, make sure you have the following in place:

Terraform 1.10.0 or later installed. Check your version:

terraform version

If you need to upgrade, follow the official upgrade guide.

AWS CLI installed and configured with credentials that have permission to create and manage S3 buckets.

aws --version
aws sts get-caller-identity   # confirm you're authenticated

IAM permissions to perform the following S3 actions:
- s3:CreateBucket
- s3:PutBucketVersioning
- s3:PutBucketEncryption
- s3:PutObjectLegalHold
- s3:PutObjectRetention
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
- s3:ListBucket
For the migration path: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.

Step 1: Create the S3 Bucket with Versioning and Encryption

Object Lock must be enabled at bucket creation time. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

Note: For regions other than us-east-1, add the --create-bucket-configuration flag.

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket

Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:

aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled

Enable server-side encryption so your state files are encrypted at rest:

aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:

aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Verify the bucket configuration:

# Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state

Expected output for the Object Lock check:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Step 2: Configure the Terraform Backend with Native Locking

In your Terraform project, create or update your backend.tf file:

terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}

The critical difference from the old configuration is the use_lockfile = true parameter. Notice what is absent: there's no dynamodb_table argument. No DynamoDB table. No second service.

Here's a direct comparison of the old and new configurations:

Old configuration (S3 + DynamoDB):

terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}

New configuration (S3 native locking):

terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}

Step 3: Initialize and Verify

Run terraform init to initialize the backend:

terraform init

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!

Run a plan to confirm everything is working end-to-end:

terraform plan

If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket – a .tflock file will appear temporarily alongside your state file during the operation and disappear when it completes.

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Follow this section if you have an existing Terraform setup using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.

Important: Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means all team members and CI/CD pipelines must stop running terraform plan or terraform apply during the migration. The migration itself takes under 10 minutes.

Step 1: Verify Your Current Setup

Before making any changes, document your existing backend configuration and confirm the state file is accessible:

# Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'

Check your current backend.tf and note the exact values:

# Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}

Run one final plan to confirm the current state is clean and there are no unexpected changes pending:

terraform plan

If the plan shows no changes, you're in a safe state to proceed.

Step 2: Enable Object Lock on the Existing S3 Bucket

This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.

But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.

Run the following AWS CLI command to enable Object Lock on your existing bucket:

aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'

Note: This command enables Object Lock in governance mode with no default retention, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.

Verify Object Lock is now enabled:

aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket

Expected output:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):

aws s3api get-bucket-versioning \
  --bucket your-existing-bucket

Expected output:

{
    "Status": "Enabled"
}

If versioning isn't enabled, enable it before proceeding:

aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled

Step 3: Update the Terraform Backend Configuration

Update your backend.tf to remove the dynamodb_table argument and add use_lockfile = true:

terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}

Your updated backend.tf should look like this:

terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}

Step 4: Reinitialize Terraform

Run terraform init with the -reconfigure flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):

terraform init -reconfigure

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!

If you see an error here: The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.

Step 5: Verify the Migration

Run a plan to confirm Terraform is working correctly with the new backend configuration:

terraform plan

The plan should:

Complete successfully
Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)
NOT mention DynamoDB anywhere in its output

To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:

╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵

The Path field shows .tfstate.tflock, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.

Step 6: Clean Up the DynamoDB Table

Once you've confirmed the migration is working correctly and your team has run at least one successful plan and apply cycle using the new backend, you can remove the DynamoDB table.

Wait at least 24-48 hours before deleting the DynamoDB table if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.

When you're ready, delete the DynamoDB table:

aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table

Confirm the deletion:

aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table

Expected output:

An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found

This error confirms that the table is gone. The migration is complete.

If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run terraform apply to destroy it via Terraform rather than the CLI directly. This keeps your state clean:

# Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

After removing the block, run:

terraform apply

Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.

How to Verify That Locking Is Working

After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.

Method 1: Observe the lock file during an operation

In one terminal, start a long-running plan against a configuration with many resources:

terraform plan

While it's running, in a second terminal, check for the lock file in S3:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

You should see a file like:

2026-05-06 14:22:01        512 terraform.tfstate.tflock

After the plan completes, run the same command again. The .tflock file should be gone.

Method 2: Read the lock file contents

While a plan is running, download and read the lock file to see its contents:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock && cat /tmp/current.lock

Expected output (formatted for readability):

{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}

This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.

How to Handle a Stuck Lock

With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the .tflock file from S3.

A lock can get stuck if:

A terraform apply or plan process was killed mid-execution
A CI/CD pipeline runner crashed during a Terraform operation
A network interruption prevented the lock release from completing

Here's how you can check for a stuck lock:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

If a .tflock file exists and no Terraform operation is currently running, it is a stuck lock.

You can also read the lock to understand who held it:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock && cat /tmp/stuck.lock

This tells you who (Who field) was running the operation, what operation it was (Operation field), and when it was acquired (Created field).

And you can force-unlock using Terraform like this:

terraform force-unlock LOCK-ID

Replace LOCK-ID with the ID value from the lock file contents. For example:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Terraform will confirm:

Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!

An alternative is to delete the lock file directly via CLI. If terraform force-unlock doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:

aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock

Only delete the lock file if you are certain no Terraform operation is currently running. Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.

Rollback Plan: If Something Goes Wrong

If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.

Step 1: Stop all Terraform operations in your team and CI/CD pipelines.

Step 2: Recreate the DynamoDB table if you already deleted it:

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Step 3: Revert backend.tf to the previous configuration:

terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}

Step 4: Reinitialize:

terraform init -reconfigure

Step 5: Verify:

terraform plan

The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.

Note: Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using dynamodb_table in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.

Security Best Practices for Your State Bucket

Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:

Enable Versioning (Required)

Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.

aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled

Block All Public Access (Non-Negotiable)

Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.

aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Enable Server-Side Encryption

Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:

aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Apply Least-Privilege IAM Permissions

The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}

Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.

Enable Access Logging

Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:

aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'

Conclusion

AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.

Here's a summary of what you accomplished:

Understood what state locking is and why it's required for safe Terraform operations
Compared S3 native locking to the existing S3 + DynamoDB approach
Set up a fresh Terraform backend using S3 native locking with correct bucket configuration
Migrated an existing backend from S3 + DynamoDB to S3 native locking safely
Learned how to verify locking, handle stuck locks, and roll back if needed
Applied security best practices to the state bucket

This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.

If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out The Startup DevOps Field Guide. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.

References

The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands

Ayobami Adejumo — Tue, 05 May 2026 18:26:21 +0000

If your team is preparing for a SOC 2 Type II review, this handbook is for you. It's a self-contained guide to the exact 90-day timeline, 14 critical controls, and evidence collection infrastructure that auditors actually check.

Everyone publishes the controls list. But nobody publishes the week-by-week engineering calendar you'll need to follow to make sure your ducks are in a row.

Here is the exact 90-day timeline — including the mistakes that add 60 days (and how to avoid them).

What You'll Learn
Prerequisites
Weeks 1–2: The Scope Decision
Weeks 3–6: The 14 Controls That Must Be Active on Day 1
Weeks 7–10: The Evidence Collection Infrastructure
Weeks 11–14: Auditor Selection and Readiness Assessment
Weeks 15–18: The Observation Period
The 90-Day SOC2 Timeline at a Glance
What's Next
Resources

What You'll Learn

By the end of this guide, you'll know:

How to scope your SOC2 boundary correctly — the decision that determines everything else
The 14 controls that must be active on day 1 of your observation period
How to build evidence collection infrastructure that runs automatically
How to choose an auditor and run a readiness assessment
What happens during the observation period and how to close gaps without restarting the clock

Let's dive in.

Prerequisites

Before following along, you should have:

Knowledge:

Basic understanding of AWS services (EC2, RDS, S3, IAM, VPC)
Familiarity with Terraform or another infrastructure as code tool
Comfort reading GitHub Actions YAML workflows
A general understanding of what SOC2 is — if you are starting from scratch, read the AICPA's SOC2 overview first

Tools and access:

An AWS account with administrator access
A GitHub organisation with admin rights
Terraform installed (v1.0 or later)
Python 3.8 or later (for the evidence collector Lambda)
A compliance automation platform — Vanta or Drata — connected to your AWS account and GitHub organisation

Estimated time: 90 days end-to-end, with active engineering work of approximately 8–12 hours per week in the first six weeks, tapering to 2–4 hours per week during the observation period.

Weeks 1–2: The Scope Decision — What Is In and Out of Your SOC2 Boundary

What Most Teams Get Wrong

Most teams scope their SOC2 boundary too broadly. They include every AWS account, every service, every environment. This is a mistake — and here is exactly why.

A broader scope means more controls to implement, more evidence to collect, and more systems the auditor will examine.

Every system inside your boundary must satisfy all 14 controls. Including your development sandbox means your engineers' experimental environments must have GuardDuty enabled, CloudTrail logging, and branch-protected deployments. That adds weeks of work and months of evidence collection for systems that pose no risk to your customers.

A correctly bounded scope means you include only the systems that store, process, or transmit customer data — and you prove that everything else cannot reach those systems.

Bad scope (over-inclusive):

Entire AWS Organization
├── Production (in scope)
├── Staging (in scope)
├── Development (in scope)
├── Sandbox (in scope)
└── CI/CD (in scope)

Good scope (correctly bounded):

SOC2 Boundary
├── Production AWS Account (in scope)
├── Production EKS Cluster (in scope)
├── Production RDS (in scope)
└── Everything else (OUT of scope — proven by network segmentation)

The correctly bounded scope works because it draws the tightest defensible line around the systems that actually handle customer data. Everything outside that line is excluded — not by assumption, but by technical controls that prevent those systems from reaching anything inside the boundary.

The Scope Decision Framework

For every system in your infrastructure, ask these four questions:

Question	If YES	If NO
Does this system store, process, or transmit customer data?	✅ In scope	❌ Out of scope
Does this system affect the availability of customer-facing services?	✅ In scope	❌ Out of scope
Does this system have access to production credentials?	✅ In scope	❌ Out of scope
Can a compromise of this system lead to a customer data breach?	✅ In scope	❌ Out of scope

Any system where the answer to even one question is yes belongs inside your boundary.

Network Segmentation — The Technical Proof That Your Boundary Holds

Network segmentation is the practice of dividing your infrastructure into isolated zones so that systems in one zone can't communicate with systems in another unless you explicitly allow it.

In the context of SOC2, it's the technical control that proves your out-of-scope systems genuinely can't reach your in-scope systems — not just by policy, but by infrastructure enforcement.

Without network segmentation, the SOC2 auditor can't trust that your boundary is real. A developer in your sandbox environment who can query your production database means the sandbox is effectively in scope, regardless of what your diagram says.

Here's the Terraform that implements network segmentation between your production and non-production environments. The network access control list (NACL) blocks all inbound traffic from the broader private IP range (10.0.0.0/8) into your in-scope production VPC, while the explicit aws_vpc_peering_connection comment documents the deliberate decision not to peer environments:

# This account has NO VPC peering to non-production environments.
# The absence of peering is itself the segmentation control.
# Do NOT add peering connections to this account without SOC2 scope review.

resource "aws_network_acl" "deny_non_production" {
  vpc_id = aws_vpc.production.id

  # Block all inbound traffic from non-production IP ranges
  ingress {
    rule_no    = 100
    action     = "deny"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "10.0.0.0/8"
  }

  # Allow legitimate inbound traffic (HTTPS from internet)
  ingress {
    rule_no    = 200
    action     = "allow"
    from_port  = 443
    to_port    = 443
    protocol   = "tcp"
    cidr_block = "0.0.0.0/0"
  }

  # Allow all outbound (tighten this per your architecture)
  egress {
    rule_no    = 100
    action     = "allow"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "0.0.0.0/0"
  }

  tags = {
    Name        = "production-nacl"
    Environment = "production"
    Purpose     = "SOC2 network segmentation"
  }
}

Verify the segmentation with this command after applying the Terraform:

# Confirm no VPC peering connections exist from production to non-production
aws ec2 describe-vpc-peering-connections \
  --filters Name=status-code,Values=active \
  --query 'VpcPeeringConnections[*].{ID:VpcPeeringConnectionId,Requester:RequesterVpcInfo.VpcId,Accepter:AccepterVpcInfo.VpcId}' \
  --output table

The Deliverable: Your SOC2 Boundary Diagram

At the end of weeks 1–2, you need a boundary diagram — a visual document that shows every in-scope system, every out-of-scope system, and the segmentation controls between them.

Here is what the diagram should contain:

Include every AWS service, every data flow arrow, and a label on the segmentation control. This diagram becomes your primary scope evidence and is typically the first thing an auditor asks for.

Weeks 3–6: The 14 Controls That Must Be Active on Day 1

These 14 controls must be implemented and actively collecting evidence from day 1 of your observation period. If you add any of them late, the observation period clock for that control restarts from the implementation date — not from day 1 of the audit period.

Think of the observation period as a surveillance camera recording your infrastructure. The auditor watches the footage later. If the camera was not on when a specific event occurred, that event has no record — and the SOC2 control for it has a gap.

Control 1: MFA Enforcement (CC6.6)

Multi-Factor Authentication (MFA) requires a user to verify their identity using two independent factors — something they know (a password) and something they have (a phone or hardware key). Without MFA, a stolen password is sufficient to access your production systems.

SOC2 CC6.6 requires that access to systems is restricted to authorized users. MFA is the technical control that makes "authorized" meaningful. Without it, any password compromise is a production access event.

To implement MFA, you can use AWS IAM Identity Center (formerly SSO) connected to your identity provider (Okta, Google Workspace, or Azure AD). MFA is then enforced at the identity provider level — any user without MFA enrolled can't authenticate, regardless of which AWS service they're trying to reach.

# IAM Identity Center configuration — MFA is enforced at the IdP level.
# No IAM user has direct console or CLI access.
# All access goes through SSO sessions (8-hour expiry by default).

resource "aws_ssoadmin_instance_access_control_attributes" "mfa" {
  instance_arn = tolist(data.aws_ssoadmin_instances.this.arns)[0]

  attribute {
    key = "email"
    value {
      source = ["$${path:email}"]
    }
  }
}

You can verify that no IAM users retain direct console access (which would bypass MFA):

# Any user listed here has direct console access bypassing SSO — investigate immediately
aws iam list-users \
  --query 'Users[?PasswordLastUsed!=`null`].[UserName,PasswordLastUsed]' \
  --output table

Control 2: Infrastructure as Code (CC8.1)

Infrastructure as Code (IaC) means defining your cloud infrastructure in version-controlled code files (Terraform, Pulumi, or AWS CDK) rather than creating resources manually through the AWS console. Every infrastructure change is proposed in a pull request, reviewed by a colleague, and applied through an automated pipeline.

SOC2 CC8.1 covers change management — the requirement that every change to your production environment is documented, reviewed, and approved. Manual console changes produce no audit trail. If an engineer opens the AWS console and creates a security group without going through Terraform, that change is invisible to your SOC2 auditor. IaC makes every change reviewable and traceable.

Now let's see how to implement IaC here. This GitHub Actions workflow applies Terraform only from the main branch, after a pull request has been reviewed and approved. The workflow creates an immutable record of every infrastructure change:

# .github/workflows/terraform-apply.yml
name: Terraform Apply (Production)
on:
  push:
    branches: [main]
    paths: ['terraform/**']

permissions:
  id-token: write   # Required for AWS OIDC authentication
  contents: read

jobs:
  apply:
    name: Apply Infrastructure Changes
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval for production

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/terraform-apply
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: "1.6.0"

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=tfplan -input=false

      - name: Terraform Apply
        run: terraform apply -input=false tfplan

SOC2 evidence this produces: A GitHub Actions run log for every infrastructure change, showing who triggered it (the pull request author), when it was applied, and what changed.

Control 3: CloudTrail Enabled (CC7.1)

AWS CloudTrail is a service that records every API call made in your AWS account — who called it, when, from which IP address, and whether it succeeded. Think of it as the complete audit log of everything that has ever happened in your AWS environment.

SOC2 CC7.1 requires monitoring for security events. CloudTrail is the foundational logging layer — without it, you can't detect unauthorized access, investigate incidents, or prove to an auditor that your controls were operating as intended. An auditor who can't see historical AWS API activity can't verify that your access controls were enforced during the observation period.

To implement it, you'll want to enable multi-region CloudTrail so that activity in every AWS region is captured, including global services like IAM. You can ship logs to an S3 bucket with Object Lock enabled (Control 3 in the evidence collection section covers this) so logs can't be modified or deleted:

# Enable CloudTrail with log file validation and multi-region coverage
aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name your-cloudtrail-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

# Start the trail (creation alone does not start logging)
aws cloudtrail start-logging --name production-audit-trail

# Verify the trail is active and logging
aws cloudtrail get-trail-status --name production-audit-trail \
  --query '{IsLogging:IsLogging,LatestDeliveryTime:LatestDeliveryTime}'

Control 4: GuardDuty Enabled (CC7.2)

AWS GuardDuty is a threat detection service that analyses your CloudTrail logs, VPC Flow Logs, and DNS logs. It uses machine learning to identify suspicious behaviour — things like an EC2 instance communicating with a known malware server, an IAM user logging in from an unusual country, or unusual API call patterns that indicate credential theft.

SOC2 CC7.2 requires the use of detection tools to identify potential security events. GuardDuty is the monitoring layer that tells you when something anomalous is happening, not just what happened after the fact. Without it, you would only discover a compromise when the damage is done.

Here's the implementation:

# Enable GuardDuty — findings published every 15 minutes for active threats
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Verify GuardDuty is active
aws guardduty list-detectors --query 'DetectorIds' --output table

You can set up an EventBridge rule to route CRITICAL and HIGH severity GuardDuty findings to your incident response channel immediately. A finding sitting unreviewed for 90 days is a qualified SOC2 finding.

Control 5: VPC Flow Logs (CC6.1)

VPC Flow Logs capture information about the IP traffic flowing through your Virtual Private Cloud — every accepted and rejected connection, including source IP, destination IP, port, protocol, and whether the traffic was allowed or denied. They are the network-level audit trail that CloudTrail doesn't provide.

SOC2 CC6.1 requires logical access controls and monitoring. VPC Flow Logs let you verify that your network segmentation is actually working (traffic you denied is showing as rejected in the logs), detect unexpected communication between services, and investigate security events at the network layer.

# Create an IAM role for VPC Flow Logs to deliver to CloudWatch
aws iam create-role \
  --role-name vpc-flow-logs-role \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Principal":{"Service":"vpc-flow-logs.amazonaws.com"},
      "Action":"sts:AssumeRole"
    }]
  }'

# Enable VPC Flow Logs for all traffic (ACCEPT and REJECT)
aws ec2 create-flow-logs \
  --resource-ids vpc-YOUR_PRODUCTION_VPC_ID \
  --resource-type VPC \
  --traffic-type ALL \
  --log-group-name /aws/vpc/flow-logs/production \
  --deliver-log-permission-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/vpc-flow-logs-role

# Verify flow logs are active
aws ec2 describe-flow-logs \
  --filter Name=resource-id,Values=vpc-YOUR_PRODUCTION_VPC_ID \
  --query 'FlowLogs[*].{Status:FlowLogStatus,LogGroup:LogGroupName}'

Control 6: Secrets Manager (CC6.7)

Secrets management means storing credentials (database passwords, API keys, certificates, and other sensitive configuration values) in a dedicated, access-controlled service (like AWS Secrets Manager or HashiCorp Vault) rather than in .env files, GitHub repository secrets, or hardcoded in application code.

SOC2 CC6.7 requires protecting sensitive system components from unauthorized access. A secret stored in an .env file committed to a repository is accessible to every developer with repo access, every CI/CD runner, and every engineer who has ever cloned the repo — including those who have since left the company.

A Secrets Manager provides centralised storage, access logging, automatic rotation, and fine-grained IAM permissions so only specific services can retrieve specific secrets.

Let's look at the implementation — storing and rotating a secret:

# Store a database credential with automatic 90-day rotation
aws secretsmanager create-secret \
  --name production/postgresql/credentials \
  --description "Production PostgreSQL credentials — rotated every 90 days" \
  --secret-string '{
    "username": "app_user",
    "password": "REPLACE_WITH_STRONG_PASSWORD",
    "host": "your-rds-endpoint.us-east-1.rds.amazonaws.com",
    "port": 5432,
    "dbname": "production"
  }'

# Enable automatic rotation every 90 days
aws secretsmanager rotate-secret \
  --secret-id production/postgresql/credentials \
  --rotation-rules AutomaticallyAfterDays=90

How your application retrieves the secret at runtime (no hardcoded credentials):

# Good: secret retrieved at runtime from Secrets Manager
import boto3
import json

def get_db_credentials():
    client = boto3.client('secretsmanager', region_name='us-east-1')
    response = client.get_secret_value(SecretId='production/postgresql/credentials')
    return json.loads(response['SecretString'])

# Bad: secret hardcoded in application code or .env file
DB_PASSWORD = "my_database_password_123"  # Never do this

The access log in CloudTrail records every time a secret is retrieved, by which IAM role, at what time. That log is your SOC2 evidence that secrets access is controlled and auditable.

Control 7: EBS Encryption (CC6.1)

EBS (Elastic Block Store) encryption ensures that the persistent disks attached to your EC2 instances and used by your RDS databases are encrypted at rest using AES-256. If an AWS employee or an attacker gained physical access to the storage hardware, the data would be unreadable without the encryption key.

SOC2 CC6.1 requires protecting information assets from unauthorised access. Encryption at rest is the control that protects data in the event of physical storage compromise or an improperly decommissioned disk. Enabling it account-wide means every new EBS volume is encrypted automatically, including RDS storage, EKS node volumes, and EC2 instance root volumes.

# Enable EBS encryption by default for all new volumes in this region
aws ec2 enable-ebs-encryption-by-default

# Verify it is enabled
aws ec2 get-ebs-encryption-by-default \
  --query 'EbsEncryptionByDefault'
# Expected output: true

# Check existing volumes — any showing false need to be migrated
aws ec2 describe-volumes \
  --query 'Volumes[?Encrypted==`false`].[VolumeId,Size,VolumeType]' \
  --output table

Any existing unencrypted volumes must be snapshot-and-replaced. The process: create a snapshot of the unencrypted volume, create a new encrypted volume from the snapshot, and swap it into the instance.

Control 8: S3 Block Public Access (CC6.1)

Amazon S3 buckets can be configured to allow public access — meaning anyone on the internet can read their contents without authentication. Block Public Access is an account-level and bucket-level setting that prevents any bucket from being made public, regardless of the bucket's own policy.

A misconfigured S3 bucket is one of the most common causes of data breaches in cloud environments. Block Public Access at the account level means a developer can't accidentally expose a bucket containing customer data, even if they set the wrong bucket policy. It's a guardrail, not just a policy.

# Block public access at the AWS account level — applies to all buckets
aws s3control put-public-access-block \
  --account-id YOUR_ACCOUNT_ID \
  --public-access-block-configuration \
    BlockPublicAcls=true,\
    IgnorePublicAcls=true,\
    BlockPublicPolicy=true,\
    RestrictPublicBuckets=true

# Verify account-level setting is active
aws s3control get-public-access-block \
  --account-id YOUR_ACCOUNT_ID

# Scan for any buckets that have public access enabled (should be zero)
aws s3api list-buckets --query 'Buckets[*].Name' --output text | \
  tr '\t' '\n' | while read bucket; do
    result=\((aws s3api get-public-access-block --bucket "\)bucket" 2>/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "WARNING: $bucket has public access not fully blocked"
    fi
  done

Control 9: Branch Protection (CC8.1)

Branch protection is a GitHub setting that prevents engineers from pushing code directly to your main branch without going through a pull request that has been reviewed and approved by at least one other team member. It also requires your CI pipeline to pass before any code can be merged.

SOC2 CC8.1 requires change management — the requirement that every change to production systems is documented, reviewed, and approved. Without branch protection, an engineer can push directly to main, which deploys directly to production through your CI/CD pipeline, with no review and no audit trail. Branch protection is the technical enforcement of your change management policy.

The critical setting that most teams miss: the "Do not allow bypassing the above settings" option must be enabled. Without it, administrators can bypass branch protection — and a SOC2 auditor will flag this as a gap because it means your change management control can be circumvented.

# .github/settings.yml — enforces branch protection via code
# Requires the settings GitHub App: https://github.com/apps/settings

branches:
  - name: main
    protection:
      required_pull_request_reviews:
        required_approving_review_count: 1
        dismiss_stale_reviews: true
        require_code_owner_reviews: false
      required_status_checks:
        strict: true
        contexts:
          - "CI / test"
          - "Security / trivy-scan"
      enforce_admins: true         # Admins cannot bypass — this is critical
      restrictions: null           # No push restriction beyond the above
      allow_force_pushes: false
      allow_deletions: false

Here's how you can verify that branch protection is enforced and admins can't bypass it:

# Returns the branch protection rules including enforce_admins status
curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
  https://api.github.com/repos/YOUR_ORG/YOUR_REPO/branches/main/protection \
  | jq '{enforce_admins: .enforce_admins.enabled, required_reviews: .required_pull_request_reviews.required_approving_review_count}'

Control 10: Container Image Scanning (CC7.4)

Container image scanning analyses your Docker images before deployment to identify known security vulnerabilities (CVEs) in the operating system packages and application dependencies they contain.

Trivy is an open-source scanner that checks the base image (Ubuntu, Alpine, and so on), all installed OS packages, and language-specific dependencies (npm, pip, Go modules) against the National Vulnerability Database.

SOC2 CC7.4 requires monitoring and identifying vulnerabilities. Every container you deploy contains a base image with OS packages — and those packages regularly receive CVE disclosures. A critical CVE left unpatched for 90 days in a production container is a SOC2 finding. Automated scanning in CI means every image is checked before it can deploy.

# .github/workflows/security-scan.yml
name: Security Scan
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t app:${{ github.sha }} .

      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: app:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1          # Fail the pipeline on CRITICAL or HIGH findings

      - name: Upload results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()            # Upload even if scan found issues
        with:
          sarif_file: trivy-results.sarif

The scanner looks for:

CVEs in base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)
Vulnerable versions of application dependencies (a known RCE in an npm package your app uses)
Misconfigurations in the Dockerfile itself (running as root, using latest tags)

Results appear in the GitHub Security tab for your repository, giving you a historical record of every scan — which is your SOC2 evidence.

Control 11: Incident Response Plan (CC9.2)

An incident response plan is a written, tested procedure that defines exactly what your team does when a security event occurs — from the moment an alert fires through to customer notification and post-incident review.

SOC2 CC9.2 requires that you have a documented process for responding to security events and that you've tested it. The auditor will ask for the written runbook and evidence that a tabletop exercise (a simulated incident walkthrough) has been conducted within the observation period.

Your incident response runbook must include:

Severity classification: Definitions of P1 (production down, customer data at risk), P2 (degraded service, potential risk), and P3 (minor issue, no customer impact) — and the response SLA for each.
Escalation path: Exactly who gets paged at each severity level, with contact details. Not "the on-call engineer" — specific names and a backup if the first person doesn't respond within 10 minutes.
First 15 minutes: The specific steps to take immediately — isolate the affected system, assess the scope, notify the incident channel, begin the timeline log.
Communication templates: Pre-written Slack messages, customer email templates, and regulatory notification templates (GDPR requires notification within 72 hours, HIPAA within 60 days).
Post-incident review: The blameless postmortem process, the 5-why root cause analysis template, and the action item tracking process.

Conduct a tabletop exercise at least once during your observation period: gather your engineering team for 45 minutes, simulate a realistic scenario (for example, "an AWS access key was committed to a public GitHub repo"), and walk through the runbook together. Document the meeting date, attendees, scenario, gaps found, and remediation actions. This document is your evidence.

Control 12: Access Reviews (CC6.3)

An access review is a quarterly audit of who has access to what in your production systems — AWS accounts, GitHub repositories, production databases, and every SaaS tool that touches customer data. You verify that every person on the list still works at the company and still needs the access their role grants them.

SOC2 CC6.3 requires that access is revoked when it's no longer needed. Former employees who retain access to production AWS accounts represent a genuine security risk and a definitive SOC2 finding.

In every access review I've conducted, at least 3–5 former employees or contractors still had active access they should not.

The quarterly access review checklist:

# 1. IAM users — list all with their last login date
aws iam generate-credential-report
aws iam get-credential-report --output text --query Content \
  | base64 --decode | cut -d',' -f1,5 | column -t -s ','

# 2. IAM roles — find roles that have not been used in 90+ days
aws iam get-account-authorization-details \
  --query 'RoleDetailList[*].{Role:RoleName,LastUsed:RoleLastUsed.LastUsedDate}' \
  --output table

# 3. Verify AWS SSO user list matches your current employee list
aws identitystore list-users \
  --identity-store-id YOUR_IDENTITY_STORE_ID \
  --query 'Users[*].{Name:DisplayName,Email:Emails[0].Value}' \
  --output table

Cross-reference the output against your current employee list in your HR system. Document every change made — access removed, permissions reduced, accounts disabled. The documented changes are the evidence that the review was conducted meaningfully, not just as a checkbox exercise.

Control 13: Backup Verification (CC9.5)

Backup verification is the process of actually restoring your backups to confirm they work — not just confirming that backups are being created. A backup that has never been tested doesn't exist from a recovery perspective.

SOC2 CC9.5 requires that recovery procedures are tested. If your production database is corrupted and you discover for the first time during the incident that your automated RDS snapshots can't be restored, you have both a disaster recovery failure and a SOC2 finding.

How to test your RDS backup:

# Step 1: Find your most recent production snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier your-production-db \
  --query 'sort_by(DBSnapshots, &SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text

# Step 2: Restore the snapshot to a test instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier backup-verification-test \
  --db-snapshot-identifier YOUR_SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=backup-verification Key=Environment,Value=test

# Step 3: Wait for the restore to complete (typically 5–15 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier backup-verification-test

# Step 4: Connect and verify data integrity (spot check key tables)
# Run this against the restored instance
psql -h RESTORED_INSTANCE_ENDPOINT -U your_user -d your_database \
  -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Document the test result and delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier backup-verification-test \
  --skip-final-snapshot

Document the test date, the snapshot used, the restore time, the data verification query results, and who conducted the test. Run this quarterly at minimum. This documentation is your SOC2 evidence for CC9.5.

Control 14: Change Management Log (CC8.1)

A change management log is the auditable record of every change made to your production environment — what changed, who approved it, and when it was applied.

SOC2 CC8.1 requires that changes to your production environment are authorized and documented. With IaC and GitOps in place, you already have two separate sources of immutable change history that together satisfy this control.

GitHub Pull Request history provides the record of every code and infrastructure change: who opened the PR, who reviewed and approved it, what the CI status was, and when it was merged. This is your change management log for application and infrastructure changes.

ArgoCD sync history provides the record of every deployment to your Kubernetes cluster: which application was synced, from which Git commit, at what time, and whether the sync succeeded.

To export the ArgoCD sync history as evidence:

# Export ArgoCD application sync history as JSON evidence
argocd app history YOUR_APP_NAME --output json > argocd-sync-history-$(date +%Y%m).json

# Upload to your SOC2 evidence bucket
aws s3 cp argocd-sync-history-$(date +%Y%m).json \
  s3://your-soc2-evidence-bucket/change-management/$(date +%Y/%m)/

# For each deployment, the evidence contains:
# - App name, deployed revision (Git commit SHA)
# - Deployment timestamp
# - Initiating user or automated sync
# - Success/failure status

Together, the GitHub PR history and the ArgoCD sync history give the auditor a complete, tamper-evident record of every change to your production environment during the observation period.

Weeks 7–10: The Evidence Collection Infrastructure

Evidence is the difference between passing and failing SOC2.

You might be wondering: what exactly is evidence? In SOC2 terms, evidence is the documentation that proves a specific control was operating correctly during a specific point in time within the observation period. A policy document says you will do something. Evidence proves you did it — and that you did it continuously, not just the week before the audit.

For example:

For MFA enforcement (Control 1), evidence is a screenshot of your IAM Identity Center MFA settings taken at a specific date during the observation period, combined with an IAM credential report showing zero IAM users with console access.
For GuardDuty (Control 4), evidence is the GuardDuty console screenshot showing active detectors, plus your documented response to any findings during the period.
For access reviews (Control 12), evidence is the completed access review document with dates, names, and specific access changes made.

The challenge is collecting this evidence continuously across 3–12 months without spending hundreds of hours on manual work. The solution is automated evidence collection infrastructure.

The Evidence Bucket — Tamper-Proof Storage for Your Audit Evidence

The evidence bucket is an S3 bucket with Object Lock enabled in GOVERNANCE mode. Object Lock prevents any object from being deleted or modified for the retention period you specify — in this case, 365 days. This means once a piece of evidence is uploaded, it can't be altered, even by a user with administrator access (without explicitly overriding the lock, which itself creates an audit trail).

This tamper-evident property is what gives the auditor confidence that the evidence was not created or modified after the fact.

# terraform/soc2-evidence-bucket.tf

resource "aws_s3_bucket" "soc2_evidence" {
  bucket = "\({var.company_name}-soc2-evidence-\){var.environment}"
}

# Block all public access to the evidence bucket
resource "aws_s3_bucket_public_access_block" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning so overwrites create new versions, not replacements
resource "aws_s3_bucket_versioning" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Object Lock in GOVERNANCE mode — objects cannot be deleted for 365 days
resource "aws_s3_bucket_object_lock_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 365
    }
  }
}

# Encrypt all evidence at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

The Daily Evidence Collector Lambda

This Lambda function runs automatically every day and exports the status of each critical control to a time-stamped JSON file in the evidence bucket. Over your 3–12 month observation period, it creates a daily record proving that your controls were active and operating.

The function checks seven controls automatically: CloudTrail status, GuardDuty status, VPC Flow Logs, S3 public access block, EBS encryption, MFA compliance, and GuardDuty finding count. Each daily snapshot is uploaded with Object Lock enabled so it can't be modified.

# lambda/evidence-collector/handler.py

import boto3
import json
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    """
    Daily SOC2 evidence collector.
    Runs at 00:00 UTC every day via EventBridge scheduler.
    Exports control status to S3 evidence bucket with Object Lock.
    """
    evidence = {
        'collection_timestamp': datetime.now(timezone.utc).isoformat(),
        'collection_date': datetime.now(timezone.utc).strftime('%Y-%m-%d'),
        'account_id': boto3.client('sts').get_caller_identity()['Account'],
        'controls': {}
    }

    # Control 3: CloudTrail status
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails(includeShadowTrails=False)['trailList']
    multi_region_trails = [t for t in trails if t.get('IsMultiRegionTrail')]
    evidence['controls']['cloudtrail'] = {
        'status': 'PASS' if multi_region_trails else 'FAIL',
        'detail': f"{len(multi_region_trails)} multi-region trail(s) active",
        'trails': [t['Name'] for t in multi_region_trails]
    }

    # Control 4: GuardDuty status
    guardduty = boto3.client('guardduty')
    detectors = guardduty.list_detectors()['DetectorIds']
    unresolved_critical = 0
    for detector_id in detectors:
        findings = guardduty.list_findings(
            DetectorId=detector_id,
            FindingCriteria={
                'Criterion': {
                    'severity': {'Gte': 7},  # HIGH and CRITICAL only
                    'service.archived': {'Eq': ['false']}
                }
            }
        )
        unresolved_critical += len(findings['FindingIds'])

    evidence['controls']['guardduty'] = {
        'status': 'PASS' if detectors else 'FAIL',
        'detail': f"{len(detectors)} detector(s) active, {unresolved_critical} unresolved HIGH/CRITICAL findings",
        'unresolved_high_critical': unresolved_critical
    }

    # Control 5: VPC Flow Logs
    ec2 = boto3.client('ec2')
    flow_logs = ec2.describe_flow_logs(
        Filters=[{'Name': 'resource-type', 'Values': ['VPC']},
                 {'Name': 'flow-log-status', 'Values': ['ACTIVE']}]
    )['FlowLogs']
    evidence['controls']['vpc_flow_logs'] = {
        'status': 'PASS' if flow_logs else 'FAIL',
        'detail': f"{len(flow_logs)} active VPC flow log(s)",
        'active_flow_logs': len(flow_logs)
    }

    # Control 7: EBS encryption by default
    ebs_encryption = ec2.get_ebs_encryption_by_default()['EbsEncryptionByDefault']
    evidence['controls']['ebs_encryption_by_default'] = {
        'status': 'PASS' if ebs_encryption else 'FAIL',
        'detail': 'EBS encryption by default is enabled' if ebs_encryption else 'EBS encryption by default is NOT enabled'
    }

    # Control 8: S3 Block Public Access (account level)
    s3control = boto3.client('s3control')
    account_id = boto3.client('sts').get_caller_identity()['Account']
    try:
        pab = s3control.get_public_access_block(AccountId=account_id)['PublicAccessBlockConfiguration']
        all_blocked = all([pab['BlockPublicAcls'], pab['IgnorePublicAcls'],
                           pab['BlockPublicPolicy'], pab['RestrictPublicBuckets']])
        evidence['controls']['s3_block_public_access'] = {
            'status': 'PASS' if all_blocked else 'FAIL',
            'detail': 'All four S3 Block Public Access settings enabled' if all_blocked else 'One or more S3 Block Public Access settings not enabled',
            'configuration': pab
        }
    except Exception as e:
        evidence['controls']['s3_block_public_access'] = {'status': 'FAIL', 'detail': str(e)}

    # Upload evidence to S3 with Object Lock
    s3 = boto3.client('s3')
    evidence_key = f"daily/{evidence['collection_date']}/control-status.json"
    lock_until = datetime.now(timezone.utc) + timedelta(days=365)

    s3.put_object(
        Bucket='YOUR_EVIDENCE_BUCKET_NAME',
        Key=evidence_key,
        Body=json.dumps(evidence, indent=2),
        ContentType='application/json',
        ObjectLockMode='GOVERNANCE',
        ObjectLockRetainUntilDate=lock_until
    )

    # Alert if any control fails
    failed_controls = [k for k, v in evidence['controls'].items() if v['status'] == 'FAIL']
    if failed_controls:
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='YOUR_ALERT_TOPIC_ARN',
            Subject=f'SOC2 Control Failure Detected — {evidence["collection_date"]}',
            Message=f'The following controls failed their daily check:\n\n{json.dumps(failed_controls, indent=2)}'
        )

    return {
        'statusCode': 200,
        'controls_checked': len(evidence['controls']),
        'controls_failed': len(failed_controls),
        'evidence_location': f"s3://YOUR_EVIDENCE_BUCKET_NAME/{evidence_key}"
    }

The GitHub Actions Evidence Workflow

This workflow runs daily and captures evidence that can't be automated through AWS APIs — GitHub-level controls like branch protection status, recent pull request activity, and CI pipeline results. It exports these as JSON files to the same evidence bucket.

# .github/workflows/soc2-evidence.yml
name: SOC2 Evidence Collection
on:
  schedule:
    - cron: '0 1 * * *'   # 01:00 UTC daily (after the Lambda runs at 00:00)
  workflow_dispatch:        # Allow manual trigger when needed

permissions:
  contents: read

jobs:
  collect-github-evidence:
    name: Collect GitHub Control Evidence
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/evidence-collector
          aws-region: us-east-1

      - name: Collect branch protection status
        run: |
          DATE=$(date +%Y-%m-%d)
          mkdir -p evidence/github

          # Export branch protection rules for main
          curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
            "https://api.github.com/repos/${{ github.repository }}/branches/main/protection" \
            | jq '{
                date: "'$DATE'",
                enforce_admins: .enforce_admins.enabled,
                required_reviews: .required_pull_request_reviews.required_approving_review_count,
                required_status_checks: .required_status_checks.contexts,
                allow_force_pushes: .allow_force_pushes.enabled
              }' > evidence/github/branch-protection-$DATE.json

          echo "Branch protection evidence collected"
          cat evidence/github/branch-protection-$DATE.json

      - name: Upload evidence to S3
        run: |
          DATE=$(date +%Y-%m-%d)
          aws s3 sync evidence/ \
            s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/ \
            --no-progress
          echo "Evidence uploaded: s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/"

Weeks 11–14: Auditor Selection and Readiness Assessment

How to Choose a SOC2 Auditor

Selecting the right auditor is more consequential than most teams realize. SOC2 audits are conducted by CPA firms — specifically, firms licensed to issue SOC reports. The right firm has experience with cloud-native, SaaS companies your size. The wrong firm could apply enterprise audit frameworks to a seed-stage startup and generate findings based on controls that aren't appropriate to your context.

Here is what to look for and what to watch out for:

Experience matters more than brand

A large Big Four firm isn't necessarily better than a specialist boutique auditor for a 20-person SaaS company.

Ask specifically: "How many SOC2 audits have you completed in the last 12 months for SaaS companies between 10 and 50 employees?" You want a firm where this is common, not exceptional.

Verify familiarity with your compliance tool

If you're using Vanta or Drata, confirm that the auditor has experience with evidence produced by those platforms. Some auditors prefer to collect evidence directly and are unfamiliar with automated evidence exports. An auditor who doesn't trust your Vanta evidence will ask you to re-collect everything manually.

Understand what Type II actually costs

For a Series A SaaS company, expect $15,000–$30,000 for a SOC2 Type II audit with a 3-month observation period. A quote below $10,000 often means the auditor is cutting corners on the review depth. A quote above $50,000 for a small company typically means the firm is applying enterprise pricing to a startup engagement.

Get references from similar companies

Ask the auditor for two or three references from SaaS companies they've audited in the last year. Call those references and ask: did the auditor understand cloud infrastructure? Were the findings reasonable? How was the communication during the review?

Here's a summary table of some things to watch out for:

Criteria	What to Look For	Red Flag
Experience	5+ years, 20+ SaaS audits annually	"We have completed several SOC2 audits" (vague)
Tool familiarity	Has reviewed Vanta/Drata evidence before	Requires manual re-collection of automated evidence
Company size fit	Has audited companies your size	Only lists enterprise clients as references
Cost (Type II)	$15K–$30K for a 20-person company	Under $10K or over $50K without clear justification
References	Can provide SaaS company contacts to call	Cannot provide references

How to Run a Readiness Assessment (Mock Audit)

A readiness assessment is a self-conducted simulation of the real audit, run 2–4 weeks before you engage the auditor. Its purpose is to find and close gaps before the auditor finds them, because gaps found in a mock audit cost you a week of remediation time, while gaps found in the real audit cost you a conditional report and a re-review.

You can run the readiness assessment yourself or hire a consultant to run it. The consultant approach is more valuable because an independent reviewer will find gaps you have rationalised away.

The process:

Step 1: Work through every control in the checklist below and attempt to produce the evidence that an auditor would request.
Step 2: For every control where you can't produce clear, timestamped evidence: that's a gap. Document it.
Step 3: Prioritise gaps by type. Evidence gaps (missing evidence for an active control) require evidence collection infrastructure fixes. Control gaps (a control that isn't implemented) require engineering work.
Step 4: Close all gaps before engaging the real auditor.

Control	Evidence Required	How to Verify	Ready?
MFA enforced	IAM credential report + SSO MFA policy screenshot	`aws iam get-credential-report`	⬜
CloudTrail active	Trail status + S3 delivery confirmation	`aws cloudtrail get-trail-status`	⬜
GuardDuty active	Detector list + finding review log	`aws guardduty list-detectors`	⬜
VPC Flow Logs	Active flow log list + sample log entries	`aws ec2 describe-flow-logs`	⬜
Secrets in Secrets Manager	Secret list + rotation policy confirmation	`aws secretsmanager list-secrets`	⬜
EBS encryption by default	Account-level encryption setting	`aws ec2 get-ebs-encryption-by-default`	⬜
S3 Block Public Access	Account-level PAB configuration	`aws s3control get-public-access-block`	⬜
Branch protection (no admin bypass)	GitHub branch protection API response	GitHub API or Settings UI	⬜
Trivy scanning in CI	GitHub Actions run history showing scans	GitHub Actions logs	⬜
Incident response runbook	Written runbook + tabletop exercise notes with date	Document review	⬜
Access review	Quarterly review document with specific changes made	Document review	⬜
Backup test	RDS restore log + data verification results	Document review	⬜
Change management log	GitHub PR history + ArgoCD sync history	GitHub and ArgoCD	⬜

The one thing most teams skip: Running the readiness assessment against their own evidence bucket. Pull a random day's evidence from the daily Lambda export and verify that it's complete, timestamped, and accurately reflects the control status on that day.

If the evidence file for December 14th shows GuardDuty as PASS but GuardDuty was actually disabled that day, the auditor will find the discrepancy in the AWS account history — and that's a qualified finding.

Weeks 15–18: The Observation Period

How the Auditor Observes Your Controls

The SOC2 auditor doesn't physically visit your office or sit inside your AWS console watching your infrastructure in real time. The audit is a remote, documentation-based process conducted entirely through evidence review.

Here is how it actually works:

First, the auditor provides a list of evidence requests — typically 80–150 items for a Type II audit. You upload the evidence to a shared portal (the auditor provides this — it is usually a secure document sharing platform). The auditor reviews the evidence, asks follow-up questions, and identifies gaps where evidence is missing or a control wasn't operating as described.

For automated controls like CloudTrail and GuardDuty, the evidence is your daily Lambda exports — the auditor spot-checks a sample of daily snapshots across the observation period to verify the controls were consistently active.

For manual controls like access reviews and backup tests, the evidence is the documents you produced when you ran those processes.

The practical implication: the auditor is trusting your evidence. This is why the Object Lock on your evidence bucket matters. It proves to the auditor that the evidence was generated at the time it claims to have been generated and hasn't been modified since.

What the Auditor Reviews Over the Observation Period

What They Check	How Often	What They Are Looking For
CloudTrail logs	Spot check monthly	Manual console changes that bypassed IaC, gaps in log delivery
GuardDuty findings	Review quarterly summary	HIGH or CRITICAL findings not remediated within your documented SLA
Access review completion	Verify each quarterly cycle	Reviews skipped, reviews with no access changes despite employee turnover
Incident response tests	Verify annually	No tabletop exercise conducted during the observation period
Evidence collection	Verify continuous coverage	Gaps in daily evidence exports, missing evidence for specific dates
Change management log	Sample PR/sync history	Deployments with no associated pull request or review

What Triggers a Finding

A SOC2 finding is the auditor's documented conclusion that a control wasn't operating effectively during the observation period. Findings range from observations (minor issues that don't affect the audit opinion) to qualified opinions (material failures that result in a qualified rather than unqualified report).

Understanding what triggers findings — and which ones restart the observation period — is critical for managing your audit timeline.

Control gaps occur when a required control isn't implemented or was disabled during the observation period. If you discover in month 2 that MFA wasn't enforced on one IAM user for the first three weeks, you must document the remediation and demonstrate the gap was closed.

Whether this restarts your observation period depends on how long the gap lasted and how the auditor assesses the risk — but a gap of less than 30 days that's immediately remediated and documented typically doesn't restart the clock.

Evidence gaps are more serious. If your daily Lambda evidence collector failed for two weeks and produced no evidence exports, you have a two-week window with no documented proof that your controls were operating. The auditor can't verify controls they can't see evidence for.

Evidence gaps almost always require extending the observation period because there's no way to retroactively produce evidence for a period that wasn't recorded.

Process failures occur when a manual control wasn't executed as documented. The most common is an access review that was skipped. Like control gaps, these can typically be remediated without restarting the clock if they're documented promptly and the remediation is clear.

Unpatched critical CVEs are a special case. If Trivy identifies a CRITICAL vulnerability in a production container and it remains unpatched for more than your documented remediation SLA (typically 30 days for critical, 90 days for high), this is a qualified finding that the auditor will note in the report.

How to Close Gaps Without Restarting the Clock

When you discover a gap during the observation period:

For control gaps:

1. Fix the control immediately — don't wait
2. Document the fix: screenshot, PR link, or CLI command output with timestamp
3. Note the gap date range in your audit log: "Control gap: 2024-03-10 to 2024-03-14 (4 days). Root cause: [X]. Remediated: [Y]. No customer data accessed during gap period."
4. Notify your auditor proactively — they will find it anyway; proactive disclosure is better than defensive explanation
5. The observation period doesn't restart if the gap was short-lived and promptly remediated

For evidence gaps:

1. Fix the evidence collection infrastructure immediately
2. Understand that you can't retroactively generate evidence for the gap period
3. The observation period for affected controls effectively restarts from the date evidence collection resumed
4. If the gap is early in your observation period, you may be able to extend the period rather than restart — discuss with your auditor

The pro tip: Set up a CloudWatch alarm that triggers if the evidence Lambda fails to deliver to S3 on schedule. A missing daily evidence file is caught within 24 hours, not discovered during the audit review.

The 90-Day SOC2 Timeline at a Glance

Weeks	Focus	Key Deliverables	Common Mistake
1–2	Scope	Boundary diagram, network segmentation Terraform	Over-scoping to include dev and staging
3–6	Controls	14 controls implemented and collecting evidence	Starting controls after the observation period begins
7–10	Evidence	S3 evidence bucket, Lambda daily collector, GitHub Actions workflow	Manual evidence collection with inevitable gaps
11–14	Readiness	Mock audit, gap remediation, auditor selected	Skipping the mock audit
15–18	Observation	Daily evidence, quarterly reviews, incident response test	Discovering evidence gaps during the audit rather than before

What's Next?

Start with Week 1. Define your SOC2 boundary. Apply the four-question framework to every system in your infrastructure. Draw the diagram in Excalidraw. Document the network segmentation controls.

Then implement the 14 controls in order, starting with MFA and CloudTrail — the two that most commonly fail audits when they're missing.

Then build your evidence collection infrastructure before the observation period starts. The automated Lambda and GitHub Actions workflow are the difference between a smooth audit and a 60-day extension.

One thing to remember: SOC2 is 20% controls, 30% evidence, and 50% continuous operation. Start early. Automate everything. Run a mock audit before you call the real one.

Resources

The following resources are referenced throughout this guide:

AICPA SOC2 Overview — The official SOC2 documentation from the American Institute of CPAs, including the Trust Service Criteria
Vanta — Compliance automation platform that connects to AWS and GitHub to automate evidence collection and track control status
Drata — Alternative compliance automation platform with similar capabilities to Vanta
Trivy by Aqua Security — Open-source container and filesystem vulnerability scanner used in Control 10
Excalidraw — Free, open-source diagram tool for creating the SOC2 boundary diagram
AWS IAM Identity Center documentation — Official AWS documentation for setting up SSO and MFA enforcement
GitHub branch protection documentation — Official GitHub documentation for configuring branch protection rules
ArgoCD documentation — Official ArgoCD documentation for GitOps deployment and sync history

Ayobami Adejumo is a senior platform engineer and FinOps specialist. He writes about SOC2 compliance engineering, Kubernetes cost optimization, and platform engineering.

How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For

Tolani Akintayo — Thu, 30 Apr 2026 14:33:32 +0000

You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating.

And yet the applications go out, and nothing comes back.

This is one of the most frustrating experiences in tech. You're genuinely learning, genuinely putting in the time, and you have nothing to show for it in terms of results. You start to wonder if the market is too competitive, if you need one more certification, or if there's some hidden door everyone else found that you're missing.

The truth is simpler and more actionable than any of that: hiring managers can't see your YouTube watch history. They can see your GitHub. Most beginners optimize for learning. Hired candidates optimize for proof.

In this guide, you'll get an honest breakdown of the nine factors hiring managers actually evaluate when they look at a junior cloud or DevOps candidate and a concrete 90-day plan to address each one. By the end, you'll know exactly where you stand and exactly what to do next.

The Three Patterns That Keep Beginners Stuck
What Hiring Managers Are Actually Evaluating
Factor 1: Proof of Work (The Non-Negotiable)
- The Three Projects That Cover Everything
Factor 2: System-Level Thinking
Factor 3: Software Engineering Fundamentals
Factor 4: Communication Skills
Factor 5: Consistency Over Intensity
Factor 6: Networking and Visibility
Factor 7: Ownership Mindset
Factor 8: Business Awareness
Factor 9: Learning Agility
Your 90-Day Action Plan
Honest Self-Assessment: Where Do You Stand?
Conclusion
References and Recommended Resources

The Three Patterns That Keep Beginners Stuck

Pattern 1: The Tutorial Loop

Week 1: You watch eight hours of Docker content. Week 2: You start an AWS course and get 70% through. Week 3: A Kubernetes series looks interesting, so you start that instead. Week 4: You open LinkedIn and wonder why you're not getting callbacks.

Watching tutorials feels like progress. It's comfortable, passive, and has no failure state. Nothing breaks. Nothing goes wrong.

The problem is that it produces nothing a hiring manager can evaluate. Courses and certifications tell an employer what you've been exposed to. Your GitHub tells them what you can actually do.

Pattern 2: The Theory-Practice Gap

You can explain CI/CD fluently. You've read the Kubernetes documentation. You understand the conceptual difference between a container and a virtual machine.

But you've never taken a simple application, containerized it, connected it to a pipeline, and deployed it to a cloud server with a real URL that someone can visit.

In an interview, "I understand how it works" and "I have built this and here is the link" are not equivalent answers. Hiring managers hear the first version from hundreds of candidates. The second version gets callbacks.

Pattern 3: Silent Learning

This one is perhaps the most painful pattern because the learning is real. You're putting in the work every day but nobody knows. No GitHub activity. No LinkedIn posts. No community presence. Just cold applications sent from job boards to ATS systems that filter you out before a human ever sees your name.

The hard truth: people get hired through people. A hiring manager who has seen your LinkedIn post about a problem you solved is significantly more likely to give your résumé serious attention than a stranger who applied through a portal.

What Hiring Managers Are Actually Evaluating

I've grouped the nine factors that follow into three buckets: Mindset, Execution, and Visibility. The order matters: mindset shapes how you execute, and execution is what powers visibility.

Bucket	Covers	Factors
Mindset	How you think about problems and your career	Factors 2, 7, 8, 9
Execution	What you actually build and demonstrate	Factors 1, 3
Visibility	Whether the right people know you exist	Factors 4, 5, 6

Let's go through each one.

Factor 1: Proof of Work (The Non-Negotiable)

If there's one thing to take from this entire article, it's this: no portfolio means no serious consideration. The most technically capable candidate in the applicant pool is invisible without proof of work.

This isn't about impressing anyone with complexity. It's about demonstrating that you can take a system from zero to deployed, documented, and working.

Here's the checklist every portfolio project should meet before you consider it done:

It's deployed: there's a real URL you can share, not "it works on my machine"
It has a CI/CD pipeline: code changes are automatically tested and deployed
Infrastructure is defined as code: not manually clicked together in the AWS console
It has monitoring and alerting: you know when it breaks before users tell you
It's documented: a README explains what it does, how to run it, and how it works
It's on GitHub publicly: with real commit history showing iterative work

If your project meets all six criteria, you have proof of work. If it meets four of six, you have a project in progress. Finish it before you start applying.

The Three Projects That Cover Everything

You don't need ten projects. You need two to three projects that together demonstrate the full range of DevOps skills.

Project 1 : The Full-Stack Deploy Pipeline

This is the foundational DevOps project every beginner should build first.

Take any simple web application – a Python Flask app, a Node.js API, or even a static site. Containerize it with Docker. Write a CI/CD pipeline that runs tests, builds the Docker image, and deploys to a cloud server automatically on every push to the main branch. You can also set up Nginx as a reverse proxy and add an uptime monitor (UptimeRobot has a free tier).

Tools: GitHub Actions, Docker, AWS EC2 or Render.com, Nginx.

Why it matters to a hiring manager: it proves you can automate a full deployment workflow end-to-end. The hiring manager can visit your URL, see it running, and inspect your pipeline history.

This single project puts you ahead of most applicants who only have course completion screenshots.

Project 2: Infrastructure as Code with Terraform

Write Terraform code that provisions a complete environment: a VPC, public and private subnets, an EC2 instance with properly scoped security group rules, and an S3 bucket for remote state. Destroy it and recreate it from scratch to prove the code actually works. Add a GitHub Actions workflow that runs terraform plan on pull requests and terraform apply on merge to main.

Tools: Terraform, AWS (or Azure/GCP), GitHub Actions.

Why it matters: Infrastructure as Code with Terraform is a required skill at almost every company running cloud infrastructure. Showing you can write, version-control, and automate Terraform demonstrates a core professional competency.

Project 3: Monitoring and Observability Stack

Deploy a monitoring stack using Docker Compose: Prometheus scraping metrics from your application and the host, Grafana dashboards showing CPU, memory, request rates, and error rates, and Alertmanager configured to send alerts to Slack or email when thresholds are crossed. Connect this to your Project 1 application so the pipeline deploys and the monitoring watches it.

Tools: Prometheus, Grafana, Alertmanager, Node Exporter, Docker Compose.

Why it matters: most beginner portfolios have zero observability work. This project immediately signals that you understand production engineering, not just deployment. Any senior DevOps engineer or SRE reviewing your application will notice it and it will set you apart.

Factor 2: System-Level Thinking

This is the mindset that separates a DevOps engineer from someone who just knows a collection of tools. System-level thinking means you can see the whole picture, not just the part you happen to be working on at any given moment.

Here's the mental test hiring managers are running throughout your interview: can you trace a user request from the moment they click a button to the moment they see a response, and explain what happens at every layer in between?

Here's the full journey of a web request, the map of modern infrastructure every DevOps engineer needs to understand:

Step	Layer	What's happening and what can go wrong
1	User's Browser	The user types a URL. The browser needs to find the server.
2	DNS Resolution	The domain is translated into an IP address. DNS misconfigurations mean users can't reach you at all.
3	CDN / Edge Network	Traffic hits a CDN (Cloudflare, CloudFront) first. Static assets are served from the nearest edge. SSL terminates here.
4	Load Balancer	Routes the request to an available application server. If all targets are unhealthy, users get 502/503 errors.
5	Compute / Application Servers	The application code runs here in containers, on VMs, or in server-less functions. Business logic executes.
6	Database Layer	The application reads from or writes to a database. Slow queries or a full disk causes slow responses or outages.
7	Cache Layer	Redis or Memcached caches frequently-read data. Cache misses cause extra database load.
8	Response Returns	The response travels back through the stack and the user sees the result.
9	Logging and Monitoring	Every step above should emit logs and metrics. Good monitoring alerts you before users notice a problem.

Why does this matter in an interview? Consider two candidates answering the question: "Tell me about a time something broke in production."

Candidate A: "The website was down."

Candidate B: "The load balancer health checks were failing because the app containers were running out of memory due to a memory leak introduced in the previous deploy. We identified it via memory metrics in Grafana, rolled back, and added a memory limit to the container spec."

Same incident. Completely different answer. System-level thinking is what makes the difference.

Factor 3: Software Engineering Fundamentals

Many beginners rush to learn Kubernetes and Terraform before mastering the foundations that make those tools make sense. This creates a knowledge structure that looks impressive but has no solid base underneath it.

Here are the fundamentals that actually matter and what to do if you have a gap in any of them:

1. Linux and the Command Line

DevOps tools run on Linux. CI/CD jobs run in Linux containers. SSH is the front door to every server. If the terminal makes you uncomfortable, you're not ready for a production environment. This is not a preference, it's a prerequisite.

Start with daily Linux practice. The Linux Foundation's free introductory materials are a solid starting point. And here's a solid freeCodeCamp course on Linux basics.

2. Networking Fundamentals

DNS, TCP/IP, HTTP/HTTPS, load balancing, firewalls, VPCs, subnets these concepts appear in every cloud architecture. Without them, Terraform and Kubernetes are magic boxes. Study the request flow in Factor 2 above until you can draw it from memory without looking.

Here's a computer networking fundamentals course to get you started.

3. Scripting: Bash and Python

CI/CD pipelines are scripts. Automation is scripting. If you cannot write a Bash script that reads a config file, calls an API, and handles errors gracefully your automation ceiling is very low. Fix this by writing one small, useful script every week. Solve real problems with code.

Here's a helpful tutorial on shell scripting in Linux for beginners.

4. Git and Version Control

Not just git commit and git push. Branching strategies, pull requests, merge conflicts, rebasing, and tagging releases are all standard practice in professional DevOps teams. Use Git for everything including your personal learning notes. Practice branching workflows intentionally.

Here's a full book on all the Git basics (and some more advanced topics, too) you need to know.

5. Docker and Containers

Docker is the universal packaging format for modern software. Understanding layers, multi-stage builds, volumes, networking, and container security is the floor not the ceiling. Every project you build should be containerized. Write your Dockerfiles by hand instead of copying them.

Here's a course on Docker and Kubernetes to get you started,

Factor 4: Communication Skills

Technical skills set your ceiling. Communication skills determine how fast you reach it. This is the most consistently underestimated factor among beginner DevOps candidates.

Two candidates with identical technical ability will have very different career outcomes based on how clearly they communicate. Here's what that looks like in practice:

Architecture explanation: Can you describe how your project works to someone who has never seen it? Can you draw the architecture on a whiteboard and walk someone through your design decisions and the trade-offs you made?

Trade-off articulation: "I chose X over Y because..." is one of the most powerful phrases in a technical interview. It shows you understand that every decision has pros and cons and you made a conscious, reasoned choice rather than just copying a tutorial.

Written documentation: A README is your project's cover letter. A well-written README with clear setup instructions, an architecture diagram, and documented decisions demonstrates engineering maturity that most beginners don't show.

Here's a quick test: open your most recent project on GitHub and read the README as if you're a hiring manager seeing it for the first time. Does it answer these questions?

What does this project do, and why did you build it?
What does the architecture look like?
How do I run this locally, and how do I deploy it?
What decisions did you make, and why?
What would you improve if you continued working on it?

If you answered "no" to more than two of those rewrite the README before applying anywhere. This single action will meaningfully improve your response rate.

Interview communication: Hiring managers assess communication throughout the entire interview not just your answers. Thinking out loud, structuring your responses, and admitting uncertainty honestly are all evaluated.

Factor 5: Consistency Over Intensity

Hiring managers are pattern recognition machines. They look at your GitHub contribution graph, your LinkedIn activity, and your learning trajectory and form an impression before reading a single word on your résumé.

A binge-learning approach, 10-hour weekends followed by weeks of nothing produces a GitHub graph that tells the wrong story. Thirty minutes of focused daily practice for six months beats a monthly 10-hour binge. At the six-month mark, the daily practitioner has 90 hours of focused work. The binge learner has 60 with significantly worse retention.

Here's how to build consistency in practice:

Pick a time slot in your day that you will protect. Thirty minutes is enough to make progress.
Define a four-week learning sprint with a specific goal, not "learn Terraform" but "build and deploy a VPC with Terraform and write the README."
Keep a private learning journal: date, what you studied, what you built, what confused you.
When the sprint ends, evaluate what you built and plan the next one.

What to avoid: declaring publicly on LinkedIn that you're "grinding DevOps full time" and then disappearing for six weeks. The absence is noticed. Only commit publicly to what you will actually sustain.

Factor 6: Networking and Visibility

This is the factor most beginners resist most, and the one that makes the biggest practical difference in time-to-hire.

Most DevOps jobs are filled through people referrals, community connections, LinkedIn conversations. A warm introduction from someone who has seen your work outweighs fifty cold applications every time.

Here are three ways to build visibility without it feeling performative:

Community Engagement

Join communities where DevOps engineers actually talk: AWS User Groups, local DevOps meetups, DevOps Discord servers, Reddit communities like r/devops and r/kubernetes. You don't need to be the expert. Ask specific questions, answer what you genuinely know, and show up consistently. After three to six months, people will recognize your name.

LinkedIn Content

Post once per week about something you learned, built, or got stuck on. Not marketing – documentation. A post that says "This week I configured Prometheus alerting for a Docker Compose stack. Here's what tripped me up and how I solved it" attracts recruiters, leads to conversations, and builds a searchable record of your growth over time.

Asking Good Questions in Public

When you get stuck and figure it out, write it up. Post the solution in the same community where you asked the question. Answer someone else's version of the same question later. You position yourself as a helpful, engaged learner, exactly who hiring managers want to hire.

Here's a concrete three-month visibility sprint to follow:

Timeframe	Action
Week 1-2	Update your LinkedIn headline: "Cloud / DevOps Engineer in Training │ Building with AWS, Docker, Terraform". Connect with 20 people in DevOps engineers, recruiters, hiring managers. Add a short personal note when connecting.
Week 3-4	Write your first LinkedIn post. Document something you built or learned this week. Keep it honest and specific. 150–200 words is enough.
Month 2	Join one community. Introduce yourself. Answer one question per week.
Month 3	Post consistently once per week. Engage with others' posts. Start appearing in recruiter searches.

By month three, recruiters searching for "DevOps" in your location will encounter your activity. Some of the best entry-level DevOps opportunities come from exactly this kind of low-pressure visibility.

Factor 7: Ownership Mindset

This factor is less about personality type and more about observable behavior. Hiring managers are looking for evidence that you finish what you start not just that you start things.

Here's what the contrast looks like:

What hiring managers frequently see	What hiring managers want to see
"I started a Kubernetes project and encountered a lot of issues"	"Here is a complete project. It deploys to AWS, has a CI/CD pipeline, is monitored, and you can access it at this URL right now."
"I was working through a Terraform course, learnt a lot about XYZ."	"I finished it, documented it, and wrote a post about what I learned."

Ownership mindset has three components. First, finish things: a complete, simple project is worth ten times more than ten incomplete complex ones. Second, take responsibility without blame when something breaks: ownership means identifying the cause, fixing it, and adding monitoring so it doesn't happen again. Third, self-direct your learning you don't wait for someone to tell you what to learn next. You see a gap, identify how to close it, and close it. This is what "junior who can work independently" actually means in job descriptions.

Factor 8: Business Awareness

Technical skill gets you in the door. Business awareness keeps you there and accelerates your career.

The core question hiring managers are testing is: can you connect your technical decisions to cost, uptime, and user impact? Infrastructure decisions are business decisions. Cloud costs are typically the second-largest engineering expense at most companies after salaries. A misconfigured auto-scaling group or a forgotten large EC2 instance can burn thousands of dollars overnight.

Here are a few benchmark questions worth being able to answer comfortably:

If your company has a 99.9% SLA, how many minutes of downtime per month is that? (About 43 minutes.)
If you move workloads from on-demand EC2 instances to Reserved Instances, what's the approximate cost saving? (Around 40–60%.)
If your CI/CD pipeline takes 45 minutes per build and you run 20 builds per day, how much developer wait time does that represent weekly?

Most junior candidates can't answer these fluently in an interview. Candidates who can stand out immediately not because the questions are hard, but because so few people bother to connect infrastructure and business.

The simple habit to build: whenever you describe a technical decision in your project documentation or in an interview, add the business dimension. "I configured auto-scaling" becomes "I configured auto-scaling to handle traffic spikes, which eliminated the cost of over-provisioning and reduced our estimated monthly cloud spend by approximately $X."

Factor 9: Learning Agility

Everyone claims to be a fast learner. It's the most overused phrase in technology job applications. Here's how to make it actually mean something.

Saying "I'm a fast learner" in an interview is table stakes. The question is whether you can prove it. Proof sounds like this: "I had never used GitHub Actions before. I needed a CI/CD pipeline for a project I was building. In 48 hours, I had a working pipeline that runs tests, builds a Docker image, and deploys to AWS."

What makes that credible: it names a specific tool, a specific timeframe, and a specific outcome. There is a GitHub repository with a commit history and a working pipeline that a hiring manager can actually look at.

Learning agility is not about knowing many tools shallowly. It's about picking up new tools quickly because you deeply understand the underlying concepts. Tool names change every few years. Concepts networking, automation, observability, reliability do not.

To build a concrete track record of learning agility: once a month, pick one tool you haven't used. Follow its quick-start guide. Build something small. Document what was difficult. Post about it. This is your learning agility portfolio visible, dated, and specific.

Your 90-Day Action Plan

Here is a concrete, sequential plan that takes you from where you are now to your first DevOps interview-ready state.

Month 1: Build Your Foundation

Focus entirely on Project 1 from the Proof of Work section. Build it completely. Deploy it. Get the live URL. Don't start Project 2 until Project 1 meets all six checklist criteria.

Alongside the build: 30 minutes of Linux and Bash scripting practice daily. This isn't optional, it's the foundation everything else runs on.

Month 2: Expand Your Execution and Start Your Visibility

Begin Project 2 (Terraform IaC). Write your first LinkedIn post, it doesn't need to be polished, it needs to be specific. Join one community and introduce yourself.

Month 3: Complete the Portfolio and Document Everything

Finish all three projects to full checklist standard. Polish every README. Add architecture diagrams. Optimize your GitHub profile, pin your three best repos, write a profile README that describes who you are and what you build, and add links to your live project URLs.

Month 4 Onward: Apply with Strategy

Don't start applying before month four. Apply with real proof of work in hand. Target five to ten quality applications per week rather than spraying a hundred. Include your GitHub and your best project's live URL in every application. For roles at companies where you have a community connection, reach out to that person before applying.

Track every application in a spreadsheet: company, role, date applied, status, outcome, notes. After thirty applications, you'll have enough data to see what's working and what isn't.

Here's the full 90-day breakdown:

Timeframe	Focus	Milestone
Week 1-2	Linux fundamentals. Set up GitHub profile. Start Project 1.	Foundation
Week 3-4	Complete Project 1 CI/CD pipeline. Deploy. Get live URL. Write README.	First Proof of Work
Month 2	Begin Project 2. First LinkedIn post. Join one community.	Visibility begins
Month 2-3	Complete Project 2. Scaffold monitoring (Project 3). Post weekly on LinkedIn.	Building momentum
Month 3	Finish all 3 projects to checklist standard. Polish READMEs and GitHub profile.	Portfolio complete
Month 4+	Apply strategically. Continue posting and community engagement.	Active job search

Honest Self-Assessment: Where Do You Stand?

Go through each statement below. Be completely honest: this is for you, not anyone else.

Statement	Action if the answer is No
I can explain a web request end-to-end (DNS → load balancer → compute → database → logs)	Study Factor 2 until you can draw this from memory
I have at least one deployed project with a live URL	This is Priority 1. Nothing else matters more right now.
My best project has a CI/CD pipeline that auto-deploys on push	Add this to your existing project this week
I have written infrastructure as code (Terraform or CloudFormation)	Project 2 is your next build target
My projects have READMEs that explain architecture and decisions	Spend one hour today rewriting your README
I have posted about my learning on LinkedIn in the last 30 days	Post something today, document what you built last week
I am part of at least one DevOps community	Join r/devops or an AWS Discord server this week
I can write a Bash script that solves a real automation problem	30 minutes of daily scripting practice for the next 30 days
I can explain what I built, why I made each decision, and what I'd change	Practice saying this out loud about each project until it's fluent

Count your "no" answers. Each one is a specific, actionable gap, not a vague sense of being behind. That's the difference between this self-assessment and the anxious feeling of "I'm not ready yet." You're not behind. You just have a prioritized list of what to build next.

Conclusion

Here's what you know now that most beginners still don't:

The gap between you and a DevOps job isn't a gap in certifications, a gap in courses completed, or a gap in the number of tools you've heard about. It's a gap in proof of work, visibility, and the consistency with which you execute.

Hiring managers aren't looking for someone who has watched everything. They're looking for someone who has built something, documented it, deployed it, monitored it, and can clearly explain every decision they made along the way.

The path isn't secret. It's just work. Build two to three complete projects that meet the full checklist. Document everything. Show up consistently in communities and on LinkedIn. Apply with strategy. Iterate based on feedback.

If you want a production-grade reference to support your DevOps journey complete with real Terraform modules, CI/CD workflow templates, infrastructure runbooks, and platform engineering patterns used in real startup environments The Startup DevOps Field Guide was built for exactly this stage of your career.

The information gap between you and your first DevOps role is smaller than you think. The execution gap is where the work is. Start today.

References and Recommended Resources

roadmap.sh/devops: The community-maintained DevOps learning roadmap. Use this to sequence what you learn next and avoid random jumps between topics.
DORA State of DevOps Report: Free annual report on what DevOps practices actually improve software delivery performance. Gives you the vocabulary hiring managers speak.
Linux Foundation - Introduction to Linux: Free introductory Linux course. If the terminal still makes you nervous, start here.
The Phoenix Project: A business novel about DevOps transformation. Teaches core concepts through story. Gives you vocabulary for business-aware conversations.
ExplainShell.com: Paste any command you find online and see exactly what every part does. Use this constantly while building your projects.
GitHub - How to Write a Good README: Official GitHub guidance on repository documentation.
Prometheus Documentation: Official docs for the monitoring tool used in Project 3.
Terraform Getting Started - AWS: Official step-by-step guide for Project 2.
GitHub Actions Documentation: Complete reference for building CI/CD pipelines in Project 1.
freeCodeCamp - Learn Linux for Beginners: Comprehensive Linux guide available on freeCodeCamp.

How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS

Tolani Akintayo — Mon, 27 Apr 2026 15:07:43 +0000

If you've been storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks in a CI/CD pipeline.

Here's why: static credentials don't expire on their own. If they get leaked through a misconfigured workflow, a public fork, or a compromised repository, an attacker has persistent access to your AWS environment until you manually rotate them. And most teams don't rotate them often enough.

OpenID Connect (OIDC) solves this entirely. Instead of storing long-lived credentials, GitHub Actions requests a short-lived token directly from AWS every time your workflow runs. No secrets to rotate. No credentials to leak. No manual key management.

In this tutorial, you'll learn how to set up OIDC authentication between GitHub Actions and AWS from scratch. By the end, your workflows will authenticate to AWS securely without storing a single access key.

What Is OpenID Connect (OIDC)?
How OIDC Works Between GitHub Actions and AWS
Prerequisites
Step 1: Create an IAM OIDC Identity Provider in AWS

Step 2: Create an IAM Role with a Trust Policy

Step 3: Attach Permissions to the IAM Role

Step 4: Store the Role ARN as a GitHub Actions Variable

Step 5: Configure Your GitHub Actions Workflow

Step 6: Run and Verify Your Workflow
Security Best Practices
Troubleshooting Common Errors
Conclusion
References

What Is OpenID Connect (OIDC)?

OpenID Connect is an identity protocol built on top of OAuth 2.0. It allows systems to verify identity through tokens rather than shared secrets.

In the context of GitHub Actions and AWS:

GitHub acts as the identity provider (IdP). It issues a signed JWT (JSON Web Token) for each workflow run.
AWS acts as the service provider. It validates that token against GitHub's public keys and exchanges it for temporary AWS credentials. The credentials AWS returns are short-lived (valid for up to 1 hour by default) and scoped to exactly the IAM role you define. When the workflow ends, those credentials are gone.

This model is called federated identity. It's the same concept used when you "Sign in with Google" on a third-party website. The difference is that instead of a user signing in, your workflow is the one authenticating.

How OIDC Works Between GitHub Actions and AWS

Before writing a single line of YAML, it beneficial to understand the flow. This is my personal approach when implementing new technologies or concepts. Here's what happens every time your workflow runs:

The diagram illustrates a secure authentication flow between GitHub Actions and AWS using OpenID Connect (OIDC), eliminating the need to store long-lived AWS credentials in GitHub. Here's what happens step-by-step:

1. Initial Authentication Request

When your GitHub Actions workflow starts, the runner (the virtual machine executing your workflow) requests a JSON Web Token (JWT) from GitHub's OIDC provider located at https://token.actions.githubusercontent.com.

2. Token Issuance

GitHub's OIDC provider generates and signs a JWT containing important claims (metadata) about your workflow. These claims include details like which repository the workflow is running from, which branch triggered it, what environment it's running in, and other contextual information that proves the workflow's identity.

3. Token Validation

The GitHub Actions runner presents this signed JWT to AWS Security Token Service (STS). AWS STS validates the JWT's signature by checking it against GitHub's publicly available cryptographic keys, ensuring the token is authentic and hasn't been tampered with.

4. Trust Policy Verification

AWS STS checks the trust policy configured on your IAM Role. This trust policy specifies which GitHub repositories, branches, or environments are allowed to assume this role. If the claims in the JWT match your trust policy conditions, authentication succeeds.

5. Temporary Credentials Issued

Once validated, AWS STS returns temporary security credentials to the GitHub Actions runner. These credentials include an Access Key ID, Secret Access Key, and Session Token that are valid for a limited time (typically 1 hour by default, configurable up to 12 hours).

6. AWS API Access

The GitHub Actions runner uses these temporary credentials to authenticate API calls to your AWS resources such as pushing Docker images to ECR, updating ECS services, writing to S3 buckets, or invoking Lambda functions.

The key point: AWS never sees your GitHub credentials, and GitHub never sees your AWS credentials. The JWT is the only thing exchanged and it's signed, scoped, and short-lived.

Prerequisites

Before you start, make sure you have the following in place:

An AWS account with IAM permissions to create identity providers and roles
A GitHub repository (public or private) where your workflows will run
Basic familiarity with GitHub Actions, knowing how to write a .yml workflow file
Basic familiarity with AWS IAM roles, policies, and permissions
The AWS CLI installed and configured (optional, but useful for verification). You don't need to be an AWS expert. Each step includes the exact console path and the configuration values you need.

Step 1: Create an IAM OIDC Identity Provider in AWS

The first thing you need to do is tell AWS to trust GitHub as an identity provider. This is a one-time setup per AWS account.

How to Do It in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Identity providers

3. Click Add provider

4. For Provider type, select OpenID Connect

5. For Provider URL, enter:

https://token.actions.githubusercontent.com

6. For Audience, enter:

sts.amazonaws.com

7. Click Add provider

How to Do It with the AWS CLI

If you prefer the terminal, run this command:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \

Once created, you'll see token.actions.githubusercontent.com listed under Identity providers in your IAM console. This provider will be referenced in your IAM role's trust policy in the next step.

Step 2: Create an IAM Role with a Trust Policy

Now you need an IAM role that your GitHub Actions workflow will assume. The trust policy on this role controls which repositories and branches are allowed to request credentials.

How to Create the IAM Role in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Roles

3. Click Create role

4. For Trusted entity type, select Web identity

5. For Identity Provider, choose: token.actions.githubusercontent.com which you created earlier.

6. For Audience, choose sts.amazonaws.com as well

7. For GitHub organisation, enter your GitHub username or organization name

8. For GitHub repository, enter your GitHub repository

9. For GitHub branch, enter your branch name (for example, main)

10. Click Next, then Next, give a name to the role and click create role

Note: Creating the IAM role using this approach already establishes the Trusted Entities using a trusted policy based on the step 4-9 above. You can verify this by clicking on the created role and navigating to Trust relationships.

How to Create the IAM Role with the AWS CLI

First, you'll need to create a trust policy document on your local machine: You can call it trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_GITHUB_ORG/YOUR_REPO_NAME:*"
        }
      }
    }
  ]
}

Replace the following placeholders before saving:

Placeholder	Replace With
`YOUR_ACCOUNT_ID`	Your 12-digit AWS account ID
`YOUR_GITHUB_ORG`	Your GitHub username or organization name
`YOUR_REPO_NAME`	The name of your GitHub repository

How to Understand the `sub` Condition

The sub (subject) claim in the JWT tells AWS exactly where the request is coming from. The value repo:your-org/your-repo:* means any branch in that repository can assume this role.

You can tighten this further depending on your needs:

# Only the main branch
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
 
# Only a specific GitHub Environment
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Scoping this correctly is one of the most important security decisions in this setup. Here's how to decide:

Use ref:refs/heads/main if only your main/production branch should deploy to AWS. This is the most restrictive and secure option: feature branches can't accidentally (or maliciously) trigger deployments or modify production resources.
Use environment:production if you're using GitHub Environments with protection rules (required reviewers, deployment gates). This lets you control deployments through GitHub's approval workflow while still restricting which workflows can access AWS.
Use repo:your-org/your-repo:* (wildcard) only if you need any branch to deploy. for example, in development environments where every feature branch deploys to its own isolated stack. Never use this for production roles.

Run this command to create the role using your trust policy:

aws iam create-role \
  --role-name GitHubActionsOIDCRole \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role assumed by GitHub Actions via OIDC"

Take note of the Role ARN in the output. It will look like this:

arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

You'll need this ARN in your workflow YAML in Step 4.

Step 3: Attach Permissions to the IAM Role

The IAM role can now authenticate, but it has no permissions yet. You need to attach a policy that defines what your workflow is actually allowed to do in AWS.

How to Apply the Principle of Least Privilege

Only grant the permissions your workflow genuinely needs. If your workflow deploys to S3, give it S3 permissions. If it pushes images to ECR, give it ECR permissions. Never attach AdministratorAccess to a CI/CD role.

Option 1: Attach an AWS managed policy (quick start):

aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Option 2: Create a custom policy scoped to a specific S3 bucket (recommended for production):

This approach is recommended for production because it limits the blast radius of a security incident. If your workflow credentials are ever compromised, a custom policy scoped to a specific bucket means an attacker can only affect that single bucket not every S3 bucket in your AWS account. It also prevents accidental misconfigurations in your workflow from impacting unrelated resources.

Create a file called s3-deploy-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Then create and attach it:

aws iam create-policy \
  --policy-name GitHubActionsS3DeployPolicy \
  --policy-document file://s3-deploy-policy.json
 
aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/GitHubActionsS3DeployPolicy

Note: You can as well implement Step 3 via the console.

Reference: For a full list of available AWS IAM actions, see the AWS IAM actions reference.

Step 4: Store the Role ARN as a GitHub Actions Variable

Before you configure your workflow, you need to make the Role ARN available to it. You'll store it as a repository variable in GitHub, not a secret, because the ARN itself isn't sensitive data.

How to Add the Variable in Your Repository

First, open your GitHub repository and click Settings:

In the left sidebar, scroll down to Secrets and variables, then click Actions:

Then click the Variables tab (not Secrets). Click New repository variable – you can set the Name to:

AWS_ROLE_ARN

Set the Value to your Role ARN from Step 2, for example:

arn:aws:iam::YOUR_ACCOUNT_ID::role/GitHubActionsOIDCRole

Click Add variable:

You'll reference this variable in your workflow in the next step using ${{ vars.AWS_ROLE_ARN }}.

Step 5: Configure Your GitHub Actions Workflow

With AWS and GitHub fully configured, you now need to update your workflow to request an OIDC token and use it to authenticate.

How to Set the Required Workflow Permissions

Your workflow must declare id-token: write. Without this, GitHub won't issue an OIDC token to the runner.

permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read    # Required to checkout the repository

Important: If you set permissions at the job level, they override any top-level permissions. Make sure id-token: write is present at whichever level your AWS authentication step runs.

Full Workflow Example

Here's a complete workflow that authenticates to AWS using OIDC and deploys a static site to S3:

name: Deploy to AWS S3
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write
  contents: read
 
jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-2
 
      - name: Verify AWS identity
        run: aws sts get-caller-identity
 
      - name: Deploy to S3
        run: |
          aws s3 sync ./code s3://your-bucket-name

Replace the following before committing:

Placeholder	Replace With
`AWS_ROLE_ARN`	The variable name for your IAM role ARN in GitHub
`us-east-2`	Your target AWS region
`your-bucket-name`	Your S3 bucket name
`./code`	The local directory where the file you want to sync to S3 is located

You can see the code sample in my GitHub Repo here.

Note: The aws-actions/configure-aws-credentials action handles the entire OIDC token exchange automatically. It requests the JWT from GitHub, calls sts:AssumeRoleWithWebIdentity, and exports the temporary credentials as environment variables for the rest of the job.

See the action's official documentation for all available options.

Step 6: Run and Verify Your Workflow

Push your workflow to the main branch and open the Actions tab in your repository to watch it run.

What a Successful Run Looks Like

The Configure AWS credentials via OIDC step should show:

Assuming role with OIDC: arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

The Verify AWS identity step (aws sts get-caller-identity) should return:

{
    "UserId": "AROA...:GitHubActions",
    "Account": "YOUR_ACCOUNT_ID",
    "Arn": "arn:aws:sts::YOUR_ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions"
}

If you see an assumed-role ARN in the output, OIDC is working correctly. Your workflow is now authenticating to AWS without a single stored credential.

Security Best Practices

Getting OIDC working is step one. Locking it down properly is step two.

Scope the `sub` Condition as Tightly as Possible

Don't use a wildcard like repo:your-org/*:* that allows any repository in your organization to assume the role. Scope it to the exact repository and branch that needs access.

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"

Use GitHub Environments for Production Deployments

GitHub Environments let you add manual approval gates and restrict which branches can deploy. When combined with OIDC, you can scope your trust policy to only allow the production environment:

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Apply Least-Privilege Permissions to Every IAM Role

Never attach AdministratorAccess or PowerUserAccess to a role used by CI/CD. Define a custom policy with only the actions your workflow actually needs.

Create Separate IAM Roles Per Environment

A staging role and a production role should have different permission scopes. Your staging deployment role should never have write access to production resources.

Enable AWS CloudTrail

Every call made using the temporary credentials is logged in CloudTrail under the assumed role ARN. This gives you a full audit trail of exactly what your workflow did in AWS.

Reference: GitHub's official security hardening guide for OIDC: About security hardening with OpenID Connect

Troubleshooting Common Errors

Error: `Not authorized to perform sts:AssumeRoleWithWebIdentity`

This usually means the trust policy on your IAM role doesn't match the sub claim in the JWT.

Check the following:

The sub condition exactly matches your repository path (it is case-sensitive)
The aud condition is set to sts.amazonaws.com
The Federated principal uses the correct AWS account ID

To inspect the actual token claims your workflow is receiving, add this debug step temporarily:

- name: Print OIDC token claims
  run: |
    TOKEN=\((curl -s -H "Authorization: Bearer \)ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')
    echo $TOKEN | cut -d '.' -f2 | base64 -d 2>/dev/null | jq .

Error: `Could not load credentials from any providers`

This almost always means id-token: write is missing from your workflow permissions. Double-check that you have:

permissions:
  id-token: write
  contents: read

Error: `AccessDenied` When Calling an AWS Service

Authentication succeeded but the IAM role doesn't have permission to perform the action your workflow is attempting. Check the permissions policy attached to your role and compare it against the specific action in the error message.

Conclusion

You've gone from storing static, long-lived AWS credentials in GitHub Secrets to a fully keyless authentication setup using OIDC. Here's what you accomplished:

Registered GitHub as a trusted OIDC identity provider in AWS.
Created an IAM role with a scoped trust policy tied to a specific repository.
Attached least-privilege permissions to that role.
Configured your GitHub Actions workflow to request and use short-lived AWS credentials.
Verified the authentication flow end-to-end.

This pattern works across every AWS service from S3, ECS, Lambda, ECR, Secrets Manager, and more. The workflow example here uses S3, but you only need to swap out the permissions policy and the deployment commands to adapt it for any service.

If you want to go further, explore:

Configuring OIDC for multiple cloud providers: Azure, GCP, and HashiCorp Vault.
GitHub Environments and deployment protection rules: for multi-stage pipelines with approval gates.
AWS IAM Access Analyzer: to validate and tighten your role policies automatically.

If you're building out your DevOps practice and want a complete, production-ready reference for infrastructure automation, CI/CD, and platform engineering, check out The Startup DevOps Field Guide. It covers the patterns, templates, and runbooks I've used across real AWS environments.

You can also connect with me on LinkedIn

References

How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik

Md Tarikul Islam — Thu, 23 Apr 2026 18:11:20 +0000

This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server.

You’ll learn how to expose services on a custom domain with auto-renewing HTTPS, and implement a smart deployment strategy that detects changes and redeploys only the affected microservices. This helps avoid unnecessary full-stack redeploys. We'll also cover real production issues and the exact fixes for each one.

1. What you'll build
2. Architecture
3. Server prerequisites
4. Traefik — the reverse proxy
5. Run Jenkins in Docker
6. Expose Jenkins on a domain via Traefik
7. First-time Jenkins setup
8. Add the GitHub credential
9. Create the pipeline job
10. The Jenkinsfile (deploy only what changed)
11. End-to-end test
12. Troubleshooting — every error we hit
13. Mental model: host vs. container
14. Daily operations cheat sheet
15. What I'd do differently next time
Closing thoughts

1. What You'll Build

In this tutorial, you'll build a Jenkins instance running inside Docker on the same Linux server as your application stack.

Traefik will act as a reverse proxy in front of Jenkins, exposing it via a clean URL (https://jenkins.example.com) with auto-renewing Let's Encrypt certificates.

You'll also create a Jenkinsfile in your application repository that:

Automatically triggers on every push to the staging branch,
Detects which microservices changed in each commit,
Pulls the latest code on the host machine,
Rebuilds and restarts only the affected services.

On every push, only the relevant services are redeployed.

Prerequisites

Before jumping in, this guide assumes you’re already comfortable with a few core concepts and tools.

This isn't a beginner-level tutorial — we’ll be working directly with infrastructure, containers, and CI/CD pipelines.

You should be familiar with:

Basic Linux commands (SSH, file system navigation, permissions)
Docker fundamentals (images, containers, volumes, networks)
Git workflows (clone, pull, branches)
General idea of CI/CD pipelines

Tools and environment required:

A Linux server (Ubuntu recommended)
Docker Engine + Docker Compose (v2)
A domain name (for Traefik + HTTPS)
GitHub repository (for your backend project)
Basic understanding of microservices architecture

If you’re comfortable with the above, you’re ready to follow along.

2. Architecture

Here's an overview of the architecture:

┌──────────────────────────── Linux server (Ubuntu) ────────────────────────────┐
│                                                                               │
│   /home/developer/projects/                                                  │
│       └── project-prod-configs/             ← infra repo (compose, Traefik) │
│              ├── docker-compose.staging.yml                                   │
│              ├── traefik.staging.yml                                          │
│              └── project-backend/          ← app repo (services, gateways) │
│                     ├── Jenkinsfile                                           │
│                     ├── docker-compose.staging.yml                            │
│                     └── apps/                                                 │
│                            ├── services//                               │
│                            ├── gateways//                               │
│                            └── core//                                   │
│                                                                               │
│   ┌─────────────────────── Docker network: proxy ──────────────────────┐      │
│   │  traefik (80, 443)                                                 │      │
│   │     │                                                              │      │
│   │     ├──► jenkins  (projects-jenkins-staging)                     │      │
│   │     │      ↳ /projects  ← bind-mount of the host project tree     │      │
│   │     │      ↳ /var/run/docker.sock ← controls host Docker           │      │
│   │     │                                                              │      │
│   │     └──► your services & gateways (built by the pipeline)          │      │
│   └────────────────────────────────────────────────────────────────────┘      │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘
            ▲
            │  webhook on push
            │
   GitHub: /project-backend (branch: staging)

There are two key ideas here:

Jenkins runs in a container, but it controls the host's Docker by mounting /var/run/docker.sock. It also bind-mounts the project folder as /projects/..., so it can cd into the real code on the host and run docker compose there.
The Jenkinsfile lives inside the app repo, so the pipeline definition is versioned with the code. Jenkins simply points at it.

3. Server Prerequisites

Before we start configuring Jenkins or Traefik, we need to prepare the server properly.

In this step, we’ll:

Create a dedicated Linux user for managing the project
Install Docker and Docker Compose
Set up the folder structure for our repositories

This ensures our CI/CD pipeline runs in a clean and predictable environment.

# Linux user that owns the project tree
sudo adduser developer

# Docker engine + Compose plugin
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker developer

# Sanity check Compose v2
docker compose version
# -> Docker Compose version v2.x.y

# Find where the Compose plugin binary lives — write it down, you'll need it
ls /usr/libexec/docker/cli-plugins/docker-compose
# (some distros use /usr/lib/docker/cli-plugins/docker-compose)

# Project layout
sudo mkdir -p /home/developer/project
sudo chown -R developer:developer /home/developer/project

# Clone both repos in the right place
cd /home/developer/projects
git clone https://github.com//projects-prod-configs.git
cd projects-prod-configs
git clone -b staging https://github.com//projects-backend.git

You should now have:

/home/developer/projects/projects-prod-configs/projects-backend

Memorize this path — your Jenkinsfile references it.

DNS

Point an A-record for your Jenkins subdomain to the server's public IP before the next steps so Let's Encrypt can validate via HTTP challenge:

jenkins.example.com   A

4. Traefik — the Reverse Proxy

Traefik acts as the entry point to your entire system. Instead of exposing each service manually with ports, Traefik automatically:

Routes traffic based on domain names
Generates and renews HTTPS certificates using Let’s Encrypt
Connects to Docker and detects services dynamically

In simple terms, Traefik lets you access services like:

https://jenkins.example.com
https://api.example.com

…without manually configuring NGINX or managing SSL certificates.

In this setup, Traefik watches Docker containers and routes traffic using labels we'll define later.

Traefik gives every container a real domain and a real cert with zero per-service config — you just add a few labels.

`traefik.staging.yml` (static config)

Put this at the root of your infra repo:

api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@example.com           # ← change me
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false              # only containers with traefik.enable=true
    network: proxy
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: INFO

accessLog: {}

The Traefik service in `docker-compose.staging.yml`

networks:
  proxy:
    name: proxy
    driver: bridge
  internal:
    name: internal
    driver: bridge

volumes:
  acme-data:
  traefik-logs:
  jenkins-data:

services:
  traefik:
    image: traefik:v2.11
    container_name: projects-traefik-staging
    restart: unless-stopped
    ports:
      - "80:80"        # HTTP (auto-redirects to HTTPS)
      - "443:443"      # HTTPS
      - "8080:8080"    # Traefik dashboard (internal only — protect via firewall)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.staging.yml:/etc/traefik/traefik.yml:ro
      - ./dynamic:/etc/traefik/dynamic:ro
      - acme-data:/etc/traefik           # persists Let's Encrypt certs
      - traefik-logs:/var/log/traefik
    networks:
      - proxy
    command:
      - '--api.insecure=false'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--providers.docker.network=proxy'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
      - '--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL:-admin@example.com}'
      - '--certificatesresolvers.letsencrypt.acme.storage=/etc/traefik/acme.json'
      - '--log.level=INFO'
      - '--accesslog=true'
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      # Traefik's own dashboard
      - "traefik.http.routers.traefik-dash.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-dash.entrypoints=websecure"
      - "traefik.http.routers.traefik-dash.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-dash.service=api@internal"

Bring it up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d traefik

Watch the logs the first time — Traefik will request a cert for the dashboard host as soon as DNS resolves.

docker logs -f projects-traefik-staging

Tip. While testing, switch ACME to staging endpoint (acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory) so you don't burn through Let's Encrypt's rate limits if you misconfigure DNS. Remove that flag before going live.

5. Run Jenkins in Docker

Add this Jenkins service to the same docker-compose.staging.yml. Every line matters (and the comments explain why).

  jenkins:
    image: jenkins/jenkins:lts
    container_name: projects-jenkins-staging
    restart: unless-stopped
    user: root                           # to use host docker.sock without UID juggling
    environment:
      - JAVA_OPTS=-Xmx1g -Xms512m -Duser.timezone=Asia/Dhaka
      - TZ=Asia/Dhaka                    # OS-level timezone inside container
      - JENKINS_OPTS=--prefix=/
    ports:
      - "3095:8080"                      # web UI (also reachable directly if needed)
      - "50000:50000"                    # inbound agent port
    volumes:
      - jenkins-data:/var/jenkins_home   # Jenkins config/jobs/secrets persistence
      - /var/run/docker.sock:/var/run/docker.sock                          # control host Docker
      - /usr/bin/docker:/usr/bin/docker                                     # docker CLI from host
      - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro  # docker compose plugin
      - /home/developer/projects:/projects                                # project tree
      - /etc/localtime:/etc/localtime:ro                                    # match host clock
      - /etc/timezone:/etc/timezone:ro
    networks:
      - proxy
      - internal
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080/login']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    deploy:
      resources:
        limits:
          memory: 1024M

Why user: root? It's the simplest way to share docker.sock and the project bind-mount without UID/GID gymnastics. If you prefer an unprivileged user, you'll need to set group: docker and align UIDs/perms on host folders — possible but out of scope here.

6. Expose Jenkins on a Domain via Traefik

This is the section many guides skip. We'll add labels to the Jenkins service so Traefik picks it up automatically. No editing of Traefik config required.

  jenkins:
    # ... everything above ...
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # 1) Router — match incoming Host
      - "traefik.http.routers.jenkins.rule=Host(`jenkins.example.com`)"
      - "traefik.http.routers.jenkins.entrypoints=websecure"
      - "traefik.http.routers.jenkins.tls.certresolver=letsencrypt"
      - "traefik.http.routers.jenkins.service=jenkins"

      # 2) Service — tell Traefik which container port is the app
      - "traefik.http.services.jenkins.loadbalancer.server.port=8080"

      # 3) Middleware — Jenkins needs X-Forwarded-Proto so it knows it's behind HTTPS
      - "traefik.http.middlewares.jenkins-headers.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.routers.jenkins.middlewares=jenkins-headers"

What each line does:

Label	Purpose
`traefik.enable=true`	Opts this container in (we set `exposedByDefault=false`).
`traefik.docker.network=proxy`	Tells Traefik which network to talk to Jenkins on (Jenkins is on both `proxy` and `internal`).
`routers.jenkins.rule=Host(...)`	Forwards only this hostname to Jenkins.
`routers.jenkins.entrypoints=websecure`	Listens only on 443. (HTTP redirect was set up in section 4.)
`routers.jenkins.tls.certresolver=letsencrypt`	Auto-issues + renews the cert.
`services.jenkins.loadbalancer.server.port=8080`	Jenkins listens on 8080 inside the container.
`customrequestheaders.X-Forwarded-Proto=https`	Without this, Jenkins generates `http://` URLs in webhooks/links and breaks.

Bring Jenkins up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d jenkins

# Watch Traefik issue the certificate
docker logs -f projects-traefik-staging | grep -i acme

After 10–60 seconds you should be able to open https://jenkins.example.com and see Jenkins's setup wizard with a valid lock icon.

Inside Jenkins (after first login):

Manage Jenkins → System → Jenkins URL → set this to: https://jenkins.example.com/

This is important because Jenkins uses this base URL to generate:

Webhook endpoints (for GitHub triggers)
Links inside emails and build logs

If this isn't set correctly, GitHub webhooks may fail, and any links Jenkins generates will point to the wrong address (often localhost or internal IPs).

7. First-Time Jenkins Setup

If you're running Jenkins for the first time on this server, follow this section to complete the initial setup.

If you already have Jenkins configured, you can skip this section — but make sure the required plugins and settings match what we use later in this guide.

Open https://jenkins.example.com. Get the initial admin password:

docker exec projects-jenkins-staging cat /var/jenkins_home/secrets/initialAdminPassword

Paste it, choose Install suggested plugins.
Create your admin user.
Manage Jenkins → Plugins → Available and install:
- GitHub (and GitHub Branch Source)
- Pipeline: GitHub
- Credentials Binding (usually preinstalled)

That's all the plugins you need for the rest of this guide.

8. Add the GitHub Credential

Jenkins needs permission to access your GitHub repository.

This is done using a GitHub Personal Access Token (PAT), which acts like a password for secure API and Git operations.

We’ll store this token inside Jenkins as a credential so it can pull code during pipeline execution and authenticate securely without exposing secrets in code.

This single credential is used both for the SCM checkout and for the deploy-time git pull.

Create a Personal Access Token (classic) on GitHub with repo scope.
In Jenkins: Manage Jenkins → Credentials → System → Global → Add Credentials.
Fill in:
- Kind: Username with password
- Username: your GitHub username
- Password: the token
- ID: github_classic_token (the Jenkinsfile references this exact ID)

9. Create the Pipeline Job

Now that Jenkins has access to your repository, the next step is to define how deployments should run.

A pipeline job tells Jenkins:

where your code lives,
which branch to monitor,
and how to execute your deployment process.

In Jenkins, create a new Pipeline job and connect it to your GitHub repository. Once this is set up, Jenkins will automatically trigger deployments whenever you push to the staging branch.

Start by creating a new job:

New Item → Pipeline → name it projects-staging → OK

Then configure the job:

Under Build Triggers, enable:
GitHub hook trigger for GITScm polling
Under Pipeline:
- Definition: Pipeline script from SCM
- SCM: Git
- Repository URL: https://github.com//projects-backend.git
- Credentials: github_classic_token
- Branch: */staging
- Script Path: Jenkinsfile

Save the configuration.

At this point, Jenkins is fully connected to your repository and ready to run your deployment pipeline automatically.

10. The Jenkinsfile (Deploy Only What Changed)

Place this at the root of the app repo (projects-backend/Jenkinsfile), branch staging.

pipeline {
  agent any

  environment {
    PROJECT_PATH = "/projects/projects-prod-configs/projects-backend"
    COMPOSE_FILE = "docker-compose.staging.yml"
  }

  stages {

    stage('Checkout') {
      steps {
        checkout scm
        echo "Checkout completed for branch: ${env.BRANCH_NAME ?: 'staging'}"
      }
    }

    stage('Detect Changes') {
      steps {
        script {
          def changedFiles = sh(
            script: "git diff --name-only HEAD~1 HEAD",
            returnStdout: true
          ).trim()

          echo "Changed files:\n${changedFiles}"

          def services = [] as Set
          changedFiles.split('\n').each { file ->
            def svc  = file =~ /^apps\/services\/([a-z0-9-]+)\//
            def gw   = file =~ /^apps\/gateways\/([a-z0-9-]+)\//
            def core = file =~ /^apps\/core\/([a-z0-9-]+)\//
            if (svc)  { services << svc[0][1]  }
            if (gw)   { services << gw[0][1]   }
            if (core) { services << core[0][1] }
          }
          services = services.findAll { !it.endsWith('-e2e') }
          env.CHANGED_SERVICES = services.join(' ')

          echo "Services to deploy: ${env.CHANGED_SERVICES ?: '(none)'}"
        }
      }
    }

    stage('Deploy') {
      when { expression { return env.CHANGED_SERVICES?.trim() } }
      steps {
        withCredentials([usernamePassword(
          credentialsId: 'github_classic_token',
          usernameVariable: 'GIT_USER',
          passwordVariable: 'GIT_TOKEN'
        )]) {
          sh '''
            set -eu
            git config --global --add safe.directory "${PROJECT_PATH}"
            cd "${PROJECT_PATH}"
            git remote set-url origin "https://github.com//projects-backend.git"
            git -c credential.helper= \
                -c "credential.helper=!f() { echo username=\({GIT_USER}; echo password=\){GIT_TOKEN}; }; f" \
                pull origin staging
            docker compose -f "\({COMPOSE_FILE}" up -d --build \){CHANGED_SERVICES}
          '''
        }
        echo "Deployed: ${env.CHANGED_SERVICES}"
      }
    }

    stage('Skip Deployment') {
      when { expression { return !env.CHANGED_SERVICES?.trim() } }
      steps { echo "No service changes detected — nothing to deploy." }
    }
  }
}

Why each tricky line is there:

git config --global --add safe.directory ... — git refuses to operate on a repo whose owner UID differs from the current user's. The repo on disk is owned by developer, but Git inside the container runs as root. This whitelists the path.
git remote set-url origin "https://..." — flips the on-disk remote to HTTPS so the token can be used. (A PAT can't authenticate git@github.com: URLs — those use SSH.) Idempotent — safe to re-run.
git -c credential.helper="!f() { echo username=...; echo password=...; }; f" — feeds the username/token to git for that one command without writing the token to disk and without exposing it on the process command line.
${CHANGED_SERVICES} is unquoted on purpose so multiple service names expand as separate args.

11. End-to-End Test

Before considering the setup complete, we need to verify that the entire pipeline works as expected.

This end-to-end test ensures that:

GitHub webhooks are triggering Jenkins correctly,
Jenkins can detect which services changed,
and only the affected services are rebuilt and deployed.

In other words, this simulates a real production deployment.

Start by making a small change in your repository. For example, modify a file inside:

apps/gateways/student-apigw/

Then push the change to the staging branch.

Once pushed, Jenkins should automatically trigger via the webhook. If not, you can manually click Build Now.

Now open the build’s Console Output and verify the flow. You should see something like:

Checkout completed for branch: staging
Services to deploy: student-apigw
git pull origin staging (successful)
docker compose ... up -d --build student-apigw
Deployed: student-apigw

If you see this sequence, your pipeline is working correctly.

If anything fails, don’t worry — jump to Section 12 where every common issue and its fix is documented.

12. Troubleshooting — Every Error We Hit

This section covers real issues we faced while setting up this pipeline — and more importantly, why each fix works. Understanding the “why” will help you debug similar problems in your own setup.

cd: can't cd to /projects/projects-prod-configs/projects-backend

Cause:
The Jenkinsfile runs cd $PROJECT_PATH, but inside the container that path doesn’t exist. This usually happens when:

the project wasn’t cloned on the host, or
the bind mount isn’t configured correctly.

Fix:

ls /home/developer/projects/projects-prod-configs/projects-backend
# If missing: git clone -b staging  there.

Confirm the bind mount:

docker inspect projects-jenkins-staging --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{println}}{{end}}'

If missing, recreate the container:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

Jenkins runs inside a container, but your code lives on the host. The bind mount connects them. Without it, Jenkins cannot access your project directory.

fatal: detected dubious ownership in repository

Cause:
Git blocks access when the repository owner differs from the current user.

Repo owner: developer (host)
Git runs as: root (inside container)

Fix:

git config --global --add safe.directory "${PROJECT_PATH}"

Why this works:

This explicitly tells Git that the directory is trusted, bypassing ownership mismatch security restrictions.

`Host key verification failed` / `Could not read from remote repository`

Cause:

The repository uses SSH (git@github.com:...), but:

the container has no SSH keys
no known_hosts file exists

Also, GitHub tokens cannot authenticate over SSH.

Fix (recommended):

git remote set-url origin "https://github.com//projects-backend.git"

Why this works:

HTTPS uses token-based authentication (PAT), which works inside containers without SSH configuration.

`unknown shorthand flag: 'f' in -f` ( `docker compose`)

Cause:
The Docker CLI exists, but the Docker Compose plugin is missing inside the container.

Fix:

volumes:
  - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro

Find your path if needed:

find /usr -name docker-compose -type f 2>/dev/null

Verify:

docker exec projects-jenkins-staging docker compose version

Why this works:

Docker Compose v2 is a CLI plugin. Mounting this directory makes the docker compose command available inside the container.

Wrong timezone in build timestamps and Jenkins UI

Fix: Set both env var and JVM flag, and bind-mount the host's clock files:

environment:
  - TZ=Asia/Dhaka
  - JAVA_OPTS=... -Duser.timezone=Asia/Dhaka
volumes:
  - /etc/localtime:/etc/localtime:ro
  - /etc/timezone:/etc/timezone:ro

You must recreate the container for env-var changes to take effect:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:
Jenkins runs on Java, which uses its own timezone separate from the OS.
By aligning OS timezone, JVM timezone, and host clock, you ensure consistent timestamps everywhere.

ERR_SOCKET_TIMEOUT (pnpm install fails)

Cause:

If you have multiple services building in parallel and each runs pnpm install with ~1500 packages, the network gets saturated and a timeout occurs.

Fixes:

a) Increase timeout + control concurrency

RUN pnpm install --frozen-lockfile --ignore-scripts 
--network-timeout 600000 
--network-concurrency 8

Why: Gives pnpm more time and reduces network overload.

b) Enable pnpm cache (BuildKit)

RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store 
pnpm install --frozen-lockfile --ignore-scripts

Why: Dependencies are cached and reused instead of downloading every time.

c) Avoid unnecessary rebuilds

docker compose -f \(COMPOSE_FILE build \)CHANGED_SERVICES docker compose -f \(COMPOSE_FILE up -d --no-build \)CHANGED_SERVICES

Why: Only changed services are rebuilt → less network load → fewer failures.

Container changes don’t apply after editing docker-compose.yml

Cause:

Docker compose up -d does not update running containers.

Fix:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

This forces Docker to recreate the container with updated configuration (env, volumes, labels).

Traefik shows default certificate (no HTTPS)

Common causes:

DNS not pointing to server Port 80 blocked Wrong Docker network

Check:

dig +short jenkins.example.com docker logs projects-traefik-staging 2>&1 | grep -i acme

Why this works:

Let’s Encrypt uses HTTP-01 challenge, so it must reach your server via port 80. If DNS or networking is wrong, certificate issuance fails.

Jenkins: "Reverse proxy setup is broken"

Fix:

Set the Jenkins URL to https://jenkins.example.com/
Ensure header:

X-Forwarded-Proto: https

Why this works:

Jenkins needs to know it's behind HTTPS. Without this, it generates incorrect URLs (http instead of https), breaking redirects and webhooks.

13. Mental Model: Host vs. Container

Many setup mistakes come from confusing the host filesystem with the container filesystem. This table makes it explicit:

Inside the Jenkins container	Comes from on the host
`/var/jenkins_home`	docker volume `jenkins-data` (Jenkins config, jobs, secrets)
`/projects/...`	`/home/developer/projects/...` (your project tree)
`/usr/bin/docker`	host's `/usr/bin/docker`
`/usr/libexec/docker/cli-plugins/docker-compose`	host plugin (lets `docker compose` work)
`/var/run/docker.sock`	host Docker daemon (so builds happen on the host's engine)
`/etc/localtime`, `/etc/timezone`	host clock
`~/.ssh`	nothing — that's why SSH-to-GitHub doesn't work without extra setup

When debugging, always ask: "Inside which filesystem is this command running, and does the file/folder it's looking for exist there?"

14. Daily Operations Cheat Sheet

# Recreate Jenkins after changing compose
cd /home/developer/Projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

# Tail Jenkins logs
docker logs -f projects-jenkins-staging

# Open a shell inside the Jenkins container
docker exec -it projects-jenkins-staging bash

# From inside the container — sanity checks
docker compose version
ls /projects/projects-prod-configs/projects-backend
git -C /projects/projects-prod-configs/projects-backend remote -v

# Manually trigger the same deploy the pipeline does
cd /projects/projects-configs/projects-backend
git pull origin staging
docker compose -f docker-compose.staging.yml up -d --build student-apigw

# Inspect Traefik routing decisions
docker logs projects-traefik-staging 2>&1 | grep -i jenkins

# Check renewed certs
docker exec projects-traefik-staging cat /etc/traefik/acme.json | head -50

15. What I'd Do Differently Next Time

Pre-build a base image with all node_modules baked in. With ~1500 packages × 15 services, every clean build re-downloads ~22k tarballs. A shared base cuts that 90%.
Run a private npm proxy (Verdaccio / Nexus / GitHub Packages) on the same Docker network — eliminates flaky npmjs.org timeouts entirely.
Per-service Jenkinsfile if your services drift apart in tooling. With one Jenkinsfile, every team contends for the same pipeline definition.
Replace git diff HEAD~1 HEAD with git diff $(git merge-base HEAD origin/staging~1) HEAD so squash-merges and force-pushes don't accidentally skip services.
Move secrets to a vault (HashiCorp Vault / AWS Secrets Manager / Doppler). PATs in Jenkins work, but rotation across many jobs is painful.
Use Jenkins' Configuration-as-Code (JCasC) so the entire Jenkins setup (jobs, credentials definitions, plugins) is in git. Then a server rebuild is a one-command operation.

Closing Thoughts

The pipeline itself is just three stages: Checkout → Detect Changes → Deploy — but a real production setup is mostly about plumbing: reverse proxy, certificates, bind-mounts, credentials, timezones, build caches. None of these are exotic. Together they decide whether your Friday-afternoon deploy goes silently green or eats your weekend.

Follow sections 1–11 to get a working pipeline. Bookmark section 12 to keep it working.

Happy shipping.

How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP

Rasheedat Atinuke Jamiu — Wed, 22 Apr 2026 20:30:00 +0000

Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.

In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.

Prerequisites
Project Setup
Step 1: Install Packer
Step 2: Set Up Project Directory
Step 3: Install Packer's Plugins
Step 4: Define Your Source
Step 5: Writing the Build Template
Step 6: Writing the GPU Provisioning Script
Step 7:Assembling and Running the Build
Step 8: Test the Image and Verify the GPU Stack
Conclusion
References

Prerequisites

HashiCorp Packer >= 1.9
Google Compute Packer plugin (installed via packer init)
Optionally, the AWS Packer plugin can be used for EC2 builds by adding an amazon-ebs source to node.pkr.hcl
GCP project with Compute Engine API enabled (or AWS account with EC2 access)
GCP authentication (gcloud auth application-default login) or AWS credentials
Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)

Project Setup

Step 1: Install Packer

To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation guides).

First, you'll install the official Packer formula from the terminal.

Install the HashiCorp tap, a repository of all Hashicorp packages.

$ brew tap hashicorp/tap

Now, install Packer with hashicorp/tap/packer.

$ brew install hashicorp/tap/packer

Step 2: Set Up Project Directory

With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your packer_demo folder using the command below:

mkdir -p packer_demo/script && touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh

Your file directory should look like this:

packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script

Step 3: Install Packer's Plugins

In your plugins.pkr.hcl file,, define your plugins in the packer block. The packer {} block contains Packer settings, including specifying a required plugin version. You'll find the required_plugins block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin here.

packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~> 1"
    }
  }
}

Then, initialize your Packer plugin with the command below:

packer init .

Step 4: Define Your Source

With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your project ID, the zone where your machine will be created, the source_image_family (think of this as your base image, such as Debian, Ubuntu, and so on), and your source_image_project_id.

In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the machine type to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.

source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}

Setting on_host_maintenance = "TERMINATE" on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.

You'll define all your variables in the variable.pkr.hcl file, and set the values in the values.pkrvars.hcl. Remember to always add your values.pkrvars.hcl file to Gitignore.

variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}

values.pkrvars.hcl

image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1"

Step 5: Writing the Build Template

Create build.pkr.hcl. The build block creates a temporary instance, runs provisioners, and produces an image.

Provisioners in this template are organized as follows:

First provisioner runs system updates and upgrades.
Second provisioner reboots the instance (expect_disconnect = true).
Third provisioner waits for the instance to come back (pause_before), then runs script/base.sh. This provisioner sets max_retries to handle transient SSH timeouts and pass environment variables for DRIVER_VERSION and CUDA_VERSION.

Lastly, you have the post-processor to tell you the image ID and completion status:

build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}

Step 6: Writing the GPU Provisioning Script

Now we'll go through the base script, and break down some parts of it.

Section 1: Pre-Installation (Kernel Headers)

Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.

log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

Section 2: Installing NVIDIA's Apt Repository

This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.

log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

Section 3: Pinning NVIDIA Drivers Version

Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.

NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit

A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.

log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

Section 4: Installing the Driver

The libnvidia-compute installs only the compute‑related user‑space libraries (CUDA driver components), while the nvidia-dkms-open; installs the open‑source NVIDIA kernel module, built locally via DKMS.

Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.

Here, we're using NVIDIA’s compute‑only driver stack using the open‑source kernel modules, as it deliberately avoids installing any display-related components, which you don't need.

This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.

log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

Section 5: CUDA Toolkit Installation

This part of the script installs the CUDA Toolkit for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.

It adds CUDA binaries to PATH, so commands like nvcc, cuda-gdb, and cuda-memcheck work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.

log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

Section 6: NVIDIA Container Toolkit

This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.

log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

Section 7: Installing DCGM (Data Center GPU Manager)

This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.

It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.

The script extracts the installed version and checks that it meets the minimum required version for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.

log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

Section 8: Enabling Persistence Mode

The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.

Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.

log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

Section 9: System Tuning for GPU Compute Workloads

This block applies a set of system‑level performance and stability tunings that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.

Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.

Swap and memory behavior: Disabling swap and setting vm.swappiness=0 prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.
Hugepages for large memory allocations: Setting vm.nr_hugepages=2048 allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.

CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.
CPU frequency governor: Installing cpupower and forcing the CPU governor to performance ensures the CPU stays at maximum frequency instead of scaling down.

GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.
NUMA and topology tools: Installing numactl, libnuma-dev, and hwloc provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.
Disabling irqbalance: Stopping and disabling irqbalance it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.

log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

Full base.sh script here:

#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" >&2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] && error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] && error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release && echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"

Step 7: Assembling and Running the Build

Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.

packer validate -var-file=values.pkrvars.hcl .

If validation succeeds, you’ll see a short confirmation like The configuration is valid.. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:

packer build -var-file=values.pkrvars.hcl .

The build typically takes 15–20 minutes, depending on network speed and package installs. Watch the Packer log for three key checkpoints:

Instance creation — confirms the temporary VM was provisioned.
Provisioner output — shows each script step (updates, reboot, script/base.sh) and any errors.
Image creation — indicates the build finished and an image artifact was written.

If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.

googlecompute.gpu-node: output will be in this color.

==> googlecompute.gpu-node: Checking image does not exist...
==> googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==> googlecompute.gpu-node: no persistent disk to create
==> googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==> googlecompute.gpu-node: Creating instance...
==> googlecompute.gpu-node: Loading zone: us-central1-a
==> googlecompute.gpu-node: Loading machine type: g2-standard-4
==> googlecompute.gpu-node: Requesting instance creation...
==> googlecompute.gpu-node: Waiting for creation operation to complete...
==> googlecompute.gpu-node: Instance has been created!
==> googlecompute.gpu-node: Waiting for the instance to become running...
==> googlecompute.gpu-node: IP: 34.58.58.214
==> googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==> googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==> googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: No containers need to be restarted.
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: User sessions running outdated binaries:
==> googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==> googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==> googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==> googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==> googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==> googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==> googlecompute.gpu-node: [BASE] Updating system packages...
==> googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==> googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==> googlecompute.gpu-node: [BASE] Installing DCGM...
==> googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==> googlecompute.gpu-node: [BASE] Applying system tuning...
==> googlecompute.gpu-node: vm.swappiness=0
==> googlecompute.gpu-node: vm.nr_hugepages=2048
==> googlecompute.gpu-node: Setting cpu: 0
==> googlecompute.gpu-node: Error setting new values. Common errors:
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==> googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==> googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==> googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==> googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: Deleting instance...
==> googlecompute.gpu-node: Instance has been deleted!
==> googlecompute.gpu-node: Creating image...
==> googlecompute.gpu-node: Deleting disk...
==> googlecompute.gpu-node: Disk has been deleted!
==> googlecompute.gpu-node: Running post-processor:  (type shell-local)
==> googlecompute.gpu-node (shell-local): Running local shell script: 
==> googlecompute.gpu-node (shell-local): === Image Build Complete ===
==> googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==> googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==> Wait completed after 17 minutes 55 seconds

==> Builds finished. The artifacts of successful builds are:
--> googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134

Step 8: Test the Image and Verify the GPU Stack

Confirm the image exists in the GCP Console: Compute → Storage → Images and locate your newly created OS image.

Create a test VM from the image:

gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING

Once the instance is RUNNING, verify the NVIDIA driver and GPU are visible:

The nvidia-smi output confirms:

Driver 590.48.01 loaded
CUDA 13.1 available
Persistence Mode is On
The L4 GPU is detected with 23GB VRAM
Zero ECC errors
No running processes (clean idle state).

This is exactly what a healthy base image should look like. Notice Disp.A: Off? That confirms our compute-only driver choice is working — no display adapter is active.

Confirm the installed CUDA toolkit by running. nvcc --version. You can see that version 13.1 was installed as specified.

Let's confirm DCGM installation by running dcgmi discovery -l. Successful output indicates DCGM is running and communicating with the driver.

Conclusion

You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.

From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.

The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.

References

NVIDIA Driver Installation Guide (Ubuntu): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
NVIDIA CUDA Toolkit Documentation: https://docs.nvidia.com/cuda/
NVIDIA Container Toolkit Installation Guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA DCGM Documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
NVIDIA Persistence Daemon: https://docs.nvidia.com/deploy/driver-persistence/index.html
HashiCorp Packer Documentation: https://developer.hashicorp.com/packer/docs
Packer Google Compute Builder: https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute

How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step

qacheampong — Wed, 22 Apr 2026 17:19:57 +0000

While many guides explain how to set up Continuous Integration pipelines, not very many show you how to debug them when things go wrong across multiple layers.

This is a common experience when contributing to open source: you make a small change, open a pull request, and suddenly everything fails.

Not just one check, but multiple:

Lint errors
YAML validation issues
Build failures
Deployment failures

Even more confusing, you may see errors in parts of the codebase you didn’t modify.

In this article, you'll learn how to debug these issues step by step. The goal is not just to fix one pull request, but to understand how CI systems validate your changes.

This guide is based on a real debugging experience from contributing to an open source documentation project.

While this example comes from a documentation project, the debugging workflow applies to many repositories that use CI pipelines, linting tools, and automated builds.

Prerequisites

To follow this guide, you should have:

Basic familiarity with Git and pull requests
A GitHub account
Some exposure to CI/CD concepts (helpful but not required)

Understanding the CI Pipeline (What’s Actually Happening)

In many projects, you will see the term CI/CD, which stands for Continuous Integration and Continuous Deployment (or Delivery).

In this guide, we'll focus specifically on the CI part – that is, Continuous Integration. This refers to the automated checks that run when you push code or open a pull request. These checks validate your changes before they're merged into the main codebase.

CD (Continuous Deployment/Delivery), on the other hand, typically handles what happens after those checks pass, such as deploying the application.

Understanding this distinction is important because most of the issues we debug in this guide happen during the CI stage.

Most repositories run multiple automated checks when you open a pull request:

Linting tools (for example, markdownlint, yamllint) enforce formatting rules
Build systems (for example, mdBook) validate structure and generate output
Deployment checks (for example, Netlify) ensure that the site can be built and served
Merge controllers (for example, Tide) enforce approval policies

A key point to remember: CI systems validate the entire set of files in your commit, not just the lines you changed.

How a CI Pipeline Processes Your Pull Request

When you push code or open a pull request, the CI pipeline runs several checks in sequence.

Let’s visualize how these checks are connected in a typical CI pipeline.

Figure: A simplified CI pipeline showing how linting, build, and deployment checks are executed sequentially.

The above diagram shows a sequential CI pipeline with feedback loops, where failures at any stage return you to fix the issue before continuing.

Let’s break down what this diagram shows:

You start by pushing code or opening a pull request.
The CI pipeline begins running automated checks.
The first set of checks typically includes linting tools like markdownlint or yamllint.
- If linting fails, the pipeline stops, and you must fix formatting issues before continuing.
- If linting passes, the pipeline moves to the build step (for example, mdBook in documentation projects).
- If the build fails, it usually means there is a structural issue, such as duplicate entries or invalid references.
After a successful build, deployment checks (such as Netlify previews) run.
- If deployment fails, the issue is often related to configuration or build output.
If all steps pass, the pull request becomes ready for review.

A Practical Debugging Workflow

Step 1: Fix Authentication and Permission Issues

Before CI runs, your push can fail due to authentication errors.

Example error:

refusing to allow a Personal Access Token to create or update workflow

This happens because GitHub requires special permissions when your commit includes files under:

.github/workflows/

The solution is to regenerate your Personal Access Token (PAT) with:

repo access
workflow permission

Step 2: Run Lint Checks Locally

Relying only on CI feedback slows you down because you have to push changes and wait for the pipeline to run before seeing errors.

Running checks locally allows you to catch issues immediately before pushing your code.

In practice, you should do both:

Run checks locally to catch errors early and reduce iteration time
Use CI as the final validation to ensure your changes meet the repository’s standards

Think of local checks as your first line of defense, and CI as the final gate before your code is accepted.

Here's an example (Markdown linting):

npm install -g markdownlint-cli2
markdownlint-cli2 docs/**/*.md

Step 3: Fix Common Markdown Lint Errors

Here are some common issues you may encounter:

1. Non-descriptive links

Non-descriptive links like "here" don't give readers any context about where the link leads. This makes documentation harder to understand and less accessible, especially for users relying on screen readers.

Instead of writing:

[here](https://example.com)

Use descriptive text like:

[command help documentation](https://example.com)

2. Line length violations

Many projects enforce a maximum line length (often around 80 characters) to improve readability across different devices and editors.

If a line is too long, you can split it into multiple lines without changing the meaning.

To do this, break the line at natural points such as spaces between words or after punctuation. Avoid breaking words or disrupting the sentence structure.
For example:

This is a long sentence that should be split across multiple
lines to satisfy lint rules.

3. List indentation issues

List indentation errors occur when nested list items aren't aligned consistently. This can break formatting and cause linting errors.

To avoid this, just make sure you use consistent spacing (usually 2 spaces per level).

Example (incorrect):

- Item 1
 - Subitem

Correct version:

- Item 1
  - Subitem

Step 4: Fix YAML Inside Markdown Code Blocks

YAML has strict formatting rules, including proper indentation, key-value structure, and consistent spacing.

Even when YAML appears inside a markdown code block, tools like yamllint still validate its structure.

Example (incorrect):

metadata:
annotations:

Correct version:

metadata:
  annotations:
    capi.metal3.io/unhealthy: "true"

In the incorrect example, annotations is not properly nested under metadata, and no key-value pair is defined.

In the corrected version:

annotations is properly indented under metadata
a valid key-value pair is added (capi.metal3.io/unhealthy: "true")

This structure satisfies YAML’s requirement for proper hierarchy and formatting.

Step 5: Fix Build Errors After Lint Passes

Passing lint checks doesn't guarantee that your build will succeed.

This is because linting focuses on syntax and formatting, while the build process validates the structure and integrity of the entire project.

Build failures often occur due to issues such as:

Duplicate entries in navigation files
Missing or incorrectly referenced files
Invalid configuration settings

Even if your syntax is correct, the build system ensures everything connects properly.

For example, in documentation projects using tools like mdBook, a duplicate entry in SUMMARY.md can cause the build to fail even when all files pass lint checks.

Step 6: Debug Cascading CI Failures

CI pipelines are layered. One failure can trigger multiple downstream failures.

For example, imagine a YAML indentation error:

YAML error → build fails → deploy fails → multiple checks fail

To fix this:

Identify the first failing step in the CI logs
Fix that issue
Re-run the pipeline

In this example, the YAML indentation error is the root cause. Once you fix the YAML formatting, the lint check passes, which allows the build to proceed and the deployment step to succeed.

This is why it is important to always fix the first failure in the pipeline rather than trying to address all errors at once.

Step 7: Handle Git Issues During CI Debugging

When working with updated branches, you may encounter:

Diverged branches
Rebase conflicts
Push rejections

To resolve these issues, you typically need to update your branch using one of two approaches:

Option 1: Rebase (clean history)

git pull --rebase

Rebasing rewrites your commit history so your changes appear on top of the latest version of the branch.

Use carefully:

Only rebase your own branches
Avoid rebasing shared branches

Option 2: Merge (safer)

git pull --no-rebase

Merging preserves the full commit history and is safer when working with others, but it may introduce additional merge commits.

Pushing your changes safely

After updating your branch, you may need to push changes:

git push --force-with-lease

Avoid using:

git push --force

The --force option can overwrite the other contributors’ work. The --force-with-lease option is safer because it only pushes if the remote branch has not changed unexpectedly.

Key Takeaways

CI validates your entire commit, not just the specific lines you changed
Linting and build systems enforce different rules
YAML inside markdown must be structurally correct
Documentation builds can fail due to structural issues
Running checks locally significantly reduces debugging time

Conclusion

Debugging a failing pull request isn't just about fixing syntax errors.

You also need to understand how different systems interact:

Version control
CI pipelines
Linting tools
Build processes

Once you understand how these systems work together, you can debug issues systematically instead of guessing.

The next time your pull request fails, you will know exactly where to start and how to fix it.

Debugging CI issues may feel overwhelming at first, but with a structured approach, you can turn failures into a clear path for improvement.

How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible

Osomudeya Zudonu — Mon, 13 Apr 2026 21:56:12 +0000

The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.

I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.

That was the last time I practiced on someone else's infrastructure.

Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.

That last part is where the actual learning happens.

Prerequisites

Before you start, make sure you have:

A laptop with at least 8GB of RAM (16GB is better)
At least 20GB of free disk space
Windows, macOS, or Linux operating system
Administrator access to your computer
Virtualization enabled in your BIOS/UEFI settings
A stable internet connection for the initial downloads

Knowledge and comfort level:

You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).
Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.

If you can follow step-by-step instructions and read error output without panicking, you're ready.

What is DevOps?
Why Build a Local Lab?
How to Set Up Docker
How to Set Up Kubernetes
How to Install kubectl
How to Set Up Vagrant
How to Install Ansible
How to Build Your First DevOps Project
How to Break Your Lab on Purpose
What You Can Now Do

What is DevOps?

DevOps is the practice of breaking down the wall between software development and IT operations teams.

Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.

The tools you'll install in this guide each solve a specific part of that process:

Docker packages your application and everything it needs into a portable container that runs the same way on any machine.
Kubernetes manages multiple containers at scale, handling restarts, networking, and load balancing automatically.
Vagrant creates and manages virtual machine environments so your whole team always works on identical setups.
Ansible automates repetitive configuration tasks across many servers without writing a script for each one.

Why Build a Local Lab?

A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.

Here's what you get with a local setup:

Zero cost. No cloud bills, no surprise charges, and no credit card required.
Works offline. Practice anywhere, even without internet after the initial setup.
Full control. You manage every layer from the OS up to the application.
Safe experimentation. Break things freely. Nothing here affects production.
Fast feedback. No waiting for cloud resources to spin up. Everything runs on your machine.

The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.

How to Set Up Docker

Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.

How to Install Docker on Windows

First, enable virtualization in your BIOS:

Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.
Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.
Enable it, save your changes, and exit.

Then install Docker Desktop:

Download Docker Desktop from Docker's official website.
Run the installer and follow the prompts.
Enable WSL 2 (Windows Subsystem for Linux) when asked.
Restart your computer.
Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.

Troubleshooting: If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:

systeminfo | findstr "Hyper-V Requirements"

All items should show "Yes". If they don't, revisit your BIOS settings.

How to Install Docker on Mac

Download Docker Desktop for Mac from Docker's website.
Open the downloaded .dmg file and drag Docker to your Applications folder.
Open Docker from Applications.
Enter your password when prompted.
Wait for the whale icon in the menu bar to stop animating.

How to Install Docker on Linux

Run these commands in order:

# Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER

Log out and back in for the group change to take effect.

How to Test Docker

Run this command:

docker run hello-world

If you see "Hello from Docker!" then Docker is working correctly.

Docker is set up. Next, you'll install Kubernetes to manage containers at scale.

How to Set Up Kubernetes

Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:

Tool	Best for	RAM needed
Minikube	Beginners. Easiest setup, built-in dashboard	2GB+
Kind	Faster startup, works well inside CI pipelines	1GB+
k3s	Low-resource machines. Lightweight but production-like	512MB+
kubeadm	Learning how clusters are actually bootstrapped in production	2GB+ per node

If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.

If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.

Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.

How to Install Minikube (Recommended for Beginners)

Minikube creates a single-node Kubernetes cluster on your laptop.

On Windows:

Download the Minikube installer from Minikube's GitHub releases page.
Run the .exe installer.
Open Command Prompt as Administrator and start Minikube:

minikube start --driver=docker

On Mac:

brew install minikube
minikube start --driver=docker

On Linux:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker

Test your cluster:

minikube status
minikube dashboard

How to Install k3s (Recommended for Low-RAM Machines)

k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.

On Linux (and Mac via Multipass):

curl -sfL https://get.k3s.io | sh -

That single command installs k3s and runs it automatically in the background. Check that it is running:

sudo k3s kubectl get nodes

You should see one node with status Ready.

On Mac directly — k3s doesn't run natively on macOS. Use Multipass to spin up a lightweight Ubuntu VM first, then run the install command inside it.

On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.

How to Install Kind (Kubernetes IN Docker)

Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.

# Mac or Linux
brew install kind

# Windows
choco install kind

Create a cluster:

kind create cluster --name my-local-lab

How to Install kubeadm (For Understanding Cluster Bootstrap)

kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.

It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the official kubeadm installation guide for your OS, then initialize your cluster:

sudo kubeadm init --pod-network-cidr=10.244.0.0/16

After init, join worker nodes using the command kubeadm prints at the end of the output.

How to Install kubectl

kubectl is the command-line tool you use to interact with any Kubernetes cluster.

On Windows:

Download kubectl.exe from Kubernetes' website and place it in a directory that is in your PATH. Or install with Chocolatey:

choco install kubernetes-cli

On Mac:

brew install kubectl

On Linux:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl

Test it:

kubectl get pods --all-namespaces

On a fresh cluster, you'll see system pods running in the kube-system namespace — things like coredns and storage-provisioner. That's the expected output. It means your cluster is up and kubectl can talk to it.

Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.

Docker vs Vagrant — they aren't the same thing

Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.

Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.

How to Set Up Vagrant

Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.

How to Install Vagrant on Windows

Download and install VirtualBox with default options.
Download and install Vagrant.
Restart your computer if prompted.

Note: VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:

systeminfo | findstr "Hyper-V"

If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:

Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All

Restart after disabling.

How to Install Vagrant on Mac and Linux

On Mac:

Download and install VirtualBox.
After installation, open System Preferences > Security & Privacy > General. You will see a message saying system software from Oracle was blocked. Click Allow and restart your Mac. Without this step, VirtualBox will not run.
Download and install Vagrant.

Note for Apple Silicon (M1/M2/M3) Macs: VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use UTM as your VM provider instead, or use Multipass which works natively on Apple Silicon.

On Linux:

Download and install VirtualBox.
Download and install Vagrant.

Verify both are installed:

vboxmanage --version
vagrant --version

How to Create Your First Vagrant Environment

Create a new directory for your project. Inside it, create a file named Vagrantfile with this content:

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: <<-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" > /var/www/html/index.html
  SHELL
end

Start the VM:

vagrant up

Visit http://localhost:8080 in your browser. You should see "Hello from Vagrant!"

Troubleshooting SSH on Windows

If vagrant ssh fails, try:

vagrant ssh -- -v

Or connect manually:

ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222

How to Create a Local Vagrant Box Without Internet

Note: Most readers can skip this. Only do this if you want to work fully offline after the initial setup.

Download Ubuntu 20.04 LTS and save the .iso file locally.
Open VirtualBox and create a new VM: Name it ubuntu-devops, Type: Linux, Version: Ubuntu (64-bit).
Assign 2048MB RAM and a 20GB VDI disk.
Attach the .iso under Storage > Optical Drive.
Start the VM and complete the Ubuntu installation.
Once installed, shut down the VM and run:

VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box

You now have a reusable local box that works without internet.

You can spin up virtual machines. Next is Ansible, which automates what goes inside them.

How to Install Ansible

Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.

How to Install Ansible on Windows

Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).

Open PowerShell as Administrator and enable WSL:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Restart your computer.
Install Ubuntu from the Microsoft Store.
Open Ubuntu and install Ansible:

sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

How to Install Ansible on Mac

brew install ansible

How to Install Ansible on Linux

# Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible

How to Test Ansible

Create a file called hosts in your current directory:

[local]
localhost ansible_connection=local

Create a file called playbook.yml in the same directory:

---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"

Run the playbook, passing the local hosts file with -i:

ansible-playbook -i hosts playbook.yml

You should see the message "Ansible is working!" in the output.

Alright, all your tools are installed. Now you'll use them together to build something real.

How to Build Your First DevOps Project

You can find the entire code for this lab in this repo: https://github.com/Osomudeya/homelab-demo-article

Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.

Before you start, create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.

You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.

Step 1: Create the Project Directory

mkdir devops-lab-project && cd devops-lab-project

Step 2: Write Your Site Content

Create a file called index.html in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:



  My DevOps Lab
  
    My DevOps Lab
    Provisioned by Vagrant. Configured by Ansible. Served by Docker.
    Built on a laptop. No cloud account needed.

Change the text to whatever you like. This is your page.

Step 3: Write the Vagrantfile

Create a file called Vagrantfile in the same directory:

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end

Step 4: Start the Virtual Machines

vagrant up

The first run downloads the ubuntu/focal64 box, which is around 500MB.

Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.

Step 5: Create the Ansible Inventory

Create a file called inventory in the same directory:

[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key

Ansible uses the Vagrant-generated private keys so it can SSH in as the vagrant user. Host key checking for this lab is turned off in ansible.cfg (next step), not in the inventory.

Step 6: Create the Ansible Config File

Before running the playbook, create a file called ansible.cfg in the same directory:

[defaults]
inventory = inventory
host_key_checking = False

The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.

These settings are for a local lab only. Do not use host_key_checking = False for production systems.

Step 7: Create the Ansible Playbook

Create a file called playbook.yml:

---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2>/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean && rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes

Two lines worth paying attention to:

src: index.html — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.
-v /var/www/html:/usr/share/nginx/html:ro — this mounts the directory from the VM into the Nginx container. The :ro means read-only. Nginx serves whatever is in that folder.

Step 8: Run the Playbook

ansible-playbook -i inventory playbook.yml

You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green ok or yellow changed next to each task means it worked. Red fatal means something failed.

Step 9: Verify the Setup

Open http://localhost:8080 in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.

If you see the page, every tool in this lab is working together.

Step 9: Clean Up (Optional)

When you're done:

vagrant destroy -f

This shuts down and deletes both VMs. Your Vagrantfile, inventory, playbook.yml, and index.html stay on disk — run vagrant up followed by ansible-playbook -i inventory playbook.yml any time to bring it all back.

Now that you have a working lab, let's use it properly.

How to Break Your Lab on Purpose

Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.

Here are five things to break and what to look for when you do.

Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)

Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of --restart always, and your site can come back without re-running Ansible.

After vagrant ssh web, every docker command below runs on the web VM. So keep your browser on your laptop at http://localhost:8080 (Vagrant forwards your host port to the VM’s port 80).

Troubleshooting: If Your Lab Isn't Ready

From your project folder on the host (your laptop) – unless the step says to run it on the VM:

You ran vagrant destroy -f. Run vagrant up, then ansible-playbook -i inventory playbook.yml.
docker ps shows webapp but status is Exited. On the web VM, run sudo docker start webapp, then sudo docker ps again.
There's no webapp row in docker ps -a. Re-run ansible-playbook -i inventory playbook.yml on the host.

If the playbook is already applied and webapp is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or docker ps. You need the VM shell and a quick check before you run docker exec.)

Steps (happy path)

SSH into the web VM:

vagrant ssh web

Confirm webapp is Up:
```
sudo docker ps
```
Break it on purpose: kill the container’s main process from inside (PID 1). That ends the container the same way a crashing app would, not the same as docker stop on the host:

sudo docker exec webapp sh -c 'sleep 5 && kill 1'

The sleep 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh http://localhost:8080. You may catch a brief error or blank page while nothing is listening on port 80.

Watch Docker restart the container:

watch sudo docker ps -a

Within a few seconds you should see Exited (137) become Up again. (Press Ctrl+C to exit watch.)

5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under /var/www/html and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.

Why not `docker stop` or `docker kill` on the host for this demo?

Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (hasBeenManuallyStopped), and --restart always may not bring the container back until you docker start it or similar.

Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.

Kubernetes analogy: A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.

What to observe (three separate checks):

Exit code: After kill 1, docker ps -a should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran docker stop on the host.
Restart delay vs browser: Watch how many seconds pass between Exited and Up in docker ps -a; that interval is Docker applying --restart always. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.
Content after recovery: After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with -v), not inside a file that vanishes when the container process restarts. The process was replaced, not your index.html on the host path.

Break 2: Cause a Container Name Conflict

On a single Docker daemon (here, on your web VM), a container name is a unique label. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use docker run --name webapp without cleaning up first hit this error constantly and recognizing it saves time in real work.

Before you start: Ansible already created one container named webapp.
Stay on the web VM (for example still inside vagrant ssh web) so the commands below run where that container lives.

So now, try to start a second container and also call it webapp. The image is plain nginx here on purpose – the point is the name clash, not matching your site’s ports or volume mounts.

sudo docker run -d --name webapp nginx

What actually happens here is that Docker doesn't create a second container. It returns an error immediately. Your original webapp is unchanged.

This is because the name webapp is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.

Example error (your ID will differ):

docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

To fix it, free the name, then create webapp again the same way the playbook does (publish port 80, mount your HTML, restart policy):

sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx

After that, your site should behave as before (refresh http://localhost:8080 from your laptop).

What to observe:

Read Docker’s Conflict message end to end. You should see that the name /webapp is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run docker run again.”

Break 3: Make Ansible Fail to Reach a VM

Ansible separates “could not connect” from “connected, but a task broke.” The first is UNREACHABLE, the second is FAILED. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.

On your laptop, in the project folder, edit inventory and change the web server address from 192.168.33.10 to an IP no VM uses, for example 192.168.33.99. Save the file.

[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

What you run (from the same project folder on the host):

ansible-playbook -i inventory playbook.yml

After this, Ansible tries to SSH to 192.168.33.99. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible never runs tasks on the web server. It stops that host with UNREACHABLE:

fatal: [192.168.33.99]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh"}

This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is no working SSH session.

Now it's time to put it back: restore 192.168.33.10 in inventory and run ansible-playbook -i inventory playbook.yml again. The web play should reach the VM and complete (assuming your lab is up).

UNREACHABLE vs FAILED – what to observe:

If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.
If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.

When you debug later, you should look at the keyword Ansible prints: UNREACHABLE points to reachability while FAILED points to task output and the first failed task under that host.

Break 4: Fill the VM's Disk

Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.

All commands below run on the db VM after vagrant ssh db. MariaDB was installed there by your playbook.

What you do:

Open a shell on the db VM:
```
vagrant ssh db
```
Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:
```
sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
```
Use df -h to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger count= only in a lab (never fill production disks on purpose without a plan).
Ask systemd to restart MariaDB and show status:
```
sudo systemctl restart mariadb
sudo systemctl status mariadb
```
If the disk is critically full, restart may fail or the service may show failed or not running.
If something looks wrong, read recent logs for the MariaDB unit:
```
sudo journalctl -u mariadb --no-pager | tail -20
```
Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.
Clean up so your VM stays usable:
```
sudo rm /tmp/bigfile
```
Optionally run sudo systemctl restart mariadb again and confirm it is active (running).

What to observe:

You should use df -h first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.
You should read systemctl status mariadb to see whether systemd thinks the service is active, failed, or flapping.
You should read journalctl -u mariadb when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). Space + status + logs is the same order you would use on a production server.

Break 5: Run Minikube Out of Resources

Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay Pending and Events explain why (for example Insufficient cpu). That is not the same as a pod that starts and then crashes.

To do this, you'll need a local cluster (we're using Minikube in this guide) and kubectl on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.

You'll run this on your Mac, Linux, or Windows terminal (host), not inside vagrant ssh. If you're still inside a VM, type exit until your prompt is back on the host.

What you do:

Check Minikube:
```
minikube status
```
If it's stopped, start it (Docker driver matches earlier sections):
```
minikube start --driver=docker
```
Create a deployment with many replicas so your single Minikube node can't run them all at once:
```
kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
```
Press Ctrl+C when you're done watching. Some pods may stay Pending while others are Running.
Pick one Pending pod name from kubectl get pods and inspect it:
```
kubectl describe pod 
```
Under Events, look for FailedScheduling and a line similar to:
```
Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
```
You might see Insufficient memory instead, depending on your machine.
Fix the lab by scaling back so the cluster can catch up:
```
kubectl scale deployment stress --replicas=2
```
You can delete the deployment entirely when finished: kubectl delete deployment stress.

What to observe:

You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any node yet, usually because the node is out of CPU or memory for that workload.
You should read kubectl describe pod and scroll to Events. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.
You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.

What You Can Now Do

You didn't just install tools in this tutorial. You also used them.

You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.

You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.

You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.

That's the difference between someone who has read about DevOps and someone who has run it.

Here are four free projects you can run in this same lab to go further:

DevOps Home-Lab 2026 — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.
KubeLab — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.
K8s Secrets Lab — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.
DevOps Troubleshooting Toolkit — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.

All free and open source: github.com/Osomudeya/List-Of-DevOps-Projects.

If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at Build Your Own DevOps Lab.

How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)

Amina Lawal — Mon, 13 Apr 2026 13:42:27 +0000

If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.

Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.

AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.

But there's a problem that catches almost every team off guard when they start this transition: container architecture mismatch.

If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic exec format error.

The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.

We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.

Here's what you'll build, step by step:

A Go HTTP server that reports the CPU architecture it's running on at runtime
A multi-stage Dockerfile that cross-compiles for both linux/amd64 and linux/arm64 without slow QEMU emulation
A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture
A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool
A Kubernetes Deployment that pins your workload exclusively to the ARM nodes

By the end, you'll hit a live endpoint and see the word arm64 staring back at you from a Google Cloud ARM node. Let's get into it.

Prerequisites
Step 1: Set Up Your Google Cloud Project
Step 2: Create the GKE Cluster
Step 3: Write the Application
Step 4: Enable Multi-Arch Builds with Docker Buildx
Step 5: Write the Dockerfile
Step 6: Build and Push the Multi-Arch Image
Step 7: Add the Axion ARM Node Pool
Step 8: Deploy the App to the ARM Node Pool
Step 9: Verify the Deployment
Step 10: Cost Savings and Tradeoffs
Cleanup
Conclusion
Project File Structure

Prerequisites

Before you start, make sure you have the following ready:

A Google Cloud project with billing enabled. If you don't have one, create it at console.cloud.google.com. The total cost to follow this tutorial is around $5–10.
gcloud CLI installed and authenticated. Run gcloud auth login to sign in and gcloud config set project YOUR_PROJECT_ID to point it at your project.
Docker Desktop version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.
kubectl installed. This is the CLI for interacting with Kubernetes clusters.
Basic familiarity with Docker (images, layers, Dockerfile) and Kubernetes (pods, deployments, services). You don't need to be an expert, but you should know what these things are.

Step 1: Set Up Your Google Cloud Project

Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.

Enable the Required APIs

Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:

gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com

Here's what each one does:

artifactregistry.googleapis.com — enables Artifact Registry, where we'll store our Docker images
container.googleapis.com — enables Google Kubernetes Engine (GKE), where our cluster will run
containeranalysis.googleapis.com — enables vulnerability scanning for images stored in Artifact Registry

Create a Docker Repository in Artifact Registry

Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:

gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"

Breaking down the flags:

--repository-format=docker — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)
--location=us-central1 — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run gcloud artifacts locations list to see all options.
--description — a human-readable label for the repository, shown in the console.

Authenticate Docker to Push to Artifact Registry

Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:

gcloud auth configure-docker us-central1-docker.pkg.dev

This adds a credential helper entry to your ~/.docker/config.json file. What that means in practice: any time Docker tries to push or pull from a URL under us-central1-docker.pkg.dev, it will automatically call gcloud to get a valid auth token. You won't need to run docker login manually.

Step 2: Create the GKE Cluster

With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.

gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog

Replace PROJECT_ID with your actual Google Cloud project ID.

What each flag does:

--zone=us-central1-a — creates a zonal cluster in a single availability zone. A regional cluster (using --region) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If us-central1-a is unavailable, try us-central1-b.
--num-nodes=2 — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.
--machine-type=e2-standard-2 — the machine type for this default node pool. e2-standard-2 is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.
--workload-pool=PROJECT_ID.svc.id.goog — enables Workload Identity, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.

This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.

Step 3: Write the Application

We need an application to containerize. We'll use Go for three specific reasons:

Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.
Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.
Go exposes the architecture the binary was compiled for via runtime.GOARCH. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.

Start by creating the project directories:

mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app

Initialize the Go module from inside app/. This creates go.mod in the current directory:

go mod init hello-axion

go mod init is Go's built-in command for starting a new module. It writes a go.mod file that declares the module name (hello-axion) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.

Now create the application at app/main.go:

package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}

Verify both files were created:

ls -la

You should see go.mod and main.go listed.

Let's walk through what this code does:

import "runtime" — imports Go's built-in runtime package, which exposes information about the Go runtime environment, including the CPU architecture.
runtime.GOARCH — returns a string like "arm64" or "amd64" representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be arm64. This is the core of our proof.
os.Hostname() — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.
handler — the main HTTP handler, registered on the root path /. It writes the architecture, OS, and hostname to the response.
healthz — a separate handler registered on /healthz. It returns HTTP 200 with the text ok. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.
http.ListenAndServe(":8080", nil) — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.

Step 4: Enable Multi-Arch Builds with Docker Buildx

Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.

Why Your Docker Images Are Architecture-Specific By Default

A CPU only understands instructions written for its specific Instruction Set Architecture (ISA). ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.

When you build a Docker image the normal way (docker build), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:

standard_init_linux.go:228: exec user process caused: exec format error

That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."

The Solution: A Single Image Tag That Serves Any Architecture

Docker solves this with a structure called a Manifest List (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.

When a server pulls hello-axion:v1, here's what actually happens:

Docker contacts the registry and requests the manifest for hello-axion:v1
The registry returns the Manifest List, which looks like this internally:

{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}

Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.

One tag, two actual images. Completely transparent to your deployment manifests.

Set Up Docker Buildx

Docker Buildx is the CLI tool that builds these Manifest Lists. It's powered by the BuildKit engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:

docker buildx create --name multiarch-builder --use

--name multiarch-builder — gives this builder a memorable name. You can have multiple builders. This command creates a new one named multiarch-builder.
--use — immediately sets this new builder as the active one, so all future docker buildx build commands use it.

Now boot the builder and confirm it supports the platforms we need:

docker buildx inspect --bootstrap

--bootstrap — starts the builder container if it isn't already running, and prints its full configuration.

You should see output like this:

Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...

The Platforms line lists every architecture this builder can produce images for. As long as you see linux/amd64 and linux/arm64 in that list, you're ready to build for both x86 and ARM.

Step 5: Write the Dockerfile

Now we can write the Dockerfile. We'll use two techniques together: a multi-stage build to keep the final image tiny, and a cross-compilation trick to avoid slow CPU emulation.

Create app/Dockerfile with the following content:

# -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]

There's a lot happening here. Let's go through it carefully.

Stage 1: The Builder

FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

This is the most important line in the file. $BUILDPLATFORM is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine running the build (your laptop). By pinning the builder stage to $BUILDPLATFORM, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.

Without --platform=$BUILDPLATFORM, Buildx would have to use QEMU — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.

ARG TARGETOS and ARG TARGETARCH

These two lines declare that our Dockerfile expects build arguments named TARGETOS and TARGETARCH. Buildx injects these automatically based on the --platform flag you pass at build time. For a linux/arm64 target, TARGETOS will be linux and TARGETARCH will be arm64.

COPY go.mod . and RUN go mod download

We copy go.mod first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for go mod download.

On future builds, as long as go.mod hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.

RUN GOOS=$TARGETOS GOARCH=$TARGETARCH go build -ldflags="-w -s" -o server main.go

This is the cross-compilation step. GOOS and GOARCH are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the $TARGETOS and $TARGETARCH build args injected by Buildx.

The -ldflags="-w -s" flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.

Stage 2: The Runtime Image

FROM alpine:latest

This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, alpine:latest is itself a multi-arch image, so Docker automatically selects the arm64 or amd64 Alpine variant depending on which platform this stage is built for.

Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains only Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.

RUN addgroup -S appgroup && adduser -S appuser -G appgroup and USER appuser

These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.

COPY --from=builder /app/server .

This is how multi-stage builds work: the --from=builder flag tells Docker to copy files from the builder stage (Stage 1), not from your local disk. Only the compiled binary (server) makes it into the final image.

Step 6: Build and Push the Multi-Arch Image

With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.

From inside the app/ directory, run:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .

Replace PROJECT_ID with your actual GCP project ID.

Here's what each part of this command does:

docker buildx build — uses the Buildx CLI instead of the standard docker build. Buildx is required for multi-platform builds.
--platform linux/amd64,linux/arm64 — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the $BUILDPLATFORM cross-compilation trick, both builds run natively on your machine without QEMU emulation.
-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 — the full image path in Artifact Registry. The format is always REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG.
--push — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.
. — the build context, the directory Docker scans for the Dockerfile and any files the build needs.

Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:

 => [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 => [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 => pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1

Verify the Multi-Arch Image in Artifact Registry

Once the push completes, navigate to GCP Console → Artifact Registry → Repositories → multi-arch-repo and click on hello-axion.

You won't see a single image — you'll see something labelled "Image Index". That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for linux/amd64 and one for linux/arm64.

You can also inspect this from the command line:

docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

The output lists every manifest inside the image index. You'll see entries for linux/amd64 and linux/arm64 — those are our two real images. You'll also see two entries with Platform: unknown/unknown labelled as attestation-manifest. These are build provenance records that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).

The two entries you care about are linux/amd64 and linux/arm64. Note the digest for the arm64 entry — we'll use it in the verification step to confirm the cluster pulled the right variant.

Step 7: Add the Axion ARM Node Pool

We have a universal image. Now we need somewhere to run it.

Recall the cluster we created in Step 2 — it's running e2-standard-2 x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a mixed-architecture cluster where different workloads can be routed to different hardware.

Choosing Your ARM Machine Type

Google Cloud currently offers two ARM-based machine series in GKE:

Series	Example type	What it is
Tau T2A	`t2a-standard-2`	First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.
Axion (C4A)	`c4a-standard-2`	Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.

This tutorial uses t2a-standard-2 because it's widely available. The commands are identical for c4a-standard-2 — just swap the --machine-type value. If t2a-standard-2 isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.

Create the ARM Node Pool

Add the ARM node pool to your existing cluster:

gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized

What each flag does:

--cluster=axion-tutorial-cluster — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.
--zone=us-central1-a — must match the zone you used when creating the cluster.
--machine-type=t2a-standard-2 — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.
--num-nodes=2 — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.
--node-labels=workload-type=arm-optimized — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic kubernetes.io/arch=arm64 label) is good practice in real clusters — it communicates the intent of the pool, not just its hardware.

This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:

gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch

The get-credentials command configures kubectl to authenticate with your new cluster. The get nodes command then lists all nodes and adds a column showing the kubernetes.io/arch label.

You should see something like:

NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m

amd64 for the default x86 pool, arm64 for our new Axion pool. This kubernetes.io/arch label is applied automatically by GKE — you don't set it, it's derived from the hardware.

Step 8: Deploy the App to the ARM Node Pool

We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: Kubernetes doesn't know or care about image architecture by default.

If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.

To guarantee that pods land on ARM nodes and only ARM nodes, we use a nodeSelector.

How nodeSelector Works

A nodeSelector is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the nodeSelector, the scheduler skips it — the pod will remain in Pending state rather than land on the wrong node.

This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (preferredDuringSchedulingIgnoredDuringExecution), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.

Write the Deployment Manifest

Create k8s/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"

Replace PROJECT_ID with your project ID. Here's what the key sections do:

replicas: 3 — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in us-central1, which distributes load across availability zones.

selector.matchLabels and template.metadata.labels — these two blocks must match. The selector tells the Deployment which pods it "owns," and the template.metadata.labels is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.

nodeSelector: kubernetes.io/arch: arm64 — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies kubernetes.io/arch=arm64 to all ARM nodes, our pods will schedule only onto the axion-pool nodes.

livenessProbe — periodically calls GET /healthz. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. initialDelaySeconds: 5 gives the server 5 seconds to start up before the first check.

readinessProbe — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.

resources.requests — reserves 250m (25% of a CPU core) and 64Mi of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.

resources.limits — caps the container at 500m CPU and 128Mi memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.

A Note on Taints and Tolerations

Once you're comfortable with nodeSelector, the next step in production clusters is adding a taint to your ARM node pool. A taint is a repellent — any pod without an explicit toleration for that taint is blocked from landing on the tainted node.

This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:

# Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule

And a matching toleration in the pod spec:

tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"

We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.

Write the Service Manifest

We also need a Kubernetes Service to expose the pods over the network. Create k8s/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer

selector: app: hello-axion — the Service discovers pods using labels. Any pod with app: hello-axion on it will be added to this Service's load balancer pool.
port: 80 — the port the Service is reachable on from outside the cluster.
targetPort: 8080 — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.
type: LoadBalancer — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.

Apply Both Manifests

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml

kubectl apply reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.

Watch the pods come up in real time:

kubectl get pods -w

The -w flag watches for changes and prints updates as they happen. You should see pods transition from Pending → ContainerCreating → Running. Once all three show Running, press Ctrl+C to stop watching.

Step 9: Verify the Deployment

Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.

Confirm Pod Placement

kubectl get pods -o wide

The -o wide flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the NODE column:

NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...

All three pods should show node names containing axion-pool. None should show default-pool.

Confirm the Nodes Are ARM

Take one of those node names and verify its architecture label:

kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch

Replace NODE_NAME with one of the node names from the previous command. You should see:

kubernetes.io/arch=arm64

That's the automatic label GKE applied when it provisioned the ARM hardware. Our nodeSelector matched on this label to pin the pods here.

Ask the Application Itself

This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.

Use kubectl port-forward to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:

kubectl port-forward deployment/hello-axion 8080:8080

This command stays running in the foreground — open a second terminal window and run:

curl http://localhost:8080

You should see:

Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12

Architecture : arm64. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.

The Bonus: See the Manifest List in Action

Want to see the multi-arch image indexing at work? Stop the port-forward, then run:

docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

Replace PROJECT_ID with your actual Google Cloud project ID.

You'll see four entries in the manifest list. Two are real images — Platform: linux/amd64 and Platform: linux/arm64. The other two show Platform: unknown/unknown with an attestation-manifest annotation. These are build provenance records that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.

You may notice that if you check the image digest recorded in a running pod:

kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'

Replace POD_NAME with one of the pod names from earlier.

The digest returned matches the top-level manifest list digest, not the arm64-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.

The definitive proof that the right binary is running is what you already have: the node labeled kubernetes.io/arch=arm64 and the application reporting Architecture: arm64.

Step 10: Cost Savings and Tradeoffs

The hands-on work is done. Let's talk about why any of this is worth the effort.

The Cost Math

At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the official pricing page before making decisions):

Instance	vCPU	Memory	Approx. $/hour
`n2-standard-4` (x86)	4	16 GB	~$0.19
`t2a-standard-4` (Tau ARM)	4	16 GB	~$0.14
`c4a-standard-4` (Axion)	4	16 GB	~$0.15

That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.

Here's how that looks at scale, for a service running 20 nodes continuously for a year:

20 × n2-standard-4 × $0.19 × 8,760 hours = $33,288/year
20 × t2a-standard-4 × $0.14 × 8,760 hours = $24,528/year

That's roughly $8,760 saved annually on compute, before committed use discounts (which further widen the gap).

When ARM Is the Right Choice

ARM works best for:

Stateless API servers and web applications — like the app we built. ARM excels at high-throughput, low-latency network workloads.
Background workers and queue processors — long-running services that don't depend on x86-specific binaries.
Microservices written in Go, Rust, or Python — these languages have excellent ARM64 support and are built cross-platform by default.

When to Proceed Carefully

Native library dependencies — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.
CI pipelines need ARM too — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.
Profile before optimizing — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.

Cleanup

When you're done, clean up to avoid ongoing charges:

# Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

Conclusion

Let's recap what you built and why each part matters.

You started with a Go application, a Dockerfile, and a docker buildx build command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.

You provisioned a GKE cluster with two node pools running different CPU architectures, then used nodeSelector to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.

The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same nodeSelector approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.

Where to go from here:

Add a GitHub Actions workflow that runs docker buildx build --platform linux/amd64,linux/arm64 on every push, automating this entire process in CI.
Audit one of your existing stateless services for ARM compatibility and try migrating it.
Explore Node Affinity as a softer alternative to nodeSelector for workloads that can run on either architecture but prefer ARM.
Look into GKE Autopilot, which now supports ARM nodes and handles node pool management automatically.

Happy building.

Project File Structure

hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service

All source files for this tutorial are available in the companion GitHub repository: https://github.com/Amiynarh/multi-arch-docker-gke-arm

How to Self-Host Your Own Server Monitoring Dashboard Using Uptime Kuma and Docker

Abdul Talha — Mon, 06 Apr 2026 20:32:31 +0000

As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains.

And while many SaaS tools can monitor your site, they often charge high monthly fees for simple alerts.

My goal with this article is to help you stop paying those expensive fees by showing you a powerful, free, open-source alternative called Uptime Kuma.

In this guide, you'll learn how to use Docker to deploy Uptime Kuma safely on a local Ubuntu machine.

By the end of this tutorial, you'll have set up your own private server monitoring dashboard in less than 10 minutes and created an automated Discord alert to ping your phone if your website goes offline.

Prerequisites
Step 1: Update Packages and Prepare the Firewall
Step 2: Create the Docker Compose File
Step 3: Start the Application
Step 4: Access the Dashboard
Step 5: Use Case – Monitor a Website and Send Discord Alerts
Conclusion

Prerequisites

Before you start, make sure you have:

An Ubuntu machine (like a local server, VM, or desktop).
Docker and Docker Compose installed.
Basic knowledge of the Linux terminal.

Step 1: Update Packages and Prepare the Firewall

First, you'll want to make sure your system has the newest updates. Then, you'll install the Uncomplicated Firewall (UFW) and open the network "door" (port) that Uptime Kuma uses for the dashboard. You'll also need to allow SSH so you don't lock yourself out.

Run these commands in your terminal:

Update your packages:

sudo apt update && sudo apt upgrade -y

Install the firewall:

sudo apt install ufw -y

Allow SSH and open port 3001:

sudo ufw allow ssh
sudo ufw allow 3001/tcp

Enable the firewall:

sudo ufw enable
sudo ufw reload

Step 2: Create the Docker Compose File

Using a docker-compose.yml file is the professional way to manage Docker containers. It keeps your setup organised in one single place.

To start, create a new folder for your project and enter it:

mkdir uptime-kuma && cd uptime-kuma

Then create the configuration file:

nano docker-compose.yml

Paste the following code into the editor:

services:
  uptime-kuma:
    image: louislam/uptime-kuma:2
    restart: unless-stopped
    volumes:
      - ./data:/app/data
    ports:
      - "3001:3001"

Note: The ./data:/app/data line is very important. It saves your database in a normal folder on your machine, making it easy to back up later.

Finally, save and exit: Press CTRL + X, then Y, then Enter.

Step 3: Start the Application

Now, tell Docker to read your file and start the monitoring service in the background.

docker compose up -d

How to verify: Docker will download the files. When it finishes, your terminal should print Started uptime-kuma.

Step 4: Access the Dashboard

To access the dashboard, first open your web browser and go to http://localhost:3001 (or your machine's local IP address).

When asked to choose the database, select SQLite. It's simple, fast, and requires no extra setup.

Then create an account and choose a secure admin username and password.

Step 5: Use Case – Monitor a Website and Send Discord Alerts

Now you'll put Uptime Kuma to work by monitoring a live website and setting up an alert. Just follow these steps:

Click Add New Monitor.
Set the Monitor Type to HTTP(s).
Give it a Friendly Name (e.g., "My Blog") and enter your website's URL.

Pro-Tip: How to Fix "Down" Errors (Bot Protection)

If your site uses strict security, it might block Uptime Kuma and say your site is "Down" with a 403 Forbidden error.

The Fix: Scroll down to Advanced, find the User Agent box, and paste this text to make Uptime Kuma look like a normal Chrome browser:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

Add a Discord Alert

To get a message on your phone when your site goes down:

On the right side of the monitor screen, click Setup Notification.
Select Discord from the dropdown list.
Paste a Discord Webhook URL (you can create one in your Discord server settings under Integrations).
Click Test to receive a test ping, then click Save.

Conclusion

Congratulations! You just took control of your server health. By deploying Uptime Kuma, you replaced an expensive SaaS subscription with a powerful, free monitoring tool that alerts you the second a project goes offline.

Let’s connect! I am a developer and technical writer specialising in writing step-by-step guides and workflows. You can find my latest projects on my Technical Writing Portfolio or reach out to me directly on LinkedIn.

Model Packaging Tools Every MLOps Engineer Should Know

Temitope Oyedele — Mon, 06 Apr 2026 15:00:08 +0000

Most machine learning deployments don’t fail because the model is bad. They fail because of packaging.

Teams often spend months fine-tuning models (adjusting hyperparameters and improving architectures) only to hit a wall when it’s time to deploy. Suddenly, the production system can’t even read the model file. Everything breaks at the handoff between research and production.

The good news? If you think about packaging from the start, you can save up to 60% of the time usually spent during deployment. That’s because you avoid the common friction between the experimental environment and the production system.

In this guide, we’ll walk through eleven essential tools every MLOps engineer should know. To keep things clear, we’ll group them into three stages of a model’s lifecycle:

Serialization: how models are stored and transferred
Bundling & Serving: how models are deployed and run
Registry: how models are tracked and versioned

Model Serialization Formats
Model Bundling and Serving Tools
Model Registries
Conclusion

Model Serialization Formats

Serialization is simply the process of turning a trained model into a file that can be stored and moved around. It’s the first step in the pipeline, and it matters more than people think. The format you choose determines how your model will be loaded later in production.

So, you want something that either works across different frameworks or is optimized for the environment where your model will eventually run.

Below are some of the most common tools in this space:

1. ONNX (Open Neural Network Exchange)

ONNX is basically the common language for model serialization. It lets you train a model in one framework, like PyTorch, and then deploy it somewhere else without running into compatibility issues. It also performs well across different types of hardware.

ONNX separates your training framework from your inference runtime and allows hardware-level optimizations like quantization and graph fusion. It’s also widely supported across cloud platforms and edge devices.

Key considerations: This format makes it possible to decouple training from deployment, while still enabling performance optimizations across different hardware setups.

When to use it: Use ONNX when you need portability – especially if different teams or environments are involved.

2. TorchScript

TorchScript lets you compile PyTorch models into a format that can run without Python. That means you can deploy it in environments like C++ or mobile without carrying the full Python runtime.

It supports two approaches: tracing (recording execution with sample inputs) and scripting (capturing full control flow).

Key considerations: Its biggest advantage is removing the Python dependency, which helps reduce latency and makes it suitable for more constrained environments.

When to use it: Best for high-performance systems where Python would be too heavy or introduce security concerns.

3. TensorFlow SavedModel

SavedModel is TensorFlow’s native format. It stores everything – the computation graph, weights, and serving logic – in a single directory.

It’s also the standard input format for TensorFlow Serving, TFLite, and Google Cloud AI Platform.

Key considerations: It keeps everything within the TensorFlow ecosystem intact, so you don’t lose any part of the model when moving to production.

When to use it: If your project is built on TensorFlow, this is the default and safest choice.

4. Pickle and Joblib

Pickle is Python’s built-in way of saving objects, and Joblib builds on top of it to better handle large arrays and models.

These are commonly used for scikit-learn pipelines, XGBoost models, and other traditional ML setups.

Key considerations: They’re simple and convenient, but come with real trade-offs. Pickle can execute arbitrary code when loading, which makes it unsafe in untrusted environments. It’s also tightly coupled to Python versions and library dependencies, so models can break when moved across environments.

When to use it: Best suited for controlled environments where everything runs in the same Python stack, such as internal tools, quick prototypes, or batch jobs.

It’s especially practical when you’re working with classical ML models and don’t need cross-language support or long-term portability. Avoid it for production systems that require security, reproducibility, or deployment across different environments.

5. Safetensors

Safetensors is a newer format developed by Hugging Face. It’s designed to be safe, fast, and straightforward.

It avoids arbitrary code execution and allows efficient loading directly from disk.

Key considerations: It’s both memory-efficient and secure, which makes it a strong alternative to older formats like Pickle.

When to use it: Ideal for modern workflows where speed and safety are important.

Model Bundling and Serving Tools

Once your model is saved, the next step is making it usable in production. That means wrapping it in a way that can handle requests and connect it to the rest of your system.

1. BentoML

BentoML allows you to define your model service in Python – including preprocessing, inference, and postprocessing – and package everything into a single unit called a “Bento.”

This bundle includes the model, code, dependencies, and even Docker configuration.

Key considerations: It simplifies deployment by packaging everything into one consistent artifact that can run anywhere.

When to use it: Great when you want to ship your model and all its logic together as one deployable unit.

2. NVIDIA Triton Inference Server

Triton is NVIDIA’s production-grade inference server. It supports multiple model formats like ONNX, TorchScript, TensorFlow, and more.

It’s built for performance, using features like dynamic batching and concurrent execution to fully utilize GPUs.

Key considerations: It delivers high throughput and efficiently uses hardware, especially GPUs, while supporting models from different frameworks.

When to use it: Best for large-scale deployments where performance, low latency, and GPU usage are critical.

3. TorchServe

TorchServe is the official serving tool for PyTorch, developed with AWS.

It packages models into a MAR file, which includes weights, code, and dependencies, and provides APIs for managing models in production.

Key considerations: It offers built-in features for versioning, batching, and management without needing to build everything from scratch.

When to use it: A solid choice for deploying PyTorch models in a standard production setup.

Model Registries

A model registry is essentially your source of truth. It stores your models, tracks versions, and manages their lifecycle from experimentation to production.

Without one, things quickly become messy and hard to track.

1. MLflow Model Registry

MLflow is one of the most widely used MLOps platforms. Its registry helps manage model versions and track their progression through stages like Staging and Production.

It also links models back to the experiments that created them.

Key considerations: It provides strong lifecycle management and makes it easier to track and audit models.

When to use it: Ideal for teams that need structured workflows and clear governance.

2. Hugging Face Hub

The Hugging Face Hub is one of the largest platforms for sharing and managing models.

It supports both public and private repositories, along with dataset versioning and interactive demos.

Key considerations: It offers a huge library of models and makes collaboration very easy.

When to use it: Perfect for projects involving transformers, generative AI, or anything that benefits from sharing and discovery.

3. Weights and Biases

Weights & Biases combines experiment tracking with a model registry.

It connects each model directly to the training run that produced it.

Key considerations: It gives you full traceability, so you always know how a model was created.

When to use it: Best when you want a strong link between experimentation and production artifacts.

Conclusion

Machine learning systems rarely fail because the models are bad. They fail because the path to production is fragile.

Packaging is what connects research to production. If that connection is weak, even great models won’t make it into real use.

Choosing the right tools across serialization, serving, and registry layers makes systems easier to deploy and maintain. Formats like ONNX and Safetensors improve portability and safety. Tools like Triton and BentoML help with reliable serving. Registries like MLflow and Hugging Face Hub keep everything organized.

The main idea is simple: don’t leave deployment as something to figure out later.

When packaging is planned early, teams move faster and avoid a lot of unnecessary problems.

In practice, success in MLOps isn’t just about building models. It’s about making sure they actually run in the real world.

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Osomudeya Zudonu — Thu, 26 Mar 2026 14:25:52 +0000

If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?

Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.

In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.

By the end, you'll be able to:

Explain the full architecture from vault to pod
Run the lab locally in about 15 minutes
Prove why environment variables go stale after rotation, while mounted secret files stay fresh
Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD
Troubleshoot the most common failures

Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.

Prerequisites
How to Understand the Secret Flow
How to Run the Local Lab
How to Inspect the ExternalSecret and the Application
How to Test Secret Rotation
How to Choose Between External Secrets Operator and the CSI Driver
How to Deploy the Pattern on Amazon Elastic Kubernetes Service
How to Configure GitHub Actions Without Stored AWS Credentials
How to Troubleshoot the Most Common Failures
Conclusion

Prerequisites

Before you begin, make sure you have the following tools installed and configured.

For the local lab:

An AWS account with access to AWS Secrets Manager
The AWS CLI installed and configured. Run aws configure and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.
kubectl installed. For Microk8s, run microk8s kubectl config view --raw > ~/.kube/config after installation to connect kubectl to your local cluster.
Terraform installed
Helm installed
Docker installed
A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the Microk8s install guide before continuing.

For the Amazon Elastic Kubernetes Service sections:

An Amazon Elastic Kubernetes Service cluster you can create or manage
A GitHub repository you can configure for workflows and secrets

The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's docs/DEPLOY-LOCAL.md and docs/DEPLOY-EKS.md.

How to Understand the Secret Flow

Before you run any command, you need to understand how the pieces connect.

The flow has four stages:

A developer or automated system updates a secret in AWS Secrets Manager.
The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.
Your pod reads that Kubernetes Secret.
During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.

How the External Secrets Operator Sync Works

The External Secrets Operator reads a custom Kubernetes resource called ExternalSecret. That resource tells the operator three things:

Which secret store to connect to
Which Kubernetes Secret name to create or update
How often to refresh

In this lab, the ExternalSecret creates a Kubernetes Secret named myapp-database-creds. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.

How the App Consumes Secrets

The sample application exposes three endpoints so you can validate behavior at any time.

/secrets/env shows what environment variables the pod sees
/secrets/volume shows what files in the mounted secret directory look like
/secrets/compare compares both and reports whether rotation has been detected

The app checks four keys: DB_USERNAME, DB_PASSWORD, DB_HOST, and DB_PORT.

How to Run the Local Lab

The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.

Step 1: Clone the Repo

git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab

Step 2: Run the Spin-Up Script

bash spinup.sh

The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.

If the script fails at any point, check docs/TROUBLESHOOTING.md before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.

Important: Run the Lab UI

The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at lab-ui/ that walks you through each concept and checkpoint as you work through the lab.

To start it, open a second terminal and run:

cd lab-ui && npm install && npm run dev

Then open http://localhost:5173. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.

Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (localhost:3000) are two separate things, the UI guides you through the steps, the app shows you the live secrets.

Step 3: Access the Application

Once the lab finishes, port-forward the service.

kubectl port-forward svc/myapp 3000:80 -n default

Open http://localhost:3000. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.

Step 4: Validate That Secrets Match

Run the compare endpoint directly from the terminal.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

When everything is working, the response will include "all_match": true.

How to Inspect the ExternalSecret and the Application

At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.

Step 1: Read the ExternalSecret Manifest

Open k8s/aws/external-secret.yaml. Focus on these four fields:

refreshInterval: how often the operator polls AWS Secrets Manager
secretStoreRef: which store the operator authenticates against
target: the name of the Kubernetes Secret to create
data: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys

Here is what that mapping looks like in this lab:

spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username

The property field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.

Two fields here are worth understanding before you move on. creationPolicy: Owner means the operator owns the Kubernetes Secret it creates. If you delete the ExternalSecret, the Secret is deleted too. ClusterSecretStore is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain SecretStore is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.

Step 2: Read the Deployment Manifest

Open k8s/aws/deployment.yaml. You are looking for two sections: envFrom and volumeMounts.

envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true

Both paths read from the same Kubernetes Secret, myapp-database-creds. The envFrom block injects all keys as environment variables at pod start.
The volumeMounts block mounts the same secret as files under /etc/secrets.

This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.

Step 3: Read the App Comparison Logic

Open app/server.js. The comparison logic reads environment variables from process.env and reads mounted secret files from /etc/secrets/. Then it computes a per-key match and a global all_match value.

The /secrets/compare endpoint sets rotation_detected: true when any key differs between env and volume.

How to Test Secret Rotation

Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.

How the Rotation Gap Works

When a pod starts, Kubernetes gives it two ways to read a secret.

The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.

The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.

Same secret, two paths. One goes stale while one stays fresh.

The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.

That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.

Here is what you're about to observe in the lab:

The rotation script updates the secret in AWS
ESO syncs the new value into Kubernetes within seconds
The volume file updates automatically
The environment variable stays stale until the pod restarts
The /secrets/compare endpoint shows both values side by side so you can see the gap live

Step 1: Confirm the Lab Is Ready

Make sure your pod and the External Secrets Operator are both running before you start.

kubectl get pods -n external-secrets
kubectl get pods -n default

Both should show Running.

Step 2: Run the Rotation Test Script

bash rotation/test-rotation.sh

The script performs these actions in order:

Reads the current DB_PASSWORD from the volume mount at /etc/secrets/DB_PASSWORD
Reads the current DB_PASSWORD from the environment variable
Updates AWS Secrets Manager with a new password using put-secret-value
Forces an immediate ESO sync by annotating the ExternalSecret with force-sync
Reads the volume value again
Reads the environment variable again

After the script runs, the volume and the env var will show different values.

Step 3: Validate With the Compare Endpoint

Hit the compare endpoint and look at the output.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

You'll see something like this:

{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}

Step 4: Restart the Deployment to Sync Env Vars

Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.

kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default

Then hit /secrets/compare again. All rows should now show "all_match": true.

How to Automate Restarts With Reloader

If you don't want to restart deployments manually after every rotation, you can install Stakater Reloader. It watches an annotation on the Deployment and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.

How to Choose Between External Secrets Operator and the CSI Driver

Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the Secrets Store CSI Driver.

Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:

Feature	External Secrets Operator	Secrets Store CSI Driver
Creates a Kubernetes Secret	Yes	No by default
Supports `envFrom`	Yes	No (workaround only)
Secret stored in etcd	Yes (base64)	No, if you skip sync
Rotation	ESO updates the Secret, Reloader restarts pods	Volume file can update in place
Best for	Most teams. Multi-cloud, env var support	Security policies that prohibit secrets in etcd

This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both envFrom and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.

Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native envFrom model.

How to Deploy the Pattern on Amazon Elastic Kubernetes Service

The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.

Step 1: Prepare Terraform and OpenID Connect Access

The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the terraform/github-oidc folder.

cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn

Copy the role ARN from the output. You'll need it in the next step.

Step 2: Set the Required Environment Variable

The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.

To find your AWS account ID, run:

aws sts get-caller-identity --query Account --output text

Then set the variable, replacing ACCOUNT with the number that command returns.

export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role

Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service

bash spinup.sh --cluster eks

When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing Match ✓.

Step 4: Test Rotation on the Deployed App

After you confirm normal operation, run the rotation test the same way you did locally.

bash rotation/test-rotation.sh

Then use /secrets/compare on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.

⚠️ Cost warning: Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run bash teardown.sh from the repo root to destroy all AWS resources and stop charges.

How to Configure GitHub Actions Without Stored AWS Credentials

The typical CI/CD setup stores AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.

OpenID Connect eliminates that problem entirely.

How OpenID Connect Works for GitHub Actions

GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via AssumeRoleWithWebIdentity. No long-lived keys are ever stored anywhere.

Step 1: Create the IAM Role With Terraform

The terraform/github-oidc folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.

Step 2: Add the Role ARN to GitHub Repository Secrets

In your GitHub repository:

Go to Settings → Secrets and variables → Actions
Click New repository secret
Name it AWS_ROLE_ARN
Paste the role ARN from the Terraform output

That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.

Step 3: Configure Terraform State

For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.

Step 4: Push to Main and Let Workflows Run

After your first spin-up, every push to the main branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use /secrets/compare to validate rotation behavior on the live environment.

How to Troubleshoot the Most Common Failures

Here's a shortlist of the most common symptoms and their fixes.

Symptom	Most Likely Cause	Fix
`ExternalSecret` is not syncing	Missing credentials or wrong store reference	Confirm the operator can access AWS Secrets Manager and that `secretStoreRef` points to the correct store
Pod is stuck in `Pending`	Missing storage setup for local cluster	For Microk8s, enable the storage add-on
Env and volume still match after rotation	Rotation happened but the pod never restarted	Run `kubectl rollout restart` or install Reloader
CRD or API version mismatch	ESO version and manifest `apiVersion` don't match	Verify the `apiVersion` for `ClusterSecretStore` and `ExternalSecret` match your installed ESO version
Amazon Elastic Kubernetes Service node group never joins	Networking or IAM permissions for nodes are wrong	Fix internet routing and review the node IAM policy

How to Inspect the Operator and the ExternalSecret

When something isn't syncing, start with these two commands.

# Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

The status conditions on the ExternalSecret resource will usually tell you exactly what failed.

How to Validate Rotation From the App Side

When you are debugging rotation, don't rely only on Kubernetes resource state. Use the /secrets/compare endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.

Conclusion

You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the ExternalSecret and Deployment manifests, and validated that the application sees the right credentials.

You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.

Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.

The lab repository is at github.com/Osomudeya/k8s-secret-lab. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.

If this helped you, star the repo and share it with someone who is learning Kubernetes.

I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures
→ Join the newsletter

Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers

Balajee Asish Brahmandam — Mon, 23 Mar 2026 17:21:11 +0000

Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.

I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.

So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.

Why Not Just Use Prometheus?
The Architecture
Setting Up the Project
The Monitoring Script — Line by Line
The Claude Diagnosis Prompt (and Why Structure Matters)
Auto-Fix Logic — Being Conservative on Purpose
Adding Slack Notifications
Health Check Endpoint
Rate Limiting Claude Calls
Docker Compose — The Full Setup
Real Errors I Caught in Production
Cost Breakdown — What This Actually Costs
Security Considerations
What I'd Do Differently
What's Next?

Why Not Just Use Prometheus?

Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.

Even then, those tools tell you what happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you why. You still need a human to look at the logs, figure out the root cause, and decide what to do.

That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).

The Architecture

Here's how the pieces fit together:

┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘

The flow works like this:

The Container Doctor runs in its own container with the Docker socket mounted
Every 10 seconds, it pulls the last 50 lines of logs from each target container
It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")
When it finds something, it sends the logs to Claude with a structured prompt
Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart
If severity is high and auto-restart is safe, the script restarts the container
Either way, it sends a Slack notification with the full diagnosis
A simple health endpoint lets you check the doctor's own status

The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.

Setting Up the Project

Create your project directory:

mkdir container-doctor && cd container-doctor

Here's your requirements.txt:

docker==7.0.0
anthropic>=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0

Install locally for testing: pip install -r requirements.txt

Create a .env file:

ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20

A quick note on CHECK_INTERVAL: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.

The Monitoring Script – Line by Line

Here's the full container_doctor.py. I'll walk through the important parts after:

import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now > rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total >= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start >= 0 and end > start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t > datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) >= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")

That's a lot of code, so let me walk through the parts that matter.

Error deduplication (is_new_error): This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.

Rate limiting (check_rate_limit): Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.

Restart throttling (inside apply_fix): If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.

Post-restart verification: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.

The Claude Diagnosis Prompt (and Why Structure Matters)

Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.

The version I landed on is explicit about format:

prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

A few things I learned:

Include the detected patterns. Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.

Ask for estimated_impact. This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."

likely_recurring is gold. If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.

Claude returns something like:

{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}

I only auto-restart on high severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.

Auto-Fix Logic – Being Conservative on Purpose

The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:

Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.

The three safety checks before any restart:

Global toggle: AUTO_FIX=true in .env. I can kill all auto-fixes instantly by changing one variable.
Claude's assessment: auto_restart_safe must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.
Restart throttle: No more than 3 restarts per container per hour. After that, it's a human problem.

If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.

Adding Slack Notifications

Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.

The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.

To set this up, create a Slack app at api.slack.com/apps, add an incoming webhook, and paste the URL in your .env.

Health Check Endpoint

The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:

curl http://localhost:8080/health

Returns:

{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}

And /history returns the last 50 diagnoses:

curl http://localhost:8080/history

I point an uptime checker (UptimeRobot, free tier) at the /health endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.

Rate Limiting Claude Calls

This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.

The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.

Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.

Docker Compose – The Full Setup

Here's the complete docker-compose.yml with the Container Doctor, a sample web server, API, and database:

version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]

Start everything: docker compose up -d

Important: The socket mount (/var/run/docker.sock:/var/run/docker.sock) gives the Container Doctor full access to the Docker daemon. Don't copy .env into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.

Real Errors I Caught in Production

I've been running this for about 3 weeks now. Here are the actual incidents it caught:

Incident 1: OOM Kill (Week 1)

Logs showed a single word: Killed. That's Linux's OOMKiller doing its thing.

Claude's diagnosis:

{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}

The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.

Incident 2: Connection Pool Exhausted (Week 2)

ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached

Claude caught that my pool size was too small for the number of workers:

{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}

Incident 3: Transient Timeout (Week 2)

WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry

Claude correctly identified this as a non-issue:

{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}

No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.

Incident 4: Disk Full (Week 3)

ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space

{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}

Notice Claude said auto_restart_safe: false here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.

Cost Breakdown – What This Actually Costs

After 3 weeks of running this on 5 containers:

Claude API: ~$3.80/month (with rate limiting and deduplication)
Linode compute: $0 extra (the Container Doctor uses about 50MB RAM)
Slack: Free tier
My time saved: ~2-3 hours/month of 3 AM debugging

Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.

If you're monitoring more containers or have noisier logs, expect higher costs. The MAX_DIAGNOSES_PER_HOUR setting is your budget knob.

Security Considerations

Let's talk about the elephant in the room: the Docker socket.

Mounting /var/run/docker.sock gives the Container Doctor root-equivalent access to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.

Here's how I mitigate this:

Network isolation: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.
Read-mostly access: The script only reads logs and restarts containers. It never execs into containers, pulls images, or modifies volumes.
No external inputs: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).
API key rotation: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.

For a more secure setup, consider Docker's --read-only flag on the socket mount and a tool like docker-socket-proxy to restrict which API calls the Container Doctor can make.

What I'd Do Differently

After 3 weeks in production, here's my honest retrospective:

I'd use structured logging from day one. My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.

I'd add per-container policies. Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.

I'd build a simple web UI. The /history endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.

I'd try local models first. For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.

I'd add a "learning mode." Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.

What's Next?

If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.

Got questions or built something similar? Drop a comment below or find me on GitHub and LinkedIn.

Happy building.

How to Build a Production-Ready Flutter CI/CD Pipeline with GitHub Actions: Quality Gates, Environments, and Store Deployment

Oluwaseyi Fatunmole — Wed, 18 Mar 2026 22:58:15 +0000

Mobile application development has evolved over the years. The processes, structure, and syntax we use has changed, as well as the quality and flexibility of the apps we build.

One of the major improvements has been a properly automated CI/CD pipeline flow that gives us seamless automation, continuous integration, and continuous deployment.

In this article, I'll break down how you can automate and build a production ready CI/CD pipeline for your Flutter application using GitHub Actions.

Note that there are other ways to do this, like with Codemagic (built specifically for Flutter apps – which I'll cover in a subsequent tutorial), but in this article we'll focus on GitHub Actions instead.

The Typical Workflow
Prerequisites
Pipeline Architecture
Writing the Workflows
Secrets and Configuration Reference
End-to-End Flow
Conclusion

The Typical Workflow

First, let's define the common approach to deploying production-ready Flutter apps.

The development team does their work on local, pushes to the repository for merge or review, and eventually runs flutter build apk or flutter build appbundle to generate the apk file. This then gets shared with the QA team manually, or deployed to Firebase app distribution for testing. If it's a production move, the app bundle is submitted to the Google Play store for review and then deployed.

This process is often fully manual with no automated checks, validation, or control over quality, speed, and seamlessness. Manually shipping a Flutter app starts out relatively simply, but can quickly and quietly turn into a liability. You run flutter build, switch configs, sign the build, upload it somewhere, and hope you didn’t mix up staging keys with production ones.

As teams grow and release updates more and more quickly, these manual steps become real risks. A skipped quality check, a missing keystore, or an incorrect base URL deployed to production can cost hours of debugging or worse – it can affect your users.

Automating this process fully involves some high level configuration and predefined scripting. It completely takes control of the deployment process from the moment the developer raised a PR into the common or base branch (for example, the develop branch).

This automated process takes care of everything that needs to be done – provided it has been predefined, properly scripted, and aligns with the use case of the team.

What we'll do here:

In this tutorial, we'll build a production-grade CI/CD pipeline for a Flutter app using GitHub Actions. The pipeline automates the entire lifecycle: pull-request quality checks, environment-specific configuration injection, Android and iOS builds, Firebase App Distribution for testers, Sentry symbol uploads, and final deployment to the Play Store and App Store.

By the end, every release – from a developer opening a PR to the final build landing in users' hands – will be fully automated, with no one touching a terminal.

Prerequisites

Before starting, you should have:

A Flutter app with working Android and iOS builds
Basic familiarity with GitHub Actions (workflows and jobs)
A Firebase project with App Distribution enabled
A Sentry project for error tracking
A Google Play Console app already created
An Apple Developer account with App Store Connect access
Fastlane configured for your iOS project
Basic Bash knowledge (I’ll explain the important parts)

Pipeline Architecture

In this guide, we'll be building a CI/CD pipeline with very precise instructions and use cases. These use cases determine the way your pipeline is built.

For this tutorial, we'll use this use case:

I want to automate the workflow on my development team based on the following criteria:

When a developer on the team raises a PR into the common working branch develop in most cases), a workflow is triggered to run quality checks on the code. It only allows the merge to happen if all checks (like tests coverage, quality checks, and static analysis) pass.
Code that's moving from the develop branch to the staging branch goes through another workflow that injects staging configurations/secret keys, does all the necessary checks, and distributes the application for testing on Firebase App Distribution for android as well as Testflight for iOS.
Code that's moving from the staging to the production branch goes through the production level workflow which involves apk secured signing, production configuration injection, running tests to ensure nothing breaks, Sentry analysis for monitoring, and submission to App Store Connect as well as Google Play Console.

These are our predefined conditions which help with the construction of our workflows.

Writing the Workflows

We'll split this pipeline into three GitHub Actions workflows.

We'll also be taking it a notch higher by creating three helper .sh scripts for a cleaner and more maintainable workflow.

In your project root, create two folders:

.github/
scripts.

The .github/ folder will hold the workflows we'll be creating for each use case, while the scripts/ folder will hold the helper scripts that we can easily call in our CLI or in the workflows directly.

After this, we'll create three workflow .yaml files:

pr_checks.yaml
android.yaml
ios.yaml

Also in the scripts folder, let's create three .sh files:

generate_config.sh
quality_checks.sh
upload_symbols.sh

.github/
  workflows/
    pr_checks.yml
    android.yml
    ios.yml

scripts/
  generate_config.sh
  quality_checks.sh
  upload_symbols.sh

This workflow architecture ensures that a push to develop automatically produces a tester build. Also, merging to production ships directly to the stores without manual commands or config changes.

The scripts live outside the YAML on purpose. This lets you run the same logic locally.

The Helper Scripts

The scripts form the backbone of the pipeline. Each one has a single responsibility and is reused across workflows.

Instead of cramming logic into YAML, we'll move it into reusable scripts. This keeps workflows clean and lets you run the same logic locally. Let's go through each one now.

Script #1: `generate_config.sh`

Injecting secrets safely is one of the hardest CI/CD problems in mobile apps.

The strategy:

Commit a Dart template file with placeholders
Replace placeholders at build time using secrets from GitHub Actions
Never commit real credentials

#!/usr/bin/env bash
set -euo pipefail


ENV_NAME=${1:-}
BASE_URL=${2:-}
ENCRYPTION_KEY=${3:-}

TEMPLATE="lib/core/env/env_ci.dart"
OUT="lib/core/env/env_ci.g.dart"

if [ -z "\(ENV_NAME" ] || [ -z "\)BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then
  echo "Usage: $0   "
  exit 2
fi

sed -e "s|<>|$BASE_URL|g" \
    -e "s|<>|$ENCRYPTION_KEY|g" \
    -e "s|<>|$ENV_NAME|g" \
    "\(TEMPLATE" > "\)OUT"

echo "Generated config for $ENV_NAME"

This script is responsible for injecting environment-specific configuration into the Flutter app at build time, without ever committing secrets to source control.

Let’s walk through it carefully.

1. Shebang: Choosing the Shell

#!/usr/bin/env bash

This line tells the system to execute the script using Bash, regardless of where Bash is installed on the machine.

Using /usr/bin/env bash instead of /bin/bash makes the script more portable across local machines, GitHub Actions runners, and Docker containers.

2. Fail Fast, Fail Loud

set -euo pipefail

This is one of the most important lines in the script.

It enables three strict Bash modes:

-e: Exit immediately if any command fails
-u: Exit if an undefined variable is used
-o pipefail: Fail if any command in a pipeline fails, not just the last one

This matters in CI because silent failures are dangerous, partial config generation can break production builds, and CI should stop immediately when something is wrong.

This line ensures that no broken config ever makes it into a build.

3. Reading Input Arguments


ENV_NAME=${1:-}
BASE_URL=${2:-}
ENCRYPTION_KEY=${3:-}

These lines read positional arguments passed to the script:

$1: Environment name (dev, staging, production)
$2: API base URL
$3: Encryption or API key

The ${1:-} syntax means:

“If the argument is missing, default to an empty string instead of crashing.”

This works hand-in-hand with set -u , we control the failure explicitly instead of letting Bash explode unexpectedly.

4. Defining Input and Output Files

TEMPLATE="lib/core/env/env_ci.dart"
OUT="lib/core/env/env_ci.g.dart"

Here we define two files:

Template file (env_ci.dart)
- Contains placeholder values like <>
- Safe to commit to Git
Generated file (env_ci.g.dart)
- Contains real environment values
- Must be ignored by Git (.gitignore)

At the heart of this approach are two Dart files with very different responsibilities. They may look similar, but they play completely different roles in the system.

`env.ci.dart`:

// lib/core/env/env_ci.dart

class EnvConfig {
  static const String baseUrl = '<>';
  static const String encryptionKey = '<>';
  static const String environment = '<>';
}

This file is safe, static, and version-controlled. It contains placeholders, not real values.

Some of its key characteristics are:

Contains no real secrets
Uses obvious placeholders (<>, etc.)
Safe to commit to Git
Reviewed like normal source code
Serves as the single source of truth for required config fields

Think of this file as a contract:

“These are the configuration values the app expects at runtime.”

`env.ci.g.dart`:

This file is created at build time by generate_config.sh. After substitution, it looks like this:

// lib/core/env/env_ci.g.dart
// GENERATED FILE — DO NOT COMMIT

class EnvConfig {
  static const String baseUrl = 'https://staging.api.example.com';
  static const String encryptionKey = 'sk_live_xxxxx';
  static const String environment = 'staging';
}

Key characteristics:

Contains real environment values
Generated dynamically in CI
Differs per environment (dev / staging / production)
Must never be committed to source control

This file exists only on a developer’s machine (if generated locally), inside the CI runner during a build. Once the job finishes, it disappears.

`.gitignore`:

To guarantee the generated file never leaks, it must be ignored:

Why This Separation Is Critical

This design solves several hard problems at once.

Security:

Secrets live only in GitHub Actions secrets
They never appear in the repository
They never appear in PRs
They never appear in Git history

Environment Isolation:

Each environment gets its own generated config:

develop: dev API
staging: staging API
production: production API

The same codebase behaves differently without branching logic in Dart.

Deterministic Builds:

Every build is fully reproducible, fully automated, and explicit about which environment it targets.

There are no “it worked locally” scenarios.

5. Validating Required Arguments

if [ -z "\(ENV_NAME" ] || [ -z "\)BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then
  echo "Usage: $0   "
  exit 2
fi

This block enforces correct usage.

-z checks whether a variable is empty
If any required argument is missing:
- A helpful usage message is printed
- The script exits with a non-zero status code
0: success
1+: failure
2 conventionally means incorrect usage

In CI, this immediately fails the job and prevents an invalid build.

6. Injecting Environment Values

sed -e "s|<>|$BASE_URL|g" \
    -e "s|<>|$ENCRYPTION_KEY|g" \
    -e "s|<>|$ENV_NAME|g" \
    "\(TEMPLATE" > "\)OUT"

This is the heart of the script.

What’s happening here:

sed performs stream editing: it reads text, transforms it, and outputs the result
Each -e flag defines a replacement rule:
- Replace <> with the actual API URL
- Replace <> with the real key
- Replace <> with the environment label
The transformed output is written to env_ci.g.dart

This entire operation happens at build time:

No secrets are committed
No secrets are logged
No secrets persist beyond the CI run

7. Success Feedback

echo "Generated config for $ENV_NAME"

This line provides a clear success signal in CI logs.

It answers three important questions instantly:

Did the script run?
Did it finish successfully?
Which environment was generated?

In long CI logs, these small confirmations matter.

Alright, now let's move on to the second script.

Script #2: `quality_gate.sh`

This script defines what “good code” means for your team.

#!/usr/bin/env bash
set -euo pipefail

echo "Running quality checks"

dart format --output=none --set-exit-if-changed .
flutter analyze
flutter test --no-pub --coverage

if command -v dart_code_metrics >/dev/null 2>&1; then
  dart_code_metrics analyze lib --reporter=console || true
fi

echo "Quality checks passed"

Lets break down this script bit by bit.

1. Start & End Log Markers

echo "Running quality checks"
...
echo "Quality checks passed"

These two lines act as visual boundaries in CI logs.

In large pipelines (especially when Android and iOS jobs run in parallel), logs can be very noisy. Clear markers:

Help developers quickly find the quality phase
Make debugging faster
Confirm that the script completed successfully

The final success message only prints if everything above it passed, because set -e would have terminated the script earlier on failure.

So this line effectively means: All quality gates passed. Safe to proceed.

2. Running the Test Suite

flutter test --no-pub --coverage

This line executes your entire Flutter test suite.

Let’s break it down carefully.

1. flutter test

This runs unit tests, widget tests, and any test under the test/ directory. If any test fails, the command exits with a non-zero status code.

Because we enabled set -e earlier, that immediately stops the script and fails the CI job.

2. --coverage

This flag generates a coverage report at:

coverage/lcov.info

This file can later be uploaded to Codecov, used to enforce minimum coverage thresholds, and tracked over time for quality improvement.

Even if you’re not enforcing coverage yet, generating it now future-proofs your pipeline.

3. Optional Code Metrics

if command -v dart_code_metrics >/dev/null 2>&1; then
  dart_code_metrics analyze lib --reporter=console || true
fi

This block is intentionally designed to be optional and non-blocking.

Step 1 – Check If the Tool Exists:

command -v dart_code_metrics >/dev/null 2>&1

This checks whether dart_code_metrics is installed.

If installed, proceed
If not installed, skip silently

The redirection:

>/dev/null hides normal output
2>&1 hides errors

This makes the script portable:

Developers without the tool can still run the script
CI can enforce it if configured

Step 2 – Run Metrics (Soft Enforcement):

dart_code_metrics analyze lib --reporter=console || true

This analyzes the lib/ directory and prints results in the console.

The important part is:

|| true

Because we enabled set -e, any failing command would normally stop the script.

Adding || true overrides that behavior:

If metrics report issues,
The script continues,
CI does not fail.

Why design it this way? Because metrics are often gradual improvements, technical debt indicators, or advisory rather than blocking.

You can later remove || true to make metrics mandatory.

4. Final Success Message

echo "✅ Quality checks passed"

This line only executes if formatting passed, static analysis passed, and tests passed.

If you see this in CI logs, it means the branch has successfully cleared the quality gate. It’s your automated approval before deployment steps begin.

What This Script Guarantees

With this in place, every branch must satisfy:

Clean formatting
No analyzer errors
Passing tests
(Optional) Healthy metrics

That’s how you move from “We try to maintain quality” to “Quality is enforced automatically.”

Alright, on to the third script.

Script #3: `upload_symbols.sh` (Sentry)

This script is responsible for uploading obfuscation debug symbols to Sentry so production crashes remain readable.

#!/usr/bin/env bash
set -euo pipefail

RELEASE=${1:-}

[ -z "$RELEASE" ] && exit 2

if ! command -v sentry-cli >/dev/null 2>&1; then
  exit 0
fi

sentry-cli releases new "$RELEASE" || true

sentry-cli upload-dif build/symbols || true

sentry-cli releases finalize "$RELEASE" || true

echo "✅ Symbols uploaded for release $RELEASE"

Let's go through it step by step.

1. Reading the Release Identifier

RELEASE=${1:-}

This reads the first positional argument passed to the script.

When you call the script in CI, it typically looks like:

./scripts/upload_symbols.sh $(git rev-parse --short HEAD)

So $1 becomes the short Git commit SHA.

Using ${1:-} ensures:

If no argument is passed, the variable becomes an empty string
The script does not crash due to set -u

This release value ties the uploaded symbols, deployed build, and crash reports all to the exact same commit. This linkage is critical for production debugging.

2. Validating the Release Argument

[ -z "$RELEASE" ] && exit 2

This is a compact validation check.

-z checks whether the string is empty
If it is empty → exit with status code 2

Conventionally:

0 = success
1+ = failure
2 = incorrect usage

This prevents symbol uploads from running without a release identifier, which would break traceability in Sentry.

3. Checking If `sentry-cli` Exists

if ! command -v sentry-cli >/dev/null 2>&1; then
  exit 0
fi

This block checks whether the sentry-cli tool is available in the environment.

What’s happening:

command -v sentry-cli checks if it exists
>/dev/null 2>&1 suppresses all output
! negates the condition

So this reads as: "If sentry-cli is NOT installed, exit successfully."

Why exit with 0 instead of failing?

Because not every environment needs symbol uploads. Also, dev builds may not install Sentry, and you don’t want CI to fail just because Sentry isn’t configured.

This makes symbol uploading environment-aware and optional.

Production environments can install sentry-cli, while dev environments skip it cleanly.

4. Creating a New Release in Sentry

sentry-cli releases new "$RELEASE" || true

This tells Sentry: “A new release exists with this version identifier.”

Even if the release already exists, the script continues because of:

|| true

This prevents the build from failing if:

The release was already created
The command returns a non-critical error

The goal is resilience, not strict enforcement.

5. Uploading Debug Information Files (DIFs)

sentry-cli upload-dif build/symbols || true

This is the core step.

build/symbols is generated when you build Flutter with:

--obfuscate --split-debug-info=build/symbols

When you obfuscate Flutter builds:

Method names are renamed
Stack traces become unreadable

The symbol files allow Sentry to reverse-map obfuscated stack traces and show readable crash reports.

Without this step, production crashes look like:

a.b.c.d (Unknown Source)

With this step, you get:

AuthRepository.login()

Again, || true ensures the build doesn’t fail if:

The directory doesn’t exist
No symbols were generated
Upload encounters a transient issue

Symbol uploads should not block deployment.

6. Finalizing the Release

sentry-cli releases finalize "$RELEASE" || true

This marks the release as complete in Sentry.

Finalizing signals:

The release is deployed
It can begin aggregating crash reports
It’s ready for production monitoring

Like the previous steps, this is soft-failed with || true to keep CI robust.

What This Script Guarantees

When everything is configured correctly:

Production build is obfuscated
Debug symbols are generated
Symbols are uploaded to Sentry
Crashes map back to real source code
Release version matches commit SHA

That’s production-grade crash observability.

Now that we've gone through the three helper scripts we've created to optimize and enhance this process, lets now dive into the three workflow .yaml files we're going to create.

Workflow #1: `PR_CHECKS.YML`

This workflow will be designed to help ensure a certain standard is met once a PR is raised into a certain common or base branch. This will ensure that all quality checks in the incoming code pass before allowing any merge into the base branch.

This is basically a gate that verifies the quality of the code that's about to be merged into the base branch. If your pipeline allows unverified code into your base branch, then your CI becomes decorative, not protective.

Lets break down what's actually needed during every PR Check.

1. Dependency Integrity

For Flutter apps, where we manage dependencies with the pub get command, it's important to make sure that the integrity of all dependencies are confirmed – up to date as well as compatible.

Every PR should begin with:

flutter pub get

This ensures:

pubspec.yaml is valid
Dependency constraints are consistent
Lockfiles are not broken
The project is buildable in a clean environment

If this fails, the branch is not deployable.

2. Static Analysis

This ensures code quality and architecture integrity. Static analysis helps prevent common issues like forgotten await, dead code, null safety violations, async misuse, and so on.

Most production bugs aren't business logic errors – they're structural carelessness. Static analysis helps enforce consistency automatically, so code reviews focus on intent, not linting.

flutter analyze --fatal-infos --fatal-warnings

3. Formatting

This command ensures that your code is properly formatted based on your organization's coding standard and policies.

dart format --output=none --set-exit-if-changed .

4. Tests

This runs the unit, widget and business logic tests to ensure quality and avoid regression leaks, silent behavior changes and feature drift.

flutter test --coverage

5. Test Coverage Enforcement

Ideally, running tests is not enough. Your workflow should also enforce a minimum threshold:

if [ \((lcov --summary coverage/lcov.info | grep lines | awk '{print \)2}' | sed 's/%//') -lt 70 ]; then
  echo "Coverage too low"
  exit 1
fi

The command above ensures that a minimum test coverage of 70% is met, with this quality becomes measurable.

The five commands above must be checked (at least) for a quality gate to guarantee code quality, security, and integrity.

Now here is the full pr_checks.yml file:

name: PR Quality Gate

on:
  pull_request:
    branches: develop
    types: [opened, synchronize, reopened, ready_for_review]

jobs:
  pr-checks:
    name: Run quality checks on this pull request
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Setup Java
        uses: actions/setup-java@v1
        with:
          java-version: "12.x"

      - name: Setup Flutter
        uses: subosito/flutter-action@v1
        with:
          channel: "stable"

      - name: Install dependencies
        run: flutter pub get

      - name: Run quality checks
        run: ./scripts/quality_checks.sh

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "PR Quality Checks PASSED"
          echo "PR: ${{ github.event.pull_request.html_url }}"
          echo "Branch: \({{ github.head_ref }} → \){{ github.base_ref }}"
          echo "By: @${{ github.actor }}"
          echo "Team notification: @foluwaseyi-dev @olabodegbolu"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "PR Quality Checks FAILED"
          echo "PR: ${{ github.event.pull_request.html_url }}"
          echo "Branch: \({{ github.head_ref }} → \){{ github.base_ref }}"
          echo "By: @${{ github.actor }}"
          echo "Please fix the issues before requesting review 🔧"
          echo "Team notification: @foluwaseyi-dev @olabodegbolu"

Every time a developer opens (or updates) a pull request targeting the develop branch, this workflow kicks in automatically. Think of it as a bouncer at the door: no code gets through without passing inspection first.

What Triggers it?

The workflow fires on four events: when a PR is opened, synchronized (new commits pushed), reopened, or marked ready_for_review. So drafts won't trigger it – only PRs that are actually ready to be looked at.

What Does it Actually Do?

It spins up a fresh Ubuntu machine and runs five steps in sequence:

Checkout: pulls down the branch's code
Setup Java 12: installs the JDK (likely a dependency for some tooling or build process)
Setup Flutter (stable channel): this is a Flutter project, so it grabs the stable Flutter SDK
Install dependencies: runs flutter pub get to pull all Dart/Flutter packages
Run quality checks: executes the helper shell script (./scripts/quality_checks.sh) that we created which runs linting, tests, formatting checks, or all of the above

The Notification Layer

After the checks run, the workflow reports the verdict and it's context-aware:

If everything passes, it logs a success message with the PR URL, branch info, and the person who opened it
If something fails, it logs a failure message and nudges the author to fix issues before requesting a review

Both outcomes tag @foluwaseyi-dev and @olabodegbolu – the two team members responsible for staying in the loop.

This workflow enforces a "fix it before you merge it" culture. No one can merge broken code into develop without the team knowing about it.

Workflow #2: Android.yml

It's a better practice to split your workflows based on platform. This helps you properly manage the instructions regarding each platform. This is the reason behind keeping the Android workflow separate.

Unlike PR _Checks, this workflow presumes that all checks for quality and standards have been done and the code that runs this workflow already meets the required standards.

Based on our predefined use case, let's create a workflow to handle test deployments when merged to develop or staging, and production level activities when merged to production.

name: Android Build & Release

on:
  push:
    branches:
      - develop
      - staging
      - production

jobs:
  android:
    runs-on: ubuntu-latest
    env:
      FLUTTER_VERSION: 'stable'

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Java
        uses: actions/setup-java@v3
        with:
          distribution: 'temurin'
          java-version: '11'

      - name: Setup Flutter
        uses: subosito/flutter-action@v2
        with:
          flutter-version: ${{ env.FLUTTER_VERSION }}

      - name: Install dependencies
        run: flutter pub get

      - name: Determine environment
        id: env
        run: |
          echo "branch=\({GITHUB_REF##*/}" >> \)GITHUB_OUTPUT
          if [ "${GITHUB_REF##*/}" = "develop" ]; then
            echo "ENV=dev" >> $GITHUB_OUTPUT
          elif [ "${GITHUB_REF##*/}" = "staging" ]; then
            echo "ENV=staging" >> $GITHUB_OUTPUT
          else
            echo "ENV=production" >> $GITHUB_OUTPUT
          fi

      # Dev uses hardcoded values no secrets needed
      - name: Generate config (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"

      # Staging and production inject real secrets
      - name: Generate config (staging/production)
        if: steps.env.outputs.ENV != 'dev'
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
            ./scripts/generate_config.sh staging \
              "${{ secrets.STAGING_BASE_URL }}" \
              "${{ secrets.STAGING_API_KEY }}"
          else
            ./scripts/generate_config.sh production \
              "${{ secrets.PROD_BASE_URL }}" \
              "${{ secrets.PROD_API_KEY }}"
          fi

      # Keystore is only needed for signed builds (staging & production)
      - name: Restore Keystore
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode > android/app/upload-keystore.jks

      # Production builds are obfuscated + split debug info for Play Store
      - name: Build artifact
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then
            flutter build appbundle --release \
              --obfuscate \
              --split-debug-info=build/symbols
          else
            flutter build appbundle --release
          fi

      # Dev and staging go to Firebase App Distribution for internal testing
      - name: Upload to Firebase App Distribution
        if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging'
        env:
          FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }}
          FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }}
          FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }}
        run: |
          firebase appdistribution:distribute \
            build/app/outputs/bundle/release/app-release.aab \
            --app "$FIREBASE_ANDROID_APP_ID" \
            --groups "$FIREBASE_GROUPS" \
            --token "$FIREBASE_TOKEN"

      # Only production goes to the Play Store
      - name: Upload to Play Store
        if: steps.env.outputs.ENV == 'production'
        uses: r0adkll/upload-google-play@v1
        with:
          serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }}
          packageName: com.your.package
          releaseFiles: build/app/outputs/bundle/release/app-release.aab
          track: production

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "Android Build & Release PASSED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "Android Build & Release FAILED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Check the logs and fix the issue before retrying"

This workflow ensures that whenever code lands on the develop, staging or production branch, this action is triggered on a fresh Ubuntu machine.

This is triggered by a simple push to any of the tracked branches, no manual intervention needed.

Let's walk through it piece by piece.

1. The Setup Phase

Before any Flutter-specific work happens, the workflow lays the foundation:

Checkout: grabs the latest code from the branch that triggered the run (using the more modern actions/checkout@v3).
Java 11 via Temurin: this is an upgrade from the first workflow we created. Instead of a generic setup-java@v1, this uses the temurin distribution which is the Eclipse's open-source JDK build. It's the current industry standard for Android toolchains.
Flutter (stable): this pulls the stable Flutter SDK, version pinned via an environment variable (FLUTTER_VERSION: 'stable') defined at the job level.
Install dependencies: this ensures we run flutter pub get to pull all packages

2. Environment Detection

This is where it gets interesting. This workflow also checks and determines the environment which will help us define the next set of instructions to run.

This command reads the branch name from GITHUB REF and maps it to its environment label which we already created in one of our helper scripts.

develop → ENV=dev
staging → ENV=staging
production → ENV=production

It strips the branch name from the full ref path using ${GITHUB_REF##*/}, then writes both the branch name and the resolved ENV value to $GITHUB_OUTPUT, making them available as named outputs (steps.env.outputs.ENV) for every subsequent step.

This means the rest of the pipeline can branch its behaviour based on which environment it's building for, different API keys, different signing configs, different targets – whatever the app needs.

3. Config Injection

With the environment resolved, the next step is injecting the right configuration into the app. This is where the generate_config.sh script we built earlier gets called directly from the workflow.

For the dev environment, hardcoded placeholder values are used. No real secrets are needed, since this build is only meant for internal developer testing:

- name: Generate config (dev)
  if: steps.env.outputs.ENV == 'dev'
  run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"

For staging and production, however, real secrets are pulled from GitHub Actions secrets and passed directly into the script:

- name: Generate config (staging/production)
  if: steps.env.outputs.ENV != 'dev'
  run: |
    if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
      ./scripts/generate_config.sh staging \
        "${{ secrets.STAGING_BASE_URL }}" \
        "${{ secrets.STAGING_API_KEY }}"
    else
      ./scripts/generate_config.sh production \
        "${{ secrets.PROD_BASE_URL }}" \
        "${{ secrets.PROD_API_KEY }}"
    fi

Notice that these two steps use an if condition to make them mutually exclusive. Only one will ever run per job. This keeps the pipeline clean: no complicated branching logic inside the script itself, just a clear decision at the workflow level.

4. Keystore Restoration

Android requires signed builds for distribution. The signing keystore file cannot be committed to the repository for obvious security reasons, so it's stored as a Base64-encoded GitHub secret and decoded at build time.

- name: Restore Keystore
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode > android/app/upload-keystore.jks

This step is skipped entirely for the dev environment because dev builds are unsigned debug artifacts meant purely for internal testing on Firebase App Distribution. Only staging and production builds need to be properly signed.

To encode your keystore file as a Base64 string for storing in GitHub secrets, you have to run this locally:

base64 -i upload-keystore.jks | pbcopy

This copies the encoded string to your clipboard, which you can then paste directly into your GitHub repository secrets.

5. Building the Artifact

With the environment configured and the keystore in place, the workflow builds the app bundle:

- name: Build artifact
  run: |
    if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then
      flutter build appbundle --release \
        --obfuscate \
        --split-debug-info=build/symbols
    else
      flutter build appbundle --release
    fi

There's a deliberate difference between how production and non-production builds are compiled.

For production:

--obfuscate renames method and class names in the compiled output, making it significantly harder to reverse engineer the app
--split-debug-info=build/symbols extracts the debug symbols into a separate directory at build/symbols

These symbols are what upload_symbols.sh later ships to Sentry, so obfuscated crash reports remain readable in your monitoring dashboard.

For dev and staging, neither flag is used. This keeps build times faster and makes local debugging easier since stack traces remain human-readable.

6. Distributing to Firebase App Distribution

Once the app bundle is built, dev and staging builds are uploaded to Firebase App Distribution so testers can install them immediately:

- name: Upload to Firebase App Distribution
  if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging'
  env:
    FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }}
    FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }}
    FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }}
  run: |
    firebase appdistribution:distribute \
      build/app/outputs/bundle/release/app-release.aab \
      --app "$FIREBASE_ANDROID_APP_ID" \
      --groups "$FIREBASE_GROUPS" \
      --token "$FIREBASE_TOKEN"

Three secrets power this step:

FIREBASE_TOKEN: the authentication token generated from firebase login:ci
FIREBASE_ANDROID_APP_ID: the app identifier from the Firebase console
FIREBASE_GROUPS: the tester group(s) that should receive the build notification

Once this step completes, every tester in the specified groups receives an email with a direct download link. No one needs to manually share an APK file over Slack or email.

7. Deploying to the Play Store

Production builds skip Firebase entirely and goes straight to the Google Play Store:

- name: Upload to Play Store
  if: steps.env.outputs.ENV == 'production'
  uses: r0adkll/upload-google-play@v1
  with:
    serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }}
    packageName: com.your.package
    releaseFiles: build/app/outputs/bundle/release/app-release.aab
    track: production

This uses the r0adkll/upload-google-play GitHub Action, which handles the Google Play API interaction under the hood. The only requirements are:

A Google Play service account with the correct permissions, stored as a JSON secret
The correct package name matching what is registered in your Play Console
The track set to production (you can also use internal, alpha, or beta depending on your release strategy)

Replace com.your.package with your actual application ID (the same one defined in your build.gradle file).

8. The Notification Layer

Just like the PR checks workflow, this workflow reports its outcome clearly:

- name: Notify Team (Success)
  if: success()
  run: |
    echo "Android Build & Release PASSED"
    echo "Environment: ${{ steps.env.outputs.ENV }}"
    echo "Branch: ${{ steps.env.outputs.branch }}"
    echo "By: @${{ github.actor }}"
    echo "Commit: ${{ github.sha }}"

- name: Notify Team (Failure)
  if: failure()
  run: |
    echo "Android Build & Release FAILED"
    echo "Environment: ${{ steps.env.outputs.ENV }}"
    echo "Branch: ${{ steps.env.outputs.branch }}"
    echo "By: @${{ github.actor }}"
    echo "Commit: ${{ github.sha }}"
    echo "Check the logs and fix the issue before retrying 🔧"

The success notification includes the environment, branch, actor, and shares everything needed to trace exactly what was deployed and who triggered it.

The failure notification includes the same context, with a clear call to action.

Workflow #3: iOS.yml

iOS CI/CD is more complex than Android by nature. This is because Apple's signing requirements involve certificates, provisioning profiles, and entitlements that all need to be in the right place before Xcode will produce a valid archive.

Fastlane helps us handles all of that complexity, and the workflow simply calls into it.

Here is the full ios.yml:

name: iOS Build & Release

on:
  push:
    branches:
      - develop
      - staging
      - production

jobs:
  ios:
    runs-on: macos-latest
    env:
      FLUTTER_VERSION: 'stable'

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Flutter
        uses: subosito/flutter-action@v2
        with:
          flutter-version: ${{ env.FLUTTER_VERSION }}

      - name: Install dependencies
        run: flutter pub get

      - name: Determine environment
        id: env
        run: |
          echo "branch=\({GITHUB_REF##*/}" >> \)GITHUB_OUTPUT
          if [ "${GITHUB_REF##*/}" = "develop" ]; then
            echo "ENV=dev" >> $GITHUB_OUTPUT
          elif [ "${GITHUB_REF##*/}" = "staging" ]; then
            echo "ENV=staging" >> $GITHUB_OUTPUT
          else
            echo "ENV=production" >> $GITHUB_OUTPUT
          fi

      - name: Generate config (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"

      - name: Generate config (staging/production)
        if: steps.env.outputs.ENV != 'dev'
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
            ./scripts/generate_config.sh staging \
              "${{ secrets.STAGING_BASE_URL }}" \
              "${{ secrets.STAGING_API_KEY }}"
          else
            ./scripts/generate_config.sh production \
              "${{ secrets.PROD_BASE_URL }}" \
              "${{ secrets.PROD_API_KEY }}"
          fi

      - name: Install Fastlane
        run: |
          cd ios
          gem install bundler
          bundle install

      - name: Import signing certificate
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode > ios/cert.p12
          security create-keychain -p "" build.keychain
          security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign
          security list-keychains -s build.keychain
          security default-keychain -s build.keychain
          security unlock-keychain -p "" build.keychain
          security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain

      - name: Install provisioning profile
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode > profile.mobileprovision
          mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
          cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/

      - name: Build iOS (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: flutter build ios --release --no-codesign

      - name: Build & distribute to TestFlight (staging)
        if: steps.env.outputs.ENV == 'staging'
        env:
          APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }}
          APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }}
          APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }}
        run: |
          cd ios
          bundle exec fastlane beta

      - name: Build & release to App Store (production)
        if: steps.env.outputs.ENV == 'production'
        env:
          APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }}
          APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }}
          APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }}
        run: |
          cd ios
          bundle exec fastlane release

      - name: Upload Sentry symbols (production only)
        if: steps.env.outputs.ENV == 'production'
        env:
          SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
          SENTRY_ORG: ${{ secrets.SENTRY_ORG }}
          SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }}
        run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD)

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "iOS Build & Release PASSED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "iOS Build & Release FAILED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Check the logs and fix the issue before retrying 🔧"

Let's walk through what is different about this workflow compared to that of android.

1. MacOS Runner

runs-on: macos-latest

This is the major difference.

iOS builds require Xcode, which only runs on macOS. GitHub Actions provides hosted macOS runners, but they are significantly more expensive in terms of compute minutes than Ubuntu runners. Just keep that in mind when thinking about build frequency.

No Java setup is needed here. Flutter on iOS compiles through Xcode directly, so the toolchain requirements are different.

2. Installing Fastlane

- name: Install Fastlane
  run: |
    cd ios
    gem install bundler
    bundle install

Fastlane is a Ruby-based automation tool that handles certificate management, building, and uploading to TestFlight and the App Store.

This step navigates into the ios/ directory and installs Fastlane along with all its dependencies as defined in the project's Gemfile.

Your ios/Gemfile should look something like this:

source "https://rubygems.org"

gem "fastlane"

And your ios/fastlane/Fastfile should define at minimum two lanes: one for staging (TestFlight) and one for production (App Store):

default_platform(:ios)

platform :ios do
  lane :beta do
    build_app(scheme: "Runner", export_method: "app-store")
    upload_to_testflight(skip_waiting_for_build_processing: true)
  end

  lane :release do
    build_app(scheme: "Runner", export_method: "app-store")
    upload_to_app_store(force: true, skip_screenshots: true, skip_metadata: true)
  end
end

3. Certificate and Provisioning Profile Setup

This is the step that trips most teams up the first time. Apple's code signing requires two things to be present on the machine:

The signing certificate (a .p12 file)
The provisioning profile

Both are stored as Base64-encoded GitHub secrets and restored at build time.

- name: Import signing certificate
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode > ios/cert.p12
    security create-keychain -p "" build.keychain
    security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign
    security list-keychains -s build.keychain
    security default-keychain -s build.keychain
    security unlock-keychain -p "" build.keychain
    security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain

Breaking this down step by step:

Decodes the Base64 certificate and write it to cert.p12
Creates a temporary keychain called build.keychain with an empty password
Imports the certificate into that keychain, granting codesign access
Sets it as the default keychain so Xcode finds it automatically
Unlocks the keychain so it can be used non-interactively
Sets partition list to allow access without repeated prompts

The provisioning profile step is simpler:

- name: Install provisioning profile
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode > profile.mobileprovision
    mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
    cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/

It decodes the profile and copies it into the exact directory where Xcode expects to find provisioning profiles on any macOS system.

To encode your certificate and profile locally, you can run these:

base64 -i Certificates.p12 | pbcopy   # for the certificate
base64 -i YourApp.mobileprovision | pbcopy   # for the provisioning profile

4. Building for Each Environment

Dev builds skip signing entirely. They're built without code signing just to verify the project compiles correctly on a clean machine:

- name: Build iOS (dev)
  if: steps.env.outputs.ENV == 'dev'
  run: flutter build ios --release --no-codesign

Staging builds go through Fastlane's beta lane, which builds and uploads to TestFlight. Production builds go through Fastlane's release lane, which submits directly to App Store Connect.

Both staging and production steps consume the same three App Store Connect API key secrets: the key ID, the issuer ID, and the key content itself.

Fastlane uses these to authenticate with Apple's API without requiring a manual Apple ID login.

5. Sentry Symbol Upload

On production iOS builds, the upload_symbols.sh script runs after the build completes, passing the current short commit SHA as the release identifier:

- name: Upload Sentry symbols (production only)
  if: steps.env.outputs.ENV == 'production'
  env:
    SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
    SENTRY_ORG: ${{ secrets.SENTRY_ORG }}
    SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }}
  run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD)

This is the same script explained earlier in the helper scripts section. It creates a Sentry release, uploads the debug information files, and finalizes the release. Any production crash from this point forward will map back to real, readable source code in your Sentry dashboard.

Secrets and Configuration Reference

For this entire pipeline to work, you need to configure the following secrets in your GitHub repository. Go to Settings → Secrets and variables → Actions → New repository secret to add each one.

Shared (used across environments):

Secret	Description
`FIREBASE_TOKEN`	Generated via `firebase login:ci` on your local machine
`FIREBASE_ANDROID_APP_ID`	Android app ID from your Firebase console
`FIREBASE_GROUPS`	Comma-separated tester group names in Firebase
`SENTRY_AUTH_TOKEN`	Auth token from your Sentry account settings
`SENTRY_ORG`	Your Sentry organization slug
`SENTRY_PROJECT`	Your Sentry project slug

Staging:

Secret	Description
`STAGING_BASE_URL`	Your staging API base URL
`STAGING_API_KEY`	Your staging API or encryption key

Production:

Secret	Description
`PROD_BASE_URL`	Your production API base URL
`PROD_API_KEY`	Your production API or encryption key

Android:

Secret	Description
`ANDROID_KEYSTORE_BASE64`	Base64-encoded `.jks` keystore file
`GOOGLE_PLAY_SERVICE_ACCOUNT_JSON`	Full JSON content of your Play Console service account

iOS:

Secret	Description
`IOS_CERTIFICATE_BASE64`	Base64-encoded `.p12` signing certificate
`IOS_CERTIFICATE_PASSWORD`	Password protecting the `.p12` file
`IOS_PROVISIONING_PROFILE_BASE64`	Base64-encoded `.mobileprovision` file
`APP_STORE_CONNECT_API_KEY_ID`	Key ID from App Store Connect → Users & Access → Keys
`APP_STORE_CONNECT_API_ISSUER_ID`	Issuer ID from the same App Store Connect page
`APP_STORE_CONNECT_API_KEY_CONTENT`	The full content of the downloaded `.p8` key file

None of these values should ever appear in your codebase. If any secret is accidentally committed, rotate it immediately.

End-to-End Flow

With all three workflows in place, here is exactly what happens from the moment a developer opens a pull request to the moment a user receives an update:

1. Developer Opens a PR into `develop`

The pr_checks.yml workflow fires. It runs formatting checks, static analysis, and the full test suite. If anything fails, the PR cannot be merged and the team is notified immediately. The developer fixes the issues and pushes again, which triggers a fresh run.

2. PR is Approved and Merged into `develop`

The android.yml and ios.yml workflows both fire on the push event. They detect the environment as dev, inject placeholder config, build unsigned artifacts, and upload them to Firebase App Distribution. Testers receive an email and can install the build on their devices within minutes – no one shared a file manually.

3. `develop` is Merged into `staging`

Both platform workflows fire again. This time the environment resolves to staging. Real secrets are injected, builds are properly signed, and the artifacts go to Firebase App Distribution (Android) and TestFlight (iOS). QA begins testing the staging build against the staging API.

4. `staging` is merged into `production`

Both workflows fire one final time. Production secrets are injected, builds are obfuscated and signed, debug symbols are uploaded to Sentry, and the final artifacts are submitted to the Google Play Store and App Store Connect. The release goes live on Apple and Google's review timelines with no further human intervention required.

From that first PR to a production submission, not a single command was run manually.

Conclusion

Building this pipeline is an upfront investment that pays off from the very first release cycle. What used to be a sequence of error-prone manual steps building locally, signing, uploading, switching configs, and hoping nothing was mixed up is now a fully automated, auditable, and repeatable process that runs the moment code moves between branches.

The architecture we built here does more than just automate builds. The PR quality gate enforces team standards consistently, so code review becomes a conversation about intent rather than a hunt for formatting issues. The environment-aware config injection eliminates an entire class of production incidents where staging keys made it into a live release. The Sentry symbol upload means your team can debug production crashes with full source visibility even from an obfuscated binary.

Every piece of this pipeline also runs locally. The helper scripts in the scripts/ folder are plain Bash so you can call them from your terminal the same way CI calls them. This eliminates the frustrating cycle of pushing a commit just to test a pipeline change.

As your team grows, this foundation scales with you. You can extend the pr_checks.yml to enforce code coverage thresholds, add a performance benchmarking job, or introduce a dedicated security scanning step. You can extend the platform workflows to support multiple flavors, multiple Firebase projects, or staged rollouts on the Play Store. The architecture stays the same – you're just adding new steps to an already working system.

This ensures that standards are met, code quality remains high, you have a proper team structure, clear process and automated post development activities are in place – and at the end of the day, you'll have an optimized engineering approach that will help your team in so many ways.

Devops - freeCodeCamp.org

Common DevOps Mistakes and How to Avoid Them — Tips for Startups

Table of Contents

Who This Article Is For

Why Startups Are a Different Environment

Mistake 1: Deploying Without Understanding What You're Deploying

The Scenario

The Business Impact

The Fix

Mistake 2: Using Production as a Development Environment

The Scenario

The Business Impact

The Fix

Mistake 3: Hardcoding Secrets and Credentials

The Scenario

The Business Impact

The Fix

Mistake 4: Overengineering for Problems You Don't Have Yet

The Scenario

The Business Impact

The Fix

Mistake 5: No Observability Before Launch

The Scenario

Business Impact

The Fix

Mistake 6: Treating Security as a Final Step

The Scenario

The Business Impact

The Fix

Mistake 7: Manual Deployments in Production

The Scenario

The Business Impact

The Fix

Mistake 8: No Disaster Recovery Plan

The Scenario

The Business Impact

The Fix

Mistake 9: No Documentation or Runbooks

The Scenario

The Business Impact

The Fix

Mistake 10: Solving Technical Problems Without Understanding the Business

The Scenario

The Business Impact

The Fix

The System Thinking Framework Every DevOps Engineer Needs

Your Production Readiness Checklist

Infrastructure

Security

Observability

Reliability

Documentation

Conclusion

Want to Go Deeper?

How to Migrate to S3 Native State Locking in Terraform

Table of Contents

What is Terraform State Locking?

What Is S3 Native State Locking?

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Prerequisites

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Step 1: Create the S3 Bucket with Versioning and Encryption

Step 2: Configure the Terraform Backend with Native Locking

Step 3: Initialize and Verify

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Step 1: Verify Your Current Setup

Step 2: Enable Object Lock on the Existing S3 Bucket

Step 3: Update the Terraform Backend Configuration

Step 4: Reinitialize Terraform

Step 5: Verify the Migration

Step 6: Clean Up the DynamoDB Table

How to Verify That Locking Is Working

Method 1: Observe the lock file during an operation

Method 2: Read the lock file contents

How to Handle a Stuck Lock

Rollback Plan: If Something Goes Wrong

Security Best Practices for Your State Bucket

Enable Versioning (Required)

Block All Public Access (Non-Negotiable)

Enable Server-Side Encryption