<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Devops - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Devops - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 21 May 2026 22:47:18 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/devops/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Common DevOps Mistakes and How to Avoid Them — Tips for Startups ]]>
                </title>
                <description>
                    <![CDATA[ Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production. Startup environments make this worse. The p ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-avoid-devops-mistakes/</link>
                <guid isPermaLink="false">6a060c22baf09db7a6253878</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ startup ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tips ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 14 May 2026 17:53:38 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/6fcabd5e-272f-4f1d-b035-8241896e8296.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what <em>not</em> to do before they got into production.</p>
<p>Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.</p>
<p>This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.</p>
<p>Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-who-this-article-is-for">Who This Article Is For</a></p>
</li>
<li><p><a href="#heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</a></p>
</li>
<li><p><a href="#heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</a></p>
</li>
<li><p><a href="#heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</a></p>
</li>
<li><p><a href="#heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</a></p>
</li>
<li><p><a href="#heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</a></p>
</li>
<li><p><a href="#heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</a></p>
</li>
<li><p><a href="#heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</a></p>
</li>
<li><p><a href="#heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</a></p>
</li>
<li><p><a href="#heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</a></p>
</li>
<li><p><a href="#heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</a></p>
</li>
<li><p><a href="#heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</a></p>
</li>
<li><p><a href="#heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</a></p>
</li>
<li><p><a href="#heading-your-production-readiness-checklist">Your Production Readiness Checklist</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-who-this-article-is-for">Who This Article Is For</h2>
<ul>
<li><p><strong>Early-career DevOps and cloud engineers</strong> who are building or maintaining production infrastructure at a startup.</p>
</li>
<li><p><strong>Backend developers</strong> who have recently taken on DevOps responsibilities.</p>
</li>
<li><p><strong>Engineers joining a startup</strong> who want to understand what operational discipline actually looks like in a fast-moving environment.</p>
</li>
</ul>
<p>You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.</p>
<h2 id="heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</h2>
<p>Before getting into the mistakes, you have to understand why startups produce them in the first place.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/f9bec1fa-8938-4144-b934-9e5af4edf4ad.svg" alt="diagram showing the startup DevOps reality, a single engineer handling infra, CI/CD, security, monitoring, and deployment pipelines simultaneously" style="display:block;margin:0 auto" width="680" height="506" loading="lazy">

<p>In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.</p>
<p>This creates four specific pressure points:</p>
<ol>
<li><p><strong>Speed pressure.</strong> The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.</p>
</li>
<li><p><strong>Budget constraints.</strong> Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.</p>
</li>
<li><p><strong>Absent guardrails.</strong> There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.</p>
</li>
<li><p><strong>Constantly changing requirements.</strong> The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.</p>
</li>
</ol>
<h2 id="heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before you deploy anything to production, you should be able to answer these five questions in writing:</p>
<ol>
<li><p><strong>What compute type is running my code?</strong> (EC2, Lambda, Fargate, container?)</p>
</li>
<li><p><strong>How does a new version replace the old one?</strong> (Rolling? Blue/green? All-at-once?)</p>
</li>
<li><p><strong>Where does configuration and secrets come from?</strong> (SSM? Secrets Manager? Environment file?)</p>
</li>
<li><p><strong>What downstream services depend on this?</strong> (Database connections? Other APIs? Cache?)</p>
</li>
<li><p><strong>How do I roll back in under five minutes if this breaks?</strong></p>
</li>
</ol>
<p>If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.</p>
<blockquote>
<p>"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."</p>
</blockquote>
<p>Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.</p>
<ul>
<li><p><strong>The first question is: What is this technology or concept about?</strong><br>This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.</p>
</li>
<li><p><strong>The second question is: Why do we need it?</strong><br>I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.</p>
</li>
<li><p><strong>The third question is: How should it be implemented?</strong><br>There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.</p>
</li>
</ul>
<p>This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.</p>
<h2 id="heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.</p>
<p>This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>You need at minimum three separate environments and ideally three separate AWS accounts:</p>
<table>
<thead>
<tr>
<th>Environment</th>
<th>Purpose</th>
<th>Access Level</th>
</tr>
</thead>
<tbody><tr>
<td><strong>dev</strong></td>
<td>Break things freely. No real data.</td>
<td>Engineers have broad access</td>
</tr>
<tr>
<td><strong>staging</strong></td>
<td>Mirror of production. Final verification.</td>
<td>Controlled access</td>
</tr>
<tr>
<td><strong>production</strong></td>
<td>Real customers. Real data.</td>
<td>MFA required. No manual deployments.</td>
</tr>
</tbody></table>
<p>Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.</p>
<p>Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.</p>
<pre><code class="language-hcl"># terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}
</code></pre>
<pre><code class="language-hcl"># terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}
</code></pre>
<p>The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.</p>
<h2 id="heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A new engineer joins a startup and clones the repository. Inside they find a <code>.env</code> file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.</p>
<p>GitHub's automated secret scanning never triggered because the secrets were inside a <code>.env</code> file rather than raw in the code. The credentials had been valid and actively used for over six months.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:</p>
<ul>
<li><p>Crypto-mining workloads generating thousands of dollars in cloud bills overnight</p>
</li>
<li><p>Complete exfiltration of customer data from every S3 bucket</p>
</li>
<li><p>Privilege escalation: the attacker creates new admin users and locks you out of your own account</p>
</li>
<li><p>AWS account suspension while the investigation runs</p>
</li>
</ul>
<p>According to <a href="https://github.blog/security/vulnerability-research/securing-millions-of-developers-together/">GitHub's annual security report</a>, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.</p>
<h2 id="heading-the-fix">The Fix</h2>
<p><strong>Step 1: Never commit secrets to Git.</strong> Not temporarily. Not in a branch. Not in a private repository.</p>
<p><strong>Step 2: Add</strong> <code>.gitignore</code> <strong>before you create the first file.</strong> Check in the <code>.gitignore</code> with the first line of code before any <code>.env</code> files exist.</p>
<pre><code class="language-gitignore"># .gitignore
.env
.env.*
*.pem
*.key
secrets/
</code></pre>
<p><strong>Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets.</strong> Your application reads secrets at runtime:</p>
<pre><code class="language-python"># Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -&gt; dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]
</code></pre>
<p><strong>Step 4: Scan your existing repositories immediately.</strong> You may already have a problem:</p>
<pre><code class="language-bash"># Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo
</code></pre>
<p><strong>Step 5: Add a pre-commit hook to prevent future accidents:</strong></p>
<pre><code class="language-bash">pip install pre-commit
</code></pre>
<pre><code class="language-yaml"># .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
</code></pre>
<pre><code class="language-bash">pre-commit install
# Now the hook runs before every commit and blocks detected secrets
</code></pre>
<p>There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.</p>
<h2 id="heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.</p>
<p>Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Match your infrastructure to your actual stage:</p>
<table>
<thead>
<tr>
<th>Scale</th>
<th>Right Infrastructure</th>
<th>Cost Range</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1–1,000 users</strong></td>
<td>Single EC2 + RDS + Nginx reverse proxy</td>
<td>$20–50/month</td>
</tr>
<tr>
<td><strong>1K–50K users</strong></td>
<td>Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD</td>
<td>$200-500/month</td>
</tr>
<tr>
<td><strong>50K–500K users</strong></td>
<td>ECS Fargate, RDS read replicas, ElastiCache, full observability</td>
<td>$1K-5K/month</td>
</tr>
<tr>
<td><strong>500K+ users</strong></td>
<td>Multi-region, managed Kubernetes, dedicated SRE</td>
<td>$10K+/month</td>
</tr>
</tbody></table>
<p>The question to ask before every infrastructure decision is: <strong>"What specific, measurable problem does this solve today that my current setup cannot solve?"</strong></p>
<p>Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.</p>
<p>Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.</p>
<h2 id="heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.</p>
<p>The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.</p>
<h3 id="heading-business-impact">Business Impact</h3>
<p>Without observability:</p>
<ul>
<li><p>You find out about production problems from users, not from your systems</p>
</li>
<li><p>Incidents take 10x longer to resolve because diagnosis is guesswork</p>
</li>
<li><p>You cannot tell whether a deployment improved or degraded performance</p>
</li>
<li><p>You have no data for making better architecture decisions</p>
</li>
</ul>
<h3 id="heading-the-fix">The Fix</h3>
<p>Implement the four golden signals before any service goes to production. These come from <a href="https://sre.google/sre-book/monitoring-distributed-systems/">Google's Site Reliability Engineering book</a>:</p>
<ol>
<li><p><strong>Latency</strong>: How long requests take to complete (p50, p95, p99)</p>
</li>
<li><p><strong>Traffic</strong>: How many requests per second the system is handling</p>
</li>
<li><p><strong>Errors</strong>: The rate of failed requests (5xx responses per minute)</p>
</li>
<li><p><strong>Saturation</strong>: How close the system is to its limits (CPU, memory, connection pool)</p>
</li>
</ol>
<p>Here is a minimal CloudWatch alarm setup using the AWS CLI:</p>
<pre><code class="language-shell"># Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef
</code></pre>
<p>Every application should also expose a <code>/health</code> endpoint that returns <code>200 OK</code> when healthy:</p>
<pre><code class="language-python"># FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }
</code></pre>
<p>Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.</p>
<blockquote>
<p>You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."</p>
</blockquote>
<h2 id="heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:</p>
<ul>
<li><p>S3 buckets publicly accessible by default</p>
</li>
<li><p>EC2 instances with port 22 open to <code>0.0.0.0/0</code></p>
</li>
<li><p>IAM users with <code>AdministratorAccess</code> for the entire team</p>
</li>
<li><p>No encryption on the database at rest</p>
</li>
<li><p>JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.</p>
</li>
</ul>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Apply these six security controls before the first line of production code ships:</p>
<p><strong>1. Principle of Least Privilege every IAM role gets only what it needs:</strong></p>
<p>One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (<code>s3:*</code>) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.</p>
<p>The fix is simple: look at what your service actually does, then write a policy that allows exactly that.</p>
<p>If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}
</code></pre>
<p>Notice the <code>Resource</code> is scoped to <code>my-app-uploads/*</code> not all S3 buckets. And the <code>Action</code> list covers only <code>GetObject</code> and <code>PutObject</code> not <code>DeleteObject</code>, not <code>s3:*</code>. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.</p>
<p><strong>2. Block all S3 public access by default:</strong></p>
<p>AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.</p>
<p>The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:</p>
<pre><code class="language-bash">aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.</p>
<p><strong>3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:</strong></p>
<p>Port 22 open to <code>0.0.0.0/0</code> is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.</p>
<p>AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:</p>
<pre><code class="language-bash"># Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0
</code></pre>
<p>To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the <code>AmazonSSMManagedInstanceCore</code> policy attached. Once that is set up, you can close port 22 on the security group entirely.</p>
<p><strong>4. Enable MFA for all IAM users and enforce it via policy:</strong></p>
<p>A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.</p>
<p>Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.</p>
<p>The AWS documentation provides the <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_users-self-manage-mfa-and-creds.html">Complete Deny Without MFA Policy</a>, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.</p>
<p><strong>5. Enable CloudTrail in all regions:</strong></p>
<p>Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.</p>
<p>CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:</p>
<pre><code class="language-bash">aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation
</code></pre>
<p>The <code>--enable-log-file-validation</code> flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every <code>AssumeRole</code>, every <code>DeleteBucket</code>, and every <code>RunInstances</code> call in your account is permanently recorded.</p>
<p><strong>6. Run AWS Security Hub from day one:</strong></p>
<p>Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.</p>
<p>Enabling it takes a single command:</p>
<pre><code class="language-bash">aws securityhub enable-security-hub
</code></pre>
<p>Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.</p>
<p>Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.</p>
<h2 id="heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running <code>git pull</code>, running <code>npm install</code>, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips <code>npm install</code>. The application starts crashing because a new dependency is missing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:</p>
<pre><code class="language-yaml"># .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" &gt;&gt; $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true
</code></pre>
<p>Notice <code>wait-for-service-stability: true</code>. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.</p>
<h2 id="heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.</p>
<p>The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.</p>
<p>Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p><strong>Define your RTO and RPO before you design anything:</strong></p>
<ul>
<li><p><strong>RTO (Recovery Time Objective):</strong> How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.</p>
</li>
<li><p><strong>RPO (Recovery Point Objective):</strong> How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.</p>
</li>
</ul>
<p><strong>Enable RDS Multi-AZ for all production databases:</strong></p>
<pre><code class="language-hcl"># Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}
</code></pre>
<p><strong>Test your backups on a schedule.</strong> Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.</p>
<pre><code class="language-bash"># Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"
</code></pre>
<p>For official guidance on RDS backup and restore, refer to the <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html">AWS RDS Backup and Restore documentation</a>.</p>
<h2 id="heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Documentation for an engineering team means three specific things:</p>
<ol>
<li><p><strong>Infrastructure as Code is the highest form of documentation.</strong> The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.</p>
</li>
<li><p><strong>A runbook for every operational task.</strong> A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:</p>
</li>
</ol>
<pre><code class="language-markdown"># Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.
</code></pre>
<ol>
<li><strong>An architecture README in every repository.</strong> Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.</li>
</ol>
<h2 id="heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before making any infrastructure decision, answer these four questions:</p>
<ol>
<li><p><strong>What is the actual, measured bottleneck?</strong> Instrument before you act. The bottleneck is almost never where you assumed it was.</p>
</li>
<li><p><strong>What does success look like, and how will you measure it?</strong> "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.</p>
</li>
<li><p><strong>What is the full cost of this solution?</strong> Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?</p>
</li>
<li><p><strong>Can a simpler solution solve 80% of the problem in 20% of the time?</strong></p>
</li>
</ol>
<p>Always profile and measure before you rebuild:</p>
<pre><code class="language-bash"># Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"
</code></pre>
<p>Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.</p>
<h2 id="heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</h2>
<p>Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/b33035a6-448f-419b-b293-206b7b775594.jpg" alt="A diagram showing a request flowing through a full system: user → CDN → load balancer → application servers → cache → database → logs/monitoring" style="display:block;margin:0 auto" width="544" height="650" loading="lazy">

<p>A system thinker asks six questions before making any change in production:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Why You Ask It</th>
</tr>
</thead>
<tbody><tr>
<td><strong>What does this change?</strong></td>
<td>List every configuration, file, or service that will be different.</td>
</tr>
<tr>
<td><strong>What does this depend on?</strong></td>
<td>What must be true upstream for this component to work correctly?</td>
</tr>
<tr>
<td><strong>What depends on this?</strong></td>
<td>What downstream systems are affected if this changes or fails?</td>
</tr>
<tr>
<td><strong>What is the failure mode?</strong></td>
<td>Does this fail loudly (500 errors) or silently (wrong data)?</td>
</tr>
<tr>
<td><strong>What is the rollback path?</strong></td>
<td>How do you reverse this in under five minutes?</td>
</tr>
<tr>
<td><strong>What does healthy look like after the change?</strong></td>
<td>What metrics confirm everything is working correctly?</td>
</tr>
</tbody></table>
<p>This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.</p>
<h2 id="heading-your-production-readiness-checklist">Your Production Readiness Checklist</h2>
<p>Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.</p>
<h3 id="heading-infrastructure">Infrastructure</h3>
<ul>
<li><p>Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git</p>
</li>
<li><p>Separate dev, staging, and production environments exist with separate credentials</p>
</li>
<li><p>All production changes go through an automated CI/CD pipeline, no manual SSH deployments</p>
</li>
<li><p>You can rebuild the entire production environment from code in under two hours</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p>No secrets, credentials, or API keys exist in any Git repository</p>
</li>
<li><p>All production secrets are in Secrets Manager or SSM Parameter Store</p>
</li>
<li><p>All IAM roles follow the principle of least privilege</p>
</li>
<li><p>S3 buckets have public access blocked by default</p>
</li>
<li><p>Port 22 is not open to <code>0.0.0.0/0</code> on any security group</p>
</li>
<li><p>CloudTrail is enabled in all regions</p>
</li>
<li><p>All IAM users have MFA enabled</p>
</li>
<li><p>AWS Security Hub is enabled and findings are reviewed weekly</p>
</li>
</ul>
<h3 id="heading-observability">Observability</h3>
<ul>
<li><p>Every service has a <code>/health</code> endpoint that monitoring checks continuously</p>
</li>
<li><p>Alerts fire within five minutes of a production error rate spike</p>
</li>
<li><p>Dashboards exist showing latency, error rate, and resource utilization</p>
</li>
<li><p>Logs are centralized and searchable, not scattered across individual servers</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><p>Production database has Multi-AZ enabled</p>
</li>
<li><p>Backup restoration has been tested in the last 30 days</p>
</li>
<li><p>Written runbooks exist for the three most likely failure scenarios</p>
</li>
<li><p>RTO and RPO requirements are documented and the architecture meets them</p>
</li>
</ul>
<h3 id="heading-documentation">Documentation</h3>
<ul>
<li><p>Every repository has a README explaining what it does and how to deploy it</p>
</li>
<li><p>A new engineer could understand the production architecture from documentation alone</p>
</li>
<li><p>No single engineer holds critical knowledge that lives only in their head</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.</p>
<p>The good news is that every single one of them is preventable with the right awareness and the right habits applied early.</p>
<p>You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.</p>
<p>The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.</p>
<h2 id="heading-want-to-go-deeper">Want to Go Deeper?</h2>
<p>If this article resonated with you, <a href="https://coachli.co/tolani-akintayo/PR-H4oQS"><strong>The Startup DevOps Field Guide</strong></a> covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.</p>
<p>It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Migrate to S3 Native State Locking in Terraform ]]>
                </title>
                <description>
                    <![CDATA[ If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them togethe ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-migrate-to-s3-native-state-locking-in-terraform/</link>
                <guid isPermaLink="false">69fd19239f93a850a430069b</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Infrastructure as code ]]>
                    </category>
                
                    <category>
                        <![CDATA[ S3 ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 07 May 2026 22:58:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/9619ad45-15c5-4be7-9221-ed4b76bc2b24.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.</p>
<p>But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.</p>
<p>The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.</p>
<p>In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning <strong>DynamoDB is no longer required for state locking</strong>. Terraform 1.10 added support for this feature, and it's now generally available.</p>
<p>In this tutorial, you'll learn:</p>
<ul>
<li><p>What S3 native locking is and how it works</p>
</li>
<li><p>How to set it up from scratch if you're starting a new project</p>
</li>
<li><p>How to migrate an existing S3 + DynamoDB setup to S3 native locking safely</p>
</li>
<li><p>How to verify locking is working and handle edge cases</p>
</li>
</ul>
<p>By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-terraform-state-locking">What Is Terraform State Locking?</a></p>
</li>
<li><p><a href="#heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</a></p>
</li>
<li><p><a href="#heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</a></p>
<ul>
<li><p><a href="#heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</a></p>
</li>
<li><p><a href="#heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</a></p>
</li>
<li><p><a href="#heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking</a></p>
<ul>
<li><p><a href="#heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</a></p>
</li>
<li><p><a href="#heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</a></p>
</li>
<li><p><a href="#heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</a></p>
</li>
<li><p><a href="#heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</a></p>
</li>
<li><p><a href="#heading-step-5-verify-the-migration">Step 5: Verify the Migration</a></p>
</li>
<li><p><a href="#heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</a></p>
</li>
<li><p><a href="#heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</a></p>
</li>
<li><p><a href="#heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</a></p>
</li>
<li><p><a href="#heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-what-is-terraform-state-locking">What is Terraform State Locking?</h2>
<p>Before looking at the new approach, it helps to understand what state locking is solving.</p>
<p>Terraform stores everything it knows about your infrastructure in a <strong>state file</strong> – a JSON document that maps your configuration to real AWS resources. When you run <code>terraform apply</code>, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.</p>
<p>The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a <strong>race condition</strong>. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.</p>
<p><strong>State locking</strong> solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.</p>
<pre><code class="language-plaintext">Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ----------&gt;|                            |
   |                                   |                            |
   |&lt;------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock ---&gt;|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |&lt;------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ----------&gt;|                            |
   |                                   |                            |
   |           [COMPLETED]             |&lt;------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |
</code></pre>
<h2 id="heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</h2>
<p>Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a <code>LockID</code> primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.</p>
<p>S3 native locking uses <strong>S3 Object Lock</strong> instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.</p>
<p>When S3 native locking is enabled in your Terraform backend:</p>
<ol>
<li><p>Terraform writes your state to an <code>.tfstate</code> object in S3 (as before)</p>
</li>
<li><p>To acquire a lock, Terraform uses <strong>S3's conditional write operations</strong> – specifically the <code>if-none-match</code> conditional header to create a lock file atomically</p>
</li>
<li><p>If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held</p>
</li>
<li><p>When the operation completes, Terraform deletes the lock file to release the lock.</p>
</li>
</ol>
<p>The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.</p>
<p><strong>Note:</strong> This feature requires Terraform version <strong>1.10.0 or later</strong> and an S3 bucket with <strong>Object Lock enabled</strong>. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.</p>
<h2 id="heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</h2>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>S3 + DynamoDB (Old)</strong></th>
<th><strong>S3 Native Locking (New)</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>AWS services required</strong></td>
<td>S3 + DynamoDB</td>
<td>S3 only</td>
</tr>
<tr>
<td><strong>IAM permissions needed</strong></td>
<td>S3 + DynamoDB permissions</td>
<td>S3 permissions only</td>
</tr>
<tr>
<td><strong>Terraform version</strong></td>
<td>Any</td>
<td>1.10.0 or later</td>
</tr>
<tr>
<td><strong>Setup complexity</strong></td>
<td>Two resources, two IAM scopes</td>
<td>One resource</td>
</tr>
<tr>
<td><strong>Stuck lock resolution</strong></td>
<td>Delete DynamoDB record</td>
<td>Delete S3 lock file</td>
</tr>
<tr>
<td><strong>Cost</strong></td>
<td>S3 storage + DynamoDB on-demand</td>
<td>S3 storage only</td>
</tr>
<tr>
<td><strong>Object Lock requirement</strong></td>
<td>Not required</td>
<td>Required on S3 bucket</td>
</tr>
<tr>
<td><strong>Locking mechanism</strong></td>
<td>DynamoDB conditional writes</td>
<td>S3 conditional writes (<code>if-none-match</code>)</td>
</tr>
<tr>
<td><strong>State versioning</strong></td>
<td>S3 Versioning (recommended)</td>
<td>S3 Versioning (required for full safety)</td>
</tr>
</tbody></table>
<p>The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following in place:</p>
<ul>
<li><strong>Terraform 1.10.0 or later</strong> installed. Check your version:</li>
</ul>
<pre><code class="language-shell">terraform version
</code></pre>
<p>If you need to upgrade, follow the <a href="https://developer.hashicorp.com/terraform/install">official upgrade guide</a>.</p>
<ul>
<li><strong>AWS CLI</strong> installed and configured with credentials that have permission to create and manage S3 buckets.</li>
</ul>
<pre><code class="language-shell">aws --version
aws sts get-caller-identity   # confirm you're authenticated
</code></pre>
<ul>
<li><p><strong>IAM permissions</strong> to perform the following S3 actions:</p>
<ul>
<li><p><code>s3:CreateBucket</code></p>
</li>
<li><p><code>s3:PutBucketVersioning</code></p>
</li>
<li><p><code>s3:PutBucketEncryption</code></p>
</li>
<li><p><code>s3:PutObjectLegalHold</code></p>
</li>
<li><p><code>s3:PutObjectRetention</code></p>
</li>
<li><p><code>s3:GetObject</code></p>
</li>
<li><p><code>s3:PutObject</code></p>
</li>
<li><p><code>s3:DeleteObject</code></p>
</li>
<li><p><code>s3:ListBucket</code></p>
</li>
</ul>
</li>
<li><p>For the <strong>migration path</strong>: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.</p>
</li>
</ul>
<h2 id="heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</h2>
<p>Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.</p>
<h3 id="heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</h3>
<p>Object Lock <strong>must be enabled at bucket creation time</strong>. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p><strong>Note:</strong> For regions other than <code>us-east-1</code>, add the <code>--create-bucket-configuration</code> flag.</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p>Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled
</code></pre>
<p>Enable server-side encryption so your state files are encrypted at rest:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<p>Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Verify the bucket configuration:</p>
<pre><code class="language-shell"># Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state
</code></pre>
<p>Expected output for the Object Lock check:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/2b2e56cf-687f-4932-a61e-ed7cc33ea6f1.png" alt="Terminal showing AWS CLI verification commands confirming S3 bucket is configured correctly with Object Lock, versioning, and encryption enabled" style="display:block;margin:0 auto" width="1120" height="616" loading="lazy">

<h3 id="heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</h3>
<p>In your Terraform project, create or update your <code>backend.tf</code> file:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}
</code></pre>
<p>The critical difference from the old configuration is the <code>use_lockfile = true</code> parameter. Notice what is <strong>absent</strong>: there's no <code>dynamodb_table</code> argument. No DynamoDB table. No second service.</p>
<p>Here's a direct comparison of the old and new configurations:</p>
<p><strong>Old configuration (S3 + DynamoDB):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}
</code></pre>
<p><strong>New configuration (S3 native locking):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}
</code></pre>
<h3 id="heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</h3>
<p>Run <code>terraform init</code> to initialize the backend:</p>
<pre><code class="language-shell">terraform init
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!
</code></pre>
<p>Run a plan to confirm everything is working end-to-end:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket&nbsp;– a <code>.tflock</code> file will appear temporarily alongside your state file during the operation and disappear when it completes.</p>
<h2 id="heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration&nbsp;– How to Move from S3 + DynamoDB to S3 Native Locking</h2>
<p>Follow this section if you have an <strong>existing Terraform setup</strong> using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.</p>
<p><strong>Important:</strong> Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means <strong>all team members and CI/CD pipelines must stop running</strong> <code>terraform plan</code> <strong>or</strong> <code>terraform apply</code> <strong>during the migration</strong>. The migration itself takes under 10 minutes.</p>
<h3 id="heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</h3>
<p>Before making any changes, document your existing backend configuration and confirm the state file is accessible:</p>
<pre><code class="language-shell"># Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'
</code></pre>
<p>Check your current <code>backend.tf</code> and note the exact values:</p>
<pre><code class="language-shell"># Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}
</code></pre>
<p>Run one final plan to confirm the current state is clean and there are no unexpected changes pending:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If the plan shows no changes, you're in a safe state to proceed.</p>
<h3 id="heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</h3>
<p>This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.</p>
<p>But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.</p>
<p>Run the following AWS CLI command to enable Object Lock on your <strong>existing</strong> bucket:</p>
<pre><code class="language-bash">aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'
</code></pre>
<p><strong>Note:</strong> This command enables Object Lock in <strong>governance mode with no default retention</strong>, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.</p>
<p>Verify Object Lock is now enabled:</p>
<pre><code class="language-shell">aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<p>Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):</p>
<pre><code class="language-shell">aws s3api get-bucket-versioning \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "Status": "Enabled"
}
</code></pre>
<p>If versioning isn't enabled, enable it before proceeding:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/cd17df01-3d0a-4f93-9250-3f51627e91c8.png" alt="Terminal output showing successful Object Lock enablement on an existing S3 bucket using the AWS CLI" style="display:block;margin:0 auto" width="1204" height="320" loading="lazy">

<h3 id="heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</h3>
<p>Update your <code>backend.tf</code> to remove the <code>dynamodb_table</code> argument and add <code>use_lockfile = true</code>:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}
</code></pre>
<p>Your updated <code>backend.tf</code> should look like this:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}
</code></pre>
<h3 id="heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</h3>
<p>Run <code>terraform init</code> with the <code>-reconfigure</code> flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):</p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!
</code></pre>
<p><strong>If you see an error here:</strong> The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.</p>
<h3 id="heading-step-5-verify-the-migration">Step 5: Verify the Migration</h3>
<p>Run a plan to confirm Terraform is working correctly with the new backend configuration:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The plan should:</p>
<ul>
<li><p>Complete successfully</p>
</li>
<li><p>Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)</p>
</li>
<li><p>NOT mention DynamoDB anywhere in its output</p>
</li>
</ul>
<p>To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:</p>
<pre><code class="language-plaintext">╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵
</code></pre>
<p>The <code>Path</code> field shows <code>.tfstate.tflock</code>, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/e9abb703-af6e-429c-83bb-2ea2dac43a3a.png" alt="Two terminals showing concurrent terraform plan commands, the second one displays a lock error confirming S3 native locking is working" style="display:block;margin:0 auto" width="1264" height="539" loading="lazy">

<h3 id="heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</h3>
<p>Once you've confirmed the migration is working correctly and your team has run at least one successful <code>plan</code> and <code>apply</code> cycle using the new backend, you can remove the DynamoDB table.</p>
<p><strong>Wait at least 24-48 hours before deleting the DynamoDB table</strong> if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.</p>
<p>When you're ready, delete the DynamoDB table:</p>
<pre><code class="language-shell">aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Confirm the deletion:</p>
<pre><code class="language-shell">aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found
</code></pre>
<p>This error confirms that the table is gone. The migration is complete.</p>
<p>If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run <code>terraform apply</code> to destroy it via Terraform rather than the CLI directly. This keeps your state clean:</p>
<pre><code class="language-hcl"># Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}
</code></pre>
<p>After removing the block, run:</p>
<pre><code class="language-bash">terraform apply
</code></pre>
<p>Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.</p>
<h2 id="heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</h2>
<p>After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.</p>
<h3 id="heading-method-1-observe-the-lock-file-during-an-operation">Method 1: Observe the lock file during an operation</h3>
<p>In one terminal, start a long-running plan against a configuration with many resources:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>While it's running, in a second terminal, check for the lock file in S3:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>You should see a file like:</p>
<pre><code class="language-plaintext">2026-05-06 14:22:01        512 terraform.tfstate.tflock
</code></pre>
<p>After the plan completes, run the same command again. The <code>.tflock</code> file should be gone.</p>
<h3 id="heading-method-2-read-the-lock-file-contents">Method 2: Read the lock file contents</h3>
<p>While a plan is running, download and read the lock file to see its contents:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock &amp;&amp; cat /tmp/current.lock
</code></pre>
<p>Expected output (formatted for readability):</p>
<pre><code class="language-json">{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}
</code></pre>
<p>This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.</p>
<h2 id="heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</h2>
<p>With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the <code>.tflock</code> file from S3.</p>
<p>A lock can get stuck if:</p>
<ul>
<li><p>A <code>terraform apply</code> or <code>plan</code> process was killed mid-execution</p>
</li>
<li><p>A CI/CD pipeline runner crashed during a Terraform operation</p>
</li>
<li><p>A network interruption prevented the lock release from completing</p>
</li>
</ul>
<p>Here's how you can check for a stuck lock:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>If a <code>.tflock</code> file exists and no Terraform operation is currently running, it is a stuck lock.</p>
<p>You can also read the lock to understand who held it:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock &amp;&amp; cat /tmp/stuck.lock
</code></pre>
<p>This tells you who (<code>Who</code> field) was running the operation, what operation it was (<code>Operation</code> field), and when it was acquired (<code>Created</code> field).</p>
<p>And you can force-unlock using Terraform like this:</p>
<pre><code class="language-shell">terraform force-unlock LOCK-ID
</code></pre>
<p>Replace <code>LOCK-ID</code> with the <code>ID</code> value from the lock file contents. For example:</p>
<pre><code class="language-shell">terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
</code></pre>
<p>Terraform will confirm:</p>
<pre><code class="language-plaintext">Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!
</code></pre>
<p>An alternative is to delete the lock file directly via CLI. If <code>terraform force-unlock</code> doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:</p>
<pre><code class="language-shell">aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock
</code></pre>
<p><strong>Only delete the lock file if you are certain no Terraform operation is currently running.</strong> Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.</p>
<h2 id="heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</h2>
<p>If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.</p>
<p><strong>Step 1: Stop all Terraform operations</strong> in your team and CI/CD pipelines.</p>
<p><strong>Step 2: Recreate the DynamoDB table</strong> if you already deleted it:</p>
<pre><code class="language-shell">aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
</code></pre>
<p><strong>Step 3: Revert</strong> <code>backend.tf</code> to the previous configuration:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}
</code></pre>
<p><strong>Step 4: Reinitialize:</strong></p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p><strong>Step 5: Verify:</strong></p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.</p>
<p><strong>Note:</strong> Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using <code>dynamodb_table</code> in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.</p>
<h2 id="heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</h2>
<p>Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:</p>
<h3 id="heading-enable-versioning-required">Enable Versioning (Required)</h3>
<p>Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<h3 id="heading-block-all-public-access-non-negotiable">Block All Public Access (Non-Negotiable)</h3>
<p>Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<h3 id="heading-enable-server-side-encryption">Enable Server-Side Encryption</h3>
<p>Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<h3 id="heading-apply-least-privilege-iam-permissions">Apply Least-Privilege IAM Permissions</h3>
<p>The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}
</code></pre>
<p>Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.</p>
<h3 id="heading-enable-access-logging">Enable Access Logging</h3>
<p>Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:</p>
<pre><code class="language-shell">aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.</p>
<p>Here's a summary of what you accomplished:</p>
<ul>
<li><p>Understood what state locking is and why it's required for safe Terraform operations</p>
</li>
<li><p>Compared S3 native locking to the existing S3 + DynamoDB approach</p>
</li>
<li><p>Set up a fresh Terraform backend using S3 native locking with correct bucket configuration</p>
</li>
<li><p>Migrated an existing backend from S3 + DynamoDB to S3 native locking safely</p>
</li>
<li><p>Learned how to verify locking, handle stuck locks, and roll back if needed</p>
</li>
<li><p>Applied security best practices to the state bucket</p>
</li>
</ul>
<p>This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.</p>
<p><em>If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out</em> <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a><em>. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.</em></p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a href="https://developer.hashicorp.com/terraform/language/backend/s3#use_lockfile">HashiCorp - S3 Backend Configuration: use_lockfile</a></p>
</li>
<li><p><a href="https://github.com/hashicorp/terraform/releases/tag/v1.10.0">HashiCorp: Terraform 1.10 Release Notes</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html">AWS Docs: S3 Object Lock Overview</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectLockConfiguration.html">AWS Docs: PutObjectLockConfiguration API</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html">AWS Docs: S3 Conditional Writes</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/language/state/locking">HashiCorp: Backend State Locking</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/cli/commands/force-unlock">HashiCorp: terraform force-unlock Command</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html">AWS Docs: Enabling S3 Versioning</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html">AWS Docs: S3 Server-Side Encryption</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands ]]>
                </title>
                <description>
                    <![CDATA[ If your team is preparing for a SOC 2 Type II review, this handbook is for you. It's a self-contained guide to the exact 90-day timeline, 14 critical controls, and evidence collection infrastructure t ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-complete-soc-2-type-ii-implementation-guide-for-engineers/</link>
                <guid isPermaLink="false">69fa364da386d7f121c468af</guid>
                
                    <category>
                        <![CDATA[ SOC ]]>
                    </category>
                
                    <category>
                        <![CDATA[ compliance  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloud security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ayobami Adejumo ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 18:26:21 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/83d83215-5d73-49f6-a745-d9c6cd0c33f8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If your team is preparing for a SOC 2 Type II review, this handbook is for you. It's a self-contained guide to the exact 90-day timeline, 14 critical controls, and evidence collection infrastructure that auditors actually check.</p>
<p>Everyone publishes the controls list. But nobody publishes the week-by-week engineering calendar you'll need to follow to make sure your ducks are in a row.</p>
<p>Here is the exact 90-day timeline — including the mistakes that add 60 days (and how to avoid them).</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-weeks-1-2-the-scope-decision">Weeks 1–2: The Scope Decision</a></p>
</li>
<li><p><a href="#heading-weeks-3-6-the-14-controls-that-must-be-active-on-day-1">Weeks 3–6: The 14 Controls That Must Be Active on Day 1</a></p>
</li>
<li><p><a href="#heading-weeks-7-10-the-evidence-collection-infrastructure">Weeks 7–10: The Evidence Collection Infrastructure</a></p>
</li>
<li><p><a href="#heading-weeks-11-14-auditor-selection-and-readiness-assessment">Weeks 11–14: Auditor Selection and Readiness Assessment</a></p>
</li>
<li><p><a href="#heading-weeks-15-18-the-observation-period">Weeks 15–18: The Observation Period</a></p>
</li>
<li><p><a href="#heading-the-90-day-soc2-timeline-at-a-glance">The 90-Day SOC2 Timeline at a Glance</a></p>
</li>
<li><p><a href="#heading-whats-next">What's Next</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ol>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<p>By the end of this guide, you'll know:</p>
<ul>
<li><p>How to scope your SOC2 boundary correctly — the decision that determines everything else</p>
</li>
<li><p>The 14 controls that must be active on day 1 of your observation period</p>
</li>
<li><p>How to build evidence collection infrastructure that runs automatically</p>
</li>
<li><p>How to choose an auditor and run a readiness assessment</p>
</li>
<li><p>What happens during the observation period and how to close gaps without restarting the clock</p>
</li>
</ul>
<p>Let's dive in.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following along, you should have:</p>
<p><strong>Knowledge:</strong></p>
<ul>
<li><p>Basic understanding of AWS services (EC2, RDS, S3, IAM, VPC)</p>
</li>
<li><p>Familiarity with Terraform or another infrastructure as code tool</p>
</li>
<li><p>Comfort reading GitHub Actions YAML workflows</p>
</li>
<li><p>A general understanding of what SOC2 is — if you are starting from scratch, read the <a href="https://www.aicpa-cima.com/resources/landing/system-and-organization-controls-soc-suite-of-services">AICPA's SOC2 overview</a> first</p>
</li>
</ul>
<p><strong>Tools and access:</strong></p>
<ul>
<li><p>An AWS account with administrator access</p>
</li>
<li><p>A GitHub organisation with admin rights</p>
</li>
<li><p>Terraform installed (v1.0 or later)</p>
</li>
<li><p>Python 3.8 or later (for the evidence collector Lambda)</p>
</li>
<li><p>A compliance automation platform — <a href="https://www.vanta.com/">Vanta</a> or <a href="https://drata.com/">Drata</a> — connected to your AWS account and GitHub organisation</p>
</li>
</ul>
<p><strong>Estimated time:</strong> 90 days end-to-end, with active engineering work of approximately 8–12 hours per week in the first six weeks, tapering to 2–4 hours per week during the observation period.</p>
<h2 id="heading-weeks-12-the-scope-decision-what-is-in-and-out-of-your-soc2-boundary">Weeks 1–2: The Scope Decision — What Is In and Out of Your SOC2 Boundary</h2>
<h3 id="heading-what-most-teams-get-wrong">What Most Teams Get Wrong</h3>
<p>Most teams scope their SOC2 boundary too broadly. They include every AWS account, every service, every environment. This is a mistake — and here is exactly why.</p>
<p>A broader scope means more controls to implement, more evidence to collect, and more systems the auditor will examine.</p>
<p>Every system inside your boundary must satisfy all 14 controls. Including your development sandbox means your engineers' experimental environments must have GuardDuty enabled, CloudTrail logging, and branch-protected deployments. That adds weeks of work and months of evidence collection for systems that pose no risk to your customers.</p>
<p>A correctly bounded scope means you include only the systems that store, process, or transmit customer data — and you prove that everything else cannot reach those systems.</p>
<p><strong>Bad scope (over-inclusive):</strong></p>
<pre><code class="language-plaintext">Entire AWS Organization
├── Production (in scope)
├── Staging (in scope)
├── Development (in scope)
├── Sandbox (in scope)
└── CI/CD (in scope)
</code></pre>
<p><strong>Good scope (correctly bounded):</strong></p>
<pre><code class="language-plaintext">SOC2 Boundary
├── Production AWS Account (in scope)
├── Production EKS Cluster (in scope)
├── Production RDS (in scope)
└── Everything else (OUT of scope — proven by network segmentation)
</code></pre>
<p>The correctly bounded scope works because it draws the tightest defensible line around the systems that actually handle customer data. Everything outside that line is excluded — not by assumption, but by technical controls that prevent those systems from reaching anything inside the boundary.</p>
<h3 id="heading-the-scope-decision-framework">The Scope Decision Framework</h3>
<p>For every system in your infrastructure, ask these four questions:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>If YES</th>
<th>If NO</th>
</tr>
</thead>
<tbody><tr>
<td>Does this system store, process, or transmit customer data?</td>
<td>✅ In scope</td>
<td>❌ Out of scope</td>
</tr>
<tr>
<td>Does this system affect the availability of customer-facing services?</td>
<td>✅ In scope</td>
<td>❌ Out of scope</td>
</tr>
<tr>
<td>Does this system have access to production credentials?</td>
<td>✅ In scope</td>
<td>❌ Out of scope</td>
</tr>
<tr>
<td>Can a compromise of this system lead to a customer data breach?</td>
<td>✅ In scope</td>
<td>❌ Out of scope</td>
</tr>
</tbody></table>
<p>Any system where the answer to even one question is yes belongs inside your boundary.</p>
<h3 id="heading-network-segmentation-the-technical-proof-that-your-boundary-holds">Network Segmentation — The Technical Proof That Your Boundary Holds</h3>
<p>Network segmentation is the practice of dividing your infrastructure into isolated zones so that systems in one zone can't communicate with systems in another unless you explicitly allow it.</p>
<p>In the context of SOC2, it's the technical control that proves your out-of-scope systems genuinely can't reach your in-scope systems — not just by policy, but by infrastructure enforcement.</p>
<p>Without network segmentation, the SOC2 auditor can't trust that your boundary is real. A developer in your sandbox environment who can query your production database means the sandbox is effectively in scope, regardless of what your diagram says.</p>
<p>Here's the Terraform that implements network segmentation between your production and non-production environments. The network access control list (NACL) blocks all inbound traffic from the broader private IP range (10.0.0.0/8) into your in-scope production VPC, while the explicit <code>aws_vpc_peering_connection</code> comment documents the deliberate decision not to peer environments:</p>
<pre><code class="language-hcl"># This account has NO VPC peering to non-production environments.
# The absence of peering is itself the segmentation control.
# Do NOT add peering connections to this account without SOC2 scope review.

resource "aws_network_acl" "deny_non_production" {
  vpc_id = aws_vpc.production.id

  # Block all inbound traffic from non-production IP ranges
  ingress {
    rule_no    = 100
    action     = "deny"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "10.0.0.0/8"
  }

  # Allow legitimate inbound traffic (HTTPS from internet)
  ingress {
    rule_no    = 200
    action     = "allow"
    from_port  = 443
    to_port    = 443
    protocol   = "tcp"
    cidr_block = "0.0.0.0/0"
  }

  # Allow all outbound (tighten this per your architecture)
  egress {
    rule_no    = 100
    action     = "allow"
    from_port  = 0
    to_port    = 0
    protocol   = "-1"
    cidr_block = "0.0.0.0/0"
  }

  tags = {
    Name        = "production-nacl"
    Environment = "production"
    Purpose     = "SOC2 network segmentation"
  }
}
</code></pre>
<p>Verify the segmentation with this command after applying the Terraform:</p>
<pre><code class="language-bash"># Confirm no VPC peering connections exist from production to non-production
aws ec2 describe-vpc-peering-connections \
  --filters Name=status-code,Values=active \
  --query 'VpcPeeringConnections[*].{ID:VpcPeeringConnectionId,Requester:RequesterVpcInfo.VpcId,Accepter:AccepterVpcInfo.VpcId}' \
  --output table
</code></pre>
<h3 id="heading-the-deliverable-your-soc2-boundary-diagram">The Deliverable: Your SOC2 Boundary Diagram</h3>
<p>At the end of weeks 1–2, you need a boundary diagram — a visual document that shows every in-scope system, every out-of-scope system, and the segmentation controls between them.</p>
<p>Here is what the diagram should contain:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69d00d5be466e2b76263a583/29dfe0c8-f455-44af-8562-8d088f8a111a.png" alt="29dfe0c8-f455-44af-8562-8d088f8a111a" style="display:block;margin:0 auto" width="611" height="686" loading="lazy">

<p>Include every AWS service, every data flow arrow, and a label on the segmentation control. This diagram becomes your primary scope evidence and is typically the first thing an auditor asks for.</p>
<h2 id="heading-weeks-36-the-14-controls-that-must-be-active-on-day-1">Weeks 3–6: The 14 Controls That Must Be Active on Day 1</h2>
<p>These 14 controls must be implemented and actively collecting evidence from day 1 of your observation period. If you add any of them late, the observation period clock for that control restarts from the implementation date — not from day 1 of the audit period.</p>
<p>Think of the observation period as a surveillance camera recording your infrastructure. The auditor watches the footage later. If the camera was not on when a specific event occurred, that event has no record — and the SOC2 control for it has a gap.</p>
<h3 id="heading-control-1-mfa-enforcement-cc66">Control 1: MFA Enforcement (CC6.6)</h3>
<p>Multi-Factor Authentication (MFA) requires a user to verify their identity using two independent factors — something they know (a password) and something they have (a phone or hardware key). Without MFA, a stolen password is sufficient to access your production systems.</p>
<p>SOC2 CC6.6 requires that access to systems is restricted to authorized users. MFA is the technical control that makes "authorized" meaningful. Without it, any password compromise is a production access event.</p>
<p>To implement MFA, you can use AWS IAM Identity Center (formerly SSO) connected to your identity provider (Okta, Google Workspace, or Azure AD). MFA is then enforced at the identity provider level — any user without MFA enrolled can't authenticate, regardless of which AWS service they're trying to reach.</p>
<pre><code class="language-hcl"># IAM Identity Center configuration — MFA is enforced at the IdP level.
# No IAM user has direct console or CLI access.
# All access goes through SSO sessions (8-hour expiry by default).

resource "aws_ssoadmin_instance_access_control_attributes" "mfa" {
  instance_arn = tolist(data.aws_ssoadmin_instances.this.arns)[0]

  attribute {
    key = "email"
    value {
      source = ["$${path:email}"]
    }
  }
}
</code></pre>
<p>You can verify that no IAM users retain direct console access (which would bypass MFA):</p>
<pre><code class="language-bash"># Any user listed here has direct console access bypassing SSO — investigate immediately
aws iam list-users \
  --query 'Users[?PasswordLastUsed!=`null`].[UserName,PasswordLastUsed]' \
  --output table
</code></pre>
<h3 id="heading-control-2-infrastructure-as-code-cc81">Control 2: Infrastructure as Code (CC8.1)</h3>
<p>Infrastructure as Code (IaC) means defining your cloud infrastructure in version-controlled code files (Terraform, Pulumi, or AWS CDK) rather than creating resources manually through the AWS console. Every infrastructure change is proposed in a pull request, reviewed by a colleague, and applied through an automated pipeline.</p>
<p>SOC2 CC8.1 covers change management — the requirement that every change to your production environment is documented, reviewed, and approved. Manual console changes produce no audit trail. If an engineer opens the AWS console and creates a security group without going through Terraform, that change is invisible to your SOC2 auditor. IaC makes every change reviewable and traceable.</p>
<p>Now let's see how to implement IaC here. This GitHub Actions workflow applies Terraform only from the main branch, after a pull request has been reviewed and approved. The workflow creates an immutable record of every infrastructure change:</p>
<pre><code class="language-yaml"># .github/workflows/terraform-apply.yml
name: Terraform Apply (Production)
on:
  push:
    branches: [main]
    paths: ['terraform/**']

permissions:
  id-token: write   # Required for AWS OIDC authentication
  contents: read

jobs:
  apply:
    name: Apply Infrastructure Changes
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval for production

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials (OIDC — no long-lived keys)
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/terraform-apply
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: "1.6.0"

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=tfplan -input=false

      - name: Terraform Apply
        run: terraform apply -input=false tfplan
</code></pre>
<p>SOC2 evidence this produces: A GitHub Actions run log for every infrastructure change, showing who triggered it (the pull request author), when it was applied, and what changed.</p>
<h3 id="heading-control-3-cloudtrail-enabled-cc71">Control 3: CloudTrail Enabled (CC7.1)</h3>
<p>AWS CloudTrail is a service that records every API call made in your AWS account — who called it, when, from which IP address, and whether it succeeded. Think of it as the complete audit log of everything that has ever happened in your AWS environment.</p>
<p>SOC2 CC7.1 requires monitoring for security events. CloudTrail is the foundational logging layer — without it, you can't detect unauthorized access, investigate incidents, or prove to an auditor that your controls were operating as intended. An auditor who can't see historical AWS API activity can't verify that your access controls were enforced during the observation period.</p>
<p>To implement it, you'll want to enable multi-region CloudTrail so that activity in every AWS region is captured, including global services like IAM. You can ship logs to an S3 bucket with Object Lock enabled (Control 3 in the evidence collection section covers this) so logs can't be modified or deleted:</p>
<pre><code class="language-bash"># Enable CloudTrail with log file validation and multi-region coverage
aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name your-cloudtrail-logs-bucket \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

# Start the trail (creation alone does not start logging)
aws cloudtrail start-logging --name production-audit-trail

# Verify the trail is active and logging
aws cloudtrail get-trail-status --name production-audit-trail \
  --query '{IsLogging:IsLogging,LatestDeliveryTime:LatestDeliveryTime}'
</code></pre>
<h3 id="heading-control-4-guardduty-enabled-cc72">Control 4: GuardDuty Enabled (CC7.2)</h3>
<p>AWS GuardDuty is a threat detection service that analyses your CloudTrail logs, VPC Flow Logs, and DNS logs. It uses machine learning to identify suspicious behaviour — things like an EC2 instance communicating with a known malware server, an IAM user logging in from an unusual country, or unusual API call patterns that indicate credential theft.</p>
<p>SOC2 CC7.2 requires the use of detection tools to identify potential security events. GuardDuty is the monitoring layer that tells you when something anomalous is happening, not just what happened after the fact. Without it, you would only discover a compromise when the damage is done.</p>
<p>Here's the implementation:</p>
<pre><code class="language-bash"># Enable GuardDuty — findings published every 15 minutes for active threats
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency FIFTEEN_MINUTES

# Verify GuardDuty is active
aws guardduty list-detectors --query 'DetectorIds' --output table
</code></pre>
<p>You can set up an EventBridge rule to route CRITICAL and HIGH severity GuardDuty findings to your incident response channel immediately. A finding sitting unreviewed for 90 days is a qualified SOC2 finding.</p>
<h3 id="heading-control-5-vpc-flow-logs-cc61">Control 5: VPC Flow Logs (CC6.1)</h3>
<p>VPC Flow Logs capture information about the IP traffic flowing through your Virtual Private Cloud — every accepted and rejected connection, including source IP, destination IP, port, protocol, and whether the traffic was allowed or denied. They are the network-level audit trail that CloudTrail doesn't provide.</p>
<p>SOC2 CC6.1 requires logical access controls and monitoring. VPC Flow Logs let you verify that your network segmentation is actually working (traffic you denied is showing as rejected in the logs), detect unexpected communication between services, and investigate security events at the network layer.</p>
<pre><code class="language-bash"># Create an IAM role for VPC Flow Logs to deliver to CloudWatch
aws iam create-role \
  --role-name vpc-flow-logs-role \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Principal":{"Service":"vpc-flow-logs.amazonaws.com"},
      "Action":"sts:AssumeRole"
    }]
  }'

# Enable VPC Flow Logs for all traffic (ACCEPT and REJECT)
aws ec2 create-flow-logs \
  --resource-ids vpc-YOUR_PRODUCTION_VPC_ID \
  --resource-type VPC \
  --traffic-type ALL \
  --log-group-name /aws/vpc/flow-logs/production \
  --deliver-log-permission-arn arn:aws:iam::YOUR_ACCOUNT_ID:role/vpc-flow-logs-role

# Verify flow logs are active
aws ec2 describe-flow-logs \
  --filter Name=resource-id,Values=vpc-YOUR_PRODUCTION_VPC_ID \
  --query 'FlowLogs[*].{Status:FlowLogStatus,LogGroup:LogGroupName}'
</code></pre>
<h3 id="heading-control-6-secrets-manager-cc67">Control 6: Secrets Manager (CC6.7)</h3>
<p>Secrets management means storing credentials (database passwords, API keys, certificates, and other sensitive configuration values) in a dedicated, access-controlled service (like AWS Secrets Manager or HashiCorp Vault) rather than in <code>.env</code> files, GitHub repository secrets, or hardcoded in application code.</p>
<p>SOC2 CC6.7 requires protecting sensitive system components from unauthorized access. A secret stored in an <code>.env</code> file committed to a repository is accessible to every developer with repo access, every CI/CD runner, and every engineer who has ever cloned the repo — including those who have since left the company.</p>
<p>A Secrets Manager provides centralised storage, access logging, automatic rotation, and fine-grained IAM permissions so only specific services can retrieve specific secrets.</p>
<p>Let's look at the implementation — storing and rotating a secret:</p>
<pre><code class="language-bash"># Store a database credential with automatic 90-day rotation
aws secretsmanager create-secret \
  --name production/postgresql/credentials \
  --description "Production PostgreSQL credentials — rotated every 90 days" \
  --secret-string '{
    "username": "app_user",
    "password": "REPLACE_WITH_STRONG_PASSWORD",
    "host": "your-rds-endpoint.us-east-1.rds.amazonaws.com",
    "port": 5432,
    "dbname": "production"
  }'

# Enable automatic rotation every 90 days
aws secretsmanager rotate-secret \
  --secret-id production/postgresql/credentials \
  --rotation-rules AutomaticallyAfterDays=90
</code></pre>
<p>How your application retrieves the secret at runtime (no hardcoded credentials):</p>
<pre><code class="language-python"># Good: secret retrieved at runtime from Secrets Manager
import boto3
import json

def get_db_credentials():
    client = boto3.client('secretsmanager', region_name='us-east-1')
    response = client.get_secret_value(SecretId='production/postgresql/credentials')
    return json.loads(response['SecretString'])

# Bad: secret hardcoded in application code or .env file
DB_PASSWORD = "my_database_password_123"  # Never do this
</code></pre>
<p>The access log in CloudTrail records every time a secret is retrieved, by which IAM role, at what time. That log is your SOC2 evidence that secrets access is controlled and auditable.</p>
<h3 id="heading-control-7-ebs-encryption-cc61">Control 7: EBS Encryption (CC6.1)</h3>
<p>EBS (Elastic Block Store) encryption ensures that the persistent disks attached to your EC2 instances and used by your RDS databases are encrypted at rest using AES-256. If an AWS employee or an attacker gained physical access to the storage hardware, the data would be unreadable without the encryption key.</p>
<p>SOC2 CC6.1 requires protecting information assets from unauthorised access. Encryption at rest is the control that protects data in the event of physical storage compromise or an improperly decommissioned disk. Enabling it account-wide means every new EBS volume is encrypted automatically, including RDS storage, EKS node volumes, and EC2 instance root volumes.</p>
<pre><code class="language-bash"># Enable EBS encryption by default for all new volumes in this region
aws ec2 enable-ebs-encryption-by-default

# Verify it is enabled
aws ec2 get-ebs-encryption-by-default \
  --query 'EbsEncryptionByDefault'
# Expected output: true

# Check existing volumes — any showing false need to be migrated
aws ec2 describe-volumes \
  --query 'Volumes[?Encrypted==`false`].[VolumeId,Size,VolumeType]' \
  --output table
</code></pre>
<p>Any existing unencrypted volumes must be snapshot-and-replaced. The process: create a snapshot of the unencrypted volume, create a new encrypted volume from the snapshot, and swap it into the instance.</p>
<h3 id="heading-control-8-s3-block-public-access-cc61">Control 8: S3 Block Public Access (CC6.1)</h3>
<p>Amazon S3 buckets can be configured to allow public access — meaning anyone on the internet can read their contents without authentication. Block Public Access is an account-level and bucket-level setting that prevents any bucket from being made public, regardless of the bucket's own policy.</p>
<p>A misconfigured S3 bucket is one of the most common causes of data breaches in cloud environments. Block Public Access at the account level means a developer can't accidentally expose a bucket containing customer data, even if they set the wrong bucket policy. It's a guardrail, not just a policy.</p>
<pre><code class="language-bash"># Block public access at the AWS account level — applies to all buckets
aws s3control put-public-access-block \
  --account-id YOUR_ACCOUNT_ID \
  --public-access-block-configuration \
    BlockPublicAcls=true,\
    IgnorePublicAcls=true,\
    BlockPublicPolicy=true,\
    RestrictPublicBuckets=true

# Verify account-level setting is active
aws s3control get-public-access-block \
  --account-id YOUR_ACCOUNT_ID

# Scan for any buckets that have public access enabled (should be zero)
aws s3api list-buckets --query 'Buckets[*].Name' --output text | \
  tr '\t' '\n' | while read bucket; do
    result=\((aws s3api get-public-access-block --bucket "\)bucket" 2&gt;/dev/null)
    if echo "$result" | grep -q '"BlockPublicAcls": false'; then
      echo "WARNING: $bucket has public access not fully blocked"
    fi
  done
</code></pre>
<h3 id="heading-control-9-branch-protection-cc81">Control 9: Branch Protection (CC8.1)</h3>
<p>Branch protection is a GitHub setting that prevents engineers from pushing code directly to your main branch without going through a pull request that has been reviewed and approved by at least one other team member. It also requires your CI pipeline to pass before any code can be merged.</p>
<p>SOC2 CC8.1 requires change management — the requirement that every change to production systems is documented, reviewed, and approved. Without branch protection, an engineer can push directly to main, which deploys directly to production through your CI/CD pipeline, with no review and no audit trail. Branch protection is the technical enforcement of your change management policy.</p>
<p>The critical setting that most teams miss: the "Do not allow bypassing the above settings" option must be enabled. Without it, administrators can bypass branch protection — and a SOC2 auditor will flag this as a gap because it means your change management control can be circumvented.</p>
<pre><code class="language-yaml"># .github/settings.yml — enforces branch protection via code
# Requires the settings GitHub App: https://github.com/apps/settings

branches:
  - name: main
    protection:
      required_pull_request_reviews:
        required_approving_review_count: 1
        dismiss_stale_reviews: true
        require_code_owner_reviews: false
      required_status_checks:
        strict: true
        contexts:
          - "CI / test"
          - "Security / trivy-scan"
      enforce_admins: true         # Admins cannot bypass — this is critical
      restrictions: null           # No push restriction beyond the above
      allow_force_pushes: false
      allow_deletions: false
</code></pre>
<p>Here's how you can verify that branch protection is enforced and admins can't bypass it:</p>
<pre><code class="language-bash"># Returns the branch protection rules including enforce_admins status
curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
  https://api.github.com/repos/YOUR_ORG/YOUR_REPO/branches/main/protection \
  | jq '{enforce_admins: .enforce_admins.enabled, required_reviews: .required_pull_request_reviews.required_approving_review_count}'
</code></pre>
<h3 id="heading-control-10-container-image-scanning-cc74">Control 10: Container Image Scanning (CC7.4)</h3>
<p>Container image scanning analyses your Docker images before deployment to identify known security vulnerabilities (CVEs) in the operating system packages and application dependencies they contain.</p>
<p>Trivy is an open-source scanner that checks the base image (Ubuntu, Alpine, and so on), all installed OS packages, and language-specific dependencies (npm, pip, Go modules) against the National Vulnerability Database.</p>
<p>SOC2 CC7.4 requires monitoring and identifying vulnerabilities. Every container you deploy contains a base image with OS packages — and those packages regularly receive CVE disclosures. A critical CVE left unpatched for 90 days in a production container is a SOC2 finding. Automated scanning in CI means every image is checked before it can deploy.</p>
<pre><code class="language-yaml"># .github/workflows/security-scan.yml
name: Security Scan
on: [push, pull_request]

jobs:
  trivy-scan:
    name: Container Vulnerability Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Build container image
        run: docker build -t app:${{ github.sha }} .

      - name: Scan image for vulnerabilities
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: app:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          severity: CRITICAL,HIGH
          exit-code: 1          # Fail the pipeline on CRITICAL or HIGH findings

      - name: Upload results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v2
        if: always()            # Upload even if scan found issues
        with:
          sarif_file: trivy-results.sarif
</code></pre>
<p>The scanner looks for:</p>
<ul>
<li><p>CVEs in base image OS packages (for example, a critical OpenSSL vulnerability in your Ubuntu base)</p>
</li>
<li><p>Vulnerable versions of application dependencies (a known RCE in an npm package your app uses)</p>
</li>
<li><p>Misconfigurations in the Dockerfile itself (running as root, using <code>latest</code> tags)</p>
</li>
</ul>
<p>Results appear in the GitHub Security tab for your repository, giving you a historical record of every scan — which is your SOC2 evidence.</p>
<h3 id="heading-control-11-incident-response-plan-cc92">Control 11: Incident Response Plan (CC9.2)</h3>
<p>An incident response plan is a written, tested procedure that defines exactly what your team does when a security event occurs — from the moment an alert fires through to customer notification and post-incident review.</p>
<p>SOC2 CC9.2 requires that you have a documented process for responding to security events and that you've tested it. The auditor will ask for the written runbook and evidence that a tabletop exercise (a simulated incident walkthrough) has been conducted within the observation period.</p>
<p>Your incident response runbook must include:</p>
<ol>
<li><p><strong>Severity classification:</strong> Definitions of P1 (production down, customer data at risk), P2 (degraded service, potential risk), and P3 (minor issue, no customer impact) — and the response SLA for each.</p>
</li>
<li><p><strong>Escalation path:</strong> Exactly who gets paged at each severity level, with contact details. Not "the on-call engineer" — specific names and a backup if the first person doesn't respond within 10 minutes.</p>
</li>
<li><p><strong>First 15 minutes:</strong> The specific steps to take immediately — isolate the affected system, assess the scope, notify the incident channel, begin the timeline log.</p>
</li>
<li><p><strong>Communication templates:</strong> Pre-written Slack messages, customer email templates, and regulatory notification templates (GDPR requires notification within 72 hours, HIPAA within 60 days).</p>
</li>
<li><p><strong>Post-incident review:</strong> The blameless postmortem process, the <a href="https://www.freecodecamp.org/news/from-symptoms-to-root-cause-how-to-use-the-5-whys-technique/">5-why</a> root cause analysis template, and the action item tracking process.</p>
</li>
</ol>
<p>Conduct a tabletop exercise at least once during your observation period: gather your engineering team for 45 minutes, simulate a realistic scenario (for example, "an AWS access key was committed to a public GitHub repo"), and walk through the runbook together. Document the meeting date, attendees, scenario, gaps found, and remediation actions. This document is your evidence.</p>
<h3 id="heading-control-12-access-reviews-cc63">Control 12: Access Reviews (CC6.3)</h3>
<p>An access review is a quarterly audit of who has access to what in your production systems — AWS accounts, GitHub repositories, production databases, and every SaaS tool that touches customer data. You verify that every person on the list still works at the company and still needs the access their role grants them.</p>
<p>SOC2 CC6.3 requires that access is revoked when it's no longer needed. Former employees who retain access to production AWS accounts represent a genuine security risk and a definitive SOC2 finding.</p>
<p>In every access review I've conducted, at least 3–5 former employees or contractors still had active access they should not.</p>
<p>The quarterly access review checklist:</p>
<pre><code class="language-bash"># 1. IAM users — list all with their last login date
aws iam generate-credential-report
aws iam get-credential-report --output text --query Content \
  | base64 --decode | cut -d',' -f1,5 | column -t -s ','

# 2. IAM roles — find roles that have not been used in 90+ days
aws iam get-account-authorization-details \
  --query 'RoleDetailList[*].{Role:RoleName,LastUsed:RoleLastUsed.LastUsedDate}' \
  --output table

# 3. Verify AWS SSO user list matches your current employee list
aws identitystore list-users \
  --identity-store-id YOUR_IDENTITY_STORE_ID \
  --query 'Users[*].{Name:DisplayName,Email:Emails[0].Value}' \
  --output table
</code></pre>
<p>Cross-reference the output against your current employee list in your HR system. Document every change made — access removed, permissions reduced, accounts disabled. The documented changes are the evidence that the review was conducted meaningfully, not just as a checkbox exercise.</p>
<h3 id="heading-control-13-backup-verification-cc95">Control 13: Backup Verification (CC9.5)</h3>
<p>Backup verification is the process of actually restoring your backups to confirm they work — not just confirming that backups are being created. A backup that has never been tested doesn't exist from a recovery perspective.</p>
<p>SOC2 CC9.5 requires that recovery procedures are tested. If your production database is corrupted and you discover for the first time during the incident that your automated RDS snapshots can't be restored, you have both a disaster recovery failure and a SOC2 finding.</p>
<p>How to test your RDS backup:</p>
<pre><code class="language-bash"># Step 1: Find your most recent production snapshot
aws rds describe-db-snapshots \
  --db-instance-identifier your-production-db \
  --query 'sort_by(DBSnapshots, &amp;SnapshotCreateTime)[-1].DBSnapshotIdentifier' \
  --output text

# Step 2: Restore the snapshot to a test instance
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier backup-verification-test \
  --db-snapshot-identifier YOUR_SNAPSHOT_ID \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible \
  --tags Key=Purpose,Value=backup-verification Key=Environment,Value=test

# Step 3: Wait for the restore to complete (typically 5–15 minutes)
aws rds wait db-instance-available \
  --db-instance-identifier backup-verification-test

# Step 4: Connect and verify data integrity (spot check key tables)
# Run this against the restored instance
psql -h RESTORED_INSTANCE_ENDPOINT -U your_user -d your_database \
  -c "SELECT COUNT(*) FROM users; SELECT MAX(created_at) FROM orders;"

# Step 5: Document the test result and delete the test instance
aws rds delete-db-instance \
  --db-instance-identifier backup-verification-test \
  --skip-final-snapshot
</code></pre>
<p>Document the test date, the snapshot used, the restore time, the data verification query results, and who conducted the test. Run this quarterly at minimum. This documentation is your SOC2 evidence for CC9.5.</p>
<h3 id="heading-control-14-change-management-log-cc81">Control 14: Change Management Log (CC8.1)</h3>
<p>A change management log is the auditable record of every change made to your production environment — what changed, who approved it, and when it was applied.</p>
<p>SOC2 CC8.1 requires that changes to your production environment are authorized and documented. With IaC and GitOps in place, you already have two separate sources of immutable change history that together satisfy this control.</p>
<p><strong>GitHub Pull Request history</strong> provides the record of every code and infrastructure change: who opened the PR, who reviewed and approved it, what the CI status was, and when it was merged. This is your change management log for application and infrastructure changes.</p>
<p><strong>ArgoCD sync history</strong> provides the record of every deployment to your Kubernetes cluster: which application was synced, from which Git commit, at what time, and whether the sync succeeded.</p>
<p>To export the ArgoCD sync history as evidence:</p>
<pre><code class="language-bash"># Export ArgoCD application sync history as JSON evidence
argocd app history YOUR_APP_NAME --output json &gt; argocd-sync-history-$(date +%Y%m).json

# Upload to your SOC2 evidence bucket
aws s3 cp argocd-sync-history-$(date +%Y%m).json \
  s3://your-soc2-evidence-bucket/change-management/$(date +%Y/%m)/

# For each deployment, the evidence contains:
# - App name, deployed revision (Git commit SHA)
# - Deployment timestamp
# - Initiating user or automated sync
# - Success/failure status
</code></pre>
<p>Together, the GitHub PR history and the ArgoCD sync history give the auditor a complete, tamper-evident record of every change to your production environment during the observation period.</p>
<h2 id="heading-weeks-710-the-evidence-collection-infrastructure">Weeks 7–10: The Evidence Collection Infrastructure</h2>
<p>Evidence is the difference between passing and failing SOC2.</p>
<p>You might be wondering: what exactly is evidence? In SOC2 terms, evidence is the documentation that proves a specific control was operating correctly during a specific point in time within the observation period. A policy document says you will do something. Evidence proves you did it — and that you did it continuously, not just the week before the audit.</p>
<p>For example:</p>
<ul>
<li><p>For MFA enforcement (Control 1), evidence is a screenshot of your IAM Identity Center MFA settings taken at a specific date during the observation period, combined with an IAM credential report showing zero IAM users with console access.</p>
</li>
<li><p>For GuardDuty (Control 4), evidence is the GuardDuty console screenshot showing active detectors, plus your documented response to any findings during the period.</p>
</li>
<li><p>For access reviews (Control 12), evidence is the completed access review document with dates, names, and specific access changes made.</p>
</li>
</ul>
<p>The challenge is collecting this evidence continuously across 3–12 months without spending hundreds of hours on manual work. The solution is automated evidence collection infrastructure.</p>
<h3 id="heading-the-evidence-bucket-tamper-proof-storage-for-your-audit-evidence">The Evidence Bucket — Tamper-Proof Storage for Your Audit Evidence</h3>
<p>The evidence bucket is an S3 bucket with Object Lock enabled in GOVERNANCE mode. Object Lock prevents any object from being deleted or modified for the retention period you specify — in this case, 365 days. This means once a piece of evidence is uploaded, it can't be altered, even by a user with administrator access (without explicitly overriding the lock, which itself creates an audit trail).</p>
<p>This tamper-evident property is what gives the auditor confidence that the evidence was not created or modified after the fact.</p>
<pre><code class="language-hcl"># terraform/soc2-evidence-bucket.tf

resource "aws_s3_bucket" "soc2_evidence" {
  bucket = "\({var.company_name}-soc2-evidence-\){var.environment}"
}

# Block all public access to the evidence bucket
resource "aws_s3_bucket_public_access_block" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning so overwrites create new versions, not replacements
resource "aws_s3_bucket_versioning" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Object Lock in GOVERNANCE mode — objects cannot be deleted for 365 days
resource "aws_s3_bucket_object_lock_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 365
    }
  }
}

# Encrypt all evidence at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "soc2_evidence" {
  bucket = aws_s3_bucket.soc2_evidence.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}
</code></pre>
<h3 id="heading-the-daily-evidence-collector-lambda">The Daily Evidence Collector Lambda</h3>
<p>This Lambda function runs automatically every day and exports the status of each critical control to a time-stamped JSON file in the evidence bucket. Over your 3–12 month observation period, it creates a daily record proving that your controls were active and operating.</p>
<p>The function checks seven controls automatically: CloudTrail status, GuardDuty status, VPC Flow Logs, S3 public access block, EBS encryption, MFA compliance, and GuardDuty finding count. Each daily snapshot is uploaded with Object Lock enabled so it can't be modified.</p>
<pre><code class="language-python"># lambda/evidence-collector/handler.py

import boto3
import json
from datetime import datetime, timedelta, timezone

def lambda_handler(event, context):
    """
    Daily SOC2 evidence collector.
    Runs at 00:00 UTC every day via EventBridge scheduler.
    Exports control status to S3 evidence bucket with Object Lock.
    """
    evidence = {
        'collection_timestamp': datetime.now(timezone.utc).isoformat(),
        'collection_date': datetime.now(timezone.utc).strftime('%Y-%m-%d'),
        'account_id': boto3.client('sts').get_caller_identity()['Account'],
        'controls': {}
    }

    # Control 3: CloudTrail status
    cloudtrail = boto3.client('cloudtrail')
    trails = cloudtrail.describe_trails(includeShadowTrails=False)['trailList']
    multi_region_trails = [t for t in trails if t.get('IsMultiRegionTrail')]
    evidence['controls']['cloudtrail'] = {
        'status': 'PASS' if multi_region_trails else 'FAIL',
        'detail': f"{len(multi_region_trails)} multi-region trail(s) active",
        'trails': [t['Name'] for t in multi_region_trails]
    }

    # Control 4: GuardDuty status
    guardduty = boto3.client('guardduty')
    detectors = guardduty.list_detectors()['DetectorIds']
    unresolved_critical = 0
    for detector_id in detectors:
        findings = guardduty.list_findings(
            DetectorId=detector_id,
            FindingCriteria={
                'Criterion': {
                    'severity': {'Gte': 7},  # HIGH and CRITICAL only
                    'service.archived': {'Eq': ['false']}
                }
            }
        )
        unresolved_critical += len(findings['FindingIds'])

    evidence['controls']['guardduty'] = {
        'status': 'PASS' if detectors else 'FAIL',
        'detail': f"{len(detectors)} detector(s) active, {unresolved_critical} unresolved HIGH/CRITICAL findings",
        'unresolved_high_critical': unresolved_critical
    }

    # Control 5: VPC Flow Logs
    ec2 = boto3.client('ec2')
    flow_logs = ec2.describe_flow_logs(
        Filters=[{'Name': 'resource-type', 'Values': ['VPC']},
                 {'Name': 'flow-log-status', 'Values': ['ACTIVE']}]
    )['FlowLogs']
    evidence['controls']['vpc_flow_logs'] = {
        'status': 'PASS' if flow_logs else 'FAIL',
        'detail': f"{len(flow_logs)} active VPC flow log(s)",
        'active_flow_logs': len(flow_logs)
    }

    # Control 7: EBS encryption by default
    ebs_encryption = ec2.get_ebs_encryption_by_default()['EbsEncryptionByDefault']
    evidence['controls']['ebs_encryption_by_default'] = {
        'status': 'PASS' if ebs_encryption else 'FAIL',
        'detail': 'EBS encryption by default is enabled' if ebs_encryption else 'EBS encryption by default is NOT enabled'
    }

    # Control 8: S3 Block Public Access (account level)
    s3control = boto3.client('s3control')
    account_id = boto3.client('sts').get_caller_identity()['Account']
    try:
        pab = s3control.get_public_access_block(AccountId=account_id)['PublicAccessBlockConfiguration']
        all_blocked = all([pab['BlockPublicAcls'], pab['IgnorePublicAcls'],
                           pab['BlockPublicPolicy'], pab['RestrictPublicBuckets']])
        evidence['controls']['s3_block_public_access'] = {
            'status': 'PASS' if all_blocked else 'FAIL',
            'detail': 'All four S3 Block Public Access settings enabled' if all_blocked else 'One or more S3 Block Public Access settings not enabled',
            'configuration': pab
        }
    except Exception as e:
        evidence['controls']['s3_block_public_access'] = {'status': 'FAIL', 'detail': str(e)}

    # Upload evidence to S3 with Object Lock
    s3 = boto3.client('s3')
    evidence_key = f"daily/{evidence['collection_date']}/control-status.json"
    lock_until = datetime.now(timezone.utc) + timedelta(days=365)

    s3.put_object(
        Bucket='YOUR_EVIDENCE_BUCKET_NAME',
        Key=evidence_key,
        Body=json.dumps(evidence, indent=2),
        ContentType='application/json',
        ObjectLockMode='GOVERNANCE',
        ObjectLockRetainUntilDate=lock_until
    )

    # Alert if any control fails
    failed_controls = [k for k, v in evidence['controls'].items() if v['status'] == 'FAIL']
    if failed_controls:
        sns = boto3.client('sns')
        sns.publish(
            TopicArn='YOUR_ALERT_TOPIC_ARN',
            Subject=f'SOC2 Control Failure Detected — {evidence["collection_date"]}',
            Message=f'The following controls failed their daily check:\n\n{json.dumps(failed_controls, indent=2)}'
        )

    return {
        'statusCode': 200,
        'controls_checked': len(evidence['controls']),
        'controls_failed': len(failed_controls),
        'evidence_location': f"s3://YOUR_EVIDENCE_BUCKET_NAME/{evidence_key}"
    }
</code></pre>
<h3 id="heading-the-github-actions-evidence-workflow">The GitHub Actions Evidence Workflow</h3>
<p>This workflow runs daily and captures evidence that can't be automated through AWS APIs — GitHub-level controls like branch protection status, recent pull request activity, and CI pipeline results. It exports these as JSON files to the same evidence bucket.</p>
<pre><code class="language-yaml"># .github/workflows/soc2-evidence.yml
name: SOC2 Evidence Collection
on:
  schedule:
    - cron: '0 1 * * *'   # 01:00 UTC daily (after the Lambda runs at 00:00)
  workflow_dispatch:        # Allow manual trigger when needed

permissions:
  contents: read

jobs:
  collect-github-evidence:
    name: Collect GitHub Control Evidence
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/evidence-collector
          aws-region: us-east-1

      - name: Collect branch protection status
        run: |
          DATE=$(date +%Y-%m-%d)
          mkdir -p evidence/github

          # Export branch protection rules for main
          curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
            "https://api.github.com/repos/${{ github.repository }}/branches/main/protection" \
            | jq '{
                date: "'$DATE'",
                enforce_admins: .enforce_admins.enabled,
                required_reviews: .required_pull_request_reviews.required_approving_review_count,
                required_status_checks: .required_status_checks.contexts,
                allow_force_pushes: .allow_force_pushes.enabled
              }' &gt; evidence/github/branch-protection-$DATE.json

          echo "Branch protection evidence collected"
          cat evidence/github/branch-protection-$DATE.json

      - name: Upload evidence to S3
        run: |
          DATE=$(date +%Y-%m-%d)
          aws s3 sync evidence/ \
            s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/ \
            --no-progress
          echo "Evidence uploaded: s3://\({{ secrets.SOC2_EVIDENCE_BUCKET }}/daily/\)DATE/github/"
</code></pre>
<h2 id="heading-weeks-1114-auditor-selection-and-readiness-assessment">Weeks 11–14: Auditor Selection and Readiness Assessment</h2>
<h3 id="heading-how-to-choose-a-soc2-auditor">How to Choose a SOC2 Auditor</h3>
<p>Selecting the right auditor is more consequential than most teams realize. SOC2 audits are conducted by CPA firms — specifically, firms licensed to issue SOC reports. The right firm has experience with cloud-native, SaaS companies your size. The wrong firm could apply enterprise audit frameworks to a seed-stage startup and generate findings based on controls that aren't appropriate to your context.</p>
<p>Here is what to look for and what to watch out for:</p>
<h4 id="heading-experience-matters-more-than-brand">Experience matters more than brand</h4>
<p>A large Big Four firm isn't necessarily better than a specialist boutique auditor for a 20-person SaaS company.</p>
<p>Ask specifically: "How many SOC2 audits have you completed in the last 12 months for SaaS companies between 10 and 50 employees?" You want a firm where this is common, not exceptional.</p>
<h4 id="heading-verify-familiarity-with-your-compliance-tool">Verify familiarity with your compliance tool</h4>
<p>If you're using Vanta or Drata, confirm that the auditor has experience with evidence produced by those platforms. Some auditors prefer to collect evidence directly and are unfamiliar with automated evidence exports. An auditor who doesn't trust your Vanta evidence will ask you to re-collect everything manually.</p>
<h4 id="heading-understand-what-type-ii-actually-costs">Understand what Type II actually costs</h4>
<p>For a Series A SaaS company, expect \(15,000–\)30,000 for a SOC2 Type II audit with a 3-month observation period. A quote below \(10,000 often means the auditor is cutting corners on the review depth. A quote above \)50,000 for a small company typically means the firm is applying enterprise pricing to a startup engagement.</p>
<h4 id="heading-get-references-from-similar-companies">Get references from similar companies</h4>
<p>Ask the auditor for two or three references from SaaS companies they've audited in the last year. Call those references and ask: did the auditor understand cloud infrastructure? Were the findings reasonable? How was the communication during the review?</p>
<p>Here's a summary table of some things to watch out for:</p>
<table>
<thead>
<tr>
<th>Criteria</th>
<th>What to Look For</th>
<th>Red Flag</th>
</tr>
</thead>
<tbody><tr>
<td>Experience</td>
<td>5+ years, 20+ SaaS audits annually</td>
<td>"We have completed several SOC2 audits" (vague)</td>
</tr>
<tr>
<td>Tool familiarity</td>
<td>Has reviewed Vanta/Drata evidence before</td>
<td>Requires manual re-collection of automated evidence</td>
</tr>
<tr>
<td>Company size fit</td>
<td>Has audited companies your size</td>
<td>Only lists enterprise clients as references</td>
</tr>
<tr>
<td>Cost (Type II)</td>
<td>\(15K–\)30K for a 20-person company</td>
<td>Under \(10K or over \)50K without clear justification</td>
</tr>
<tr>
<td>References</td>
<td>Can provide SaaS company contacts to call</td>
<td>Cannot provide references</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-run-a-readiness-assessment-mock-audit">How to Run a Readiness Assessment (Mock Audit)</h3>
<p>A readiness assessment is a self-conducted simulation of the real audit, run 2–4 weeks before you engage the auditor. Its purpose is to find and close gaps before the auditor finds them, because gaps found in a mock audit cost you a week of remediation time, while gaps found in the real audit cost you a conditional report and a re-review.</p>
<p>You can run the readiness assessment yourself or hire a consultant to run it. The consultant approach is more valuable because an independent reviewer will find gaps you have rationalised away.</p>
<p>The process:</p>
<ol>
<li><p><strong>Step 1:</strong> Work through every control in the checklist below and attempt to produce the evidence that an auditor would request.</p>
</li>
<li><p><strong>Step 2:</strong> For every control where you can't produce clear, timestamped evidence: that's a gap. Document it.</p>
</li>
<li><p><strong>Step 3:</strong> Prioritise gaps by type. Evidence gaps (missing evidence for an active control) require evidence collection infrastructure fixes. Control gaps (a control that isn't implemented) require engineering work.</p>
</li>
<li><p><strong>Step 4:</strong> Close all gaps before engaging the real auditor.</p>
</li>
</ol>
<table>
<thead>
<tr>
<th>Control</th>
<th>Evidence Required</th>
<th>How to Verify</th>
<th>Ready?</th>
</tr>
</thead>
<tbody><tr>
<td>MFA enforced</td>
<td>IAM credential report + SSO MFA policy screenshot</td>
<td><code>aws iam get-credential-report</code></td>
<td>⬜</td>
</tr>
<tr>
<td>CloudTrail active</td>
<td>Trail status + S3 delivery confirmation</td>
<td><code>aws cloudtrail get-trail-status</code></td>
<td>⬜</td>
</tr>
<tr>
<td>GuardDuty active</td>
<td>Detector list + finding review log</td>
<td><code>aws guardduty list-detectors</code></td>
<td>⬜</td>
</tr>
<tr>
<td>VPC Flow Logs</td>
<td>Active flow log list + sample log entries</td>
<td><code>aws ec2 describe-flow-logs</code></td>
<td>⬜</td>
</tr>
<tr>
<td>Secrets in Secrets Manager</td>
<td>Secret list + rotation policy confirmation</td>
<td><code>aws secretsmanager list-secrets</code></td>
<td>⬜</td>
</tr>
<tr>
<td>EBS encryption by default</td>
<td>Account-level encryption setting</td>
<td><code>aws ec2 get-ebs-encryption-by-default</code></td>
<td>⬜</td>
</tr>
<tr>
<td>S3 Block Public Access</td>
<td>Account-level PAB configuration</td>
<td><code>aws s3control get-public-access-block</code></td>
<td>⬜</td>
</tr>
<tr>
<td>Branch protection (no admin bypass)</td>
<td>GitHub branch protection API response</td>
<td>GitHub API or Settings UI</td>
<td>⬜</td>
</tr>
<tr>
<td>Trivy scanning in CI</td>
<td>GitHub Actions run history showing scans</td>
<td>GitHub Actions logs</td>
<td>⬜</td>
</tr>
<tr>
<td>Incident response runbook</td>
<td>Written runbook + tabletop exercise notes with date</td>
<td>Document review</td>
<td>⬜</td>
</tr>
<tr>
<td>Access review</td>
<td>Quarterly review document with specific changes made</td>
<td>Document review</td>
<td>⬜</td>
</tr>
<tr>
<td>Backup test</td>
<td>RDS restore log + data verification results</td>
<td>Document review</td>
<td>⬜</td>
</tr>
<tr>
<td>Change management log</td>
<td>GitHub PR history + ArgoCD sync history</td>
<td>GitHub and ArgoCD</td>
<td>⬜</td>
</tr>
</tbody></table>
<p><strong>The one thing most teams skip:</strong> Running the readiness assessment against their own evidence bucket. Pull a random day's evidence from the daily Lambda export and verify that it's complete, timestamped, and accurately reflects the control status on that day.</p>
<p>If the evidence file for December 14th shows GuardDuty as PASS but GuardDuty was actually disabled that day, the auditor will find the discrepancy in the AWS account history — and that's a qualified finding.</p>
<h2 id="heading-weeks-1518-the-observation-period">Weeks 15–18: The Observation Period</h2>
<h3 id="heading-how-the-auditor-observes-your-controls">How the Auditor Observes Your Controls</h3>
<p>The SOC2 auditor doesn't physically visit your office or sit inside your AWS console watching your infrastructure in real time. The audit is a remote, documentation-based process conducted entirely through evidence review.</p>
<p>Here is how it actually works:</p>
<p>First, the auditor provides a list of evidence requests — typically 80–150 items for a Type II audit. You upload the evidence to a shared portal (the auditor provides this — it is usually a secure document sharing platform). The auditor reviews the evidence, asks follow-up questions, and identifies gaps where evidence is missing or a control wasn't operating as described.</p>
<p>For automated controls like CloudTrail and GuardDuty, the evidence is your daily Lambda exports — the auditor spot-checks a sample of daily snapshots across the observation period to verify the controls were consistently active.</p>
<p>For manual controls like access reviews and backup tests, the evidence is the documents you produced when you ran those processes.</p>
<p>The practical implication: the auditor is trusting your evidence. This is why the Object Lock on your evidence bucket matters. It proves to the auditor that the evidence was generated at the time it claims to have been generated and hasn't been modified since.</p>
<h3 id="heading-what-the-auditor-reviews-over-the-observation-period">What the Auditor Reviews Over the Observation Period</h3>
<table>
<thead>
<tr>
<th>What They Check</th>
<th>How Often</th>
<th>What They Are Looking For</th>
</tr>
</thead>
<tbody><tr>
<td>CloudTrail logs</td>
<td>Spot check monthly</td>
<td>Manual console changes that bypassed IaC, gaps in log delivery</td>
</tr>
<tr>
<td>GuardDuty findings</td>
<td>Review quarterly summary</td>
<td>HIGH or CRITICAL findings not remediated within your documented SLA</td>
</tr>
<tr>
<td>Access review completion</td>
<td>Verify each quarterly cycle</td>
<td>Reviews skipped, reviews with no access changes despite employee turnover</td>
</tr>
<tr>
<td>Incident response tests</td>
<td>Verify annually</td>
<td>No tabletop exercise conducted during the observation period</td>
</tr>
<tr>
<td>Evidence collection</td>
<td>Verify continuous coverage</td>
<td>Gaps in daily evidence exports, missing evidence for specific dates</td>
</tr>
<tr>
<td>Change management log</td>
<td>Sample PR/sync history</td>
<td>Deployments with no associated pull request or review</td>
</tr>
</tbody></table>
<h3 id="heading-what-triggers-a-finding">What Triggers a Finding</h3>
<p>A SOC2 finding is the auditor's documented conclusion that a control wasn't operating effectively during the observation period. Findings range from observations (minor issues that don't affect the audit opinion) to qualified opinions (material failures that result in a qualified rather than unqualified report).</p>
<p>Understanding what triggers findings — and which ones restart the observation period — is critical for managing your audit timeline.</p>
<p><strong>Control gaps</strong> occur when a required control isn't implemented or was disabled during the observation period. If you discover in month 2 that MFA wasn't enforced on one IAM user for the first three weeks, you must document the remediation and demonstrate the gap was closed.</p>
<p>Whether this restarts your observation period depends on how long the gap lasted and how the auditor assesses the risk — but a gap of less than 30 days that's immediately remediated and documented typically doesn't restart the clock.</p>
<p><strong>Evidence gaps</strong> are more serious. If your daily Lambda evidence collector failed for two weeks and produced no evidence exports, you have a two-week window with no documented proof that your controls were operating. The auditor can't verify controls they can't see evidence for.</p>
<p>Evidence gaps almost always require extending the observation period because there's no way to retroactively produce evidence for a period that wasn't recorded.</p>
<p><strong>Process failures</strong> occur when a manual control wasn't executed as documented. The most common is an access review that was skipped. Like control gaps, these can typically be remediated without restarting the clock if they're documented promptly and the remediation is clear.</p>
<p><strong>Unpatched critical CVEs</strong> are a special case. If Trivy identifies a CRITICAL vulnerability in a production container and it remains unpatched for more than your documented remediation SLA (typically 30 days for critical, 90 days for high), this is a qualified finding that the auditor will note in the report.</p>
<h3 id="heading-how-to-close-gaps-without-restarting-the-clock">How to Close Gaps Without Restarting the Clock</h3>
<p>When you discover a gap during the observation period:</p>
<p><strong>For control gaps:</strong></p>
<pre><code class="language-plaintext">1. Fix the control immediately — don't wait
2. Document the fix: screenshot, PR link, or CLI command output with timestamp
3. Note the gap date range in your audit log: "Control gap: 2024-03-10 to 2024-03-14 (4 days). Root cause: [X]. Remediated: [Y]. No customer data accessed during gap period."
4. Notify your auditor proactively — they will find it anyway; proactive disclosure is better than defensive explanation
5. The observation period doesn't restart if the gap was short-lived and promptly remediated
</code></pre>
<p><strong>For evidence gaps:</strong></p>
<pre><code class="language-plaintext">1. Fix the evidence collection infrastructure immediately
2. Understand that you can't retroactively generate evidence for the gap period
3. The observation period for affected controls effectively restarts from the date evidence collection resumed
4. If the gap is early in your observation period, you may be able to extend the period rather than restart — discuss with your auditor
</code></pre>
<p><strong>The pro tip:</strong> Set up a CloudWatch alarm that triggers if the evidence Lambda fails to deliver to S3 on schedule. A missing daily evidence file is caught within 24 hours, not discovered during the audit review.</p>
<h2 id="heading-the-90-day-soc2-timeline-at-a-glance">The 90-Day SOC2 Timeline at a Glance</h2>
<table>
<thead>
<tr>
<th>Weeks</th>
<th>Focus</th>
<th>Key Deliverables</th>
<th>Common Mistake</th>
</tr>
</thead>
<tbody><tr>
<td>1–2</td>
<td>Scope</td>
<td>Boundary diagram, network segmentation Terraform</td>
<td>Over-scoping to include dev and staging</td>
</tr>
<tr>
<td>3–6</td>
<td>Controls</td>
<td>14 controls implemented and collecting evidence</td>
<td>Starting controls after the observation period begins</td>
</tr>
<tr>
<td>7–10</td>
<td>Evidence</td>
<td>S3 evidence bucket, Lambda daily collector, GitHub Actions workflow</td>
<td>Manual evidence collection with inevitable gaps</td>
</tr>
<tr>
<td>11–14</td>
<td>Readiness</td>
<td>Mock audit, gap remediation, auditor selected</td>
<td>Skipping the mock audit</td>
</tr>
<tr>
<td>15–18</td>
<td>Observation</td>
<td>Daily evidence, quarterly reviews, incident response test</td>
<td>Discovering evidence gaps during the audit rather than before</td>
</tr>
</tbody></table>
<h2 id="heading-whats-next">What's Next?</h2>
<p>Start with Week 1. Define your SOC2 boundary. Apply the four-question framework to every system in your infrastructure. Draw the diagram in Excalidraw. Document the network segmentation controls.</p>
<p>Then implement the 14 controls in order, starting with MFA and CloudTrail — the two that most commonly fail audits when they're missing.</p>
<p>Then build your evidence collection infrastructure before the observation period starts. The automated Lambda and GitHub Actions workflow are the difference between a smooth audit and a 60-day extension.</p>
<p>One thing to remember: SOC2 is 20% controls, 30% evidence, and 50% continuous operation. Start early. Automate everything. Run a mock audit before you call the real one.</p>
<h2 id="heading-resources">Resources</h2>
<p>The following resources are referenced throughout this guide:</p>
<ul>
<li><p><a href="https://www.aicpa-cima.com/resources/landing/system-and-organization-controls-soc-suite-of-services"><strong>AICPA SOC2 Overview</strong></a> — The official SOC2 documentation from the American Institute of CPAs, including the Trust Service Criteria</p>
</li>
<li><p><a href="https://www.vanta.com/"><strong>Vanta</strong></a> — Compliance automation platform that connects to AWS and GitHub to automate evidence collection and track control status</p>
</li>
<li><p><a href="https://drata.com/"><strong>Drata</strong></a> — Alternative compliance automation platform with similar capabilities to Vanta</p>
</li>
<li><p><a href="https://github.com/aquasecurity/trivy"><strong>Trivy by Aqua Security</strong></a> — Open-source container and filesystem vulnerability scanner used in Control 10</p>
</li>
<li><p><a href="https://excalidraw.com/"><strong>Excalidraw</strong></a> — Free, open-source diagram tool for creating the SOC2 boundary diagram</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/singlesignon/latest/userguide/what-is.html"><strong>AWS IAM Identity Center documentation</strong></a> — Official AWS documentation for setting up SSO and MFA enforcement</p>
</li>
<li><p><a href="https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches"><strong>GitHub branch protection documentation</strong></a> — Official GitHub documentation for configuring branch protection rules</p>
</li>
<li><p><a href="https://argo-cd.readthedocs.io/"><strong>ArgoCD documentation</strong></a> — Official ArgoCD documentation for GitOps deployment and sync history</p>
</li>
</ul>
<p><a href="https://github.com/aayostem">Ayobami Adejumo</a> <em>is a senior platform engineer and FinOps specialist. He writes about SOC2 compliance engineering, Kubernetes cost optimization, and platform engineering.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For ]]>
                </title>
                <description>
                    <![CDATA[ You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating. And yet  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-land-your-first-cloud-or-devops-role-what-hiring-managers-actually-look-for/</link>
                <guid isPermaLink="false">69f3683c909e64ad07e3b0fc</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Career ]]>
                    </category>
                
                    <category>
                        <![CDATA[ jobs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 14:33:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/374e807b-a67f-4f04-a639-dfa230b0ba5f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating.</p>
<p>And yet the applications go out, and nothing comes back.</p>
<p>This is one of the most frustrating experiences in tech. You're genuinely learning, genuinely putting in the time, and you have nothing to show for it in terms of results. You start to wonder if the market is too competitive, if you need one more certification, or if there's some hidden door everyone else found that you're missing.</p>
<p>The truth is simpler and more actionable than any of that: <strong>hiring managers can't see your YouTube watch history. They can see your GitHub.</strong> Most beginners optimize for learning. Hired candidates optimize for proof.</p>
<p>In this guide, you'll get an honest breakdown of the nine factors hiring managers actually evaluate when they look at a junior cloud or DevOps candidate and a concrete 90-day plan to address each one. By the end, you'll know exactly where you stand and exactly what to do next.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</a></p>
<ul>
<li><p><a href="#heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</a></p>
</li>
<li><p><a href="#heading-pattern-2--the-theorypractice-gap">Pattern 2: The Theory-Practice Gap</a></p>
</li>
<li><p><a href="#pattern-3-silent-learning">Pattern 3: Silent Learning</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</a></p>
</li>
<li><p><a href="#heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</a></p>
<ul>
<li><a href="#heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</a></li>
</ul>
</li>
<li><p><a href="#heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</a></p>
</li>
<li><p><a href="#heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</a></p>
</li>
<li><p><a href="#heading-factor-4-communication-skills">Factor 4: Communication Skills</a></p>
</li>
<li><p><a href="#heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</a></p>
</li>
<li><p><a href="#heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</a></p>
</li>
<li><p><a href="#heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</a></p>
</li>
<li><p><a href="#heading-factor-8--business-awareness">Factor 8: Business Awareness</a></p>
</li>
<li><p><a href="#heading-factor-9-learning-agility">Factor 9: Learning Agility</a></p>
</li>
<li><p><a href="#heading-your-90-day-action-plan">Your 90-Day Action Plan</a></p>
</li>
<li><p><a href="#heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references-and-recommended-resources">References and Recommended Resources</a></p>
</li>
</ul>
<h2 id="heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</h2>
<h3 id="heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</h3>
<p>Week 1: You watch eight hours of Docker content. Week 2: You start an AWS course and get 70% through. Week 3: A Kubernetes series looks interesting, so you start that instead. Week 4: You open LinkedIn and wonder why you're not getting callbacks.</p>
<p>Watching tutorials feels like progress. It's comfortable, passive, and has no failure state. Nothing breaks. Nothing goes wrong.</p>
<p>The problem is that it produces nothing a hiring manager can evaluate. Courses and certifications tell an employer what you've been exposed to. Your GitHub tells them what you can actually do.</p>
<h3 id="heading-pattern-2-the-theory-practice-gap">Pattern 2: The Theory-Practice Gap</h3>
<p>You can explain CI/CD fluently. You've read the Kubernetes documentation. You understand the conceptual difference between a container and a virtual machine.</p>
<p>But you've never taken a simple application, containerized it, connected it to a pipeline, and deployed it to a cloud server with a real URL that someone can visit.</p>
<p>In an interview, "I understand how it works" and "I have built this and here is the link" are not equivalent answers. Hiring managers hear the first version from hundreds of candidates. The second version gets callbacks.</p>
<h3 id="heading-pattern-3-silent-learning">Pattern 3: Silent Learning</h3>
<p>This one is perhaps the most painful pattern because the learning is real. You're putting in the work every day but nobody knows. No GitHub activity. No LinkedIn posts. No community presence. Just cold applications sent from job boards to ATS systems that filter you out before a human ever sees your name.</p>
<p>The hard truth: people get hired through people. A hiring manager who has seen your LinkedIn post about a problem you solved is significantly more likely to give your résumé serious attention than a stranger who applied through a portal.</p>
<h2 id="heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</h2>
<p>I've grouped the nine factors that follow into three buckets: <strong>Mindset</strong>, <strong>Execution</strong>, and <strong>Visibility</strong>. The order matters: mindset shapes how you execute, and execution is what powers visibility.</p>
<table>
<thead>
<tr>
<th>Bucket</th>
<th>Covers</th>
<th>Factors</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Mindset</strong></td>
<td>How you think about problems and your career</td>
<td>Factors 2, 7, 8, 9</td>
</tr>
<tr>
<td><strong>Execution</strong></td>
<td>What you actually build and demonstrate</td>
<td>Factors 1, 3</td>
</tr>
<tr>
<td><strong>Visibility</strong></td>
<td>Whether the right people know you exist</td>
<td>Factors 4, 5, 6</td>
</tr>
</tbody></table>
<p>Let's go through each one.</p>
<h2 id="heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</h2>
<p>If there's one thing to take from this entire article, it's this: <strong>no portfolio means no serious consideration.</strong> The most technically capable candidate in the applicant pool is invisible without proof of work.</p>
<p>This isn't about impressing anyone with complexity. It's about demonstrating that you can take a system from zero to deployed, documented, and working.</p>
<p>Here's the checklist every portfolio project should meet before you consider it done:</p>
<ul>
<li><p><strong>It's deployed</strong>: there's a real URL you can share, not "it works on my machine"</p>
</li>
<li><p><strong>It has a CI/CD pipeline</strong>: code changes are automatically tested and deployed</p>
</li>
<li><p><strong>Infrastructure is defined as code</strong>: not manually clicked together in the AWS console</p>
</li>
<li><p><strong>It has monitoring and alerting</strong>: you know when it breaks before users tell you</p>
</li>
<li><p><strong>It's documented</strong>: a README explains what it does, how to run it, and how it works</p>
</li>
<li><p><strong>It's on GitHub publicly</strong>: with real commit history showing iterative work</p>
</li>
</ul>
<p>If your project meets all six criteria, you have proof of work. If it meets four of six, you have a project in progress. Finish it before you start applying.</p>
<h3 id="heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</h3>
<p>You don't need ten projects. You need two to three projects that together demonstrate the full range of DevOps skills.</p>
<h4 id="heading-project-1-the-full-stack-deploy-pipeline">Project 1 : The Full-Stack Deploy Pipeline</h4>
<p>This is the foundational DevOps project every beginner should build first.</p>
<p>Take any simple web application – a Python Flask app, a Node.js API, or even a static site. Containerize it with Docker. Write a CI/CD pipeline that runs tests, builds the Docker image, and deploys to a cloud server automatically on every push to the main branch. You can also set up Nginx as a reverse proxy and add an uptime monitor (UptimeRobot has a free tier).</p>
<p>Tools: GitHub Actions, Docker, AWS EC2 or <a href="http://Render.com">Render.com</a>, Nginx.</p>
<p>Why it matters to a hiring manager: it proves you can automate a full deployment workflow end-to-end. The hiring manager can visit your URL, see it running, and inspect your pipeline history.</p>
<p>This single project puts you ahead of most applicants who only have course completion screenshots.</p>
<h4 id="heading-project-2-infrastructure-as-code-with-terraform">Project 2: Infrastructure as Code with Terraform</h4>
<p>Write Terraform code that provisions a complete environment: a VPC, public and private subnets, an EC2 instance with properly scoped security group rules, and an S3 bucket for remote state. Destroy it and recreate it from scratch to prove the code actually works. Add a GitHub Actions workflow that runs <code>terraform plan</code> on pull requests and <code>terraform apply</code> on merge to main.</p>
<p>Tools: Terraform, AWS (or Azure/GCP), GitHub Actions.</p>
<p>Why it matters: Infrastructure as Code with Terraform is a required skill at almost every company running cloud infrastructure. Showing you can write, version-control, and automate Terraform demonstrates a core professional competency.</p>
<h4 id="heading-project-3-monitoring-and-observability-stack">Project 3: Monitoring and Observability Stack</h4>
<p>Deploy a monitoring stack using Docker Compose: Prometheus scraping metrics from your application and the host, Grafana dashboards showing CPU, memory, request rates, and error rates, and Alertmanager configured to send alerts to Slack or email when thresholds are crossed. Connect this to your Project 1 application so the pipeline deploys and the monitoring watches it.</p>
<p>Tools: Prometheus, Grafana, Alertmanager, Node Exporter, Docker Compose.</p>
<p>Why it matters: most beginner portfolios have zero observability work. This project immediately signals that you understand production engineering, not just deployment. Any senior DevOps engineer or SRE reviewing your application will notice it and it will set you apart.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/da9e25be-9b59-48c8-9cf0-9cfdb050c277.png" alt="GitHub profile showing three pinned DevOps portfolio repositories with descriptive names " style="display:block;margin:0 auto" width="1353" height="584" loading="lazy">

<h2 id="heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</h2>
<p>This is the mindset that separates a DevOps engineer from someone who just knows a collection of tools. System-level thinking means you can see the whole picture, not just the part you happen to be working on at any given moment.</p>
<p>Here's the mental test hiring managers are running throughout your interview: <em>can you trace a user request from the moment they click a button to the moment they see a response, and explain what happens at every layer in between?</em></p>
<p>Here's the full journey of a web request, the map of modern infrastructure every DevOps engineer needs to understand:</p>
<table>
<thead>
<tr>
<th>Step</th>
<th>Layer</th>
<th>What's happening and what can go wrong</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td>User's Browser</td>
<td>The user types a URL. The browser needs to find the server.</td>
</tr>
<tr>
<td>2</td>
<td>DNS Resolution</td>
<td>The domain is translated into an IP address. DNS misconfigurations mean users can't reach you at all.</td>
</tr>
<tr>
<td>3</td>
<td>CDN / Edge Network</td>
<td>Traffic hits a CDN (Cloudflare, CloudFront) first. Static assets are served from the nearest edge. SSL terminates here.</td>
</tr>
<tr>
<td>4</td>
<td>Load Balancer</td>
<td>Routes the request to an available application server. If all targets are unhealthy, users get 502/503 errors.</td>
</tr>
<tr>
<td>5</td>
<td>Compute / Application Servers</td>
<td>The application code runs here in containers, on VMs, or in server-less functions. Business logic executes.</td>
</tr>
<tr>
<td>6</td>
<td>Database Layer</td>
<td>The application reads from or writes to a database. Slow queries or a full disk causes slow responses or outages.</td>
</tr>
<tr>
<td>7</td>
<td>Cache Layer</td>
<td>Redis or Memcached caches frequently-read data. Cache misses cause extra database load.</td>
</tr>
<tr>
<td>8</td>
<td>Response Returns</td>
<td>The response travels back through the stack and the user sees the result.</td>
</tr>
<tr>
<td>9</td>
<td>Logging and Monitoring</td>
<td>Every step above should emit logs and metrics. Good monitoring alerts you before users notice a problem.</td>
</tr>
</tbody></table>
<p>Why does this matter in an interview? Consider two candidates answering the question: <em>"Tell me about a time something broke in production."</em></p>
<p>Candidate A: "The website was down."</p>
<p>Candidate B: "The load balancer health checks were failing because the app containers were running out of memory due to a memory leak introduced in the previous deploy. We identified it via memory metrics in Grafana, rolled back, and added a memory limit to the container spec."</p>
<p>Same incident. Completely different answer. System-level thinking is what makes the difference.</p>
<h2 id="heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</h2>
<p>Many beginners rush to learn Kubernetes and Terraform before mastering the foundations that make those tools make sense. This creates a knowledge structure that looks impressive but has no solid base underneath it.</p>
<p>Here are the fundamentals that actually matter and what to do if you have a gap in any of them:</p>
<h3 id="heading-1-linux-and-the-command-line">1. Linux and the Command Line</h3>
<p>DevOps tools run on Linux. CI/CD jobs run in Linux containers. SSH is the front door to every server. If the terminal makes you uncomfortable, you're not ready for a production environment. This is not a preference, it's a prerequisite.</p>
<p>Start with daily Linux practice. The <a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation's free introductory materials</a> are a solid starting point. And here's a <a href="https://www.freecodecamp.org/news/learn-the-basics-of-the-linux-operating-system/">solid freeCodeCamp course on Linux basics.</a></p>
<h3 id="heading-2-networking-fundamentals">2. Networking Fundamentals</h3>
<p>DNS, TCP/IP, HTTP/HTTPS, load balancing, firewalls, VPCs, subnets these concepts appear in every cloud architecture. Without them, Terraform and Kubernetes are magic boxes. Study the request flow in Factor 2 above until you can draw it from memory without looking.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/computer-networking-fundamentals/">computer networking fundamentals course</a> to get you started.</p>
<h3 id="heading-3-scripting-bash-and-python">3. Scripting: Bash and Python</h3>
<p>CI/CD pipelines are scripts. Automation is scripting. If you cannot write a Bash script that reads a config file, calls an API, and handles errors gracefully your automation ceiling is very low. Fix this by writing one small, useful script every week. Solve real problems with code.</p>
<p>Here's a helpful tutorial on <a href="https://www.freecodecamp.org/news/shell-scripting-crash-course-how-to-write-bash-scripts-in-linux/">shell scripting in Linux for beginners</a>.</p>
<h3 id="heading-4-git-and-version-control">4. Git and Version Control</h3>
<p>Not just <code>git commit</code> and <code>git push</code>. Branching strategies, pull requests, merge conflicts, rebasing, and tagging releases are all standard practice in professional DevOps teams. Use Git for everything including your personal learning notes. Practice branching workflows intentionally.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/gitting-things-done-book/">full book on all the Git basics</a> (and some more advanced topics, too) you need to know.</p>
<h3 id="heading-5-docker-and-containers">5. Docker and Containers</h3>
<p>Docker is the universal packaging format for modern software. Understanding layers, multi-stage builds, volumes, networking, and container security is the floor not the ceiling. Every project you build should be containerized. Write your Dockerfiles by hand instead of copying them.</p>
<p>Here's a course on <a href="https://www.freecodecamp.org/news/learn-docker-and-kubernetes-hands-on-course/">Docker and Kubernetes</a> to get you started,</p>
<h2 id="heading-factor-4-communication-skills">Factor 4: Communication Skills</h2>
<p>Technical skills set your ceiling. Communication skills determine how fast you reach it. This is the most consistently underestimated factor among beginner DevOps candidates.</p>
<p>Two candidates with identical technical ability will have very different career outcomes based on how clearly they communicate. Here's what that looks like in practice:</p>
<p><strong>Architecture explanation</strong>: Can you describe how your project works to someone who has never seen it? Can you draw the architecture on a whiteboard and walk someone through your design decisions and the trade-offs you made?</p>
<p><strong>Trade-off articulation</strong>: <em>"I chose X over Y because..."</em> is one of the most powerful phrases in a technical interview. It shows you understand that every decision has pros and cons and you made a conscious, reasoned choice rather than just copying a tutorial.</p>
<p><strong>Written documentation</strong>: A README is your project's cover letter. A well-written README with clear setup instructions, an architecture diagram, and documented decisions demonstrates engineering maturity that most beginners don't show.</p>
<p>Here's a quick test: open your most recent project on GitHub and read the README as if you're a hiring manager seeing it for the first time. Does it answer these questions?</p>
<ul>
<li><p>What does this project do, and why did you build it?</p>
</li>
<li><p>What does the architecture look like?</p>
</li>
<li><p>How do I run this locally, and how do I deploy it?</p>
</li>
<li><p>What decisions did you make, and why?</p>
</li>
<li><p>What would you improve if you continued working on it?</p>
</li>
</ul>
<p>If you answered "no" to more than two of those rewrite the README before applying anywhere. This single action will meaningfully improve your response rate.</p>
<p><strong>Interview communication</strong>: Hiring managers assess communication throughout the entire interview not just your answers. Thinking out loud, structuring your responses, and admitting uncertainty honestly are all evaluated.</p>
<h2 id="heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</h2>
<p>Hiring managers are pattern recognition machines. They look at your GitHub contribution graph, your LinkedIn activity, and your learning trajectory and form an impression before reading a single word on your résumé.</p>
<p>A binge-learning approach, 10-hour weekends followed by weeks of nothing produces a GitHub graph that tells the wrong story. Thirty minutes of focused daily practice for six months beats a monthly 10-hour binge. At the six-month mark, the daily practitioner has 90 hours of focused work. The binge learner has 60 with significantly worse retention.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/1315bb8d-9e4e-4f84-836f-4e02b83c75ce.webp" alt="GitHub contribution graph showing 12 months of consistent activity with regular commits across the year" style="display:block;margin:0 auto" width="1080" height="273" loading="lazy">

<p>Here's how to build consistency in practice:</p>
<ul>
<li><p>Pick a time slot in your day that you will protect. Thirty minutes is enough to make progress.</p>
</li>
<li><p>Define a four-week learning sprint with a specific goal, not "learn Terraform" but "build and deploy a VPC with Terraform and write the README."</p>
</li>
<li><p>Keep a private learning journal: date, what you studied, what you built, what confused you.</p>
</li>
<li><p>When the sprint ends, evaluate what you built and plan the next one.</p>
</li>
</ul>
<p>What to avoid: declaring publicly on LinkedIn that you're "grinding DevOps full time" and then disappearing for six weeks. The absence is noticed. Only commit publicly to what you will actually sustain.</p>
<h2 id="heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</h2>
<p>This is the factor most beginners resist most, and the one that makes the biggest practical difference in time-to-hire.</p>
<p>Most DevOps jobs are filled through people referrals, community connections, LinkedIn conversations. A warm introduction from someone who has seen your work outweighs fifty cold applications every time.</p>
<p>Here are three ways to build visibility without it feeling performative:</p>
<h3 id="heading-community-engagement">Community Engagement</h3>
<p>Join communities where DevOps engineers actually talk: AWS User Groups, local DevOps meetups, DevOps Discord servers, Reddit communities like r/devops and r/kubernetes. You don't need to be the expert. Ask specific questions, answer what you genuinely know, and show up consistently. After three to six months, people will recognize your name.</p>
<h3 id="heading-linkedin-content">LinkedIn Content</h3>
<p>Post once per week about something you learned, built, or got stuck on. Not marketing – documentation. A post that says <em>"This week I configured Prometheus alerting for a Docker Compose stack. Here's what tripped me up and how I solved it"</em> attracts recruiters, leads to conversations, and builds a searchable record of your growth over time.</p>
<h3 id="heading-asking-good-questions-in-public">Asking Good Questions in Public</h3>
<p>When you get stuck and figure it out, write it up. Post the solution in the same community where you asked the question. Answer someone else's version of the same question later. You position yourself as a helpful, engaged learner, exactly who hiring managers want to hire.</p>
<p>Here's a concrete three-month visibility sprint to follow:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Action</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Update your LinkedIn headline: "Cloud / DevOps Engineer in Training │ Building with AWS, Docker, Terraform". Connect with 20 people in DevOps engineers, recruiters, hiring managers. Add a short personal note when connecting.</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Write your first LinkedIn post. Document something you built or learned this week. Keep it honest and specific. 150–200 words is enough.</td>
</tr>
<tr>
<td>Month 2</td>
<td>Join one community. Introduce yourself. Answer one question per week.</td>
</tr>
<tr>
<td>Month 3</td>
<td>Post consistently once per week. Engage with others' posts. Start appearing in recruiter searches.</td>
</tr>
</tbody></table>
<p>By month three, recruiters searching for "DevOps" in your location will encounter your activity. Some of the best entry-level DevOps opportunities come from exactly this kind of low-pressure visibility.</p>
<h2 id="heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</h2>
<p>This factor is less about personality type and more about observable behavior. Hiring managers are looking for evidence that you finish what you start not just that you start things.</p>
<p>Here's what the contrast looks like:</p>
<table>
<thead>
<tr>
<th>What hiring managers frequently see</th>
<th>What hiring managers want to see</th>
</tr>
</thead>
<tbody><tr>
<td>"I started a Kubernetes project and encountered a lot of issues"</td>
<td>"Here is a complete project. It deploys to AWS, has a CI/CD pipeline, is monitored, and you can access it at this URL right now."</td>
</tr>
<tr>
<td>"I was working through a Terraform course, learnt a lot about XYZ."</td>
<td>"I finished it, documented it, and wrote a post about what I learned."</td>
</tr>
</tbody></table>
<p>Ownership mindset has three components. First, finish things: a complete, simple project is worth ten times more than ten incomplete complex ones. Second, take responsibility without blame when something breaks: ownership means identifying the cause, fixing it, and adding monitoring so it doesn't happen again. Third, self-direct your learning you don't wait for someone to tell you what to learn next. You see a gap, identify how to close it, and close it. This is what "junior who can work independently" actually means in job descriptions.</p>
<h2 id="heading-factor-8-business-awareness">Factor 8: Business Awareness</h2>
<p>Technical skill gets you in the door. Business awareness keeps you there and accelerates your career.</p>
<p>The core question hiring managers are testing is: <em>can you connect your technical decisions to cost, uptime, and user impact?</em> Infrastructure decisions are business decisions. Cloud costs are typically the second-largest engineering expense at most companies after salaries. A misconfigured auto-scaling group or a forgotten large EC2 instance can burn thousands of dollars overnight.</p>
<p>Here are a few benchmark questions worth being able to answer comfortably:</p>
<ul>
<li><p>If your company has a 99.9% SLA, how many minutes of downtime per month is that? (About 43 minutes.)</p>
</li>
<li><p>If you move workloads from on-demand EC2 instances to Reserved Instances, what's the approximate cost saving? (Around 40–60%.)</p>
</li>
<li><p>If your CI/CD pipeline takes 45 minutes per build and you run 20 builds per day, how much developer wait time does that represent weekly?</p>
</li>
</ul>
<p>Most junior candidates can't answer these fluently in an interview. Candidates who can stand out immediately not because the questions are hard, but because so few people bother to connect infrastructure and business.</p>
<p>The simple habit to build: whenever you describe a technical decision in your project documentation or in an interview, add the business dimension. "I configured auto-scaling" becomes "I configured auto-scaling to handle traffic spikes, which eliminated the cost of over-provisioning and reduced our estimated monthly cloud spend by approximately $X."</p>
<h2 id="heading-factor-9-learning-agility">Factor 9: Learning Agility</h2>
<p>Everyone claims to be a fast learner. It's the most overused phrase in technology job applications. Here's how to make it actually mean something.</p>
<p>Saying "I'm a fast learner" in an interview is table stakes. The question is whether you can prove it. Proof sounds like this: <em>"I had never used GitHub Actions before. I needed a CI/CD pipeline for a project I was building. In 48 hours, I had a working pipeline that runs tests, builds a Docker image, and deploys to AWS."</em></p>
<p>What makes that credible: it names a specific tool, a specific timeframe, and a specific outcome. There is a GitHub repository with a commit history and a working pipeline that a hiring manager can actually look at.</p>
<p>Learning agility is not about knowing many tools shallowly. It's about picking up new tools quickly because you deeply understand the underlying concepts. Tool names change every few years. Concepts networking, automation, observability, reliability do not.</p>
<p>To build a concrete track record of learning agility: once a month, pick one tool you haven't used. Follow its quick-start guide. Build something small. Document what was difficult. Post about it. This is your learning agility portfolio visible, dated, and specific.</p>
<h2 id="heading-your-90-day-action-plan">Your 90-Day Action Plan</h2>
<p>Here is a concrete, sequential plan that takes you from where you are now to your first DevOps interview-ready state.</p>
<h3 id="heading-month-1-build-your-foundation">Month 1: Build Your Foundation</h3>
<p>Focus entirely on Project 1 from the Proof of Work section. Build it completely. Deploy it. Get the live URL. Don't start Project 2 until Project 1 meets all six checklist criteria.</p>
<p>Alongside the build: 30 minutes of Linux and Bash scripting practice daily. This isn't optional, it's the foundation everything else runs on.</p>
<h3 id="heading-month-2-expand-your-execution-and-start-your-visibility">Month 2: Expand Your Execution and Start Your Visibility</h3>
<p>Begin Project 2 (Terraform IaC). Write your first LinkedIn post, it doesn't need to be polished, it needs to be specific. Join one community and introduce yourself.</p>
<h3 id="heading-month-3-complete-the-portfolio-and-document-everything">Month 3: Complete the Portfolio and Document Everything</h3>
<p>Finish all three projects to full checklist standard. Polish every README. Add architecture diagrams. Optimize your GitHub profile, pin your three best repos, write a profile README that describes who you are and what you build, and add links to your live project URLs.</p>
<h3 id="heading-month-4-onward-apply-with-strategy">Month 4 Onward: Apply with Strategy</h3>
<p>Don't start applying before month four. Apply with real proof of work in hand. Target five to ten quality applications per week rather than spraying a hundred. Include your GitHub and your best project's live URL in every application. For roles at companies where you have a community connection, reach out to that person before applying.</p>
<p>Track every application in a spreadsheet: company, role, date applied, status, outcome, notes. After thirty applications, you'll have enough data to see what's working and what isn't.</p>
<p>Here's the full 90-day breakdown:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Focus</th>
<th>Milestone</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Linux fundamentals. Set up GitHub profile. Start Project 1.</td>
<td>Foundation</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Complete Project 1 CI/CD pipeline. Deploy. Get live URL. Write README.</td>
<td>First Proof of Work</td>
</tr>
<tr>
<td>Month 2</td>
<td>Begin Project 2. First LinkedIn post. Join one community.</td>
<td>Visibility begins</td>
</tr>
<tr>
<td>Month 2-3</td>
<td>Complete Project 2. Scaffold monitoring (Project 3). Post weekly on LinkedIn.</td>
<td>Building momentum</td>
</tr>
<tr>
<td>Month 3</td>
<td>Finish all 3 projects to checklist standard. Polish READMEs and GitHub profile.</td>
<td>Portfolio complete</td>
</tr>
<tr>
<td>Month 4+</td>
<td>Apply strategically. Continue posting and community engagement.</td>
<td>Active job search</td>
</tr>
</tbody></table>
<h2 id="heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</h2>
<p>Go through each statement below. Be completely honest: this is for you, not anyone else.</p>
<table>
<thead>
<tr>
<th>Statement</th>
<th>Action if the answer is No</th>
</tr>
</thead>
<tbody><tr>
<td>I can explain a web request end-to-end (DNS → load balancer → compute → database → logs)</td>
<td>Study Factor 2 until you can draw this from memory</td>
</tr>
<tr>
<td>I have at least one deployed project with a live URL</td>
<td>This is Priority 1. Nothing else matters more right now.</td>
</tr>
<tr>
<td>My best project has a CI/CD pipeline that auto-deploys on push</td>
<td>Add this to your existing project this week</td>
</tr>
<tr>
<td>I have written infrastructure as code (Terraform or CloudFormation)</td>
<td>Project 2 is your next build target</td>
</tr>
<tr>
<td>My projects have READMEs that explain architecture and decisions</td>
<td>Spend one hour today rewriting your README</td>
</tr>
<tr>
<td>I have posted about my learning on LinkedIn in the last 30 days</td>
<td>Post something today, document what you built last week</td>
</tr>
<tr>
<td>I am part of at least one DevOps community</td>
<td>Join r/devops or an AWS Discord server this week</td>
</tr>
<tr>
<td>I can write a Bash script that solves a real automation problem</td>
<td>30 minutes of daily scripting practice for the next 30 days</td>
</tr>
<tr>
<td>I can explain what I built, why I made each decision, and what I'd change</td>
<td>Practice saying this out loud about each project until it's fluent</td>
</tr>
</tbody></table>
<p>Count your "no" answers. Each one is a specific, actionable gap, not a vague sense of being behind. That's the difference between this self-assessment and the anxious feeling of "I'm not ready yet." You're not behind. You just have a prioritized list of what to build next.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Here's what you know now that most beginners still don't:</p>
<p>The gap between you and a DevOps job isn't a gap in certifications, a gap in courses completed, or a gap in the number of tools you've heard about. It's a gap in proof of work, visibility, and the consistency with which you execute.</p>
<p>Hiring managers aren't looking for someone who has watched everything. They're looking for someone who has built something, documented it, deployed it, monitored it, and can clearly explain every decision they made along the way.</p>
<p>The path isn't secret. It's just work. Build two to three complete projects that meet the full checklist. Document everything. Show up consistently in communities and on LinkedIn. Apply with strategy. Iterate based on feedback.</p>
<p>If you want a production-grade reference to support your DevOps journey complete with real Terraform modules, CI/CD workflow templates, infrastructure runbooks, and platform engineering patterns used in real startup environments <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a> was built for exactly this stage of your career.</p>
<p>The information gap between you and your first DevOps role is smaller than you think. The execution gap is where the work is. Start today.</p>
<h2 id="heading-references-and-recommended-resources">References and Recommended Resources</h2>
<ul>
<li><p><a href="https://roadmap.sh/devops">roadmap.sh/devops</a>: The community-maintained DevOps learning roadmap. Use this to sequence what you learn next and avoid random jumps between topics.</p>
</li>
<li><p><a href="https://dora.dev">DORA State of DevOps Report</a>: Free annual report on what DevOps practices actually improve software delivery performance. Gives you the vocabulary hiring managers speak.</p>
</li>
<li><p><a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation - Introduction to Linux</a>: Free introductory Linux course. If the terminal still makes you nervous, start here.</p>
</li>
<li><p><a href="https://itrevolution.com/product/the-phoenix-project/">The Phoenix Project</a>: A business novel about DevOps transformation. Teaches core concepts through story. Gives you vocabulary for business-aware conversations.</p>
</li>
<li><p><a href="http://ExplainShell.com">ExplainShell.com</a>: Paste any command you find online and see exactly what every part does. Use this constantly while building your projects.</p>
</li>
<li><p><a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes">GitHub - How to Write a Good README</a>: Official GitHub guidance on repository documentation.</p>
</li>
<li><p><a href="https://prometheus.io/docs/introduction/overview/">Prometheus Documentation</a>: Official docs for the monitoring tool used in Project 3.</p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/tutorials/aws-get-started">Terraform Getting Started - AWS</a>: Official step-by-step guide for Project 2.</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a>: Complete reference for building CI/CD pipelines in Project 1.</p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/learn-linux-for-beginners-book-basic-to-advanced/">freeCodeCamp - Learn Linux for Beginners</a>: Comprehensive Linux guide available on freeCodeCamp.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've been storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks i ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-openid-connect-oidc-in-github-actions-for-aws/</link>
                <guid isPermaLink="false">69ef7bbf330a1ad7f7f2d579</guid>
                
                    <category>
                        <![CDATA[ OpenID Connect ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OIDC ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GitHub Actions ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Mon, 27 Apr 2026 15:07:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/83b71e24-b63b-42a4-ac1c-d59e226da6c3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been storing <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks in a CI/CD pipeline.</p>
<p>Here's why: static credentials don't expire on their own. If they get leaked through a misconfigured workflow, a public fork, or a compromised repository, an attacker has persistent access to your AWS environment until you manually rotate them. And most teams don't rotate them often enough.</p>
<p>OpenID Connect (OIDC) solves this entirely. Instead of storing long-lived credentials, GitHub Actions requests a <strong>short-lived token</strong> directly from AWS every time your workflow runs. No secrets to rotate. No credentials to leak. No manual key management.</p>
<p>In this tutorial, you'll learn how to set up OIDC authentication between GitHub Actions and AWS from scratch. By the end, your workflows will authenticate to AWS securely without storing a single access key.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-openid-connect-oidc">What Is OpenID Connect (OIDC)?</a></p>
</li>
<li><p><a href="#heading-how-oidc-works-between-github-actions-and-aws">How OIDC Works Between GitHub Actions and AWS</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-create-an-iam-oidc-identity-provider-in-aws">Step 1: Create an IAM OIDC Identity Provider in AWS</a></p>
<p><a href="#heading-step-2-create-an-iam-role-with-a-trust-policy">Step 2: Create an IAM Role with a Trust Policy</a></p>
<p><a href="#heading-step-3-attach-permissions-to-the-iam-role">Step 3: Attach Permissions to the IAM Role</a></p>
<p><a href="#heading-step-4-store-the-role-arn-as-a-github-actions-variable">Step 4: Store the Role ARN as a GitHub Actions Variable</a></p>
<p><a href="#heading-step-5-configure-your-github-actions-workflow">Step 5: Configure Your GitHub Actions Workflow</a></p>
<p><a href="#heading-step-6-run-and-verify-your-workflow">Step 6: Run and Verify Your Workflow</a></p>
</li>
<li><p><a href="#heading-security-best-practices">Security Best Practices</a></p>
</li>
<li><p><a href="#heading-troubleshooting-common-errors">Troubleshooting Common Errors</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-what-is-openid-connect-oidc">What Is OpenID Connect (OIDC)?</h2>
<p>OpenID Connect is an identity protocol built on top of OAuth 2.0. It allows systems to verify identity through tokens rather than shared secrets.</p>
<p>In the context of GitHub Actions and AWS:</p>
<ul>
<li><p><strong>GitHub</strong> acts as the <strong>identity provider (IdP)</strong>. It issues a signed JWT (JSON Web Token) for each workflow run.</p>
</li>
<li><p><strong>AWS</strong> acts as the <strong>service provider</strong>. It validates that token against GitHub's public keys and exchanges it for temporary AWS credentials. The credentials AWS returns are short-lived (valid for up to 1 hour by default) and scoped to exactly the IAM role you define. When the workflow ends, those credentials are gone.</p>
</li>
</ul>
<p>This model is called <strong>federated identity</strong>. It's the same concept used when you "Sign in with Google" on a third-party website. The difference is that instead of a user signing in, your workflow is the one authenticating.</p>
<h2 id="heading-how-oidc-works-between-github-actions-and-aws">How OIDC Works Between GitHub Actions and AWS</h2>
<p>Before writing a single line of YAML, it beneficial to understand the flow. This is my personal approach when implementing new technologies or concepts. Here's what happens every time your workflow runs:</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/8b5b39de-f671-4ffe-a2db-96d10ade69b3.jpg" alt="Diagram showing the OIDC authentication flow between GitHub Actions and AWS" style="display:block;margin:0 auto" width="449" height="544" loading="lazy">

<p>The diagram illustrates a secure authentication flow between GitHub Actions and AWS using OpenID Connect (OIDC), eliminating the need to store long-lived AWS credentials in GitHub. Here's what happens step-by-step:</p>
<p><strong>1. Initial Authentication Request</strong></p>
<p>When your GitHub Actions workflow starts, the runner (the virtual machine executing your workflow) requests a JSON Web Token (JWT) from GitHub's OIDC provider located at <code>https://token.actions.githubusercontent.com</code>.</p>
<p><strong>2. Token Issuance</strong></p>
<p>GitHub's OIDC provider generates and signs a JWT containing important claims (metadata) about your workflow. These claims include details like which repository the workflow is running from, which branch triggered it, what environment it's running in, and other contextual information that proves the workflow's identity.</p>
<p><strong>3. Token Validation</strong></p>
<p>The GitHub Actions runner presents this signed JWT to AWS Security Token Service (STS). AWS STS validates the JWT's signature by checking it against GitHub's publicly available cryptographic keys, ensuring the token is authentic and hasn't been tampered with.</p>
<p><strong>4. Trust Policy Verification</strong></p>
<p>AWS STS checks the trust policy configured on your IAM Role. This trust policy specifies which GitHub repositories, branches, or environments are allowed to assume this role. If the claims in the JWT match your trust policy conditions, authentication succeeds.</p>
<p><strong>5. Temporary Credentials Issued</strong></p>
<p>Once validated, AWS STS returns temporary security credentials to the GitHub Actions runner. These credentials include an Access Key ID, Secret Access Key, and Session Token that are valid for a limited time (typically 1 hour by default, configurable up to 12 hours).</p>
<p><strong>6. AWS API Access</strong></p>
<p>The GitHub Actions runner uses these temporary credentials to authenticate API calls to your AWS resources such as pushing Docker images to ECR, updating ECS services, writing to S3 buckets, or invoking Lambda functions.</p>
<p>The key point: <strong>AWS never sees your GitHub credentials, and GitHub never sees your AWS credentials.</strong> The JWT is the only thing exchanged and it's signed, scoped, and short-lived.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following in place:</p>
<ul>
<li><p>An <strong>AWS account</strong> with IAM permissions to create identity providers and roles</p>
</li>
<li><p>A <strong>GitHub repository</strong> (public or private) where your workflows will run</p>
</li>
<li><p>Basic familiarity with <strong>GitHub Actions</strong>, knowing how to write a <code>.yml</code> workflow file</p>
</li>
<li><p>Basic familiarity with <strong>AWS IAM</strong> roles, policies, and permissions</p>
</li>
<li><p>The <strong>AWS CLI</strong> installed and configured (optional, but useful for verification). You don't need to be an AWS expert. Each step includes the exact console path and the configuration values you need.</p>
</li>
</ul>
<h2 id="heading-step-1-create-an-iam-oidc-identity-provider-in-aws">Step 1: Create an IAM OIDC Identity Provider in AWS</h2>
<p>The first thing you need to do is tell AWS to trust GitHub as an identity provider. This is a one-time setup per AWS account.</p>
<h3 id="heading-how-to-do-it-in-the-aws-console">How to Do It in the AWS Console</h3>
<p>1. Open the <a href="https://console.aws.amazon.com/iam/">AWS IAM Console</a></p>
<p>2. In the left sidebar, click Identity providers</p>
<p>3. Click Add provider</p>
<p>4. For Provider type, select OpenID Connect</p>
<p>5. For Provider URL, enter:</p>
<pre><code class="language-plaintext">https://token.actions.githubusercontent.com
</code></pre>
<p>6. For Audience, enter:</p>
<pre><code class="language-plaintext">sts.amazonaws.com
</code></pre>
<p>7. Click Add provider</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/66f1de9d-36f9-462e-ad0c-090b152be6e5.png" alt="AWS IAM console showing the Add Identity Provider form configured for GitHub Actions OIDC" style="display:block;margin:0 auto" width="1349" height="609" loading="lazy">

<h3 id="heading-how-to-do-it-with-the-aws-cli">How to Do It with the AWS CLI</h3>
<p>If you prefer the terminal, run this command:</p>
<pre><code class="language-shell">aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/4b779fa0-0df2-4bc3-bbf4-9839ef8ce5e6.png" alt="terminal-oidc-connect-created" style="display:block;margin:0 auto" width="966" height="114" loading="lazy">

<p>Once created, you'll see <code>token.actions.githubusercontent.com</code> listed under <strong>Identity providers</strong> in your IAM console. This provider will be referenced in your IAM role's trust policy in the next step.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/eb820487-6553-43d2-b6b7-4e7b08d039ef.png" alt="verify oidc connect in AWS" style="display:block;margin:0 auto" width="1132" height="284" loading="lazy">

<h2 id="heading-step-2-create-an-iam-role-with-a-trust-policy">Step 2: Create an IAM Role with a Trust Policy</h2>
<p>Now you need an IAM role that your GitHub Actions workflow will assume. The trust policy on this role controls which repositories and branches are allowed to request credentials.</p>
<h3 id="heading-how-to-create-the-iam-role-in-the-aws-console">How to Create the IAM Role in the AWS Console</h3>
<p>1. Open the <a href="https://console.aws.amazon.com/iam/">AWS IAM Console</a></p>
<p>2. In the left sidebar, click <strong>Roles</strong></p>
<p>3. Click <strong>Create role</strong></p>
<p>4. For <strong>Trusted entity type</strong>, select <strong>Web identity</strong></p>
<p>5. For <strong>Identity Provider</strong>, choose: <code>token.actions.githubusercontent.com</code> which you created earlier.</p>
<p>6. For Audience, choose <code>sts.amazonaws.com</code> as well</p>
<p>7. For GitHub organisation, enter your GitHub username or organization name</p>
<p>8. For GitHub repository, enter your GitHub repository</p>
<p>9. For GitHub branch, enter your branch name (for example, main)</p>
<p>10. Click Next, then Next, give a name to the role and click create role</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/dca12969-db8a-4ec4-885e-e953f4808f6c.png" alt="create-iam-role-for-github-action-via-the-console" style="display:block;margin:0 auto" width="1351" height="620" loading="lazy">

<p>Note: Creating the IAM role using this approach already establishes the <strong>Trusted Entities</strong> using a trusted policy based on the step 4-9 above. You can verify this by clicking on the created role and navigating to Trust relationships.</p>
<h3 id="heading-how-to-create-the-iam-role-with-the-aws-cli">How to Create the IAM Role with the AWS CLI</h3>
<p>First, you'll need to create a trust policy document on your local machine: You can call it <code>trust-policy.json</code>:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_GITHUB_ORG/YOUR_REPO_NAME:*"
        }
      }
    }
  ]
}
</code></pre>
<p>Replace the following placeholders before saving:</p>
<table>
<thead>
<tr>
<th>Placeholder</th>
<th>Replace With</th>
</tr>
</thead>
<tbody><tr>
<td><code>YOUR_ACCOUNT_ID</code></td>
<td>Your 12-digit AWS account ID</td>
</tr>
<tr>
<td><code>YOUR_GITHUB_ORG</code></td>
<td>Your GitHub username or organization name</td>
</tr>
<tr>
<td><code>YOUR_REPO_NAME</code></td>
<td>The name of your GitHub repository</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-understand-the-sub-condition">How to Understand the <code>sub</code> Condition</h3>
<p>The <code>sub (subject)</code> claim in the JWT tells AWS exactly where the request is coming from. The value <code>repo:your-org/your-repo:*</code> means any branch in that repository can assume this role.</p>
<p>You can tighten this further depending on your needs:</p>
<pre><code class="language-shell"># Only the main branch
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
 
# Only a specific GitHub Environment
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"
</code></pre>
<p>Scoping this correctly is one of the most important security decisions in this setup. Here's how to decide:</p>
<ul>
<li><p>Use <code>ref:refs/heads/main</code> if only your main/production branch should deploy to AWS. This is the most restrictive and secure option: feature branches can't accidentally (or maliciously) trigger deployments or modify production resources.</p>
</li>
<li><p>Use <code>environment:production</code> if you're using GitHub Environments with protection rules (required reviewers, deployment gates). This lets you control deployments through GitHub's approval workflow while still restricting which workflows can access AWS.</p>
</li>
<li><p>Use <code>repo:your-org/your-repo:*</code> (wildcard) only if you need any branch to deploy. for example, in development environments where every feature branch deploys to its own isolated stack. Never use this for production roles.</p>
</li>
</ul>
<p>Run this command to create the role using your trust policy:</p>
<pre><code class="language-shell">aws iam create-role \
  --role-name GitHubActionsOIDCRole \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role assumed by GitHub Actions via OIDC"
</code></pre>
<p>Take note of the <strong>Role ARN</strong> in the output. It will look like this:</p>
<pre><code class="language-plaintext">arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole
</code></pre>
<p>You'll need this ARN in your workflow YAML in Step 4.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/6bb154e7-0fb3-4c58-94e1-90116eaea95a.png" alt="terminal output of the AWS CLI create-role command showing the returned Role ARN" style="display:block;margin:0 auto" width="1123" height="615" loading="lazy">

<h2 id="heading-step-3-attach-permissions-to-the-iam-role">Step 3: Attach Permissions to the IAM Role</h2>
<p>The IAM role can now authenticate, but it has no permissions yet. You need to attach a policy that defines what your workflow is actually allowed to do in AWS.</p>
<h3 id="heading-how-to-apply-the-principle-of-least-privilege">How to Apply the Principle of Least Privilege</h3>
<p>Only grant the permissions your workflow genuinely needs. If your workflow deploys to S3, give it S3 permissions. If it pushes images to ECR, give it ECR permissions. Never attach <code>AdministratorAccess</code> to a CI/CD role.</p>
<h4 id="heading-option-1-attach-an-aws-managed-policy-quick-start">Option 1: Attach an AWS managed policy (quick start):</h4>
<pre><code class="language-shell">aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
</code></pre>
<h4 id="heading-option-2-create-a-custom-policy-scoped-to-a-specific-s3-bucket-recommended-for-production">Option 2: Create a custom policy scoped to a specific S3 bucket (recommended for production):</h4>
<p>This approach is recommended for production because it limits the blast radius of a security incident. If your workflow credentials are ever compromised, a custom policy scoped to a specific bucket means an attacker can only affect that single bucket not every S3 bucket in your AWS account. It also prevents accidental misconfigurations in your workflow from impacting unrelated resources.</p>
<p>Create a file called <code>s3-deploy-policy.json</code>:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}
</code></pre>
<p>Then create and attach it:</p>
<pre><code class="language-shell">aws iam create-policy \
  --policy-name GitHubActionsS3DeployPolicy \
  --policy-document file://s3-deploy-policy.json
 
aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/GitHubActionsS3DeployPolicy
</code></pre>
<p>Note: You can as well implement <strong>Step 3</strong> via the console.</p>
<p><strong>Reference:</strong> For a full list of available AWS IAM actions, see the <a href="https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html">AWS IAM actions reference</a>.</p>
<h2 id="heading-step-4-store-the-role-arn-as-a-github-actions-variable">Step 4: Store the Role ARN as a GitHub Actions Variable</h2>
<p>Before you configure your workflow, you need to make the Role ARN available to it. You'll store it as a repository variable in GitHub, not a secret, because the ARN itself isn't sensitive data.</p>
<h3 id="heading-how-to-add-the-variable-in-your-repository">How to Add the Variable in Your Repository</h3>
<p>First, open your GitHub repository and click <strong>Settings:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/b2dd526a-00ca-44eb-8d22-b78dfd220a14.png" alt="GitHub repository top navigation bar with the Settings tab highlighted" style="display:block;margin:0 auto" width="1310" height="307" loading="lazy">

<p>In the left sidebar, scroll down to <strong>Secrets and variables</strong>, then click <strong>Actions:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/61d67c83-7bbc-4570-93ec-f2ee4207ad6e.png" alt="GitHub repository settings sidebar showing Secrets and variables expanded with Actions selected" style="display:block;margin:0 auto" width="1266" height="325" loading="lazy">

<p>Then click the <strong>Variables</strong> tab (not Secrets). Click New repository variable – you can set the <strong>Name</strong> to:</p>
<pre><code class="language-plaintext">AWS_ROLE_ARN
</code></pre>
<p>Set the <strong>Value</strong> to your Role ARN from Step 2, for example:</p>
<pre><code class="language-plaintext">arn:aws:iam::YOUR_ACCOUNT_ID::role/GitHubActionsOIDCRole
</code></pre>
<p>Click <strong>Add variable:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/71f5468d-d4ab-45c1-aecd-8509f575237a.png" alt="GitHub repository Actions variables tab showing AWS_ROLE_ARN variable added successfully" style="display:block;margin:0 auto" width="1083" height="377" loading="lazy">

<p>You'll reference this variable in your workflow in the next step using <code>${{</code> <code>vars.AWS_ROLE_ARN }}</code>.</p>
<h2 id="heading-step-5-configure-your-github-actions-workflow">Step 5: Configure Your GitHub Actions Workflow</h2>
<p>With AWS and GitHub fully configured, you now need to update your workflow to request an OIDC token and use it to authenticate.</p>
<h3 id="heading-how-to-set-the-required-workflow-permissions">How to Set the Required Workflow Permissions</h3>
<p>Your workflow <strong>must</strong> declare <code>id-token: write</code>. Without this, GitHub won't issue an OIDC token to the runner.</p>
<pre><code class="language-yaml">permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read    # Required to checkout the repository
</code></pre>
<p><strong>Important:</strong> If you set permissions at the job level, they override any top-level permissions. Make sure <code>id-token: write</code> is present at whichever level your AWS authentication step runs.</p>
<h3 id="heading-full-workflow-example">Full Workflow Example</h3>
<p>Here's a complete workflow that authenticates to AWS using OIDC and deploys a static site to S3:</p>
<pre><code class="language-yaml">name: Deploy to AWS S3
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write
  contents: read
 
jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-2
 
      - name: Verify AWS identity
        run: aws sts get-caller-identity
 
      - name: Deploy to S3
        run: |
          aws s3 sync ./code s3://your-bucket-name
</code></pre>
<p>Replace the following before committing:</p>
<table>
<thead>
<tr>
<th>Placeholder</th>
<th>Replace With</th>
</tr>
</thead>
<tbody><tr>
<td><code>AWS_ROLE_ARN</code></td>
<td>The variable name for your IAM role ARN in GitHub</td>
</tr>
<tr>
<td><code>us-east-2</code></td>
<td>Your target AWS region</td>
</tr>
<tr>
<td><code>your-bucket-name</code></td>
<td>Your S3 bucket name</td>
</tr>
<tr>
<td><code>./code</code></td>
<td>The local directory where the file you want to sync to S3 is located</td>
</tr>
</tbody></table>
<p>You can see the code sample in my GitHub Repo <a href="https://github.com/tolani-akintayo/OpenID-Connect-in-GitHub-Actions-for-AWS">here</a>.</p>
<p><strong>Note:</strong> The <code>aws-actions/configure-aws-credentials</code> action handles the entire OIDC token exchange automatically. It requests the JWT from GitHub, calls <code>sts:AssumeRoleWithWebIdentity</code>, and exports the temporary credentials as environment variables for the rest of the job.</p>
<p>See the <a href="https://github.com/aws-actions/configure-aws-credentials">action's official documentation</a> for all available options.</p>
<h2 id="heading-step-6-run-and-verify-your-workflow">Step 6: Run and Verify Your Workflow</h2>
<p>Push your workflow to the <code>main</code> branch and open the <strong>Actions</strong> tab in your repository to watch it run.</p>
<h3 id="heading-what-a-successful-run-looks-like">What a Successful Run Looks Like</h3>
<p>The Configure AWS credentials via OIDC step should show:</p>
<pre><code class="language-plaintext">Assuming role with OIDC: arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole
</code></pre>
<p>The Verify AWS identity step (<code>aws sts get-caller-identity</code>) should return:</p>
<pre><code class="language-json">{
    "UserId": "AROA...:GitHubActions",
    "Account": "YOUR_ACCOUNT_ID",
    "Arn": "arn:aws:sts::YOUR_ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions"
}
</code></pre>
<p>If you see an <code>assumed-role</code> ARN in the output, OIDC is working correctly. Your workflow is now authenticating to AWS without a single stored credential.</p>
<h2 id="heading-security-best-practices">Security Best Practices</h2>
<p>Getting OIDC working is step one. Locking it down properly is step two.</p>
<h3 id="heading-scope-the-sub-condition-as-tightly-as-possible">Scope the <code>sub</code> Condition as Tightly as Possible</h3>
<p>Don't use a wildcard like <code>repo:your-org/*:*</code> that allows any repository in your organization to assume the role. Scope it to the exact repository and branch that needs access.</p>
<pre><code class="language-json">"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
</code></pre>
<h3 id="heading-use-github-environments-for-production-deployments">Use GitHub Environments for Production Deployments</h3>
<p>GitHub Environments let you add manual approval gates and restrict which branches can deploy. When combined with OIDC, you can scope your trust policy to only allow the <code>production</code> environment:</p>
<pre><code class="language-json">"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"
</code></pre>
<h3 id="heading-apply-least-privilege-permissions-to-every-iam-role">Apply Least-Privilege Permissions to Every IAM Role</h3>
<p>Never attach <code>AdministratorAccess</code> or <code>PowerUserAccess</code> to a role used by CI/CD. Define a custom policy with only the actions your workflow actually needs.</p>
<h3 id="heading-create-separate-iam-roles-per-environment">Create Separate IAM Roles Per Environment</h3>
<p>A staging role and a production role should have different permission scopes. Your staging deployment role should never have write access to production resources.</p>
<h3 id="heading-enable-aws-cloudtrail">Enable AWS CloudTrail</h3>
<p>Every call made using the temporary credentials is logged in CloudTrail under the assumed role ARN. This gives you a full audit trail of exactly what your workflow did in AWS.</p>
<p><strong>Reference:</strong> GitHub's official security hardening guide for OIDC: <a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect">About security hardening with OpenID Connect</a></p>
<h2 id="heading-troubleshooting-common-errors">Troubleshooting Common Errors</h2>
<h3 id="heading-error-not-authorized-to-perform-stsassumerolewithwebidentity">Error: <code>Not authorized to perform sts:AssumeRoleWithWebIdentity</code></h3>
<p>This usually means the trust policy on your IAM role doesn't match the <code>sub</code> claim in the JWT.</p>
<p>Check the following:</p>
<ul>
<li><p>The <code>sub</code> condition exactly matches your repository path (it is case-sensitive)</p>
</li>
<li><p>The <code>aud</code> condition is set to <code>sts.amazonaws.com</code></p>
</li>
<li><p>The <code>Federated</code> principal uses the correct AWS account ID</p>
</li>
</ul>
<p>To inspect the actual token claims your workflow is receiving, add this debug step temporarily:</p>
<pre><code class="language-yaml">- name: Print OIDC token claims
  run: |
    TOKEN=\((curl -s -H "Authorization: Bearer \)ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&amp;audience=sts.amazonaws.com" | jq -r '.value')
    echo $TOKEN | cut -d '.' -f2 | base64 -d 2&gt;/dev/null | jq .
</code></pre>
<h3 id="heading-error-could-not-load-credentials-from-any-providers">Error: <code>Could not load credentials from any providers</code></h3>
<p>This almost always means <code>id-token: write</code> is missing from your workflow permissions. Double-check that you have:</p>
<pre><code class="language-yaml">permissions:
  id-token: write
  contents: read
</code></pre>
<h3 id="heading-error-accessdenied-when-calling-an-aws-service">Error: <code>AccessDenied</code> When Calling an AWS Service</h3>
<p>Authentication succeeded but the IAM role doesn't have permission to perform the action your workflow is attempting. Check the permissions policy attached to your role and compare it against the specific action in the error message.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've gone from storing static, long-lived AWS credentials in GitHub Secrets to a fully keyless authentication setup using OIDC. Here's what you accomplished:</p>
<ul>
<li><p>Registered GitHub as a trusted OIDC identity provider in AWS.</p>
</li>
<li><p>Created an IAM role with a scoped trust policy tied to a specific repository.</p>
</li>
<li><p>Attached least-privilege permissions to that role.</p>
</li>
<li><p>Configured your GitHub Actions workflow to request and use short-lived AWS credentials.</p>
</li>
<li><p>Verified the authentication flow end-to-end.</p>
</li>
</ul>
<p>This pattern works across every AWS service from S3, ECS, Lambda, ECR, Secrets Manager, and more. The workflow example here uses S3, but you only need to swap out the permissions policy and the deployment commands to adapt it for any service.</p>
<p>If you want to go further, explore:</p>
<ul>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#supported-cloud-providers">Configuring OIDC for multiple cloud providers</a>: Azure, GCP, and HashiCorp Vault.</p>
</li>
<li><p><a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment">GitHub Environments and deployment protection rules</a>: for multi-stage pipelines with approval gates.</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/what-is-access-analyzer.html">AWS IAM Access Analyzer</a>: to validate and tighten your role policies automatically.</p>
</li>
</ul>
<p><em>If you're building out your DevOps practice and want a complete, production-ready reference for infrastructure automation, CI/CD, and platform engineering, check out</em> <a href="https://coachli.co/tolani-akintayo/PR-H4oQS"><em><strong>The Startup DevOps Field Guide</strong></em></a><em>. It covers the patterns, templates, and runbooks I've used across real AWS environments.</em></p>
<p><em>You can also connect with me on</em> <a href="https://www.linkedin.com/in/tolani-akintayo"><em>LinkedIn</em></a></p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect">GitHub Docs: About security hardening with OpenID Connect</a></p>
</li>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services">GitHub Docs: Configuring OpenID Connect in Amazon Web Services</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html">AWS Docs: Creating OpenID Connect (OIDC) identity providers</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html">AWS Docs: AssumeRoleWithWebIdentity API Reference</a></p>
</li>
<li><p><a href="https://github.com/aws-actions/configure-aws-credentials">aws-actions/configure-aws-credentials - GitHub</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html">AWS IAM Actions Reference</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html">AWS CloudTrail User Guide</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik ]]>
                </title>
                <description>
                    <![CDATA[ This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server. You’ll learn how to expose services on  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-production-ready-ci-cd-pipeline-for-monorepo-based-microservices-system/</link>
                <guid isPermaLink="false">69ea60c8904b915438a58ca2</guid>
                
                    <category>
                        <![CDATA[ Jenkins ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Traefik ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md Tarikul Islam ]]>
                </dc:creator>
                <pubDate>Thu, 23 Apr 2026 18:11:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/66cb39fcaa2a09f9a8d691c1/d59c62f5-e376-4f09-851f-83e437f9960a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server.</p>
<p>You’ll learn how to expose services on a custom domain with auto-renewing HTTPS, and implement a smart deployment strategy that detects changes and redeploys only the affected microservices. This helps avoid unnecessary full-stack redeploys. We'll also cover real production issues and the exact fixes for each one.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a href="#heading-1-what-youll-build">1. What you'll build</a></p>
</li>
<li><p><a href="#heading-2-architecture">2. Architecture</a></p>
</li>
<li><p><a href="#heading-3-server-prerequisites">3. Server prerequisites</a></p>
</li>
<li><p><a href="#heading-4-traefik-the-reverse-proxy">4. Traefik — the reverse proxy</a></p>
</li>
<li><p><a href="#heading-5-run-jenkins-in-docker">5. Run Jenkins in Docker</a></p>
</li>
<li><p><a href="#heading-6-expose-jenkins-on-a-domain-via-traefik">6. Expose Jenkins on a domain via Traefik</a></p>
</li>
<li><p><a href="#heading-7-first-time-jenkins-setup">7. First-time Jenkins setup</a></p>
</li>
<li><p><a href="#heading-8-add-the-github-credential">8. Add the GitHub credential</a></p>
</li>
<li><p><a href="#heading-9-create-the-pipeline-job">9. Create the pipeline job</a></p>
</li>
<li><p><a href="#heading-10-the-jenkinsfile-deploy-only-what-changed">10. The Jenkinsfile (deploy only what changed)</a></p>
</li>
<li><p><a href="#heading-11-end-to-end-test">11. End-to-end test</a></p>
</li>
<li><p><a href="#heading-12-troubleshooting-every-error-we-hit">12. Troubleshooting — every error we hit</a></p>
</li>
<li><p><a href="#heading-13-mental-model-host-vs-container">13. Mental model: host vs. container</a></p>
</li>
<li><p><a href="#heading-14-daily-operations-cheat-sheet">14. Daily operations cheat sheet</a></p>
</li>
<li><p><a href="#heading-15-what-id-do-differently-next-time">15. What I'd do differently next time</a></p>
</li>
<li><p><a href="#heading-closing-thoughts">Closing thoughts</a></p>
</li>
</ul>
<h2 id="heading-1-what-youll-build">1. What You'll Build</h2>
<p>In this tutorial, you'll build a Jenkins instance running inside Docker on the same Linux server as your application stack.</p>
<p>Traefik will act as a reverse proxy in front of Jenkins, exposing it via a clean URL (<a href="https://jenkins.example.com"><code>https://jenkins.example.com</code></a>) with <strong>auto-renewing Let's Encrypt certificates</strong>.</p>
<p>You'll also create a Jenkinsfile in your application repository that:</p>
<ul>
<li><p>Automatically triggers on every push to the <code>staging</code> branch,</p>
</li>
<li><p>Detects which microservices changed in each commit,</p>
</li>
<li><p>Pulls the latest code on the host machine,</p>
</li>
<li><p>Rebuilds and restarts <strong>only the affected services</strong>.</p>
</li>
</ul>
<p>On every push, only the relevant services are redeployed.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>Before jumping in, this guide assumes you’re already comfortable with a few core concepts and tools.</p>
<p>This isn't a beginner-level tutorial — we’ll be working directly with infrastructure, containers, and CI/CD pipelines.</p>
<p>You should be familiar with:</p>
<ul>
<li><p>Basic Linux commands (SSH, file system navigation, permissions)</p>
</li>
<li><p>Docker fundamentals (images, containers, volumes, networks)</p>
</li>
<li><p>Git workflows (clone, pull, branches)</p>
</li>
<li><p>General idea of CI/CD pipelines</p>
</li>
</ul>
<p>Tools and environment required:</p>
<ul>
<li><p>A Linux server (Ubuntu recommended)</p>
</li>
<li><p>Docker Engine + Docker Compose (v2)</p>
</li>
<li><p>A domain name (for Traefik + HTTPS)</p>
</li>
<li><p>GitHub repository (for your backend project)</p>
</li>
<li><p>Basic understanding of microservices architecture</p>
</li>
</ul>
<p>If you’re comfortable with the above, you’re ready to follow along.</p>
<h2 id="heading-2-architecture">2. Architecture</h2>
<p>Here's an overview of the architecture:</p>
<pre><code class="language-plaintext">┌──────────────────────────── Linux server (Ubuntu) ────────────────────────────┐
│                                                                               │
│   /home/developer/projects/                                                  │
│       └── project-prod-configs/             ← infra repo (compose, Traefik) │
│              ├── docker-compose.staging.yml                                   │
│              ├── traefik.staging.yml                                          │
│              └── project-backend/          ← app repo (services, gateways) │
│                     ├── Jenkinsfile                                           │
│                     ├── docker-compose.staging.yml                            │
│                     └── apps/                                                 │
│                            ├── services/&lt;name&gt;/                               │
│                            ├── gateways/&lt;name&gt;/                               │
│                            └── core/&lt;name&gt;/                                   │
│                                                                               │
│   ┌─────────────────────── Docker network: proxy ──────────────────────┐      │
│   │  traefik (80, 443)                                                 │      │
│   │     │                                                              │      │
│   │     ├──► jenkins  (projects-jenkins-staging)                     │      │
│   │     │      ↳ /projects  ← bind-mount of the host project tree     │      │
│   │     │      ↳ /var/run/docker.sock ← controls host Docker           │      │
│   │     │                                                              │      │
│   │     └──► your services &amp; gateways (built by the pipeline)          │      │
│   └────────────────────────────────────────────────────────────────────┘      │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘
            ▲
            │  webhook on push
            │
   GitHub: &lt;org&gt;/project-backend (branch: staging)
</code></pre>
<p>There are two key ideas here:</p>
<ol>
<li><p><strong>Jenkins runs in a container</strong>, but it controls the <strong>host's</strong> Docker by mounting <code>/var/run/docker.sock</code>. It also bind-mounts the project folder as <code>/projects/...</code>, so it can <code>cd</code> into the real code on the host and run <code>docker compose</code> there.</p>
</li>
<li><p>The <strong>Jenkinsfile lives inside the app repo</strong>, so the pipeline definition is versioned with the code. Jenkins simply points at it.</p>
</li>
</ol>
<h3 id="heading-3-server-prerequisites">3. Server Prerequisites</h3>
<p>Before we start configuring Jenkins or Traefik, we need to prepare the server properly.</p>
<p>In this step, we’ll:</p>
<ul>
<li><p>Create a dedicated Linux user for managing the project</p>
</li>
<li><p>Install Docker and Docker Compose</p>
</li>
<li><p>Set up the folder structure for our repositories</p>
</li>
</ul>
<p>This ensures our CI/CD pipeline runs in a clean and predictable environment.</p>
<pre><code class="language-bash"># Linux user that owns the project tree
sudo adduser developer

# Docker engine + Compose plugin
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker developer

# Sanity check Compose v2
docker compose version
# -&gt; Docker Compose version v2.x.y

# Find where the Compose plugin binary lives — write it down, you'll need it
ls /usr/libexec/docker/cli-plugins/docker-compose
# (some distros use /usr/lib/docker/cli-plugins/docker-compose)

# Project layout
sudo mkdir -p /home/developer/project
sudo chown -R developer:developer /home/developer/project

# Clone both repos in the right place
cd /home/developer/projects
git clone https://github.com/&lt;org&gt;/projects-prod-configs.git
cd projects-prod-configs
git clone -b staging https://github.com/&lt;org&gt;/projects-backend.git
</code></pre>
<p>You should now have:</p>
<pre><code class="language-plaintext">/home/developer/projects/projects-prod-configs/projects-backend
</code></pre>
<p>Memorize this path — your Jenkinsfile references it.</p>
<h3 id="heading-dns">DNS</h3>
<p>Point an A-record for your Jenkins subdomain to the server's public IP <strong>before</strong> the next steps so Let's Encrypt can validate via HTTP challenge:</p>
<pre><code class="language-plaintext">jenkins.example.com   A   &lt;server-public-ip&gt;
</code></pre>
<h2 id="heading-4-traefik-the-reverse-proxy">4. Traefik — the Reverse Proxy</h2>
<p>Traefik acts as the entry point to your entire system. Instead of exposing each service manually with ports, Traefik automatically:</p>
<ul>
<li><p>Routes traffic based on domain names</p>
</li>
<li><p>Generates and renews HTTPS certificates using Let’s Encrypt</p>
</li>
<li><p>Connects to Docker and detects services dynamically</p>
</li>
</ul>
<p>In simple terms, Traefik lets you access services like:</p>
<p><a href="https://jenkins.example.com">https://jenkins.example.com</a><br><a href="https://api.example.com">https://api.example.com</a></p>
<p>…without manually configuring NGINX or managing SSL certificates.</p>
<p>In this setup, Traefik watches Docker containers and routes traffic using labels we'll define later.</p>
<p>Traefik gives every container a real domain and a real cert with <strong>zero per-service config</strong> — you just add a few labels.</p>
<h3 id="heading-traefikstagingyml-static-config"><code>traefik.staging.yml</code> (static config)</h3>
<p>Put this at the root of your infra repo:</p>
<pre><code class="language-yaml">api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@example.com           # ← change me
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false              # only containers with traefik.enable=true
    network: proxy
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: INFO

accessLog: {}
</code></pre>
<h3 id="heading-the-traefik-service-in-docker-composestagingyml">The Traefik service in <code>docker-compose.staging.yml</code></h3>
<pre><code class="language-yaml">networks:
  proxy:
    name: proxy
    driver: bridge
  internal:
    name: internal
    driver: bridge

volumes:
  acme-data:
  traefik-logs:
  jenkins-data:

services:
  traefik:
    image: traefik:v2.11
    container_name: projects-traefik-staging
    restart: unless-stopped
    ports:
      - "80:80"        # HTTP (auto-redirects to HTTPS)
      - "443:443"      # HTTPS
      - "8080:8080"    # Traefik dashboard (internal only — protect via firewall)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.staging.yml:/etc/traefik/traefik.yml:ro
      - ./dynamic:/etc/traefik/dynamic:ro
      - acme-data:/etc/traefik           # persists Let's Encrypt certs
      - traefik-logs:/var/log/traefik
    networks:
      - proxy
    command:
      - '--api.insecure=false'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--providers.docker.network=proxy'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
      - '--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL:-admin@example.com}'
      - '--certificatesresolvers.letsencrypt.acme.storage=/etc/traefik/acme.json'
      - '--log.level=INFO'
      - '--accesslog=true'
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      # Traefik's own dashboard
      - "traefik.http.routers.traefik-dash.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-dash.entrypoints=websecure"
      - "traefik.http.routers.traefik-dash.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-dash.service=api@internal"
</code></pre>
<p>Bring it up:</p>
<pre><code class="language-bash">cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d traefik
</code></pre>
<p>Watch the logs the first time — Traefik will request a cert for the dashboard host as soon as DNS resolves.</p>
<pre><code class="language-bash">docker logs -f projects-traefik-staging
</code></pre>
<p><strong>Tip.</strong> While testing, switch ACME to staging endpoint (<code>acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory</code>) so you don't burn through Let's Encrypt's rate limits if you misconfigure DNS. Remove that flag before going live.</p>
<h2 id="heading-5-run-jenkins-in-docker">5. Run Jenkins in Docker</h2>
<p>Add this Jenkins service to the same <code>docker-compose.staging.yml</code>. Every line matters (and the comments explain why).</p>
<pre><code class="language-yaml">  jenkins:
    image: jenkins/jenkins:lts
    container_name: projects-jenkins-staging
    restart: unless-stopped
    user: root                           # to use host docker.sock without UID juggling
    environment:
      - JAVA_OPTS=-Xmx1g -Xms512m -Duser.timezone=Asia/Dhaka
      - TZ=Asia/Dhaka                    # OS-level timezone inside container
      - JENKINS_OPTS=--prefix=/
    ports:
      - "3095:8080"                      # web UI (also reachable directly if needed)
      - "50000:50000"                    # inbound agent port
    volumes:
      - jenkins-data:/var/jenkins_home   # Jenkins config/jobs/secrets persistence
      - /var/run/docker.sock:/var/run/docker.sock                          # control host Docker
      - /usr/bin/docker:/usr/bin/docker                                     # docker CLI from host
      - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro  # docker compose plugin
      - /home/developer/projects:/projects                                # project tree
      - /etc/localtime:/etc/localtime:ro                                    # match host clock
      - /etc/timezone:/etc/timezone:ro
    networks:
      - proxy
      - internal
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080/login']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    deploy:
      resources:
        limits:
          memory: 1024M
</code></pre>
<p><strong>Why</strong> <code>user: root</code><strong>?</strong> It's the simplest way to share <code>docker.sock</code> and the project bind-mount without UID/GID gymnastics. If you prefer an unprivileged user, you'll need to set <code>group: docker</code> and align UIDs/perms on host folders — possible but out of scope here.</p>
<h2 id="heading-6-expose-jenkins-on-a-domain-via-traefik">6. Expose Jenkins on a Domain via Traefik</h2>
<p>This is the section many guides skip. We'll add <strong>labels</strong> to the Jenkins service so Traefik picks it up automatically. No editing of Traefik config required.</p>
<pre><code class="language-yaml">  jenkins:
    # ... everything above ...
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # 1) Router — match incoming Host
      - "traefik.http.routers.jenkins.rule=Host(`jenkins.example.com`)"
      - "traefik.http.routers.jenkins.entrypoints=websecure"
      - "traefik.http.routers.jenkins.tls.certresolver=letsencrypt"
      - "traefik.http.routers.jenkins.service=jenkins"

      # 2) Service — tell Traefik which container port is the app
      - "traefik.http.services.jenkins.loadbalancer.server.port=8080"

      # 3) Middleware — Jenkins needs X-Forwarded-Proto so it knows it's behind HTTPS
      - "traefik.http.middlewares.jenkins-headers.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.routers.jenkins.middlewares=jenkins-headers"
</code></pre>
<p>What each line does:</p>
<table>
<thead>
<tr>
<th>Label</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>traefik.enable=true</code></td>
<td>Opts this container in (we set <code>exposedByDefault=false</code>).</td>
</tr>
<tr>
<td><code>traefik.docker.network=proxy</code></td>
<td>Tells Traefik which network to talk to Jenkins on (Jenkins is on both <code>proxy</code> and <code>internal</code>).</td>
</tr>
<tr>
<td><code>routers.jenkins.rule=Host(...)</code></td>
<td>Forwards only this hostname to Jenkins.</td>
</tr>
<tr>
<td><code>routers.jenkins.entrypoints=websecure</code></td>
<td>Listens only on 443. (HTTP redirect was set up in section 4.)</td>
</tr>
<tr>
<td><code>routers.jenkins.tls.certresolver=letsencrypt</code></td>
<td>Auto-issues + renews the cert.</td>
</tr>
<tr>
<td><code>services.jenkins.loadbalancer.server.port=8080</code></td>
<td>Jenkins listens on 8080 inside the container.</td>
</tr>
<tr>
<td><code>customrequestheaders.X-Forwarded-Proto=https</code></td>
<td>Without this, Jenkins generates <code>http://</code> URLs in webhooks/links and breaks.</td>
</tr>
</tbody></table>
<p>Bring Jenkins up:</p>
<pre><code class="language-bash">cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d jenkins

# Watch Traefik issue the certificate
docker logs -f projects-traefik-staging | grep -i acme
</code></pre>
<p>After 10–60 seconds you should be able to open <code>https://jenkins.example.com</code> and see Jenkins's setup wizard with a valid lock icon.</p>
<p>Inside Jenkins (after first login):</p>
<p>Manage Jenkins → System → Jenkins URL → set this to: <a href="https://jenkins.example.com/">https://jenkins.example.com/</a></p>
<p>This is important because Jenkins uses this base URL to generate:</p>
<ul>
<li><p>Webhook endpoints (for GitHub triggers)</p>
</li>
<li><p>Links inside emails and build logs</p>
</li>
</ul>
<p>If this isn't set correctly, GitHub webhooks may fail, and any links Jenkins generates will point to the wrong address (often localhost or internal IPs).</p>
<h2 id="heading-7-first-time-jenkins-setup">7. First-Time Jenkins Setup</h2>
<p>If you're running Jenkins for the first time on this server, follow this section to complete the initial setup.</p>
<p>If you already have Jenkins configured, you can skip this section — but make sure the required plugins and settings match what we use later in this guide.</p>
<ol>
<li><p>Open <code>https://jenkins.example.com</code>. Get the initial admin password:</p>
<pre><code class="language-bash">docker exec projects-jenkins-staging cat /var/jenkins_home/secrets/initialAdminPassword
</code></pre>
</li>
<li><p>Paste it, choose Install suggested plugins.</p>
</li>
<li><p>Create your admin user.</p>
</li>
<li><p>Manage Jenkins → Plugins → Available and install:</p>
<ul>
<li><p>GitHub (and GitHub Branch Source)</p>
</li>
<li><p>Pipeline: GitHub</p>
</li>
<li><p>Credentials Binding (usually preinstalled)</p>
</li>
</ul>
</li>
</ol>
<p>That's all the plugins you need for the rest of this guide.</p>
<h2 id="heading-8-add-the-github-credential">8. Add the GitHub Credential</h2>
<p>Jenkins needs permission to access your GitHub repository.</p>
<p>This is done using a GitHub Personal Access Token (PAT), which acts like a password for secure API and Git operations.</p>
<p>We’ll store this token inside Jenkins as a credential so it can pull code during pipeline execution and authenticate securely without exposing secrets in code.</p>
<p>This single credential is used both for the SCM checkout and for the deploy-time <code>git pull</code>.</p>
<ol>
<li><p>Create a Personal Access Token (classic) on GitHub with <code>repo</code> scope.</p>
</li>
<li><p>In Jenkins: Manage Jenkins → Credentials → System → Global → Add Credentials.</p>
</li>
<li><p>Fill in:</p>
<ul>
<li><p>Kind: Username with password</p>
</li>
<li><p>Username: your GitHub username</p>
</li>
<li><p>Password: the token</p>
</li>
<li><p><strong>ID:</strong> <code>github_classic_token</code> <em>(the Jenkinsfile references this exact ID)</em></p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-9-create-the-pipeline-job">9. Create the Pipeline Job</h2>
<p>Now that Jenkins has access to your repository, the next step is to define how deployments should run.</p>
<p>A pipeline job tells Jenkins:</p>
<ul>
<li><p>where your code lives,</p>
</li>
<li><p>which branch to monitor,</p>
</li>
<li><p>and how to execute your deployment process.</p>
</li>
</ul>
<p>In Jenkins, create a new Pipeline job and connect it to your GitHub repository. Once this is set up, Jenkins will automatically trigger deployments whenever you push to the <code>staging</code> branch.</p>
<p>Start by creating a new job:</p>
<p>New Item → Pipeline → name it <code>projects-staging</code> → OK</p>
<p>Then configure the job:</p>
<ul>
<li><p>Under <strong>Build Triggers</strong>, enable:<br><strong>GitHub hook trigger for GITScm polling</strong></p>
</li>
<li><p>Under <strong>Pipeline</strong>:</p>
<ul>
<li><p>Definition: Pipeline script from SCM</p>
</li>
<li><p>SCM: Git</p>
</li>
<li><p>Repository URL: <code>https://github.com/&lt;org&gt;/projects-backend.git</code></p>
</li>
<li><p>Credentials: <code>github_classic_token</code></p>
</li>
<li><p>Branch: <code>*/staging</code></p>
</li>
<li><p>Script Path: <code>Jenkinsfile</code></p>
</li>
</ul>
</li>
</ul>
<p>Save the configuration.</p>
<p>At this point, Jenkins is fully connected to your repository and ready to run your deployment pipeline automatically.</p>
<h2 id="heading-10-the-jenkinsfile-deploy-only-what-changed">10. The Jenkinsfile (Deploy Only What Changed)</h2>
<p>Place this at the root of the <strong>app</strong> repo (<code>projects-backend/Jenkinsfile</code>), branch <code>staging</code>.</p>
<pre><code class="language-groovy">pipeline {
  agent any

  environment {
    PROJECT_PATH = "/projects/projects-prod-configs/projects-backend"
    COMPOSE_FILE = "docker-compose.staging.yml"
  }

  stages {

    stage('Checkout') {
      steps {
        checkout scm
        echo "Checkout completed for branch: ${env.BRANCH_NAME ?: 'staging'}"
      }
    }

    stage('Detect Changes') {
      steps {
        script {
          def changedFiles = sh(
            script: "git diff --name-only HEAD~1 HEAD",
            returnStdout: true
          ).trim()

          echo "Changed files:\n${changedFiles}"

          def services = [] as Set
          changedFiles.split('\n').each { file -&gt;
            def svc  = file =~ /^apps\/services\/([a-z0-9-]+)\//
            def gw   = file =~ /^apps\/gateways\/([a-z0-9-]+)\//
            def core = file =~ /^apps\/core\/([a-z0-9-]+)\//
            if (svc)  { services &lt;&lt; svc[0][1]  }
            if (gw)   { services &lt;&lt; gw[0][1]   }
            if (core) { services &lt;&lt; core[0][1] }
          }
          services = services.findAll { !it.endsWith('-e2e') }
          env.CHANGED_SERVICES = services.join(' ')

          echo "Services to deploy: ${env.CHANGED_SERVICES ?: '(none)'}"
        }
      }
    }

    stage('Deploy') {
      when { expression { return env.CHANGED_SERVICES?.trim() } }
      steps {
        withCredentials([usernamePassword(
          credentialsId: 'github_classic_token',
          usernameVariable: 'GIT_USER',
          passwordVariable: 'GIT_TOKEN'
        )]) {
          sh '''
            set -eu
            git config --global --add safe.directory "${PROJECT_PATH}"
            cd "${PROJECT_PATH}"
            git remote set-url origin "https://github.com/&lt;org&gt;/projects-backend.git"
            git -c credential.helper= \
                -c "credential.helper=!f() { echo username=\({GIT_USER}; echo password=\){GIT_TOKEN}; }; f" \
                pull origin staging
            docker compose -f "\({COMPOSE_FILE}" up -d --build \){CHANGED_SERVICES}
          '''
        }
        echo "Deployed: ${env.CHANGED_SERVICES}"
      }
    }

    stage('Skip Deployment') {
      when { expression { return !env.CHANGED_SERVICES?.trim() } }
      steps { echo "No service changes detected — nothing to deploy." }
    }
  }
}
</code></pre>
<p>Why each tricky line is there:</p>
<ul>
<li><p><code>git config --global --add safe.directory ...</code> — git refuses to operate on a repo whose owner UID differs from the current user's. The repo on disk is owned by <code>developer</code>, but Git inside the container runs as <code>root</code>. This whitelists the path.</p>
</li>
<li><p><code>git remote set-url origin "https://..."</code> — flips the on-disk remote to HTTPS so the <strong>token can be used</strong>. (A PAT can't authenticate <code>git@github.com:</code> URLs — those use SSH.) Idempotent — safe to re-run.</p>
</li>
<li><p><code>git -c credential.helper="!f() { echo username=...; echo password=...; }; f"</code> — feeds the username/token to git for that one command without writing the token to disk and without exposing it on the process command line.</p>
</li>
<li><p><code>${CHANGED_SERVICES}</code> is unquoted on purpose so multiple service names expand as separate args.</p>
</li>
</ul>
<h2 id="heading-11-end-to-end-test">11. End-to-End Test</h2>
<p>Before considering the setup complete, we need to verify that the entire pipeline works as expected.</p>
<p>This end-to-end test ensures that:</p>
<ul>
<li><p>GitHub webhooks are triggering Jenkins correctly,</p>
</li>
<li><p>Jenkins can detect which services changed,</p>
</li>
<li><p>and only the affected services are rebuilt and deployed.</p>
</li>
</ul>
<p>In other words, this simulates a real production deployment.</p>
<p>Start by making a small change in your repository. For example, modify a file inside:</p>
<p>apps/gateways/student-apigw/</p>
<p>Then push the change to the <code>staging</code> branch.</p>
<p>Once pushed, Jenkins should automatically trigger via the webhook. If not, you can manually click <strong>Build Now</strong>.</p>
<p>Now open the build’s <strong>Console Output</strong> and verify the flow. You should see something like:</p>
<ul>
<li><p>Checkout completed for branch: staging</p>
</li>
<li><p>Services to deploy: student-apigw</p>
</li>
<li><p>git pull origin staging (successful)</p>
</li>
<li><p>docker compose ... up -d --build student-apigw</p>
</li>
<li><p>Deployed: student-apigw</p>
</li>
</ul>
<p>If you see this sequence, your pipeline is working correctly.</p>
<p>If anything fails, don’t worry — jump to Section 12 where every common issue and its fix is documented.</p>
<h2 id="heading-12-troubleshooting-every-error-we-hit">12. Troubleshooting — Every Error We Hit</h2>
<p>This section covers real issues we faced while setting up this pipeline — and more importantly, <em>why each fix works</em>. Understanding the “why” will help you debug similar problems in your own setup.</p>
<h3 id="heading-cd-cant-cd-to-projectsprojects-prod-configsprojects-backend">cd: can't cd to /projects/projects-prod-configs/projects-backend</h3>
<p><strong>Cause:</strong><br>The Jenkinsfile runs <code>cd $PROJECT_PATH</code>, but inside the container that path doesn’t exist. This usually happens when:</p>
<ul>
<li><p>the project wasn’t cloned on the host, or</p>
</li>
<li><p>the bind mount isn’t configured correctly.</p>
</li>
</ul>
<p><strong>Fix:</strong></p>
<pre><code class="language-bash">ls /home/developer/projects/projects-prod-configs/projects-backend
# If missing: git clone -b staging &lt;url&gt; there.
</code></pre>
<p>Confirm the bind mount:</p>
<pre><code class="language-plaintext">docker inspect projects-jenkins-staging --format '{{range .Mounts}}{{.Source}} -&gt; {{.Destination}}{{println}}{{end}}'
</code></pre>
<p>If missing, recreate the container:</p>
<pre><code class="language-plaintext">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Jenkins runs inside a container, but your code lives on the host. The bind mount connects them. Without it, Jenkins cannot access your project directory.</p>
<h3 id="heading-fatal-detected-dubious-ownership-in-repository">fatal: detected dubious ownership in repository</h3>
<p><strong>Cause:</strong><br>Git blocks access when the repository owner differs from the current user.</p>
<ul>
<li><p>Repo owner: <code>developer</code> (host)</p>
</li>
<li><p>Git runs as: <code>root</code> (inside container)</p>
</li>
</ul>
<p><strong>Fix:</strong></p>
<pre><code class="language-plaintext">git config --global --add safe.directory "${PROJECT_PATH}"
</code></pre>
<p><strong>Why this works:</strong></p>
<p>This explicitly tells Git that the directory is trusted, bypassing ownership mismatch security restrictions.</p>
<h3 id="heading-host-key-verification-failed-could-not-read-from-remote-repository"><code>Host key verification failed</code> / <code>Could not read from remote repository</code></h3>
<h4 id="heading-cause">Cause:</h4>
<p>The repository uses SSH (<code>git@github.com:...</code>), but:</p>
<ul>
<li><p>the container has no SSH keys</p>
</li>
<li><p>no known_hosts file exists</p>
</li>
</ul>
<p>Also, GitHub tokens cannot authenticate over SSH.</p>
<p><strong>Fix (recommended):</strong></p>
<pre><code class="language-plaintext">git remote set-url origin "https://github.com/&lt;org&gt;/projects-backend.git"
</code></pre>
<p><strong>Why this works:</strong></p>
<p>HTTPS uses token-based authentication (PAT), which works inside containers without SSH configuration.</p>
<h3 id="heading-unknown-shorthand-flag-f-in-f-docker-compose"><code>unknown shorthand flag: 'f' in -f</code> ( <code>docker compose</code>)</h3>
<p><strong>Cause:</strong><br>The Docker CLI exists, but the Docker Compose plugin is missing inside the container.</p>
<p><strong>Fix:</strong></p>
<pre><code class="language-plaintext">volumes:
  - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro
</code></pre>
<p>Find your path if needed:</p>
<pre><code class="language-plaintext">find /usr -name docker-compose -type f 2&gt;/dev/null
</code></pre>
<p>Verify:</p>
<pre><code class="language-plaintext">docker exec projects-jenkins-staging docker compose version
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Docker Compose v2 is a CLI plugin. Mounting this directory makes the <code>docker compose</code> command available inside the container.</p>
<h3 id="heading-wrong-timezone-in-build-timestamps-and-jenkins-ui">Wrong timezone in build timestamps and Jenkins UI</h3>
<p><strong>Fix:</strong> Set both env var and JVM flag, and bind-mount the host's clock files:</p>
<pre><code class="language-yaml">environment:
  - TZ=Asia/Dhaka
  - JAVA_OPTS=... -Duser.timezone=Asia/Dhaka
volumes:
  - /etc/localtime:/etc/localtime:ro
  - /etc/timezone:/etc/timezone:ro
</code></pre>
<p>You <strong>must</strong> recreate the container for env-var changes to take effect:</p>
<pre><code class="language-bash">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong><br>Jenkins runs on Java, which uses its own timezone separate from the OS.<br>By aligning OS timezone, JVM timezone, and host clock, you ensure consistent timestamps everywhere.</p>
<h3 id="heading-errsockettimeout-pnpm-install-fails">ERR_SOCKET_TIMEOUT (pnpm install fails)</h3>
<h4 id="heading-cause">Cause:</h4>
<p>If you have multiple services building in parallel and each runs pnpm install with ~1500 packages, the network gets saturated and a timeout occurs.</p>
<h4 id="heading-fixes">Fixes:</h4>
<p>a) Increase timeout + control concurrency</p>
<pre><code class="language-xml">RUN pnpm install --frozen-lockfile --ignore-scripts 
--network-timeout 600000 
--network-concurrency 8
</code></pre>
<p>Why: Gives pnpm more time and reduces network overload.</p>
<p>b) Enable pnpm cache (BuildKit)</p>
<pre><code class="language-xml">RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store 
pnpm install --frozen-lockfile --ignore-scripts
</code></pre>
<p>Why: Dependencies are cached and reused instead of downloading every time.</p>
<p>c) Avoid unnecessary rebuilds</p>
<pre><code class="language-xml">docker compose -f \(COMPOSE_FILE build \)CHANGED_SERVICES docker compose -f \(COMPOSE_FILE up -d --no-build \)CHANGED_SERVICES
</code></pre>
<p>Why: Only changed services are rebuilt → less network load → fewer failures.</p>
<h3 id="heading-container-changes-dont-apply-after-editing-docker-composeyml">Container changes don’t apply after editing docker-compose.yml</h3>
<h4 id="heading-cause">Cause:</h4>
<p>Docker compose up -d does not update running containers.</p>
<h4 id="heading-fix">Fix:</h4>
<pre><code class="language-xml">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong></p>
<p>This forces Docker to recreate the container with updated configuration (env, volumes, labels).</p>
<h3 id="heading-traefik-shows-default-certificate-no-https">Traefik shows default certificate (no HTTPS)</h3>
<h4 id="heading-common-causes">Common causes:</h4>
<p>DNS not pointing to server Port 80 blocked Wrong Docker network</p>
<h4 id="heading-check">Check:</h4>
<pre><code class="language-xml">dig +short jenkins.example.com docker logs projects-traefik-staging 2&gt;&amp;1 | grep -i acme
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Let’s Encrypt uses HTTP-01 challenge, so it must reach your server via port 80. If DNS or networking is wrong, certificate issuance fails.</p>
<h3 id="heading-jenkins-reverse-proxy-setup-is-broken">Jenkins: "Reverse proxy setup is broken"</h3>
<h4 id="heading-fix">Fix:</h4>
<p>Set the Jenkins URL to <a href="https://jenkins.example.com/">https://jenkins.example.com/</a><br>Ensure header:</p>
<pre><code class="language-xml">X-Forwarded-Proto: https
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Jenkins needs to know it's behind HTTPS. Without this, it generates incorrect URLs (http instead of https), breaking redirects and webhooks.</p>
<h2 id="heading-13-mental-model-host-vs-container">13. Mental Model: Host vs. Container</h2>
<p>Many setup mistakes come from confusing the <strong>host</strong> filesystem with the <strong>container</strong> filesystem. This table makes it explicit:</p>
<table>
<thead>
<tr>
<th>Inside the Jenkins container</th>
<th>Comes from on the host</th>
</tr>
</thead>
<tbody><tr>
<td><code>/var/jenkins_home</code></td>
<td>docker volume <code>jenkins-data</code> (Jenkins config, jobs, secrets)</td>
</tr>
<tr>
<td><code>/projects/...</code></td>
<td><code>/home/developer/projects/...</code> (your project tree)</td>
</tr>
<tr>
<td><code>/usr/bin/docker</code></td>
<td>host's <code>/usr/bin/docker</code></td>
</tr>
<tr>
<td><code>/usr/libexec/docker/cli-plugins/docker-compose</code></td>
<td>host plugin (lets <code>docker compose</code> work)</td>
</tr>
<tr>
<td><code>/var/run/docker.sock</code></td>
<td>host Docker daemon (so builds happen on the host's engine)</td>
</tr>
<tr>
<td><code>/etc/localtime</code>, <code>/etc/timezone</code></td>
<td>host clock</td>
</tr>
<tr>
<td><code>~/.ssh</code></td>
<td><strong>nothing</strong> — that's why SSH-to-GitHub doesn't work without extra setup</td>
</tr>
</tbody></table>
<p>When debugging, always ask: <em>"Inside which filesystem is this command running, and does the file/folder it's looking for exist there?"</em></p>
<h2 id="heading-14-daily-operations-cheat-sheet">14. Daily Operations Cheat Sheet</h2>
<pre><code class="language-bash"># Recreate Jenkins after changing compose
cd /home/developer/Projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

# Tail Jenkins logs
docker logs -f projects-jenkins-staging

# Open a shell inside the Jenkins container
docker exec -it projects-jenkins-staging bash

# From inside the container — sanity checks
docker compose version
ls /projects/projects-prod-configs/projects-backend
git -C /projects/projects-prod-configs/projects-backend remote -v

# Manually trigger the same deploy the pipeline does
cd /projects/projects-configs/projects-backend
git pull origin staging
docker compose -f docker-compose.staging.yml up -d --build student-apigw

# Inspect Traefik routing decisions
docker logs projects-traefik-staging 2&gt;&amp;1 | grep -i jenkins

# Check renewed certs
docker exec projects-traefik-staging cat /etc/traefik/acme.json | head -50
</code></pre>
<h2 id="heading-15-what-id-do-differently-next-time">15. What I'd Do Differently Next Time</h2>
<ul>
<li><p><strong>Pre-build a base image</strong> with all node_modules baked in. With ~1500 packages × 15 services, every clean build re-downloads ~22k tarballs. A shared base cuts that 90%.</p>
</li>
<li><p><strong>Run a private npm proxy</strong> (Verdaccio / Nexus / GitHub Packages) on the same Docker network — eliminates flaky <code>npmjs.org</code> timeouts entirely.</p>
</li>
<li><p><strong>Per-service Jenkinsfile</strong> if your services drift apart in tooling. With one Jenkinsfile, every team contends for the same pipeline definition.</p>
</li>
<li><p><strong>Replace</strong> <code>git diff HEAD~1 HEAD</code> with <code>git diff $(git merge-base HEAD origin/staging~1) HEAD</code> so squash-merges and force-pushes don't accidentally skip services.</p>
</li>
<li><p><strong>Move secrets to a vault</strong> (HashiCorp Vault / AWS Secrets Manager / Doppler). PATs in Jenkins work, but rotation across many jobs is painful.</p>
</li>
<li><p><strong>Use Jenkins' Configuration-as-Code (JCasC)</strong> so the entire Jenkins setup (jobs, credentials definitions, plugins) is in git. Then a server rebuild is a one-command operation.</p>
</li>
</ul>
<h2 id="heading-closing-thoughts">Closing Thoughts</h2>
<p>The pipeline itself is just three stages: <strong>Checkout → Detect Changes → Deploy</strong> — but a real production setup is mostly about <strong>plumbing</strong>: reverse proxy, certificates, bind-mounts, credentials, timezones, build caches. None of these are exotic. Together they decide whether your Friday-afternoon deploy goes silently green or eats your weekend.</p>
<p>Follow sections 1–11 to get a working pipeline. Bookmark section 12 to keep it working.</p>
<p>Happy shipping.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP ]]>
                </title>
                <description>
                    <![CDATA[ Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-gpu-optimized-machine-image-with-hashicorp-packer-on-gcp/</link>
                <guid isPermaLink="false">69e93606d5f8830e7d9fbad6</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ VM Image ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hashicorp packer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 20:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fd393878-fe7c-458a-addf-7cd22d8280ac.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.</p>
<p>In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a href="#heading-step-1-install-packer">Step 1: Install Packer</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</a></p>
</li>
<li><p><a href="#heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</a></p>
</li>
<li><p><a href="#heading-step-4-define-your-source">Step 4: Define Your Source</a></p>
</li>
<li><p><a href="#heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</a></p>
</li>
<li><p><a href="#heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</a></p>
<ul>
<li><p><a href="#heading-section-1-pre-installation-kernel-headers">section 1: Pre-Installation (Kernel Headers)</a></p>
</li>
<li><p><a href="#heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</a></p>
</li>
<li><p><a href="#heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</a></p>
</li>
<li><p><a href="#heading-section-4-installing-the-driver">Section 4: Installing the Driver</a></p>
</li>
<li><p><a href="#heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</a></p>
</li>
<li><p><a href="#heading-section-6-nvidia-container-toolkit">Section 6: Nvidia Container Toolkit</a></p>
</li>
<li><p><a href="#heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM — Data Center GPU Manager</a></p>
</li>
<li><p><a href="#heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</a></p>
</li>
<li><p><a href="#heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-7assembling-and-running-the-build">Step 7:Assembling and Running the Build</a></p>
</li>
<li><p><a href="#heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><a href="https://www.packer.io/">HashiCorp Packer</a> &gt;= 1.9</p>
</li>
<li><p><a href="https://github.com/hashicorp/packer-plugin-googlecompute">Google Compute Packer plugin</a> (installed via <code>packer init</code>)</p>
</li>
<li><p>Optionally, the <a href="https://github.com/hashicorp/packer-plugin-amazon">AWS Packer plugin</a> can be used for EC2 builds by adding an <code>amazon-ebs</code> source to <code>node.pkr.hcl</code></p>
</li>
<li><p>GCP project with Compute Engine API enabled (or AWS account with EC2 access)</p>
</li>
<li><p>GCP authentication (<code>gcloud auth application-default login</code>) or AWS credentials</p>
</li>
<li><p>Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-step-1-install-packer">Step 1: Install Packer</h3>
<p>To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation <a href="https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli#:~:text=Chocolatey%20on%20Windows-,Linux,-HashiCorp%20officially%20maintains">guides</a>).</p>
<p>First, you'll install the official Packer formula from the terminal.</p>
<p>Install the HashiCorp tap, a repository of all Hashicorp packages.</p>
<pre><code class="language-plaintext">$ brew tap hashicorp/tap
</code></pre>
<p>Now, install Packer with <code>hashicorp/tap/packer</code>.</p>
<pre><code class="language-plaintext">$ brew install hashicorp/tap/packer
</code></pre>
<h3 id="heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</h3>
<p>With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your <code>packer_demo</code> folder using the command below:</p>
<pre><code class="language-plaintext">mkdir -p packer_demo/script &amp;&amp; touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh
</code></pre>
<p>Your file directory should look like this:</p>
<pre><code class="language-plaintext">packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script 
</code></pre>
<h3 id="heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</h3>
<p>In your <code>plugins.pkr.hcl file,</code>, define your plugins in the <code>packer block.</code> The <code>packer {}</code> block contains Packer settings, including specifying a required plugin version. You'll find the <code>required_plugins</code> block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin <a href="https://developer.hashicorp.com/packer/integrations">here</a>.</p>
<pre><code class="language-hcl">packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~&gt; 1"
    }
  }
}
</code></pre>
<p>Then, initialize your Packer plugin with the command below:</p>
<pre><code class="language-plaintext">packer init .
</code></pre>
<h3 id="heading-step-4-define-your-source">Step 4: Define Your Source</h3>
<p>With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your <code>project ID</code>, the zone where your machine will be created, the <code>source_image_family</code> (think of this as your base image, such as Debian, Ubuntu, and so on), and your <code>source_image_project_id</code>.</p>
<p>In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the <code>machine type</code> to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.</p>
<pre><code class="language-hcl">source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}
</code></pre>
<p>Setting <code>on_host_maintenance = "TERMINATE"</code> on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.</p>
<p>You'll define all your variables in the <code>variable.pkr.hcl</code> file, and set the values in the <code>values.pkrvars.hcl</code>. Remember to always add your <code>values.pkrvars.hcl</code> file to Gitignore.</p>
<pre><code class="language-hcl">variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}
</code></pre>
<p><code>values.pkrvars.hcl</code></p>
<pre><code class="language-hcl">image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1" 
</code></pre>
<h3 id="heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</h3>
<p>Create <code>build.pkr.hcl</code>. The <code>build</code> block creates a temporary instance, runs provisioners, and produces an image.</p>
<p>Provisioners in this template are organized as follows:</p>
<ul>
<li><p><strong>First provisioner</strong> runs system updates and upgrades.</p>
</li>
<li><p><strong>Second provisioner</strong> reboots the instance (<code>expect_disconnect = true</code>).</p>
</li>
<li><p><strong>Third provisioner</strong> waits for the instance to come back (<code>pause_before</code>), then runs <code>script/base.sh</code>. This provisioner sets <code>max_retries</code> to handle transient SSH timeouts and pass environment variables for <code>DRIVER_VERSION</code> and <code>CUDA_VERSION</code>.</p>
</li>
</ul>
<p>Lastly, you have the post-processor to tell you the image ID and completion status:</p>
<pre><code class="language-hcl">build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}
</code></pre>
<h3 id="heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</h3>
<p>Now we'll go through the base script, and break down some parts of it.</p>
<h3 id="heading-section-1-pre-installation-kernel-headers">Section 1: Pre-Installation (Kernel Headers)</h3>
<p>Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.</p>
<pre><code class="language-shellscript">log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget
</code></pre>
<h3 id="heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</h3>
<p>This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.</p>
<pre><code class="language-shellscript">log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq
</code></pre>
<h3 id="heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</h3>
<p>Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.</p>
<p>NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit</p>
<p>A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.</p>
<pre><code class="language-shellscript">log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"
</code></pre>
<h3 id="heading-section-4-installing-the-driver">Section 4: Installing the Driver</h3>
<p>The <code>libnvidia-compute</code> installs only the compute‑related user‑space libraries (CUDA driver components), while the <code>nvidia-dkms-open;</code> installs the <strong>open‑source NVIDIA kernel module</strong>, built locally via DKMS.</p>
<p>Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.</p>
<p>Here, we're using <strong>NVIDIA’s compute‑only driver stack using the open‑source kernel modules</strong>, as it deliberately avoids installing any display-related components, which you don't need.</p>
<p>This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open
</code></pre>
<h3 id="heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</h3>
<p>This part of the script installs the <strong>CUDA Toolkit</strong> for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.</p>
<p>It adds CUDA binaries to PATH, so commands like <code>nvcc</code>, <code>cuda-gdb</code>, and <code>cuda-memcheck</code> work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.</p>
<pre><code class="language-shellscript">log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig
</code></pre>
<h3 id="heading-section-6-nvidia-container-toolkit">Section 6: NVIDIA Container Toolkit</h3>
<p>This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi
</code></pre>
<h3 id="heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM (Data Center GPU Manager)</h3>
<p>This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.</p>
<p>It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.</p>
<p>The script extracts the installed version and checks that it meets the <strong>minimum required version</strong> for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.</p>
<pre><code class="language-shellscript">log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi
</code></pre>
<h3 id="heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</h3>
<p>The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.</p>
<p>Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.</p>
<pre><code class="language-shellscript">log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced
</code></pre>
<h3 id="heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</h3>
<p>This block applies a set of <strong>system‑level performance and stability tunings</strong> that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.</p>
<p>Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.</p>
<ul>
<li><p>Swap and memory behavior: Disabling swap and setting <code>vm.swappiness=0</code> prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.</p>
</li>
<li><p>Hugepages for large memory allocations: Setting <code>vm.nr_hugepages=2048</code> allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.</p>
<p>CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.</p>
</li>
<li><p>CPU frequency governor: Installing <code>cpupower</code> and forcing the CPU governor to <code>performance</code> ensures the CPU stays at maximum frequency instead of scaling down.</p>
<p>GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.</p>
</li>
<li><p>NUMA and topology tools: Installing <code>numactl</code>, <code>libnuma-dev</code>, and <code>hwloc</code> provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.</p>
</li>
<li><p>Disabling irqbalance: Stopping and disabling <code>irqbalance</code> it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.</p>
</li>
</ul>
<pre><code class="language-shell">log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system
</code></pre>
<p>Full base.sh script here:</p>
<pre><code class="language-shell">#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" &gt;&amp;2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] &amp;&amp; error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] &amp;&amp; error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release &amp;&amp; echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"
</code></pre>
<h2 id="heading-step-7-assembling-and-running-the-build">Step 7: Assembling and Running the Build</h2>
<p>Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.</p>
<pre><code class="language-shellscript">packer validate -var-file=values.pkrvars.hcl .
</code></pre>
<p>If validation succeeds, you’ll see a short confirmation like <code>The configuration is valid.</code>. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:</p>
<pre><code class="language-plaintext">packer build -var-file=values.pkrvars.hcl .
</code></pre>
<p>The build typically takes <strong>15–20 minutes,</strong> depending on network speed and package installs. Watch the Packer log for three key checkpoints:</p>
<ul>
<li><p><strong>Instance creation</strong> — confirms the temporary VM was provisioned.</p>
</li>
<li><p><strong>Provisioner output</strong> — shows each script step (updates, reboot, <code>script/base.sh</code>) and any errors.</p>
</li>
<li><p><strong>Image creation</strong> — indicates the build finished and an image artifact was written.</p>
</li>
</ul>
<p>If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.</p>
<pre><code class="language-plaintext">googlecompute.gpu-node: output will be in this color.

==&gt; googlecompute.gpu-node: Checking image does not exist...
==&gt; googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==&gt; googlecompute.gpu-node: no persistent disk to create
==&gt; googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==&gt; googlecompute.gpu-node: Creating instance...
==&gt; googlecompute.gpu-node: Loading zone: us-central1-a
==&gt; googlecompute.gpu-node: Loading machine type: g2-standard-4
==&gt; googlecompute.gpu-node: Requesting instance creation...
==&gt; googlecompute.gpu-node: Waiting for creation operation to complete...
==&gt; googlecompute.gpu-node: Instance has been created!
==&gt; googlecompute.gpu-node: Waiting for the instance to become running...
==&gt; googlecompute.gpu-node: IP: 34.58.58.214
==&gt; googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==&gt; googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==&gt; googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: No containers need to be restarted.
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: User sessions running outdated binaries:
==&gt; googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==&gt; googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==&gt; googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==&gt; googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==&gt; googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==&gt; googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==&gt; googlecompute.gpu-node: [BASE] Updating system packages...
==&gt; googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==&gt; googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==&gt; googlecompute.gpu-node: [BASE] Installing DCGM...
==&gt; googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==&gt; googlecompute.gpu-node: [BASE] Applying system tuning...
==&gt; googlecompute.gpu-node: vm.swappiness=0
==&gt; googlecompute.gpu-node: vm.nr_hugepages=2048
==&gt; googlecompute.gpu-node: Setting cpu: 0
==&gt; googlecompute.gpu-node: Error setting new values. Common errors:
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==&gt; googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==&gt; googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==&gt; googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==&gt; googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: Deleting instance...
==&gt; googlecompute.gpu-node: Instance has been deleted!
==&gt; googlecompute.gpu-node: Creating image...
==&gt; googlecompute.gpu-node: Deleting disk...
==&gt; googlecompute.gpu-node: Disk has been deleted!
==&gt; googlecompute.gpu-node: Running post-processor:  (type shell-local)
==&gt; googlecompute.gpu-node (shell-local): Running local shell script: 
==&gt; googlecompute.gpu-node (shell-local): === Image Build Complete ===
==&gt; googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==&gt; googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==&gt; Wait completed after 17 minutes 55 seconds

==&gt; Builds finished. The artifacts of successful builds are:
--&gt; googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134
</code></pre>
<h3 id="heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</h3>
<p>Confirm the image exists in the GCP Console: <strong>Compute → Storage → Images</strong> and locate your newly created OS image.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/90f304eb-3fe7-4304-b2ad-d86701dde607.png" alt="Your Image information on GCP" style="display:block;margin:0 auto" width="1686" height="692" loading="lazy">

<p>Create a test VM from the image:</p>
<pre><code class="language-plaintext">gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING
</code></pre>
<p>Once the instance is <code>RUNNING</code>, verify the NVIDIA driver and GPU are visible:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/364df8fc-7584-40df-8ab7-b3fe349d5065.png" alt="Output from the Nvidia-SMI command showing Driver and CUDA Version" style="display:block;margin:0 auto" width="1508" height="630" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/0912c303-3bb0-47fa-aa34-1c91ff26874f.png" alt="Image verifying the persistence mode is enabled" style="display:block;margin:0 auto" width="1508" height="80" loading="lazy">

<p><strong>The</strong> <code>nvidia-smi</code> <strong>output confirms:</strong></p>
<ul>
<li><p>Driver 590.48.01 loaded</p>
</li>
<li><p>CUDA 13.1 available</p>
</li>
<li><p>Persistence Mode is <code>On</code></p>
</li>
<li><p>The L4 GPU is detected with 23GB VRAM</p>
</li>
<li><p>Zero ECC errors</p>
</li>
<li><p>No running processes (clean idle state).</p>
</li>
</ul>
<p>This is exactly what a healthy base image should look like. Notice <code>Disp.A: Off</code>? That confirms our compute-only driver choice is working — no display adapter is active.</p>
<p>Confirm the installed CUDA toolkit by running. <code>nvcc --version</code>. You can see that version 13.1 was installed as specified.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/cc744624-9408-4348-88d7-61da04b5e1d0.png" alt="Output from the NVCC -Version command" style="display:block;margin:0 auto" width="1508" height="202" loading="lazy">

<p>Let's confirm DCGM installation by running <code>dcgmi discovery -l</code>. Successful output indicates DCGM is running and communicating with the driver.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/114996c6-1f28-43d4-a3fa-13aa7ccd2c82.png" alt="Output from the DCGMI dicovery -l command showing device information" style="display:block;margin:0 auto" width="1508" height="714" loading="lazy">

<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.</p>
<p>From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.</p>
<p>The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p>NVIDIA Driver Installation Guide (Ubuntu): <a href="https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/">https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/</a></p>
</li>
<li><p>NVIDIA CUDA Toolkit Documentation: <a href="https://docs.nvidia.com/cuda/">https://docs.nvidia.com/cuda/</a></p>
</li>
<li><p>NVIDIA Container Toolkit Installation Guide: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html</a></p>
</li>
<li><p>NVIDIA DCGM Documentation: <a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html">https://docs.nvidia.com/datacenter/dcgm/latest/index.html</a></p>
</li>
<li><p>NVIDIA Persistence Daemon: <a href="https://docs.nvidia.com/deploy/driver-persistence/index.html">https://docs.nvidia.com/deploy/driver-persistence/index.html</a></p>
</li>
<li><p>HashiCorp Packer Documentation: <a href="https://developer.hashicorp.com/packer/docs">https://developer.hashicorp.com/packer/docs</a></p>
</li>
<li><p>Packer Google Compute Builder: <a href="https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute">https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step ]]>
                </title>
                <description>
                    <![CDATA[ While many guides explain how to set up Continuous Integration pipelines, not very many show you how to debug them when things go wrong across multiple layers. This is a common experience when contrib ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-fix-failing-github-pr-ci-lint-build-errors/</link>
                <guid isPermaLink="false">69e9033dbca83cce6c5f0209</guid>
                
                    <category>
                        <![CDATA[ GitHub ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                    <category>
                        <![CDATA[ debugging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ markdown ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ qacheampong ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 17:19:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/29733bad-98af-4d6e-9fb6-93d55e8f87fd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>While many guides explain how to set up Continuous Integration pipelines, not very many show you how to debug them when things go wrong across multiple layers.</p>
<p>This is a common experience when contributing to open source: you make a small change, open a pull request, and suddenly everything fails.</p>
<p>Not just one check, but multiple:</p>
<ul>
<li><p>Lint errors</p>
</li>
<li><p>YAML validation issues</p>
</li>
<li><p>Build failures</p>
</li>
<li><p>Deployment failures</p>
</li>
</ul>
<p>Even more confusing, you may see errors in parts of the codebase you didn’t modify.</p>
<p>In this article, you'll learn how to debug these issues step by step. The goal is not just to fix one pull request, but to understand how CI systems validate your changes.</p>
<p>This guide is based on a real debugging experience from contributing to an open source documentation project.</p>
<p>While this example comes from a documentation project, the debugging workflow applies to many repositories that use CI pipelines, linting tools, and automated builds.</p>
<h3 id="heading-table-of-contents">Table of Contents:</h3>
<ul>
<li><p><a href="#heading-understanding-the-ci-pipeline-whats-actually-happening">Understanding the CI Pipeline (What’s Actually Happening)</a></p>
</li>
<li><p><a href="#heading-how-a-ci-pipeline-processes-your-pull-request">How a CI Pipeline Processes Your Pull Request</a></p>
</li>
<li><p><a href="#heading-a-practical-debugging-workflow">A Practical Debugging Workflow</a></p>
<ul>
<li><p><a href="#heading-step-1-fix-authentication-and-permission-issues">Step 1: Fix Authentication and Permission Issues</a></p>
</li>
<li><p><a href="#heading-step-2-run-lint-checks-locally">Step 2: Run Lint Checks Locally</a></p>
</li>
<li><p><a href="#heading-step-3-fix-common-markdown-lint-errors">Step 3: Fix Common Markdown Lint Errors</a></p>
</li>
<li><p><a href="#heading-step-4-fix-yaml-inside-markdown-code-blocks">Step 4: Fix YAML Inside Markdown Code Blocks</a></p>
</li>
<li><p><a href="#heading-step-5-fix-build-errors-after-lint-passes">Step 5: Fix Build Errors After Lint Passes</a></p>
</li>
<li><p><a href="#heading-step-6-debug-cascading-ci-failures">Step 6: Debug Cascading CI Failures</a></p>
</li>
<li><p><a href="#heading-step-7-handle-git-issues-during-ci-debugging">Step 7: Handle Git Issues During CI Debugging</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-key-takeaways">Key Takeaways</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<p>To follow this guide, you should have:</p>
<ul>
<li><p>Basic familiarity with Git and pull requests</p>
</li>
<li><p>A GitHub account</p>
</li>
<li><p>Some exposure to CI/CD concepts (helpful but not required)</p>
</li>
</ul>
<h2 id="heading-understanding-the-ci-pipeline-whats-actually-happening"><strong>Understanding the CI Pipeline (What’s Actually Happening)</strong></h2>
<p>In many projects, you will see the term CI/CD, which stands for Continuous Integration and Continuous Deployment (or Delivery).</p>
<p>In this guide, we'll focus specifically on the CI part – that is, Continuous Integration. This refers to the automated checks that run when you push code or open a pull request. These checks validate your changes before they're merged into the main codebase.</p>
<p>CD (Continuous Deployment/Delivery), on the other hand, typically handles what happens after those checks pass, such as deploying the application.</p>
<p>Understanding this distinction is important because most of the issues we debug in this guide happen during the CI stage.</p>
<p>Most repositories run multiple automated checks when you open a pull request:</p>
<ul>
<li><p><strong>Linting tools</strong> (for example, markdownlint, yamllint) enforce formatting rules</p>
</li>
<li><p><strong>Build systems</strong> (for example, mdBook) validate structure and generate output</p>
</li>
<li><p><strong>Deployment checks</strong> (for example, Netlify) ensure that the site can be built and served</p>
</li>
<li><p><strong>Merge controllers</strong> (for example, Tide) enforce approval policies</p>
</li>
</ul>
<p>A key point to remember: CI systems validate the <strong>entire set of files in your commit,</strong> not just the lines you changed.</p>
<h2 id="heading-how-a-ci-pipeline-processes-your-pull-request"><strong>How a CI Pipeline Processes Your Pull Request</strong></h2>
<p>When you push code or open a pull request, the CI pipeline runs several checks in sequence.</p>
<p>Let’s visualize how these checks are connected in a typical CI pipeline.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69d09527e466e2b762fdff59/9cecca6e-e000-46e3-a40e-cb353fc89ff8.png" alt="A CI pipeline diagram showing lint, build, and deployment steps with failure loops returning to code fixes." style="display:block;margin:0 auto" width="2348" height="1516" loading="lazy">

<p>Figure: A simplified CI pipeline showing how linting, build, and deployment checks are executed sequentially.</p>
<p>The above diagram shows a sequential CI pipeline with feedback loops, where failures at any stage return you to fix the issue before continuing.</p>
<p>Let’s break down what this diagram shows:</p>
<ol>
<li><p>You start by pushing code or opening a pull request.</p>
</li>
<li><p>The CI pipeline begins running automated checks.</p>
</li>
<li><p>The first set of checks typically includes linting tools like markdownlint or yamllint.</p>
<ul>
<li><p>If linting fails, the pipeline stops, and you must fix formatting issues before continuing.</p>
</li>
<li><p>If linting passes, the pipeline moves to the build step (for example, mdBook in documentation projects).</p>
</li>
<li><p>If the build fails, it usually means there is a structural issue, such as duplicate entries or invalid references.</p>
</li>
</ul>
</li>
<li><p>After a successful build, deployment checks (such as Netlify previews) run.</p>
<ul>
<li>If deployment fails, the issue is often related to configuration or build output.</li>
</ul>
</li>
<li><p>If all steps pass, the pull request becomes ready for review.</p>
</li>
</ol>
<h2 id="heading-a-practical-debugging-workflow"><strong>A Practical Debugging Workflow</strong></h2>
<h3 id="heading-step-1-fix-authentication-and-permission-issues">Step 1: Fix Authentication and Permission Issues</h3>
<p>Before CI runs, your push can fail due to authentication errors.</p>
<p>Example error:</p>
<pre><code class="language-shell">refusing to allow a Personal Access Token to create or update workflow
</code></pre>
<p>This happens because GitHub requires special permissions when your commit includes files under:</p>
<pre><code class="language-shell">.github/workflows/
</code></pre>
<p>The solution is to regenerate your Personal Access Token (PAT) with:</p>
<ul>
<li><p><code>repo</code> access</p>
</li>
<li><p><code>workflow</code> permission</p>
</li>
</ul>
<h3 id="heading-step-2-run-lint-checks-locally">Step 2: Run Lint Checks Locally</h3>
<p>Relying only on CI feedback slows you down because you have to push changes and wait for the pipeline to run before seeing errors.</p>
<p>Running checks locally allows you to catch issues immediately before pushing your code.</p>
<p>In practice, you should do both:</p>
<ul>
<li><p>Run checks locally to catch errors early and reduce iteration time</p>
</li>
<li><p>Use CI as the final validation to ensure your changes meet the repository’s standards</p>
</li>
</ul>
<p>Think of local checks as your first line of defense, and CI as the final gate before your code is accepted.</p>
<p>Here's an example (Markdown linting):</p>
<pre><code class="language-shell">npm install -g markdownlint-cli2
markdownlint-cli2 docs/**/*.md
</code></pre>
<h3 id="heading-step-3-fix-common-markdown-lint-errors">Step 3: Fix Common Markdown Lint Errors</h3>
<p>Here are some common issues you may encounter:</p>
<h4 id="heading-1-non-descriptive-links">1. Non-descriptive links</h4>
<p>Non-descriptive links like "here" don't give readers any context about where the link leads. This makes documentation harder to understand and less accessible, especially for users relying on screen readers.</p>
<p>Instead of writing:</p>
<pre><code class="language-shell">[here](https://example.com)
</code></pre>
<p>Use descriptive text like:</p>
<pre><code class="language-shell">[command help documentation](https://example.com)
</code></pre>
<h4 id="heading-2-line-length-violations">2. Line length violations</h4>
<p>Many projects enforce a maximum line length (often around 80 characters) to improve readability across different devices and editors.</p>
<p>If a line is too long, you can split it into multiple lines without changing the meaning.</p>
<p>To do this, break the line at natural points such as spaces between words or after punctuation. Avoid breaking words or disrupting the sentence structure.<br>For example:</p>
<pre><code class="language-shell">This is a long sentence that should be split across multiple
lines to satisfy lint rules.
</code></pre>
<h4 id="heading-3-list-indentation-issues">3. List indentation issues</h4>
<p>List indentation errors occur when nested list items aren't aligned consistently. This can break formatting and cause linting errors.</p>
<p>To avoid this, just make sure you use consistent spacing (usually 2 spaces per level).</p>
<p>Example (incorrect):</p>
<pre><code class="language-shell">- Item 1
 - Subitem
</code></pre>
<p>Correct version:</p>
<pre><code class="language-shell">- Item 1
  - Subitem
</code></pre>
<h3 id="heading-step-4-fix-yaml-inside-markdown-code-blocks">Step 4: Fix YAML Inside Markdown Code Blocks</h3>
<p>YAML has strict formatting rules, including proper indentation, key-value structure, and consistent spacing.</p>
<p>Even when YAML appears inside a markdown code block, tools like yamllint still validate its structure.</p>
<p>Example (incorrect):</p>
<pre><code class="language-yaml">metadata:
annotations:
</code></pre>
<p>Correct version:</p>
<pre><code class="language-yaml">metadata:
  annotations:
    capi.metal3.io/unhealthy: "true"
</code></pre>
<p>In the incorrect example, <code>annotations</code> is not properly nested under <code>metadata</code>, and no key-value pair is defined.</p>
<p>In the corrected version:</p>
<ul>
<li><p><code>annotations</code> is properly indented under <code>metadata</code></p>
</li>
<li><p>a valid key-value pair is added (<code>capi.metal3.io/unhealthy: "true"</code>)</p>
</li>
</ul>
<p>This structure satisfies YAML’s requirement for proper hierarchy and formatting.</p>
<h3 id="heading-step-5-fix-build-errors-after-lint-passes">Step 5: Fix Build Errors After Lint Passes</h3>
<p>Passing lint checks doesn't guarantee that your build will succeed.</p>
<p>This is because linting focuses on syntax and formatting, while the build process validates the structure and integrity of the entire project.</p>
<p>Build failures often occur due to issues such as:</p>
<ul>
<li><p>Duplicate entries in navigation files</p>
</li>
<li><p>Missing or incorrectly referenced files</p>
</li>
<li><p>Invalid configuration settings</p>
</li>
</ul>
<p>Even if your syntax is correct, the build system ensures everything connects properly.</p>
<p>For example, in documentation projects using tools like mdBook, a duplicate entry in <code>SUMMARY.md</code> can cause the build to fail even when all files pass lint checks.</p>
<h3 id="heading-step-6-debug-cascading-ci-failures">Step 6: Debug Cascading CI Failures</h3>
<p>CI pipelines are layered. One failure can trigger multiple downstream failures.</p>
<p>For example, imagine a YAML indentation error:</p>
<pre><code class="language-shell">YAML error → build fails → deploy fails → multiple checks fail
</code></pre>
<p>To fix this:</p>
<ol>
<li><p>Identify the first failing step in the CI logs</p>
</li>
<li><p>Fix that issue</p>
</li>
<li><p>Re-run the pipeline</p>
</li>
</ol>
<p>In this example, the YAML indentation error is the root cause. Once you fix the YAML formatting, the lint check passes, which allows the build to proceed and the deployment step to succeed.</p>
<p>This is why it is important to always fix the first failure in the pipeline rather than trying to address all errors at once.</p>
<h3 id="heading-step-7-handle-git-issues-during-ci-debugging">Step 7: Handle Git Issues During CI Debugging</h3>
<p>When working with updated branches, you may encounter:</p>
<ul>
<li><p>Diverged branches</p>
</li>
<li><p>Rebase conflicts</p>
</li>
<li><p>Push rejections</p>
</li>
</ul>
<p>To resolve these issues, you typically need to update your branch using one of two approaches:</p>
<h4 id="heading-option-1-rebase-clean-history">Option 1: Rebase (clean history)</h4>
<pre><code class="language-shell">git pull --rebase
</code></pre>
<p>Rebasing rewrites your commit history so your changes appear on top of the latest version of the branch.</p>
<p>Use carefully:</p>
<ul>
<li><p>Only rebase your own branches</p>
</li>
<li><p>Avoid rebasing shared branches</p>
</li>
</ul>
<h4 id="heading-option-2-merge-safer">Option 2: Merge (safer)</h4>
<pre><code class="language-shell">git pull --no-rebase
</code></pre>
<p>Merging preserves the full commit history and is safer when working with others, but it may introduce additional merge commits.</p>
<h4 id="heading-pushing-your-changes-safely">Pushing your changes safely</h4>
<p>After updating your branch, you may need to push changes:</p>
<pre><code class="language-shell">git push --force-with-lease
</code></pre>
<p>Avoid using:</p>
<pre><code class="language-shell">git push --force
</code></pre>
<p>The <code>--force</code> option can overwrite the other contributors’ work. The <code>--force-with-lease</code> option is safer because it only pushes if the remote branch has not changed unexpectedly.</p>
<h2 id="heading-key-takeaways"><strong>Key Takeaways</strong></h2>
<ul>
<li><p>CI validates your entire commit, not just the specific lines you changed</p>
</li>
<li><p>Linting and build systems enforce different rules</p>
</li>
<li><p>YAML inside markdown must be structurally correct</p>
</li>
<li><p>Documentation builds can fail due to structural issues</p>
</li>
<li><p>Running checks locally significantly reduces debugging time</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Debugging a failing pull request isn't just about fixing syntax errors.</p>
<p>You also need to understand how different systems interact:</p>
<ul>
<li><p>Version control</p>
</li>
<li><p>CI pipelines</p>
</li>
<li><p>Linting tools</p>
</li>
<li><p>Build processes</p>
</li>
</ul>
<p>Once you understand how these systems work together, you can debug issues systematically instead of guessing.</p>
<p>The next time your pull request fails, you will know exactly where to start and how to fix it.</p>
<p>Debugging CI issues may feel overwhelming at first, but with a structured approach, you can turn failures into a clear path for improvement.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible ]]>
                </title>
                <description>
                    <![CDATA[ The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS. I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $ ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-local-devops-homelab-with-docker-kubernetes-and-ansible/</link>
                <guid isPermaLink="false">69dd667c217f5dfcbd55b7b4</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Homelab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops articles ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 21:56:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1e970f8b-eb52-4582-9c98-13cbce867c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.</p>
<p>I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.</p>
<p>That was the last time I practiced on someone else's infrastructure.</p>
<p>Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.</p>
<p>That last part is where the actual learning happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>A laptop with at least 8GB of RAM (16GB is better)</p>
</li>
<li><p>At least 20GB of free disk space</p>
</li>
<li><p>Windows, macOS, or Linux operating system</p>
</li>
<li><p>Administrator access to your computer</p>
</li>
<li><p>Virtualization enabled in your BIOS/UEFI settings</p>
</li>
<li><p>A stable internet connection for the initial downloads</p>
</li>
</ul>
<p>Knowledge and comfort level:</p>
<ul>
<li><p>You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).</p>
</li>
<li><p>Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.</p>
</li>
</ul>
<p>If you can follow step-by-step instructions and read error output without panicking, you're ready.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-is-devops">What is DevOps?</a></p>
</li>
<li><p><a href="#heading-why-build-a-local-lab">Why Build a Local Lab?</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-docker">How to Set Up Docker</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-to-install-kubectl">How to Install kubectl</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-vagrant">How to Set Up Vagrant</a></p>
</li>
<li><p><a href="#heading-how-to-install-ansible">How to Install Ansible</a></p>
</li>
<li><p><a href="#heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</a></p>
</li>
<li><p><a href="#heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</a></p>
</li>
<li><p><a href="#heading-what-you-can-now-do">What You Can Now Do</a></p>
</li>
</ol>
<h2 id="heading-what-is-devops">What is DevOps?</h2>
<p>DevOps is the practice of breaking down the wall between software development and IT operations teams.</p>
<p>Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.</p>
<p>The tools you'll install in this guide each solve a specific part of that process:</p>
<ul>
<li><p><strong>Docker</strong> packages your application and everything it needs into a portable container that runs the same way on any machine.</p>
</li>
<li><p><strong>Kubernetes</strong> manages multiple containers at scale, handling restarts, networking, and load balancing automatically.</p>
</li>
<li><p><strong>Vagrant</strong> creates and manages virtual machine environments so your whole team always works on identical setups.</p>
</li>
<li><p><strong>Ansible</strong> automates repetitive configuration tasks across many servers without writing a script for each one.</p>
</li>
</ul>
<h2 id="heading-why-build-a-local-lab">Why Build a Local Lab?</h2>
<p>A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.</p>
<p>Here's what you get with a local setup:</p>
<ul>
<li><p><strong>Zero cost.</strong> No cloud bills, no surprise charges, and no credit card required.</p>
</li>
<li><p><strong>Works offline.</strong> Practice anywhere, even without internet after the initial setup.</p>
</li>
<li><p><strong>Full control.</strong> You manage every layer from the OS up to the application.</p>
</li>
<li><p><strong>Safe experimentation.</strong> Break things freely. Nothing here affects production.</p>
</li>
<li><p><strong>Fast feedback.</strong> No waiting for cloud resources to spin up. Everything runs on your machine.</p>
</li>
</ul>
<p>The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.</p>
<h2 id="heading-how-to-set-up-docker">How to Set Up Docker</h2>
<p>Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.</p>
<h3 id="heading-how-to-install-docker-on-windows">How to Install Docker on Windows</h3>
<p>First, enable virtualization in your BIOS:</p>
<ol>
<li><p>Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.</p>
</li>
<li><p>Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.</p>
</li>
<li><p>Enable it, save your changes, and exit.</p>
</li>
</ol>
<p>Then install Docker Desktop:</p>
<ol>
<li><p>Download Docker Desktop from <a href="https://www.docker.com/products/docker-desktop/">Docker's official website</a>.</p>
</li>
<li><p>Run the installer and follow the prompts.</p>
</li>
<li><p>Enable WSL 2 (Windows Subsystem for Linux) when asked.</p>
</li>
<li><p>Restart your computer.</p>
</li>
<li><p>Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.</p>
</li>
</ol>
<p><strong>Troubleshooting:</strong> If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:</p>
<pre><code class="language-powershell">systeminfo | findstr "Hyper-V Requirements"
</code></pre>
<p>All items should show "Yes". If they don't, revisit your BIOS settings.</p>
<h3 id="heading-how-to-install-docker-on-mac">How to Install Docker on Mac</h3>
<ol>
<li><p>Download Docker Desktop for Mac from <a href="https://www.docker.com/products/docker-desktop/">Docker's website</a>.</p>
</li>
<li><p>Open the downloaded <code>.dmg</code> file and drag Docker to your Applications folder.</p>
</li>
<li><p>Open Docker from Applications.</p>
</li>
<li><p>Enter your password when prompted.</p>
</li>
<li><p>Wait for the whale icon in the menu bar to stop animating.</p>
</li>
</ol>
<h3 id="heading-how-to-install-docker-on-linux">How to Install Docker on Linux</h3>
<p>Run these commands in order:</p>
<pre><code class="language-bash"># Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER
</code></pre>
<p>Log out and back in for the group change to take effect.</p>
<h3 id="heading-how-to-test-docker">How to Test Docker</h3>
<p>Run this command:</p>
<pre><code class="language-bash">docker run hello-world
</code></pre>
<p>If you see "Hello from Docker!" then Docker is working correctly.</p>
<p>Docker is set up. Next, you'll install Kubernetes to manage containers at scale.</p>
<h2 id="heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</h2>
<p>Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:</p>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best for</th>
<th>RAM needed</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Minikube</strong></td>
<td>Beginners. Easiest setup, built-in dashboard</td>
<td>2GB+</td>
</tr>
<tr>
<td><strong>Kind</strong></td>
<td>Faster startup, works well inside CI pipelines</td>
<td>1GB+</td>
</tr>
<tr>
<td><strong>k3s</strong></td>
<td>Low-resource machines. Lightweight but production-like</td>
<td>512MB+</td>
</tr>
<tr>
<td><strong>kubeadm</strong></td>
<td>Learning how clusters are actually bootstrapped in production</td>
<td>2GB+ per node</td>
</tr>
</tbody></table>
<p>If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.</p>
<p>If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.</p>
<p>Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.</p>
<h3 id="heading-how-to-install-minikube-recommended-for-beginners">How to Install Minikube (Recommended for Beginners)</h3>
<p>Minikube creates a single-node Kubernetes cluster on your laptop.</p>
<p>On Windows:</p>
<ol>
<li><p>Download the Minikube installer from <a href="https://github.com/kubernetes/minikube/releases">Minikube's GitHub releases page</a>.</p>
</li>
<li><p>Run the <code>.exe</code> installer.</p>
</li>
<li><p>Open Command Prompt as Administrator and start Minikube:</p>
</li>
</ol>
<pre><code class="language-cmd">minikube start --driver=docker
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install minikube
minikube start --driver=docker
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker
</code></pre>
<p>Test your cluster:</p>
<pre><code class="language-bash">minikube status
minikube dashboard
</code></pre>
<h3 id="heading-how-to-install-k3s-recommended-for-low-ram-machines">How to Install k3s (Recommended for Low-RAM Machines)</h3>
<p>k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.</p>
<p>On Linux (and Mac via Multipass):</p>
<pre><code class="language-bash">curl -sfL https://get.k3s.io | sh -
</code></pre>
<p>That single command installs k3s and runs it automatically in the background. Check that it is running:</p>
<pre><code class="language-bash">sudo k3s kubectl get nodes
</code></pre>
<p>You should see one node with status <code>Ready</code>.</p>
<p>On Mac directly — k3s doesn't run natively on macOS. Use <a href="https://multipass.run">Multipass</a> to spin up a lightweight Ubuntu VM first, then run the install command inside it.</p>
<p>On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.</p>
<h3 id="heading-how-to-install-kind-kubernetes-in-docker">How to Install Kind (Kubernetes IN Docker)</h3>
<p>Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.</p>
<pre><code class="language-bash"># Mac or Linux
brew install kind

# Windows
choco install kind
</code></pre>
<p>Create a cluster:</p>
<pre><code class="language-bash">kind create cluster --name my-local-lab
</code></pre>
<h3 id="heading-how-to-install-kubeadm-for-understanding-cluster-bootstrap">How to Install kubeadm (For Understanding Cluster Bootstrap)</h3>
<p>kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.</p>
<p>It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/">official kubeadm installation guide</a> for your OS, then initialize your cluster:</p>
<pre><code class="language-bash">sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
<p>After init, join worker nodes using the command kubeadm prints at the end of the output.</p>
<h3 id="heading-how-to-install-kubectl">How to Install kubectl</h3>
<p>kubectl is the command-line tool you use to interact with any Kubernetes cluster.</p>
<p>On Windows:</p>
<p>Download <code>kubectl.exe</code> from <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/">Kubernetes' website</a> and place it in a directory that is in your PATH. Or install with Chocolatey:</p>
<pre><code class="language-cmd">choco install kubernetes-cli
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install kubectl
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl
</code></pre>
<p>Test it:</p>
<pre><code class="language-bash">kubectl get pods --all-namespaces
</code></pre>
<p>On a fresh cluster, you'll see system pods running in the <code>kube-system</code> namespace — things like <code>coredns</code> and <code>storage-provisioner</code>. That's the expected output. It means your cluster is up and kubectl can talk to it.</p>
<p>Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.</p>
<h4 id="heading-docker-vs-vagrant-they-arent-the-same-thing">Docker vs Vagrant — they aren't the same thing</h4>
<p>Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.</p>
<p>Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.</p>
<h2 id="heading-how-to-set-up-vagrant">How to Set Up Vagrant</h2>
<p>Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.</p>
<h3 id="heading-how-to-install-vagrant-on-windows">How to Install Vagrant on Windows</h3>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a> with default options.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
<li><p>Restart your computer if prompted.</p>
</li>
</ol>
<p><strong>Note:</strong> VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:</p>
<pre><code class="language-cmd">systeminfo | findstr "Hyper-V"
</code></pre>
<p>If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:</p>
<pre><code class="language-powershell">Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All
</code></pre>
<p>Restart after disabling.</p>
<h3 id="heading-how-to-install-vagrant-on-mac-and-linux">How to Install Vagrant on Mac and Linux</h3>
<p>On Mac:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>After installation, open <strong>System Preferences &gt; Security &amp; Privacy &gt; General</strong>. You will see a message saying system software from Oracle was blocked. Click <strong>Allow</strong> and restart your Mac. Without this step, VirtualBox will not run.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p><strong>Note for Apple Silicon (M1/M2/M3) Macs:</strong> VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use <a href="https://mac.getutm.app/">UTM</a> as your VM provider instead, or use Multipass which works natively on Apple Silicon.</p>
<p>On Linux:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p>Verify both are installed:</p>
<pre><code class="language-bash">vboxmanage --version
vagrant --version
</code></pre>
<h3 id="heading-how-to-create-your-first-vagrant-environment">How to Create Your First Vagrant Environment</h3>
<p>Create a new directory for your project. Inside it, create a file named <code>Vagrantfile</code> with this content:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: &lt;&lt;-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" &gt; /var/www/html/index.html
  SHELL
end
</code></pre>
<p>Start the VM:</p>
<pre><code class="language-bash">vagrant up
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/342f11ad-7c7d-40d2-a810-113b8c71edac.png" alt="screnshot showing VB server and terminal installation processes" style="display:block;margin:0 auto" width="1848" height="323" loading="lazy">

<p>Visit <code>http://localhost:8080</code> in your browser. You should see "Hello from Vagrant!"</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/bcd66a76-4a5b-4f26-bb7e-e203672968d8.png" alt="screenshot showing &quot;Hello from Vagrant!&quot; in browser" style="display:block;margin:0 auto" width="643" height="483" loading="lazy">

<h4 id="heading-troubleshooting-ssh-on-windows">Troubleshooting SSH on Windows</h4>
<p>If <code>vagrant ssh</code> fails, try:</p>
<pre><code class="language-bash">vagrant ssh -- -v
</code></pre>
<p>Or connect manually:</p>
<pre><code class="language-bash">ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222
</code></pre>
<h3 id="heading-how-to-create-a-local-vagrant-box-without-internet">How to Create a Local Vagrant Box Without Internet</h3>
<p><strong>Note:</strong> Most readers can skip this. Only do this if you want to work fully offline after the initial setup.</p>
<ol>
<li><p>Download <a href="https://ubuntu.com/download/server">Ubuntu 20.04 LTS</a> and save the <code>.iso</code> file locally.</p>
</li>
<li><p>Open VirtualBox and create a new VM: Name it <code>ubuntu-devops</code>, Type: Linux, Version: Ubuntu (64-bit).</p>
</li>
<li><p>Assign 2048MB RAM and a 20GB VDI disk.</p>
</li>
<li><p>Attach the <code>.iso</code> under Storage &gt; Optical Drive.</p>
</li>
<li><p>Start the VM and complete the Ubuntu installation.</p>
</li>
<li><p>Once installed, shut down the VM and run:</p>
</li>
</ol>
<pre><code class="language-bash">VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box
</code></pre>
<p>You now have a reusable local box that works without internet.</p>
<p>You can spin up virtual machines. Next is Ansible, which automates what goes inside them.</p>
<h2 id="heading-how-to-install-ansible">How to Install Ansible</h2>
<p>Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.</p>
<h3 id="heading-how-to-install-ansible-on-windows">How to Install Ansible on Windows</h3>
<p>Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).</p>
<ol>
<li>Open PowerShell as Administrator and enable WSL:</li>
</ol>
<pre><code class="language-powershell">dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
</code></pre>
<ol>
<li><p>Restart your computer.</p>
</li>
<li><p>Install Ubuntu from the Microsoft Store.</p>
</li>
<li><p>Open Ubuntu and install Ansible:</p>
</li>
</ol>
<pre><code class="language-bash">sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-mac">How to Install Ansible on Mac</h3>
<pre><code class="language-bash">brew install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-linux">How to Install Ansible on Linux</h3>
<pre><code class="language-bash"># Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible
</code></pre>
<h3 id="heading-how-to-test-ansible">How to Test Ansible</h3>
<p>Create a file called <code>hosts</code> in your current directory:</p>
<pre><code class="language-ini">[local]
localhost ansible_connection=local
</code></pre>
<p>Create a file called <code>playbook.yml</code> in the same directory:</p>
<pre><code class="language-yaml">---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"
</code></pre>
<p>Run the playbook, passing the local <code>hosts</code> file with <code>-i</code>:</p>
<pre><code class="language-bash">ansible-playbook -i hosts playbook.yml
</code></pre>
<p>You should see the message "Ansible is working!" in the output.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/081e6ff3-b983-42a0-960e-5340bbd24e3b.png" alt="screenshot showing ansible playbook complete terminal installation" style="display:block;margin:0 auto" width="849" height="287" loading="lazy">

<p>Alright, all your tools are installed. Now you'll use them together to build something real.</p>
<h2 id="heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</h2>
<p>You can find the entire code for this lab in this repo: <a href="https://github.com/Osomudeya/homelab-demo-article">https://github.com/Osomudeya/homelab-demo-article</a></p>
<p>Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.</p>
<p><strong>Before you start,</strong> create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.</p>
<p>You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.</p>
<h3 id="heading-step-1-create-the-project-directory">Step 1: Create the Project Directory</h3>
<pre><code class="language-bash">mkdir devops-lab-project &amp;&amp; cd devops-lab-project
</code></pre>
<h3 id="heading-step-2-write-your-site-content">Step 2: Write Your Site Content</h3>
<p>Create a file called <code>index.html</code> in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:</p>
<pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;&lt;title&gt;My DevOps Lab&lt;/title&gt;&lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;My DevOps Lab&lt;/h1&gt;
    &lt;p&gt;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&lt;/p&gt;
    &lt;p&gt;Built on a laptop. No cloud account needed.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre>
<p>Change the text to whatever you like. This is your page.</p>
<h3 id="heading-step-3-write-the-vagrantfile">Step 3: Write the Vagrantfile</h3>
<p>Create a file called <code>Vagrantfile</code> in the same directory:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end
</code></pre>
<h3 id="heading-step-4-start-the-virtual-machines">Step 4: Start the Virtual Machines</h3>
<pre><code class="language-bash">vagrant up
</code></pre>
<p>The first run downloads the <code>ubuntu/focal64</code> box, which is around 500MB.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/264866b0-9977-490e-96a3-69b3070be589.png" alt="screenshot showing virtualbox installation processes in terminal" style="display:block;margin:0 auto" width="867" height="377" loading="lazy">

<p>Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/118d2fb2-70f6-41e8-afb2-6f45fb895e98.png" alt="screenshot showing 2 virtualbox servers &quot;running&quot; in VB manager" style="display:block;margin:0 auto" width="926" height="396" loading="lazy">

<h3 id="heading-step-5-create-the-ansible-inventory">Step 5: Create the Ansible Inventory</h3>
<p>Create a file called <code>inventory</code> in the same directory:</p>
<pre><code class="language-ini">[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key
</code></pre>
<p>Ansible uses the Vagrant-generated private keys so it can SSH in as the <code>vagrant</code> user. Host key checking for this lab is turned off in <code>ansible.cfg</code> (next step), not in the inventory.</p>
<h3 id="heading-step-6-create-the-ansible-config-file">Step 6: Create the Ansible Config File</h3>
<p>Before running the playbook, create a file called <code>ansible.cfg</code> in the same directory:</p>
<pre><code class="language-ini">[defaults]
inventory = inventory
host_key_checking = False
</code></pre>
<p>The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.</p>
<p>These settings are for a local lab only. Do not use host_key_checking = False for production systems.</p>
<h3 id="heading-step-7-create-the-ansible-playbook">Step 7: Create the Ansible Playbook</h3>
<p>Create a file called <code>playbook.yml</code>:</p>
<pre><code class="language-yaml">---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2&gt;/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes
</code></pre>
<p>Two lines worth paying attention to:</p>
<ul>
<li><p><code>src: index.html</code> — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.</p>
</li>
<li><p><code>-v /var/www/html:/usr/share/nginx/html:ro</code> — this mounts the directory from the VM into the Nginx container. The <code>:ro</code> means read-only. Nginx serves whatever is in that folder.</p>
</li>
</ul>
<h3 id="heading-step-8-run-the-playbook">Step 8: Run the Playbook</h3>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green <code>ok</code> or yellow <code>changed</code> next to each task means it worked. Red <code>fatal</code> means something failed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/91241b41-981c-4e23-9dc4-8531e551c39e.png" alt="terminal screenshot of A green ok or yellow changed next to each task means it worked. Red fatal means something failed." style="display:block;margin:0 auto" width="875" height="267" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c02db252-8aff-42e5-b937-d812d070a75b.png" alt="terminal screenshot of playbook run completion" style="display:block;margin:0 auto" width="867" height="425" loading="lazy">

<h3 id="heading-step-9-verify-the-setup">Step 9: Verify the Setup</h3>
<p>Open <code>http://localhost:8080</code> in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.</p>
<p>If you see the page, every tool in this lab is working together.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0d3d897b-3f51-46fb-b548-832cc5ec3272.png" alt="Browser showing localhost:8082 with the heading &quot;My DevOps Lab&quot; and the text &quot;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&quot;" style="display:block;margin:0 auto" width="746" height="418" loading="lazy">

<h3 id="heading-step-9-clean-up-optional">Step 9: Clean Up (Optional)</h3>
<p>When you're done:</p>
<pre><code class="language-bash">vagrant destroy -f
</code></pre>
<p>This shuts down and deletes both VMs. Your <code>Vagrantfile</code>, <code>inventory</code>, <code>playbook.yml</code>, and <code>index.html</code> stay on disk — run <code>vagrant up</code> followed by <code>ansible-playbook -i inventory playbook.yml</code> any time to bring it all back.</p>
<p>Now that you have a working lab, let's use it properly.</p>
<h2 id="heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</h2>
<p>Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.</p>
<p>Here are five things to break and what to look for when you do.</p>
<h3 id="heading-break-1-crash-the-main-process-inside-the-container-and-watch-it-come-back">Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)</h3>
<p>Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of <code>--restart always</code>, and your site can come back without re-running Ansible.</p>
<p>After <code>vagrant ssh web</code>, every <code>docker</code> command below runs <strong>on the web VM</strong>. So keep your browser on your laptop at <a href="http://localhost:8080"><code>http://localhost:8080</code></a> (Vagrant forwards your host port to the VM’s port 80).</p>
<h4 id="heading-troubleshooting-if-your-lab-isnt-ready">Troubleshooting: If Your Lab Isn't Ready</h4>
<p>From your project folder on the host (your laptop) – unless the step says to run it on the VM:</p>
<ul>
<li><p>You ran <code>vagrant destroy -f</code>. Run <code>vagrant up</code>, then <code>ansible-playbook -i inventory playbook.yml</code>.</p>
</li>
<li><p><code>docker ps</code> shows <code>webapp</code> but status is Exited. On the web VM, run <code>sudo docker start webapp</code>, then <code>sudo docker ps</code> again.</p>
</li>
<li><p>There's no <code>webapp</code> row in <code>docker ps -a</code><strong>.</strong> Re-run <code>ansible-playbook -i inventory playbook.yml</code> on the host.</p>
</li>
</ul>
<p>If the playbook is already applied and <code>webapp</code> is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or <code>docker ps</code>. You need the VM shell and a quick check before you run <code>docker exec</code>.)</p>
<h4 id="heading-steps-happy-path">Steps (happy path)</h4>
<ol>
<li>SSH into the web VM:</li>
</ol>
<pre><code class="language-plaintext">vagrant ssh web
</code></pre>
<ol>
<li><p>Confirm <code>webapp</code> is <strong>Up</strong>:</p>
<pre><code class="language-plaintext">sudo docker ps
</code></pre>
</li>
<li><p><strong>Break it on purpose:</strong> kill the container’s main process <strong>from inside</strong> (PID 1). That ends the container the same way a crashing app would, not the same as <code>docker stop</code> on the host:</p>
</li>
</ol>
<pre><code class="language-bash">sudo docker exec webapp sh -c 'sleep 5 &amp;&amp; kill 1'
</code></pre>
<p>The <code>sleep</code> 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a>. You may catch a brief error or blank page while nothing is listening on port 80.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ac89703-63f3-45d8-954f-35adbd2c7dec.png" alt="Browser showing ERR_CONNECTION_RESET on localhost:8082 after the Nginx container process was killed" style="display:block;margin:0 auto" width="1242" height="1057" loading="lazy">

<ol>
<li>Watch Docker restart the container:</li>
</ol>
<pre><code class="language-bash">watch sudo docker ps -a
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5c61d90d-61d6-4023-b3f5-e3eb427e8492.png" alt="Terminal running watch docker ps showing webapp container status as Up 10 seconds after automatic restart" style="display:block;margin:0 auto" width="1011" height="393" loading="lazy">

<p>Within a few seconds you should see <strong>Exited (137)</strong> become <strong>Up</strong> again. (Press Ctrl+C to exit <code>watch</code>.)</p>
<p>5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under <code>/var/www/html</code> and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.</p>
<h4 id="heading-why-not-docker-stop-or-docker-kill-on-the-host-for-this-demo"><strong>Why not</strong> <code>docker stop</code> <strong>or</strong> <code>docker kill</code> <strong>on the host for this demo?</strong></h4>
<p>Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (<code>hasBeenManuallyStopped</code>), and <code>--restart always</code> may not bring the container back until you <code>docker start</code> it or similar.</p>
<p>Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.</p>
<p><strong>Kubernetes analogy:</strong> A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.</p>
<p><strong>What to observe (three separate checks):</strong></p>
<ol>
<li><p><strong>Exit code:</strong> After <code>kill 1</code>, <code>docker ps -a</code> should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran <code>docker stop</code> on the host.</p>
</li>
<li><p><strong>Restart delay vs browser:</strong> Watch how many seconds pass between Exited and Up in <code>docker ps -a</code>; that interval is Docker applying <code>--restart always</code>. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.</p>
</li>
<li><p><strong>Content after recovery:</strong> After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with <code>-v</code>), not inside a file that vanishes when the container process restarts. The process was replaced, not your <code>index.html</code> on the host path.</p>
</li>
</ol>
<h3 id="heading-break-2-cause-a-container-name-conflict">Break 2: Cause a Container Name Conflict</h3>
<p>On a single Docker daemon (here, on your web VM), a container name is a <strong>unique label</strong>. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use <code>docker run --name webapp</code> without cleaning up first hit this error constantly and recognizing it saves time in real work.</p>
<p><strong>Before you start:</strong> Ansible already created one container named <code>webapp</code>.<br>Stay on the web VM (for example still inside <code>vagrant ssh web</code>) so the commands below run where that container lives.</p>
<p>So now, try to start a second container and also call it <code>webapp</code>. The image is plain <code>nginx</code> here on purpose – the point is the <strong>name clash</strong>, not matching your site’s ports or volume mounts.</p>
<pre><code class="language-plaintext">sudo docker run -d --name webapp nginx
</code></pre>
<p>What actually happens here is that Docker <strong>doesn't</strong> create a second container. It returns an error immediately. Your original <code>webapp</code> is unchanged.</p>
<p>This is because the name <code>webapp</code> is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.</p>
<p>Example error (your ID will differ):</p>
<pre><code class="language-plaintext">docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/1fd42c16-c28e-4539-9290-3583206eb8ff.png" alt="container name conflict terminal error screenshot" style="display:block;margin:0 auto" width="914" height="252" loading="lazy">

<p>To fix it, free the name, then create <code>webapp</code> again the same way the playbook does (publish port 80, mount your HTML, restart policy):</p>
<pre><code class="language-plaintext">sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx
</code></pre>
<p>After that, your site should behave as before (refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a> from your laptop).</p>
<h4 id="heading-what-to-observe">What to observe:</h4>
<p>Read Docker’s Conflict message end to end. You should see that the name <code>/webapp</code> is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run <code>docker run</code> again.”</p>
<h3 id="heading-break-3-make-ansible-fail-to-reach-a-vm">Break 3: Make Ansible Fail to Reach a VM</h3>
<p>Ansible separates “could not connect” from “connected, but a task broke.” The first is <strong>UNREACHABLE</strong>, the second is <strong>FAILED</strong>. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.</p>
<p>On your laptop, in the project folder, edit <code>inventory</code> and change the web server address from <code>192.168.33.10</code> to an IP <strong>no VM uses</strong>, for example <code>192.168.33.99</code>. Save the file.</p>
<pre><code class="language-ini">[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key
</code></pre>
<p>What you run (from the same project folder on the host):</p>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>After this, Ansible tries to SSH to <code>192.168.33.99</code>. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible <strong>never runs tasks</strong> on the web server. It stops that host with UNREACHABLE:</p>
<pre><code class="language-plaintext">fatal: [192.168.33.99]: UNREACHABLE! =&gt; {"msg": "Failed to connect to the host via ssh"}
</code></pre>
<p>This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is <strong>no working SSH session</strong>.</p>
<p>Now it's time to put it back: restore <code>192.168.33.10</code> in <code>inventory</code> and run <code>ansible-playbook -i inventory playbook.yml</code> again. The web play should reach the VM and complete (assuming your lab is up).</p>
<p><strong>UNREACHABLE vs FAILED – what to observe:</strong></p>
<ul>
<li><p>If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.</p>
</li>
<li><p>If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.</p>
</li>
</ul>
<p>When you debug later, you should look at the keyword Ansible prints: <strong>UNREACHABLE</strong> points to reachability while <strong>FAILED</strong> points to task output and the first failed task under that host.</p>
<h3 id="heading-break-4-fill-the-vms-disk">Break 4: Fill the VM's Disk</h3>
<p>Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.</p>
<p>All commands below run <strong>on the db VM</strong> after <code>vagrant ssh db</code>. MariaDB was installed there by your playbook.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Open a shell on the db VM:</p>
<pre><code class="language-plaintext">vagrant ssh db
</code></pre>
</li>
<li><p>Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:</p>
<pre><code class="language-plaintext">sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
</code></pre>
<p>Use <code>df -h</code> to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger <code>count=</code> <strong>only in a lab</strong> (never fill production disks on purpose without a plan).</p>
</li>
<li><p>Ask systemd to restart MariaDB and show status:</p>
<pre><code class="language-plaintext">sudo systemctl restart mariadb
sudo systemctl status mariadb
</code></pre>
<p>If the disk is critically full, restart may fail or the service may show failed or not running.</p>
</li>
<li><p>If something looks wrong, read recent logs for the MariaDB unit:</p>
<pre><code class="language-plaintext">sudo journalctl -u mariadb --no-pager | tail -20
</code></pre>
<p>Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.</p>
</li>
<li><p>Clean up so your VM stays usable:</p>
<pre><code class="language-plaintext">sudo rm /tmp/bigfile
</code></pre>
<p>Optionally run <code>sudo systemctl restart mariadb</code> again and confirm it is active (running).</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should use <code>df -h</code> first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.</p>
</li>
<li><p>You should read <code>systemctl status mariadb</code> to see whether systemd thinks the service is active, failed, or flapping.</p>
</li>
<li><p>You should read <code>journalctl -u mariadb</code> when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). <strong>Space + status + logs</strong> is the same order you would use on a production server.</p>
</li>
</ul>
<h3 id="heading-break-5-run-minikube-out-of-resources">Break 5: Run Minikube Out of Resources</h3>
<p>Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay <strong>Pending</strong> and <strong>Events</strong> explain why (for example <em>Insufficient cpu</em>). That is not the same as a pod that starts and then crashes.</p>
<p>To do this, you'll need a local cluster (we're using <a href="https://minikube.sigs.k8s.io/docs/start/?arch=%2Fmacos%2Fx86-64%2Fstable%2Fbinary+download"><strong>Minikube</strong></a> in this guide) and <code>kubectl</code> on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.</p>
<p>You'll run this on your <strong>Mac, Linux, or Windows terminal</strong> (host), not inside <code>vagrant ssh</code>. If you're still inside a VM, type <code>exit</code> until your prompt is back on the host.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Check Minikube:</p>
<pre><code class="language-plaintext">minikube status
</code></pre>
<p>If it's stopped, start it (Docker driver matches earlier sections):</p>
<pre><code class="language-plaintext">minikube start --driver=docker
</code></pre>
</li>
<li><p>Create a deployment with many replicas so your single Minikube node can't run them all at once:</p>
<pre><code class="language-plaintext">kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
</code></pre>
<p>Press Ctrl+C when you're done watching. Some pods may stay <strong>Pending</strong> while others are <strong>Running</strong>.</p>
</li>
<li><p>Pick one Pending pod name from <code>kubectl get pods</code> and inspect it:</p>
<pre><code class="language-plaintext">kubectl describe pod &lt;pod-name&gt;
</code></pre>
<p>Under Events, look for FailedScheduling and a line similar to:</p>
<pre><code class="language-plaintext">Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
</code></pre>
<p>You might see <strong>Insufficient memory</strong> instead, depending on your machine.</p>
</li>
<li><p>Fix the lab by scaling back so the cluster can catch up:</p>
<pre><code class="language-plaintext">kubectl scale deployment stress --replicas=2
</code></pre>
<p>You can delete the deployment entirely when finished: <code>kubectl delete deployment stress</code>.</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any <strong>node</strong> yet, usually because the node is out of CPU or memory for that workload.</p>
</li>
<li><p>You should read <code>kubectl describe pod &lt;pod-name&gt;</code> and scroll to <strong>Events</strong>. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.</p>
</li>
<li><p>You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.</p>
</li>
</ul>
<h2 id="heading-what-you-can-now-do">What You Can Now Do</h2>
<p>You didn't just install tools in this tutorial. You also used them.</p>
<p>You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.</p>
<p>You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.</p>
<p>You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.</p>
<p>That's the difference between someone who has read about DevOps and someone who has run it.</p>
<p><strong>Here are four free projects you can run in this same lab to go further:</strong></p>
<ul>
<li><p><strong>DevOps Home-Lab 2026</strong> — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.</p>
</li>
<li><p><strong>KubeLab</strong> — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.</p>
</li>
<li><p><strong>K8s Secrets Lab</strong> — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.</p>
</li>
<li><p><strong>DevOps Troubleshooting Toolkit</strong> — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.</p>
</li>
</ul>
<p>All free and open source: <a href="https://github.com/Osomudeya/List-Of-DevOps-Projects">github.com/Osomudeya/List-Of-DevOps-Projects</a>.</p>
<p>If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at <a href="https://osomudeya.gumroad.com/l/BuildYourOwnDevOpsLab">Build Your Own DevOps Lab</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening ins ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-and-deploy-multi-architecture-docker-apps-on-google-cloud-using-arm-nodes/</link>
                <guid isPermaLink="false">69dcf2c3f57346bc1e05a01d</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ARM ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Amina Lawal ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:42:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/e89ae65a-4b3a-44b7-94d8-d0638f017bf6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.</p>
<p>Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.</p>
<p>AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.</p>
<p>But there's a problem that catches almost every team off guard when they start this transition: <strong>container architecture mismatch</strong>.</p>
<p>If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic <code>exec format error</code>.</p>
<p>The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.</p>
<p>We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.</p>
<p><strong>Here's what you'll build, step by step:</strong></p>
<ul>
<li><p>A Go HTTP server that reports the CPU architecture it's running on at runtime</p>
</li>
<li><p>A multi-stage Dockerfile that cross-compiles for both <code>linux/amd64</code> and <code>linux/arm64</code> without slow QEMU emulation</p>
</li>
<li><p>A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture</p>
</li>
<li><p>A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool</p>
</li>
<li><p>A Kubernetes Deployment that pins your workload exclusively to the ARM nodes</p>
</li>
</ul>
<p>By the end, you'll hit a live endpoint and see the word <code>arm64</code> staring back at you from a Google Cloud ARM node. Let's get into it.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</a></p>
</li>
<li><p><a href="#heading-step-3-write-the-application">Step 3: Write the Application</a></p>
</li>
<li><p><a href="#heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</a></p>
</li>
<li><p><a href="#heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</a></p>
</li>
<li><p><a href="#heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</a></p>
</li>
<li><p><a href="#heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</a></p>
</li>
<li><p><a href="#heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-project-file-structure">Project File Structure</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following ready:</p>
<ul>
<li><p><strong>A Google Cloud project</strong> with billing enabled. If you don't have one, create it at <a href="https://console.cloud.google.com">console.cloud.google.com</a>. The total cost to follow this tutorial is around $5–10.</p>
</li>
<li><p><code>gcloud</code> <strong>CLI</strong> installed and authenticated. Run <code>gcloud auth login</code> to sign in and <code>gcloud config set project YOUR_PROJECT_ID</code> to point it at your project.</p>
</li>
<li><p><strong>Docker Desktop</strong> version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.</p>
</li>
<li><p><code>kubectl</code> installed. This is the CLI for interacting with Kubernetes clusters.</p>
</li>
<li><p>Basic familiarity with <strong>Docker</strong> (images, layers, Dockerfile) and <strong>Kubernetes</strong> (pods, deployments, services). You don't need to be an expert, but you should know what these things are.</p>
</li>
</ul>
<h2 id="heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</h2>
<p>Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.</p>
<h3 id="heading-enable-the-required-apis">Enable the Required APIs</h3>
<p>Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:</p>
<pre><code class="language-bash">gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com
</code></pre>
<p>Here's what each one does:</p>
<ul>
<li><p><code>artifactregistry.googleapis.com</code> — enables <strong>Artifact Registry</strong>, where we'll store our Docker images</p>
</li>
<li><p><code>container.googleapis.com</code> — enables <strong>Google Kubernetes Engine (GKE)</strong>, where our cluster will run</p>
</li>
<li><p><code>containeranalysis.googleapis.com</code> — enables vulnerability scanning for images stored in Artifact Registry</p>
</li>
</ul>
<h3 id="heading-create-a-docker-repository-in-artifact-registry">Create a Docker Repository in Artifact Registry</h3>
<p>Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:</p>
<pre><code class="language-bash">gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"
</code></pre>
<p>Breaking down the flags:</p>
<ul>
<li><p><code>--repository-format=docker</code> — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)</p>
</li>
<li><p><code>--location=us-central1</code> — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run <code>gcloud artifacts locations list</code> to see all options.</p>
</li>
<li><p><code>--description</code> — a human-readable label for the repository, shown in the console.</p>
</li>
</ul>
<h3 id="heading-authenticate-docker-to-push-to-artifact-registry">Authenticate Docker to Push to Artifact Registry</h3>
<p>Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:</p>
<pre><code class="language-bash">gcloud auth configure-docker us-central1-docker.pkg.dev
</code></pre>
<p>This adds a credential helper entry to your <code>~/.docker/config.json</code> file. What that means in practice: any time Docker tries to push or pull from a URL under <code>us-central1-docker.pkg.dev</code>, it will automatically call <code>gcloud</code> to get a valid auth token. You won't need to run <code>docker login</code> manually.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/31fd020f-ffa2-40bd-9057-57b16a61b325.png" alt="Terminal output of the gcloud artifacts repositories list command, showing a row for multi-arch-repo with format DOCKER, location us-central1" style="display:block;margin:0 auto" width="2870" height="1512" loading="lazy">

<h2 id="heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</h2>
<p>With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.</p>
<pre><code class="language-bash">gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>What each flag does:</p>
<ul>
<li><p><code>--zone=us-central1-a</code> — creates a zonal cluster in a single availability zone. A regional cluster (using <code>--region</code>) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If <code>us-central1-a</code> is unavailable, try <code>us-central1-b</code>.</p>
</li>
<li><p><code>--num-nodes=2</code> — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.</p>
</li>
<li><p><code>--machine-type=e2-standard-2</code> — the machine type for this default node pool. <code>e2-standard-2</code> is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.</p>
</li>
<li><p><code>--workload-pool=PROJECT_ID.svc.id.goog</code> — enables <strong>Workload Identity</strong>, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.</p>
</li>
</ul>
<p>This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/332250a8-3f99-4eb1-849f-51ab054c9567.png" alt="GCP Console Kubernetes Engine Clusters page showing axion-tutorial-cluster with a green checkmark status, the zone us-central1-a, and Kubernetes version in the table." style="display:block;margin:0 auto" width="1457" height="720" loading="lazy">

<h2 id="heading-step-3-write-the-application">Step 3: Write the Application</h2>
<p>We need an application to containerize. We'll use <strong>Go</strong> for three specific reasons:</p>
<ol>
<li><p>Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.</p>
</li>
<li><p>Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.</p>
</li>
<li><p>Go exposes the architecture the binary was compiled for via <code>runtime.GOARCH</code>. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.</p>
</li>
</ol>
<p>Start by creating the project directories:</p>
<pre><code class="language-bash">mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app
</code></pre>
<p>Initialize the Go module from inside <code>app/</code>. This creates <code>go.mod</code> in the current directory:</p>
<pre><code class="language-bash">go mod init hello-axion
</code></pre>
<p><code>go mod init</code> is Go's built-in command for starting a new module. It writes a <code>go.mod</code> file that declares the module name (<code>hello-axion</code>) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.</p>
<p>Now create the application at <code>app/main.go</code>:</p>
<pre><code class="language-go">package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}
</code></pre>
<p>Verify both files were created:</p>
<pre><code class="language-bash">ls -la
</code></pre>
<p>You should see <code>go.mod</code> and <code>main.go</code> listed.</p>
<p>Let's walk through what this code does:</p>
<ul>
<li><p><code>import "runtime"</code> — imports Go's built-in <code>runtime</code> package, which exposes information about the Go runtime environment, including the CPU architecture.</p>
</li>
<li><p><code>runtime.GOARCH</code> — returns a string like <code>"arm64"</code> or <code>"amd64"</code> representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be <code>arm64</code>. This is the core of our proof.</p>
</li>
<li><p><code>os.Hostname()</code> — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.</p>
</li>
<li><p><code>handler</code> — the main HTTP handler, registered on the root path <code>/</code>. It writes the architecture, OS, and hostname to the response.</p>
</li>
<li><p><code>healthz</code> — a separate handler registered on <code>/healthz</code>. It returns HTTP 200 with the text <code>ok</code>. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.</p>
</li>
<li><p><code>http.ListenAndServe(":8080", nil)</code> — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.</p>
</li>
</ul>
<h2 id="heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</h2>
<p>Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.</p>
<h3 id="heading-why-your-docker-images-are-architecture-specific-by-default">Why Your Docker Images Are Architecture-Specific By Default</h3>
<p>A CPU only understands instructions written for its specific <strong>Instruction Set Architecture (ISA)</strong>. ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.</p>
<p>When you build a Docker image the normal way (<code>docker build</code>), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:</p>
<pre><code class="language-shell">standard_init_linux.go:228: exec user process caused: exec format error
</code></pre>
<p>That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."</p>
<h3 id="heading-the-solution-a-single-image-tag-that-serves-any-architecture">The Solution: A Single Image Tag That Serves Any Architecture</h3>
<p>Docker solves this with a structure called a <strong>Manifest List</strong> (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.</p>
<p>When a server pulls <code>hello-axion:v1</code>, here's what actually happens:</p>
<ol>
<li><p>Docker contacts the registry and requests the manifest for <code>hello-axion:v1</code></p>
</li>
<li><p>The registry returns the Manifest List, which looks like this internally:</p>
</li>
</ol>
<pre><code class="language-json">{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}
</code></pre>
<ol>
<li>Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.</li>
</ol>
<p>One tag, two actual images. Completely transparent to your deployment manifests.</p>
<h3 id="heading-set-up-docker-buildx">Set Up Docker Buildx</h3>
<p><strong>Docker Buildx</strong> is the CLI tool that builds these Manifest Lists. It's powered by the <strong>BuildKit</strong> engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:</p>
<pre><code class="language-bash">docker buildx create --name multiarch-builder --use
</code></pre>
<ul>
<li><p><code>--name multiarch-builder</code> — gives this builder a memorable name. You can have multiple builders. This command creates a new one named <code>multiarch-builder</code>.</p>
</li>
<li><p><code>--use</code> — immediately sets this new builder as the active one, so all future <code>docker buildx build</code> commands use it.</p>
</li>
</ul>
<p>Now boot the builder and confirm it supports the platforms we need:</p>
<pre><code class="language-bash">docker buildx inspect --bootstrap
</code></pre>
<ul>
<li><code>--bootstrap</code> — starts the builder container if it isn't already running, and prints its full configuration.</li>
</ul>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...
</code></pre>
<p>The <code>Platforms</code> line lists every architecture this builder can produce images for. As long as you see <code>linux/amd64</code> and <code>linux/arm64</code> in that list, you're ready to build for both x86 and ARM.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/1c19aca1-30c4-406d-9c37-679ee4f2928f.png" alt="Terminal output showing the multiarch-builder details with Name, Driver set to docker-container, and a Platforms list that includes linux/amd64 and linux/arm64 highlighted." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h2 id="heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</h2>
<p>Now we can write the Dockerfile. We'll use two techniques together: a <strong>multi-stage build</strong> to keep the final image tiny, and a <strong>cross-compilation trick</strong> to avoid slow CPU emulation.</p>
<p>Create <code>app/Dockerfile</code> with the following content:</p>
<pre><code class="language-dockerfile"># -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]
</code></pre>
<p>There's a lot happening here. Let's go through it carefully.</p>
<h3 id="heading-stage-1-the-builder">Stage 1: The Builder</h3>
<p><code>FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder</code></p>
<p>This is the most important line in the file. <code>\(BUILDPLATFORM</code> is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine <em>running the build</em> (your laptop). By pinning the builder stage to <code>\)BUILDPLATFORM</code>, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.</p>
<p>Without <code>--platform=$BUILDPLATFORM</code>, Buildx would have to use <strong>QEMU</strong> — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.</p>
<p><code>ARG TARGETOS</code> <strong>and</strong> <code>ARG TARGETARCH</code></p>
<p>These two lines declare that our Dockerfile expects build arguments named <code>TARGETOS</code> and <code>TARGETARCH</code>. Buildx injects these automatically based on the <code>--platform</code> flag you pass at build time. For a <code>linux/arm64</code> target, <code>TARGETOS</code> will be <code>linux</code> and <code>TARGETARCH</code> will be <code>arm64</code>.</p>
<p><code>COPY go.mod .</code> <strong>and</strong> <code>RUN go mod download</code></p>
<p>We copy <code>go.mod</code> first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for <code>go mod download</code>.</p>
<p>On future builds, as long as <code>go.mod</code> hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.</p>
<p><code>RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go</code></p>
<p>This is the cross-compilation step. <code>GOOS</code> and <code>GOARCH</code> are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the <code>\(TARGETOS</code> and <code>\)TARGETARCH</code> build args injected by Buildx.</p>
<p>The <code>-ldflags="-w -s"</code> flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.</p>
<h3 id="heading-stage-2-the-runtime-image">Stage 2: The Runtime Image</h3>
<p><code>FROM alpine:latest</code></p>
<p>This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, <code>alpine:latest</code> is itself a multi-arch image, so Docker automatically selects the <code>arm64</code> or <code>amd64</code> Alpine variant depending on which platform this stage is built for.</p>
<p>Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains <em>only</em> Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.</p>
<p><code>RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup</code> and <code>USER appuser</code></p>
<p>These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.</p>
<p><code>COPY --from=builder /app/server .</code></p>
<p>This is how multi-stage builds work: the <code>--from=builder</code> flag tells Docker to copy files from the <code>builder</code> stage (Stage 1), not from your local disk. Only the compiled binary (<code>server</code>) makes it into the final image.</p>
<h2 id="heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</h2>
<p>With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.</p>
<p>From inside the <code>app/</code> directory, run:</p>
<pre><code class="language-bash">docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual GCP project ID.</p>
<p>Here's what each part of this command does:</p>
<ul>
<li><p><code>docker buildx build</code> — uses the Buildx CLI instead of the standard <code>docker build</code>. Buildx is required for multi-platform builds.</p>
</li>
<li><p><code>--platform linux/amd64,linux/arm64</code> — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the <code>$BUILDPLATFORM</code> cross-compilation trick, both builds run natively on your machine without QEMU emulation.</p>
</li>
<li><p><code>-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1</code> — the full image path in Artifact Registry. The format is always <code>REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG</code>.</p>
</li>
<li><p><code>--push</code> — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.</p>
</li>
<li><p><code>.</code> — the build context, the directory Docker scans for the Dockerfile and any files the build needs.</p>
</li>
</ul>
<p>Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:</p>
<pre><code class="language-plaintext"> =&gt; [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 =&gt; [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 =&gt; pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/dc88f558-b4ee-4100-bfe1-eaa943bec9bc.png" alt="Terminal showing docker buildx build output with two parallel build tracks labeled linux/amd64 and linux/arm64, and a final line reading pushing manifest for the Artifact Registry image path." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h3 id="heading-verify-the-multi-arch-image-in-artifact-registry">Verify the Multi-Arch Image in Artifact Registry</h3>
<p>Once the push completes, navigate to <strong>GCP Console → Artifact Registry → Repositories → multi-arch-repo</strong> and click on <code>hello-axion</code>.</p>
<p>You won't see a single image — you'll see something labelled <strong>"Image Index"</strong>. That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for <code>linux/amd64</code> and one for <code>linux/arm64</code>.</p>
<p>You can also inspect this from the command line:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/28d0e4a4-1d45-4c0b-ac47-34dc3b72c11d.png" alt="Google Cloud Artifact Registry console showing hello-axion as an Image Index with two child images: one labeled linux/amd64 and one labeled linux/arm64, each with its own digest and size." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<p>The output lists every manifest inside the image index. You'll see entries for <code>linux/amd64</code> and <code>linux/arm64</code> — those are our two real images. You'll also see two entries with <code>Platform: unknown/unknown</code> labelled as <code>attestation-manifest</code>. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).</p>
<p>The two entries you care about are <code>linux/amd64</code> and <code>linux/arm64</code>. Note the digest for the <code>arm64</code> entry — we'll use it in the verification step to confirm the cluster pulled the right variant.</p>
<h2 id="heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</h2>
<p>We have a universal image. Now we need somewhere to run it.</p>
<p>Recall the cluster we created in Step 2 — it's running <code>e2-standard-2</code> x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a <strong>mixed-architecture cluster</strong> where different workloads can be routed to different hardware.</p>
<h3 id="heading-choosing-your-arm-machine-type">Choosing Your ARM Machine Type</h3>
<p>Google Cloud currently offers two ARM-based machine series in GKE:</p>
<table>
<thead>
<tr>
<th>Series</th>
<th>Example type</th>
<th>What it is</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tau T2A</strong></td>
<td><code>t2a-standard-2</code></td>
<td>First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.</td>
</tr>
<tr>
<td><strong>Axion (C4A)</strong></td>
<td><code>c4a-standard-2</code></td>
<td>Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.</td>
</tr>
</tbody></table>
<p>This tutorial uses <code>t2a-standard-2</code> because it's widely available. The commands are identical for <code>c4a-standard-2</code> — just swap the <code>--machine-type</code> value. If <code>t2a-standard-2</code> isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.</p>
<h3 id="heading-create-the-arm-node-pool">Create the ARM Node Pool</h3>
<p>Add the ARM node pool to your existing cluster:</p>
<pre><code class="language-bash">gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized
</code></pre>
<p>What each flag does:</p>
<ul>
<li><p><code>--cluster=axion-tutorial-cluster</code> — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.</p>
</li>
<li><p><code>--zone=us-central1-a</code> — must match the zone you used when creating the cluster.</p>
</li>
<li><p><code>--machine-type=t2a-standard-2</code> — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.</p>
</li>
<li><p><code>--num-nodes=2</code> — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.</p>
</li>
<li><p><code>--node-labels=workload-type=arm-optimized</code> — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic <code>kubernetes.io/arch=arm64</code> label) is good practice in real clusters — it communicates the <em>intent</em> of the pool, not just its hardware.</p>
</li>
</ul>
<p>This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:</p>
<pre><code class="language-bash">gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch
</code></pre>
<p>The <code>get-credentials</code> command configures <code>kubectl</code> to authenticate with your new cluster. The <code>get nodes</code> command then lists all nodes and adds a column showing the <code>kubernetes.io/arch</code> label.</p>
<p>You should see something like:</p>
<pre><code class="language-plaintext">NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m
</code></pre>
<p><code>amd64</code> for the default x86 pool, <code>arm64</code> for our new Axion pool. This <code>kubernetes.io/arch</code> label is applied automatically by GKE — you don't set it, it's derived from the hardware.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/6389f4c6-17fe-4086-982f-39d94dbfa252.png" alt="Terminal output of kubectl get nodes with a ARCH column showing amd64 for two default-pool nodes and arm64 for two axion-pool nodes." style="display:block;margin:0 auto" width="2330" height="646" loading="lazy">

<h2 id="heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</h2>
<p>We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: <strong>Kubernetes doesn't know or care about image architecture by default</strong>.</p>
<p>If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.</p>
<p>To guarantee that pods land on ARM nodes and only ARM nodes, we use a <code>nodeSelector</code>.</p>
<h3 id="heading-how-nodeselector-works">How nodeSelector Works</h3>
<p>A <code>nodeSelector</code> is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the <code>nodeSelector</code>, the scheduler skips it — the pod will remain in <code>Pending</code> state rather than land on the wrong node.</p>
<p>This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (<code>preferredDuringSchedulingIgnoredDuringExecution</code>), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.</p>
<h3 id="heading-write-the-deployment-manifest">Write the Deployment Manifest</h3>
<p>Create <code>k8s/deployment.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your project ID. Here's what the key sections do:</p>
<p><code>replicas: 3</code> — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in <code>us-central1</code>, which distributes load across availability zones.</p>
<p><code>selector.matchLabels</code> and <code>template.metadata.labels</code> — these two blocks must match. The <code>selector</code> tells the Deployment which pods it "owns," and the <code>template.metadata.labels</code> is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.</p>
<p><code>nodeSelector: kubernetes.io/arch: arm64</code> — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies <code>kubernetes.io/arch=arm64</code> to all ARM nodes, our pods will schedule only onto the <code>axion-pool</code> nodes.</p>
<p><code>livenessProbe</code> — periodically calls <code>GET /healthz</code>. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. <code>initialDelaySeconds: 5</code> gives the server 5 seconds to start up before the first check.</p>
<p><code>readinessProbe</code> — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.</p>
<p><code>resources.requests</code> — reserves <code>250m</code> (25% of a CPU core) and <code>64Mi</code> of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.</p>
<p><code>resources.limits</code> — caps the container at <code>500m</code> CPU and <code>128Mi</code> memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.</p>
<h3 id="heading-a-note-on-taints-and-tolerations">A Note on Taints and Tolerations</h3>
<p>Once you're comfortable with <code>nodeSelector</code>, the next step in production clusters is adding a <strong>taint</strong> to your ARM node pool. A taint is a repellent — any pod without an explicit <strong>toleration</strong> for that taint is blocked from landing on the tainted node.</p>
<p>This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:</p>
<pre><code class="language-bash"># Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule
</code></pre>
<p>And a matching toleration in the pod spec:</p>
<pre><code class="language-yaml">tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"
</code></pre>
<p>We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.</p>
<h3 id="heading-write-the-service-manifest">Write the Service Manifest</h3>
<p>We also need a Kubernetes Service to expose the pods over the network. Create <code>k8s/service.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer
</code></pre>
<ul>
<li><p><code>selector: app: hello-axion</code> — the Service discovers pods using labels. Any pod with <code>app: hello-axion</code> on it will be added to this Service's load balancer pool.</p>
</li>
<li><p><code>port: 80</code> — the port the Service is reachable on from outside the cluster.</p>
</li>
<li><p><code>targetPort: 8080</code> — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.</p>
</li>
<li><p><code>type: LoadBalancer</code> — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.</p>
</li>
</ul>
<h3 id="heading-apply-both-manifests">Apply Both Manifests</h3>
<pre><code class="language-bash">kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
</code></pre>
<p><code>kubectl apply</code> reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.</p>
<p>Watch the pods come up in real time:</p>
<pre><code class="language-bash">kubectl get pods -w
</code></pre>
<p>The <code>-w</code> flag watches for changes and prints updates as they happen. You should see pods transition from <code>Pending</code> → <code>ContainerCreating</code> → <code>Running</code>. Once all three show <code>Running</code>, press <code>Ctrl+C</code> to stop watching.</p>
<h2 id="heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</h2>
<p>Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.</p>
<h3 id="heading-confirm-pod-placement">Confirm Pod Placement</h3>
<pre><code class="language-bash">kubectl get pods -o wide
</code></pre>
<p>The <code>-o wide</code> flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the <code>NODE</code> column:</p>
<pre><code class="language-plaintext">NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...
</code></pre>
<p>All three pods should show node names containing <code>axion-pool</code>. None should show <code>default-pool</code>.</p>
<h3 id="heading-confirm-the-nodes-are-arm">Confirm the Nodes Are ARM</h3>
<p>Take one of those node names and verify its architecture label:</p>
<pre><code class="language-bash">kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch
</code></pre>
<p>Replace <code>NODE_NAME</code> with one of the node names from the previous command. You should see:</p>
<pre><code class="language-plaintext">kubernetes.io/arch=arm64
</code></pre>
<p>That's the automatic label GKE applied when it provisioned the ARM hardware. Our <code>nodeSelector</code> matched on this label to pin the pods here.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/815312ea-e2bf-4106-863e-55cd0bdad5f7.png" alt="Terminal split into two sections: the top showing kubectl get pods -o wide with all pods scheduled on nodes containing axion-pool in the name, and the bottom showing kubectl get node with kubernetes.io/arch=arm64 in the labels output." style="display:block;margin:0 auto" width="2848" height="1500" loading="lazy">

<h3 id="heading-ask-the-application-itself">Ask the Application Itself</h3>
<p>This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.</p>
<p>Use <code>kubectl port-forward</code> to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:</p>
<pre><code class="language-bash">kubectl port-forward deployment/hello-axion 8080:8080
</code></pre>
<p>This command stays running in the foreground — open a <strong>second terminal window</strong> and run:</p>
<pre><code class="language-bash">curl http://localhost:8080
</code></pre>
<p>You should see:</p>
<pre><code class="language-plaintext">Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12
</code></pre>
<p><code>Architecture : arm64</code>. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/114ff82d-950f-4059-a1fa-89baffb90b6c.png" alt="Terminal output of curl http://localhost:8080 showing the four-line response: Hello from freeCodeCamp, Architecture: arm64, OS: linux, and the pod hostname." style="display:block;margin:0 auto" width="1042" height="292" loading="lazy">

<h3 id="heading-the-bonus-see-the-manifest-list-in-action">The Bonus: See the Manifest List in Action</h3>
<p>Want to see the multi-arch image indexing at work? Stop the port-forward, then run:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>You'll see four entries in the manifest list. Two are real images — <code>Platform: linux/amd64</code> and <code>Platform: linux/arm64</code>. The other two show <code>Platform: unknown/unknown</code> with an <code>attestation-manifest</code> annotation. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.</p>
<p>You may notice that if you check the image digest recorded in a running pod:</p>
<pre><code class="language-bash">kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'
</code></pre>
<p>Replace <code>POD_NAME</code> with one of the pod names from earlier.</p>
<p>The digest returned matches the <strong>top-level manifest list digest</strong>, not the <code>arm64</code>-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.</p>
<p>The definitive proof that the right binary is running is what you already have: the node labeled <code>kubernetes.io/arch=arm64</code> and the application reporting <code>Architecture: arm64</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/7dffe0c8-28cf-4a5d-8459-1e8db3da7dc0.png" alt="top-level manifest list digest" style="display:block;margin:0 auto" width="2302" height="1000" loading="lazy">

<h2 id="heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</h2>
<p>The hands-on work is done. Let's talk about why any of this is worth the effort.</p>
<h3 id="heading-the-cost-math">The Cost Math</h3>
<p>At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the <a href="https://cloud.google.com/compute/vm-instance-pricing">official pricing page</a> before making decisions):</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>vCPU</th>
<th>Memory</th>
<th>Approx. $/hour</th>
</tr>
</thead>
<tbody><tr>
<td><code>n2-standard-4</code> (x86)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.19</td>
</tr>
<tr>
<td><code>t2a-standard-4</code> (Tau ARM)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.14</td>
</tr>
<tr>
<td><code>c4a-standard-4</code> (Axion)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.15</td>
</tr>
</tbody></table>
<p>That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.</p>
<p>Here's how that looks at scale, for a service running 20 nodes continuously for a year:</p>
<ul>
<li><p>20 × <code>n2-standard-4</code> × \(0.19 × 8,760 hours = <strong>\)33,288/year</strong></p>
</li>
<li><p>20 × <code>t2a-standard-4</code> × \(0.14 × 8,760 hours = <strong>\)24,528/year</strong></p>
</li>
</ul>
<p>That's roughly <strong>$8,760 saved annually</strong> on compute, before committed use discounts (which further widen the gap).</p>
<h3 id="heading-when-arm-is-the-right-choice">When ARM Is the Right Choice</h3>
<p>ARM works best for:</p>
<ul>
<li><p><strong>Stateless API servers and web applications</strong> — like the app we built. ARM excels at high-throughput, low-latency network workloads.</p>
</li>
<li><p><strong>Background workers and queue processors</strong> — long-running services that don't depend on x86-specific binaries.</p>
</li>
<li><p><strong>Microservices written in Go, Rust, or Python</strong> — these languages have excellent ARM64 support and are built cross-platform by default.</p>
</li>
</ul>
<h3 id="heading-when-to-proceed-carefully">When to Proceed Carefully</h3>
<ul>
<li><p><strong>Native library dependencies</strong> — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.</p>
</li>
<li><p><strong>CI pipelines need ARM too</strong> — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.</p>
</li>
<li><p><strong>Profile before optimizing</strong> — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.</p>
</li>
</ul>
<h2 id="heading-cleanup">Cleanup</h2>
<p>When you're done, clean up to avoid ongoing charges:</p>
<pre><code class="language-bash"># Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Let's recap what you built and why each part matters.</p>
<p>You started with a Go application, a Dockerfile, and a <code>docker buildx build</code> command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.</p>
<p>You provisioned a GKE cluster with two node pools running different CPU architectures, then used <code>nodeSelector</code> to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.</p>
<p>The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same <code>nodeSelector</code> approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.</p>
<p><strong>Where to go from here:</strong></p>
<ul>
<li><p>Add a GitHub Actions workflow that runs <code>docker buildx build --platform linux/amd64,linux/arm64</code> on every push, automating this entire process in CI.</p>
</li>
<li><p>Audit one of your existing stateless services for ARM compatibility and try migrating it.</p>
</li>
<li><p>Explore <strong>Node Affinity</strong> as a softer alternative to <code>nodeSelector</code> for workloads that can run on either architecture but prefer ARM.</p>
</li>
<li><p>Look into <strong>GKE Autopilot</strong>, which now supports ARM nodes and handles node pool management automatically.</p>
</li>
</ul>
<p>Happy building.</p>
<h2 id="heading-project-file-structure">Project File Structure</h2>
<pre><code class="language-plaintext">hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service
</code></pre>
<p>All source files for this tutorial are available in the companion GitHub repository: <a href="https://github.com/Amiynarh/multi-arch-docker-gke-arm">https://github.com/Amiynarh/multi-arch-docker-gke-arm</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Self-Host Your Own Server Monitoring Dashboard Using Uptime Kuma and Docker ]]>
                </title>
                <description>
                    <![CDATA[ As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains. And while many SaaS tools can  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/self-host-uptime-kuma-docker/</link>
                <guid isPermaLink="false">69d4185f40c9cabf44851652</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ self-hosted ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ monitoring ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Ubuntu ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abdul Talha ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:32:31 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ea068a20-bc19-400a-a42e-1bbb7e492da8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains.</p>
<p>And while many SaaS tools can monitor your site, they often charge high monthly fees for simple alerts.</p>
<p>My goal with this article is to help you stop paying those expensive fees by showing you a powerful, free, open-source alternative called Uptime Kuma.</p>
<p>In this guide, you'll learn how to use Docker to deploy Uptime Kuma safely on a local Ubuntu machine.</p>
<p>By the end of this tutorial, you'll have set up your own private server monitoring dashboard in less than 10 minutes and created an automated Discord alert to ping your phone if your website goes offline.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-update-packages-and-prepare-the-firewall">Step 1: Update Packages and Prepare the Firewall</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-docker-compose-file">Step 2: Create the Docker Compose File</a></p>
</li>
<li><p><a href="#heading-step-3-start-the-application">Step 3: Start the Application</a></p>
</li>
<li><p><a href="#heading-step-4-access-the-dashboard">Step 4: Access the Dashboard</a></p>
</li>
<li><p><a href="#heading-step-5-use-case-monitor-a-website-and-send-discord-alerts">Step 5: Use Case – Monitor a Website and Send Discord Alerts</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>An Ubuntu machine (like a local server, VM, or desktop).</p>
</li>
<li><p>Docker and Docker Compose installed.</p>
</li>
<li><p>Basic knowledge of the Linux terminal.</p>
</li>
</ul>
<h2 id="heading-step-1-update-packages-and-prepare-the-firewall">Step 1: Update Packages and Prepare the Firewall</h2>
<p>First, you'll want to make sure your system has the newest updates. Then, you'll install the Uncomplicated Firewall (UFW) and open the network "door" (port) that Uptime Kuma uses for the dashboard. You'll also need to allow SSH so you don't lock yourself out.</p>
<p>Run these commands in your terminal:</p>
<ol>
<li>Update your packages:</li>
</ol>
<pre><code class="language-shell">sudo apt update &amp;&amp; sudo apt upgrade -y
</code></pre>
<ol>
<li>Install the firewall:</li>
</ol>
<pre><code class="language-shell">sudo apt install ufw -y
</code></pre>
<ol>
<li>Allow SSH and open port 3001:</li>
</ol>
<pre><code class="language-shell">sudo ufw allow ssh
sudo ufw allow 3001/tcp
</code></pre>
<ol>
<li>Enable the firewall:</li>
</ol>
<pre><code class="language-shell">sudo ufw enable
sudo ufw reload
</code></pre>
<h2 id="heading-step-2-create-the-docker-compose-file">Step 2: Create the Docker Compose File</h2>
<p>Using a <code>docker-compose.yml</code> file is the professional way to manage Docker containers. It keeps your setup organised in one single place.</p>
<p>To start, create a new folder for your project and enter it:</p>
<pre><code class="language-shell">mkdir uptime-kuma &amp;&amp; cd uptime-kuma
</code></pre>
<p>Then create the configuration file:</p>
<pre><code class="language-shell">nano docker-compose.yml
</code></pre>
<p>Paste the following code into the editor:</p>
<pre><code class="language-yaml">services:
  uptime-kuma:
    image: louislam/uptime-kuma:2
    restart: unless-stopped
    volumes:
      - ./data:/app/data
    ports:
      - "3001:3001"
</code></pre>
<p><strong>Note</strong>: The <code>./data:/app/data</code> line is very important. It saves your database in a normal folder on your machine, making it easy to back up later.</p>
<p>Finally, save and exit: Press <code>CTRL + X</code>, then <code>Y</code>, then <code>Enter</code>.</p>
<h2 id="heading-step-3-start-the-application">Step 3: Start the Application</h2>
<p>Now, tell Docker to read your file and start the monitoring service in the background.</p>
<pre><code class="language-shell">docker compose up -d
</code></pre>
<p><strong>How to verify:</strong> Docker will download the files. When it finishes, your terminal should print <code>Started uptime-kuma</code>.</p>
<h2 id="heading-step-4-access-the-dashboard">Step 4: Access the Dashboard</h2>
<p>To access the dashboard, first open your web browser and go to <code>http://localhost:3001</code> (or your machine's local IP address).</p>
<p>When asked to choose the database, select <strong>SQLite</strong>. It's simple, fast, and requires no extra setup.</p>
<p>Then create an account and choose a secure admin username and password.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/02913589-020e-4a8a-aa7a-1bf70a9244c6.png" alt="02913589-020e-4a8a-aa7a-1bf70a9244c6" style="display:block;margin:0 auto" width="908" height="851" loading="lazy">

<h2 id="heading-step-5-use-case-monitor-a-website-and-send-discord-alerts">Step 5: Use Case – Monitor a Website and Send Discord Alerts</h2>
<p>Now you'll put Uptime Kuma to work by monitoring a live website and setting up an alert. Just follow these steps:</p>
<ol>
<li><p>Click Add New Monitor.</p>
</li>
<li><p>Set the Monitor Type to <code>HTTP(s)</code>.</p>
</li>
<li><p>Give it a Friendly Name (e.g., "My Blog") and enter your website's URL.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/74567f1e-acc4-480f-b969-7883e01aa459.png" alt="74567f1e-acc4-480f-b969-7883e01aa459" style="display:block;margin:0 auto" width="1918" height="867" loading="lazy">

<h3 id="heading-pro-tip-how-to-fix-down-errors-bot-protection">Pro-Tip: How to Fix "Down" Errors (Bot Protection)</h3>
<p>If your site uses strict security, it might block Uptime Kuma and say your site is "Down" with a 403 Forbidden error.</p>
<p><strong>The Fix:</strong> Scroll down to Advanced, find the User Agent box, and paste this text to make Uptime Kuma look like a normal Chrome browser:</p>
<p><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36</code></p>
<h3 id="heading-add-a-discord-alert">Add a Discord Alert</h3>
<p>To get a message on your phone when your site goes down:</p>
<ol>
<li><p>On the right side of the monitor screen, click Setup Notification.</p>
</li>
<li><p>Select Discord from the dropdown list.</p>
</li>
<li><p>Paste a Discord Webhook URL (you can create one in your Discord server settings under Integrations).</p>
</li>
<li><p>Click Test to receive a test ping, then click Save.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You just took control of your server health. By deploying Uptime Kuma, you replaced an expensive SaaS subscription with a powerful, free monitoring tool that alerts you the second a project goes offline.</p>
<p><strong>Let’s connect!</strong> I am a developer and technical writer specialising in writing step-by-step guides and workflows. You can find my latest projects on my <a href="https://blog.abdultalha.tech/portfolio"><strong>Technical Writing Portfolio</strong></a> or reach out to me directly on <a href="https://www.linkedin.com/in/abdul-talha/"><strong>LinkedIn</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Model Packaging Tools Every MLOps Engineer Should Know ]]>
                </title>
                <description>
                    <![CDATA[ Most machine learning deployments don’t fail because the model is bad. They fail because of packaging. Teams often spend months fine-tuning models (adjusting hyperparameters and improving architecture ]]>
                </description>
                <link>https://www.freecodecamp.org/news/model-packaging-tools-every-mlops-engineer-should-know/</link>
                <guid isPermaLink="false">69d3ca7840c9cabf443c9ce3</guid>
                
                    <category>
                        <![CDATA[ ML ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 15:00:08 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4fa02714-2cea-4592-813e-a5d5ebaf0842.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most machine learning deployments don’t fail because the model is bad. They fail because of packaging.</p>
<p>Teams often spend months fine-tuning models (adjusting hyperparameters and improving architectures) only to hit a wall when it’s time to deploy. Suddenly, the production system can’t even read the model file. Everything breaks at the handoff between research and production.</p>
<p>The good news? If you think about packaging from the start, you can save up to 60% of the time usually spent during deployment. That’s because you avoid the common friction between the experimental environment and the production system.</p>
<p>In this guide, we’ll walk through eleven essential tools every MLOps engineer should know. To keep things clear, we’ll group them into three stages of a model’s lifecycle:</p>
<ul>
<li><p><strong>Serialization</strong>: how models are stored and transferred</p>
</li>
<li><p><strong>Bundling &amp; Serving</strong>: how models are deployed and run</p>
</li>
<li><p><strong>Registry</strong>: how models are tracked and versioned</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-model-serialization-formats">Model Serialization Formats</a></p>
<ul>
<li><p><a href="#heading-1-onnx-open-neural-network-exchangehttpsonnxai">1. ONNX (Open Neural Network Exchange)</a></p>
</li>
<li><p><a href="#heading-2-torchscripthttpsdocspytorchorgdocsstabletorchcompilerapihtml">2. TorchScript</a></p>
</li>
<li><p><a href="#heading-3-tensorflow-savedmodelhttpswwwtensorfloworgguidesavedmodel">3. TensorFlow SavedModel</a></p>
</li>
<li><p><a href="#heading-4-picklehttpsdocspythonorg3librarypicklehtmlle-joblibhttpsjoblibreadthedocsioenstable">4. Picklele / Joblib</a></p>
</li>
<li><p><a href="#heading-5-safetensorshttpsgithubcomhuggingfacesafetensors">5. Safetensors</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-model-bundling-and-serving-tools">Model Bundling and Serving Tools</a></p>
<ul>
<li><p><a href="#heading-1-bentomlhttpsdocsbentomlcomenlatest">1. BentoML</a></p>
</li>
<li><p><a href="#heading-2-nvidia-triton-inference-serverhttpsgithubcomtriton-inference-serverserver">2. NVIDIA Triton Inference Server</a></p>
</li>
<li><p><a href="#heading-3-torchservehttpsdocspytorchorgserverve">3. TorchServerve</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-model-registries">Model Registries</a></p>
<ul>
<li><p><a href="#heading-1-mlflow-model-registryhttpsmlfloworgdocslatestmlmodel-registry">1. MLflow Model Registry</a></p>
</li>
<li><p><a href="#heading-2-hugging-face-hubhttpshuggingfacecodocshubindex">2. Hugging Face Hub</a></p>
</li>
<li><p><a href="#heading-3-weights-amp-biaseshttpsdocswandbaimodels">3. Weights &amp; Biases</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-model-serialization-formats">Model Serialization Formats</h2>
<p>Serialization is simply the process of turning a trained model into a file that can be stored and moved around. It’s the first step in the pipeline, and it matters more than people think. The format you choose determines how your model will be loaded later in production.</p>
<p>So, you want something that either works across different frameworks or is optimized for the environment where your model will eventually run.</p>
<p>Below are some of the most common tools in this space:</p>
<h3 id="heading-1-onnx-open-neural-network-exchange"><a href="https://onnx.ai/">1. ONNX (Open Neural Network Exchange)</a></h3>
<p>ONNX is basically the common language for model serialization. It lets you train a model in one framework, like PyTorch, and then deploy it somewhere else without running into compatibility issues. It also performs well across different types of hardware.</p>
<p>ONNX separates your training framework from your inference runtime and allows hardware-level optimizations like quantization and graph fusion. It’s also widely supported across cloud platforms and edge devices.</p>
<p><strong>Key considerations:</strong> This format makes it possible to decouple training from deployment, while still enabling performance optimizations across different hardware setups.</p>
<p><strong>When to use it:</strong> Use ONNX when you need portability –&nbsp;especially if different teams or environments are involved.</p>
<h3 id="heading-2-torchscript"><a href="https://docs.pytorch.org/docs/stable/torch.compiler_api.html">2. TorchScript</a></h3>
<p>TorchScript lets you compile PyTorch models into a format that can run without Python. That means you can deploy it in environments like C++ or mobile without carrying the full Python runtime.</p>
<p>It supports two approaches: tracing (recording execution with sample inputs) and scripting (capturing full control flow).</p>
<p><strong>Key considerations:</strong> Its biggest advantage is removing the Python dependency, which helps reduce latency and makes it suitable for more constrained environments.</p>
<p><strong>When to use it:</strong> Best for high-performance systems where Python would be too heavy or introduce security concerns.</p>
<h3 id="heading-3-tensorflow-savedmodel"><a href="https://www.tensorflow.org/guide/saved_model">3. TensorFlow SavedModel</a></h3>
<p>SavedModel is TensorFlow’s native format. It stores everything –&nbsp;the computation graph, weights, and serving logic – in a single directory.</p>
<p>It’s also the standard input format for TensorFlow Serving, TFLite, and Google Cloud AI Platform.</p>
<p><strong>Key considerations:</strong> It keeps everything within the TensorFlow ecosystem intact, so you don’t lose any part of the model when moving to production.</p>
<p><strong>When to use it:</strong> If your project is built on TensorFlow, this is the default and safest choice.</p>
<h3 id="heading-4-pickle-and-joblib">4. &nbsp;<a href="https://docs.python.org/3/library/pickle.html">Pickle</a> and <a href="https://joblib.readthedocs.io/en/stable/">Joblib</a></h3>
<p>Pickle is Python’s built-in way of saving objects, and Joblib builds on top of it to better handle large arrays and models.</p>
<p>These are commonly used for scikit-learn pipelines, XGBoost models, and other traditional ML setups.</p>
<p><strong>Key considerations:</strong> They’re simple and convenient, but come with real trade-offs. Pickle can execute arbitrary code when loading, which makes it unsafe in untrusted environments. It’s also tightly coupled to Python versions and library dependencies, so models can break when moved across environments.</p>
<p><strong>When to use it:</strong> Best suited for controlled environments where everything runs in the same Python stack, such as internal tools, quick prototypes, or batch jobs.</p>
<p>It’s especially practical when you’re working with classical ML models and don’t need cross-language support or long-term portability. Avoid it for production systems that require security, reproducibility, or deployment across different environments.</p>
<h3 id="heading-5-safetensors"><a href="https://github.com/huggingface/safetensors">5. Safetensors</a></h3>
<p>Safetensors is a newer format developed by Hugging Face. It’s designed to be safe, fast, and straightforward.</p>
<p>It avoids arbitrary code execution and allows efficient loading directly from disk.</p>
<p><strong>Key considerations:</strong> It’s both memory-efficient and secure, which makes it a strong alternative to older formats like Pickle.</p>
<p><strong>When to use it:</strong> Ideal for modern workflows where speed and safety are important.</p>
<h2 id="heading-model-bundling-and-serving-tools">Model Bundling and Serving Tools</h2>
<p>Once your model is saved, the next step is making it usable in production. That means wrapping it in a way that can handle requests and connect it to the rest of your system.</p>
<h3 id="heading-1-bentoml"><a href="https://docs.bentoml.com/en/latest/">1. BentoML</a></h3>
<p>BentoML allows you to define your model service in Python – including preprocessing, inference, and postprocessing – and package everything into a single unit called a “Bento.”</p>
<p>This bundle includes the model, code, dependencies, and even Docker configuration.</p>
<p><strong>Key considerations</strong>: It simplifies deployment by packaging everything into one consistent artifact that can run anywhere.</p>
<p><strong>When to use it</strong>: Great when you want to ship your model and all its logic together as one deployable unit.</p>
<h3 id="heading-2-nvidia-triton-inference-server"><a href="https://github.com/triton-inference-server/server">2. NVIDIA Triton Inference Server</a></h3>
<p>Triton is NVIDIA’s production-grade inference server. It supports multiple model formats like ONNX, TorchScript, TensorFlow, and more.</p>
<p>It’s built for performance, using features like dynamic batching and concurrent execution to fully utilize GPUs.</p>
<p><strong>Key considerations:</strong> It delivers high throughput and efficiently uses hardware, especially GPUs, while supporting models from different frameworks.</p>
<p><strong>When to use it:</strong> Best for large-scale deployments where performance, low latency, and GPU usage are critical.</p>
<h3 id="heading-3-torchserve"><a href="https://docs.pytorch.org/serve/">3. TorchServe</a></h3>
<p>TorchServe is the official serving tool for PyTorch, developed with AWS.</p>
<p>It packages models into a MAR file, which includes weights, code, and dependencies, and provides APIs for managing models in production.</p>
<p><strong>Key considerations:</strong> It offers built-in features for versioning, batching, and management without needing to build everything from scratch.</p>
<p><strong>When to use it:</strong> A solid choice for deploying PyTorch models in a standard production setup.</p>
<h2 id="heading-model-registries">Model Registries</h2>
<p>A model registry is essentially your source of truth. It stores your models, tracks versions, and manages their lifecycle from experimentation to production.</p>
<p>Without one, things quickly become messy and hard to track.</p>
<h3 id="heading-1-mlflow-model-registry"><a href="https://mlflow.org/docs/latest/ml/model-registry/">1. MLflow Model Registry</a></h3>
<p>MLflow is one of the most widely used MLOps platforms. Its registry helps manage model versions and track their progression through stages like Staging and Production.</p>
<p>It also links models back to the experiments that created them.</p>
<p><strong>Key considerations:</strong> It provides strong lifecycle management and makes it easier to track and audit models.</p>
<p><strong>When to use it:</strong> Ideal for teams that need structured workflows and clear governance.</p>
<h3 id="heading-2-hugging-face-hub"><a href="https://huggingface.co/docs/hub/index">2. Hugging Face Hub</a></h3>
<p>The Hugging Face Hub is one of the largest platforms for sharing and managing models.</p>
<p>It supports both public and private repositories, along with dataset versioning and interactive demos.</p>
<p><strong>Key considerations:</strong> It offers a huge library of models and makes collaboration very easy.</p>
<p><strong>When to use it:</strong> Perfect for projects involving transformers, generative AI, or anything that benefits from sharing and discovery.</p>
<h3 id="heading-3-weights-and-biases"><a href="https://docs.wandb.ai/models">3. Weights and Biases</a></h3>
<p>Weights &amp; Biases combines experiment tracking with a model registry.</p>
<p>It connects each model directly to the training run that produced it.</p>
<p><strong>Key considerations:</strong> It gives you full traceability, so you always know how a model was created.</p>
<p><strong>When to use it:</strong> Best when you want a strong link between experimentation and production artifacts.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Machine learning systems rarely fail because the models are bad. They fail because the path to production is fragile.</p>
<p>Packaging is what connects research to production. If that connection is weak, even great models won’t make it into real use.</p>
<p>Choosing the right tools across serialization, serving, and registry layers makes systems easier to deploy and maintain. Formats like ONNX and Safetensors improve portability and safety. Tools like Triton and BentoML help with reliable serving. Registries like MLflow and Hugging Face Hub keep everything organized.</p>
<p>The main idea is simple: don’t leave deployment as something to figure out later.</p>
<p>When packaging is planned early, teams move faster and avoid a lot of unnecessary problems.</p>
<p>In practice, success in MLOps isn’t just about building models. It’s about making sure they actually run in the real world.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator ]]>
                </title>
                <description>
                    <![CDATA[ If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently? Storing them is straightforward. But handling rotation, stale env vars, and the gap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sync-aws-secrets-manager-secrets-into-kubernetes-with-the-external-secrets-operator/</link>
                <guid isPermaLink="false">69c541f010e664c5dadc877e</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ secrets management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Thu, 26 Mar 2026 14:25:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6cca126e-dd50-4400-ae9d-65449581345b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?</p>
<p>Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.</p>
<p>In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.</p>
<p>By the end, you'll be able to:</p>
<ul>
<li><p>Explain the full architecture from vault to pod</p>
</li>
<li><p>Run the lab locally in about 15 minutes</p>
</li>
<li><p>Prove why environment variables go stale after rotation, while mounted secret files stay fresh</p>
</li>
<li><p>Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD</p>
</li>
<li><p>Troubleshoot the most common failures</p>
</li>
</ul>
<p>Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ac8bfc9e-304e-41b8-b6a3-7ce1795b29a9.png" alt="Architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds." style="display:block;margin:0 auto" width="1024" height="1536" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-local-lab">How to Run the Local Lab</a></p>
</li>
<li><p><a href="#heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</a></p>
</li>
<li><p><a href="#heading-how-to-test-secret-rotation">How to Test Secret Rotation</a></p>
</li>
<li><p><a href="#heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</a></p>
</li>
<li><p><a href="#heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</a></p>
</li>
<li><p><a href="#heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following tools installed and configured.</p>
<p><strong>For the local lab:</strong></p>
<ul>
<li><p>An AWS account with access to AWS Secrets Manager</p>
</li>
<li><p>The AWS CLI installed and configured. Run <code>aws configure</code> and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.</p>
</li>
<li><p><code>kubectl</code> installed. For Microk8s, run <code>microk8s kubectl config view --raw &gt; ~/.kube/config</code> after installation to connect kubectl to your local cluster.</p>
</li>
<li><p>Terraform installed</p>
</li>
<li><p>Helm installed</p>
</li>
<li><p>Docker installed</p>
</li>
<li><p>A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the <a href="https://microk8s.io/">Microk8s install guide</a> before continuing.</p>
</li>
</ul>
<p><strong>For the Amazon Elastic Kubernetes Service sections:</strong></p>
<ul>
<li><p>An Amazon Elastic Kubernetes Service cluster you can create or manage</p>
</li>
<li><p>A GitHub repository you can configure for workflows and secrets</p>
</li>
</ul>
<p>The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-LOCAL.md"><code>docs/DEPLOY-LOCAL.md</code></a> and <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-EKS.md"><code>docs/DEPLOY-EKS.md</code></a>.</p>
<h2 id="heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</h2>
<p>Before you run any command, you need to understand how the pieces connect.</p>
<p>The flow has four stages:</p>
<ol>
<li><p>A developer or automated system updates a secret in AWS Secrets Manager.</p>
</li>
<li><p>The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.</p>
</li>
<li><p>Your pod reads that Kubernetes Secret.</p>
</li>
<li><p>During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/9dc52f99-add4-490a-ad86-25a30d0ae306.png" alt="A step-by-step flow diagram showing the four stages of secret flow above" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-how-the-external-secrets-operator-sync-works">How the External Secrets Operator Sync Works</h3>
<p>The External Secrets Operator reads a custom Kubernetes resource called <code>ExternalSecret</code>. That resource tells the operator three things:</p>
<ul>
<li><p>Which secret store to connect to</p>
</li>
<li><p>Which Kubernetes Secret name to create or update</p>
</li>
<li><p>How often to refresh</p>
</li>
</ul>
<p>In this lab, the <code>ExternalSecret</code> creates a Kubernetes Secret named <code>myapp-database-creds</code>. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.</p>
<h3 id="heading-how-the-app-consumes-secrets">How the App Consumes Secrets</h3>
<p>The sample application exposes three endpoints so you can validate behavior at any time.</p>
<ul>
<li><p><code>/secrets/env</code> shows what environment variables the pod sees</p>
</li>
<li><p><code>/secrets/volume</code> shows what files in the mounted secret directory look like</p>
</li>
<li><p><code>/secrets/compare</code> compares both and reports whether rotation has been detected</p>
</li>
</ul>
<p>The app checks four keys: <code>DB_USERNAME</code>, <code>DB_PASSWORD</code>, <code>DB_HOST</code>, and <code>DB_PORT</code>.</p>
<h2 id="heading-how-to-run-the-local-lab">How to Run the Local Lab</h2>
<p>The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.</p>
<h3 id="heading-step-1-clone-the-repo">Step 1: Clone the Repo</h3>
<pre><code class="language-bash">git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab
</code></pre>
<h3 id="heading-step-2-run-the-spin-up-script">Step 2: Run the Spin-Up Script</h3>
<pre><code class="language-bash">bash spinup.sh
</code></pre>
<p>The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.</p>
<p>If the script fails at any point, check <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/TROUBLESHOOTING.md"><code>docs/TROUBLESHOOTING.md</code></a> before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.</p>
<h3 id="heading-important-run-the-lab-ui">Important: Run the Lab UI</h3>
<p>The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at <code>lab-ui/</code> that walks you through each concept and checkpoint as you work through the lab.</p>
<p>To start it, open a second terminal and run:</p>
<pre><code class="language-bash">cd lab-ui &amp;&amp; npm install &amp;&amp; npm run dev
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/873166e9-6bff-4e56-a18d-e58b9e9a5af9.png" alt="Screenshot of npm run dev lab ui terminal" style="display:block;margin:0 auto" width="849" height="435" loading="lazy">

<p>Then open <a href="http://localhost:5173"><code>http://localhost:5173</code></a>. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a5b220b-3f23-4c7c-8388-f2e23d122e2c.png" alt="Screenshot of The Lab UI, a guided tutorial interface that runs alongside the lab and walks you through each concept and checkpoint." style="display:block;margin:0 auto" width="1399" height="1287" loading="lazy">

<p>Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (<code>localhost:3000</code>) are two separate things, the UI guides you through the steps, the app shows you the live secrets.</p>
<h3 id="heading-step-3-access-the-application">Step 3: Access the Application</h3>
<p>Once the lab finishes, port-forward the service.</p>
<pre><code class="language-bash">kubectl port-forward svc/myapp 3000:80 -n default
</code></pre>
<p>Open <code>http://localhost:3000</code>. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/dbe122ac-b787-40d0-96f4-4b1276bab017.png" alt="Screenshot of the running application at localhost:3000. Every row in the table should show &quot;Match ✓" style="display:block;margin:0 auto" width="1213" height="902" loading="lazy">

<h3 id="heading-step-4-validate-that-secrets-match">Step 4: Validate That Secrets Match</h3>
<p>Run the compare endpoint directly from the terminal.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>When everything is working, the response will include <code>"all_match": true</code>.</p>
<h2 id="heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</h2>
<p>At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.</p>
<h3 id="heading-step-1-read-the-externalsecret-manifest">Step 1: Read the ExternalSecret Manifest</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/external-secret.yaml"><code>k8s/aws/external-secret.yaml</code></a>. Focus on these four fields:</p>
<ul>
<li><p><code>refreshInterval</code>: how often the operator polls AWS Secrets Manager</p>
</li>
<li><p><code>secretStoreRef</code>: which store the operator authenticates against</p>
</li>
<li><p><code>target</code>: the name of the Kubernetes Secret to create</p>
</li>
<li><p><code>data</code>: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys</p>
</li>
</ul>
<p>Here is what that mapping looks like in this lab:</p>
<pre><code class="language-yaml">spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username
</code></pre>
<p>The <code>property</code> field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.</p>
<p>Two fields here are worth understanding before you move on. <code>creationPolicy: Owner</code> means the operator owns the Kubernetes Secret it creates. If you delete the <code>ExternalSecret</code>, the Secret is deleted too. <code>ClusterSecretStore</code> is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain <code>SecretStore</code> is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.</p>
<h3 id="heading-step-2-read-the-deployment-manifest">Step 2: Read the Deployment Manifest</h3>
<p>Open <a href="http://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/deployment.yaml"><code>k8s/aws/deployment.yaml</code></a>. You are looking for two sections: <code>envFrom</code> and <code>volumeMounts</code>.</p>
<pre><code class="language-yaml">envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true
</code></pre>
<p>Both paths read from the same Kubernetes Secret, <code>myapp-database-creds</code>. The <code>envFrom</code> block injects all keys as environment variables at pod start.<br>The <code>volumeMounts</code> block mounts the same secret as files under <code>/etc/secrets</code>.</p>
<p>This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.</p>
<h3 id="heading-step-3-read-the-app-comparison-logic">Step 3: Read the App Comparison Logic</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/app/server.js"><code>app/server.js</code></a>. The comparison logic reads environment variables from <code>process.env</code> and reads mounted secret files from <code>/etc/secrets/&lt;key&gt;</code>. Then it computes a per-key match and a global <code>all_match</code> value.</p>
<p>The <code>/secrets/compare</code> endpoint sets <code>rotation_detected: true</code> when any key differs between env and volume.</p>
<h2 id="heading-how-to-test-secret-rotation">How to Test Secret Rotation</h2>
<p>Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.</p>
<h3 id="heading-how-the-rotation-gap-works"><strong>How the Rotation Gap Works</strong></h3>
<p>When a pod starts, Kubernetes gives it two ways to read a secret.</p>
<p>The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.</p>
<p>The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.</p>
<p>Same secret, two paths. One goes stale while one stays fresh.</p>
<p>The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.</p>
<p>That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.</p>
<p>Here is what you're about to observe in the lab:</p>
<ul>
<li><p>The rotation script updates the secret in AWS</p>
</li>
<li><p>ESO syncs the new value into Kubernetes within seconds</p>
</li>
<li><p>The volume file updates automatically</p>
</li>
<li><p>The environment variable stays stale until the pod restarts</p>
</li>
<li><p>The <code>/secrets/compare</code> endpoint shows both values side by side so you can see the gap live</p>
</li>
</ul>
<h3 id="heading-step-1-confirm-the-lab-is-ready">Step 1: Confirm the Lab Is Ready</h3>
<p>Make sure your pod and the External Secrets Operator are both running before you start.</p>
<pre><code class="language-bash">kubectl get pods -n external-secrets
kubectl get pods -n default
</code></pre>
<p>Both should show <code>Running</code>.</p>
<h3 id="heading-step-2-run-the-rotation-test-script">Step 2: Run the Rotation Test Script</h3>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>The script performs these actions in order:</p>
<ol>
<li><p>Reads the current <code>DB_PASSWORD</code> from the volume mount at <code>/etc/secrets/DB_PASSWORD</code></p>
</li>
<li><p>Reads the current <code>DB_PASSWORD</code> from the environment variable</p>
</li>
<li><p>Updates AWS Secrets Manager with a new password using <code>put-secret-value</code></p>
</li>
<li><p>Forces an immediate ESO sync by annotating the <code>ExternalSecret</code> with <code>force-sync</code></p>
</li>
<li><p>Reads the volume value again</p>
</li>
<li><p>Reads the environment variable again</p>
</li>
</ol>
<p>After the script runs, the volume and the env var will show different values.</p>
<h3 id="heading-step-3-validate-with-the-compare-endpoint">Step 3: Validate With the Compare Endpoint</h3>
<p>Hit the compare endpoint and look at the output.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>You'll see something like this:</p>
<pre><code class="language-json">{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c4ebb09f-e605-4f68-8e12-1361d94199b2.png" alt="Rotation mismatch, the volume file updated with the new password but the env var still holds the old value from pod startup." style="display:block;margin:0 auto" width="832" height="290" loading="lazy">

<h3 id="heading-step-4-restart-the-deployment-to-sync-env-vars">Step 4: Restart the Deployment to Sync Env Vars</h3>
<p>Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.</p>
<pre><code class="language-bash">kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default
</code></pre>
<p>Then hit <code>/secrets/compare</code> again. All rows should now show <code>"all_match": true</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0040274d-a398-408c-9486-ce0a9e527479.png" alt="After a rolling restart, new pods pick up fresh env vars and all keys match." style="display:block;margin:0 auto" width="821" height="436" loading="lazy">

<h3 id="heading-how-to-automate-restarts-with-reloader">How to Automate Restarts With Reloader</h3>
<p>If you don't want to restart deployments manually after every rotation, you can install <a href="https://github.com/stakater/reloader"><strong>Stakater Reloader</strong></a>. It watches an annotation on the <code>Deployment</code> and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.</p>
<h2 id="heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</h2>
<p>Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the <a href="https://secrets-store-csi-driver.sigs.k8s.io/">Secrets Store CSI Driver</a>.</p>
<p>Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>External Secrets Operator</th>
<th>Secrets Store CSI Driver</th>
</tr>
</thead>
<tbody><tr>
<td>Creates a Kubernetes Secret</td>
<td>Yes</td>
<td>No by default</td>
</tr>
<tr>
<td>Supports <code>envFrom</code></td>
<td>Yes</td>
<td>No (workaround only)</td>
</tr>
<tr>
<td>Secret stored in etcd</td>
<td>Yes (base64)</td>
<td>No, if you skip sync</td>
</tr>
<tr>
<td>Rotation</td>
<td>ESO updates the Secret, Reloader restarts pods</td>
<td>Volume file can update in place</td>
</tr>
<tr>
<td>Best for</td>
<td>Most teams. Multi-cloud, env var support</td>
<td>Security policies that prohibit secrets in etcd</td>
</tr>
</tbody></table>
<p>This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both <code>envFrom</code> and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.</p>
<p>Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native <code>envFrom</code> model.</p>
<h2 id="heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</h2>
<p>The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.</p>
<h3 id="heading-step-1-prepare-terraform-and-openid-connect-access">Step 1: Prepare Terraform and OpenID Connect Access</h3>
<p>The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder.</p>
<pre><code class="language-bash">cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn
</code></pre>
<p>Copy the role ARN from the output. You'll need it in the next step.</p>
<h3 id="heading-step-2-set-the-required-environment-variable">Step 2: Set the Required Environment Variable</h3>
<p>The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.</p>
<p>To find your AWS account ID, run:</p>
<pre><code class="language-bash">aws sts get-caller-identity --query Account --output text
</code></pre>
<p>Then set the variable, replacing <code>ACCOUNT</code> with the number that command returns.</p>
<pre><code class="language-bash">export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role
</code></pre>
<h3 id="heading-step-3-run-the-spin-up-script-for-amazon-elastic-kubernetes-service">Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service</h3>
<pre><code class="language-bash">bash spinup.sh --cluster eks
</code></pre>
<p>When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing <code>Match ✓</code>.</p>
<h3 id="heading-step-4-test-rotation-on-the-deployed-app">Step 4: Test Rotation on the Deployed App</h3>
<p>After you confirm normal operation, run the rotation test the same way you did locally.</p>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>Then use <code>/secrets/compare</code> on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.</p>
<p>⚠️ <strong>Cost warning:</strong> Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/teardown.sh"><code>bash teardown.sh</code></a> from the repo root to destroy all AWS resources and stop charges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/56f05ace-9ab6-4b67-ade6-a0bd1fa3962c.png" alt="Screenshot of the app running on the ALB URL, showing all keys matched" style="display:block;margin:0 auto" width="912" height="891" loading="lazy">

<h2 id="heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</h2>
<p>The typical CI/CD setup stores <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.</p>
<p>OpenID Connect eliminates that problem entirely.</p>
<h3 id="heading-how-openid-connect-works-for-github-actions">How OpenID Connect Works for GitHub Actions</h3>
<p>GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via <code>AssumeRoleWithWebIdentity</code>. No long-lived keys are ever stored anywhere.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/48e72210-a669-440e-b42e-81b0c15746ec.png" alt="The full OIDC authentication flow for GitHub Actions deploying to EKS — from minting the JWT token through AssumeRoleWithWebIdentity to temporary credentials, kubeconfig retrieval, and final kubectl apply steps." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-step-1-create-the-iam-role-with-terraform">Step 1: Create the IAM Role With Terraform</h3>
<p>The <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.</p>
<h3 id="heading-step-2-add-the-role-arn-to-github-repository-secrets">Step 2: Add the Role ARN to GitHub Repository Secrets</h3>
<p>In your GitHub repository:</p>
<ol>
<li><p>Go to Settings → Secrets and variables → Actions</p>
</li>
<li><p>Click New repository secret</p>
</li>
<li><p>Name it <code>AWS_ROLE_ARN</code></p>
</li>
<li><p>Paste the role ARN from the Terraform output</p>
</li>
</ol>
<p>That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.</p>
<h3 id="heading-step-3-configure-terraform-state">Step 3: Configure Terraform State</h3>
<p>For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.</p>
<h3 id="heading-step-4-push-to-main-and-let-workflows-run">Step 4: Push to Main and Let Workflows Run</h3>
<p>After your first spin-up, every push to the <code>main</code> branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use <code>/secrets/compare</code> to validate rotation behavior on the live environment.</p>
<h2 id="heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</h2>
<p>Here's a shortlist of the most common symptoms and their fixes.</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Most Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>ExternalSecret</code> is not syncing</td>
<td>Missing credentials or wrong store reference</td>
<td>Confirm the operator can access AWS Secrets Manager and that <code>secretStoreRef</code> points to the correct store</td>
</tr>
<tr>
<td>Pod is stuck in <code>Pending</code></td>
<td>Missing storage setup for local cluster</td>
<td>For Microk8s, enable the storage add-on</td>
</tr>
<tr>
<td>Env and volume still match after rotation</td>
<td>Rotation happened but the pod never restarted</td>
<td>Run <code>kubectl rollout restart</code> or install Reloader</td>
</tr>
<tr>
<td>CRD or API version mismatch</td>
<td>ESO version and manifest <code>apiVersion</code> don't match</td>
<td>Verify the <code>apiVersion</code> for <code>ClusterSecretStore</code> and <code>ExternalSecret</code> match your installed ESO version</td>
</tr>
<tr>
<td>Amazon Elastic Kubernetes Service node group never joins</td>
<td>Networking or IAM permissions for nodes are wrong</td>
<td>Fix internet routing and review the node IAM policy</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-inspect-the-operator-and-the-externalsecret">How to Inspect the Operator and the ExternalSecret</h3>
<p>When something isn't syncing, start with these two commands.</p>
<pre><code class="language-bash"># Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
</code></pre>
<p>The status conditions on the <code>ExternalSecret</code> resource will usually tell you exactly what failed.</p>
<h3 id="heading-how-to-validate-rotation-from-the-app-side">How to Validate Rotation From the App Side</h3>
<p>When you are debugging rotation, don't rely only on Kubernetes resource state. Use the <code>/secrets/compare</code> endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the <code>ExternalSecret</code> and <code>Deployment</code> manifests, and validated that the application sees the right credentials.</p>
<p>You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.</p>
<p>Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.</p>
<p>The lab repository is at <a href="https://github.com/Osomudeya/k8s-secret-lab">github.com/Osomudeya/k8s-secret-lab</a>. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.</p>
<p>If this helped you, star the repo and share it with someone who is learning Kubernetes.</p>
<p><em>I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures<br>→</em> <a href="https://osomudeya.gumroad.com/subscribe"><em>Join the newsletter</em></a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers ]]>
                </title>
                <description>
                    <![CDATA[ Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Go ]]>
                </description>
                <link>https://www.freecodecamp.org/news/docker-container-doctor-how-i-built-an-ai-agent-that-monitors-and-fixes-my-containers/</link>
                <guid isPermaLink="false">69c1768730a9b81e3a833f20</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agents ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Mon, 23 Mar 2026 17:21:11 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8bb7701d-e519-407f-92ba-59639e13729d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.</p>
<p>I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.</p>
<p>So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-why-not-just-use-prometheus">Why Not Just Use Prometheus?</a></p>
</li>
<li><p><a href="#heading-the-architecture">The Architecture</a></p>
</li>
<li><p><a href="#heading-setting-up-the-project">Setting Up the Project</a></p>
</li>
<li><p><a href="#heading-the-monitoring-script--line-by-line">The Monitoring Script — Line by Line</a></p>
</li>
<li><p><a href="#heading-the-claude-diagnosis-prompt-and-why-structure-matters">The Claude Diagnosis Prompt (and Why Structure Matters)</a></p>
</li>
<li><p><a href="#heading-auto-fix-logic--being-conservative-on-purpose">Auto-Fix Logic — Being Conservative on Purpose</a></p>
</li>
<li><p><a href="#heading-adding-slack-notifications">Adding Slack Notifications</a></p>
</li>
<li><p><a href="#heading-health-check-endpoint">Health Check Endpoint</a></p>
</li>
<li><p><a href="#heading-rate-limiting-claude-calls">Rate Limiting Claude Calls</a></p>
</li>
<li><p><a href="#heading-docker-compose--the-full-setup">Docker Compose — The Full Setup</a></p>
</li>
<li><p><a href="#heading-real-errors-i-caught-in-production">Real Errors I Caught in Production</a></p>
</li>
<li><p><a href="#heading-cost-breakdown--what-this-actually-costs">Cost Breakdown — What This Actually Costs</a></p>
</li>
<li><p><a href="#heading-security-considerations">Security Considerations</a></p>
</li>
<li><p><a href="#heading-what-id-do-differently">What I'd Do Differently</a></p>
</li>
<li><p><a href="#heading-whats-next">What's Next?</a></p>
</li>
</ol>
<h2 id="heading-why-not-just-use-prometheus">Why Not Just Use Prometheus?</h2>
<p>Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.</p>
<p>Even then, those tools tell you <em>what</em> happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you <em>why</em>. You still need a human to look at the logs, figure out the root cause, and decide what to do.</p>
<p>That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).</p>
<h2 id="heading-the-architecture">The Architecture</h2>
<p>Here's how the pieces fit together:</p>
<pre><code class="language-plaintext">┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘
</code></pre>
<p>The flow works like this:</p>
<ol>
<li><p>The Container Doctor runs in its own container with the Docker socket mounted</p>
</li>
<li><p>Every 10 seconds, it pulls the last 50 lines of logs from each target container</p>
</li>
<li><p>It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")</p>
</li>
<li><p>When it finds something, it sends the logs to Claude with a structured prompt</p>
</li>
<li><p>Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart</p>
</li>
<li><p>If severity is high and auto-restart is safe, the script restarts the container</p>
</li>
<li><p>Either way, it sends a Slack notification with the full diagnosis</p>
</li>
<li><p>A simple health endpoint lets you check the doctor's own status</p>
</li>
</ol>
<p>The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.</p>
<h2 id="heading-setting-up-the-project">Setting Up the Project</h2>
<p>Create your project directory:</p>
<pre><code class="language-bash">mkdir container-doctor &amp;&amp; cd container-doctor
</code></pre>
<p>Here's your <code>requirements.txt</code>:</p>
<pre><code class="language-plaintext">docker==7.0.0
anthropic&gt;=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0
</code></pre>
<p>Install locally for testing: <code>pip install -r requirements.txt</code></p>
<p>Create a <code>.env</code> file:</p>
<pre><code class="language-bash">ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20
</code></pre>
<p>A quick note on <code>CHECK_INTERVAL</code>: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.</p>
<h2 id="heading-the-monitoring-script-line-by-line">The Monitoring Script – Line by Line</h2>
<p>Here's the full <code>container_doctor.py</code>. I'll walk through the important parts after:</p>
<pre><code class="language-python">import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now &gt; rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total &gt;= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start &gt;= 0 and end &gt; start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t &gt; datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) &gt;= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")
</code></pre>
<p>That's a lot of code, so let me walk through the parts that matter.</p>
<p><strong>Error deduplication (</strong><code>is_new_error</code><strong>)</strong>: This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.</p>
<p><strong>Rate limiting (</strong><code>check_rate_limit</code><strong>)</strong>: Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.</p>
<p><strong>Restart throttling (inside</strong> <code>apply_fix</code><strong>)</strong>: If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.</p>
<p><strong>Post-restart verification</strong>: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.</p>
<h2 id="heading-the-claude-diagnosis-prompt-and-why-structure-matters">The Claude Diagnosis Prompt (and Why Structure Matters)</h2>
<p>Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.</p>
<p>The version I landed on is explicit about format:</p>
<pre><code class="language-python">prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""
</code></pre>
<p>A few things I learned:</p>
<p><strong>Include the detected patterns.</strong> Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.</p>
<p><strong>Ask for</strong> <code>estimated_impact</code><strong>.</strong> This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."</p>
<p><code>likely_recurring</code> <strong>is gold.</strong> If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.</p>
<p>Claude returns something like:</p>
<pre><code class="language-json">{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}
</code></pre>
<p>I only auto-restart on <code>high</code> severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.</p>
<h2 id="heading-auto-fix-logic-being-conservative-on-purpose">Auto-Fix Logic – Being Conservative on Purpose</h2>
<p>The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:</p>
<p>Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.</p>
<p>The three safety checks before any restart:</p>
<ol>
<li><p><strong>Global toggle</strong>: <code>AUTO_FIX=true</code> in .env. I can kill all auto-fixes instantly by changing one variable.</p>
</li>
<li><p><strong>Claude's assessment</strong>: <code>auto_restart_safe</code> must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.</p>
</li>
<li><p><strong>Restart throttle</strong>: No more than 3 restarts per container per hour. After that, it's a human problem.</p>
</li>
</ol>
<p>If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.</p>
<h2 id="heading-adding-slack-notifications">Adding Slack Notifications</h2>
<p>Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.</p>
<p>The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.</p>
<p>To set this up, create a Slack app at <a href="https://api.slack.com/apps">api.slack.com/apps</a>, add an incoming webhook, and paste the URL in your <code>.env</code>.</p>
<h2 id="heading-health-check-endpoint">Health Check Endpoint</h2>
<p>The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:</p>
<pre><code class="language-bash">curl http://localhost:8080/health
</code></pre>
<p>Returns:</p>
<pre><code class="language-json">{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}
</code></pre>
<p>And <code>/history</code> returns the last 50 diagnoses:</p>
<pre><code class="language-bash">curl http://localhost:8080/history
</code></pre>
<p>I point an uptime checker (UptimeRobot, free tier) at the <code>/health</code> endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.</p>
<h2 id="heading-rate-limiting-claude-calls">Rate Limiting Claude Calls</h2>
<p>This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.</p>
<p>The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.</p>
<p>Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.</p>
<h2 id="heading-docker-compose-the-full-setup">Docker Compose – The Full Setup</h2>
<p>Here's the complete <code>docker-compose.yml</code> with the Container Doctor, a sample web server, API, and database:</p>
<pre><code class="language-yaml">version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:
</code></pre>
<p>And the <code>Dockerfile</code>:</p>
<pre><code class="language-dockerfile">FROM python:3.12-slim

WORKDIR /app

RUN apt-get update &amp;&amp; apt-get install -y curl &amp;&amp; rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]
</code></pre>
<p>Start everything: <code>docker compose up -d</code></p>
<p><strong>Important:</strong> The socket mount (<code>/var/run/docker.sock:/var/run/docker.sock</code>) gives the Container Doctor full access to the Docker daemon. Don't copy <code>.env</code> into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.</p>
<h2 id="heading-real-errors-i-caught-in-production">Real Errors I Caught in Production</h2>
<p>I've been running this for about 3 weeks now. Here are the actual incidents it caught:</p>
<h3 id="heading-incident-1-oom-kill-week-1">Incident 1: OOM Kill (Week 1)</h3>
<p>Logs showed a single word: <code>Killed</code>. That's Linux's OOMKiller doing its thing.</p>
<p>Claude's diagnosis:</p>
<pre><code class="language-json">{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}
</code></pre>
<p>The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.</p>
<h3 id="heading-incident-2-connection-pool-exhausted-week-2">Incident 2: Connection Pool Exhausted (Week 2)</h3>
<pre><code class="language-plaintext">ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached
</code></pre>
<p>Claude caught that my pool size was too small for the number of workers:</p>
<pre><code class="language-json">{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}
</code></pre>
<h3 id="heading-incident-3-transient-timeout-week-2">Incident 3: Transient Timeout (Week 2)</h3>
<pre><code class="language-plaintext">WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry
</code></pre>
<p>Claude correctly identified this as a non-issue:</p>
<pre><code class="language-json">{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}
</code></pre>
<p>No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.</p>
<h3 id="heading-incident-4-disk-full-week-3">Incident 4: Disk Full (Week 3)</h3>
<pre><code class="language-plaintext">ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space
</code></pre>
<pre><code class="language-json">{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}
</code></pre>
<p>Notice Claude said <code>auto_restart_safe: false</code> here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.</p>
<h2 id="heading-cost-breakdown-what-this-actually-costs">Cost Breakdown – What This Actually Costs</h2>
<p>After 3 weeks of running this on 5 containers:</p>
<ul>
<li><p><strong>Claude API</strong>: ~$3.80/month (with rate limiting and deduplication)</p>
</li>
<li><p><strong>Linode compute</strong>: $0 extra (the Container Doctor uses about 50MB RAM)</p>
</li>
<li><p><strong>Slack</strong>: Free tier</p>
</li>
<li><p><strong>My time saved</strong>: ~2-3 hours/month of 3 AM debugging</p>
</li>
</ul>
<p>Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.</p>
<p>If you're monitoring more containers or have noisier logs, expect higher costs. The <code>MAX_DIAGNOSES_PER_HOUR</code> setting is your budget knob.</p>
<h2 id="heading-security-considerations">Security Considerations</h2>
<p>Let's talk about the elephant in the room: the Docker socket.</p>
<p>Mounting <code>/var/run/docker.sock</code> gives the Container Doctor <strong>root-equivalent access</strong> to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.</p>
<p>Here's how I mitigate this:</p>
<ol>
<li><p><strong>Network isolation</strong>: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.</p>
</li>
<li><p><strong>Read-mostly access</strong>: The script only <em>reads</em> logs and <em>restarts</em> containers. It never execs into containers, pulls images, or modifies volumes.</p>
</li>
<li><p><strong>No external inputs</strong>: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).</p>
</li>
<li><p><strong>API key rotation</strong>: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.</p>
</li>
</ol>
<p>For a more secure setup, consider Docker's <code>--read-only</code> flag on the socket mount and a tool like <a href="https://github.com/Tecnativa/docker-socket-proxy">docker-socket-proxy</a> to restrict which API calls the Container Doctor can make.</p>
<h2 id="heading-what-id-do-differently">What I'd Do Differently</h2>
<p>After 3 weeks in production, here's my honest retrospective:</p>
<p><strong>I'd use structured logging from day one.</strong> My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.</p>
<p><strong>I'd add per-container policies.</strong> Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.</p>
<p><strong>I'd build a simple web UI.</strong> The <code>/history</code> endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.</p>
<p><strong>I'd try local models first.</strong> For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.</p>
<p><strong>I'd add a "learning mode."</strong> Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.</p>
<h2 id="heading-whats-next">What's Next?</h2>
<p>If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.</p>
<p>Got questions or built something similar? Drop a comment below or find me on <a href="https://github.com/balajee-asish">GitHub</a> and <a href="https://linkedin.com/in/balajee-asish">LinkedIn</a>.</p>
<p>Happy building.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Production-Ready Flutter CI/CD Pipeline with GitHub Actions: Quality Gates, Environments, and Store Deployment ]]>
                </title>
                <description>
                    <![CDATA[ Mobile application development has evolved over the years. The processes, structure, and syntax we use has changed, as well as the quality and flexibility of the apps we build. One of the major improv ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-production-ready-flutter-ci-cd-pipeline-with-github-actions-quality-gates-environments-and-store-deployment/</link>
                <guid isPermaLink="false">69bb2e078c55d6eefb6c2e8d</guid>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Flutter ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Mobile Development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ github-actions ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ github copilot ]]>
                    </category>
                
                    <category>
                        <![CDATA[ CI/CD pipelines ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Dart ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Productivity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oluwaseyi Fatunmole ]]>
                </dc:creator>
                <pubDate>Wed, 18 Mar 2026 22:58:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8c9d9384-ff02-47d7-aa69-42db2ebae247.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Mobile application development has evolved over the years. The processes, structure, and syntax we use has changed, as well as the quality and flexibility of the apps we build.</p>
<p>One of the major improvements has been a properly automated CI/CD pipeline flow that gives us seamless automation, continuous integration, and continuous deployment.</p>
<p>In this article, I'll break down how you can automate and build a production ready CI/CD pipeline for your Flutter application using GitHub Actions.</p>
<p>Note that there are other ways to do this, like with Codemagic (built specifically for Flutter apps – which I'll cover in a subsequent tutorial), but in this article we'll focus on GitHub Actions instead.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-the-typical-workflow">The Typical Workflow</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-pipeline-architecture">Pipeline Architecture</a></p>
</li>
<li><p><a href="#heading-writing-the-workflows">Writing the Workflows</a></p>
<ul>
<li><p><a href="#heading-the-helper-scripts">The Helper Scripts</a></p>
<ul>
<li><p><a href="#heading-script-1-generateconfigsh">generate_config.sh</a></p>
</li>
<li><p><a href="#heading-script-2-qualitygatesh">quality_gate.sh</a></p>
</li>
<li><p><a href="#heading-script-3-uploadsymbolssh-sentry">upload_symbols.sh</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-workflow-1-prchecksyml">PR Quality Gate (pr_checks.yml)</a></p>
</li>
<li><p><a href="#heading-workflow-2-androidyml">Android CI/CD Pipeline (android.yml)</a></p>
</li>
<li><p><a href="#heading-workflow-3-iosyml">iOS CI/CD Pipeline (ios.yml)</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-secrets-and-configuration-reference">Secrets and Configuration Reference</a></p>
</li>
<li><p><a href="#heading-end-to-end-flow">End-to-End Flow</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-the-typical-workflow">The Typical Workflow</h2>
<p>First, let's define the common approach to deploying production-ready Flutter apps.</p>
<p>The development team does their work on local, pushes to the repository for merge or review, and eventually runs <code>flutter build apk</code> or <code>flutter build appbundle</code> to generate the apk file. This then gets shared with the QA team manually, or deployed to Firebase app distribution for testing. If it's a production move, the app bundle is submitted to the Google Play store for review and then deployed.</p>
<p>This process is often fully manual with no automated checks, validation, or control over quality, speed, and seamlessness. Manually shipping a Flutter app starts out relatively simply, but can quickly and quietly turn into a liability. You run <code>flutter build</code>, switch configs, sign the build, upload it somewhere, and hope you didn’t mix up staging keys with production ones.</p>
<p>As teams grow and release updates more and more quickly, these manual steps become real risks. A skipped quality check, a missing keystore, or an incorrect base URL deployed to production can cost hours of debugging or worse – it can affect your users.</p>
<p>Automating this process fully involves some high level configuration and predefined scripting. It completely takes control of the deployment process from the moment the developer raised a PR into the common or base branch (for example, the <code>develop</code> branch).</p>
<p>This automated process takes care of everything that needs to be done – provided it has been predefined, properly scripted, and aligns with the use case of the team.</p>
<h3 id="heading-what-well-do-here">What we'll do here:</h3>
<p>In this tutorial, we'll build a production-grade CI/CD pipeline for a Flutter app using GitHub Actions. The pipeline automates the entire lifecycle: pull-request quality checks, environment-specific configuration injection, Android and iOS builds, Firebase App Distribution for testers, Sentry symbol uploads, and final deployment to the Play Store and App Store.</p>
<p>By the end, every release – from a developer opening a PR to the final build landing in users' hands – will be fully automated, with no one touching a terminal.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before starting, you should have:</p>
<ol>
<li><p>A Flutter app with working Android and iOS builds</p>
</li>
<li><p>Basic familiarity with <a href="https://www.freecodecamp.org/news/automate-cicd-with-github-actions-streamline-workflow/">GitHub Actions</a> (workflows and jobs)</p>
</li>
<li><p>A Firebase project with App Distribution enabled</p>
</li>
<li><p>A Sentry project for error tracking</p>
</li>
<li><p>A Google Play Console app already created</p>
</li>
<li><p>An Apple Developer account with App Store Connect access</p>
</li>
<li><p>Fastlane configured for your iOS project</p>
</li>
<li><p>Basic Bash knowledge (I’ll explain the important parts)</p>
</li>
</ol>
<h2 id="heading-pipeline-architecture">Pipeline Architecture</h2>
<p>In this guide, we'll be building a CI/CD pipeline with very precise instructions and use cases. These use cases determine the way your pipeline is built.</p>
<p>For this tutorial, we'll use this use case:</p>
<p>I want to automate the workflow on my development team based on the following criteria:</p>
<ol>
<li><p>When a developer on the team raises a PR into the common working branch <code>develop</code> in most cases), a workflow is triggered to run quality checks on the code. It only allows the merge to happen if all checks (like tests coverage, quality checks, and static analysis) pass.</p>
</li>
<li><p>Code that's moving from the develop branch to the staging branch goes through another workflow that injects staging configurations/secret keys, does all the necessary checks, and distributes the application for testing on Firebase App Distribution for android as well as Testflight for iOS.</p>
</li>
<li><p>Code that's moving from the staging to the production branch goes through the production level workflow which involves apk secured signing, production configuration injection, running tests to ensure nothing breaks, Sentry analysis for monitoring, and submission to App Store Connect as well as Google Play Console.</p>
</li>
</ol>
<p>These are our predefined conditions which help with the construction of our workflows.</p>
<h2 id="heading-writing-the-workflows">Writing the Workflows</h2>
<p>We'll split this pipeline into three GitHub Actions workflows.</p>
<p>We'll also be taking it a notch higher by creating three helper .sh scripts for a cleaner and more maintainable workflow.</p>
<p>In your project root, create two folders:</p>
<ol>
<li><p>.github/</p>
</li>
<li><p>scripts.</p>
</li>
</ol>
<p>The <strong>.github/</strong> folder will hold the workflows we'll be creating for each use case, while the <strong>scripts/</strong> folder will hold the helper scripts that we can easily call in our CLI or in the workflows directly.</p>
<p>After this, we'll create three workflow .yaml files:</p>
<ol>
<li><p>pr_checks.yaml</p>
</li>
<li><p>android.yaml</p>
</li>
<li><p>ios.yaml</p>
</li>
</ol>
<p>Also in the scripts folder, let's create three .sh files:</p>
<ol>
<li><p>generate_config.sh</p>
</li>
<li><p>quality_checks.sh</p>
</li>
<li><p>upload_symbols.sh</p>
</li>
</ol>
<pre><code class="language-yaml">.github/
  workflows/
    pr_checks.yml
    android.yml
    ios.yml

scripts/
  generate_config.sh
  quality_checks.sh
  upload_symbols.sh
</code></pre>
<p>This workflow architecture ensures that a push to <code>develop</code> automatically produces a tester build. Also, merging to <code>production</code> ships directly to the stores without manual commands or config changes.</p>
<p>The scripts live outside the YAML on purpose. This lets you run the same logic locally.</p>
<h3 id="heading-the-helper-scripts">The Helper Scripts</h3>
<p>The scripts form the backbone of the pipeline. Each one has a single responsibility and is reused across workflows.</p>
<p>Instead of cramming logic into YAML, we'll move it into <strong>reusable scripts</strong>. This keeps workflows clean and lets you run the same logic locally. Let's go through each one now.</p>
<h3 id="heading-script-1-generateconfigsh">Script #1: <code>generate_config.sh</code></h3>
<p>Injecting secrets safely is one of the hardest CI/CD problems in mobile apps.</p>
<p>The strategy:</p>
<ul>
<li><p>Commit a Dart template file with placeholders</p>
</li>
<li><p>Replace placeholders at build time using secrets from GitHub Actions</p>
</li>
<li><p>Never commit real credentials</p>
</li>
</ul>
<pre><code class="language-yaml">#!/usr/bin/env bash
set -euo pipefail


ENV_NAME=${1:-}
BASE_URL=${2:-}
ENCRYPTION_KEY=${3:-}

TEMPLATE="lib/core/env/env_ci.dart"
OUT="lib/core/env/env_ci.g.dart"

if [ -z "\(ENV_NAME" ] || [ -z "\)BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then
  echo "Usage: $0 &lt;env-name&gt; &lt;base-url&gt; &lt;encryption-key&gt;"
  exit 2
fi

sed -e "s|&lt;&lt;BASE_URL&gt;&gt;|$BASE_URL|g" \
    -e "s|&lt;&lt;ENCRYPTION_KEY&gt;&gt;|$ENCRYPTION_KEY|g" \
    -e "s|&lt;&lt;ENV_NAME&gt;&gt;|$ENV_NAME|g" \
    "\(TEMPLATE" &gt; "\)OUT"

echo "Generated config for $ENV_NAME"
</code></pre>
<p>This script is responsible for injecting environment-specific configuration into the Flutter app at build time, without ever committing secrets to source control.</p>
<p>Let’s walk through it carefully.</p>
<h4 id="heading-1-shebang-choosing-the-shell">1. Shebang: Choosing the Shell</h4>
<pre><code class="language-yaml">#!/usr/bin/env bash
</code></pre>
<p>This line tells the system to execute the script using <strong>Bash</strong>, regardless of where Bash is installed on the machine.</p>
<p>Using <code>/usr/bin/env bash</code> instead of <code>/bin/bash</code> makes the script more portable across local machines, GitHub Actions runners, and Docker containers.</p>
<h4 id="heading-2-fail-fast-fail-loud">2. Fail Fast, Fail Loud</h4>
<pre><code class="language-yaml">set -euo pipefail
</code></pre>
<p>This is one of the most important lines in the script.</p>
<p>It enables three strict Bash modes:</p>
<ul>
<li><p><code>-e</code>: Exit immediately if any command fails</p>
</li>
<li><p><code>-u</code>: Exit if an undefined variable is used</p>
</li>
<li><p><code>-o pipefail</code>: Fail if any command in a pipeline fails, not just the last one</p>
</li>
</ul>
<p>This matters in CI because silent failures are dangerous, partial config generation can break production builds, and CI should stop immediately when something is wrong.</p>
<p>This line ensures that no broken config ever makes it into a build.</p>
<h4 id="heading-3-reading-input-arguments">3. Reading Input Arguments</h4>
<pre><code class="language-yaml">
ENV_NAME=${1:-}
BASE_URL=${2:-}
ENCRYPTION_KEY=${3:-}
</code></pre>
<p>These lines read <strong>positional arguments</strong> passed to the script:</p>
<ul>
<li><p><code>$1</code>: Environment name (<code>dev</code>, <code>staging</code>, <code>production</code>)</p>
</li>
<li><p><code>$2</code>: API base URL</p>
</li>
<li><p><code>$3</code>: Encryption or API key</p>
</li>
</ul>
<p>The <code>${1:-}</code> syntax means:</p>
<p><em>“If the argument is missing, default to an empty string instead of crashing.”</em></p>
<p>This works hand-in-hand with <code>set -u</code> , we control the failure explicitly instead of letting Bash explode unexpectedly.</p>
<h4 id="heading-4-defining-input-and-output-files">4. Defining Input and Output Files</h4>
<pre><code class="language-yaml">TEMPLATE="lib/core/env/env_ci.dart"
OUT="lib/core/env/env_ci.g.dart"
</code></pre>
<p>Here we define two files:</p>
<ul>
<li><p><strong>Template file (</strong><code>env_ci.dart</code><strong>)</strong></p>
<ul>
<li><p>Contains placeholder values like <code>&lt;&lt;BASE_URL&gt;&gt;</code></p>
</li>
<li><p>Safe to commit to Git</p>
</li>
</ul>
</li>
<li><p><strong>Generated file (</strong><code>env_ci.g.dart</code><strong>)</strong></p>
<ul>
<li><p>Contains real environment values</p>
</li>
<li><p>Must be ignored by Git (<code>.gitignore</code>)</p>
</li>
</ul>
</li>
</ul>
<p>At the heart of this approach are two Dart files with very different responsibilities. They may look similar, but they play completely different roles in the system.</p>
<h4 id="heading-envcidart"><code>env.ci.dart</code>:</h4>
<pre><code class="language-java">// lib/core/env/env_ci.dart

class EnvConfig {
  static const String baseUrl = '&lt;&lt;BASE_URL&gt;&gt;';
  static const String encryptionKey = '&lt;&lt;ENCRYPTION_KEY&gt;&gt;';
  static const String environment = '&lt;&lt;ENV_NAME&gt;&gt;';
}
</code></pre>
<p>This file is <strong>safe</strong>, <strong>static</strong>, and <strong>version-controlled</strong>. It contains placeholders, not real values.</p>
<p>Some of its key characteristics are:</p>
<ul>
<li><p>Contains no real secrets</p>
</li>
<li><p>Uses obvious placeholders (<code>&lt;&lt;BASE_URL&gt;&gt;</code>, etc.)</p>
</li>
<li><p>Safe to commit to Git</p>
</li>
<li><p>Reviewed like normal source code</p>
</li>
<li><p>Serves as the single source of truth for required config fields</p>
</li>
</ul>
<p>Think of this file as a contract:</p>
<p><em>“These are the configuration values the app expects at runtime.”</em></p>
<h4 id="heading-envcigdart"><code>env.ci.g.dart</code>:</h4>
<p>This file is created at <strong>build time</strong> by <code>generate_config.sh</code>. After substitution, it looks like this:</p>
<pre><code class="language-java">// lib/core/env/env_ci.g.dart
// GENERATED FILE — DO NOT COMMIT

class EnvConfig {
  static const String baseUrl = 'https://staging.api.example.com';
  static const String encryptionKey = 'sk_live_xxxxx';
  static const String environment = 'staging';
}
</code></pre>
<p>Key characteristics:</p>
<ul>
<li><p>Contains real environment values</p>
</li>
<li><p>Generated dynamically in CI</p>
</li>
<li><p>Differs per environment (dev / staging / production)</p>
</li>
<li><p>Must <strong>never</strong> be committed to source control</p>
</li>
</ul>
<p>This file exists only on a developer’s machine (if generated locally), inside the CI runner during a build. Once the job finishes, it disappears.</p>
<h4 id="heading-gitignore"><code>.gitignore</code>:</h4>
<p>To guarantee the generated file never leaks, it must be ignored:</p>
<h4 id="heading-why-this-separation-is-critical">Why This Separation Is Critical</h4>
<p>This design solves several hard problems at once.</p>
<p><strong>Security:</strong></p>
<ul>
<li><p>Secrets live <strong>only</strong> in GitHub Actions secrets</p>
</li>
<li><p>They never appear in the repository</p>
</li>
<li><p>They never appear in PRs</p>
</li>
<li><p>They never appear in Git history</p>
</li>
</ul>
<p><strong>Environment Isolation:</strong></p>
<p>Each environment gets its own generated config:</p>
<ul>
<li><p><code>develop</code>: dev API</p>
</li>
<li><p><code>staging</code>: staging API</p>
</li>
<li><p><code>production</code>: production API</p>
</li>
</ul>
<p>The same codebase behaves differently <strong>without branching logic in Dart</strong>.</p>
<p><strong>Deterministic Builds:</strong></p>
<p>Every build is fully reproducible, fully automated, and explicit about which environment it targets.</p>
<p>There are no “it worked locally” scenarios.</p>
<h4 id="heading-5-validating-required-arguments">5. Validating Required Arguments</h4>
<pre><code class="language-java">if [ -z "\(ENV_NAME" ] || [ -z "\)BASE_URL" ] || [ -z "$ENCRYPTION_KEY" ]; then
  echo "Usage: $0 &lt;env-name&gt; &lt;base-url&gt; &lt;encryption-key&gt;"
  exit 2
fi
</code></pre>
<p>This block enforces correct usage.</p>
<ul>
<li><p><code>-z</code> checks whether a variable is empty</p>
</li>
<li><p>If any required argument is missing:</p>
<ul>
<li><p>A helpful usage message is printed</p>
</li>
<li><p>The script exits with a non-zero status code</p>
</li>
</ul>
</li>
<li><p><code>0</code>: success</p>
</li>
<li><p><code>1+</code>: failure</p>
</li>
<li><p><code>2</code> conventionally means incorrect usage</p>
</li>
</ul>
<p>In CI, this immediately fails the job and prevents an invalid build.</p>
<h4 id="heading-6-injecting-environment-values">6. Injecting Environment Values</h4>
<pre><code class="language-java">sed -e "s|&lt;&lt;BASE_URL&gt;&gt;|$BASE_URL|g" \
    -e "s|&lt;&lt;ENCRYPTION_KEY&gt;&gt;|$ENCRYPTION_KEY|g" \
    -e "s|&lt;&lt;ENV_NAME&gt;&gt;|$ENV_NAME|g" \
    "\(TEMPLATE" &gt; "\)OUT"
</code></pre>
<p>This is the heart of the script.</p>
<p>What’s happening here:</p>
<ol>
<li><p><code>sed</code> performs <strong>stream editing</strong>: it reads text, transforms it, and outputs the result</p>
</li>
<li><p>Each <code>-e</code> flag defines a replacement rule:</p>
<ul>
<li><p>Replace <code>&lt;&lt;BASE_URL&gt;&gt;</code> with the actual API URL</p>
</li>
<li><p>Replace <code>&lt;&lt;ENCRYPTION_KEY&gt;&gt;</code> with the real key</p>
</li>
<li><p>Replace <code>&lt;&lt;ENV_NAME&gt;&gt;</code> with the environment label</p>
</li>
</ul>
</li>
<li><p>The transformed output is written to <code>env_ci.g.dart</code></p>
</li>
</ol>
<p>This entire operation happens <strong>at build time</strong>:</p>
<ul>
<li><p>No secrets are committed</p>
</li>
<li><p>No secrets are logged</p>
</li>
<li><p>No secrets persist beyond the CI run</p>
</li>
</ul>
<h4 id="heading-7-success-feedback">7. Success Feedback</h4>
<pre><code class="language-java">echo "Generated config for $ENV_NAME"
</code></pre>
<p>This line provides a clear success signal in CI logs.</p>
<p>It answers three important questions instantly:</p>
<ul>
<li><p>Did the script run?</p>
</li>
<li><p>Did it finish successfully?</p>
</li>
<li><p>Which environment was generated?</p>
</li>
</ul>
<p>In long CI logs, these small confirmations matter.</p>
<p>Alright, now let's move on to the second script.</p>
<h3 id="heading-script-2-qualitygatesh">Script #2: <code>quality_gate.sh</code></h3>
<p>This script defines what <em>“good code”</em> means for your team.</p>
<pre><code class="language-yaml">#!/usr/bin/env bash
set -euo pipefail

echo "Running quality checks"

dart format --output=none --set-exit-if-changed .
flutter analyze
flutter test --no-pub --coverage

if command -v dart_code_metrics &gt;/dev/null 2&gt;&amp;1; then
  dart_code_metrics analyze lib --reporter=console || true
fi

echo "Quality checks passed"
</code></pre>
<p>Lets break down this script bit by bit.</p>
<h4 id="heading-1-start-amp-end-log-markers">1. Start &amp; End Log Markers</h4>
<pre><code class="language-yaml">echo "Running quality checks"
...
echo "Quality checks passed"
</code></pre>
<p>These two lines act as <strong>visual boundaries</strong> in CI logs.</p>
<p>In large pipelines (especially when Android and iOS jobs run in parallel), logs can be very noisy. Clear markers:</p>
<ul>
<li><p>Help developers quickly find the quality phase</p>
</li>
<li><p>Make debugging faster</p>
</li>
<li><p>Confirm that the script completed successfully</p>
</li>
</ul>
<p>The final success message only prints if <strong>everything above it passed</strong>, because <code>set -e</code> would have terminated the script earlier on failure.</p>
<p>So this line effectively means: All quality gates passed. Safe to proceed.</p>
<h4 id="heading-2-running-the-test-suite">2. Running the Test Suite</h4>
<pre><code class="language-yaml">flutter test --no-pub --coverage
</code></pre>
<p>This line executes your entire Flutter test suite.</p>
<p>Let’s break it down carefully.</p>
<p>1. <code>flutter test</code></p>
<p>This runs unit tests, widget tests, and any test under the <code>test/</code> directory. If <strong>any test fails</strong>, the command exits with a non-zero status code.</p>
<p>Because we enabled <code>set -e</code> earlier, that immediately stops the script and fails the CI job.</p>
<p>2. <code>--coverage</code></p>
<p>This flag generates a coverage report at:</p>
<pre><code class="language-yaml">coverage/lcov.info
</code></pre>
<p>This file can later be uploaded to Codecov, used to enforce minimum coverage thresholds, and tracked over time for quality improvement.</p>
<p>Even if you’re not enforcing coverage yet, generating it now future-proofs your pipeline.</p>
<h4 id="heading-3-optional-code-metrics">3. Optional Code Metrics</h4>
<pre><code class="language-yaml">if command -v dart_code_metrics &gt;/dev/null 2&gt;&amp;1; then
  dart_code_metrics analyze lib --reporter=console || true
fi
</code></pre>
<p>This block is intentionally designed to be optional and non-blocking.</p>
<p><strong>Step 1 – Check If the Tool Exists:</strong></p>
<pre><code class="language-yaml">command -v dart_code_metrics &gt;/dev/null 2&gt;&amp;1
</code></pre>
<p>This checks whether <code>dart_code_metrics</code> is installed.</p>
<ul>
<li><p>If installed, proceed</p>
</li>
<li><p>If not installed, skip silently</p>
</li>
</ul>
<p>The redirection:</p>
<ul>
<li><p><code>&gt;/dev/null</code> hides normal output</p>
</li>
<li><p><code>2&gt;&amp;1</code> hides errors</p>
</li>
</ul>
<p>This makes the script portable:</p>
<ul>
<li><p>Developers without the tool can still run the script</p>
</li>
<li><p>CI can enforce it if configured</p>
</li>
</ul>
<p><strong>Step 2 – Run Metrics (Soft Enforcement):</strong></p>
<pre><code class="language-yaml">dart_code_metrics analyze lib --reporter=console || true
</code></pre>
<p>This analyzes the <code>lib/</code> directory and prints results in the console.</p>
<p>The important part is:</p>
<pre><code class="language-yaml">|| true
</code></pre>
<p>Because we enabled <code>set -e</code>, any failing command would normally stop the script.</p>
<p>Adding <code>|| true</code> overrides that behavior:</p>
<ul>
<li><p>If metrics report issues,</p>
</li>
<li><p>The script continues,</p>
</li>
<li><p>CI does not fail.</p>
</li>
</ul>
<p>Why design it this way? Because metrics are often gradual improvements, technical debt indicators, or advisory rather than blocking.</p>
<p>You can later remove <code>|| true</code> to make metrics mandatory.</p>
<h4 id="heading-4-final-success-message"><strong>4. Final Success Message</strong></h4>
<pre><code class="language-yaml">echo "✅ Quality checks passed"
</code></pre>
<p>This line only executes if formatting passed, static analysis passed, and tests passed.</p>
<p>If you see this in CI logs, it means the branch has successfully cleared the quality gate. It’s your automated approval before deployment steps begin.</p>
<h4 id="heading-what-this-script-guarantees">What This Script Guarantees</h4>
<p>With this in place, every branch must satisfy:</p>
<ul>
<li><p>Clean formatting</p>
</li>
<li><p>No analyzer errors</p>
</li>
<li><p>Passing tests</p>
</li>
<li><p>(Optional) Healthy metrics</p>
</li>
</ul>
<p>That’s how you move from <strong>“We try to maintain quality”</strong> to <strong>“Quality is enforced automatically.”</strong></p>
<p>Alright, on to the third script.</p>
<h3 id="heading-script-3-uploadsymbolssh-sentry"><strong>Script #3:</strong> <code>upload_symbols.sh</code> <strong>(Sentry)</strong></h3>
<p>This script is responsible for uploading <strong>obfuscation debug symbols</strong> to Sentry so production crashes remain readable.</p>
<pre><code class="language-yaml">#!/usr/bin/env bash
set -euo pipefail

RELEASE=${1:-}

[ -z "$RELEASE" ] &amp;&amp; exit 2

if ! command -v sentry-cli &gt;/dev/null 2&gt;&amp;1; then
  exit 0
fi

sentry-cli releases new "$RELEASE" || true

sentry-cli upload-dif build/symbols || true

sentry-cli releases finalize "$RELEASE" || true

echo "✅ Symbols uploaded for release $RELEASE"
</code></pre>
<p>Let's go through it step by step.</p>
<h4 id="heading-1-reading-the-release-identifier">1. Reading the Release Identifier</h4>
<pre><code class="language-yaml">RELEASE=${1:-}
</code></pre>
<p>This reads the first positional argument passed to the script.</p>
<p>When you call the script in CI, it typically looks like:</p>
<pre><code class="language-yaml">./scripts/upload_symbols.sh $(git rev-parse --short HEAD)
</code></pre>
<p>So <code>$1</code> becomes the short Git commit SHA.</p>
<p>Using <code>${1:-}</code> ensures:</p>
<ul>
<li><p>If no argument is passed, the variable becomes an empty string</p>
</li>
<li><p>The script does not crash due to <code>set -u</code></p>
</li>
</ul>
<p>This release value ties the uploaded symbols, deployed build, and crash reports all to the exact same commit. This linkage is critical for production debugging.</p>
<h4 id="heading-2-validating-the-release-argument">2. Validating the Release Argument</h4>
<pre><code class="language-yaml">[ -z "$RELEASE" ] &amp;&amp; exit 2
</code></pre>
<p>This is a compact validation check.</p>
<ul>
<li><p><code>-z</code> checks whether the string is empty</p>
</li>
<li><p>If it is empty → exit with status code 2</p>
</li>
</ul>
<p>Conventionally:</p>
<ul>
<li><p><code>0</code> = success</p>
</li>
<li><p><code>1+</code> = failure</p>
</li>
<li><p><code>2</code> = incorrect usage</p>
</li>
</ul>
<p>This prevents symbol uploads from running without a release identifier, which would break traceability in Sentry.</p>
<h4 id="heading-3-checking-if-sentry-cli-exists">3. Checking If <code>sentry-cli</code> Exists</h4>
<pre><code class="language-yaml">if ! command -v sentry-cli &gt;/dev/null 2&gt;&amp;1; then
  exit 0
fi
</code></pre>
<p>This block checks whether the <code>sentry-cli</code> tool is available in the environment.</p>
<p>What’s happening:</p>
<ul>
<li><p><code>command -v sentry-cli</code> checks if it exists</p>
</li>
<li><p><code>&gt;/dev/null 2&gt;&amp;1</code> suppresses all output</p>
</li>
<li><p><code>!</code> negates the condition</p>
</li>
</ul>
<p>So this reads as: <em>"If</em> <code>sentry-cli</code> <em>is NOT installed, exit successfully."</em></p>
<p>Why exit with <code>0</code> instead of failing?</p>
<p>Because not every environment needs symbol uploads. Also, dev builds may not install Sentry, and you don’t want CI to fail just because Sentry isn’t configured.</p>
<p>This makes symbol uploading <strong>environment-aware</strong> and <strong>optional</strong>.</p>
<p>Production environments can install <code>sentry-cli</code>, while dev environments skip it cleanly.</p>
<h4 id="heading-4-creating-a-new-release-in-sentry">4. Creating a New Release in Sentry</h4>
<pre><code class="language-yaml">sentry-cli releases new "$RELEASE" || true
</code></pre>
<p>This tells Sentry: “A new release exists with this version identifier.”</p>
<p>Even if the release already exists, the script continues because of:</p>
<pre><code class="language-yaml">|| true
</code></pre>
<p>This prevents the build from failing if:</p>
<ul>
<li><p>The release was already created</p>
</li>
<li><p>The command returns a non-critical error</p>
</li>
</ul>
<p>The goal is resilience, not strict enforcement.</p>
<h4 id="heading-5-uploading-debug-information-files-difs">5. Uploading Debug Information Files (DIFs)</h4>
<pre><code class="language-yaml">sentry-cli upload-dif build/symbols || true
</code></pre>
<p>This is the core step.</p>
<p><code>build/symbols</code> is generated when you build Flutter with:</p>
<pre><code class="language-yaml">--obfuscate --split-debug-info=build/symbols
</code></pre>
<p>When you obfuscate Flutter builds:</p>
<ul>
<li><p>Method names are renamed</p>
</li>
<li><p>Stack traces become unreadable</p>
</li>
</ul>
<p>The symbol files allow Sentry to reverse-map obfuscated stack traces and show readable crash reports.</p>
<p>Without this step, production crashes look like:</p>
<pre><code class="language-yaml">a.b.c.d (Unknown Source)
</code></pre>
<p>With this step, you get:</p>
<pre><code class="language-yaml">AuthRepository.login()
</code></pre>
<p>Again, <code>|| true</code> ensures the build doesn’t fail if:</p>
<ul>
<li><p>The directory doesn’t exist</p>
</li>
<li><p>No symbols were generated</p>
</li>
<li><p>Upload encounters a transient issue</p>
</li>
</ul>
<p>Symbol uploads should not block deployment.</p>
<h4 id="heading-6-finalizing-the-release">6. Finalizing the Release</h4>
<pre><code class="language-yaml">sentry-cli releases finalize "$RELEASE" || true
</code></pre>
<p>This marks the release as complete in Sentry.</p>
<p>Finalizing signals:</p>
<ul>
<li><p>The release is deployed</p>
</li>
<li><p>It can begin aggregating crash reports</p>
</li>
<li><p>It’s ready for production monitoring</p>
</li>
</ul>
<p>Like the previous steps, this is soft-failed with <code>|| true</code> to keep CI robust.</p>
<h4 id="heading-what-this-script-guarantees">What This Script Guarantees</h4>
<p>When everything is configured correctly:</p>
<ol>
<li><p>Production build is obfuscated</p>
</li>
<li><p>Debug symbols are generated</p>
</li>
<li><p>Symbols are uploaded to Sentry</p>
</li>
<li><p>Crashes map back to real source code</p>
</li>
<li><p>Release version matches commit SHA</p>
</li>
</ol>
<p>That’s production-grade crash observability.</p>
<p>Now that we've gone through the three helper scripts we've created to optimize and enhance this process, lets now dive into the three workflow .yaml files we're going to create.</p>
<h2 id="heading-workflow-1-prchecksyml">Workflow #1: <code>PR_CHECKS.YML</code></h2>
<p>This workflow will be designed to help ensure a certain standard is met once a PR is raised into a certain common or base branch. This will ensure that all quality checks in the incoming code pass before allowing any merge into the base branch.</p>
<p>This is basically a gate that verifies the quality of the code that's about to be merged into the base branch. If your pipeline allows unverified code into your base branch, then your CI becomes decorative, not protective.</p>
<p>Lets break down what's actually needed during every PR Check.</p>
<h3 id="heading-1-dependency-integrity">1. Dependency Integrity</h3>
<p>For Flutter apps, where we manage dependencies with the <strong>pub get</strong> command, it's important to make sure that the integrity of all dependencies are confirmed – up to date as well as compatible.</p>
<p>Every PR should begin with:</p>
<pre><code class="language-yaml">flutter pub get
</code></pre>
<p>This ensures:</p>
<ul>
<li><p><code>pubspec.yaml</code> is valid</p>
</li>
<li><p>Dependency constraints are consistent</p>
</li>
<li><p>Lockfiles are not broken</p>
</li>
<li><p>The project is buildable in a clean environment</p>
</li>
</ul>
<p>If this fails, the branch is not deployable.</p>
<h3 id="heading-2-static-analysis">2. Static Analysis</h3>
<p>This ensures code quality and architecture integrity. Static analysis helps prevent common issues like forgotten await, dead code, null safety violations, async misuse, and so on.</p>
<p>Most production bugs aren't business logic errors – they're structural carelessness. Static analysis helps enforce consistency automatically, so code reviews focus on intent, not linting.</p>
<pre><code class="language-yaml">flutter analyze --fatal-infos --fatal-warnings
</code></pre>
<h3 id="heading-3-formatting">3. Formatting</h3>
<p>This command ensures that your code is properly formatted based on your organization's coding standard and policies.</p>
<pre><code class="language-yaml">dart format --output=none --set-exit-if-changed .
</code></pre>
<h3 id="heading-4-tests">4. Tests</h3>
<p>This runs the unit, widget and business logic tests to ensure quality and avoid regression leaks, silent behavior changes and feature drift.</p>
<pre><code class="language-yaml">flutter test --coverage
</code></pre>
<h3 id="heading-5-test-coverage-enforcement">5. Test Coverage Enforcement</h3>
<p>Ideally, running tests is not enough. Your workflow should also enforce a minimum threshold:</p>
<pre><code class="language-yaml">if [ \((lcov --summary coverage/lcov.info | grep lines | awk '{print \)2}' | sed 's/%//') -lt 70 ]; then
  echo "Coverage too low"
  exit 1
fi
</code></pre>
<p>The command above ensures that a minimum test coverage of 70% is met, with this quality becomes measurable.</p>
<p>The five commands above must be checked (at least) for a <strong>quality gate</strong> to guarantee code quality, security, and integrity.</p>
<p>Now here is the full <strong>pr_checks.yml</strong> file:</p>
<pre><code class="language-yaml">name: PR Quality Gate

on:
  pull_request:
    branches: develop
    types: [opened, synchronize, reopened, ready_for_review]

jobs:
  pr-checks:
    name: Run quality checks on this pull request
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Setup Java
        uses: actions/setup-java@v1
        with:
          java-version: "12.x"

      - name: Setup Flutter
        uses: subosito/flutter-action@v1
        with:
          channel: "stable"

      - name: Install dependencies
        run: flutter pub get

      - name: Run quality checks
        run: ./scripts/quality_checks.sh

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "PR Quality Checks PASSED"
          echo "PR: ${{ github.event.pull_request.html_url }}"
          echo "Branch: \({{ github.head_ref }} → \){{ github.base_ref }}"
          echo "By: @${{ github.actor }}"
          echo "Team notification: @foluwaseyi-dev @olabodegbolu"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "PR Quality Checks FAILED"
          echo "PR: ${{ github.event.pull_request.html_url }}"
          echo "Branch: \({{ github.head_ref }} → \){{ github.base_ref }}"
          echo "By: @${{ github.actor }}"
          echo "Please fix the issues before requesting review 🔧"
          echo "Team notification: @foluwaseyi-dev @olabodegbolu"
</code></pre>
<p>Every time a developer opens (or updates) a pull request targeting the <code>develop</code> branch, this workflow kicks in automatically. Think of it as a bouncer at the door: no code gets through without passing inspection first.</p>
<h3 id="heading-what-triggers-it">What Triggers it?</h3>
<p>The workflow fires on four events: when a PR is <code>opened</code>, <code>synchronized</code> (new commits pushed), <code>reopened</code>, or marked <code>ready_for_review</code>. So drafts won't trigger it – only PRs that are actually ready to be looked at.</p>
<h3 id="heading-what-does-it-actually-do">What Does it Actually Do?</h3>
<p>It spins up a fresh Ubuntu machine and runs five steps in sequence:</p>
<ol>
<li><p><strong>Checkout</strong>: pulls down the branch's code</p>
</li>
<li><p><strong>Setup Java 12</strong>: installs the JDK (likely a dependency for some tooling or build process)</p>
</li>
<li><p><strong>Setup Flutter (stable channel)</strong>: this is a Flutter project, so it grabs the stable Flutter SDK</p>
</li>
<li><p><strong>Install dependencies</strong>: runs <code>flutter pub get</code> to pull all Dart/Flutter packages</p>
</li>
<li><p><strong>Run quality checks</strong>: executes the helper shell script (<code>./scripts/quality_checks.sh</code>) that we created which runs linting, tests, formatting checks, or all of the above</p>
</li>
</ol>
<h3 id="heading-the-notification-layer">The Notification Layer</h3>
<p>After the checks run, the workflow reports the verdict and it's context-aware:</p>
<ul>
<li><p><strong>If everything passes</strong>, it logs a success message with the PR URL, branch info, and the person who opened it</p>
</li>
<li><p><strong>If something fails</strong>, it logs a failure message and nudges the author to fix issues before requesting a review</p>
</li>
</ul>
<p>Both outcomes tag <code>@foluwaseyi-dev</code> and <code>@olabodegbolu</code> – the two team members responsible for staying in the loop.</p>
<p>This workflow enforces a "fix it before you merge it" culture. No one can merge broken code into <code>develop</code> without the team knowing about it.</p>
<h2 id="heading-workflow-2-androidyml">Workflow #2: Android.yml</h2>
<p>It's a better practice to split your workflows based on platform. This helps you properly manage the instructions regarding each platform. This is the reason behind keeping the Android workflow separate.</p>
<p>Unlike <code>PR _Checks</code>, this workflow presumes that all checks for quality and standards have been done and the code that runs this workflow already meets the required standards.</p>
<p>Based on our predefined use case, let's create a workflow to handle test deployments when merged to develop or staging, and production level activities when merged to production.</p>
<pre><code class="language-yaml">name: Android Build &amp; Release

on:
  push:
    branches:
      - develop
      - staging
      - production

jobs:
  android:
    runs-on: ubuntu-latest
    env:
      FLUTTER_VERSION: 'stable'

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Java
        uses: actions/setup-java@v3
        with:
          distribution: 'temurin'
          java-version: '11'

      - name: Setup Flutter
        uses: subosito/flutter-action@v2
        with:
          flutter-version: ${{ env.FLUTTER_VERSION }}

      - name: Install dependencies
        run: flutter pub get

      - name: Determine environment
        id: env
        run: |
          echo "branch=\({GITHUB_REF##*/}" &gt;&gt; \)GITHUB_OUTPUT
          if [ "${GITHUB_REF##*/}" = "develop" ]; then
            echo "ENV=dev" &gt;&gt; $GITHUB_OUTPUT
          elif [ "${GITHUB_REF##*/}" = "staging" ]; then
            echo "ENV=staging" &gt;&gt; $GITHUB_OUTPUT
          else
            echo "ENV=production" &gt;&gt; $GITHUB_OUTPUT
          fi

      # Dev uses hardcoded values no secrets needed
      - name: Generate config (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"

      # Staging and production inject real secrets
      - name: Generate config (staging/production)
        if: steps.env.outputs.ENV != 'dev'
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
            ./scripts/generate_config.sh staging \
              "${{ secrets.STAGING_BASE_URL }}" \
              "${{ secrets.STAGING_API_KEY }}"
          else
            ./scripts/generate_config.sh production \
              "${{ secrets.PROD_BASE_URL }}" \
              "${{ secrets.PROD_API_KEY }}"
          fi

      # Keystore is only needed for signed builds (staging &amp; production)
      - name: Restore Keystore
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode &gt; android/app/upload-keystore.jks

      # Production builds are obfuscated + split debug info for Play Store
      - name: Build artifact
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then
            flutter build appbundle --release \
              --obfuscate \
              --split-debug-info=build/symbols
          else
            flutter build appbundle --release
          fi

      # Dev and staging go to Firebase App Distribution for internal testing
      - name: Upload to Firebase App Distribution
        if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging'
        env:
          FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }}
          FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }}
          FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }}
        run: |
          firebase appdistribution:distribute \
            build/app/outputs/bundle/release/app-release.aab \
            --app "$FIREBASE_ANDROID_APP_ID" \
            --groups "$FIREBASE_GROUPS" \
            --token "$FIREBASE_TOKEN"

      # Only production goes to the Play Store
      - name: Upload to Play Store
        if: steps.env.outputs.ENV == 'production'
        uses: r0adkll/upload-google-play@v1
        with:
          serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }}
          packageName: com.your.package
          releaseFiles: build/app/outputs/bundle/release/app-release.aab
          track: production

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "Android Build &amp; Release PASSED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "Android Build &amp; Release FAILED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Check the logs and fix the issue before retrying"
</code></pre>
<p>This workflow ensures that whenever code lands on the <strong>develop, staging or production</strong> branch, this action is triggered on a fresh Ubuntu machine.</p>
<p>This is triggered by a simple push to any of the tracked branches, no manual intervention needed.</p>
<p>Let's walk through it piece by piece.</p>
<h3 id="heading-1-the-setup-phase">1. The Setup Phase</h3>
<p>Before any Flutter-specific work happens, the workflow lays the foundation:</p>
<ol>
<li><p><strong>Checkout</strong>: grabs the latest code from the branch that triggered the run (using the more modern <code>actions/checkout@v3</code>).</p>
</li>
<li><p><strong>Java 11 via Temurin</strong>: this is an upgrade from the first workflow we created. Instead of a generic <code>setup-java@v1</code>, this uses the <code>temurin</code> distribution which is the Eclipse's open-source JDK build. It's the current industry standard for Android toolchains.</p>
</li>
<li><p><strong>Flutter (stable)</strong>: this pulls the stable Flutter SDK, version pinned via an environment variable (<code>FLUTTER_VERSION: 'stable'</code>) defined at the job level.</p>
</li>
<li><p><strong>Install dependencies</strong>: this ensures we run <code>flutter pub get</code> to pull all packages</p>
</li>
</ol>
<h3 id="heading-2-environment-detection">2. Environment Detection</h3>
<p>This is where it gets interesting. This workflow also checks and determines the environment which will help us define the next set of instructions to run.</p>
<p>This command reads the branch name from <strong>GITHUB REF</strong> and maps it to its environment label which we already created in one of our helper scripts.</p>
<ul>
<li><p>develop → ENV=dev</p>
</li>
<li><p>staging → ENV=staging</p>
</li>
<li><p>production → ENV=production</p>
</li>
</ul>
<p>It strips the branch name from the full ref path using <code>\({GITHUB_REF##*/}</code>, then writes both the branch name and the resolved <code>ENV</code> value to <code>\)GITHUB_OUTPUT</code>, making them available as named outputs (<code>steps.env.outputs.ENV</code>) for every subsequent step.</p>
<p>This means the rest of the pipeline can branch its behaviour based on which environment it's building for, different API keys, different signing configs, different targets – whatever the app needs.</p>
<h3 id="heading-3-config-injection">3. Config Injection</h3>
<p>With the environment resolved, the next step is injecting the right configuration into the app. This is where the <code>generate_config.sh</code> script we built earlier gets called directly from the workflow.</p>
<p>For the <code>dev</code> environment, hardcoded placeholder values are used. No real secrets are needed, since this build is only meant for internal developer testing:</p>
<pre><code class="language-yaml">- name: Generate config (dev)
  if: steps.env.outputs.ENV == 'dev'
  run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"
</code></pre>
<p>For staging and production, however, real secrets are pulled from GitHub Actions secrets and passed directly into the script:</p>
<pre><code class="language-yaml">- name: Generate config (staging/production)
  if: steps.env.outputs.ENV != 'dev'
  run: |
    if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
      ./scripts/generate_config.sh staging \
        "${{ secrets.STAGING_BASE_URL }}" \
        "${{ secrets.STAGING_API_KEY }}"
    else
      ./scripts/generate_config.sh production \
        "${{ secrets.PROD_BASE_URL }}" \
        "${{ secrets.PROD_API_KEY }}"
    fi
</code></pre>
<p>Notice that these two steps use an <code>if</code> condition to make them mutually exclusive. Only one will ever run per job. This keeps the pipeline clean: no complicated branching logic inside the script itself, just a clear decision at the workflow level.</p>
<h3 id="heading-4-keystore-restoration">4. Keystore Restoration</h3>
<p>Android requires signed builds for distribution. The signing keystore file cannot be committed to the repository for obvious security reasons, so it's stored as a Base64-encoded GitHub secret and decoded at build time.</p>
<pre><code class="language-yaml">- name: Restore Keystore
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.ANDROID_KEYSTORE_BASE64 }}" | base64 --decode &gt; android/app/upload-keystore.jks
</code></pre>
<p>This step is skipped entirely for the <code>dev</code> environment because dev builds are unsigned debug artifacts meant purely for internal testing on Firebase App Distribution. Only staging and production builds need to be properly signed.</p>
<p>To encode your keystore file as a Base64 string for storing in GitHub secrets, you have to run this locally:</p>
<pre><code class="language-yaml">base64 -i upload-keystore.jks | pbcopy
</code></pre>
<p>This copies the encoded string to your clipboard, which you can then paste directly into your GitHub repository secrets.</p>
<h3 id="heading-5-building-the-artifact">5. Building the Artifact</h3>
<p>With the environment configured and the keystore in place, the workflow builds the app bundle:</p>
<pre><code class="language-yaml">- name: Build artifact
  run: |
    if [ "${{ steps.env.outputs.ENV }}" = "production" ]; then
      flutter build appbundle --release \
        --obfuscate \
        --split-debug-info=build/symbols
    else
      flutter build appbundle --release
    fi
</code></pre>
<p>There's a deliberate difference between how production and non-production builds are compiled.</p>
<p>For production:</p>
<ul>
<li><p><code>--obfuscate</code> renames method and class names in the compiled output, making it significantly harder to reverse engineer the app</p>
</li>
<li><p><code>--split-debug-info=build/symbols</code> extracts the debug symbols into a separate directory at <code>build/symbols</code></p>
</li>
</ul>
<p>These symbols are what <code>upload_symbols.sh</code> later ships to Sentry, so obfuscated crash reports remain readable in your monitoring dashboard.</p>
<p>For dev and staging, neither flag is used. This keeps build times faster and makes local debugging easier since stack traces remain human-readable.</p>
<h3 id="heading-6-distributing-to-firebase-app-distribution">6. Distributing to Firebase App Distribution</h3>
<p>Once the app bundle is built, dev and staging builds are uploaded to Firebase App Distribution so testers can install them immediately:</p>
<pre><code class="language-yaml">- name: Upload to Firebase App Distribution
  if: steps.env.outputs.ENV == 'dev' || steps.env.outputs.ENV == 'staging'
  env:
    FIREBASE_TOKEN: ${{ secrets.FIREBASE_TOKEN }}
    FIREBASE_ANDROID_APP_ID: ${{ secrets.FIREBASE_ANDROID_APP_ID }}
    FIREBASE_GROUPS: ${{ secrets.FIREBASE_GROUPS }}
  run: |
    firebase appdistribution:distribute \
      build/app/outputs/bundle/release/app-release.aab \
      --app "$FIREBASE_ANDROID_APP_ID" \
      --groups "$FIREBASE_GROUPS" \
      --token "$FIREBASE_TOKEN"
</code></pre>
<p>Three secrets power this step:</p>
<ul>
<li><p><code>FIREBASE_TOKEN</code>: the authentication token generated from <code>firebase login:ci</code></p>
</li>
<li><p><code>FIREBASE_ANDROID_APP_ID</code>: the app identifier from the Firebase console</p>
</li>
<li><p><code>FIREBASE_GROUPS</code>: the tester group(s) that should receive the build notification</p>
</li>
</ul>
<p>Once this step completes, every tester in the specified groups receives an email with a direct download link. No one needs to manually share an APK file over Slack or email.</p>
<h3 id="heading-7-deploying-to-the-play-store">7. Deploying to the Play Store</h3>
<p>Production builds skip Firebase entirely and goes straight to the Google Play Store:</p>
<pre><code class="language-yaml">- name: Upload to Play Store
  if: steps.env.outputs.ENV == 'production'
  uses: r0adkll/upload-google-play@v1
  with:
    serviceAccountJsonPlainText: ${{ secrets.GOOGLE_PLAY_SERVICE_ACCOUNT_JSON }}
    packageName: com.your.package
    releaseFiles: build/app/outputs/bundle/release/app-release.aab
    track: production
</code></pre>
<p>This uses the <code>r0adkll/upload-google-play</code> GitHub Action, which handles the Google Play API interaction under the hood. The only requirements are:</p>
<ul>
<li><p>A Google Play service account with the correct permissions, stored as a JSON secret</p>
</li>
<li><p>The correct package name matching what is registered in your Play Console</p>
</li>
<li><p>The <code>track</code> set to <code>production</code> (you can also use <code>internal</code>, <code>alpha</code>, or <code>beta</code> depending on your release strategy)</p>
</li>
</ul>
<p>Replace <code>com.your.package</code> with your actual application ID (the same one defined in your <code>build.gradle</code> file).</p>
<h3 id="heading-8-the-notification-layer">8. The Notification Layer</h3>
<p>Just like the PR checks workflow, this workflow reports its outcome clearly:</p>
<pre><code class="language-yaml">- name: Notify Team (Success)
  if: success()
  run: |
    echo "Android Build &amp; Release PASSED"
    echo "Environment: ${{ steps.env.outputs.ENV }}"
    echo "Branch: ${{ steps.env.outputs.branch }}"
    echo "By: @${{ github.actor }}"
    echo "Commit: ${{ github.sha }}"

- name: Notify Team (Failure)
  if: failure()
  run: |
    echo "Android Build &amp; Release FAILED"
    echo "Environment: ${{ steps.env.outputs.ENV }}"
    echo "Branch: ${{ steps.env.outputs.branch }}"
    echo "By: @${{ github.actor }}"
    echo "Commit: ${{ github.sha }}"
    echo "Check the logs and fix the issue before retrying 🔧"
</code></pre>
<p>The success notification includes the environment, branch, actor, and shares everything needed to trace exactly what was deployed and who triggered it.</p>
<p>The failure notification includes the same context, with a clear call to action.</p>
<h2 id="heading-workflow-3-iosyml">Workflow #3: iOS.yml</h2>
<p>iOS CI/CD is more complex than Android by nature. This is because Apple's signing requirements involve certificates, provisioning profiles, and entitlements that all need to be in the right place before Xcode will produce a valid archive.</p>
<p>Fastlane helps us handles all of that complexity, and the workflow simply calls into it.</p>
<p>Here is the full <code>ios.yml</code>:</p>
<pre><code class="language-yaml">name: iOS Build &amp; Release

on:
  push:
    branches:
      - develop
      - staging
      - production

jobs:
  ios:
    runs-on: macos-latest
    env:
      FLUTTER_VERSION: 'stable'

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Flutter
        uses: subosito/flutter-action@v2
        with:
          flutter-version: ${{ env.FLUTTER_VERSION }}

      - name: Install dependencies
        run: flutter pub get

      - name: Determine environment
        id: env
        run: |
          echo "branch=\({GITHUB_REF##*/}" &gt;&gt; \)GITHUB_OUTPUT
          if [ "${GITHUB_REF##*/}" = "develop" ]; then
            echo "ENV=dev" &gt;&gt; $GITHUB_OUTPUT
          elif [ "${GITHUB_REF##*/}" = "staging" ]; then
            echo "ENV=staging" &gt;&gt; $GITHUB_OUTPUT
          else
            echo "ENV=production" &gt;&gt; $GITHUB_OUTPUT
          fi

      - name: Generate config (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: ./scripts/generate_config.sh dev "https://dev.api.example.com" "dev_dummy_key"

      - name: Generate config (staging/production)
        if: steps.env.outputs.ENV != 'dev'
        run: |
          if [ "${{ steps.env.outputs.ENV }}" = "staging" ]; then
            ./scripts/generate_config.sh staging \
              "${{ secrets.STAGING_BASE_URL }}" \
              "${{ secrets.STAGING_API_KEY }}"
          else
            ./scripts/generate_config.sh production \
              "${{ secrets.PROD_BASE_URL }}" \
              "${{ secrets.PROD_API_KEY }}"
          fi

      - name: Install Fastlane
        run: |
          cd ios
          gem install bundler
          bundle install

      - name: Import signing certificate
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode &gt; ios/cert.p12
          security create-keychain -p "" build.keychain
          security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign
          security list-keychains -s build.keychain
          security default-keychain -s build.keychain
          security unlock-keychain -p "" build.keychain
          security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain

      - name: Install provisioning profile
        if: steps.env.outputs.ENV != 'dev'
        run: |
          echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode &gt; profile.mobileprovision
          mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
          cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/

      - name: Build iOS (dev)
        if: steps.env.outputs.ENV == 'dev'
        run: flutter build ios --release --no-codesign

      - name: Build &amp; distribute to TestFlight (staging)
        if: steps.env.outputs.ENV == 'staging'
        env:
          APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }}
          APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }}
          APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }}
        run: |
          cd ios
          bundle exec fastlane beta

      - name: Build &amp; release to App Store (production)
        if: steps.env.outputs.ENV == 'production'
        env:
          APP_STORE_CONNECT_API_KEY_ID: ${{ secrets.APP_STORE_CONNECT_API_KEY_ID }}
          APP_STORE_CONNECT_API_ISSUER_ID: ${{ secrets.APP_STORE_CONNECT_API_ISSUER_ID }}
          APP_STORE_CONNECT_API_KEY_CONTENT: ${{ secrets.APP_STORE_CONNECT_API_KEY_CONTENT }}
        run: |
          cd ios
          bundle exec fastlane release

      - name: Upload Sentry symbols (production only)
        if: steps.env.outputs.ENV == 'production'
        env:
          SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
          SENTRY_ORG: ${{ secrets.SENTRY_ORG }}
          SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }}
        run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD)

      - name: Notify Team (Success)
        if: success()
        run: |
          echo "iOS Build &amp; Release PASSED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"

      - name: Notify Team (Failure)
        if: failure()
        run: |
          echo "iOS Build &amp; Release FAILED"
          echo "Environment: ${{ steps.env.outputs.ENV }}"
          echo "Branch: ${{ steps.env.outputs.branch }}"
          echo "By: @${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Check the logs and fix the issue before retrying 🔧"
</code></pre>
<p>Let's walk through what is different about this workflow compared to that of android.</p>
<h3 id="heading-1-macos-runner">1. MacOS Runner</h3>
<pre><code class="language-yaml">runs-on: macos-latest
</code></pre>
<p>This is the major difference.</p>
<p>iOS builds require Xcode, which only runs on macOS. GitHub Actions provides hosted macOS runners, but they are significantly more expensive in terms of compute minutes than Ubuntu runners. Just keep that in mind when thinking about build frequency.</p>
<p>No Java setup is needed here. Flutter on iOS compiles through Xcode directly, so the toolchain requirements are different.</p>
<h3 id="heading-2-installing-fastlane">2. Installing Fastlane</h3>
<pre><code class="language-yaml">- name: Install Fastlane
  run: |
    cd ios
    gem install bundler
    bundle install
</code></pre>
<p>Fastlane is a Ruby-based automation tool that handles certificate management, building, and uploading to TestFlight and the App Store.</p>
<p>This step navigates into the <code>ios/</code> directory and installs Fastlane along with all its dependencies as defined in the project's <code>Gemfile</code>.</p>
<p>Your <code>ios/Gemfile</code> should look something like this:</p>
<pre><code class="language-ruby">source "https://rubygems.org"

gem "fastlane"
</code></pre>
<p>And your <code>ios/fastlane/Fastfile</code> should define at minimum two lanes: one for staging (TestFlight) and one for production (App Store):</p>
<pre><code class="language-ruby">default_platform(:ios)

platform :ios do
  lane :beta do
    build_app(scheme: "Runner", export_method: "app-store")
    upload_to_testflight(skip_waiting_for_build_processing: true)
  end

  lane :release do
    build_app(scheme: "Runner", export_method: "app-store")
    upload_to_app_store(force: true, skip_screenshots: true, skip_metadata: true)
  end
end
</code></pre>
<h3 id="heading-3-certificate-and-provisioning-profile-setup">3. Certificate and Provisioning Profile Setup</h3>
<p>This is the step that trips most teams up the first time. Apple's code signing requires two things to be present on the machine:</p>
<ol>
<li><p>The signing certificate (a <code>.p12</code> file)</p>
</li>
<li><p>The provisioning profile</p>
</li>
</ol>
<p>Both are stored as Base64-encoded GitHub secrets and restored at build time.</p>
<pre><code class="language-yaml">- name: Import signing certificate
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.IOS_CERTIFICATE_BASE64 }}" | base64 --decode &gt; ios/cert.p12
    security create-keychain -p "" build.keychain
    security import ios/cert.p12 -k build.keychain -P "${{ secrets.IOS_CERTIFICATE_PASSWORD }}" -T /usr/bin/codesign
    security list-keychains -s build.keychain
    security default-keychain -s build.keychain
    security unlock-keychain -p "" build.keychain
    security set-key-partition-list -S apple-tool:,apple: -s -k "" build.keychain
</code></pre>
<p>Breaking this down step by step:</p>
<ul>
<li><p>Decodes the Base64 certificate and write it to <code>cert.p12</code></p>
</li>
<li><p>Creates a temporary keychain called <code>build.keychain</code> with an empty password</p>
</li>
<li><p>Imports the certificate into that keychain, granting codesign access</p>
</li>
<li><p>Sets it as the default keychain so Xcode finds it automatically</p>
</li>
<li><p>Unlocks the keychain so it can be used non-interactively</p>
</li>
<li><p>Sets partition list to allow access without repeated prompts</p>
</li>
</ul>
<p>The provisioning profile step is simpler:</p>
<pre><code class="language-yaml">- name: Install provisioning profile
  if: steps.env.outputs.ENV != 'dev'
  run: |
    echo "${{ secrets.IOS_PROVISIONING_PROFILE_BASE64 }}" | base64 --decode &gt; profile.mobileprovision
    mkdir -p ~/Library/MobileDevice/Provisioning\ Profiles
    cp profile.mobileprovision ~/Library/MobileDevice/Provisioning\ Profiles/
</code></pre>
<p>It decodes the profile and copies it into the exact directory where Xcode expects to find provisioning profiles on any macOS system.</p>
<p>To encode your certificate and profile locally, you can run these:</p>
<pre><code class="language-bash">base64 -i Certificates.p12 | pbcopy   # for the certificate
base64 -i YourApp.mobileprovision | pbcopy   # for the provisioning profile
</code></pre>
<h3 id="heading-4-building-for-each-environment">4. Building for Each Environment</h3>
<p>Dev builds skip signing entirely. They're built without code signing just to verify the project compiles correctly on a clean machine:</p>
<pre><code class="language-yaml">- name: Build iOS (dev)
  if: steps.env.outputs.ENV == 'dev'
  run: flutter build ios --release --no-codesign
</code></pre>
<p>Staging builds go through Fastlane's <code>beta</code> lane, which builds and uploads to TestFlight. Production builds go through Fastlane's <code>release</code> lane, which submits directly to App Store Connect.</p>
<p>Both staging and production steps consume the same three App Store Connect API key secrets: the key ID, the issuer ID, and the key content itself.</p>
<p>Fastlane uses these to authenticate with Apple's API without requiring a manual Apple ID login.</p>
<h3 id="heading-5-sentry-symbol-upload">5. Sentry Symbol Upload</h3>
<p>On production iOS builds, the <code>upload_symbols.sh</code> script runs after the build completes, passing the current short commit SHA as the release identifier:</p>
<pre><code class="language-yaml">- name: Upload Sentry symbols (production only)
  if: steps.env.outputs.ENV == 'production'
  env:
    SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
    SENTRY_ORG: ${{ secrets.SENTRY_ORG }}
    SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }}
  run: ./scripts/upload_symbols.sh $(git rev-parse --short HEAD)
</code></pre>
<p>This is the same script explained earlier in the helper scripts section. It creates a Sentry release, uploads the debug information files, and finalizes the release. Any production crash from this point forward will map back to real, readable source code in your Sentry dashboard.</p>
<h2 id="heading-secrets-and-configuration-reference">Secrets and Configuration Reference</h2>
<p>For this entire pipeline to work, you need to configure the following secrets in your GitHub repository. Go to <strong>Settings → Secrets and variables → Actions → New repository secret</strong> to add each one.</p>
<p><strong>Shared (used across environments):</strong></p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>FIREBASE_TOKEN</code></td>
<td>Generated via <code>firebase login:ci</code> on your local machine</td>
</tr>
<tr>
<td><code>FIREBASE_ANDROID_APP_ID</code></td>
<td>Android app ID from your Firebase console</td>
</tr>
<tr>
<td><code>FIREBASE_GROUPS</code></td>
<td>Comma-separated tester group names in Firebase</td>
</tr>
<tr>
<td><code>SENTRY_AUTH_TOKEN</code></td>
<td>Auth token from your Sentry account settings</td>
</tr>
<tr>
<td><code>SENTRY_ORG</code></td>
<td>Your Sentry organization slug</td>
</tr>
<tr>
<td><code>SENTRY_PROJECT</code></td>
<td>Your Sentry project slug</td>
</tr>
</tbody></table>
<p><strong>Staging:</strong></p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>STAGING_BASE_URL</code></td>
<td>Your staging API base URL</td>
</tr>
<tr>
<td><code>STAGING_API_KEY</code></td>
<td>Your staging API or encryption key</td>
</tr>
</tbody></table>
<p><strong>Production:</strong></p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>PROD_BASE_URL</code></td>
<td>Your production API base URL</td>
</tr>
<tr>
<td><code>PROD_API_KEY</code></td>
<td>Your production API or encryption key</td>
</tr>
</tbody></table>
<p><strong>Android:</strong></p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>ANDROID_KEYSTORE_BASE64</code></td>
<td>Base64-encoded <code>.jks</code> keystore file</td>
</tr>
<tr>
<td><code>GOOGLE_PLAY_SERVICE_ACCOUNT_JSON</code></td>
<td>Full JSON content of your Play Console service account</td>
</tr>
</tbody></table>
<p><strong>iOS:</strong></p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>IOS_CERTIFICATE_BASE64</code></td>
<td>Base64-encoded <code>.p12</code> signing certificate</td>
</tr>
<tr>
<td><code>IOS_CERTIFICATE_PASSWORD</code></td>
<td>Password protecting the <code>.p12</code> file</td>
</tr>
<tr>
<td><code>IOS_PROVISIONING_PROFILE_BASE64</code></td>
<td>Base64-encoded <code>.mobileprovision</code> file</td>
</tr>
<tr>
<td><code>APP_STORE_CONNECT_API_KEY_ID</code></td>
<td>Key ID from App Store Connect → Users &amp; Access → Keys</td>
</tr>
<tr>
<td><code>APP_STORE_CONNECT_API_ISSUER_ID</code></td>
<td>Issuer ID from the same App Store Connect page</td>
</tr>
<tr>
<td><code>APP_STORE_CONNECT_API_KEY_CONTENT</code></td>
<td>The full content of the downloaded <code>.p8</code> key file</td>
</tr>
</tbody></table>
<p>None of these values should ever appear in your codebase. If any secret is accidentally committed, rotate it immediately.</p>
<h2 id="heading-end-to-end-flow">End-to-End Flow</h2>
<p>With all three workflows in place, here is exactly what happens from the moment a developer opens a pull request to the moment a user receives an update:</p>
<h3 id="heading-1-developer-opens-a-pr-into-develop">1. Developer Opens a PR into <code>develop</code></h3>
<p>The <code>pr_checks.yml</code> workflow fires. It runs formatting checks, static analysis, and the full test suite. If anything fails, the PR cannot be merged and the team is notified immediately. The developer fixes the issues and pushes again, which triggers a fresh run.</p>
<h3 id="heading-2-pr-is-approved-and-merged-into-develop">2. PR is Approved and Merged into <code>develop</code></h3>
<p>The <code>android.yml</code> and <code>ios.yml</code> workflows both fire on the push event. They detect the environment as <code>dev</code>, inject placeholder config, build unsigned artifacts, and upload them to Firebase App Distribution. Testers receive an email and can install the build on their devices within minutes – no one shared a file manually.</p>
<h3 id="heading-3-develop-is-merged-into-staging">3. <code>develop</code> is Merged into <code>staging</code></h3>
<p>Both platform workflows fire again. This time the environment resolves to <code>staging</code>. Real secrets are injected, builds are properly signed, and the artifacts go to Firebase App Distribution (Android) and TestFlight (iOS). QA begins testing the staging build against the staging API.</p>
<h3 id="heading-4-staging-is-merged-into-production">4. <code>staging</code> is merged into <code>production</code></h3>
<p>Both workflows fire one final time. Production secrets are injected, builds are obfuscated and signed, debug symbols are uploaded to Sentry, and the final artifacts are submitted to the Google Play Store and App Store Connect. The release goes live on Apple and Google's review timelines with no further human intervention required.</p>
<p>From that first PR to a production submission, not a single command was run manually.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building this pipeline is an upfront investment that pays off from the very first release cycle. What used to be a sequence of error-prone manual steps building locally, signing, uploading, switching configs, and hoping nothing was mixed up is now a fully automated, auditable, and repeatable process that runs the moment code moves between branches.</p>
<p>The architecture we built here does more than just automate builds. The PR quality gate enforces team standards consistently, so code review becomes a conversation about intent rather than a hunt for formatting issues. The environment-aware config injection eliminates an entire class of production incidents where staging keys made it into a live release. The Sentry symbol upload means your team can debug production crashes with full source visibility even from an obfuscated binary.</p>
<p>Every piece of this pipeline also runs locally. The helper scripts in the <code>scripts/</code> folder are plain Bash so you can call them from your terminal the same way CI calls them. This eliminates the frustrating cycle of pushing a commit just to test a pipeline change.</p>
<p>As your team grows, this foundation scales with you. You can extend the <code>pr_checks.yml</code> to enforce code coverage thresholds, add a performance benchmarking job, or introduce a dedicated security scanning step. You can extend the platform workflows to support multiple flavors, multiple Firebase projects, or staged rollouts on the Play Store. The architecture stays the same – you're just adding new steps to an already working system.</p>
<p>This ensures that standards are met, code quality remains high, you have a proper team structure, clear process and automated post development activities are in place – and at the end of the day, you'll have an optimized engineering approach that will help your team in so many ways.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
