<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Osomudeya Zudonu - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Osomudeya Zudonu - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Fri, 05 Jun 2026 20:26:24 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/Osomudeya/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases ]]>
                </title>
                <description>
                    <![CDATA[ Automation scripts often validate process completion instead of system health. A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deploymen ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-bash-python-for-real-devops-automation-handbook-with-production-use-cases/</link>
                <guid isPermaLink="false">6a171310badcd8afcb060460</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Bash ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Wed, 27 May 2026 15:51:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/73f2a745-c1b5-4cbb-8f97-2ba6c5230592.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Automation scripts often validate process completion instead of system health.</p>
<p>A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deployment can return clean while someone has manually changed infrastructure in the cloud console. A canary rollout can show zero errors while users wait five seconds for every request.</p>
<p>The problem isn't the tooling. The problem is that the system can look healthy when it really is not.</p>
<p>This handbook walks through five production-style automation scenarios using Bash and Python for:</p>
<ul>
<li><p>Detecting abnormal AWS spend before the monthly invoice arrives</p>
</li>
<li><p>Correlating logs across multiple services using trace IDs</p>
</li>
<li><p>Finding infrastructure drift outside Terraform</p>
</li>
<li><p>Validating secret rotation at the application level</p>
</li>
<li><p>Automatically rolling back slow deployments before users complain</p>
</li>
</ul>
<p>By the end of this handbook, you'll be able to build small scripts that help you notice when something is wrong in a system, even when the tools say everything is fine.</p>
<p>The scripts are intentionally small. The important part is the operational thinking behind them like what signal the script measures, what failure mode it can detect, and what assumptions the platform is making underneath.</p>
<p>Each use case includes a runnable demo environment, the complete script, a breakdown of the system behaviour involved, and an intentional failure you can trigger yourself.</p>
<p>If you're new to this workflow, start with use case 1 and work forward. The later sections build on the same pattern: automation is useful when it verifies reality, not just process completion.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, set up the following:</p>
<ul>
<li><p><strong>Python 3.8 or higher</strong> – check with <code>python3 --version</code></p>
</li>
<li><p><strong>A Python virtual environment</strong> – create one before installing anything:</p>
</li>
</ul>
<pre><code class="language-plaintext">python3 -m venv venv
source venv/bin/activate  

 # on Windows: 

venv\Scripts\activate
</code></pre>
<p>This keeps your installed packages isolated from your system Python and prevents permission errors on shared machines.</p>
<ul>
<li><p><strong>pip</strong> – Python's package installer, included with Python</p>
</li>
<li><p><strong>AWS CLI</strong> configured with a working profile – a free-tier AWS account is enough for use cases 1, 3, and 4. Verify it's working with:</p>
<pre><code class="language-plaintext">aws sts get-caller-identity
</code></pre>
</li>
<li><p><strong>Docker and Docker Compose</strong> – needed for use cases 2, 4, and 5</p>
</li>
<li><p><strong>Kind</strong> (Kubernetes in Docker) – a way to run Kubernetes locally for use cases 4 and 5. Install with <code>brew install kind</code> on macOS, or follow the <a href="https://kind.sigs.k8s.io/docs/user/quick-start/">Kind quick start guide</a></p>
</li>
<li><p><strong>kubectl</strong> – the command-line tool for talking to a Kubernetes cluster. After installing Kind, run <code>kind create cluster</code> and kubectl is configured automatically</p>
</li>
<li><p><strong>Helm</strong> – a package manager for Kubernetes, needed for use case 5. Install with <code>brew install helm</code> or the <a href="https://helm.sh/docs/intro/install/">Helm install guide</a></p>
</li>
<li><p><strong>Terraform</strong> – needed for use case 3. Install with <code>brew install terraform</code> on macOS or follow the <a href="https://developer.hashicorp.com/terraform/install">Terraform install guide</a>. Check with <code>terraform version</code>.</p>
</li>
<li><p><strong>bc</strong> – a calculator utility used by the canary watch scripts for floating-point comparison. Install with <code>brew install bc</code> on macOS or <code>apt install bc</code> on Ubuntu. Run <code>bc --version</code> to confirm it is available before starting use case 5.</p>
</li>
</ul>
<h3 id="heading-knowledge-and-skills">Knowledge and Skills</h3>
<ul>
<li><p>You should be comfortable reading Python and Bash scripts without needing to write them from scratch.</p>
</li>
<li><p>You should have basic Linux terminal comfort – navigating directories, running scripts, reading output, and so on.</p>
</li>
<li><p>You should know what Kubernetes pods and deployments are at a basic level – you don't need deep Kubernetes expertise, as use cases 4 and 5 will introduce the Kubernetes concepts they rely on as they go.</p>
</li>
<li><p>Familiarity with AWS basics such as what EC2, IAM, and Secrets Manager will help with use cases 1, 3, and 4, while use case 2 runs entirely on your local machine and requires no AWS knowledge at all.</p>
</li>
<li><p>For use case 3, knowing what Terraform is and what a state file does will help. You don't need to write any Terraform, but understanding that Terraform tracks and what it created is the foundation of the whole use case.</p>
</li>
</ul>
<h3 id="heading-aws-iam-permissions-required">AWS IAM Permissions Required</h3>
<p>The scripts in this article make real AWS API calls. Your IAM user or role needs the following minimum permissions. (If you see an <code>AccessDenied</code> error, this is the first place to look.):</p>
<table>
<thead>
<tr>
<th>Use Case</th>
<th>Required IAM Permission</th>
</tr>
</thead>
<tbody><tr>
<td>1 - Cost Anomaly Detection</td>
<td><code>ce:GetCostAndUsage</code></td>
</tr>
<tr>
<td>3 - Drift Detection</td>
<td><code>ec2:DescribeSecurityGroups</code></td>
</tr>
<tr>
<td>4 - Secrets Rotation</td>
<td><code>secretsmanager:GetSecretValue</code>, <code>secretsmanager:PutSecretValue</code></td>
</tr>
</tbody></table>
<p>If you're using a fresh AWS free-tier account with <code>AdministratorAccess</code> attached, these permissions are already included and you can skip this step.</p>
<p>If you're on a restricted IAM user, here's how to add them. In the AWS Console, go to IAM, click Users, then click your username. Under the Permissions tab, click Add permissions, then Create inline policy.</p>
<p>Switch to the JSON tab and paste a policy document granting the permissions in the table above, then save it.</p>
<p>If your company manages AWS through an organization and you don't have permission to edit your own IAM policies, ask your administrator to add these permissions to your role.</p>
<h3 id="heading-companion-github-repository">Companion GitHub Repository</h3>
<p>All demo projects live at: <a href="https://github.com/Osomudeya/devops-scripting-labs"><strong>https://github.com/irvingtalks/devops-scripting-labs</strong></a></p>
<p>Each use case has its own numbered folder with the complete script, supporting files, a <code>setup.sh</code> to prepare the environment, and a <code>break_it.sh</code> that injects the specific failure each use case is built around.</p>
<p>Clone the repo before starting:</p>
<pre><code class="language-plaintext">git clone https://github.com/irvingtalks/devops-scripting-labs
cd devops-scripting-labs
</code></pre>
<p>Before running any use case, check that you have everything installed:</p>
<pre><code class="language-plaintext">./preflight.sh
</code></pre>
<p>This checks for every tool the lab needs like Python, AWS CLI, Docker, Kind, Helm, Terraform, and <code>bc</code> and tells you exactly what's missing with the install command for each one.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-use-case-1-cost-anomaly-detection">Use Case 1 - Cost Anomaly Detection</a></p>
</li>
<li><p><a href="#heading-use-case-2-log-correlation-across-services">Use Case 2 - Log Correlation Across Services</a></p>
</li>
<li><p><a href="#heading-use-case-3-infrastructure-drift-detection">Use Case 3 - Infrastructure Drift Detection</a></p>
</li>
<li><p><a href="#heading-use-case-4-secrets-rotation-with-zero-downtime">Use Case 4 - Secrets Rotation with Zero Downtime</a></p>
</li>
<li><p><a href="#heading-use-case-5-automated-canary-rollback-trigger">Use Case 5 - Automated Canary Rollback Trigger</a></p>
</li>
<li><p><a href="#heading-what-you-can-do-now">What You Can Do Now</a></p>
</li>
</ul>
<h2 id="heading-use-case-1-cost-anomaly-detection">Use Case 1 - Cost Anomaly Detection</h2>
<p><strong>Environment:</strong> AWS Cost Explorer API (read-only, available in all accounts) <strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>A junior engineer is testing a Kubernetes configuration. They spin up a managed node group in AWS (a set of EC2 virtual machines that the Kubernetes cluster uses to run workloads) and configure the cluster autoscaler, which is the Kubernetes component responsible for adding more machines when the cluster needs more capacity. The test goes well, and on Friday afternoon, they forget to tear the environment down.</p>
<p>Over the weekend, the autoscaler keeps provisioning new nodes because the test workloads are still running and requesting resources. By Monday morning you have a node group that has been quietly growing for two and a half days, and nobody noticed until the invoice landed three weeks later.</p>
<p>The script in this use case exists because your AWS bill isn't just a monthly number. It's a time series, and you can monitor it the same way you monitor application metrics. Check it daily, know your baseline, and you catch this kind of event in hours instead of weeks.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't a finance dashboard. It's an operational anomaly detector and the signal it monitors is cost. But the thing it's actually detecting is unexpected infrastructure behavior such as resources left running, autoscaler events, and forgotten environments.</p>
<p>AWS Cost Explorer is a service that stores your billing data and exposes it through an API, and when you call it, you're running a query against your account's billing records by specifying the time range, the granularity, and how you want results grouped.</p>
<p>One thing to know before you start investigating any flagged cost is that AWS decides which service category to put a charge under, not you. An EBS snapshot copy running across regions might appear under the EC2 line item rather than data transfer, which means a spike in EC2 spend doesn't necessarily mean something went wrong with your EC2 instances. The script flags the spike correctly, but investigating it means asking <em>"what changed in my infrastructure on this date"</em> rather than <em>"what is running in EC2 right now."</em></p>
<p>The billing label is a starting point, not a diagnosis.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>01-cost-anomaly/</code> in the <a href="https://github.com/Osomudeya/devops-scripting-labs">companion repo</a>. No cluster setup is needed for this use case because the script runs against your AWS account directly, and the only dependency is boto3:</p>
<pre><code class="language-plaintext">cd 01-cost-anomaly
pip install boto3
</code></pre>
<p>Before running against your real account, make sure your AWS credentials are configured. The script uses whatever credentials the AWS CLI is set up with. If you haven't done this yet:</p>
<pre><code class="language-plaintext">aws configure
</code></pre>
<p>This will ask for your AWS Access Key ID, Secret Access Key, default region (use <code>us-east-1</code> if unsure), and output format (type <code>json</code>). You can find your access keys in the AWS Console under IAM → Users → your username → Security credentials → Create access key.</p>
<p>Your account needs the <code>ce:GetCostAndUsage</code> permission also, if you're on a fresh account with AdministratorAccess that's already included.</p>
<p>If you have an AWS account with a few weeks of billing history, you can run the script directly against your real data:</p>
<pre><code class="language-plaintext">python detect_cost_anomaly.py
</code></pre>
<p>Two things to know before running against a real account. First, Cost Explorer data has a 24-hour lag. This means spend from today won't appear until tomorrow, so the script automatically excludes the most recent day to avoid incomplete results.</p>
<p>Second, the script uses unblended costs, which is what you actually pay on a single-account setup. Blended costs are a weighted average used in multi-account organisations sharing reserved capacity and will give different numbers.</p>
<p>If you have a new account or prefer not to use real billing data, the script includes a <code>--sample</code> flag that uses built-in data and calls no AWS APIs at all.<br>Run this first to see what the output looks like before reading the code:</p>
<pre><code class="language-plaintext">python detect_cost_anomaly.py --sample
</code></pre>
<h3 id="heading-the-script">The Script</h3>
<pre><code class="language-python">#!/usr/bin/env python3
# detect_cost_anomaly.py — Use Case 1: Cost Anomaly Detection
# Full explanation of every function is in the article.

import statistics
import sys
from datetime import datetime, timedelta

import boto3

def build_sample_data(days=30):
    """Synthetic Cost Explorer rows for the last `days` (ending yesterday).

    The EC2 spike is placed on yesterday (device local date) so sample output
    always matches the same window as live Cost Explorer mode.
    """
    last_day = datetime.today().date() - timedelta(days=1)
    first_day = last_day - timedelta(days=days - 1)
    anomaly_day_index = days - 1
    results = []
    for i in range(days):
        day = first_day + timedelta(days=i)
        d = i + 1
        results.append(
            {
                "TimePeriod": {
                    "Start": str(day),
                    "End": str(day + timedelta(days=1)),
                },
                "Groups": [
                    {
                        "Keys": ["Amazon EC2"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(
                                    round(
                                        18.50
                                        if i == anomaly_day_index
                                        else 1.10 + (d % 3) * 0.10,
                                        2,
                                    )
                                )
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon S3"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.04 + (d % 5) * 0.01, 2))
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon RDS"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.85 + (d % 4) * 0.05, 2))
                            }
                        },
                    },
                ],
            }
        )
    return results, str(last_day)


def get_daily_costs(days=30):
    ce = boto3.client("ce", region_name="us-east-1")
    end = datetime.today().date() - timedelta(days=1)
    start = end - timedelta(days=days)
    response = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    return response["ResultsByTime"]


def build_service_timeseries(results):
    services = {}
    for day in results:
        date_str = day["TimePeriod"]["Start"]
        for group in day["Groups"]:
            service = group["Keys"][0]
            cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
            if service not in services:
                services[service] = []
            services[service].append({"date": date_str, "cost": cost})
    return services


def detect_anomalies(services, baseline_days=7, multiplier=2.0, recent_days=None):
    """Flag days where cost exceeds prior `baseline_days` average + 2σ.

    Uses a rolling baseline (each day vs the previous week). If `recent_days`
    is set, only returns anomalies on or after today - recent_days.
    """
    cutoff = None
    if recent_days is not None:
        cutoff = datetime.today().date() - timedelta(days=recent_days)

    anomalies = []
    for service, daily in services.items():
        if len(daily) &lt; baseline_days + 1:
            continue
        for i in range(baseline_days, len(daily)):
            day = daily[i]
            day_date = datetime.strptime(day["date"], "%Y-%m-%d").date()
            if cutoff is not None and day_date &lt; cutoff:
                continue
            baseline_costs = [d["cost"] for d in daily[i - baseline_days : i]]
            avg = statistics.mean(baseline_costs)
            if avg &lt; 0.01:
                continue
            try:
                std = statistics.stdev(baseline_costs)
            except statistics.StatisticsError:
                continue
            threshold = avg + (multiplier * std)
            if day["cost"] &gt; threshold:
                anomalies.append(
                    {
                        "service": service,
                        "date": day["date"],
                        "actual": round(day["cost"], 4),
                        "baseline_avg": round(avg, 4),
                        "threshold": round(threshold, 4),
                        "pct_above": round(((day["cost"] - avg) / avg) * 100, 1),
                    }
                )
    return sorted(anomalies, key=lambda x: x["date"])


def parse_args(argv):
    use_sample = "--sample" in argv
    recent_days = None
    for arg in argv[1:]:
        if arg.startswith("--recent-days="):
            recent_days = int(arg.split("=", 1)[1])
    return use_sample, recent_days


def run(use_sample=False, recent_days=None):
    if use_sample:
        results, anomaly_date = build_sample_data()
        print("Running against sample data (--sample mode).")
        print(
            f"This data represents 30 days of billing ending yesterday, "
            f"with a realistic EC2 anomaly on {anomaly_date}.\n"
        )
    else:
        print("Fetching 30 days of daily AWS costs by service...")
        print("Note: today is excluded — Cost Explorer has a 24-hour billing lag.\n")
        results = get_daily_costs(days=30)

    if recent_days is not None:
        since = datetime.today().date() - timedelta(days=recent_days)
        print(
            f"Checking for spikes in the last {recent_days} days only "
            f"(on or after {since}), each vs its prior 7-day average.\n"
        )

    services = build_service_timeseries(results)
    anomalies = detect_anomalies(services, recent_days=recent_days)

    if not anomalies:
        print("No anomalies detected.")
        print("\nNote: this script flags statistical outliers against your own baseline.")
        print("A consistently elevated spend level will not trigger — only sudden increases.")
        return

    print(f"{'=' * 60}")
    print(f"ANOMALIES DETECTED: {len(anomalies)}")
    print(f"{'=' * 60}\n")

    for a in anomalies:
        print(f"Service:      {a['service']}")
        print(f"Date:         {a['date']}")
        print(f"Actual cost:  ${a['actual']}")
        print(f"Baseline avg: ${a['baseline_avg']} (prior 7-day average)")
        print(f"Threshold:    ${a['threshold']}")
        print(f"Overage:      {a['pct_above']}% above baseline")
        print()

    print("=" * 60)
    print("A note on AWS cost attribution:")
    print("The service label in Cost Explorer is assigned by AWS, not by the resource")
    print("that caused the cost. An EC2 spike may be caused by EBS snapshot copies,")
    print("cross-region data transfer, or autoscaling events that AWS categorizes under")
    print("EC2 in billing — not a running EC2 instance you can find in the console.")
    print()
    print("Before investigating the flagged service directly, ask:")
    print("What changed in my infrastructure on or before the flagged date?")
    print("Work backward from the operational change, not forward from the billing label.")


if __name__ == "__main__":
    use_sample, recent_days = parse_args(sys.argv)
    run(use_sample=use_sample, recent_days=recent_days)
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>get_daily_costs</code> pulls your AWS billing data for the last 30 days.</p>
<p><code>build_service_timeseries</code> takes the raw data from AWS and reorganises it. AWS groups the data by day first, then by service. This function flips that around so each service has its own list of daily costs, which is what the detection step needs to work with.</p>
<p><code>detect_anomalies</code> is where the actual check happens. For each service, it compares each day's spend to the 7 days right before it. If yesterday cost dramatically more than the week before, the script flags it. That's all it does.</p>
<p><code>--recent-days=7</code> means <em>"only show me anomalies from the last 7 days."</em> The script still fetches 30 days of data because it needs that history to calculate the comparison, but the results are filtered to the window you care about. This is good for a quick Monday morning check.</p>
<p><code>--sample</code> runs without touching your AWS account at all. It uses built-in fake billing data with a spike baked into yesterday's date so the detection always fires. Use this first to see what the output looks like before connecting it to real data.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>Running <code>--sample</code> (the spike date will show as yesterday's actual date, not a fixed value):</p>
<pre><code class="language-plaintext">Running against sample data (--sample mode).
30 days of billing ending yesterday, with an EC2 spike on 2026-05-14.

============================================================
ANOMALIES DETECTED: 1
============================================================

Service:      Amazon EC2
Date:         2026-05-14
Actual cost:  $18.5
Baseline avg: $1.2143 (prior 7-day average)
Threshold:    $1.3939
Overage:      1423.4% above baseline

============================================================
A note on AWS cost attribution:
The service label in Cost Explorer is assigned by AWS, not by the resource
that caused the cost. An EC2 spike may be caused by EBS snapshot copies,
cross-region data transfer, or autoscaling events that AWS categorizes under
EC2 in billing - not a running EC2 instance you can find in the console.

Before investigating the flagged service directly, ask:
What changed in my infrastructure on or before the flagged date?
Work backward from the operational change, not forward from the billing label.
</code></pre>
<p>Your numbers will differ slightly from the above because the sample data generates dates from today dynamically. The spike always shows up on yesterday and the surrounding baseline numbers shift depending on the day you run it.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make for You</h3>
<p>The anomaly is on the EC2 line, and the instinct is to go look at running EC2 instances. But as the output warns, the attribution is AWS's choice, not yours.</p>
<p>Before opening the EC2 console, check your deployment history for that date. What was deployed? Was a new environment created? Did an autoscaler event run? Start from the operational change and follow the thread to the billing data, because starting from the billing label and working backward is slower and frequently misleading.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<pre><code class="language-bash"># See the spike immediately with no AWS account needed
python detect_cost_anomaly.py --sample

# Run against your real account
python detect_cost_anomaly.py

# Only show anomalies from the last 7 days, good for a quick this-week check
python detect_cost_anomaly.py --recent-days=7

# Combine both flags - sample data filtered to the last 7 days
python detect_cost_anomaly.py --sample --recent-days=7
</code></pre>
<p><strong>If your real account returns "No anomalies detected" that's not a failure.</strong> It means your spend has been consistent. A clean account returns clean output. The script is doing exactly what it should.</p>
<p>When a real event happens on your account such as an autoscaler left running, a forgotten environment or an unexpected data transfer, this is what catches it before the invoice does.</p>
<h2 id="heading-use-case-2-log-correlation-across-services">Use Case 2 – Log Correlation Across Services</h2>
<p><strong>Environment:</strong> Fully local – Docker Compose, three Python services<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>A user reports that their payment failed. You open your logging tool and search. The auth service logged a successful authentication. The ledger service logged a successful transaction but the notification service which should have sent a payment confirmation email has logged nothing at all.</p>
<p>Two services reported success while one service stay silent. The payment still failed, and you have three logs and no clear answer about where the chain broke.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't a guide to installing a log aggregation tool. It's about the data structure that makes log correlation possible in the first place and what happens when that structure breaks on one service's error path.</p>
<p>In a system with a single service, debugging is simple: one service, one log file, one timeline. But when a user request passes through multiple services, you need a way to link all the logs together. That link is called a trace ID.</p>
<p>Think of it like a ticket number at a government office. When you walk in, you get a number, say, A247. Every desk that handles your case writes A247 on your file. If something goes wrong, the manager pulls every record with A247 and sees exactly what happened, in order, across every desk. That is a trace ID. One number, shared across every service that touched the request.</p>
<p>In the demo, when a payment comes in, the auth service creates a unique ID for it. Every log line that auth, ledger, and notification write for that payment includes the same ID. When something breaks, you run <code>correlate.py</code> with that ID and it finds every related log line across all three services and sorts them by time:</p>
<pre><code class="language-plaintext">python correlate.py pay-abc123
</code></pre>
<p>Here's what those logs look like. Notice that every line has the same <code>trace_id</code>:</p>
<pre><code class="language-json">{"timestamp": "2026-05-01T14:23:01.234Z", "trace_id": "pay-abc123", "service": "auth", "event": "user_authenticated", "level": "INFO", "user_id": "u_789", "duration_ms": 12}
{"timestamp": "2026-05-01T14:23:01.891Z", "trace_id": "pay-abc123", "service": "ledger", "event": "transaction_recorded", "level": "INFO", "amount": 50.0, "currency": "USD"}
{"timestamp": "2026-05-01T14:23:02.103Z", "trace_id": "pay-abc123", "service": "notification", "event": "email_queued", "level": "INFO", "recipient": "user@example.com"}
</code></pre>
<p>Now here's what breaks it. The notification service hits a timeout connecting to the email provider. The developer who wrote the error handler forgot to include the trace ID, so instead of a proper log line, it writes this:</p>
<pre><code class="language-plaintext">2026-05-01T14:23:02.415Z ERROR Connection timeout to email provider smtp.example.com:587
</code></pre>
<p>The error happened, the log line exists. But because it has no <code>trace_id</code>, <code>correlate.py</code> can't find it.</p>
<p>The notification still appears in the timeline, and you can see <code>email_send_attempt</code> – but <code>email_queued</code> never follows it.</p>
<pre><code class="language-plaintext">Timeline — 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete
</code></pre>
<p>The attempt is there but the failure is not. The developer just forgot one field.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/22b7d7b0-8ae5-4573-bcb0-faaf5d807e8a.png" alt="log correlation attempt terminal output - ERROR Connection timeout" style="display:block;margin:0 auto" width="1041" height="82" loading="lazy">

<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>02-log-correlation/</code> and start the three services:</p>
<pre><code class="language-plaintext">cd 02-log-correlation
docker compose up -d
</code></pre>
<p>This starts the auth, ledger, and notification services. Trigger a payment request to generate some logs:</p>
<pre><code class="language-plaintext">./trigger_request.sh
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/06977757-e7fd-43c3-aaca-dbc6d17c951a.png" alt="trigger_request.sh terminal output - also showing the traceid" style="display:block;margin:0 auto" width="630" height="131" loading="lazy">

<p>The script prints the trace ID it used. Copy the ID and Run the correlation script against it now, before we break anything, to see the full working path:</p>
<pre><code class="language-plaintext">python correlate.py pay-5831e1bf
</code></pre>
<p>You should see something like this (your trace ID will be different but the structure is the same):</p>
<pre><code class="language-plaintext">Loading logs from ./logs/...
Loaded 6 structured log lines.

============================================================
Trace ID: pay-5831e1bf
============================================================

Timeline - 6 events across 3 service(s):

  [2026-05-15T21:42:28.079046+00:00] [AUTH] [INFO] payment_request_received
    service: auth
    user_id: u_789
    amount: 50.0
  [2026-05-15T21:42:28.080718+00:00] [AUTH] [INFO] user_authenticated
    service: auth
    user_id: u_789
    duration_ms: 12
  [2026-05-15T21:42:28.145528+00:00] [LEDGER] [INFO] transaction_recorded
    service: ledger
    user_id: u_789
    amount: 50.0
    currency: USD
  [2026-05-15T21:42:28.210088+00:00] [NOTIFICATION] [INFO] email_send_attempt
    service: notification
    recipient: user@example.com
  [2026-05-15T21:42:28.347893+00:00] [NOTIFICATION] [INFO] email_queued
    service: notification
    recipient: user@example.com
    amount: 50.0
  [2026-05-15T21:42:28.378402+00:00] [AUTH] [INFO] payment_complete
    service: auth
    user_id: u_789
    amount: 50.0
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/26a0b226-4e3d-4967-a7c6-6d367409fb1d.png" alt="terminal output showing the full payment journey" style="display:block;margin:0 auto" width="1100" height="559" loading="lazy">

<p>That's the full payment journey with auth, ledger, notification in the exact order it happened. Now let's look at how the script works.</p>
<h3 id="heading-the-script">The Script</h3>
<pre><code class="language-python"># correlate.py
import json
import os
import sys

SERVICES = ["auth", "ledger", "notification"]
LOG_DIR = "./logs"


def load_logs(log_dir):
    """
    Read each service's log file and parse every line as JSON.
    Lines that fail JSON parsing are printed as warnings.
    They are not silently dropped - a plain-text error line in a service
    that should emit structured logs is itself evidence worth seeing.
    """
    all_lines = []

    for service in SERVICES:
        log_file = os.path.join(log_dir, f"{service}.log")

        if not os.path.exists(log_file):
            print(f"  WARNING: No log file for '{service}' at {log_file}")
            continue

        with open(log_file) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    parsed = json.loads(line)
                    parsed["_source"] = service
                    all_lines.append(parsed)
                except json.JSONDecodeError:
                    # This line exists in the log but cannot be correlated.
                    print(f"  WARNING: {service}.log line {line_num} is not structured JSON:")
                    print(f"           {line[:100]}")
                    print(f"           This line will NOT appear in any trace-based search.")

    return all_lines


def correlate(trace_id, all_lines):
    """
    Find every log line with this trace_id and sort by timestamp.
    The sorted result is the reconstructed timeline of the request.
    """
    matched = [line for line in all_lines if line.get("trace_id") == trace_id]
    matched.sort(key=lambda x: x.get("timestamp", ""))
    return matched


def find_missing_services(matched):
    """
    Check which services produced zero trace-tagged lines for this request.
    A missing service is not just an absence - it is a signal.
    Either the request never reached that service, or an error path swallowed
    the trace ID. Both are worth investigating.
    """
    services_seen = {line["_source"] for line in matched}
    return [s for s in SERVICES if s not in services_seen]


def print_timeline(trace_id, matched, missing):
    print(f"\n{'=' * 60}")
    print(f"Trace ID: {trace_id}")
    print(f"{'=' * 60}")

    if not matched:
        print("\nNo structured log lines found with this trace ID.")
        print("Either the trace ID is wrong, or no service emitted")
        print("a structured log line for this request.")
        return

    services_count = len({line["_source"] for line in matched})
    print(f"\nTimeline - {len(matched)} events across {services_count} service(s):\n")

    for line in matched:
        ts = line.get("timestamp", "unknown")
        service = line.get("_source", "unknown").upper()
        event = line.get("event", "unknown event")
        level = line.get("level", "INFO")
        extras = {k: v for k, v in line.items()
                  if k not in ("timestamp", "trace_id", "event", "level", "_source")}

        print(f"  [{ts}] [{service}] [{level}] {event}")
        for k, v in extras.items():
            print(f"    {k}: {v}")

    if missing:
        print(f"\n{'=' * 60}")
        print("MISSING TELEMETRY")
        print(f"{'=' * 60}")
        print(f"These services produced no trace-tagged events for trace {trace_id}:\n")
        for s in missing:
            print(f"  - {s}")
        print()
        print("This means one of three things:")
        print("  1. The request never reached this service.")
        print("  2. The service received it but an error path swallowed the trace ID,")
        print("     leaving a plain-text log line that trace correlation cannot find.")
        print("  3. This service's log file was not included in this run.")
        print()
        print("Check the raw log file for a plain-text error line around the same timestamp.")
        print("If one exists, that is your root cause - and a structured logging gap to fix.")


def run(trace_id):
    print(f"Loading logs from {LOG_DIR}/...")
    all_lines = load_logs(LOG_DIR)
    print(f"Loaded {len(all_lines)} structured log lines.\n")

    matched = correlate(trace_id, all_lines)
    missing = find_missing_services(matched)
    print_timeline(trace_id, matched, missing)


if __name__ == "__main__":
    if len(sys.argv) &lt; 2:
        print("Usage: python correlate.py &lt;trace_id&gt;")
        print("Example: python correlate.py pay-abc123")
        sys.exit(1)
    run(sys.argv[1])
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>load_logs</code> reads log files from each service. Each line should be JSON. If a line isn't JSON, it prints a warning that usually means an error log is missing a trace ID and can't be tracked.</p>
<p><code>correlate</code> finds all logs that match the given trace ID and sorts them by time. This rebuilds the full request flow across services.</p>
<p><code>find_missing_services</code> checks which services have no logs for that trace ID. This tells you where the request stopped or where the trace ID was lost.</p>
<p><code>print_timeline</code> displays the full request timeline in order. It also shows which services are missing if something didn't log correctly.</p>
<p>One thing worth knowing for when you use this in a real Kubernetes environment:<br>in Kubernetes, <code>kubectl logs</code> only shows the current running container.<br>If a pod restarts, you can use this:</p>
<pre><code class="language-plaintext">kubectl logs &lt;pod-name&gt; --previous
</code></pre>
<p>But this only works for the last restart. Older logs are gone unless you use a logging system like Loki or CloudWatch.</p>
<h3 id="heading-what-the-output-looks-like-after-breaking-it">What the Output Looks Like After Breaking it</h3>
<p>The point of this section is to show you what happens when a service fails silently, – when the error exists in the logs but the script can't find it because the developer forgot one field.</p>
<p><code>break_it.sh</code> forces the notification service to fail when it tries to send an email, and because the error handler was written without a trace ID, the failure gets logged as plain text with no way to tie it back to the original request.</p>
<p>Run it:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>Then trigger a new request:</p>
<pre><code class="language-plaintext">./trigger_request.sh
</code></pre>
<p>Copy the trace ID it prints, then correlate it:</p>
<pre><code class="language-plaintext">python correlate.py pay-xxxxxxxx
</code></pre>
<p>Here is what you'll see:</p>
<pre><code class="language-plaintext">Loading logs from ./logs/...
  WARNING: notification.log line 10 is not structured JSON:
           2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email
           provider http://mock-email:80/ after 0.001s - failed to send
           confirmation to user@example.com
           This line will NOT appear in any trace-based search.
Loaded 29 structured log lines.

============================================================
Trace ID: pay-6cf69a8c
============================================================

Timeline - 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete
</code></pre>
<p>Look at this carefully. The notification is in the timeline, and it logged <code>email_send_attempt</code>. But <code>email_queued</code> is missing, which means the email never actually sent and the error that explains why isn't in the timeline at all. It's hiding in the WARNING at the very top, where the script told you it found a line it couldn't parse.</p>
<p>That's the problem: where the attempt is visible but the failure is invisible.</p>
<p>Run <code>cat logs/notification.log</code> and scroll to the bottom:</p>
<pre><code class="language-plaintext">{"timestamp": "2026-05-15T21:59:00.630313+00:00", "trace_id": "pay-6cf69a8c",
 "service": "notification", "event": "email_send_attempt", ...}
2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email provider
http://mock-email:80/ after 0.001s - failed to send confirmation to user@example.com
</code></pre>
<p>Two lines to note: the first has a trace ID, which the script found and showed in the timeline. The second doesn't – the script flagged it as a warning and skipped it. The error happened 0.075 seconds after the attempt. The log file has both lines. The timeline only has one.</p>
<p>That is what <em>"invisible failure"</em> looks like in production. The payment went through. The confirmation email never sent. The error is sitting right there in the log file, <code>Connection timeout to email provider after 0.001s</code> but in the correlation output above, the timeline shows <code>email_send_attempt</code> and then jumps straight to <code>payment_complete</code> with nothing in between: no error, no failure, no gap. It looks like everything worked.</p>
<p>The fix is in <code>02-log-correlation/services/notification/main.py</code>. Here's the broken error handler:</p>
<pre><code class="language-python">except httpx.TimeoutException:
    emit_plain(f"Connection timeout to email provider {EMAIL_PROVIDER_URL}")
    return {"status": "ok"}
</code></pre>
<p>And here's the fixed version. The only change is passing <code>req.trace_id</code> into <code>emit</code> instead of calling <code>emit_plain</code>:</p>
<pre><code class="language-python">except httpx.TimeoutException:
    emit(req.trace_id, "email_timeout", level="ERROR",
         provider=EMAIL_PROVIDER_URL)
    return {"status": "ok"}
</code></pre>
<p>Once that change is made, the timeout error shows up in the timeline like everything else:</p>
<pre><code class="language-plaintext">  [2026-05-15T21:59:00.681583+00:00] [NOTIFICATION] [ERROR] email_timeout
    provider: http://mock-email:80/
</code></pre>
<p>One command, one trace ID, the full picture.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The correlation script identifies notification as the gap. When you check the raw <code>notification.log</code>, you find the plain-text timeout error, that the request reached the service, that authentication and transaction recording both succeeded, but that the email failed.</p>
<p>Whether a notification failure is a payment failure depends entirely on how your system was designed. If notification is a soft dependency, this error shouldn't have surfaced to the user as a payment failure, and something else in your system design is wrong. If it's a hard dependency, the transaction itself should have rolled back. The script found where things broke, but the right response depends on the design.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<ol>
<li><p>Run <code>./break_it.sh</code> – this switches the notification service to a mode where its error handler drops the trace ID</p>
</li>
<li><p>Run <code>./trigger_request.sh</code> to generate a new payment request and get a new trace ID</p>
</li>
<li><p>Run <code>python correlate.py &lt;new trace ID&gt;</code> – the notification will be missing from the timeline</p>
</li>
<li><p>Run <code>cat logs/notification.log</code> – the timeout error is right there, without a trace ID, invisible to the script</p>
</li>
</ol>
<h2 id="heading-use-case-3-infrastructure-drift-detection">Use Case 3 - Infrastructure Drift Detection</h2>
<p><strong>Environment:</strong> AWS free tier (one security group) + Terraform<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p>Your Terraform plan shows no changes. Your deployment is behaving differently than it did yesterday, and when you ask around, someone eventually remembers: a colleague made a quick manual change to a security group in the AWS console last week to unblock a staging test. They meant to go back and apply it through Terraform but they forgot.</p>
<p>Your Terraform state file and your actual AWS infrastructure have been quietly disagreeing ever since. Not that anything broke loudly or an alert fired. Terraform wouldn't even know unless someone ran <code>terraform plan</code> to check, and in this scenario, nobody did.</p>
<p>This is called infrastructure drift, and it's far more common than most teams want to admit.</p>
<h3 id="heading-whats-actually-happening-at-the-system-level">What's Actually Happening at the System Level</h3>
<p><strong>What this is not:</strong> This isn't the same as running <code>terraform plan</code>. A plan shows you what Terraform <em>would</em> change. This script shows you what has <em>already</em> changed in AWS without Terraform knowing.</p>
<p>The script itself doesn't run any Terraform commands. It reads the state file Terraform already produced. In the demo, Terraform creates that file. In a real environment, it already exists from your normal workflow.</p>
<p>Think of Terraform's state file as a receipt. When Terraform creates a security group, it writes down exactly what it created, the rules, the ports, the CIDRs. That receipt is the state file.</p>
<p>The script compares that receipt against what AWS actually has right now. If someone went into the AWS console and added a rule that isn't on the receipt, the script flags it as drift.</p>
<p>The blind spot is that, if someone creates a completely new security group in the console and never uses Terraform at all, there's no receipt for it. The script can't compare something it has never seen. It returns clean, and that group sits in your account undetected.</p>
<p>The demo shows both. First you break a known resource. Then the <code>--invisible</code> scenario creates a new one outside Terraform entirely, and the script returns clean even though your account now has an extra security group.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>03-drift-detection/</code> in the companion repo:</p>
<pre><code class="language-plaintext">cd 03-drift-detection
pip install -r requirements.txt
</code></pre>
<p>Run setup. This uses real Terraform, not a mock:</p>
<pre><code class="language-plaintext">./setup.sh
</code></pre>
<p>This runs <code>terraform init</code> and <code>terraform apply</code>, which creates a real AWS security group:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/d5617704-d04e-40cc-8ca9-aa9d6b806e5e.png" alt="screenshot of AWS dashboard showing security group created" style="display:block;margin:0 auto" width="1189" height="260" loading="lazy">

<p>It also writes a genuine <code>terraform.tfstate</code> file. Open it in any text editor if you want to see what Terraform actually produces. It's JSON, it's readable, and it's the real thing.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ea8e237-a1d7-43bc-be3d-50dcb6ff4b76.png" alt="screenshot of IDE folder structure showing terraform.tfstate file being created" style="display:block;margin:0 auto" width="215" height="230" loading="lazy">

<p>Once setup completes, run the script:</p>
<pre><code class="language-plaintext">python detect_drift.py terraform.tfstate
</code></pre>
<p>You should see something like this, but your actual security group ID will be different:</p>
<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.
</code></pre>
<p>The lab is alive and both sides of the contract match. Now let's look at what the script is doing.</p>
<h3 id="heading-the-script-code-files">The Script (<a href="https://github.com/Osomudeya/devops-scripting-labs">Code Files</a>)</h3>
<pre><code class="language-python"># detect_drift.py
import boto3
import json
import sys


def load_tfstate(path):
    """
    The Terraform state file is plain JSON - open it in any text editor
    and you will see a 'resources' array listing everything Terraform knows about.
    This function reads that file and returns the parsed contents.
    """
    with open(path) as f:
        return json.load(f)


def get_security_groups_from_state(tfstate):
    """
    Walk through the resources array and collect every security group entry.
    Each resource has a 'type', a 'name', and an 'instances' array holding
    the attribute values Terraform recorded when it last ran.
    We extract the resource ID and the ingress (inbound) rules.
    """
    resources = {}
    for resource in tfstate.get("resources", []):
        if resource["type"] == "aws_security_group":
            for instance in resource.get("instances", []):
                sg_id = instance["attributes"]["id"]
                resources[sg_id] = {
                    "ingress": instance["attributes"].get("ingress", [])
                }
    return resources


def get_security_group_from_aws(sg_id):
    """
    Call the AWS EC2 API to fetch the live current state of this security group.
    Under the hood, boto3 constructs an authenticated HTTPS request, signs it with
    your AWS credentials, sends it to the EC2 API endpoint in your configured region,
    and parses the response. The response contains far more data than we need -
    we extract only the inbound rules.
    """
    ec2 = boto3.client("ec2")
    response = ec2.describe_security_groups(GroupIds=[sg_id])
    sg = response["SecurityGroups"][0]
    return {"ingress": sg.get("IpPermissions", [])}


def normalize_state_rules(rules):
    """
    Terraform stores ingress rules in its own format.
    We normalize them into a set of tuples for easy comparison.
    Each tuple is: (from_port, to_port, protocol, cidr_block)
    """
    normalized = set()
    for rule in rules:
        for cidr in rule.get("cidr_blocks", []):
            normalized.add((
                rule.get("from_port", 0),
                rule.get("to_port", 0),
                rule.get("protocol", "-1"),
                cidr
            ))
    return normalized


def normalize_aws_rules(rules):
    """
    AWS returns ingress rules in a different format from Terraform's.
    We normalize them into the same tuple shape so the comparison works.
    """
    normalized = set()
    for rule in rules:
        from_port = rule.get("FromPort", 0)
        to_port = rule.get("ToPort", 0)
        protocol = rule.get("IpProtocol", "-1")
        for ip_range in rule.get("IpRanges", []):
            normalized.add((from_port, to_port, protocol, ip_range["CidrIp"]))
    return normalized


def detect_drift(tfstate_path):
    print(f"Loading Terraform state from: {tfstate_path}")
    tfstate = load_tfstate(tfstate_path)
    state_sgs = get_security_groups_from_state(tfstate)

    if not state_sgs:
        print("No security groups found in state file. Nothing to compare.")
        return

    drift_found = False

    for sg_id, state_data in state_sgs.items():
        print(f"\nChecking: {sg_id}")

        try:
            aws_data = get_security_group_from_aws(sg_id)
        except Exception as e:
            print(f"  ERROR: Could not fetch {sg_id} from AWS - {e}")
            print(f"  Check your IAM permissions: ec2:DescribeSecurityGroups is required.")
            continue

        state_rules = normalize_state_rules(state_data["ingress"])
        aws_rules = normalize_aws_rules(aws_data["ingress"])

        # Rules in AWS that Terraform does not know about (manual additions)
        added_in_aws = aws_rules - state_rules
        # Rules Terraform expects that no longer exist in AWS (manual deletions)
        removed_from_aws = state_rules - aws_rules

        if added_in_aws:
            drift_found = True
            print("  DRIFT - Rules present in AWS but missing from state file:")
            for rule in added_in_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if removed_from_aws:
            drift_found = True
            print("  DRIFT - Rules in state file but removed from AWS:")
            for rule in removed_from_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if not added_in_aws and not removed_from_aws:
            print("  OK - No drift detected.")

    print("\n" + "=" * 60)
    if drift_found:
        print("Drift detected. See above for details.")
    else:
        print("No drift detected in monitored resources.")

    print("\nIMPORTANT: This script only checks resources tracked in your state file.")
    print("Resources created manually in AWS without Terraform are invisible to this check.")
    print("A clean output here does not mean your AWS account is clean - it means")
    print("the resources you are watching match what Terraform last recorded.")


if __name__ == "__main__":
    tfstate_path = sys.argv[1] if len(sys.argv) &gt; 1 else "terraform.tfstate"
    detect_drift(tfstate_path)
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>load_tfstate</code> opens <code>terraform.tfstate</code> and reads it. Run <code>cat terraform.tfstate</code> after setup and you'll see that it's just a text file and everything Terraform knows about your infrastructure is stored in there.</p>
<p><code>get_security_groups_from_state</code> pulls out every security group from that file, the ID AWS assigned it, and the inbound rules Terraform last recorded. These are the expected values.</p>
<p><code>get_security_group_from_aws</code> calls the AWS API and fetches the same security group's current inbound rules. These are the actual values. The script now has two versions of the same thing.</p>
<p><code>normalize_state_rules</code> and <code>normalize_aws_rules</code> exist because Terraform and AWS store the same rule in slightly different formats. These two functions convert both into the same format so the comparison works.</p>
<p>The comparison is the last step. Rules in AWS but not in the state file were added manually. Rules in the state file but not in AWS were deleted manually. The script prints both.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>A clean run with no drift:</p>
<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

============================================================
No drift detected in monitored resources.

IMPORTANT: This script only checks resources tracked in your state file.
Resources created manually in AWS without Terraform are invisible to this check.
A clean output here does not mean your AWS account is clean - it means
the resources you are watching match what Terraform last recorded.
</code></pre>
<p>After injecting drift:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/61ae6f66-36e4-4e03-8e76-f250e2489dab.png" alt="screenshot of AWS dashboard showing security group inbound rule created" style="display:block;margin:0 auto" width="1170" height="220" loading="lazy">

<pre><code class="language-plaintext">Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  DRIFT - Rules present in AWS but missing from state file:
    Port 22-22 | Protocol: tcp | CIDR: 0.0.0.0/0

============================================================
Drift detected. See above for details.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5744ac0c-2e4f-4015-9b72-a1fa7084587e.png" alt="screenshot of terminal output after injecting drift showing &quot;drift detected&quot;" style="display:block;margin:0 auto" width="952" height="183" loading="lazy">

<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The script finds drift, an inbound rule that Terraform doesn't know about. The instinct is to revert it immediately by running <code>terraform apply</code>, but before doing that, ask one question: was this change an emergency hotfix? Someone may have manually opened a port at 2am to restore a broken service while a proper fix was being prepared. And if you revert it automatically, you might undo something that was deliberately placed there to keep a service running.</p>
<p>Drift detection tells you that things are different. It doesn't tell you which version is correct, and investigating that is the work that comes after the script runs.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<ol>
<li><p>Run <code>./break_it.sh</code>. This adds an SSH inbound rule (port 22) directly via the AWS CLI, simulating a manual console change.</p>
</li>
<li><p>Run <code>python detect_drift.py terraform.tfstate</code>. The drift appears in the output.</p>
</li>
<li><p>Run <code>./break_it.sh --invisible</code> to create a brand new security group that's not in the state file at all, then run the script again. It returns clean even though a new resource exists in your account, making the coverage gap visible.</p>
</li>
<li><p>Run <code>./teardown.sh</code>. When finished, this runs <code>terraform destroy</code> to delete the security group and clean up all AWS resources. No charges will remain after this.</p>
</li>
</ol>
<h2 id="heading-use-case-4-secrets-rotation-with-zero-downtime">Use Case 4 - Secrets Rotation with Zero Downtime</h2>
<p><strong>Environment:</strong> AWS Secrets Manager + local Kind cluster<br><strong>Language:</strong> Python</p>
<h3 id="heading-the-production-problem">The Production Problem</h3>
<p><strong>The goal of this use case:</strong> Kubernetes says a pod is healthy, but your users are getting database errors. The script catches that gap before the users are affected by running one extra check that Kubernetes never runs.</p>
<p>You rotate your database credentials. The pod restarts. <code>kubectl get pods</code> shows Running. Ten minutes later, users can't log in.</p>
<p>The rotation worked, but the problem is that Kubernetes checked whether the HTTP server was alive, not whether it could authenticate with the database. Those are two different things.</p>
<h3 id="heading-whats-actually-happening">What's Actually Happening</h3>
<p><strong>What this is not:</strong> This isn't about how to store secrets in Kubernetes. It's about what happens after the secret is rotated.</p>
<p>When a pod is already running, it holds a pool of open database connections that were authenticated before the rotation happened. Those connections stay alive after the password changes because they were authenticated before the change and the database does not kick them out. But when the pool needs to open a new connection, it uses the current environment credentials, which still have the old password. That new connection fails immediately.</p>
<p>Meanwhile, Kubernetes sees the pod responding to HTTP and marks it Running, so your users are hitting the failures with no indication from the cluster that anything is wrong.</p>
<h3 id="heading-what-the-healthzdb-endpoint-does">What the <code>/healthz/db</code> Endpoint Does</h3>
<p><code>/healthz</code> returns 200 if the HTTP server is alive. That is all Kubernetes checks.</p>
<p><code>/healthz/db</code> opens a fresh database connection using the current credentials and runs <code>SELECT 1</code>. If that fails after a rotation, the pod is Running but can't serve database requests. The rotation script calls this endpoint as its final step – the check Kubernetes never runs.</p>
<p>Here's what that looks like in the demo FastAPI application (<a href="https://github.com/Osomudeya/devops-scripting-labs">code files</a>):</p>
<pre><code class="language-python"># app.py (relevant section)
import os
import asyncpg
from fastapi import FastAPI, HTTPException

app = FastAPI()

DB_HOST = os.environ.get("DB_HOST", "postgres")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "appdb")
DB_USERNAME = os.environ.get("DB_USERNAME", "appuser")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")

@app.get("/healthz")
async def healthz():
    # Always returns 200 if the HTTP server is alive.
    # This is all the Kubernetes readiness probe checks.
    return {"status": "ok"}

@app.get("/healthz/db")
async def healthz_db():
    # Opens a fresh connection using the current environment credentials.
    # If the password was rotated and this pod has not restarted yet,
    # the environment still has the old password - this connection fails.
    # /healthz above would still return 200. Your users would see errors.
    try:
        conn = await asyncpg.connect(
            host=DB_HOST, port=DB_PORT,
            database=DB_NAME, user=DB_USERNAME, password=DB_PASSWORD,
        )
        await conn.execute("SELECT 1")
        await conn.close()
        return {"status": "ok", "db": "authenticated"}

    except asyncpg.InvalidPasswordError:
        raise HTTPException(
            status_code=503,
            detail=(
                f"Authentication failed for '{DB_USERNAME}'. "
                "Password may have been rotated. "
                "Readiness probe does not check this."
            )
        )
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Database error: {str(e)}")
</code></pre>
<p>The difference between these two endpoints is the entire lesson of this use case.</p>
<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>04-secrets-rotation/</code> and run the setup script:</p>
<pre><code class="language-plaintext">cd 04-secrets-rotation
./setup.sh
</code></pre>
<p>This starts a Kind cluster, deploys real PostgreSQL with the <code>appuser</code> account already created, deploys the demo FastAPI app connected to it, and creates an initial secret in AWS Secrets Manager.</p>
<p>Once setup completes, install the dependencies:</p>
<pre><code class="language-plaintext">pip install boto3 kubernetes
</code></pre>
<p>Before running the rotation, confirm everything is running:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see <code>myapp</code> and <code>postgres</code> pods both in the Running state. If any pod shows Pending or Error, wait 30 seconds and check again. PostgreSQL takes a moment to finish initialising.</p>
<p>You can also verify that the secret was created in AWS. In the console, go to AWS Secrets Manager and look for <code>myapp/db-credentials</code>:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c5d481f7-c938-43f8-91ec-09640c137897.png" alt="screenshot showing AWS secret created" style="display:block;margin:0 auto" width="1414" height="337" loading="lazy">

<p>If you prefer the CLI:</p>
<pre><code class="language-plaintext">aws secretsmanager get-secret-value --secret-id myapp/db-credentials
</code></pre>
<p>Once both pods are Running and the secret exists, run the rotation to see the full path:</p>
<pre><code class="language-plaintext">python rotate_secret.py
</code></pre>
<p><strong>If Step 6 shows FAILED on this first clean run</strong>, it's almost always a timing issue: the app pod restarted successfully but <code>/healthz/db</code> ran before the new pod finished establishing its first database connection. Wait 20 seconds and run <code>python rotate_secret.py</code> again. If it fails repeatedly, run <code>kubectl logs deployment/myapp</code> to see what the app is reporting.</p>
<p>You should see all six steps complete cleanly, ending with:</p>
<pre><code class="language-plaintext">Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated
</code></pre>
<p>The lab is alive and the full rotation chain works end to end. Now let's look at what the script is doing.</p>
<h3 id="heading-the-script-code-files">The Script (<a href="https://github.com/Osomudeya/devops-scripting-labs">Code Files</a>)</h3>
<pre><code class="language-python"># rotate_secret.py
import boto3
import base64
import json
import subprocess
import sys
from kubernetes import client, config


def get_current_secret(secret_name):
    """
    Fetch the current credential from AWS Secrets Manager.
    The secret is stored as a JSON string with 'username' and 'password' fields.
    """
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


def rotate_in_aws(secret_name, username, new_password):
    """
    Write the new credential to AWS Secrets Manager.
    put_secret_value creates a new version - the previous version is
    not deleted immediately, giving you a short rollback window.
    """
    sm = boto3.client("secretsmanager")
    new_value = json.dumps({"username": username, "password": new_password})
    sm.put_secret_value(SecretId=secret_name, SecretString=new_value)
    print("  [AWS] Secret updated in Secrets Manager.")


def update_kubernetes_secret(namespace, k8s_secret_name, username, new_password):
    """
    Patch the Kubernetes Secret object with the new credential values.
    Kubernetes requires secret data to be base64-encoded - this is encoding,
    not encryption. Anyone with access to the Secret object can decode the values.
    Real encryption at rest requires separate etcd encryption configuration.
    """
    config.load_kube_config()
    v1 = client.CoreV1Api()

    secret_data = {
        "username": base64.b64encode(username.encode()).decode(),
        "password": base64.b64encode(new_password.encode()).decode()
    }

    v1.patch_namespaced_secret(
        name=k8s_secret_name,
        namespace=namespace,
        body={"data": secret_data}
    )
    print(f"  [K8s] Kubernetes Secret '{k8s_secret_name}' updated.")


def rolling_restart(namespace, deployment_name):
    """
    Trigger a rolling restart of the deployment.
    Rolling restart means Kubernetes creates one new pod, waits for it to pass
    its readiness probe, then terminates one old pod - and repeats until all
    pods have been replaced. Availability is preserved throughout.
    This is very different from deleting all pods at once.
    """
    result = subprocess.run(
        ["kubectl", "rollout", "restart",
         f"deployment/{deployment_name}", "-n", namespace],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rolling restart failed: {result.stderr}")
    print(f"  [K8s] Rolling restart triggered for '{deployment_name}'.")


def wait_for_rollout(namespace, deployment_name, timeout=120):
    """
    Block until the rolling restart finishes or times out.
    'Finished' means all new pods are Running and their readiness probes passed.
    This does NOT mean the application can authenticate with the new credential.
    That is what verify_credential checks next.
    """
    print(f"  [K8s] Waiting for rollout (timeout: {timeout}s)...")
    result = subprocess.run(
        ["kubectl", "rollout", "status",
         f"deployment/{deployment_name}",
         "-n", namespace,
         f"--timeout={timeout}s"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rollout did not complete: {result.stderr}")
    print("  [K8s] Rollout complete. All pods report Ready.")


def verify_credential(namespace, deployment_name):
    """
    This is the check the readiness probe does not make.
    We exec into the running pod and call /healthz/db - an endpoint that
    makes an actual authenticated query to the database.
    If this passes: the credential is working at the application level.
    If this fails after the readiness probe passed: the contract mismatch is confirmed.
    The pod is Running. The application cannot serve database requests.
    """
    print("  [Verify] Running post-rotation credential check...")

    result = subprocess.run(
        ["kubectl", "get", "pods", "-n", namespace,
         "-l", f"app={deployment_name}",
         "-o", "jsonpath={.items[0].metadata.name}"],
        capture_output=True, text=True
    )
    pod_name = result.stdout.strip()

    if not pod_name:
        print("  [Verify] ERROR: No running pod found for this deployment.")
        return False

    verify = subprocess.run(
        ["kubectl", "exec", pod_name, "-n", namespace,
         "--", "curl", "-sf", "http://localhost:8000/healthz/db"],
        capture_output=True, text=True
    )

    if verify.returncode != 0:
        print("  [Verify] FAILED - Pod is Running but database authentication failed.")
        print("           The readiness probe validated HTTP reachability.")
        print("           The application cannot authenticate with the new credential.")
        print("           These are two different contracts. Only one was checked automatically.")
        return False

    print("  [Verify] PASSED - Application confirmed it can authenticate with the new credential.")
    return True


def rotate(secret_name, new_password, namespace, k8s_secret_name, deployment_name):
    print("\n[Step 1/6] Reading current secret from AWS Secrets Manager...")
    current = get_current_secret(secret_name)
    username = current["username"]

    print("[Step 2/6] Updating AWS Secrets Manager...")
    rotate_in_aws(secret_name, username, new_password)

    print("[Step 3/6] Rotating password at the database level (ALTER USER)...")
    rotate_postgres_password(namespace, new_password)

    print("[Step 4/6] Updating Kubernetes Secret object...")
    update_kubernetes_secret(namespace, k8s_secret_name, username, new_password)

    print("[Step 5/6] Triggering rolling restart...")
    rolling_restart(namespace, deployment_name)
    wait_for_rollout(namespace, deployment_name)

    print("[Step 6/6] Verifying the new credential works at the application level...")
    success = verify_credential(namespace, deployment_name)

    print("\n" + "=" * 60)
    if success:
        print("Rotation complete. Credential verified at the application level.")
    else:
        print("Rotation incomplete. Readiness probe passed but credential verification failed.")
        print("Recommended action: force-restart all pods to flush the connection pool,")
        print("or investigate the database session timeout configuration.")
        sys.exit(1)


if __name__ == "__main__":
    import secrets as _secrets
    rotate(
        secret_name="myapp/db-credentials",
        new_password=_secrets.token_urlsafe(16),
        namespace="default",
        k8s_secret_name="db-credentials",
        deployment_name="myapp"
    )
</code></pre>
<h3 id="heading-how-the-script-works">How the Script Works</h3>
<p><code>get_current_secret</code> reads the current credential from AWS Secrets Manager so the script knows the username before it generates a new password.</p>
<p><code>rotate_in_aws</code> writes the new credential to Secrets Manager. It creates a new version rather than overwriting the old one, so you have a short window to roll back if something goes wrong.</p>
<p><code>_pg_password_literal</code> and <code>rotate_postgres_password</code> handle the step that most rotation scripts skip, which is actually changing the password inside PostgreSQL. This is done by running <code>ALTER USER appuser PASSWORD '...'</code> directly on the live PostgreSQL pod. Before this step, the database still accepts the old password. After this step, it does not.</p>
<p><code>update_kubernetes_secret</code> writes the new password into the Kubernetes Secret so that any new pod that starts will get the new credential from the beginning.</p>
<p><code>rolling_restart</code> and <code>wait_for_rollout</code> restart the application pods one at a time so the deployment stays available throughout. When this step completes, all pods are Running and their readiness probes have passed – but keep in mind that "Running" only means <code>/healthz</code> returned 200, which is exactly the problem this use case is about.</p>
<p><code>verify_credential</code> is the extra step Kubernetes never runs. It reaches inside the new pod and calls <code>/healthz/db</code>, which opens a real database connection with the credentials in the pod's current environment. If this passes, the rotation is genuinely complete. If this fails after the readiness probe passed, you have confirmed the gap: the pod looks healthy but can't serve database requests.</p>
<h3 id="heading-what-the-output-looks-like">What the Output Looks Like</h3>
<p>Successful rotation:</p>
<pre><code class="language-plaintext">[Step 1/6] Reading current secret from AWS Secrets Manager...
[Step 2/6] Updating AWS Secrets Manager...
  [AWS] Secrets Manager updated.
[Step 3/6] Rotating password at the database level (ALTER USER)...
  [DB]  Running ALTER USER on PostgreSQL...
  [DB]  Password changed at the database level.
        New connections now require the new password.
        Existing pool connections remain valid until they close.
[Step 4/6] Updating Kubernetes Secret object...
  [K8s] Kubernetes Secret 'db-credentials' updated.
[Step 5/6] Triggering rolling restart...
  [K8s] Rolling restart triggered for 'myapp'.
  [K8s] Waiting for rollout (timeout: 120s)...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] PASSED - Application confirmed it can authenticate with the new credential.

============================================================
Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated
</code></pre>
<p>The lab is alive and the full rotation chain works end to end.</p>
<p>Before you break anything, confirm the pod is healthy:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see <code>myapp</code> in Running state. That is the baseline: everything working as expected. Now let's break it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/caabd892-562a-40f4-81c1-2c59fd15240b.png" alt="terminal screenshot showing output of 'kubectl get pods&quot;" style="display:block;margin:0 auto" width="753" height="187" loading="lazy">

<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<h4 id="heading-step-1-desync-the-db">Step 1: Desync the DB</h4>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>This runs <code>ALTER USER</code> directly on PostgreSQL with a wrong password. The K8s Secret still has the old password, so the pod's environment and the database are now out of sync.</p>
<h4 id="heading-step-2-check-what-kubernetes-sees">Step 2: Check what Kubernetes sees</h4>
<pre><code class="language-plaintext">kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz
</code></pre>
<p>You will see <code>{"status":"ok"}</code>. The pod is still showing Ready in <code>kubectl get pods</code>. Kubernetes has no idea anything is wrong – that's the contract gap made visible in your terminal.</p>
<h4 id="heading-step-3-check-what-your-users-experience">Step 3: Check what your users experience</h4>
<pre><code class="language-plaintext">kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz/db
</code></pre>
<p>You'll see a <code>503</code> error. Fresh database connections are failing. Your users are already seeing this.</p>
<h4 id="heading-step-4-see-the-mixed-pattern-optional">Step 4: See the mixed pattern (optional)</h4>
<pre><code class="language-plaintext">./load_test.sh
</code></pre>
<p>Some requests succeed because they hit old pool connections that were authenticated before the break. Some fail because they need a fresh connection. The pod looks healthy, but half your traffic is failing.</p>
<h4 id="heading-step-5-run-the-rotation-script">Step 5: Run the rotation script</h4>
<pre><code class="language-plaintext">python rotate_secret.py
</code></pre>
<p>This time, Step 6 catches the failure. Here's what you'll see:</p>
<pre><code class="language-plaintext">[Step 5/6] Triggering rolling restart...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] FAILED - Pod is Running but database authentication failed.
           The readiness probe validated HTTP reachability.
           The application cannot authenticate with the new credential.
           These are two different contracts. Only one was checked automatically.

============================================================
Rotation incomplete. Readiness probe passed but credential verification failed.
</code></pre>
<p>The pod is Running and shows Ready in <code>kubectl get pods</code>. The rotation script says the credential is broken. That's the contract gap visible in your terminal, caught before your users hit it.</p>
<p><strong>The lesson:</strong> <code>/healthz</code> tells you the HTTP server is alive. <code>/healthz/db</code> tells you the application can actually connect to the database. Kubernetes only checks the first one unless you add a database probe. The rotation script adds that check at the end of every rotation so you catch the failure before your users do.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>The verification failed, the pod is Running, and requests to the database are failing. You have two options:</p>
<ol>
<li><p>force-restart all pods at once to flush the connection pool (which is faster but causes a brief capacity reduction), or</p>
</li>
<li><p>wait for old sessions to expire naturally (which avoids downtime but leaves requests failing intermittently until the pool cycles).</p>
</li>
</ol>
<p>The script found the problem, but deciding what to do next belongs to an engineer who knows the system.</p>
<h3 id="heading-teardown">Teardown</h3>
<pre><code class="language-plaintext">./teardown.sh
</code></pre>
<h2 id="heading-use-case-5-automated-canary-rollback-trigger">Use Case 5 - Automated Canary Rollback Trigger</h2>
<p><strong>Environment:</strong> Fully local – Kind, Prometheus via Helm<br><strong>Language:</strong> Bash</p>
<h3 id="heading-what-this-use-case-does-and-why-it-matters">What This Use Case Does and Why it Matters</h3>
<p>This use case runs a script that watches your new deployment and automatically rolls it back if something goes wrong, before your users flood your support queue.</p>
<p>This matters in production because, when you ship a new version, you don't send all traffic to it immediately. You send a small slice, say 20% to the new version while 80% still goes to the old one. If the new version is broken, only 20% of users are affected and you can roll back before the damage spreads. But the rollback only works if you're watching the right things.</p>
<p><strong>The takehome:</strong> Two scripts watch the same failing canary. One reports everything is fine. The other fires the rollback. The only difference is what they measure. Your automation is only as good as what it watches.</p>
<p><strong>What to watch for:</strong> <code>canary_watch_v1.sh</code> watches errors only and stays silent while the canary is slow. <code>canary_watch_v2.sh</code> watches errors AND response time and fires the rollback. The difference between them is the lesson.</p>
<p><strong>What this is not:</strong> This isn't a guide to canary deployments. It's about what your monitoring misses when it only watches one signal.</p>
<h3 id="heading-how-it-works">How it Works</h3>
<p>Three things run in the cluster: the stable app (three pods, handles most traffic), the canary app (one pod, handles a small slice), and Prometheus (collects response times and error counts from both every 15 seconds).</p>
<p>The watch script asks Prometheus every 15 seconds: <em>"Is the canary behaving normally?"</em> If the answer is no for three checks in a row, it rolls back the canary automatically.</p>
<p>The question is that what does <em>"behaving normally"</em> mean? That is the entire use case.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/204f992e-44c2-4404-a6ff-f2279ea23aeb.png" alt="terminal screenshot showing output result of 'kubectl get pods&quot;" style="display:block;margin:0 auto" width="753" height="187" loading="lazy">

<h3 id="heading-set-up-the-demo-environment">Set Up the Demo Environment</h3>
<p>Navigate to <code>05-canary-rollback/</code> and run:</p>
<pre><code class="language-plaintext">cd 05-canary-rollback
./setup.sh
</code></pre>
<p>Setup takes a few minutes. It installs Prometheus, deploys both versions of the demo app, and starts a load generator pod that sends continuous traffic to both so Prometheus always has data.</p>
<p>When setup finishes, confirm everything is running:</p>
<pre><code class="language-plaintext">kubectl get pods
</code></pre>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">NAME                                                   READY   STATUS    RESTARTS   AGE
load-generator-68c59698b7-kws2l                        1/1     Running   0          4m54s
myapp-canary-6d6979c66f-g9lgw                          1/1     Running   0          32s
myapp-stable-6bcf994fc4-b4k9l                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-ndhxc                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-z97kx                          1/1     Running   0          4m55s
prometheus-kube-prometheus-operator-59b847d96c-mp72s   1/1     Running   0          5m58s
prometheus-prometheus-kube-prometheus-prometheus-0     2/2     Running   0          5m1s
</code></pre>
<p>Three stable pods, one canary pod, one load generator, Prometheus running. The lab is alive.</p>
<p><strong>Wait 60 seconds before running anything else.</strong> Prometheus needs time to scrape the first metrics from the pods. If you skip this, the watch scripts return empty data with no explanation.</p>
<h3 id="heading-three-terminal-windows">Three Terminal Windows</h3>
<p>You need three separate command prompts running at the same time.</p>
<p><strong>On macOS:</strong> open Terminal and press <code>Cmd+T</code> twice. You now have three tabs, each an independent terminal.<br><strong>On Linux:</strong> press <code>Ctrl+Shift+T</code> in most terminal apps, or right-click and choose "Open new tab."</p>
<p>Label them Terminal 1 for the watch script, Terminal 2 for injecting failures, Terminal 3 for watching latency.</p>
<h3 id="heading-the-scripts">The Scripts</h3>
<h4 id="heading-version-1-watches-errors-only-code-here">Version 1: watches errors only (<a href="https://github.com/Osomudeya/devops-scripting-labs.git">code here</a>)</h4>
<pre><code class="language-bash">#!/usr/bin/env bash
# canary_watch_v1.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v1 - error rate only)."
echo "Rollback triggers if error rate exceeds \({ERROR_THRESHOLD} for \){STRIKE_LIMIT} checks."
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'

    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    error_rate=${error_rate:-0}
    above=\((echo "\)error_rate &gt; $ERROR_THRESHOLD" | bc -l)

    echo "[\(ts] error_rate=\){error_rate} | threshold=\({ERROR_THRESHOLD} | breach=\)([ "$above" = "1" ] &amp;&amp; echo YES || echo NO)"

    if [ "$above" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo "  ROLLBACK TRIGGERED"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done
</code></pre>
<h4 id="heading-version-2-watches-error-rate-and-response-time">Version 2: watches error rate AND response time</h4>
<pre><code class="language-bash">#!/usr/bin/env bash
# canary_watch_v2.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
LATENCY_THRESHOLD="2.0"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v2 - error rate + P99 latency)."
echo "Error threshold: \({ERROR_THRESHOLD} | Latency P99 threshold: \){LATENCY_THRESHOLD}s"
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'
    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    latency_query='histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="myapp-canary"}[1m])) by (le))'
    latency=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${latency_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2&gt;/dev/null)

    error_rate=${error_rate:-0}
    latency=${latency:-0}

    error_breach=\((echo "\)error_rate &gt; $ERROR_THRESHOLD" | bc -l)
    latency_breach=\((echo "\)latency &gt; $LATENCY_THRESHOLD" | bc -l)

    triggered_by=""
    [ "\(error_breach" = "1" ] &amp;&amp; triggered_by="error_rate(\){error_rate})"
    [ "\(latency_breach" = "1" ] &amp;&amp; triggered_by="\){triggered_by:+\({triggered_by}, }latency_p99(\){latency}s)"

    echo "[\(ts] error_rate=\){error_rate} | latency_p99=\({latency}s | breach=\){triggered_by:-none}"

    if [ "\(error_breach" = "1" ] || [ "\)latency_breach" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT} | Triggered by: ${triggered_by}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo ""
            echo "  ROLLBACK TRIGGERED"
            echo "  Signal: ${triggered_by}"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done
</code></pre>
<h3 id="heading-how-the-scripts-work">How the Scripts Work</h3>
<p>The <code>error rate query</code> asks Prometheus: <em>"What fraction of requests to the canary returned an error in the last minute?"</em> A result of <code>0.0</code> means no errors. A result of <code>0.06</code> means 6% of requests are failing, above the 5% threshold. You see this in the output as:</p>
<pre><code class="language-plaintext">error_rate=0.06 | threshold=0.05 | breach=YES
</code></pre>
<p>The <code>latency query</code> asks: <em>"How slow is the slowest 1% of requests to the canary right now?"</em> A result of <code>5.234</code> means 1 in every 100 requests is taking over 5 seconds. You see this as:</p>
<pre><code class="language-plaintext">latency_p99=5.234s | breach=latency_p99(5.234s)
</code></pre>
<p>V1 only runs the first query. V2 runs both. Same canary, same problem, different answers.</p>
<p>The three-strike rule means a single bad check doesn't trigger a rollback – three in a row does. The tradeoff is 45 seconds (three checks at 15 seconds each) of exposure before the rollback fires.</p>
<p>When three strikes hit, the watch script itself runs:</p>
<pre><code class="language-plaintext">kubectl rollout undo deployment/myapp-canary -n default
</code></pre>
<p>That one line is what triggers the rollback. It lives inside <code>canary_watch_v2.sh</code> and runs automatically – you don't have to do anything. The script detects, decides, and acts.</p>
<h3 id="heading-break-it-on-purpose">Break it On Purpose</h3>
<p><strong>In Terminal 1</strong>, start the v1 monitor:</p>
<pre><code class="language-plaintext">./canary_watch_v1.sh
</code></pre>
<p>You will see this repeating every 15 seconds:</p>
<pre><code class="language-plaintext">Canary monitor running (v1 - error rate only).
Rollback triggers if error rate exceeds 0.05 for 3 checks.

[2026-05-17T11:53:12] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:27] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:42] error_rate=0 | threshold=0.05 | breach=NO
</code></pre>
<p><code>breach=NO</code> means the canary looks healthy. Leave this running and move to Terminal 2.</p>
<p><strong>In Terminal 2</strong>, inject latency into the canary:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>This makes every request to the canary take 5 seconds. Requests still return 200 – no errors, just slowness. You will see:</p>
<pre><code class="language-plaintext">Injecting latency into the canary deployment...
deployment "myapp-canary" successfully rolled out
Latency injection is active.

The canary pod is Running and passing its readiness probe.
Every request to the canary now takes 5 seconds.
Error rate: 0%   |   P99 latency: ~5s
</code></pre>
<p>Now look back at Terminal 1. The v1 monitor keeps printing <code>breach=NO</code>. The canary is taking 5 seconds per request and your monitoring says everything is fine. That's the failure.</p>
<p><strong>In Terminal 3</strong>, see what your users are actually experiencing:</p>
<pre><code class="language-plaintext">./check_latency.sh
</code></pre>
<pre><code class="language-plaintext">TIMESTAMP                   STABLE (ms)   CANARY (ms)   STATUS
---------                   -----------   -----------   ------
2026-05-17T11:55:14         18ms          5008ms        CANARY DEGRADED
2026-05-17T11:55:20         7ms           5008ms        CANARY DEGRADED
2026-05-17T11:55:27         6ms           5008ms        CANARY DEGRADED
</code></pre>
<p>Stable is responding in 6–18 milliseconds. Canary is taking over 5 seconds. Users on the canary are waiting 5 seconds for every page load. The v1 monitor in Terminal 1 still says <code>breach=NO</code>.</p>
<p>This is the lesson: the monitoring and the user experience are completely disconnected. The script isn't broken. It's watching the wrong thing.</p>
<p>Now let's see the fix. Press <code>Ctrl+C</code> in Terminal 1 to stop v1. Start v2 in the same terminal:</p>
<pre><code class="language-plaintext">./canary_watch_v2.sh
</code></pre>
<p>In Terminal 2, re-inject the latency:</p>
<pre><code class="language-plaintext">./break_it.sh
</code></pre>
<p>Watch Terminal 1. V2 catches the latency and fires the rollback after three strikes:</p>
<pre><code class="language-plaintext">Canary monitor running (v2 - error rate + P99 latency).
Error threshold: 0.05 | Latency P99 threshold: 2.0s

[2026-05-15T14:30:00] error_rate=0.0 | latency_p99=0.082s | breach=none
[2026-05-15T14:30:15] error_rate=0.0 | latency_p99=5.234s | breach=latency_p99(5.234s)
  Strike 1/3 | Triggered by: latency_p99(5.234s)
[2026-05-15T14:30:30] error_rate=0.0 | latency_p99=5.891s | breach=latency_p99(5.891s)
  Strike 2/3 | Triggered by: latency_p99(5.891s)
[2026-05-15T14:30:45] error_rate=0.0 | latency_p99=6.102s | breach=latency_p99(6.102s)
  Strike 3/3 | Triggered by: latency_p99(6.102s)

  ROLLBACK TRIGGERED
  Signal: latency_p99(6.102s)

deployment.apps/myapp-canary rolled back
</code></pre>
<p>The error rate never moved from 0. V2 rolled back anyway because latency crossed the threshold. That's the difference one extra measurement makes.</p>
<p>After the rollback, confirm the canary is dormant but not deleted:</p>
<pre><code class="language-plaintext">kubectl rollout history deployment/myapp-canary -n default
</code></pre>
<pre><code class="language-plaintext">REVISION  CHANGE-CAUSE
1         &lt;none&gt;
2         &lt;none&gt;
</code></pre>
<p>Two revisions. The rollback scaled revision 2 down to zero and restored revision 1. Nothing was deleted, and you can re-deploy if you decide the rollback was a false alarm.</p>
<h3 id="heading-the-decision-the-script-cant-make-for-you">The Decision the Script Can't Make For You</h3>
<p>V2 rolled back based on latency with zero errors. Before re-deploying, ask if the latency was a real regression in the new code, or a temporary spike, like a database cache warming up on first use? Both produce the same signal. Only you know which is more likely given what changed.</p>
<p>False positive rollbacks slow down deployments and erode confidence in automation. The right thresholds depend on your users and your system.<br>What the script enforces is whatever you configure.</p>
<h3 id="heading-teardown">Teardown</h3>
<pre><code class="language-plaintext">./teardown.sh
</code></pre>
<h2 id="heading-what-you-can-do-now">What You Can Do Now</h2>
<p>Each use case in this handbook was a script solving a specific problem the standard tooling wasn't catching. Here's where you land:</p>
<p>You can catch AWS cost spikes before the invoice and you know that the service label is AWS's attribution, not a pointer to what actually caused the cost. Start from what changed operationally, not from the billing label.</p>
<p>You can reconstruct the full timeline of any failed request across multiple services from a single trace ID, and you know that a missing service in that timeline is evidence, not just an absence.</p>
<p>You can detect infrastructure drift by comparing what Terraform believes against what AWS actually contains, and you know that a clean result means the resources Terraform manages are in sync, not that your entire AWS account is clean.</p>
<p>You can validate a secret rotation at the application level, not just at the infrastructure level, and you know the difference between a readiness probe passing and the application actually being able to connect to the database.</p>
<p>You can build a canary rollback trigger that watches the right signals, and you know why watching only error rates can leave a slow, broken deployment running while users wait.</p>
<p>The pattern across all five use cases is the same: the standard tooling reported everything as fine while something was actually broken. The cost script returned clean, the pod showed Running, and the canary showed zero errors – not because the tools were wrong but because they were only checking what was easy to check. These scripts check what the standard tooling skips.</p>
<p><strong>GitHub repo:</strong> <a href="https://github.com/Osomudeya/devops-scripting-labs.git">https://github.com/Osomudeya/devops-scripting-labs</a></p>
<p>I write about DevOps weekly, covering real systems, interview, CV tips and tricks, and real incidents – <a href="https://osomudeya.kit.com/23db7ca59f"><strong>Join the newsletter</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible ]]>
                </title>
                <description>
                    <![CDATA[ The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS. I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $ ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-local-devops-homelab-with-docker-kubernetes-and-ansible/</link>
                <guid isPermaLink="false">69dd667c217f5dfcbd55b7b4</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Homelab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops articles ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 21:56:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1e970f8b-eb52-4582-9c98-13cbce867c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.</p>
<p>I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.</p>
<p>That was the last time I practiced on someone else's infrastructure.</p>
<p>Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.</p>
<p>That last part is where the actual learning happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>A laptop with at least 8GB of RAM (16GB is better)</p>
</li>
<li><p>At least 20GB of free disk space</p>
</li>
<li><p>Windows, macOS, or Linux operating system</p>
</li>
<li><p>Administrator access to your computer</p>
</li>
<li><p>Virtualization enabled in your BIOS/UEFI settings</p>
</li>
<li><p>A stable internet connection for the initial downloads</p>
</li>
</ul>
<p>Knowledge and comfort level:</p>
<ul>
<li><p>You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).</p>
</li>
<li><p>Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.</p>
</li>
</ul>
<p>If you can follow step-by-step instructions and read error output without panicking, you're ready.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-is-devops">What is DevOps?</a></p>
</li>
<li><p><a href="#heading-why-build-a-local-lab">Why Build a Local Lab?</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-docker">How to Set Up Docker</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-to-install-kubectl">How to Install kubectl</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-vagrant">How to Set Up Vagrant</a></p>
</li>
<li><p><a href="#heading-how-to-install-ansible">How to Install Ansible</a></p>
</li>
<li><p><a href="#heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</a></p>
</li>
<li><p><a href="#heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</a></p>
</li>
<li><p><a href="#heading-what-you-can-now-do">What You Can Now Do</a></p>
</li>
</ol>
<h2 id="heading-what-is-devops">What is DevOps?</h2>
<p>DevOps is the practice of breaking down the wall between software development and IT operations teams.</p>
<p>Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.</p>
<p>The tools you'll install in this guide each solve a specific part of that process:</p>
<ul>
<li><p><strong>Docker</strong> packages your application and everything it needs into a portable container that runs the same way on any machine.</p>
</li>
<li><p><strong>Kubernetes</strong> manages multiple containers at scale, handling restarts, networking, and load balancing automatically.</p>
</li>
<li><p><strong>Vagrant</strong> creates and manages virtual machine environments so your whole team always works on identical setups.</p>
</li>
<li><p><strong>Ansible</strong> automates repetitive configuration tasks across many servers without writing a script for each one.</p>
</li>
</ul>
<h2 id="heading-why-build-a-local-lab">Why Build a Local Lab?</h2>
<p>A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.</p>
<p>Here's what you get with a local setup:</p>
<ul>
<li><p><strong>Zero cost.</strong> No cloud bills, no surprise charges, and no credit card required.</p>
</li>
<li><p><strong>Works offline.</strong> Practice anywhere, even without internet after the initial setup.</p>
</li>
<li><p><strong>Full control.</strong> You manage every layer from the OS up to the application.</p>
</li>
<li><p><strong>Safe experimentation.</strong> Break things freely. Nothing here affects production.</p>
</li>
<li><p><strong>Fast feedback.</strong> No waiting for cloud resources to spin up. Everything runs on your machine.</p>
</li>
</ul>
<p>The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.</p>
<h2 id="heading-how-to-set-up-docker">How to Set Up Docker</h2>
<p>Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.</p>
<h3 id="heading-how-to-install-docker-on-windows">How to Install Docker on Windows</h3>
<p>First, enable virtualization in your BIOS:</p>
<ol>
<li><p>Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.</p>
</li>
<li><p>Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.</p>
</li>
<li><p>Enable it, save your changes, and exit.</p>
</li>
</ol>
<p>Then install Docker Desktop:</p>
<ol>
<li><p>Download Docker Desktop from <a href="https://www.docker.com/products/docker-desktop/">Docker's official website</a>.</p>
</li>
<li><p>Run the installer and follow the prompts.</p>
</li>
<li><p>Enable WSL 2 (Windows Subsystem for Linux) when asked.</p>
</li>
<li><p>Restart your computer.</p>
</li>
<li><p>Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.</p>
</li>
</ol>
<p><strong>Troubleshooting:</strong> If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:</p>
<pre><code class="language-powershell">systeminfo | findstr "Hyper-V Requirements"
</code></pre>
<p>All items should show "Yes". If they don't, revisit your BIOS settings.</p>
<h3 id="heading-how-to-install-docker-on-mac">How to Install Docker on Mac</h3>
<ol>
<li><p>Download Docker Desktop for Mac from <a href="https://www.docker.com/products/docker-desktop/">Docker's website</a>.</p>
</li>
<li><p>Open the downloaded <code>.dmg</code> file and drag Docker to your Applications folder.</p>
</li>
<li><p>Open Docker from Applications.</p>
</li>
<li><p>Enter your password when prompted.</p>
</li>
<li><p>Wait for the whale icon in the menu bar to stop animating.</p>
</li>
</ol>
<h3 id="heading-how-to-install-docker-on-linux">How to Install Docker on Linux</h3>
<p>Run these commands in order:</p>
<pre><code class="language-bash"># Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER
</code></pre>
<p>Log out and back in for the group change to take effect.</p>
<h3 id="heading-how-to-test-docker">How to Test Docker</h3>
<p>Run this command:</p>
<pre><code class="language-bash">docker run hello-world
</code></pre>
<p>If you see "Hello from Docker!" then Docker is working correctly.</p>
<p>Docker is set up. Next, you'll install Kubernetes to manage containers at scale.</p>
<h2 id="heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</h2>
<p>Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:</p>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best for</th>
<th>RAM needed</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Minikube</strong></td>
<td>Beginners. Easiest setup, built-in dashboard</td>
<td>2GB+</td>
</tr>
<tr>
<td><strong>Kind</strong></td>
<td>Faster startup, works well inside CI pipelines</td>
<td>1GB+</td>
</tr>
<tr>
<td><strong>k3s</strong></td>
<td>Low-resource machines. Lightweight but production-like</td>
<td>512MB+</td>
</tr>
<tr>
<td><strong>kubeadm</strong></td>
<td>Learning how clusters are actually bootstrapped in production</td>
<td>2GB+ per node</td>
</tr>
</tbody></table>
<p>If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.</p>
<p>If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.</p>
<p>Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.</p>
<h3 id="heading-how-to-install-minikube-recommended-for-beginners">How to Install Minikube (Recommended for Beginners)</h3>
<p>Minikube creates a single-node Kubernetes cluster on your laptop.</p>
<p>On Windows:</p>
<ol>
<li><p>Download the Minikube installer from <a href="https://github.com/kubernetes/minikube/releases">Minikube's GitHub releases page</a>.</p>
</li>
<li><p>Run the <code>.exe</code> installer.</p>
</li>
<li><p>Open Command Prompt as Administrator and start Minikube:</p>
</li>
</ol>
<pre><code class="language-cmd">minikube start --driver=docker
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install minikube
minikube start --driver=docker
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker
</code></pre>
<p>Test your cluster:</p>
<pre><code class="language-bash">minikube status
minikube dashboard
</code></pre>
<h3 id="heading-how-to-install-k3s-recommended-for-low-ram-machines">How to Install k3s (Recommended for Low-RAM Machines)</h3>
<p>k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.</p>
<p>On Linux (and Mac via Multipass):</p>
<pre><code class="language-bash">curl -sfL https://get.k3s.io | sh -
</code></pre>
<p>That single command installs k3s and runs it automatically in the background. Check that it is running:</p>
<pre><code class="language-bash">sudo k3s kubectl get nodes
</code></pre>
<p>You should see one node with status <code>Ready</code>.</p>
<p>On Mac directly — k3s doesn't run natively on macOS. Use <a href="https://multipass.run">Multipass</a> to spin up a lightweight Ubuntu VM first, then run the install command inside it.</p>
<p>On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.</p>
<h3 id="heading-how-to-install-kind-kubernetes-in-docker">How to Install Kind (Kubernetes IN Docker)</h3>
<p>Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.</p>
<pre><code class="language-bash"># Mac or Linux
brew install kind

# Windows
choco install kind
</code></pre>
<p>Create a cluster:</p>
<pre><code class="language-bash">kind create cluster --name my-local-lab
</code></pre>
<h3 id="heading-how-to-install-kubeadm-for-understanding-cluster-bootstrap">How to Install kubeadm (For Understanding Cluster Bootstrap)</h3>
<p>kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.</p>
<p>It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/">official kubeadm installation guide</a> for your OS, then initialize your cluster:</p>
<pre><code class="language-bash">sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
<p>After init, join worker nodes using the command kubeadm prints at the end of the output.</p>
<h3 id="heading-how-to-install-kubectl">How to Install kubectl</h3>
<p>kubectl is the command-line tool you use to interact with any Kubernetes cluster.</p>
<p>On Windows:</p>
<p>Download <code>kubectl.exe</code> from <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/">Kubernetes' website</a> and place it in a directory that is in your PATH. Or install with Chocolatey:</p>
<pre><code class="language-cmd">choco install kubernetes-cli
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install kubectl
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl
</code></pre>
<p>Test it:</p>
<pre><code class="language-bash">kubectl get pods --all-namespaces
</code></pre>
<p>On a fresh cluster, you'll see system pods running in the <code>kube-system</code> namespace — things like <code>coredns</code> and <code>storage-provisioner</code>. That's the expected output. It means your cluster is up and kubectl can talk to it.</p>
<p>Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.</p>
<h4 id="heading-docker-vs-vagrant-they-arent-the-same-thing">Docker vs Vagrant — they aren't the same thing</h4>
<p>Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.</p>
<p>Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.</p>
<h2 id="heading-how-to-set-up-vagrant">How to Set Up Vagrant</h2>
<p>Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.</p>
<h3 id="heading-how-to-install-vagrant-on-windows">How to Install Vagrant on Windows</h3>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a> with default options.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
<li><p>Restart your computer if prompted.</p>
</li>
</ol>
<p><strong>Note:</strong> VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:</p>
<pre><code class="language-cmd">systeminfo | findstr "Hyper-V"
</code></pre>
<p>If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:</p>
<pre><code class="language-powershell">Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All
</code></pre>
<p>Restart after disabling.</p>
<h3 id="heading-how-to-install-vagrant-on-mac-and-linux">How to Install Vagrant on Mac and Linux</h3>
<p>On Mac:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>After installation, open <strong>System Preferences &gt; Security &amp; Privacy &gt; General</strong>. You will see a message saying system software from Oracle was blocked. Click <strong>Allow</strong> and restart your Mac. Without this step, VirtualBox will not run.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p><strong>Note for Apple Silicon (M1/M2/M3) Macs:</strong> VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use <a href="https://mac.getutm.app/">UTM</a> as your VM provider instead, or use Multipass which works natively on Apple Silicon.</p>
<p>On Linux:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p>Verify both are installed:</p>
<pre><code class="language-bash">vboxmanage --version
vagrant --version
</code></pre>
<h3 id="heading-how-to-create-your-first-vagrant-environment">How to Create Your First Vagrant Environment</h3>
<p>Create a new directory for your project. Inside it, create a file named <code>Vagrantfile</code> with this content:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: &lt;&lt;-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" &gt; /var/www/html/index.html
  SHELL
end
</code></pre>
<p>Start the VM:</p>
<pre><code class="language-bash">vagrant up
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/342f11ad-7c7d-40d2-a810-113b8c71edac.png" alt="screnshot showing VB server and terminal installation processes" style="display:block;margin:0 auto" width="1848" height="323" loading="lazy">

<p>Visit <code>http://localhost:8080</code> in your browser. You should see "Hello from Vagrant!"</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/bcd66a76-4a5b-4f26-bb7e-e203672968d8.png" alt="screenshot showing &quot;Hello from Vagrant!&quot; in browser" style="display:block;margin:0 auto" width="643" height="483" loading="lazy">

<h4 id="heading-troubleshooting-ssh-on-windows">Troubleshooting SSH on Windows</h4>
<p>If <code>vagrant ssh</code> fails, try:</p>
<pre><code class="language-bash">vagrant ssh -- -v
</code></pre>
<p>Or connect manually:</p>
<pre><code class="language-bash">ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222
</code></pre>
<h3 id="heading-how-to-create-a-local-vagrant-box-without-internet">How to Create a Local Vagrant Box Without Internet</h3>
<p><strong>Note:</strong> Most readers can skip this. Only do this if you want to work fully offline after the initial setup.</p>
<ol>
<li><p>Download <a href="https://ubuntu.com/download/server">Ubuntu 20.04 LTS</a> and save the <code>.iso</code> file locally.</p>
</li>
<li><p>Open VirtualBox and create a new VM: Name it <code>ubuntu-devops</code>, Type: Linux, Version: Ubuntu (64-bit).</p>
</li>
<li><p>Assign 2048MB RAM and a 20GB VDI disk.</p>
</li>
<li><p>Attach the <code>.iso</code> under Storage &gt; Optical Drive.</p>
</li>
<li><p>Start the VM and complete the Ubuntu installation.</p>
</li>
<li><p>Once installed, shut down the VM and run:</p>
</li>
</ol>
<pre><code class="language-bash">VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box
</code></pre>
<p>You now have a reusable local box that works without internet.</p>
<p>You can spin up virtual machines. Next is Ansible, which automates what goes inside them.</p>
<h2 id="heading-how-to-install-ansible">How to Install Ansible</h2>
<p>Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.</p>
<h3 id="heading-how-to-install-ansible-on-windows">How to Install Ansible on Windows</h3>
<p>Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).</p>
<ol>
<li>Open PowerShell as Administrator and enable WSL:</li>
</ol>
<pre><code class="language-powershell">dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
</code></pre>
<ol>
<li><p>Restart your computer.</p>
</li>
<li><p>Install Ubuntu from the Microsoft Store.</p>
</li>
<li><p>Open Ubuntu and install Ansible:</p>
</li>
</ol>
<pre><code class="language-bash">sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-mac">How to Install Ansible on Mac</h3>
<pre><code class="language-bash">brew install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-linux">How to Install Ansible on Linux</h3>
<pre><code class="language-bash"># Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible
</code></pre>
<h3 id="heading-how-to-test-ansible">How to Test Ansible</h3>
<p>Create a file called <code>hosts</code> in your current directory:</p>
<pre><code class="language-ini">[local]
localhost ansible_connection=local
</code></pre>
<p>Create a file called <code>playbook.yml</code> in the same directory:</p>
<pre><code class="language-yaml">---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"
</code></pre>
<p>Run the playbook, passing the local <code>hosts</code> file with <code>-i</code>:</p>
<pre><code class="language-bash">ansible-playbook -i hosts playbook.yml
</code></pre>
<p>You should see the message "Ansible is working!" in the output.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/081e6ff3-b983-42a0-960e-5340bbd24e3b.png" alt="screenshot showing ansible playbook complete terminal installation" style="display:block;margin:0 auto" width="849" height="287" loading="lazy">

<p>Alright, all your tools are installed. Now you'll use them together to build something real.</p>
<h2 id="heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</h2>
<p>You can find the entire code for this lab in this repo: <a href="https://github.com/Osomudeya/homelab-demo-article">https://github.com/Osomudeya/homelab-demo-article</a></p>
<p>Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.</p>
<p><strong>Before you start,</strong> create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.</p>
<p>You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.</p>
<h3 id="heading-step-1-create-the-project-directory">Step 1: Create the Project Directory</h3>
<pre><code class="language-bash">mkdir devops-lab-project &amp;&amp; cd devops-lab-project
</code></pre>
<h3 id="heading-step-2-write-your-site-content">Step 2: Write Your Site Content</h3>
<p>Create a file called <code>index.html</code> in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:</p>
<pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;&lt;title&gt;My DevOps Lab&lt;/title&gt;&lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;My DevOps Lab&lt;/h1&gt;
    &lt;p&gt;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&lt;/p&gt;
    &lt;p&gt;Built on a laptop. No cloud account needed.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre>
<p>Change the text to whatever you like. This is your page.</p>
<h3 id="heading-step-3-write-the-vagrantfile">Step 3: Write the Vagrantfile</h3>
<p>Create a file called <code>Vagrantfile</code> in the same directory:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end
</code></pre>
<h3 id="heading-step-4-start-the-virtual-machines">Step 4: Start the Virtual Machines</h3>
<pre><code class="language-bash">vagrant up
</code></pre>
<p>The first run downloads the <code>ubuntu/focal64</code> box, which is around 500MB.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/264866b0-9977-490e-96a3-69b3070be589.png" alt="screenshot showing virtualbox installation processes in terminal" style="display:block;margin:0 auto" width="867" height="377" loading="lazy">

<p>Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/118d2fb2-70f6-41e8-afb2-6f45fb895e98.png" alt="screenshot showing 2 virtualbox servers &quot;running&quot; in VB manager" style="display:block;margin:0 auto" width="926" height="396" loading="lazy">

<h3 id="heading-step-5-create-the-ansible-inventory">Step 5: Create the Ansible Inventory</h3>
<p>Create a file called <code>inventory</code> in the same directory:</p>
<pre><code class="language-ini">[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key
</code></pre>
<p>Ansible uses the Vagrant-generated private keys so it can SSH in as the <code>vagrant</code> user. Host key checking for this lab is turned off in <code>ansible.cfg</code> (next step), not in the inventory.</p>
<h3 id="heading-step-6-create-the-ansible-config-file">Step 6: Create the Ansible Config File</h3>
<p>Before running the playbook, create a file called <code>ansible.cfg</code> in the same directory:</p>
<pre><code class="language-ini">[defaults]
inventory = inventory
host_key_checking = False
</code></pre>
<p>The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.</p>
<p>These settings are for a local lab only. Do not use host_key_checking = False for production systems.</p>
<h3 id="heading-step-7-create-the-ansible-playbook">Step 7: Create the Ansible Playbook</h3>
<p>Create a file called <code>playbook.yml</code>:</p>
<pre><code class="language-yaml">---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2&gt;/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes
</code></pre>
<p>Two lines worth paying attention to:</p>
<ul>
<li><p><code>src: index.html</code> — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.</p>
</li>
<li><p><code>-v /var/www/html:/usr/share/nginx/html:ro</code> — this mounts the directory from the VM into the Nginx container. The <code>:ro</code> means read-only. Nginx serves whatever is in that folder.</p>
</li>
</ul>
<h3 id="heading-step-8-run-the-playbook">Step 8: Run the Playbook</h3>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green <code>ok</code> or yellow <code>changed</code> next to each task means it worked. Red <code>fatal</code> means something failed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/91241b41-981c-4e23-9dc4-8531e551c39e.png" alt="terminal screenshot of A green ok or yellow changed next to each task means it worked. Red fatal means something failed." style="display:block;margin:0 auto" width="875" height="267" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c02db252-8aff-42e5-b937-d812d070a75b.png" alt="terminal screenshot of playbook run completion" style="display:block;margin:0 auto" width="867" height="425" loading="lazy">

<h3 id="heading-step-9-verify-the-setup">Step 9: Verify the Setup</h3>
<p>Open <code>http://localhost:8080</code> in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.</p>
<p>If you see the page, every tool in this lab is working together.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0d3d897b-3f51-46fb-b548-832cc5ec3272.png" alt="Browser showing localhost:8082 with the heading &quot;My DevOps Lab&quot; and the text &quot;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&quot;" style="display:block;margin:0 auto" width="746" height="418" loading="lazy">

<h3 id="heading-step-9-clean-up-optional">Step 9: Clean Up (Optional)</h3>
<p>When you're done:</p>
<pre><code class="language-bash">vagrant destroy -f
</code></pre>
<p>This shuts down and deletes both VMs. Your <code>Vagrantfile</code>, <code>inventory</code>, <code>playbook.yml</code>, and <code>index.html</code> stay on disk — run <code>vagrant up</code> followed by <code>ansible-playbook -i inventory playbook.yml</code> any time to bring it all back.</p>
<p>Now that you have a working lab, let's use it properly.</p>
<h2 id="heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</h2>
<p>Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.</p>
<p>Here are five things to break and what to look for when you do.</p>
<h3 id="heading-break-1-crash-the-main-process-inside-the-container-and-watch-it-come-back">Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)</h3>
<p>Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of <code>--restart always</code>, and your site can come back without re-running Ansible.</p>
<p>After <code>vagrant ssh web</code>, every <code>docker</code> command below runs <strong>on the web VM</strong>. So keep your browser on your laptop at <a href="http://localhost:8080"><code>http://localhost:8080</code></a> (Vagrant forwards your host port to the VM’s port 80).</p>
<h4 id="heading-troubleshooting-if-your-lab-isnt-ready">Troubleshooting: If Your Lab Isn't Ready</h4>
<p>From your project folder on the host (your laptop) – unless the step says to run it on the VM:</p>
<ul>
<li><p>You ran <code>vagrant destroy -f</code>. Run <code>vagrant up</code>, then <code>ansible-playbook -i inventory playbook.yml</code>.</p>
</li>
<li><p><code>docker ps</code> shows <code>webapp</code> but status is Exited. On the web VM, run <code>sudo docker start webapp</code>, then <code>sudo docker ps</code> again.</p>
</li>
<li><p>There's no <code>webapp</code> row in <code>docker ps -a</code><strong>.</strong> Re-run <code>ansible-playbook -i inventory playbook.yml</code> on the host.</p>
</li>
</ul>
<p>If the playbook is already applied and <code>webapp</code> is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or <code>docker ps</code>. You need the VM shell and a quick check before you run <code>docker exec</code>.)</p>
<h4 id="heading-steps-happy-path">Steps (happy path)</h4>
<ol>
<li>SSH into the web VM:</li>
</ol>
<pre><code class="language-plaintext">vagrant ssh web
</code></pre>
<ol>
<li><p>Confirm <code>webapp</code> is <strong>Up</strong>:</p>
<pre><code class="language-plaintext">sudo docker ps
</code></pre>
</li>
<li><p><strong>Break it on purpose:</strong> kill the container’s main process <strong>from inside</strong> (PID 1). That ends the container the same way a crashing app would, not the same as <code>docker stop</code> on the host:</p>
</li>
</ol>
<pre><code class="language-bash">sudo docker exec webapp sh -c 'sleep 5 &amp;&amp; kill 1'
</code></pre>
<p>The <code>sleep</code> 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a>. You may catch a brief error or blank page while nothing is listening on port 80.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ac89703-63f3-45d8-954f-35adbd2c7dec.png" alt="Browser showing ERR_CONNECTION_RESET on localhost:8082 after the Nginx container process was killed" style="display:block;margin:0 auto" width="1242" height="1057" loading="lazy">

<ol>
<li>Watch Docker restart the container:</li>
</ol>
<pre><code class="language-bash">watch sudo docker ps -a
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5c61d90d-61d6-4023-b3f5-e3eb427e8492.png" alt="Terminal running watch docker ps showing webapp container status as Up 10 seconds after automatic restart" style="display:block;margin:0 auto" width="1011" height="393" loading="lazy">

<p>Within a few seconds you should see <strong>Exited (137)</strong> become <strong>Up</strong> again. (Press Ctrl+C to exit <code>watch</code>.)</p>
<p>5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under <code>/var/www/html</code> and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.</p>
<h4 id="heading-why-not-docker-stop-or-docker-kill-on-the-host-for-this-demo"><strong>Why not</strong> <code>docker stop</code> <strong>or</strong> <code>docker kill</code> <strong>on the host for this demo?</strong></h4>
<p>Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (<code>hasBeenManuallyStopped</code>), and <code>--restart always</code> may not bring the container back until you <code>docker start</code> it or similar.</p>
<p>Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.</p>
<p><strong>Kubernetes analogy:</strong> A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.</p>
<p><strong>What to observe (three separate checks):</strong></p>
<ol>
<li><p><strong>Exit code:</strong> After <code>kill 1</code>, <code>docker ps -a</code> should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran <code>docker stop</code> on the host.</p>
</li>
<li><p><strong>Restart delay vs browser:</strong> Watch how many seconds pass between Exited and Up in <code>docker ps -a</code>; that interval is Docker applying <code>--restart always</code>. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.</p>
</li>
<li><p><strong>Content after recovery:</strong> After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with <code>-v</code>), not inside a file that vanishes when the container process restarts. The process was replaced, not your <code>index.html</code> on the host path.</p>
</li>
</ol>
<h3 id="heading-break-2-cause-a-container-name-conflict">Break 2: Cause a Container Name Conflict</h3>
<p>On a single Docker daemon (here, on your web VM), a container name is a <strong>unique label</strong>. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use <code>docker run --name webapp</code> without cleaning up first hit this error constantly and recognizing it saves time in real work.</p>
<p><strong>Before you start:</strong> Ansible already created one container named <code>webapp</code>.<br>Stay on the web VM (for example still inside <code>vagrant ssh web</code>) so the commands below run where that container lives.</p>
<p>So now, try to start a second container and also call it <code>webapp</code>. The image is plain <code>nginx</code> here on purpose – the point is the <strong>name clash</strong>, not matching your site’s ports or volume mounts.</p>
<pre><code class="language-plaintext">sudo docker run -d --name webapp nginx
</code></pre>
<p>What actually happens here is that Docker <strong>doesn't</strong> create a second container. It returns an error immediately. Your original <code>webapp</code> is unchanged.</p>
<p>This is because the name <code>webapp</code> is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.</p>
<p>Example error (your ID will differ):</p>
<pre><code class="language-plaintext">docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/1fd42c16-c28e-4539-9290-3583206eb8ff.png" alt="container name conflict terminal error screenshot" style="display:block;margin:0 auto" width="914" height="252" loading="lazy">

<p>To fix it, free the name, then create <code>webapp</code> again the same way the playbook does (publish port 80, mount your HTML, restart policy):</p>
<pre><code class="language-plaintext">sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx
</code></pre>
<p>After that, your site should behave as before (refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a> from your laptop).</p>
<h4 id="heading-what-to-observe">What to observe:</h4>
<p>Read Docker’s Conflict message end to end. You should see that the name <code>/webapp</code> is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run <code>docker run</code> again.”</p>
<h3 id="heading-break-3-make-ansible-fail-to-reach-a-vm">Break 3: Make Ansible Fail to Reach a VM</h3>
<p>Ansible separates “could not connect” from “connected, but a task broke.” The first is <strong>UNREACHABLE</strong>, the second is <strong>FAILED</strong>. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.</p>
<p>On your laptop, in the project folder, edit <code>inventory</code> and change the web server address from <code>192.168.33.10</code> to an IP <strong>no VM uses</strong>, for example <code>192.168.33.99</code>. Save the file.</p>
<pre><code class="language-ini">[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key
</code></pre>
<p>What you run (from the same project folder on the host):</p>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>After this, Ansible tries to SSH to <code>192.168.33.99</code>. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible <strong>never runs tasks</strong> on the web server. It stops that host with UNREACHABLE:</p>
<pre><code class="language-plaintext">fatal: [192.168.33.99]: UNREACHABLE! =&gt; {"msg": "Failed to connect to the host via ssh"}
</code></pre>
<p>This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is <strong>no working SSH session</strong>.</p>
<p>Now it's time to put it back: restore <code>192.168.33.10</code> in <code>inventory</code> and run <code>ansible-playbook -i inventory playbook.yml</code> again. The web play should reach the VM and complete (assuming your lab is up).</p>
<p><strong>UNREACHABLE vs FAILED – what to observe:</strong></p>
<ul>
<li><p>If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.</p>
</li>
<li><p>If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.</p>
</li>
</ul>
<p>When you debug later, you should look at the keyword Ansible prints: <strong>UNREACHABLE</strong> points to reachability while <strong>FAILED</strong> points to task output and the first failed task under that host.</p>
<h3 id="heading-break-4-fill-the-vms-disk">Break 4: Fill the VM's Disk</h3>
<p>Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.</p>
<p>All commands below run <strong>on the db VM</strong> after <code>vagrant ssh db</code>. MariaDB was installed there by your playbook.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Open a shell on the db VM:</p>
<pre><code class="language-plaintext">vagrant ssh db
</code></pre>
</li>
<li><p>Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:</p>
<pre><code class="language-plaintext">sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
</code></pre>
<p>Use <code>df -h</code> to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger <code>count=</code> <strong>only in a lab</strong> (never fill production disks on purpose without a plan).</p>
</li>
<li><p>Ask systemd to restart MariaDB and show status:</p>
<pre><code class="language-plaintext">sudo systemctl restart mariadb
sudo systemctl status mariadb
</code></pre>
<p>If the disk is critically full, restart may fail or the service may show failed or not running.</p>
</li>
<li><p>If something looks wrong, read recent logs for the MariaDB unit:</p>
<pre><code class="language-plaintext">sudo journalctl -u mariadb --no-pager | tail -20
</code></pre>
<p>Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.</p>
</li>
<li><p>Clean up so your VM stays usable:</p>
<pre><code class="language-plaintext">sudo rm /tmp/bigfile
</code></pre>
<p>Optionally run <code>sudo systemctl restart mariadb</code> again and confirm it is active (running).</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should use <code>df -h</code> first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.</p>
</li>
<li><p>You should read <code>systemctl status mariadb</code> to see whether systemd thinks the service is active, failed, or flapping.</p>
</li>
<li><p>You should read <code>journalctl -u mariadb</code> when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). <strong>Space + status + logs</strong> is the same order you would use on a production server.</p>
</li>
</ul>
<h3 id="heading-break-5-run-minikube-out-of-resources">Break 5: Run Minikube Out of Resources</h3>
<p>Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay <strong>Pending</strong> and <strong>Events</strong> explain why (for example <em>Insufficient cpu</em>). That is not the same as a pod that starts and then crashes.</p>
<p>To do this, you'll need a local cluster (we're using <a href="https://minikube.sigs.k8s.io/docs/start/?arch=%2Fmacos%2Fx86-64%2Fstable%2Fbinary+download"><strong>Minikube</strong></a> in this guide) and <code>kubectl</code> on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.</p>
<p>You'll run this on your <strong>Mac, Linux, or Windows terminal</strong> (host), not inside <code>vagrant ssh</code>. If you're still inside a VM, type <code>exit</code> until your prompt is back on the host.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Check Minikube:</p>
<pre><code class="language-plaintext">minikube status
</code></pre>
<p>If it's stopped, start it (Docker driver matches earlier sections):</p>
<pre><code class="language-plaintext">minikube start --driver=docker
</code></pre>
</li>
<li><p>Create a deployment with many replicas so your single Minikube node can't run them all at once:</p>
<pre><code class="language-plaintext">kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
</code></pre>
<p>Press Ctrl+C when you're done watching. Some pods may stay <strong>Pending</strong> while others are <strong>Running</strong>.</p>
</li>
<li><p>Pick one Pending pod name from <code>kubectl get pods</code> and inspect it:</p>
<pre><code class="language-plaintext">kubectl describe pod &lt;pod-name&gt;
</code></pre>
<p>Under Events, look for FailedScheduling and a line similar to:</p>
<pre><code class="language-plaintext">Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
</code></pre>
<p>You might see <strong>Insufficient memory</strong> instead, depending on your machine.</p>
</li>
<li><p>Fix the lab by scaling back so the cluster can catch up:</p>
<pre><code class="language-plaintext">kubectl scale deployment stress --replicas=2
</code></pre>
<p>You can delete the deployment entirely when finished: <code>kubectl delete deployment stress</code>.</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any <strong>node</strong> yet, usually because the node is out of CPU or memory for that workload.</p>
</li>
<li><p>You should read <code>kubectl describe pod &lt;pod-name&gt;</code> and scroll to <strong>Events</strong>. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.</p>
</li>
<li><p>You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.</p>
</li>
</ul>
<h2 id="heading-what-you-can-now-do">What You Can Now Do</h2>
<p>You didn't just install tools in this tutorial. You also used them.</p>
<p>You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.</p>
<p>You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.</p>
<p>You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.</p>
<p>That's the difference between someone who has read about DevOps and someone who has run it.</p>
<p><strong>Here are four free projects you can run in this same lab to go further:</strong></p>
<ul>
<li><p><strong>DevOps Home-Lab 2026</strong> — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.</p>
</li>
<li><p><strong>KubeLab</strong> — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.</p>
</li>
<li><p><strong>K8s Secrets Lab</strong> — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.</p>
</li>
<li><p><strong>DevOps Troubleshooting Toolkit</strong> — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.</p>
</li>
</ul>
<p>All free and open source: <a href="https://github.com/Osomudeya/List-Of-DevOps-Projects">github.com/Osomudeya/List-Of-DevOps-Projects</a>.</p>
<p>If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at <a href="https://osomudeya.gumroad.com/l/BuildYourOwnDevOpsLab">Build Your Own DevOps Lab</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator ]]>
                </title>
                <description>
                    <![CDATA[ If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently? Storing them is straightforward. But handling rotation, stale env vars, and the gap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sync-aws-secrets-manager-secrets-into-kubernetes-with-the-external-secrets-operator/</link>
                <guid isPermaLink="false">69c541f010e664c5dadc877e</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ secrets management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Thu, 26 Mar 2026 14:25:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6cca126e-dd50-4400-ae9d-65449581345b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?</p>
<p>Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.</p>
<p>In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.</p>
<p>By the end, you'll be able to:</p>
<ul>
<li><p>Explain the full architecture from vault to pod</p>
</li>
<li><p>Run the lab locally in about 15 minutes</p>
</li>
<li><p>Prove why environment variables go stale after rotation, while mounted secret files stay fresh</p>
</li>
<li><p>Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD</p>
</li>
<li><p>Troubleshoot the most common failures</p>
</li>
</ul>
<p>Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ac8bfc9e-304e-41b8-b6a3-7ce1795b29a9.png" alt="Architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-local-lab">How to Run the Local Lab</a></p>
</li>
<li><p><a href="#heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</a></p>
</li>
<li><p><a href="#heading-how-to-test-secret-rotation">How to Test Secret Rotation</a></p>
</li>
<li><p><a href="#heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</a></p>
</li>
<li><p><a href="#heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</a></p>
</li>
<li><p><a href="#heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following tools installed and configured.</p>
<p><strong>For the local lab:</strong></p>
<ul>
<li><p>An AWS account with access to AWS Secrets Manager</p>
</li>
<li><p>The AWS CLI installed and configured. Run <code>aws configure</code> and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.</p>
</li>
<li><p><code>kubectl</code> installed. For Microk8s, run <code>microk8s kubectl config view --raw &gt; ~/.kube/config</code> after installation to connect kubectl to your local cluster.</p>
</li>
<li><p>Terraform installed</p>
</li>
<li><p>Helm installed</p>
</li>
<li><p>Docker installed</p>
</li>
<li><p>A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the <a href="https://microk8s.io/">Microk8s install guide</a> before continuing.</p>
</li>
</ul>
<p><strong>For the Amazon Elastic Kubernetes Service sections:</strong></p>
<ul>
<li><p>An Amazon Elastic Kubernetes Service cluster you can create or manage</p>
</li>
<li><p>A GitHub repository you can configure for workflows and secrets</p>
</li>
</ul>
<p>The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-LOCAL.md"><code>docs/DEPLOY-LOCAL.md</code></a> and <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-EKS.md"><code>docs/DEPLOY-EKS.md</code></a>.</p>
<h2 id="heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</h2>
<p>Before you run any command, you need to understand how the pieces connect.</p>
<p>The flow has four stages:</p>
<ol>
<li><p>A developer or automated system updates a secret in AWS Secrets Manager.</p>
</li>
<li><p>The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.</p>
</li>
<li><p>Your pod reads that Kubernetes Secret.</p>
</li>
<li><p>During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/9dc52f99-add4-490a-ad86-25a30d0ae306.png" alt="A step-by-step flow diagram showing the four stages of secret flow above" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-how-the-external-secrets-operator-sync-works">How the External Secrets Operator Sync Works</h3>
<p>The External Secrets Operator reads a custom Kubernetes resource called <code>ExternalSecret</code>. That resource tells the operator three things:</p>
<ul>
<li><p>Which secret store to connect to</p>
</li>
<li><p>Which Kubernetes Secret name to create or update</p>
</li>
<li><p>How often to refresh</p>
</li>
</ul>
<p>In this lab, the <code>ExternalSecret</code> creates a Kubernetes Secret named <code>myapp-database-creds</code>. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.</p>
<h3 id="heading-how-the-app-consumes-secrets">How the App Consumes Secrets</h3>
<p>The sample application exposes three endpoints so you can validate behavior at any time.</p>
<ul>
<li><p><code>/secrets/env</code> shows what environment variables the pod sees</p>
</li>
<li><p><code>/secrets/volume</code> shows what files in the mounted secret directory look like</p>
</li>
<li><p><code>/secrets/compare</code> compares both and reports whether rotation has been detected</p>
</li>
</ul>
<p>The app checks four keys: <code>DB_USERNAME</code>, <code>DB_PASSWORD</code>, <code>DB_HOST</code>, and <code>DB_PORT</code>.</p>
<h2 id="heading-how-to-run-the-local-lab">How to Run the Local Lab</h2>
<p>The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.</p>
<h3 id="heading-step-1-clone-the-repo">Step 1: Clone the Repo</h3>
<pre><code class="language-bash">git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab
</code></pre>
<h3 id="heading-step-2-run-the-spin-up-script">Step 2: Run the Spin-Up Script</h3>
<pre><code class="language-bash">bash spinup.sh
</code></pre>
<p>The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.</p>
<p>If the script fails at any point, check <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/TROUBLESHOOTING.md"><code>docs/TROUBLESHOOTING.md</code></a> before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.</p>
<h3 id="heading-important-run-the-lab-ui">Important: Run the Lab UI</h3>
<p>The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at <code>lab-ui/</code> that walks you through each concept and checkpoint as you work through the lab.</p>
<p>To start it, open a second terminal and run:</p>
<pre><code class="language-bash">cd lab-ui &amp;&amp; npm install &amp;&amp; npm run dev
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/873166e9-6bff-4e56-a18d-e58b9e9a5af9.png" alt="Screenshot of npm run dev lab ui terminal" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Then open <a href="http://localhost:5173"><code>http://localhost:5173</code></a>. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a5b220b-3f23-4c7c-8388-f2e23d122e2c.png" alt="Screenshot of The Lab UI, a guided tutorial interface that runs alongside the lab and walks you through each concept and checkpoint." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (<code>localhost:3000</code>) are two separate things, the UI guides you through the steps, the app shows you the live secrets.</p>
<h3 id="heading-step-3-access-the-application">Step 3: Access the Application</h3>
<p>Once the lab finishes, port-forward the service.</p>
<pre><code class="language-bash">kubectl port-forward svc/myapp 3000:80 -n default
</code></pre>
<p>Open <code>http://localhost:3000</code>. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/dbe122ac-b787-40d0-96f4-4b1276bab017.png" alt="Screenshot of the running application at localhost:3000. Every row in the table should show &quot;Match ✓" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-step-4-validate-that-secrets-match">Step 4: Validate That Secrets Match</h3>
<p>Run the compare endpoint directly from the terminal.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>When everything is working, the response will include <code>"all_match": true</code>.</p>
<h2 id="heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</h2>
<p>At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.</p>
<h3 id="heading-step-1-read-the-externalsecret-manifest">Step 1: Read the ExternalSecret Manifest</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/external-secret.yaml"><code>k8s/aws/external-secret.yaml</code></a>. Focus on these four fields:</p>
<ul>
<li><p><code>refreshInterval</code>: how often the operator polls AWS Secrets Manager</p>
</li>
<li><p><code>secretStoreRef</code>: which store the operator authenticates against</p>
</li>
<li><p><code>target</code>: the name of the Kubernetes Secret to create</p>
</li>
<li><p><code>data</code>: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys</p>
</li>
</ul>
<p>Here is what that mapping looks like in this lab:</p>
<pre><code class="language-yaml">spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username
</code></pre>
<p>The <code>property</code> field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.</p>
<p>Two fields here are worth understanding before you move on. <code>creationPolicy: Owner</code> means the operator owns the Kubernetes Secret it creates. If you delete the <code>ExternalSecret</code>, the Secret is deleted too. <code>ClusterSecretStore</code> is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain <code>SecretStore</code> is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.</p>
<h3 id="heading-step-2-read-the-deployment-manifest">Step 2: Read the Deployment Manifest</h3>
<p>Open <a href="http://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/deployment.yaml"><code>k8s/aws/deployment.yaml</code></a>. You are looking for two sections: <code>envFrom</code> and <code>volumeMounts</code>.</p>
<pre><code class="language-yaml">envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true
</code></pre>
<p>Both paths read from the same Kubernetes Secret, <code>myapp-database-creds</code>. The <code>envFrom</code> block injects all keys as environment variables at pod start.<br>The <code>volumeMounts</code> block mounts the same secret as files under <code>/etc/secrets</code>.</p>
<p>This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.</p>
<h3 id="heading-step-3-read-the-app-comparison-logic">Step 3: Read the App Comparison Logic</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/app/server.js"><code>app/server.js</code></a>. The comparison logic reads environment variables from <code>process.env</code> and reads mounted secret files from <code>/etc/secrets/&lt;key&gt;</code>. Then it computes a per-key match and a global <code>all_match</code> value.</p>
<p>The <code>/secrets/compare</code> endpoint sets <code>rotation_detected: true</code> when any key differs between env and volume.</p>
<h2 id="heading-how-to-test-secret-rotation">How to Test Secret Rotation</h2>
<p>Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.</p>
<h3 id="heading-how-the-rotation-gap-works"><strong>How the Rotation Gap Works</strong></h3>
<p>When a pod starts, Kubernetes gives it two ways to read a secret.</p>
<p>The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.</p>
<p>The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.</p>
<p>Same secret, two paths. One goes stale while one stays fresh.</p>
<p>The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.</p>
<p>That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.</p>
<p>Here is what you're about to observe in the lab:</p>
<ul>
<li><p>The rotation script updates the secret in AWS</p>
</li>
<li><p>ESO syncs the new value into Kubernetes within seconds</p>
</li>
<li><p>The volume file updates automatically</p>
</li>
<li><p>The environment variable stays stale until the pod restarts</p>
</li>
<li><p>The <code>/secrets/compare</code> endpoint shows both values side by side so you can see the gap live</p>
</li>
</ul>
<h3 id="heading-step-1-confirm-the-lab-is-ready">Step 1: Confirm the Lab Is Ready</h3>
<p>Make sure your pod and the External Secrets Operator are both running before you start.</p>
<pre><code class="language-bash">kubectl get pods -n external-secrets
kubectl get pods -n default
</code></pre>
<p>Both should show <code>Running</code>.</p>
<h3 id="heading-step-2-run-the-rotation-test-script">Step 2: Run the Rotation Test Script</h3>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>The script performs these actions in order:</p>
<ol>
<li><p>Reads the current <code>DB_PASSWORD</code> from the volume mount at <code>/etc/secrets/DB_PASSWORD</code></p>
</li>
<li><p>Reads the current <code>DB_PASSWORD</code> from the environment variable</p>
</li>
<li><p>Updates AWS Secrets Manager with a new password using <code>put-secret-value</code></p>
</li>
<li><p>Forces an immediate ESO sync by annotating the <code>ExternalSecret</code> with <code>force-sync</code></p>
</li>
<li><p>Reads the volume value again</p>
</li>
<li><p>Reads the environment variable again</p>
</li>
</ol>
<p>After the script runs, the volume and the env var will show different values.</p>
<h3 id="heading-step-3-validate-with-the-compare-endpoint">Step 3: Validate With the Compare Endpoint</h3>
<p>Hit the compare endpoint and look at the output.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>You'll see something like this:</p>
<pre><code class="language-json">{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c4ebb09f-e605-4f68-8e12-1361d94199b2.png" alt="Rotation mismatch, the volume file updated with the new password but the env var still holds the old value from pod startup." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-step-4-restart-the-deployment-to-sync-env-vars">Step 4: Restart the Deployment to Sync Env Vars</h3>
<p>Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.</p>
<pre><code class="language-bash">kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default
</code></pre>
<p>Then hit <code>/secrets/compare</code> again. All rows should now show <code>"all_match": true</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0040274d-a398-408c-9486-ce0a9e527479.png" alt="After a rolling restart, new pods pick up fresh env vars and all keys match." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-how-to-automate-restarts-with-reloader">How to Automate Restarts With Reloader</h3>
<p>If you don't want to restart deployments manually after every rotation, you can install <a href="https://github.com/stakater/reloader"><strong>Stakater Reloader</strong></a>. It watches an annotation on the <code>Deployment</code> and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.</p>
<h2 id="heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</h2>
<p>Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the <a href="https://secrets-store-csi-driver.sigs.k8s.io/">Secrets Store CSI Driver</a>.</p>
<p>Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>External Secrets Operator</th>
<th>Secrets Store CSI Driver</th>
</tr>
</thead>
<tbody><tr>
<td>Creates a Kubernetes Secret</td>
<td>Yes</td>
<td>No by default</td>
</tr>
<tr>
<td>Supports <code>envFrom</code></td>
<td>Yes</td>
<td>No (workaround only)</td>
</tr>
<tr>
<td>Secret stored in etcd</td>
<td>Yes (base64)</td>
<td>No, if you skip sync</td>
</tr>
<tr>
<td>Rotation</td>
<td>ESO updates the Secret, Reloader restarts pods</td>
<td>Volume file can update in place</td>
</tr>
<tr>
<td>Best for</td>
<td>Most teams. Multi-cloud, env var support</td>
<td>Security policies that prohibit secrets in etcd</td>
</tr>
</tbody></table>
<p>This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both <code>envFrom</code> and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.</p>
<p>Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native <code>envFrom</code> model.</p>
<h2 id="heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</h2>
<p>The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.</p>
<h3 id="heading-step-1-prepare-terraform-and-openid-connect-access">Step 1: Prepare Terraform and OpenID Connect Access</h3>
<p>The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder.</p>
<pre><code class="language-bash">cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn
</code></pre>
<p>Copy the role ARN from the output. You'll need it in the next step.</p>
<h3 id="heading-step-2-set-the-required-environment-variable">Step 2: Set the Required Environment Variable</h3>
<p>The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.</p>
<p>To find your AWS account ID, run:</p>
<pre><code class="language-bash">aws sts get-caller-identity --query Account --output text
</code></pre>
<p>Then set the variable, replacing <code>ACCOUNT</code> with the number that command returns.</p>
<pre><code class="language-bash">export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role
</code></pre>
<h3 id="heading-step-3-run-the-spin-up-script-for-amazon-elastic-kubernetes-service">Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service</h3>
<pre><code class="language-bash">bash spinup.sh --cluster eks
</code></pre>
<p>When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing <code>Match ✓</code>.</p>
<h3 id="heading-step-4-test-rotation-on-the-deployed-app">Step 4: Test Rotation on the Deployed App</h3>
<p>After you confirm normal operation, run the rotation test the same way you did locally.</p>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>Then use <code>/secrets/compare</code> on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.</p>
<p>⚠️ <strong>Cost warning:</strong> Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/teardown.sh"><code>bash teardown.sh</code></a> from the repo root to destroy all AWS resources and stop charges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/56f05ace-9ab6-4b67-ade6-a0bd1fa3962c.png" alt="Screenshot of the app running on the ALB URL, showing all keys matched" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</h2>
<p>The typical CI/CD setup stores <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.</p>
<p>OpenID Connect eliminates that problem entirely.</p>
<h3 id="heading-how-openid-connect-works-for-github-actions">How OpenID Connect Works for GitHub Actions</h3>
<p>GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via <code>AssumeRoleWithWebIdentity</code>. No long-lived keys are ever stored anywhere.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/48e72210-a669-440e-b42e-81b0c15746ec.png" alt="The full OIDC authentication flow for GitHub Actions deploying to EKS — from minting the JWT token through AssumeRoleWithWebIdentity to temporary credentials, kubeconfig retrieval, and final kubectl apply steps." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-step-1-create-the-iam-role-with-terraform">Step 1: Create the IAM Role With Terraform</h3>
<p>The <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.</p>
<h3 id="heading-step-2-add-the-role-arn-to-github-repository-secrets">Step 2: Add the Role ARN to GitHub Repository Secrets</h3>
<p>In your GitHub repository:</p>
<ol>
<li><p>Go to Settings → Secrets and variables → Actions</p>
</li>
<li><p>Click New repository secret</p>
</li>
<li><p>Name it <code>AWS_ROLE_ARN</code></p>
</li>
<li><p>Paste the role ARN from the Terraform output</p>
</li>
</ol>
<p>That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.</p>
<h3 id="heading-step-3-configure-terraform-state">Step 3: Configure Terraform State</h3>
<p>For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.</p>
<h3 id="heading-step-4-push-to-main-and-let-workflows-run">Step 4: Push to Main and Let Workflows Run</h3>
<p>After your first spin-up, every push to the <code>main</code> branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use <code>/secrets/compare</code> to validate rotation behavior on the live environment.</p>
<h2 id="heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</h2>
<p>Here's a shortlist of the most common symptoms and their fixes.</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Most Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>ExternalSecret</code> is not syncing</td>
<td>Missing credentials or wrong store reference</td>
<td>Confirm the operator can access AWS Secrets Manager and that <code>secretStoreRef</code> points to the correct store</td>
</tr>
<tr>
<td>Pod is stuck in <code>Pending</code></td>
<td>Missing storage setup for local cluster</td>
<td>For Microk8s, enable the storage add-on</td>
</tr>
<tr>
<td>Env and volume still match after rotation</td>
<td>Rotation happened but the pod never restarted</td>
<td>Run <code>kubectl rollout restart</code> or install Reloader</td>
</tr>
<tr>
<td>CRD or API version mismatch</td>
<td>ESO version and manifest <code>apiVersion</code> don't match</td>
<td>Verify the <code>apiVersion</code> for <code>ClusterSecretStore</code> and <code>ExternalSecret</code> match your installed ESO version</td>
</tr>
<tr>
<td>Amazon Elastic Kubernetes Service node group never joins</td>
<td>Networking or IAM permissions for nodes are wrong</td>
<td>Fix internet routing and review the node IAM policy</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-inspect-the-operator-and-the-externalsecret">How to Inspect the Operator and the ExternalSecret</h3>
<p>When something isn't syncing, start with these two commands.</p>
<pre><code class="language-bash"># Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
</code></pre>
<p>The status conditions on the <code>ExternalSecret</code> resource will usually tell you exactly what failed.</p>
<h3 id="heading-how-to-validate-rotation-from-the-app-side">How to Validate Rotation From the App Side</h3>
<p>When you are debugging rotation, don't rely only on Kubernetes resource state. Use the <code>/secrets/compare</code> endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the <code>ExternalSecret</code> and <code>Deployment</code> manifests, and validated that the application sees the right credentials.</p>
<p>You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.</p>
<p>Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.</p>
<p>The lab repository is at <a href="https://github.com/Osomudeya/k8s-secret-lab">github.com/Osomudeya/k8s-secret-lab</a>. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.</p>
<p>If this helped you, star the repo and share it with someone who is learning Kubernetes.</p>
<p><em>I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures<br>→</em> <a href="https://osomudeya.gumroad.com/subscribe"><em>Join the newsletter</em></a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster ]]>
                </title>
                <description>
                    <![CDATA[ I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet contro ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kubernetes-self-healing-explained/</link>
                <guid isPermaLink="false">69aae80e78c5adcd0e1c63bc</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Testing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Fri, 06 Mar 2026 14:43:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/ef1ba178-622f-4a28-b58a-7fb8a58be964.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from <code>kubectl describe</code>, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.</p>
<p>You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-kubelab-is">What KubeLab Is?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-get-the-lab-running">How to Get the Lab Running</a></p>
</li>
<li><p><a href="#heading-simulation-1-kill-random-pod">Simulation 1 — Kill Random Pod</a></p>
</li>
<li><p><a href="#heading-simulation-2-drain-a-worker-node">Simulation 2 — Drain a Worker Node</a></p>
</li>
<li><p><a href="#heading-simulation-3-cpu-stress-and-throttling">Simulation 3 — CPU Stress and Throttling</a></p>
</li>
<li><p><a href="#heading-simulation-4-memory-stress-and-oomkill">Simulation 4 — Memory Stress and OOMKill</a></p>
</li>
<li><p><a href="#heading-simulation-5-database-failure">Simulation 5 — Database Failure</a></p>
</li>
<li><p><a href="#heading-simulation-6-cascading-pod-failure">Simulation 6 — Cascading Pod Failure</a></p>
</li>
<li><p><a href="#heading-simulation-7-readiness-probe-failure">Simulation 7 — Readiness Probe Failure</a></p>
</li>
<li><p><a href="#heading-how-to-read-the-signals-in-grafana">How to Read the Signals in Grafana</a></p>
</li>
<li><p><a href="#heading-how-to-use-this-for-production-debugging">How to Use This for Production Debugging</a></p>
</li>
</ul>
<h2 id="heading-what-is-kubelab"><strong>What is KubeLab?</strong></h2>
<p>KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.</p>
<table>
<thead>
<tr>
<th>Simulation</th>
<th>What it teaches</th>
</tr>
</thead>
<tbody><tr>
<td>Kill Random Pod</td>
<td>ReplicaSet self-healing, pod immutability</td>
</tr>
<tr>
<td>Drain Worker Node</td>
<td>Zero-downtime maintenance, PodDisruptionBudgets</td>
</tr>
<tr>
<td>CPU Stress</td>
<td>Throttling vs crashing, invisible latency</td>
</tr>
<tr>
<td>Memory Stress</td>
<td>OOMKill, exit code 137, silent restart loops</td>
</tr>
<tr>
<td>Database Failure</td>
<td>StatefulSets, PVC persistence</td>
</tr>
<tr>
<td>Cascading Pod Failure</td>
<td>Why replicas: 2 isn't enough</td>
</tr>
<tr>
<td>Readiness Probe Failure</td>
<td>Liveness vs readiness, traffic control</td>
</tr>
</tbody></table>
<p>Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1cd2a06d-7a7a-4250-ab5d-8a78d24af7b5.png" alt="KubeLab cluster map — pods grouped by node, color-coded by status. During simulations, chips change color and move between nodes in real time." style="display:block;margin:0 auto" width="920" height="505" loading="lazy">

<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.</p>
<p><strong>Hardware:</strong> 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/docker-compose-preview.md">setup/docker-compose-preview.md</a> full UI with mock data, no real cluster needed.</p>
<h2 id="heading-how-to-get-the-lab-running"><strong>How to Get the Lab Running</strong></h2>
<p>Full cluster setup lives at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/k8s-cluster-setup.md">setup/k8s-cluster-setup.md</a> in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:</p>
<pre><code class="language-bash">kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running
</code></pre>
<p>Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:</p>
<pre><code class="language-bash"># Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000
</code></pre>
<p>Grafana login: <code>admin</code> / <code>kubelab-grafana-2026</code>.</p>
<blockquote>
<p>Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.</p>
</blockquote>
<h2 id="heading-simulation-1-kill-random-pod"><strong>Simulation 1: Kill Random Pod</strong></h2>
<p>This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code>. Watch for a pod to go Terminating then a new one to appear.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/3d3cb733-407a-482f-82e7-cbeea496157b.png" alt="Terminals running side by side before clicking Run, events streaming, pod watch, frontend and grafana port forwarding." style="display:block;margin:0 auto" width="706" height="1250" loading="lazy">

<pre><code class="language-bash">kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement
</code></pre>
<p><strong>What happened:</strong> The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.</p>
<p><strong>The production trap:</strong> A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.</p>
<p><strong>The fix:</strong> Set <code>replicas: 2</code>, add a readiness probe, and set <code>terminationGracePeriodSeconds</code> to match your longest request timeout.</p>
<h2 id="heading-simulation-2-drain-a-worker-node"><strong>Simulation 2: Drain a Worker Node</strong></h2>
<p>This simulation cordons a worker node, then evicts all its pods to the remaining node.</p>
<p>To <em><strong>"cordon"</strong></em> a worker node means to mark it as unschedulable. When you run <code>kubectl cordon &lt;node-name&gt;</code>, the Kubernetes control plane adds the <code>node.kubernetes.io/unschedulable:NoSchedule</code> taint to the node. (A <strong>taint</strong> is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does <strong>not</strong> affect the pods that are already running there.</p>
<p>Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.</p>
<p>Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -o wide -w</code>. Watch which node each pod runs on.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -o wide -w
</code></pre>
<pre><code class="language-plaintext">NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled
</code></pre>
<p>In <code>kubectl get nodes</code> the node shows <code>Ready,SchedulingDisabled</code> until you run <code>kubectl uncordon</code>.</p>
<p><strong>What happened:</strong> The node spec got <code>spec.unschedulable=true</code>. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw <code>kubectl delete pod</code> bypasses this check entirely — which is why draining with <code>kubectl drain</code> is always safer than deleting pods manually during maintenance.</p>
<p><strong>The production trap:</strong> Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite <code>replicas: 2</code>.</p>
<p><strong>The fix:</strong> Use pod anti-affinity with topology key: <code>kubernetes.io/hostname</code> and a PodDisruptionBudget with <code>minAvailable: 1</code>.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1161cbf9-2482-41c7-9b5c-751762d3baaa.png" alt="Node drain CLI output: cordoned node shows Ready,SchedulingDisabled; pods reschedule to the other node." style="display:block;margin:0 auto" width="729" height="128" loading="lazy">

<h2 id="heading-simulation-3-cpu-stress-and-throttling"><strong>Simulation 3: CPU Stress and Throttling</strong></h2>
<p>This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.</p>
<p><strong>Before you click:</strong> Run <code>watch -n 2 kubectl top pods -n kubelab</code> and open the Grafana CPU Usage panel.</p>
<pre><code class="language-bash">kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m
</code></pre>
<p><strong>What happened:</strong> The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.</p>
<p><strong>The production trap:</strong> <code>kubectl top</code> shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.</p>
<p><strong>The fix:</strong> For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5e3fd49b-c9a0-4271-9be7-b7fec3122c1a.png" alt="One backend pod flatlined at exactly 95-150m for 60 seconds. A healthy pod's CPU fluctuates, this flat ceiling is the throttle." style="display:block;margin:0 auto" width="1476" height="788" loading="lazy">

<h2 id="heading-simulation-4-memory-stress-and-oomkill"><strong>Simulation 4: Memory Stress and OOMKill</strong></h2>
<p>This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -l app=backend -w</code> and open the Grafana Memory Usage panel.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown
</code></pre>
<p><strong>What happened:</strong> The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.</p>
<p><strong>The production trap:</strong> The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."</p>
<p><strong>The fix:</strong> Alert on <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) &gt; 3</code> before users notice.<br>The Prometheus expression means: look at how many times containers in the <code>kubelab</code> namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.</p>
<p>Confirm it happened:</p>
<pre><code class="language-bash">kubectl describe pod -n kubelab &lt;pod-name&gt; | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137
</code></pre>
<p>To see the last output before the kernel killed the process, run <code>kubectl logs -n kubelab &lt;pod-name&gt; --previous</code>. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/8ced107b-9d14-4d40-b6d6-7ae0fe35b1b7.png" alt="One backend pod's memory climbs, then the line drops at the OOMKill and reappears as the container restarts. The other pod's line stays flat the whole time" style="display:block;margin:0 auto" width="735" height="298" loading="lazy">

<h2 id="heading-simulation-5-database-failure"><strong>Simulation 5: Database Failure</strong></h2>
<p>This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods,pvc -n kubelab</code>. Note that the PVC exists before you start.</p>
<pre><code class="language-bash">kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume
</code></pre>
<p>A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new <code>postgres-0</code> pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.</p>
<p><strong>What happened:</strong> The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. <code>postgres-0</code> always mounts <code>postgres-data-postgres-0</code>. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.</p>
<p><strong>The production trap:</strong> Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.</p>
<p><strong>The fix:</strong> Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.</p>
<h2 id="heading-simulation-6-cascading-pod-failure"><strong>Simulation 6: Cascading Pod Failure</strong></h2>
<p>This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get endpoints -n kubelab backend-service -w</code>. Watch the IP list.</p>
<pre><code class="language-bash">kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS   &lt;none&gt;   ← every request in this window gets Connection refused
</code></pre>
<p><strong>What happened:</strong> Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/6cae14e0-faf2-4d42-90f4-32d00a1b4119.png" alt="The 5xx spike during Cascading Failure, 5 to 15 seconds of real downtime with the exact window timestamped" style="display:block;margin:0 auto" width="746" height="291" loading="lazy">

<p><strong>The production trap:</strong> <code>replicas: 2</code> protects you from one pod dying at a time, nothing more.<br>If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.<br>Check right now with <code>kubectl get pods -n kubelab -o wide | grep backend</code>, and if both pods show the same NODE, you are one node failure away from an outage.</p>
<p><strong>The fix:</strong> Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with <code>minAvailable: 1</code> to block any voluntary action that would leave zero replicas.</p>
<h2 id="heading-simulation-7-readiness-probe-failure"><strong>Simulation 7: Readiness Probe Failure</strong></h2>
<p>This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code> in one tab and <code>kubectl get endpoints -n kubelab backend-service -w</code> in another.</p>
<pre><code class="language-bash"># Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic
</code></pre>
<p><strong>What happened:</strong> <code>/ready</code> returned 503. The kubelet marked the pod <code>Ready=False</code>. The Endpoints controller removed its IP from the Service. The liveness probe <code>/health</code>) still returned 200, so no restart. After 120 seconds <code>/ready</code> recovered and the pod rejoined. Run <code>kubectl logs -n kubelab &lt;failing-pod&gt; -f</code> to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.</p>
<p><strong>The production trap:</strong> Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.</p>
<p><strong>The fix:</strong> Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.</p>
<h2 id="heading-4-how-to-read-the-signals-in-grafana"><strong>4. How to Read the Signals in Grafana</strong></h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/e6709c25-2d80-489c-b7fb-418ef303b7e2.png" alt="A screenshot showing my grafana dashboards" style="display:block;margin:0 auto" width="1110" height="1201" loading="lazy">

<p><code>kubectl</code> shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.</p>
<h3 id="heading-the-four-panels-that-matter"><strong>The Four Panels that Matter</strong></h3>
<p><strong>Pod Restarts:</strong> A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.</p>
<p><strong>CPU Usage:</strong> A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.</p>
<p><strong>Memory Usage:</strong> Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.</p>
<p><strong>HTTP Request Rate:</strong> During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.</p>
<h3 id="heading-5-how-to-read-the-terminal-signals"><strong>5. How to Read the Terminal Signals</strong></h3>
<p>What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.</p>
<p>The <code>-w</code> flag on <code>kubectl get pods -n kubelab -w</code> streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — <code>1/2</code> means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.</p>
<p><code>kubectl get events -n kubelab --sort-by=.lastTimestamp</code> is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.</p>
<p><code>kubectl describe pod -n kubelab &lt;pod-name&gt;</code> is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.</p>
<p><code>kubectl get endpoints -n kubelab backend-service</code> shows which pod IPs are actually receiving traffic right now. A pod can show Running in <code>kubectl get pods</code> and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.</p>
<p><code>kubectl logs -n kubelab &lt;pod-name&gt;</code> shows the container's stdout and stderr. Use <code>-f</code> to follow the stream. After a pod restarts, use <code>--previous</code> to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.</p>
<p>A full event sequence during Kill Pod recovery looks like this:</p>
<pre><code class="language-bash">kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10
</code></pre>
<pre><code class="language-plaintext">REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running
</code></pre>
<p>The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.</p>
<h3 id="heading-two-prometheus-queries-worth-memorizing"><strong>Two Prometheus Queries Worth Memorizing</strong></h3>
<p><strong>First query: silent restart loop.</strong> <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])</code> counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.</p>
<p><strong>Second query: invisible CPU throttling.</strong> <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code> measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in <code>kubectl top</code> often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).</p>
<pre><code class="language-plaintext"># Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])
</code></pre>
<p>Run these against your own cluster. Not just KubeLab. These are production queries.</p>
<h2 id="heading-6-how-to-use-this-for-production-debugging"><strong>6. How to Use This for Production Debugging</strong></h2>
<p>The repo includes <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/diagnose.md">docs/diagnose.md</a>, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.</p>
<p><strong>Exit code 137, pods restarting.</strong> Run the Memory Stress simulation. Confirm with <code>kubectl describe pod | grep -A 5 "Last State:"</code> and look for <code>Reason: OOMKilled</code>. Raise limits or find the leak. The simulation shows both.</p>
<p><strong>High latency, pods look healthy, zero restarts.</strong> Run the CPU Stress simulation. Check <code>container_cpu_cfs_throttled_seconds_total</code> in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.</p>
<p><strong>503 on some requests, pods show Running.</strong> Run the Readiness Probe Failure simulation. Check <code>kubectl get endpoints</code> — one pod IP is missing despite Running. The pod gets zero traffic.</p>
<p><strong>Pods stuck Pending after a node went down.</strong> Run the Drain Node simulation. Run <code>kubectl describe pod &lt;pending-pod&gt;</code> and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from <code>kubectl describe</code>, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.</p>
<p>What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.</p>
<p>The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/interview-prep.md">docs/interview-prep.md</a> has answers to the 13 most common Kubernetes interview questions. The observability guide at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/observability.md">docs/observability.md</a> covers Prometheus and Grafana setup in detail.</p>
<p>If this helped you, star the repo at <a href="https://github.com/Osomudeya/kubelab">https://github.com/Osomudeya/kube-lab</a> and share it with someone who is learning Kubernetes the hard way.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
