Osomudeya Zudonu - freeCodeCamp.org

How Enterprise Teams Manage Infrastructure at Scale with Terraform

Osomudeya Zudonu — Tue, 23 Jun 2026 15:56:59 +0000

Tutorials teach you how to write Terraform, but don't teach you what happens when 60 engineers start writing it together.

When you learn Terraform, you work with a single repository, state file, and a single environment. You run terraform apply from your laptop, and your infrastructure is provisioned.

That model works fine until the day you join a company and realize engineers rarely apply to production from a laptop.
A lot of what you see will not match what you practiced.

This article explains how large engineering teams actually run Terraform, the repositories, workflows, ownership rules, and what goes wrong without them.

You'll learn how enterprise teams structure repositories and state files, how they store and version reusable modules through GitHub, why infrastructure changes move to production through pipelines, how they catch changes that happen outside of Terraform, and how they recover when things go wrong.

Every practice here exists because a team hit a specific wall and built something to get past it.

Prerequisites

You should be comfortable with Terraform before reading this.
You should also know how Git pull requests and branch merging work.

This is not a Terraform introduction, it is about what happens after you have learned the basics and start sharing infrastructure with other engineers.

How State Corruption Happens
Why State File Gets Treated Like a Production Database
How Enterprise Teams Structure Their Terraform Repositories
How Teams Split State Files to Protect Each Other
Why Some Teams Prefer Directories Over Workspaces for Production
How Teams Share Infrastructure Through Modules on GitHub
How Teams Version and Release Terraform Modules
How Teams Maintain Terraform Modules at Scale
How Teams Share Data Between State Files
How Infrastructure Changes Actually Move to Production
How Teams Detect Infrastructure Drift
How Teams Recover When State Goes Wrong
Conclusion

How State Corruption Happens

The state file is how Terraform tracks what it has built. It remembers every resource, every ID, and every configuration value. When it gets out of sync with what actually exists in the cloud, that's state corruption.

It gets blamed for a lot of things. But engineers who have dealt with it in production know it usually traces back to one of a handful of situations, each with a different cause and a different fix.

Two Engineers Run `terraform apply` at the Same Time

Before understanding this one, you need to understand something about how Terraform works.

When you run terraform apply, two things happen separately:

First, Terraform talks to AWS, and the resource gets created in the cloud. Second, Terraform updates the state file to record what was just built.

These are two different systems. AWS holds the real infrastructure, and the state file is Terraform's notebook about it. If anything interrupts the process between step one and step two, they fall out of sync.

Now here's what happens when two engineers apply at the same time without locking:

Sarah opens the state file and starts adding a subnet. Marcus opens the same state file at the same moment and starts updating a NAT gateway. Both are working from the same starting copy.

Sarah finishes first. Her apply creates the subnet in AWS and updates the state file to record it.

Marcus finishes second. His apply updates the NAT gateway in AWS. Terraform then updates the state file using the version of state Marcus read when his apply started.

That version didn't include Sarah's subnet, so the updated state no longer contains a record of it.

The subnet exists in AWS. But Terraform's notebook no longer has a record of it. The next terraform plan thinks the subnet was never created and proposes building it again.

State locking prevents this. Sarah's apply acquires a lock before it starts. When Marcus tries to apply, Terraform makes him wait.

After Sarah finishes, Terraform updates the state file and releases the lock. Marcus then runs against the updated state, so both the subnet and NAT gateway changes are recorded correctly.

An Apply Gets Interrupted

A GitHub Actions pipeline is applying changes to the payments infrastructure, adding three new security group rules and a database parameter group. Halfway through, the pipeline runner hits its 60-minute timeout limit, and the job gets killed.

Here's what the apply actually managed to do before dying:

The terminal image above shows three security group rules completing successfully before the pipeline hits its 60-minute runtime limit. The runner is then terminated. The database parameter group never finishes creating, and the state file update never runs because the job died first.

Security group rule 1  → created ✓
Security group rule 2  → created ✓
Security group rule 3  → created ✓
Database parameter     → not created ✗
State file update      → never wrote (job died first)

The three security group rules now exist in AWS. The problem is that the pipeline died before Terraform could finish updating the state file. AWS knows the rules exist. Terraform's state file does not.

At this point, reality and the state file no longer match.

Fortunately, this is usually easy to recover from. When the pipeline runs again, Terraform checks what already exists in AWS. It sees the three security group rules and doesn't try to create them again. It then creates the database parameter group that never got built.

The second run completes successfully and the state file catches up.

This works because Terraform is idempotent, running the same configuration again moves infrastructure toward the desired state rather than blindly creating everything from scratch.

One small complication remains: the state lock.

If the pipeline was interrupted while holding a lock, Terraform may still think another apply is running. The next pipeline run fails immediately with an error like this:

The terminal above shows terraform apply failing because the previous job left a state lock behind. The error includes the lock ID, the path to the state file, and the name of the process that acquired it. Terraform refuses to proceed until the lock is released or manually cleared.

Before clearing the lock, make sure no Terraform apply is still running.

Open your CI/CD system. GitHub Actions, GitLab CI, Jenkins, or whatever your team uses and check the pipeline history for that environment:

The GitHub Actions pipeline history image above shows four recent runs. terraform-plan completed successfully. Two terraform-apply jobs show as cancelled and timed out, both flagged as lock may be stale. A fourth terraform-apply job is currently in progress, and this one shouldn't be unlocked until it finishes.

If the previous apply was cancelled or timed out, the lock is stale. Clear it with terraform force-unlock plus the lock ID from the error. The pipeline then runs normally.

Only force-unlock when you're certain nothing is actively running. Clearing a live lock lets two applies write to the same state at the same time, which is exactly the problem locking was built to prevent.

Someone Runs a Terraform State Command in the Wrong Environment

A database engineer is cleaning up an old test database in the staging environment.

The database still exists in AWS, but Terraform should stop managing it. To do that, the engineer uses terraform state rm.

This command doesn't delete anything in AWS. It only removes Terraform's record of the resource from the state file. Think of it as telling Terraform: "forget this resource exists, but leave it running."

The engineer intends to run it against staging:

Intended:  staging state       → forget the old test database

But they're working in the wrong directory. They run it against production instead.

Actual:    production state    → forget the live payments database

Nothing gets deleted. The production database is still running in AWS. But Terraform has now forgotten it exists.

Now Terraform and reality disagree. The next terraform plan sees a database defined in the code but missing from the state file, so it assumes the database doesn't exist and proposes creating a new one.

If nobody catches it in the plan output, Terraform creates a second production database alongside the original: two databases running in production, neither fully managed, and a very expensive mess to untangle.

terraform state rm, terraform import, and terraform state mv make immediate changes to the state file with no confirmation prompt. Run them from the wrong directory, the wrong workspace, or with the wrong resource address and you change the wrong state in seconds.

Two Teams Manage the Same Resource

The networking team owns a security group that controls access to the payments database. When a new microservice needs database access, a payments engineer has two options: ask the networking team to add a new rule, or manage the security group themselves.

They choose the second option. The engineer imports the existing security group into the payments state file and adds a rule for Microservice C.
From that moment, both teams think they own the same security group.

The problem is that Terraform does exactly what each state file tells it to do. The networking state says the security group should allow A and B. The payments state says it should allow A, B, and Microservice C. Both can't be true at the same time.

When the payments team applies their state, Microservice C gets access. But later that night, the networking pipeline runs. Terraform reads the networking state, sees only A and B, and updates the security group to match. Microservice C's rule disappears silently.

No errors are seen and both pipelines pass, which is exactly what makes this so hard to debug. Terraform isn't broken, it's receiving conflicting instructions from two different state files and doing exactly what each one says.

This isn't something to be fixed with Terraform commands. It's an ownership decision that should have been made before anyone ran an import. If the payments team had submitted a pull request to the networking repository asking them to add the rule, one team would own the security group, one state file would manage it, and the conflict could never have happened.

Why State File Gets Treated Like a Production Database

The state file looks like bookkeeping: a record of what Terraform created. The reason teams treat it differently is that it often contains secrets.

The state file stores sensitive values in plaintext. Database passwords, API keys, connection strings – if those values were passed to a Terraform resource during an apply, they're now sitting in the state file. Even if you marked the variable as sensitive in your Terraform code, the value still lands in the state file. Terraform needs it there to compute diffs on future plans.

That means: whoever can read the state file can read your database password.

In large organizations, engineers typically don't have direct access to the production state bucket. Instead, Terraform runs through a CI/CD pipeline that assumes a dedicated IAM role with permission to read and write the state bucket and perform applies. Engineers interact with infrastructure through pull requests and plan output, not by touching the state bucket directly.

This separation reduces risk and creates an audit trail. Every state change is performed by the pipeline and logged, making it straightforward to trace what changed and when.

How Enterprise Teams Structure Their Terraform Repositories

When you join a large engineering organization, the first thing you notice is the number of repositories. You might expect one repository for all infrastructure, but what you find is dozens.

The structure maps directly to ownership. Each repository belongs to one team, and that team is responsible for everything in it. A typical layout looks like this:

The diagram shows two types of repositories. The first type belongs to the platform team and contains reusable modules: things like VPC configurations, database templates, and security group patterns. These repositories don't create production resources directly.

The second type belongs to individual product teams, such as the payments team or the auth team. These repositories call the platform modules and use them to build their actual infrastructure. A mistake in a product team repository affects only that team. A mistake in a shared platform module can affect every team that depends on it.

The key thing to understand here is that the platform team repositories don't create production resources. They create reusable modules that the product teams call when building their actual infrastructure.

That distinction matters because some repositories are used by one team, while others are shared by everyone.

A mistake in a product team's repository usually affects only that team. A mistake in a shared module can affect every team that depends on it.

The diagram illustrates why shared repositories carry more risk than product-specific ones. A bug in the payments-infra repository affects only the payments team. A bug in the terraform-aws-postgres module affects every team that uses it to provision databases. A bug in the terraform-policies repository affects every pipeline in the company. The wider the module is shared, the larger the blast radius when something goes wrong.

This is why experienced engineers pay close attention to shared modules and policy repositories.

If the payments team's infrastructure breaks, the problem is probably in the payments repository.

If five different teams start seeing the same issue at the same time, the shared modules and policy repositories become the first place to investigate.

How Teams Split State Files to Protect Each Other

A single state file managing everything, VPC, Kubernetes cluster, databases, monitoring, is fine when one person is running things, but quickly becomes a problem when multiple teams share it.

Three specific problems emerge.

Blast radius: If the networking configuration and the database configuration live in the same state file, a bad networking apply can accidentally affect database resources that had nothing to do with the change. Separate state files keep failures contained.
Deployment speed: Networking infrastructure might change a few times a year. Applications might deploy dozens of times a day. If they share a state file, teams end up waiting on each other's locks.
Ownership conflicts: When multiple teams share a state file, one team can change something the other team depends on without realizing it.

The solution is to split state along ownership boundaries. A structure that addresses all three problems looks like this:

The structure image above shows one state file per domain under a production folder.

networking handles VPC, subnets, routing, and NAT gateways.
identity handles IAM roles, policies, and service accounts.
platform handles the Kubernetes cluster, node pools, and add-ons.
database handles RDS instances, Redis clusters, and backups.
security handles security groups, WAF rules, and certificates.
monitoring handles Prometheus, Grafana, and alerting pipelines.
payments handles payment service infrastructure.

production/
  networking/terraform.tfstate   → VPC, subnets, routing, NAT gateways
  identity/terraform.tfstate     → IAM roles, policies, service accounts
  platform/terraform.tfstate     → Kubernetes cluster, node pools, add-ons
  database/terraform.tfstate     → RDS instances, Redis clusters, backups
  security/terraform.tfstate     → Security groups, WAF rules, certificates
  monitoring/terraform.tfstate   → Prometheus, Grafana, alerting pipelines
  payments/terraform.tfstate     → Payment service infrastructure

This is one example, not a universal standard. Larger organizations often split further. The principle is the same: one owning team per state file, one pipeline, one blast radius.

The rule is simple: every resource belongs to one state file. If the networking team owns a security group, it stays in the networking state. Other teams can reference it as a data source, but they don't import it into their own state.
That is what prevents the ownership collision described in the first section.

Why Some Teams Prefer Directories Over Workspaces for Production

Terraform CLI workspaces let you manage multiple environments like dev, staging, and production from a single directory. Each workspace gets its own state file, but they all share the same .tf configuration files.

infra/
  main.tf          ← same code runs for ALL environments
  variables.tf

  terraform.tfstate.d/
    dev/
    staging/
    production/    ← separate state, same code

The workspace approach keeps all environments in one directory called infra. It contains a single main.tf file that runs for all environments. State is stored separately under terraform.tfstate.d with folders for dev, staging, and production, but all three share the same code.

You switch environments with terraform workspace select production, then apply.

The risk is that switching workspaces is a manual step. If the wrong workspace is active, changes meant for staging can end up in production.

Many teams prefer separate directories for long-lived environments:

environments/
  dev/
    main.tf      ← its own code path
    backend.tf   ← points to the dev state bucket
  staging/
    main.tf      ← its own code path
    backend.tf   ← points to the staging state bucket
  production/
    main.tf      ← its own code path
    backend.tf   ← points to the production state bucket

The directory approach gives each environment its own folder under environments. Dev, staging, and production each have their own main.tf with a separate code path, and their own backend.tf pointing to a different state bucket. The environments are completely separate from each other.

To apply against production, you have to be in the production directory. Each environment has its own state, backend, and execution path.

The tradeoff is duplication. Teams usually solve that with shared modules, so each environment directory contains only environment-specific configuration.

Workspaces are still useful for short-lived environments such as feature branches, preview deployments, and temporary test infrastructure.

When 30 teams each need a PostgreSQL database, two things happen.

Without a shared standard, every team writes their own database configuration. Six months later, a security audit runs across all environments and finds that:

The diagram shows what a security audit found when four teams each wrote their own database configuration independently.

Team A set backup_retention_period = 0, meaning their database was never backed up. Team B set storage_encrypted = false, leaving data in plaintext. Team C passed an empty tags = {}, so there was no cost tracking. Team D set deletion_protection = false, leaving the database one accident away from permanent data loss.

Nobody skipped those things on purpose, there was just no shared standard.

With a shared module, the platform team writes a postgres module once. They encode every organizational requirement into it: encryption on, 7-day backups, monitoring alarms, required tags, deletion protection enabled. They publish it to a GitHub repository called terraform-aws-postgres.

Every team that needs a database now writes this:

module "payments_db" {
  source         = "git::ssh://github.company.com/platform/terraform-aws-postgres.git?ref=v2.1.0"
  name           = "payments"
  environment    = "production"
  instance_class = "db.m5.large"
}

Four inputs. Everything else is handled by the module.

Large organizations usually expose approved modules through an internal registry so engineers can discover and version them without browsing GitHub repositories. Instead of the full Git URL, the reference becomes:

module "payments_db" {
  source  = "app.terraform.io/mycompany/postgres/aws"
  version = "~> 2.1"
}

HCP Terraform and Terraform Enterprise both include a private registry that connects to GitHub, watches for version tags on module repositories, and publishes new versions automatically.

How Teams Version and Release Terraform Modules

The ?ref=v2.1.0 in a module source URL isn't decoration. At the scale of 40 teams sharing one module, it's the thing that prevents a well-intentioned change from becoming a company-wide incident.

Without version pinning, the payments team references the Postgres module from main meaning whatever the latest code is at any given moment. The module owners rename an output variable from db_endpoint to database_endpoint to match a new naming convention. The next time any team runs terraform init, they pull that change. Their configuration still references db_endpoint.

Plans break:

payments-infra                        → plan fails
analytics-infra                       → plan fails
auth-infra                            → plan fails
reporting-infra                       → plan fails

Version pinning prevents this. The payments team stays on v2.1.0. The module owners release v2.2.0 with the renamed output and write a changelog. Teams upgrade when they're ready, after testing in staging. Nobody's pipeline breaks without warning.

The versioning convention is called semantic versioning:

v2.1.1  → patch:  bug fix. Safe to upgrade. Nothing to change in your code.
v2.2.0  → minor:  new optional feature. Safe to upgrade. Nothing to change.
v3.0.0  → major:  breaking change. Read the changelog. Update your code first.

The table shows three version types. A patch version like v2.1.1 means a bug fix, safe to upgrade with nothing to change in your code. A minor version like v2.2.0 means a new optional feature, also safe to upgrade with nothing to change. A major version like v3.0.0 means a breaking change, so you need to read the changelog and update your code before upgrading.

How Teams Maintain Terraform Modules at Scale

Building a Terraform module takes an afternoon, bit maintaining it for two years is a different job entirely.

A networking engineer needs a VPC module. The platform team has one, but their backlog is full. So the engineer creates a slightly different version. Three months later, another team does the same. Then another. Now this exists:

terraform-aws-vpc           ← original, maintained by platform team
terraform-aws-vpc-v2        ← created by the app team, author unknown
terraform-aws-vpc-shared    ← no idea which environments use this
terraform-aws-vpc-prod      ← unclear if this was ever different from the original

No one created a module graveyard on purpose. It grew one "I'll just make a quick variation" at a time. Each variant has slightly different security settings, different tagging, different defaults. When a compliance audit requires all VPCs to enable flow logging, the team has to investigate four different modules to figure out which environments are compliant.

Teams that avoid this treat their modules like shared services: named owner, contributions through pull requests, breaking changes in major versions with a migration guide, and deprecated modules with a retirement date. A CODEOWNERS file routes every pull request to the right reviewer automatically.

Organizations that skip this end up with modules that nobody owns, nobody wants to touch, and nobody is sure can be safely removed.

Once infrastructure is split into separate state files, a practical problem surfaces: teams need information from each other's infrastructure. The platform team's Kubernetes state needs the VPC ID from the networking team's state. The database state needs subnet IDs. The payments state needs the database endpoint.

Two patterns exist for solving this.

Reading Another Team's State Outputs

The terraform_remote_state data source lets one state read the outputs of another. The networking team marks their VPC ID and subnet IDs as outputs. The database team reads those outputs and uses them to place databases in the right subnets.

Networking state
  └── outputs: vpc_id, private_subnet_ids
                          ↓
               Database state reads them
               └── places RDS in the right subnets

This works, but there's a limitation. Reading another team's state requires full read access to their entire state file, not just the outputs you want. State files contain database passwords and API keys in plaintext. More dependencies means more teams reading each other's secrets.

Looking Up Resources Directly From the Cloud

The alternative, and the one HashiCorp now recommends, is to look up resources through the cloud provider's API instead of reading another team's state:

data "aws_vpc" "main" {
  tags = {
    Name        = "production-vpc"
    Environment = "production"
  }
}

No cross-team state access needed, and each team's state stays isolated. The tradeoff is consistent tagging: the networking team has to tag their VPC in a way the database team can reliably search for, which forces teams to agree on naming conventions early.

Many teams use both. Remote state for a small number of trusted, tightly coupled dependencies. Cloud data sources for everything broader.

How Infrastructure Changes Actually Move to Production

In large organizations managing production Terraform at scale, changes don't come from someone's laptop. Applying directly from a local machine requires production cloud credentials sitting on that machine, a security risk and leaves no audit trail if something breaks.

Instead, production changes move through a pipeline. Every change goes through a pull request in GitHub, and the pipeline does the work:

Engineer opens a pull request
        ↓
Pipeline: terraform validate + fmt check
        ↓
Pipeline: security scan (Checkov, tfsec, or similar)
        ↓
Pipeline: terraform plan → posts the full output as a comment on the PR
        ↓
Reviewer reads the plan output (not just the code)
        ↓
Required reviewers approve (enforced by CODEOWNERS + branch protection)
        ↓
Merge triggers the apply pipeline
        ↓
Pipeline: acquires state lock → applies → releases lock → logs result

The diagram above shows eight steps in order. An engineer opens a pull request. The pipeline runs terraform validate and a format check. A security scan runs using Checkov, tfsec, or similar. The pipeline runs terraform plan and posts the output as a comment on the pull request. A reviewer reads the full plan output. Required reviewers approve, enforced by CODEOWNERS and branch protection rules. Merging triggers the apply pipeline. The pipeline acquires the state lock, applies the changes, releases the lock, and logs the result.

The part that surprises engineers when they first encounter this is that the reviewer isn't approving the code. They're approving the plan output and the list of exactly what will be created, changed, or destroyed in the cloud.

A code change can look completely harmless and produce a destructive plan. Changing one database parameter might force a resource replacement, meaning Terraform destroys the current database and creates a new one. Seeing this in the plan output before the PR merges:

# aws_db_instance.payments must be replaced
-/+ resource "aws_db_instance" "payments" {

The image above shows a plan output that aws_db_instance.payments must be replaced, meaning Terraform will destroy the existing database and create a new one, not update it in place.

Catching that before merge is the entire point of reviewing the plan. Not the code.

How CODEOWNERS Enforces Who Reviews What

Earlier, we talked about module ownership. A VPC module might belong to the platform team, while database infrastructure belongs to the database team.

The challenge is making sure changes are actually reviewed by the people who own them.

GitHub solves this with a feature called CODEOWNERS. It lets a repository define which team is responsible for which directories. When someone opens a pull request that touches those files, GitHub automatically requests reviews from the correct team.

For example, if an engineer modifies the PostgreSQL module, GitHub can automatically require approval from the platform team before the change can be merged.

Without CODEOWNERS, engineers have to remember who owns which parts of the infrastructure.

CODEOWNERS makes ownership explicit and automatically requests reviews from the right team.

How Teams Detect Infrastructure Drift

Drift is the diff between what Terraform says should exist and what actually exists in the cloud.

Here's the scenario that produces drift more reliably than anything else:

Monday 3:00 AM  Production database CPU spikes. Outage.
Monday 3:15 AM  Engineer resizes database in AWS console: db.m5.large → db.m5.4xlarge
Monday 3:20 AM  Incident resolved. Engineer goes to sleep.
Monday 3:21 AM  Terraform state file: still says db.m5.large

The incident is forgotten, the ticket is closed, and life moves on.

Three months later, a routine Terraform apply runs. Terraform sees db.m5.large in the configuration but finds db.m5.4xlarge running in AWS. From Terraform's perspective, the database is larger than it should be, so the plan proposes changing it back.

Nobody notices the change in the plan output. The apply goes through, the database is downsized, and users begin reporting slow queries. The team spends hours investigating before eventually tracing the issue back to a Terraform change that reverted the emergency fix from months earlier.

Teams that handle this well run scheduled terraform plan jobs against every production state. If terraform plan exits with code 2, differences were found and an alert fires. The team then decides whether to apply to restore declared state or update the configuration to match reality. Either way, the change is visible and deliberate. Invisible drift always gets worse.

How Teams Recover When State Goes Wrong

State is recoverable in almost every situation, as long as the team set things up correctly before the incident happened.

The teams that recover in twenty minutes instead of three days aren't the ones with the deepest Terraform expertise. They're the ones who prepared.

Step 1: Pull a Backup Before Touching Anything.

terraform state pull > backup-$(date +%Y%m%d-%H%M%S).json

This saves the current state to a local file. Whatever you try next, you have a starting point to return to.

Step 2: Run `terraform plan` and Look at What it Proposes.

If Terraform proposes destroying resources that still exist in the cloud, the state is behind reality. If it proposes creating resources that already exist, reality is ahead of the state. Either way, the plan output tells you which direction the mismatch runs.

Step 3: Restore from S3 Versioning if the State is Corrupted.

Every write to a versioned S3 bucket saves a new version automatically. If the state file is corrupted or wrong, list the previous versions, download the last known good one, and push it back:

# List previous versions
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix production/database/terraform.tfstate

# Download a specific version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key production/database/terraform.tfstate \
  --version-id "the-version-id-here" \
  recovered-state.json

# Push it back
terraform state push recovered-state.json

Run terraform plan after restoring to confirm it looks correct before running any apply.

Step 4: Clear a Stale Lock if the Pipeline is Blocked.

If a lock was never released after a failed apply, clear it:

terraform force-unlock LOCK_ID

Only do this after confirming no apply is actively running. Clearing a live lock corrupts the state.

Step 5: Re-import Resources That Fell Out of State.

If a resource exists in the cloud but Terraform no longer knows about it — because of an accidental terraform state rm — bring it back without recreating it:

terraform import aws_db_instance.payments db-ABCD1234EFGH5678

Run terraform plan after importing to confirm no unexpected changes are proposed.

Conclusion

Every practice in this article traces back to a specific problem teams ran into as Terraform usage grew.

State locking prevents engineers from overwriting each other's changes.
State splitting reduces blast radius. Module versioning prevents shared infrastructure from breaking unexpectedly. Drift detection catches changes made outside Terraform. CODEOWNERS ensures the right people review the right changes.

Different problems with different solutions. But they all point to the same underlying theme which is ownership.

As teams grow, many Terraform problems have less to do with infrastructure and more to do with ownership.

State collisions happen when multiple people can modify the same state.
Module sprawl happens when nobody is responsible for maintaining a shared standard.

Drift becomes dangerous when changes are made without anyone taking ownership of bringing Terraform and reality back into alignment. Even review bottlenecks often trace back to uncertainty about who should approve what.

Understanding this changes how you read an unfamiliar Terraform repository.

Dozens of small state files aren't necessarily over-engineering. They're often ownership boundaries. A CODEOWNERS file is not bureaucracy. It's an ownership map. A pipeline that posts plan output on a pull request isn't just automation, it's a review process built around infrastructure consequences rather than code.

The infrastructure matters. But as teams grow, ownership is what keeps the system understandable.

I write about DevOps engineering, production systems, and the things tutorials do not cover weekly. If this was useful, please join the newsletter.
If you enjoyed reading this, we can also connect on Linkedin.

How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases

Osomudeya Zudonu — Wed, 27 May 2026 15:51:44 +0000

Automation scripts often validate process completion instead of system health.

A Kubernetes pod can be running while the application inside it can't authenticate to the database. A Terraform deployment can return clean while someone has manually changed infrastructure in the cloud console. A canary rollout can show zero errors while users wait five seconds for every request.

The problem isn't the tooling. The problem is that the system can look healthy when it really is not.

This handbook walks through five production-style automation scenarios using Bash and Python for:

Detecting abnormal AWS spend before the monthly invoice arrives
Correlating logs across multiple services using trace IDs
Finding infrastructure drift outside Terraform
Validating secret rotation at the application level
Automatically rolling back slow deployments before users complain

By the end of this handbook, you'll be able to build small scripts that help you notice when something is wrong in a system, even when the tools say everything is fine.

The scripts are intentionally small. The important part is the operational thinking behind them like what signal the script measures, what failure mode it can detect, and what assumptions the platform is making underneath.

Each use case includes a runnable demo environment, the complete script, a breakdown of the system behaviour involved, and an intentional failure you can trigger yourself.

If you're new to this workflow, start with use case 1 and work forward. The later sections build on the same pattern: automation is useful when it verifies reality, not just process completion.

Prerequisites

Before you start, set up the following:

Python 3.8 or higher – check with python3 --version
A Python virtual environment – create one before installing anything:

python3 -m venv venv
source venv/bin/activate  

 # on Windows: 

venv\Scripts\activate

This keeps your installed packages isolated from your system Python and prevents permission errors on shared machines.

pip – Python's package installer, included with Python
AWS CLI configured with a working profile – a free-tier AWS account is enough for use cases 1, 3, and 4. Verify it's working with:
```
aws sts get-caller-identity
```
Docker and Docker Compose – needed for use cases 2, 4, and 5
Kind (Kubernetes in Docker) – a way to run Kubernetes locally for use cases 4 and 5. Install with brew install kind on macOS, or follow the Kind quick start guide
kubectl – the command-line tool for talking to a Kubernetes cluster. After installing Kind, run kind create cluster and kubectl is configured automatically
Helm – a package manager for Kubernetes, needed for use case 5. Install with brew install helm or the Helm install guide
Terraform – needed for use case 3. Install with brew install terraform on macOS or follow the Terraform install guide. Check with terraform version.
bc – a calculator utility used by the canary watch scripts for floating-point comparison. Install with brew install bc on macOS or apt install bc on Ubuntu. Run bc --version to confirm it is available before starting use case 5.

Knowledge and Skills

You should be comfortable reading Python and Bash scripts without needing to write them from scratch.
You should have basic Linux terminal comfort – navigating directories, running scripts, reading output, and so on.
You should know what Kubernetes pods and deployments are at a basic level – you don't need deep Kubernetes expertise, as use cases 4 and 5 will introduce the Kubernetes concepts they rely on as they go.
Familiarity with AWS basics such as what EC2, IAM, and Secrets Manager will help with use cases 1, 3, and 4, while use case 2 runs entirely on your local machine and requires no AWS knowledge at all.
For use case 3, knowing what Terraform is and what a state file does will help. You don't need to write any Terraform, but understanding that Terraform tracks and what it created is the foundation of the whole use case.

AWS IAM Permissions Required

The scripts in this article make real AWS API calls. Your IAM user or role needs the following minimum permissions. (If you see an AccessDenied error, this is the first place to look.):

Use Case	Required IAM Permission
1 - Cost Anomaly Detection	`ce:GetCostAndUsage`
3 - Drift Detection	`ec2:DescribeSecurityGroups`
4 - Secrets Rotation	`secretsmanager:GetSecretValue`, `secretsmanager:PutSecretValue`

If you're using a fresh AWS free-tier account with AdministratorAccess attached, these permissions are already included and you can skip this step.

If you're on a restricted IAM user, here's how to add them. In the AWS Console, go to IAM, click Users, then click your username. Under the Permissions tab, click Add permissions, then Create inline policy.

Switch to the JSON tab and paste a policy document granting the permissions in the table above, then save it.

If your company manages AWS through an organization and you don't have permission to edit your own IAM policies, ask your administrator to add these permissions to your role.

Companion GitHub Repository

All demo projects live at: https://github.com/irvingtalks/devops-scripting-labs

Each use case has its own numbered folder with the complete script, supporting files, a setup.sh to prepare the environment, and a break_it.sh that injects the specific failure each use case is built around.

Clone the repo before starting:

git clone https://github.com/irvingtalks/devops-scripting-labs
cd devops-scripting-labs

Before running any use case, check that you have everything installed:

./preflight.sh

This checks for every tool the lab needs like Python, AWS CLI, Docker, Kind, Helm, Terraform, and bc and tells you exactly what's missing with the install command for each one.

Use Case 1 - Cost Anomaly Detection
Use Case 2 - Log Correlation Across Services
Use Case 3 - Infrastructure Drift Detection
Use Case 4 - Secrets Rotation with Zero Downtime
Use Case 5 - Automated Canary Rollback Trigger
What You Can Do Now

Use Case 1 - Cost Anomaly Detection

Environment: AWS Cost Explorer API (read-only, available in all accounts) Language: Python

The Production Problem

A junior engineer is testing a Kubernetes configuration. They spin up a managed node group in AWS (a set of EC2 virtual machines that the Kubernetes cluster uses to run workloads) and configure the cluster autoscaler, which is the Kubernetes component responsible for adding more machines when the cluster needs more capacity. The test goes well, and on Friday afternoon, they forget to tear the environment down.

Over the weekend, the autoscaler keeps provisioning new nodes because the test workloads are still running and requesting resources. By Monday morning you have a node group that has been quietly growing for two and a half days, and nobody noticed until the invoice landed three weeks later.

The script in this use case exists because your AWS bill isn't just a monthly number. It's a time series, and you can monitor it the same way you monitor application metrics. Check it daily, know your baseline, and you catch this kind of event in hours instead of weeks.

What's Actually Happening at the System Level

What this is not: This isn't a finance dashboard. It's an operational anomaly detector and the signal it monitors is cost. But the thing it's actually detecting is unexpected infrastructure behavior such as resources left running, autoscaler events, and forgotten environments.

AWS Cost Explorer is a service that stores your billing data and exposes it through an API, and when you call it, you're running a query against your account's billing records by specifying the time range, the granularity, and how you want results grouped.

One thing to know before you start investigating any flagged cost is that AWS decides which service category to put a charge under, not you. An EBS snapshot copy running across regions might appear under the EC2 line item rather than data transfer, which means a spike in EC2 spend doesn't necessarily mean something went wrong with your EC2 instances. The script flags the spike correctly, but investigating it means asking "what changed in my infrastructure on this date" rather than "what is running in EC2 right now."

The billing label is a starting point, not a diagnosis.

Set Up the Demo Environment

Navigate to 01-cost-anomaly/ in the companion repo. No cluster setup is needed for this use case because the script runs against your AWS account directly, and the only dependency is boto3:

cd 01-cost-anomaly
pip install boto3

Before running against your real account, make sure your AWS credentials are configured. The script uses whatever credentials the AWS CLI is set up with. If you haven't done this yet:

aws configure

This will ask for your AWS Access Key ID, Secret Access Key, default region (use us-east-1 if unsure), and output format (type json). You can find your access keys in the AWS Console under IAM → Users → your username → Security credentials → Create access key.

Your account needs the ce:GetCostAndUsage permission also, if you're on a fresh account with AdministratorAccess that's already included.

If you have an AWS account with a few weeks of billing history, you can run the script directly against your real data:

python detect_cost_anomaly.py

Two things to know before running against a real account. First, Cost Explorer data has a 24-hour lag. This means spend from today won't appear until tomorrow, so the script automatically excludes the most recent day to avoid incomplete results.

Second, the script uses unblended costs, which is what you actually pay on a single-account setup. Blended costs are a weighted average used in multi-account organisations sharing reserved capacity and will give different numbers.

If you have a new account or prefer not to use real billing data, the script includes a --sample flag that uses built-in data and calls no AWS APIs at all.
Run this first to see what the output looks like before reading the code:

python detect_cost_anomaly.py --sample

The Script

#!/usr/bin/env python3
# detect_cost_anomaly.py — Use Case 1: Cost Anomaly Detection
# Full explanation of every function is in the article.

import statistics
import sys
from datetime import datetime, timedelta

import boto3

def build_sample_data(days=30):
    """Synthetic Cost Explorer rows for the last `days` (ending yesterday).

    The EC2 spike is placed on yesterday (device local date) so sample output
    always matches the same window as live Cost Explorer mode.
    """
    last_day = datetime.today().date() - timedelta(days=1)
    first_day = last_day - timedelta(days=days - 1)
    anomaly_day_index = days - 1
    results = []
    for i in range(days):
        day = first_day + timedelta(days=i)
        d = i + 1
        results.append(
            {
                "TimePeriod": {
                    "Start": str(day),
                    "End": str(day + timedelta(days=1)),
                },
                "Groups": [
                    {
                        "Keys": ["Amazon EC2"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(
                                    round(
                                        18.50
                                        if i == anomaly_day_index
                                        else 1.10 + (d % 3) * 0.10,
                                        2,
                                    )
                                )
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon S3"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.04 + (d % 5) * 0.01, 2))
                            }
                        },
                    },
                    {
                        "Keys": ["Amazon RDS"],
                        "Metrics": {
                            "UnblendedCost": {
                                "Amount": str(round(0.85 + (d % 4) * 0.05, 2))
                            }
                        },
                    },
                ],
            }
        )
    return results, str(last_day)


def get_daily_costs(days=30):
    ce = boto3.client("ce", region_name="us-east-1")
    end = datetime.today().date() - timedelta(days=1)
    start = end - timedelta(days=days)
    response = ce.get_cost_and_usage(
        TimePeriod={"Start": str(start), "End": str(end)},
        Granularity="DAILY",
        Metrics=["UnblendedCost"],
        GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
    )
    return response["ResultsByTime"]


def build_service_timeseries(results):
    services = {}
    for day in results:
        date_str = day["TimePeriod"]["Start"]
        for group in day["Groups"]:
            service = group["Keys"][0]
            cost = float(group["Metrics"]["UnblendedCost"]["Amount"])
            if service not in services:
                services[service] = []
            services[service].append({"date": date_str, "cost": cost})
    return services


def detect_anomalies(services, baseline_days=7, multiplier=2.0, recent_days=None):
    """Flag days where cost exceeds prior `baseline_days` average + 2σ.

    Uses a rolling baseline (each day vs the previous week). If `recent_days`
    is set, only returns anomalies on or after today - recent_days.
    """
    cutoff = None
    if recent_days is not None:
        cutoff = datetime.today().date() - timedelta(days=recent_days)

    anomalies = []
    for service, daily in services.items():
        if len(daily) < baseline_days + 1:
            continue
        for i in range(baseline_days, len(daily)):
            day = daily[i]
            day_date = datetime.strptime(day["date"], "%Y-%m-%d").date()
            if cutoff is not None and day_date < cutoff:
                continue
            baseline_costs = [d["cost"] for d in daily[i - baseline_days : i]]
            avg = statistics.mean(baseline_costs)
            if avg < 0.01:
                continue
            try:
                std = statistics.stdev(baseline_costs)
            except statistics.StatisticsError:
                continue
            threshold = avg + (multiplier * std)
            if day["cost"] > threshold:
                anomalies.append(
                    {
                        "service": service,
                        "date": day["date"],
                        "actual": round(day["cost"], 4),
                        "baseline_avg": round(avg, 4),
                        "threshold": round(threshold, 4),
                        "pct_above": round(((day["cost"] - avg) / avg) * 100, 1),
                    }
                )
    return sorted(anomalies, key=lambda x: x["date"])


def parse_args(argv):
    use_sample = "--sample" in argv
    recent_days = None
    for arg in argv[1:]:
        if arg.startswith("--recent-days="):
            recent_days = int(arg.split("=", 1)[1])
    return use_sample, recent_days


def run(use_sample=False, recent_days=None):
    if use_sample:
        results, anomaly_date = build_sample_data()
        print("Running against sample data (--sample mode).")
        print(
            f"This data represents 30 days of billing ending yesterday, "
            f"with a realistic EC2 anomaly on {anomaly_date}.\n"
        )
    else:
        print("Fetching 30 days of daily AWS costs by service...")
        print("Note: today is excluded — Cost Explorer has a 24-hour billing lag.\n")
        results = get_daily_costs(days=30)

    if recent_days is not None:
        since = datetime.today().date() - timedelta(days=recent_days)
        print(
            f"Checking for spikes in the last {recent_days} days only "
            f"(on or after {since}), each vs its prior 7-day average.\n"
        )

    services = build_service_timeseries(results)
    anomalies = detect_anomalies(services, recent_days=recent_days)

    if not anomalies:
        print("No anomalies detected.")
        print("\nNote: this script flags statistical outliers against your own baseline.")
        print("A consistently elevated spend level will not trigger — only sudden increases.")
        return

    print(f"{'=' * 60}")
    print(f"ANOMALIES DETECTED: {len(anomalies)}")
    print(f"{'=' * 60}\n")

    for a in anomalies:
        print(f"Service:      {a['service']}")
        print(f"Date:         {a['date']}")
        print(f"Actual cost:  ${a['actual']}")
        print(f"Baseline avg: ${a['baseline_avg']} (prior 7-day average)")
        print(f"Threshold:    ${a['threshold']}")
        print(f"Overage:      {a['pct_above']}% above baseline")
        print()

    print("=" * 60)
    print("A note on AWS cost attribution:")
    print("The service label in Cost Explorer is assigned by AWS, not by the resource")
    print("that caused the cost. An EC2 spike may be caused by EBS snapshot copies,")
    print("cross-region data transfer, or autoscaling events that AWS categorizes under")
    print("EC2 in billing — not a running EC2 instance you can find in the console.")
    print()
    print("Before investigating the flagged service directly, ask:")
    print("What changed in my infrastructure on or before the flagged date?")
    print("Work backward from the operational change, not forward from the billing label.")


if __name__ == "__main__":
    use_sample, recent_days = parse_args(sys.argv)
    run(use_sample=use_sample, recent_days=recent_days)

How the Script Works

get_daily_costs pulls your AWS billing data for the last 30 days.

build_service_timeseries takes the raw data from AWS and reorganises it. AWS groups the data by day first, then by service. This function flips that around so each service has its own list of daily costs, which is what the detection step needs to work with.

detect_anomalies is where the actual check happens. For each service, it compares each day's spend to the 7 days right before it. If yesterday cost dramatically more than the week before, the script flags it. That's all it does.

--recent-days=7 means "only show me anomalies from the last 7 days." The script still fetches 30 days of data because it needs that history to calculate the comparison, but the results are filtered to the window you care about. This is good for a quick Monday morning check.

--sample runs without touching your AWS account at all. It uses built-in fake billing data with a spike baked into yesterday's date so the detection always fires. Use this first to see what the output looks like before connecting it to real data.

What the Output Looks Like

Running --sample (the spike date will show as yesterday's actual date, not a fixed value):

Running against sample data (--sample mode).
30 days of billing ending yesterday, with an EC2 spike on 2026-05-14.

============================================================
ANOMALIES DETECTED: 1
============================================================

Service:      Amazon EC2
Date:         2026-05-14
Actual cost:  $18.5
Baseline avg: $1.2143 (prior 7-day average)
Threshold:    $1.3939
Overage:      1423.4% above baseline

============================================================
A note on AWS cost attribution:
The service label in Cost Explorer is assigned by AWS, not by the resource
that caused the cost. An EC2 spike may be caused by EBS snapshot copies,
cross-region data transfer, or autoscaling events that AWS categorizes under
EC2 in billing - not a running EC2 instance you can find in the console.

Before investigating the flagged service directly, ask:
What changed in my infrastructure on or before the flagged date?
Work backward from the operational change, not forward from the billing label.

Your numbers will differ slightly from the above because the sample data generates dates from today dynamically. The spike always shows up on yesterday and the surrounding baseline numbers shift depending on the day you run it.

The Decision the Script Can't Make for You

The anomaly is on the EC2 line, and the instinct is to go look at running EC2 instances. But as the output warns, the attribution is AWS's choice, not yours.

Before opening the EC2 console, check your deployment history for that date. What was deployed? Was a new environment created? Did an autoscaler event run? Start from the operational change and follow the thread to the billing data, because starting from the billing label and working backward is slower and frequently misleading.

Break it On Purpose

# See the spike immediately with no AWS account needed
python detect_cost_anomaly.py --sample

# Run against your real account
python detect_cost_anomaly.py

# Only show anomalies from the last 7 days, good for a quick this-week check
python detect_cost_anomaly.py --recent-days=7

# Combine both flags - sample data filtered to the last 7 days
python detect_cost_anomaly.py --sample --recent-days=7

If your real account returns "No anomalies detected" that's not a failure. It means your spend has been consistent. A clean account returns clean output. The script is doing exactly what it should.

When a real event happens on your account such as an autoscaler left running, a forgotten environment or an unexpected data transfer, this is what catches it before the invoice does.

Use Case 2 – Log Correlation Across Services

Environment: Fully local – Docker Compose, three Python services
Language: Python

The Production Problem

A user reports that their payment failed. You open your logging tool and search. The auth service logged a successful authentication. The ledger service logged a successful transaction but the notification service which should have sent a payment confirmation email has logged nothing at all.

Two services reported success while one service stay silent. The payment still failed, and you have three logs and no clear answer about where the chain broke.

What's Actually Happening at the System Level

What this is not: This isn't a guide to installing a log aggregation tool. It's about the data structure that makes log correlation possible in the first place and what happens when that structure breaks on one service's error path.

In a system with a single service, debugging is simple: one service, one log file, one timeline. But when a user request passes through multiple services, you need a way to link all the logs together. That link is called a trace ID.

Think of it like a ticket number at a government office. When you walk in, you get a number, say, A247. Every desk that handles your case writes A247 on your file. If something goes wrong, the manager pulls every record with A247 and sees exactly what happened, in order, across every desk. That is a trace ID. One number, shared across every service that touched the request.

In the demo, when a payment comes in, the auth service creates a unique ID for it. Every log line that auth, ledger, and notification write for that payment includes the same ID. When something breaks, you run correlate.py with that ID and it finds every related log line across all three services and sorts them by time:

python correlate.py pay-abc123

Here's what those logs look like. Notice that every line has the same trace_id:

{"timestamp": "2026-05-01T14:23:01.234Z", "trace_id": "pay-abc123", "service": "auth", "event": "user_authenticated", "level": "INFO", "user_id": "u_789", "duration_ms": 12}
{"timestamp": "2026-05-01T14:23:01.891Z", "trace_id": "pay-abc123", "service": "ledger", "event": "transaction_recorded", "level": "INFO", "amount": 50.0, "currency": "USD"}
{"timestamp": "2026-05-01T14:23:02.103Z", "trace_id": "pay-abc123", "service": "notification", "event": "email_queued", "level": "INFO", "recipient": "user@example.com"}

Now here's what breaks it. The notification service hits a timeout connecting to the email provider. The developer who wrote the error handler forgot to include the trace ID, so instead of a proper log line, it writes this:

2026-05-01T14:23:02.415Z ERROR Connection timeout to email provider smtp.example.com:587

The error happened, the log line exists. But because it has no trace_id, correlate.py can't find it.

The notification still appears in the timeline, and you can see email_send_attempt – but email_queued never follows it.

Timeline — 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete

The attempt is there but the failure is not. The developer just forgot one field.

Set Up the Demo Environment

Navigate to 02-log-correlation/ and start the three services:

cd 02-log-correlation
docker compose up -d

This starts the auth, ledger, and notification services. Trigger a payment request to generate some logs:

./trigger_request.sh

The script prints the trace ID it used. Copy the ID and Run the correlation script against it now, before we break anything, to see the full working path:

python correlate.py pay-5831e1bf

You should see something like this (your trace ID will be different but the structure is the same):

Loading logs from ./logs/...
Loaded 6 structured log lines.

============================================================
Trace ID: pay-5831e1bf
============================================================

Timeline - 6 events across 3 service(s):

  [2026-05-15T21:42:28.079046+00:00] [AUTH] [INFO] payment_request_received
    service: auth
    user_id: u_789
    amount: 50.0
  [2026-05-15T21:42:28.080718+00:00] [AUTH] [INFO] user_authenticated
    service: auth
    user_id: u_789
    duration_ms: 12
  [2026-05-15T21:42:28.145528+00:00] [LEDGER] [INFO] transaction_recorded
    service: ledger
    user_id: u_789
    amount: 50.0
    currency: USD
  [2026-05-15T21:42:28.210088+00:00] [NOTIFICATION] [INFO] email_send_attempt
    service: notification
    recipient: user@example.com
  [2026-05-15T21:42:28.347893+00:00] [NOTIFICATION] [INFO] email_queued
    service: notification
    recipient: user@example.com
    amount: 50.0
  [2026-05-15T21:42:28.378402+00:00] [AUTH] [INFO] payment_complete
    service: auth
    user_id: u_789
    amount: 50.0

That's the full payment journey with auth, ledger, notification in the exact order it happened. Now let's look at how the script works.

The Script

# correlate.py
import json
import os
import sys

SERVICES = ["auth", "ledger", "notification"]
LOG_DIR = "./logs"


def load_logs(log_dir):
    """
    Read each service's log file and parse every line as JSON.
    Lines that fail JSON parsing are printed as warnings.
    They are not silently dropped - a plain-text error line in a service
    that should emit structured logs is itself evidence worth seeing.
    """
    all_lines = []

    for service in SERVICES:
        log_file = os.path.join(log_dir, f"{service}.log")

        if not os.path.exists(log_file):
            print(f"  WARNING: No log file for '{service}' at {log_file}")
            continue

        with open(log_file) as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    parsed = json.loads(line)
                    parsed["_source"] = service
                    all_lines.append(parsed)
                except json.JSONDecodeError:
                    # This line exists in the log but cannot be correlated.
                    print(f"  WARNING: {service}.log line {line_num} is not structured JSON:")
                    print(f"           {line[:100]}")
                    print(f"           This line will NOT appear in any trace-based search.")

    return all_lines


def correlate(trace_id, all_lines):
    """
    Find every log line with this trace_id and sort by timestamp.
    The sorted result is the reconstructed timeline of the request.
    """
    matched = [line for line in all_lines if line.get("trace_id") == trace_id]
    matched.sort(key=lambda x: x.get("timestamp", ""))
    return matched


def find_missing_services(matched):
    """
    Check which services produced zero trace-tagged lines for this request.
    A missing service is not just an absence - it is a signal.
    Either the request never reached that service, or an error path swallowed
    the trace ID. Both are worth investigating.
    """
    services_seen = {line["_source"] for line in matched}
    return [s for s in SERVICES if s not in services_seen]


def print_timeline(trace_id, matched, missing):
    print(f"\n{'=' * 60}")
    print(f"Trace ID: {trace_id}")
    print(f"{'=' * 60}")

    if not matched:
        print("\nNo structured log lines found with this trace ID.")
        print("Either the trace ID is wrong, or no service emitted")
        print("a structured log line for this request.")
        return

    services_count = len({line["_source"] for line in matched})
    print(f"\nTimeline - {len(matched)} events across {services_count} service(s):\n")

    for line in matched:
        ts = line.get("timestamp", "unknown")
        service = line.get("_source", "unknown").upper()
        event = line.get("event", "unknown event")
        level = line.get("level", "INFO")
        extras = {k: v for k, v in line.items()
                  if k not in ("timestamp", "trace_id", "event", "level", "_source")}

        print(f"  [{ts}] [{service}] [{level}] {event}")
        for k, v in extras.items():
            print(f"    {k}: {v}")

    if missing:
        print(f"\n{'=' * 60}")
        print("MISSING TELEMETRY")
        print(f"{'=' * 60}")
        print(f"These services produced no trace-tagged events for trace {trace_id}:\n")
        for s in missing:
            print(f"  - {s}")
        print()
        print("This means one of three things:")
        print("  1. The request never reached this service.")
        print("  2. The service received it but an error path swallowed the trace ID,")
        print("     leaving a plain-text log line that trace correlation cannot find.")
        print("  3. This service's log file was not included in this run.")
        print()
        print("Check the raw log file for a plain-text error line around the same timestamp.")
        print("If one exists, that is your root cause - and a structured logging gap to fix.")


def run(trace_id):
    print(f"Loading logs from {LOG_DIR}/...")
    all_lines = load_logs(LOG_DIR)
    print(f"Loaded {len(all_lines)} structured log lines.\n")

    matched = correlate(trace_id, all_lines)
    missing = find_missing_services(matched)
    print_timeline(trace_id, matched, missing)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python correlate.py ")
        print("Example: python correlate.py pay-abc123")
        sys.exit(1)
    run(sys.argv[1])

How the Script Works

load_logs reads log files from each service. Each line should be JSON. If a line isn't JSON, it prints a warning that usually means an error log is missing a trace ID and can't be tracked.

correlate finds all logs that match the given trace ID and sorts them by time. This rebuilds the full request flow across services.

find_missing_services checks which services have no logs for that trace ID. This tells you where the request stopped or where the trace ID was lost.

print_timeline displays the full request timeline in order. It also shows which services are missing if something didn't log correctly.

One thing worth knowing for when you use this in a real Kubernetes environment:
in Kubernetes, kubectl logs only shows the current running container.
If a pod restarts, you can use this:

kubectl logs  --previous

But this only works for the last restart. Older logs are gone unless you use a logging system like Loki or CloudWatch.

What the Output Looks Like After Breaking it

The point of this section is to show you what happens when a service fails silently, – when the error exists in the logs but the script can't find it because the developer forgot one field.

break_it.sh forces the notification service to fail when it tries to send an email, and because the error handler was written without a trace ID, the failure gets logged as plain text with no way to tie it back to the original request.

Run it:

./break_it.sh

Then trigger a new request:

./trigger_request.sh

Copy the trace ID it prints, then correlate it:

python correlate.py pay-xxxxxxxx

Here is what you'll see:

Loading logs from ./logs/...
  WARNING: notification.log line 10 is not structured JSON:
           2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email
           provider http://mock-email:80/ after 0.001s - failed to send
           confirmation to user@example.com
           This line will NOT appear in any trace-based search.
Loaded 29 structured log lines.

============================================================
Trace ID: pay-6cf69a8c
============================================================

Timeline - 5 events across 3 service(s):

  [2026-05-15T21:59:00.605307+00:00] [AUTH] [INFO] payment_request_received
  [2026-05-15T21:59:00.606008+00:00] [AUTH] [INFO] user_authenticated
  [2026-05-15T21:59:00.617331+00:00] [LEDGER] [INFO] transaction_recorded
  [2026-05-15T21:59:00.630313+00:00] [NOTIFICATION] [INFO] email_send_attempt
  [2026-05-15T21:59:00.685182+00:00] [AUTH] [INFO] payment_complete

Look at this carefully. The notification is in the timeline, and it logged email_send_attempt. But email_queued is missing, which means the email never actually sent and the error that explains why isn't in the timeline at all. It's hiding in the WARNING at the very top, where the script told you it found a line it couldn't parse.

That's the problem: where the attempt is visible but the failure is invisible.

Run cat logs/notification.log and scroll to the bottom:

{"timestamp": "2026-05-15T21:59:00.630313+00:00", "trace_id": "pay-6cf69a8c",
 "service": "notification", "event": "email_send_attempt", ...}
2026-05-15T21:59:00.681583+00:00 ERROR Connection timeout to email provider
http://mock-email:80/ after 0.001s - failed to send confirmation to user@example.com

Two lines to note: the first has a trace ID, which the script found and showed in the timeline. The second doesn't – the script flagged it as a warning and skipped it. The error happened 0.075 seconds after the attempt. The log file has both lines. The timeline only has one.

That is what "invisible failure" looks like in production. The payment went through. The confirmation email never sent. The error is sitting right there in the log file, Connection timeout to email provider after 0.001s but in the correlation output above, the timeline shows email_send_attempt and then jumps straight to payment_complete with nothing in between: no error, no failure, no gap. It looks like everything worked.

The fix is in 02-log-correlation/services/notification/main.py. Here's the broken error handler:

except httpx.TimeoutException:
    emit_plain(f"Connection timeout to email provider {EMAIL_PROVIDER_URL}")
    return {"status": "ok"}

And here's the fixed version. The only change is passing req.trace_id into emit instead of calling emit_plain:

except httpx.TimeoutException:
    emit(req.trace_id, "email_timeout", level="ERROR",
         provider=EMAIL_PROVIDER_URL)
    return {"status": "ok"}

Once that change is made, the timeout error shows up in the timeline like everything else:

  [2026-05-15T21:59:00.681583+00:00] [NOTIFICATION] [ERROR] email_timeout
    provider: http://mock-email:80/

One command, one trace ID, the full picture.

The Decision the Script Can't Make For You

The correlation script identifies notification as the gap. When you check the raw notification.log, you find the plain-text timeout error, that the request reached the service, that authentication and transaction recording both succeeded, but that the email failed.

Whether a notification failure is a payment failure depends entirely on how your system was designed. If notification is a soft dependency, this error shouldn't have surfaced to the user as a payment failure, and something else in your system design is wrong. If it's a hard dependency, the transaction itself should have rolled back. The script found where things broke, but the right response depends on the design.

Break it On Purpose

Run ./break_it.sh – this switches the notification service to a mode where its error handler drops the trace ID
Run ./trigger_request.sh to generate a new payment request and get a new trace ID
Run python correlate.py – the notification will be missing from the timeline
Run cat logs/notification.log – the timeout error is right there, without a trace ID, invisible to the script

Use Case 3 - Infrastructure Drift Detection

Environment: AWS free tier (one security group) + Terraform
Language: Python

The Production Problem

Your Terraform plan shows no changes. Your deployment is behaving differently than it did yesterday, and when you ask around, someone eventually remembers: a colleague made a quick manual change to a security group in the AWS console last week to unblock a staging test. They meant to go back and apply it through Terraform but they forgot.

Your Terraform state file and your actual AWS infrastructure have been quietly disagreeing ever since. Not that anything broke loudly or an alert fired. Terraform wouldn't even know unless someone ran terraform plan to check, and in this scenario, nobody did.

This is called infrastructure drift, and it's far more common than most teams want to admit.

What's Actually Happening at the System Level

What this is not: This isn't the same as running terraform plan. A plan shows you what Terraform would change. This script shows you what has already changed in AWS without Terraform knowing.

The script itself doesn't run any Terraform commands. It reads the state file Terraform already produced. In the demo, Terraform creates that file. In a real environment, it already exists from your normal workflow.

Think of Terraform's state file as a receipt. When Terraform creates a security group, it writes down exactly what it created, the rules, the ports, the CIDRs. That receipt is the state file.

The script compares that receipt against what AWS actually has right now. If someone went into the AWS console and added a rule that isn't on the receipt, the script flags it as drift.

The blind spot is that, if someone creates a completely new security group in the console and never uses Terraform at all, there's no receipt for it. The script can't compare something it has never seen. It returns clean, and that group sits in your account undetected.

The demo shows both. First you break a known resource. Then the --invisible scenario creates a new one outside Terraform entirely, and the script returns clean even though your account now has an extra security group.

Set Up the Demo Environment

Navigate to 03-drift-detection/ in the companion repo:

cd 03-drift-detection
pip install -r requirements.txt

Run setup. This uses real Terraform, not a mock:

./setup.sh

This runs terraform init and terraform apply, which creates a real AWS security group:

It also writes a genuine terraform.tfstate file. Open it in any text editor if you want to see what Terraform actually produces. It's JSON, it's readable, and it's the real thing.

Once setup completes, run the script:

python detect_drift.py terraform.tfstate

You should see something like this, but your actual security group ID will be different:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

The lab is alive and both sides of the contract match. Now let's look at what the script is doing.

The Script (Code Files)

# detect_drift.py
import boto3
import json
import sys


def load_tfstate(path):
    """
    The Terraform state file is plain JSON - open it in any text editor
    and you will see a 'resources' array listing everything Terraform knows about.
    This function reads that file and returns the parsed contents.
    """
    with open(path) as f:
        return json.load(f)


def get_security_groups_from_state(tfstate):
    """
    Walk through the resources array and collect every security group entry.
    Each resource has a 'type', a 'name', and an 'instances' array holding
    the attribute values Terraform recorded when it last ran.
    We extract the resource ID and the ingress (inbound) rules.
    """
    resources = {}
    for resource in tfstate.get("resources", []):
        if resource["type"] == "aws_security_group":
            for instance in resource.get("instances", []):
                sg_id = instance["attributes"]["id"]
                resources[sg_id] = {
                    "ingress": instance["attributes"].get("ingress", [])
                }
    return resources


def get_security_group_from_aws(sg_id):
    """
    Call the AWS EC2 API to fetch the live current state of this security group.
    Under the hood, boto3 constructs an authenticated HTTPS request, signs it with
    your AWS credentials, sends it to the EC2 API endpoint in your configured region,
    and parses the response. The response contains far more data than we need -
    we extract only the inbound rules.
    """
    ec2 = boto3.client("ec2")
    response = ec2.describe_security_groups(GroupIds=[sg_id])
    sg = response["SecurityGroups"][0]
    return {"ingress": sg.get("IpPermissions", [])}


def normalize_state_rules(rules):
    """
    Terraform stores ingress rules in its own format.
    We normalize them into a set of tuples for easy comparison.
    Each tuple is: (from_port, to_port, protocol, cidr_block)
    """
    normalized = set()
    for rule in rules:
        for cidr in rule.get("cidr_blocks", []):
            normalized.add((
                rule.get("from_port", 0),
                rule.get("to_port", 0),
                rule.get("protocol", "-1"),
                cidr
            ))
    return normalized


def normalize_aws_rules(rules):
    """
    AWS returns ingress rules in a different format from Terraform's.
    We normalize them into the same tuple shape so the comparison works.
    """
    normalized = set()
    for rule in rules:
        from_port = rule.get("FromPort", 0)
        to_port = rule.get("ToPort", 0)
        protocol = rule.get("IpProtocol", "-1")
        for ip_range in rule.get("IpRanges", []):
            normalized.add((from_port, to_port, protocol, ip_range["CidrIp"]))
    return normalized


def detect_drift(tfstate_path):
    print(f"Loading Terraform state from: {tfstate_path}")
    tfstate = load_tfstate(tfstate_path)
    state_sgs = get_security_groups_from_state(tfstate)

    if not state_sgs:
        print("No security groups found in state file. Nothing to compare.")
        return

    drift_found = False

    for sg_id, state_data in state_sgs.items():
        print(f"\nChecking: {sg_id}")

        try:
            aws_data = get_security_group_from_aws(sg_id)
        except Exception as e:
            print(f"  ERROR: Could not fetch {sg_id} from AWS - {e}")
            print(f"  Check your IAM permissions: ec2:DescribeSecurityGroups is required.")
            continue

        state_rules = normalize_state_rules(state_data["ingress"])
        aws_rules = normalize_aws_rules(aws_data["ingress"])

        # Rules in AWS that Terraform does not know about (manual additions)
        added_in_aws = aws_rules - state_rules
        # Rules Terraform expects that no longer exist in AWS (manual deletions)
        removed_from_aws = state_rules - aws_rules

        if added_in_aws:
            drift_found = True
            print("  DRIFT - Rules present in AWS but missing from state file:")
            for rule in added_in_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if removed_from_aws:
            drift_found = True
            print("  DRIFT - Rules in state file but removed from AWS:")
            for rule in removed_from_aws:
                print(f"    Port {rule[0]}-{rule[1]} | Protocol: {rule[2]} | CIDR: {rule[3]}")

        if not added_in_aws and not removed_from_aws:
            print("  OK - No drift detected.")

    print("\n" + "=" * 60)
    if drift_found:
        print("Drift detected. See above for details.")
    else:
        print("No drift detected in monitored resources.")

    print("\nIMPORTANT: This script only checks resources tracked in your state file.")
    print("Resources created manually in AWS without Terraform are invisible to this check.")
    print("A clean output here does not mean your AWS account is clean - it means")
    print("the resources you are watching match what Terraform last recorded.")


if __name__ == "__main__":
    tfstate_path = sys.argv[1] if len(sys.argv) > 1 else "terraform.tfstate"
    detect_drift(tfstate_path)

How the Script Works

load_tfstate opens terraform.tfstate and reads it. Run cat terraform.tfstate after setup and you'll see that it's just a text file and everything Terraform knows about your infrastructure is stored in there.

get_security_groups_from_state pulls out every security group from that file, the ID AWS assigned it, and the inbound rules Terraform last recorded. These are the expected values.

get_security_group_from_aws calls the AWS API and fetches the same security group's current inbound rules. These are the actual values. The script now has two versions of the same thing.

normalize_state_rules and normalize_aws_rules exist because Terraform and AWS store the same rule in slightly different formats. These two functions convert both into the same format so the comparison works.

The comparison is the last step. Rules in AWS but not in the state file were added manually. Rules in the state file but not in AWS were deleted manually. The script prints both.

What the Output Looks Like

A clean run with no drift:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  OK - No drift detected.

============================================================
No drift detected in monitored resources.

IMPORTANT: This script only checks resources tracked in your state file.
Resources created manually in AWS without Terraform are invisible to this check.
A clean output here does not mean your AWS account is clean - it means
the resources you are watching match what Terraform last recorded.

After injecting drift:

Loading Terraform state from: terraform.tfstate

Checking: sg-0a1b2c3d4e5f6a7b8

  DRIFT - Rules present in AWS but missing from state file:
    Port 22-22 | Protocol: tcp | CIDR: 0.0.0.0/0

============================================================
Drift detected. See above for details.

The Decision the Script Can't Make For You

The script finds drift, an inbound rule that Terraform doesn't know about. The instinct is to revert it immediately by running terraform apply, but before doing that, ask one question: was this change an emergency hotfix? Someone may have manually opened a port at 2am to restore a broken service while a proper fix was being prepared. And if you revert it automatically, you might undo something that was deliberately placed there to keep a service running.

Drift detection tells you that things are different. It doesn't tell you which version is correct, and investigating that is the work that comes after the script runs.

Break it On Purpose

Run ./break_it.sh. This adds an SSH inbound rule (port 22) directly via the AWS CLI, simulating a manual console change.
Run python detect_drift.py terraform.tfstate. The drift appears in the output.
Run ./break_it.sh --invisible to create a brand new security group that's not in the state file at all, then run the script again. It returns clean even though a new resource exists in your account, making the coverage gap visible.
Run ./teardown.sh. When finished, this runs terraform destroy to delete the security group and clean up all AWS resources. No charges will remain after this.

Use Case 4 - Secrets Rotation with Zero Downtime

Environment: AWS Secrets Manager + local Kind cluster
Language: Python

The Production Problem

The goal of this use case: Kubernetes says a pod is healthy, but your users are getting database errors. The script catches that gap before the users are affected by running one extra check that Kubernetes never runs.

You rotate your database credentials. The pod restarts. kubectl get pods shows Running. Ten minutes later, users can't log in.

The rotation worked, but the problem is that Kubernetes checked whether the HTTP server was alive, not whether it could authenticate with the database. Those are two different things.

What's Actually Happening

What this is not: This isn't about how to store secrets in Kubernetes. It's about what happens after the secret is rotated.

When a pod is already running, it holds a pool of open database connections that were authenticated before the rotation happened. Those connections stay alive after the password changes because they were authenticated before the change and the database does not kick them out. But when the pool needs to open a new connection, it uses the current environment credentials, which still have the old password. That new connection fails immediately.

Meanwhile, Kubernetes sees the pod responding to HTTP and marks it Running, so your users are hitting the failures with no indication from the cluster that anything is wrong.

What the `/healthz/db` Endpoint Does

/healthz returns 200 if the HTTP server is alive. That is all Kubernetes checks.

/healthz/db opens a fresh database connection using the current credentials and runs SELECT 1. If that fails after a rotation, the pod is Running but can't serve database requests. The rotation script calls this endpoint as its final step – the check Kubernetes never runs.

Here's what that looks like in the demo FastAPI application (code files):

# app.py (relevant section)
import os
import asyncpg
from fastapi import FastAPI, HTTPException

app = FastAPI()

DB_HOST = os.environ.get("DB_HOST", "postgres")
DB_PORT = int(os.environ.get("DB_PORT", "5432"))
DB_NAME = os.environ.get("DB_NAME", "appdb")
DB_USERNAME = os.environ.get("DB_USERNAME", "appuser")
DB_PASSWORD = os.environ.get("DB_PASSWORD", "")

@app.get("/healthz")
async def healthz():
    # Always returns 200 if the HTTP server is alive.
    # This is all the Kubernetes readiness probe checks.
    return {"status": "ok"}

@app.get("/healthz/db")
async def healthz_db():
    # Opens a fresh connection using the current environment credentials.
    # If the password was rotated and this pod has not restarted yet,
    # the environment still has the old password - this connection fails.
    # /healthz above would still return 200. Your users would see errors.
    try:
        conn = await asyncpg.connect(
            host=DB_HOST, port=DB_PORT,
            database=DB_NAME, user=DB_USERNAME, password=DB_PASSWORD,
        )
        await conn.execute("SELECT 1")
        await conn.close()
        return {"status": "ok", "db": "authenticated"}

    except asyncpg.InvalidPasswordError:
        raise HTTPException(
            status_code=503,
            detail=(
                f"Authentication failed for '{DB_USERNAME}'. "
                "Password may have been rotated. "
                "Readiness probe does not check this."
            )
        )
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Database error: {str(e)}")

The difference between these two endpoints is the entire lesson of this use case.

Set Up the Demo Environment

Navigate to 04-secrets-rotation/ and run the setup script:

cd 04-secrets-rotation
./setup.sh

This starts a Kind cluster, deploys real PostgreSQL with the appuser account already created, deploys the demo FastAPI app connected to it, and creates an initial secret in AWS Secrets Manager.

Once setup completes, install the dependencies:

pip install boto3 kubernetes

Before running the rotation, confirm everything is running:

kubectl get pods

You should see myapp and postgres pods both in the Running state. If any pod shows Pending or Error, wait 30 seconds and check again. PostgreSQL takes a moment to finish initialising.

You can also verify that the secret was created in AWS. In the console, go to AWS Secrets Manager and look for myapp/db-credentials:

If you prefer the CLI:

aws secretsmanager get-secret-value --secret-id myapp/db-credentials

Once both pods are Running and the secret exists, run the rotation to see the full path:

python rotate_secret.py

If Step 6 shows FAILED on this first clean run, it's almost always a timing issue: the app pod restarted successfully but /healthz/db ran before the new pod finished establishing its first database connection. Wait 20 seconds and run python rotate_secret.py again. If it fails repeatedly, run kubectl logs deployment/myapp to see what the app is reporting.

You should see all six steps complete cleanly, ending with:

Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated

The lab is alive and the full rotation chain works end to end. Now let's look at what the script is doing.

The Script (Code Files)

# rotate_secret.py
import boto3
import base64
import json
import subprocess
import sys
from kubernetes import client, config


def get_current_secret(secret_name):
    """
    Fetch the current credential from AWS Secrets Manager.
    The secret is stored as a JSON string with 'username' and 'password' fields.
    """
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


def rotate_in_aws(secret_name, username, new_password):
    """
    Write the new credential to AWS Secrets Manager.
    put_secret_value creates a new version - the previous version is
    not deleted immediately, giving you a short rollback window.
    """
    sm = boto3.client("secretsmanager")
    new_value = json.dumps({"username": username, "password": new_password})
    sm.put_secret_value(SecretId=secret_name, SecretString=new_value)
    print("  [AWS] Secret updated in Secrets Manager.")


def update_kubernetes_secret(namespace, k8s_secret_name, username, new_password):
    """
    Patch the Kubernetes Secret object with the new credential values.
    Kubernetes requires secret data to be base64-encoded - this is encoding,
    not encryption. Anyone with access to the Secret object can decode the values.
    Real encryption at rest requires separate etcd encryption configuration.
    """
    config.load_kube_config()
    v1 = client.CoreV1Api()

    secret_data = {
        "username": base64.b64encode(username.encode()).decode(),
        "password": base64.b64encode(new_password.encode()).decode()
    }

    v1.patch_namespaced_secret(
        name=k8s_secret_name,
        namespace=namespace,
        body={"data": secret_data}
    )
    print(f"  [K8s] Kubernetes Secret '{k8s_secret_name}' updated.")


def rolling_restart(namespace, deployment_name):
    """
    Trigger a rolling restart of the deployment.
    Rolling restart means Kubernetes creates one new pod, waits for it to pass
    its readiness probe, then terminates one old pod - and repeats until all
    pods have been replaced. Availability is preserved throughout.
    This is very different from deleting all pods at once.
    """
    result = subprocess.run(
        ["kubectl", "rollout", "restart",
         f"deployment/{deployment_name}", "-n", namespace],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rolling restart failed: {result.stderr}")
    print(f"  [K8s] Rolling restart triggered for '{deployment_name}'.")


def wait_for_rollout(namespace, deployment_name, timeout=120):
    """
    Block until the rolling restart finishes or times out.
    'Finished' means all new pods are Running and their readiness probes passed.
    This does NOT mean the application can authenticate with the new credential.
    That is what verify_credential checks next.
    """
    print(f"  [K8s] Waiting for rollout (timeout: {timeout}s)...")
    result = subprocess.run(
        ["kubectl", "rollout", "status",
         f"deployment/{deployment_name}",
         "-n", namespace,
         f"--timeout={timeout}s"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f"Rollout did not complete: {result.stderr}")
    print("  [K8s] Rollout complete. All pods report Ready.")


def verify_credential(namespace, deployment_name):
    """
    This is the check the readiness probe does not make.
    We exec into the running pod and call /healthz/db - an endpoint that
    makes an actual authenticated query to the database.
    If this passes: the credential is working at the application level.
    If this fails after the readiness probe passed: the contract mismatch is confirmed.
    The pod is Running. The application cannot serve database requests.
    """
    print("  [Verify] Running post-rotation credential check...")

    result = subprocess.run(
        ["kubectl", "get", "pods", "-n", namespace,
         "-l", f"app={deployment_name}",
         "-o", "jsonpath={.items[0].metadata.name}"],
        capture_output=True, text=True
    )
    pod_name = result.stdout.strip()

    if not pod_name:
        print("  [Verify] ERROR: No running pod found for this deployment.")
        return False

    verify = subprocess.run(
        ["kubectl", "exec", pod_name, "-n", namespace,
         "--", "curl", "-sf", "http://localhost:8000/healthz/db"],
        capture_output=True, text=True
    )

    if verify.returncode != 0:
        print("  [Verify] FAILED - Pod is Running but database authentication failed.")
        print("           The readiness probe validated HTTP reachability.")
        print("           The application cannot authenticate with the new credential.")
        print("           These are two different contracts. Only one was checked automatically.")
        return False

    print("  [Verify] PASSED - Application confirmed it can authenticate with the new credential.")
    return True


def rotate(secret_name, new_password, namespace, k8s_secret_name, deployment_name):
    print("\n[Step 1/6] Reading current secret from AWS Secrets Manager...")
    current = get_current_secret(secret_name)
    username = current["username"]

    print("[Step 2/6] Updating AWS Secrets Manager...")
    rotate_in_aws(secret_name, username, new_password)

    print("[Step 3/6] Rotating password at the database level (ALTER USER)...")
    rotate_postgres_password(namespace, new_password)

    print("[Step 4/6] Updating Kubernetes Secret object...")
    update_kubernetes_secret(namespace, k8s_secret_name, username, new_password)

    print("[Step 5/6] Triggering rolling restart...")
    rolling_restart(namespace, deployment_name)
    wait_for_rollout(namespace, deployment_name)

    print("[Step 6/6] Verifying the new credential works at the application level...")
    success = verify_credential(namespace, deployment_name)

    print("\n" + "=" * 60)
    if success:
        print("Rotation complete. Credential verified at the application level.")
    else:
        print("Rotation incomplete. Readiness probe passed but credential verification failed.")
        print("Recommended action: force-restart all pods to flush the connection pool,")
        print("or investigate the database session timeout configuration.")
        sys.exit(1)


if __name__ == "__main__":
    import secrets as _secrets
    rotate(
        secret_name="myapp/db-credentials",
        new_password=_secrets.token_urlsafe(16),
        namespace="default",
        k8s_secret_name="db-credentials",
        deployment_name="myapp"
    )

How the Script Works

get_current_secret reads the current credential from AWS Secrets Manager so the script knows the username before it generates a new password.

rotate_in_aws writes the new credential to Secrets Manager. It creates a new version rather than overwriting the old one, so you have a short window to roll back if something goes wrong.

_pg_password_literal and rotate_postgres_password handle the step that most rotation scripts skip, which is actually changing the password inside PostgreSQL. This is done by running ALTER USER appuser PASSWORD '...' directly on the live PostgreSQL pod. Before this step, the database still accepts the old password. After this step, it does not.

update_kubernetes_secret writes the new password into the Kubernetes Secret so that any new pod that starts will get the new credential from the beginning.

rolling_restart and wait_for_rollout restart the application pods one at a time so the deployment stays available throughout. When this step completes, all pods are Running and their readiness probes have passed – but keep in mind that "Running" only means /healthz returned 200, which is exactly the problem this use case is about.

verify_credential is the extra step Kubernetes never runs. It reaches inside the new pod and calls /healthz/db, which opens a real database connection with the credentials in the pod's current environment. If this passes, the rotation is genuinely complete. If this fails after the readiness probe passed, you have confirmed the gap: the pod looks healthy but can't serve database requests.

What the Output Looks Like

Successful rotation:

[Step 1/6] Reading current secret from AWS Secrets Manager...
[Step 2/6] Updating AWS Secrets Manager...
  [AWS] Secrets Manager updated.
[Step 3/6] Rotating password at the database level (ALTER USER)...
  [DB]  Running ALTER USER on PostgreSQL...
  [DB]  Password changed at the database level.
        New connections now require the new password.
        Existing pool connections remain valid until they close.
[Step 4/6] Updating Kubernetes Secret object...
  [K8s] Kubernetes Secret 'db-credentials' updated.
[Step 5/6] Triggering rolling restart...
  [K8s] Rolling restart triggered for 'myapp'.
  [K8s] Waiting for rollout (timeout: 120s)...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] PASSED - Application confirmed it can authenticate with the new credential.

============================================================
Rotation complete. Credential verified at the application level.
  AWS Secrets Manager: updated
  PostgreSQL:          updated (ALTER USER)
  Kubernetes Secret:   updated
  Application pod:     restarted, authenticated

The lab is alive and the full rotation chain works end to end.

Before you break anything, confirm the pod is healthy:

kubectl get pods

You should see myapp in Running state. That is the baseline: everything working as expected. Now let's break it.

Break it On Purpose

Step 1: Desync the DB

./break_it.sh

This runs ALTER USER directly on PostgreSQL with a wrong password. The K8s Secret still has the old password, so the pod's environment and the database are now out of sync.

Step 2: Check what Kubernetes sees

kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz

You will see {"status":"ok"}. The pod is still showing Ready in kubectl get pods. Kubernetes has no idea anything is wrong – that's the contract gap made visible in your terminal.

Step 3: Check what your users experience

kubectl exec deployment/myapp -- curl -s http://localhost:8000/healthz/db

You'll see a 503 error. Fresh database connections are failing. Your users are already seeing this.

Step 4: See the mixed pattern (optional)

./load_test.sh

Some requests succeed because they hit old pool connections that were authenticated before the break. Some fail because they need a fresh connection. The pod looks healthy, but half your traffic is failing.

Step 5: Run the rotation script

python rotate_secret.py

This time, Step 6 catches the failure. Here's what you'll see:

[Step 5/6] Triggering rolling restart...
  [K8s] Rollout complete. All pods report Ready.
[Step 6/6] Verifying the new credential works at the application level...
  [Verify] Running post-rotation credential check...
  [Verify] FAILED - Pod is Running but database authentication failed.
           The readiness probe validated HTTP reachability.
           The application cannot authenticate with the new credential.
           These are two different contracts. Only one was checked automatically.

============================================================
Rotation incomplete. Readiness probe passed but credential verification failed.

The pod is Running and shows Ready in kubectl get pods. The rotation script says the credential is broken. That's the contract gap visible in your terminal, caught before your users hit it.

The lesson: /healthz tells you the HTTP server is alive. /healthz/db tells you the application can actually connect to the database. Kubernetes only checks the first one unless you add a database probe. The rotation script adds that check at the end of every rotation so you catch the failure before your users do.

The Decision the Script Can't Make For You

The verification failed, the pod is Running, and requests to the database are failing. You have two options:

force-restart all pods at once to flush the connection pool (which is faster but causes a brief capacity reduction), or
wait for old sessions to expire naturally (which avoids downtime but leaves requests failing intermittently until the pool cycles).

The script found the problem, but deciding what to do next belongs to an engineer who knows the system.

Teardown

./teardown.sh

Use Case 5 - Automated Canary Rollback Trigger

Environment: Fully local – Kind, Prometheus via Helm
Language: Bash

What This Use Case Does and Why it Matters

This use case runs a script that watches your new deployment and automatically rolls it back if something goes wrong, before your users flood your support queue.

This matters in production because, when you ship a new version, you don't send all traffic to it immediately. You send a small slice, say 20% to the new version while 80% still goes to the old one. If the new version is broken, only 20% of users are affected and you can roll back before the damage spreads. But the rollback only works if you're watching the right things.

The takehome: Two scripts watch the same failing canary. One reports everything is fine. The other fires the rollback. The only difference is what they measure. Your automation is only as good as what it watches.

What to watch for: canary_watch_v1.sh watches errors only and stays silent while the canary is slow. canary_watch_v2.sh watches errors AND response time and fires the rollback. The difference between them is the lesson.

What this is not: This isn't a guide to canary deployments. It's about what your monitoring misses when it only watches one signal.

How it Works

Three things run in the cluster: the stable app (three pods, handles most traffic), the canary app (one pod, handles a small slice), and Prometheus (collects response times and error counts from both every 15 seconds).

The watch script asks Prometheus every 15 seconds: "Is the canary behaving normally?" If the answer is no for three checks in a row, it rolls back the canary automatically.

The question is that what does "behaving normally" mean? That is the entire use case.

Set Up the Demo Environment

Navigate to 05-canary-rollback/ and run:

cd 05-canary-rollback
./setup.sh

Setup takes a few minutes. It installs Prometheus, deploys both versions of the demo app, and starts a load generator pod that sends continuous traffic to both so Prometheus always has data.

When setup finishes, confirm everything is running:

kubectl get pods

You should see output like this:

NAME                                                   READY   STATUS    RESTARTS   AGE
load-generator-68c59698b7-kws2l                        1/1     Running   0          4m54s
myapp-canary-6d6979c66f-g9lgw                          1/1     Running   0          32s
myapp-stable-6bcf994fc4-b4k9l                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-ndhxc                          1/1     Running   0          4m55s
myapp-stable-6bcf994fc4-z97kx                          1/1     Running   0          4m55s
prometheus-kube-prometheus-operator-59b847d96c-mp72s   1/1     Running   0          5m58s
prometheus-prometheus-kube-prometheus-prometheus-0     2/2     Running   0          5m1s

Three stable pods, one canary pod, one load generator, Prometheus running. The lab is alive.

Wait 60 seconds before running anything else. Prometheus needs time to scrape the first metrics from the pods. If you skip this, the watch scripts return empty data with no explanation.

Three Terminal Windows

You need three separate command prompts running at the same time.

On macOS: open Terminal and press Cmd+T twice. You now have three tabs, each an independent terminal.
On Linux: press Ctrl+Shift+T in most terminal apps, or right-click and choose "Open new tab."

Label them Terminal 1 for the watch script, Terminal 2 for injecting failures, Terminal 3 for watching latency.

The Scripts

Version 1: watches errors only (code here)

#!/usr/bin/env bash
# canary_watch_v1.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v1 - error rate only)."
echo "Rollback triggers if error rate exceeds \({ERROR_THRESHOLD} for \){STRIKE_LIMIT} checks."
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'

    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    error_rate=${error_rate:-0}
    above=\((echo "\)error_rate > $ERROR_THRESHOLD" | bc -l)

    echo "[\(ts] error_rate=\){error_rate} | threshold=\({ERROR_THRESHOLD} | breach=\)([ "$above" = "1" ] && echo YES || echo NO)"

    if [ "$above" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo "  ROLLBACK TRIGGERED"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done

Version 2: watches error rate AND response time

#!/usr/bin/env bash
# canary_watch_v2.sh

PROMETHEUS="http://localhost:9090"
DEPLOYMENT="myapp-canary"
NAMESPACE="default"
ERROR_THRESHOLD="0.05"
LATENCY_THRESHOLD="2.0"
CHECK_INTERVAL=15
STRIKE_LIMIT=3

strikes=0

echo "Canary monitor running (v2 - error rate + P99 latency)."
echo "Error threshold: \({ERROR_THRESHOLD} | Latency P99 threshold: \){LATENCY_THRESHOLD}s"
echo ""

while true; do
    ts=$(date '+%Y-%m-%dT%H:%M:%S')

    error_query='sum(rate(http_requests_total{app="myapp-canary",status=~"5.."}[1m])) / sum(rate(http_requests_total{app="myapp-canary"}[1m]))'
    error_rate=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${error_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    latency_query='histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="myapp-canary"}[1m])) by (le))'
    latency=\((curl -sf "\){PROMETHEUS}/api/v1/query" \
        --data-urlencode "query=${latency_query}" | \
        python3 -c "
import sys, json
d = json.load(sys.stdin)
result = d['data']['result']
print(result[0]['value'][1] if result else '0')
" 2>/dev/null)

    error_rate=${error_rate:-0}
    latency=${latency:-0}

    error_breach=\((echo "\)error_rate > $ERROR_THRESHOLD" | bc -l)
    latency_breach=\((echo "\)latency > $LATENCY_THRESHOLD" | bc -l)

    triggered_by=""
    [ "\(error_breach" = "1" ] && triggered_by="error_rate(\){error_rate})"
    [ "\(latency_breach" = "1" ] && triggered_by="\){triggered_by:+\({triggered_by}, }latency_p99(\){latency}s)"

    echo "[\(ts] error_rate=\){error_rate} | latency_p99=\({latency}s | breach=\){triggered_by:-none}"

    if [ "\(error_breach" = "1" ] || [ "\)latency_breach" = "1" ]; then
        strikes=$((strikes + 1))
        echo "  Strike \({strikes}/\){STRIKE_LIMIT} | Triggered by: ${triggered_by}"
        if [ "\(strikes" -ge "\)STRIKE_LIMIT" ]; then
            echo ""
            echo "  ROLLBACK TRIGGERED"
            echo "  Signal: ${triggered_by}"
            kubectl rollout undo deployment/"\({DEPLOYMENT}" -n "\){NAMESPACE}"
            exit 0
        fi
    else
        strikes=0
    fi

    sleep "${CHECK_INTERVAL}"
done

How the Scripts Work

The error rate query asks Prometheus: "What fraction of requests to the canary returned an error in the last minute?" A result of 0.0 means no errors. A result of 0.06 means 6% of requests are failing, above the 5% threshold. You see this in the output as:

error_rate=0.06 | threshold=0.05 | breach=YES

The latency query asks: "How slow is the slowest 1% of requests to the canary right now?" A result of 5.234 means 1 in every 100 requests is taking over 5 seconds. You see this as:

latency_p99=5.234s | breach=latency_p99(5.234s)

V1 only runs the first query. V2 runs both. Same canary, same problem, different answers.

The three-strike rule means a single bad check doesn't trigger a rollback – three in a row does. The tradeoff is 45 seconds (three checks at 15 seconds each) of exposure before the rollback fires.

When three strikes hit, the watch script itself runs:

kubectl rollout undo deployment/myapp-canary -n default

That one line is what triggers the rollback. It lives inside canary_watch_v2.sh and runs automatically – you don't have to do anything. The script detects, decides, and acts.

Break it On Purpose

In Terminal 1, start the v1 monitor:

./canary_watch_v1.sh

You will see this repeating every 15 seconds:

Canary monitor running (v1 - error rate only).
Rollback triggers if error rate exceeds 0.05 for 3 checks.

[2026-05-17T11:53:12] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:27] error_rate=0 | threshold=0.05 | breach=NO
[2026-05-17T11:53:42] error_rate=0 | threshold=0.05 | breach=NO

breach=NO means the canary looks healthy. Leave this running and move to Terminal 2.

In Terminal 2, inject latency into the canary:

./break_it.sh

This makes every request to the canary take 5 seconds. Requests still return 200 – no errors, just slowness. You will see:

Injecting latency into the canary deployment...
deployment "myapp-canary" successfully rolled out
Latency injection is active.

The canary pod is Running and passing its readiness probe.
Every request to the canary now takes 5 seconds.
Error rate: 0%   |   P99 latency: ~5s

Now look back at Terminal 1. The v1 monitor keeps printing breach=NO. The canary is taking 5 seconds per request and your monitoring says everything is fine. That's the failure.

In Terminal 3, see what your users are actually experiencing:

./check_latency.sh

TIMESTAMP                   STABLE (ms)   CANARY (ms)   STATUS
---------                   -----------   -----------   ------
2026-05-17T11:55:14         18ms          5008ms        CANARY DEGRADED
2026-05-17T11:55:20         7ms           5008ms        CANARY DEGRADED
2026-05-17T11:55:27         6ms           5008ms        CANARY DEGRADED

Stable is responding in 6–18 milliseconds. Canary is taking over 5 seconds. Users on the canary are waiting 5 seconds for every page load. The v1 monitor in Terminal 1 still says breach=NO.

This is the lesson: the monitoring and the user experience are completely disconnected. The script isn't broken. It's watching the wrong thing.

Now let's see the fix. Press Ctrl+C in Terminal 1 to stop v1. Start v2 in the same terminal:

./canary_watch_v2.sh

In Terminal 2, re-inject the latency:

./break_it.sh

Watch Terminal 1. V2 catches the latency and fires the rollback after three strikes:

Canary monitor running (v2 - error rate + P99 latency).
Error threshold: 0.05 | Latency P99 threshold: 2.0s

[2026-05-15T14:30:00] error_rate=0.0 | latency_p99=0.082s | breach=none
[2026-05-15T14:30:15] error_rate=0.0 | latency_p99=5.234s | breach=latency_p99(5.234s)
  Strike 1/3 | Triggered by: latency_p99(5.234s)
[2026-05-15T14:30:30] error_rate=0.0 | latency_p99=5.891s | breach=latency_p99(5.891s)
  Strike 2/3 | Triggered by: latency_p99(5.891s)
[2026-05-15T14:30:45] error_rate=0.0 | latency_p99=6.102s | breach=latency_p99(6.102s)
  Strike 3/3 | Triggered by: latency_p99(6.102s)

  ROLLBACK TRIGGERED
  Signal: latency_p99(6.102s)

deployment.apps/myapp-canary rolled back

The error rate never moved from 0. V2 rolled back anyway because latency crossed the threshold. That's the difference one extra measurement makes.

After the rollback, confirm the canary is dormant but not deleted:

kubectl rollout history deployment/myapp-canary -n default

REVISION  CHANGE-CAUSE
1         
2

Two revisions. The rollback scaled revision 2 down to zero and restored revision 1. Nothing was deleted, and you can re-deploy if you decide the rollback was a false alarm.

The Decision the Script Can't Make For You

V2 rolled back based on latency with zero errors. Before re-deploying, ask if the latency was a real regression in the new code, or a temporary spike, like a database cache warming up on first use? Both produce the same signal. Only you know which is more likely given what changed.

False positive rollbacks slow down deployments and erode confidence in automation. The right thresholds depend on your users and your system.
What the script enforces is whatever you configure.

Teardown

./teardown.sh

What You Can Do Now

Each use case in this handbook was a script solving a specific problem the standard tooling wasn't catching. Here's where you land:

You can catch AWS cost spikes before the invoice and you know that the service label is AWS's attribution, not a pointer to what actually caused the cost. Start from what changed operationally, not from the billing label.

You can reconstruct the full timeline of any failed request across multiple services from a single trace ID, and you know that a missing service in that timeline is evidence, not just an absence.

You can detect infrastructure drift by comparing what Terraform believes against what AWS actually contains, and you know that a clean result means the resources Terraform manages are in sync, not that your entire AWS account is clean.

You can validate a secret rotation at the application level, not just at the infrastructure level, and you know the difference between a readiness probe passing and the application actually being able to connect to the database.

You can build a canary rollback trigger that watches the right signals, and you know why watching only error rates can leave a slow, broken deployment running while users wait.

The pattern across all five use cases is the same: the standard tooling reported everything as fine while something was actually broken. The cost script returned clean, the pod showed Running, and the canary showed zero errors – not because the tools were wrong but because they were only checking what was easy to check. These scripts check what the standard tooling skips.

GitHub repo: https://github.com/Osomudeya/devops-scripting-labs

I write about DevOps weekly, covering real systems, interview, CV tips and tricks, and real incidents – Join the newsletter.

How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible

Osomudeya Zudonu — Mon, 13 Apr 2026 21:56:12 +0000

The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.

I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.

That was the last time I practiced on someone else's infrastructure.

Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.

That last part is where the actual learning happens.

Prerequisites

Before you start, make sure you have:

A laptop with at least 8GB of RAM (16GB is better)
At least 20GB of free disk space
Windows, macOS, or Linux operating system
Administrator access to your computer
Virtualization enabled in your BIOS/UEFI settings
A stable internet connection for the initial downloads

Knowledge and comfort level:

You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).
Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.

If you can follow step-by-step instructions and read error output without panicking, you're ready.

What is DevOps?
Why Build a Local Lab?
How to Set Up Docker
How to Set Up Kubernetes
How to Install kubectl
How to Set Up Vagrant
How to Install Ansible
How to Build Your First DevOps Project
How to Break Your Lab on Purpose
What You Can Now Do

What is DevOps?

DevOps is the practice of breaking down the wall between software development and IT operations teams.

Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.

The tools you'll install in this guide each solve a specific part of that process:

Docker packages your application and everything it needs into a portable container that runs the same way on any machine.
Kubernetes manages multiple containers at scale, handling restarts, networking, and load balancing automatically.
Vagrant creates and manages virtual machine environments so your whole team always works on identical setups.
Ansible automates repetitive configuration tasks across many servers without writing a script for each one.

Why Build a Local Lab?

A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.

Here's what you get with a local setup:

Zero cost. No cloud bills, no surprise charges, and no credit card required.
Works offline. Practice anywhere, even without internet after the initial setup.
Full control. You manage every layer from the OS up to the application.
Safe experimentation. Break things freely. Nothing here affects production.
Fast feedback. No waiting for cloud resources to spin up. Everything runs on your machine.

The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.

How to Set Up Docker

Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.

How to Install Docker on Windows

First, enable virtualization in your BIOS:

Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.
Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.
Enable it, save your changes, and exit.

Then install Docker Desktop:

Download Docker Desktop from Docker's official website.
Run the installer and follow the prompts.
Enable WSL 2 (Windows Subsystem for Linux) when asked.
Restart your computer.
Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.

Troubleshooting: If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:

systeminfo | findstr "Hyper-V Requirements"

All items should show "Yes". If they don't, revisit your BIOS settings.

How to Install Docker on Mac

Download Docker Desktop for Mac from Docker's website.
Open the downloaded .dmg file and drag Docker to your Applications folder.
Open Docker from Applications.
Enter your password when prompted.
Wait for the whale icon in the menu bar to stop animating.

How to Install Docker on Linux

Run these commands in order:

# Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER

Log out and back in for the group change to take effect.

How to Test Docker

Run this command:

docker run hello-world

If you see "Hello from Docker!" then Docker is working correctly.

Docker is set up. Next, you'll install Kubernetes to manage containers at scale.

How to Set Up Kubernetes

Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:

Tool	Best for	RAM needed
Minikube	Beginners. Easiest setup, built-in dashboard	2GB+
Kind	Faster startup, works well inside CI pipelines	1GB+
k3s	Low-resource machines. Lightweight but production-like	512MB+
kubeadm	Learning how clusters are actually bootstrapped in production	2GB+ per node

If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.

If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.

Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.

How to Install Minikube (Recommended for Beginners)

Minikube creates a single-node Kubernetes cluster on your laptop.

On Windows:

Download the Minikube installer from Minikube's GitHub releases page.
Run the .exe installer.
Open Command Prompt as Administrator and start Minikube:

minikube start --driver=docker

On Mac:

brew install minikube
minikube start --driver=docker

On Linux:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker

Test your cluster:

minikube status
minikube dashboard

How to Install k3s (Recommended for Low-RAM Machines)

k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.

On Linux (and Mac via Multipass):

curl -sfL https://get.k3s.io | sh -

That single command installs k3s and runs it automatically in the background. Check that it is running:

sudo k3s kubectl get nodes

You should see one node with status Ready.

On Mac directly — k3s doesn't run natively on macOS. Use Multipass to spin up a lightweight Ubuntu VM first, then run the install command inside it.

On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.

How to Install Kind (Kubernetes IN Docker)

Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.

# Mac or Linux
brew install kind

# Windows
choco install kind

Create a cluster:

kind create cluster --name my-local-lab

How to Install kubeadm (For Understanding Cluster Bootstrap)

kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.

It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the official kubeadm installation guide for your OS, then initialize your cluster:

sudo kubeadm init --pod-network-cidr=10.244.0.0/16

After init, join worker nodes using the command kubeadm prints at the end of the output.

How to Install kubectl

kubectl is the command-line tool you use to interact with any Kubernetes cluster.

On Windows:

Download kubectl.exe from Kubernetes' website and place it in a directory that is in your PATH. Or install with Chocolatey:

choco install kubernetes-cli

On Mac:

brew install kubectl

On Linux:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl

Test it:

kubectl get pods --all-namespaces

On a fresh cluster, you'll see system pods running in the kube-system namespace — things like coredns and storage-provisioner. That's the expected output. It means your cluster is up and kubectl can talk to it.

Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.

Docker vs Vagrant — they aren't the same thing

Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.

Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.

How to Set Up Vagrant

Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.

How to Install Vagrant on Windows

Download and install VirtualBox with default options.
Download and install Vagrant.
Restart your computer if prompted.

Note: VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:

systeminfo | findstr "Hyper-V"

If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:

Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All

Restart after disabling.

How to Install Vagrant on Mac and Linux

On Mac:

Download and install VirtualBox.
After installation, open System Preferences > Security & Privacy > General. You will see a message saying system software from Oracle was blocked. Click Allow and restart your Mac. Without this step, VirtualBox will not run.
Download and install Vagrant.

Note for Apple Silicon (M1/M2/M3) Macs: VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use UTM as your VM provider instead, or use Multipass which works natively on Apple Silicon.

On Linux:

Download and install VirtualBox.
Download and install Vagrant.

Verify both are installed:

vboxmanage --version
vagrant --version

How to Create Your First Vagrant Environment

Create a new directory for your project. Inside it, create a file named Vagrantfile with this content:

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: <<-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" > /var/www/html/index.html
  SHELL
end

Start the VM:

vagrant up

Visit http://localhost:8080 in your browser. You should see "Hello from Vagrant!"

Troubleshooting SSH on Windows

If vagrant ssh fails, try:

vagrant ssh -- -v

Or connect manually:

ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222

How to Create a Local Vagrant Box Without Internet

Note: Most readers can skip this. Only do this if you want to work fully offline after the initial setup.

Download Ubuntu 20.04 LTS and save the .iso file locally.
Open VirtualBox and create a new VM: Name it ubuntu-devops, Type: Linux, Version: Ubuntu (64-bit).
Assign 2048MB RAM and a 20GB VDI disk.
Attach the .iso under Storage > Optical Drive.
Start the VM and complete the Ubuntu installation.
Once installed, shut down the VM and run:

VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box

You now have a reusable local box that works without internet.

You can spin up virtual machines. Next is Ansible, which automates what goes inside them.

How to Install Ansible

Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.

How to Install Ansible on Windows

Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).

Open PowerShell as Administrator and enable WSL:

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Restart your computer.
Install Ubuntu from the Microsoft Store.
Open Ubuntu and install Ansible:

sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

How to Install Ansible on Mac

brew install ansible

How to Install Ansible on Linux

# Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible

How to Test Ansible

Create a file called hosts in your current directory:

[local]
localhost ansible_connection=local

Create a file called playbook.yml in the same directory:

---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"

Run the playbook, passing the local hosts file with -i:

ansible-playbook -i hosts playbook.yml

You should see the message "Ansible is working!" in the output.

Alright, all your tools are installed. Now you'll use them together to build something real.

How to Build Your First DevOps Project

You can find the entire code for this lab in this repo: https://github.com/Osomudeya/homelab-demo-article

Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.

Before you start, create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.

You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.

Step 1: Create the Project Directory

mkdir devops-lab-project && cd devops-lab-project

Step 2: Write Your Site Content

Create a file called index.html in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:



  My DevOps Lab
  
    My DevOps Lab
    Provisioned by Vagrant. Configured by Ansible. Served by Docker.
    Built on a laptop. No cloud account needed.

Change the text to whatever you like. This is your page.

Step 3: Write the Vagrantfile

Create a file called Vagrantfile in the same directory:

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end

Step 4: Start the Virtual Machines

vagrant up

The first run downloads the ubuntu/focal64 box, which is around 500MB.

Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.

Step 5: Create the Ansible Inventory

Create a file called inventory in the same directory:

[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key

Ansible uses the Vagrant-generated private keys so it can SSH in as the vagrant user. Host key checking for this lab is turned off in ansible.cfg (next step), not in the inventory.

Step 6: Create the Ansible Config File

Before running the playbook, create a file called ansible.cfg in the same directory:

[defaults]
inventory = inventory
host_key_checking = False

The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.

These settings are for a local lab only. Do not use host_key_checking = False for production systems.

Step 7: Create the Ansible Playbook

Create a file called playbook.yml:

---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2>/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean && rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes

Two lines worth paying attention to:

src: index.html — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.
-v /var/www/html:/usr/share/nginx/html:ro — this mounts the directory from the VM into the Nginx container. The :ro means read-only. Nginx serves whatever is in that folder.

Step 8: Run the Playbook

ansible-playbook -i inventory playbook.yml

You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green ok or yellow changed next to each task means it worked. Red fatal means something failed.

Step 9: Verify the Setup

Open http://localhost:8080 in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.

If you see the page, every tool in this lab is working together.

Step 9: Clean Up (Optional)

When you're done:

vagrant destroy -f

This shuts down and deletes both VMs. Your Vagrantfile, inventory, playbook.yml, and index.html stay on disk — run vagrant up followed by ansible-playbook -i inventory playbook.yml any time to bring it all back.

Now that you have a working lab, let's use it properly.

How to Break Your Lab on Purpose

Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.

Here are five things to break and what to look for when you do.

Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)

Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of --restart always, and your site can come back without re-running Ansible.

After vagrant ssh web, every docker command below runs on the web VM. So keep your browser on your laptop at http://localhost:8080 (Vagrant forwards your host port to the VM’s port 80).

Troubleshooting: If Your Lab Isn't Ready

From your project folder on the host (your laptop) – unless the step says to run it on the VM:

You ran vagrant destroy -f. Run vagrant up, then ansible-playbook -i inventory playbook.yml.
docker ps shows webapp but status is Exited. On the web VM, run sudo docker start webapp, then sudo docker ps again.
There's no webapp row in docker ps -a. Re-run ansible-playbook -i inventory playbook.yml on the host.

If the playbook is already applied and webapp is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or docker ps. You need the VM shell and a quick check before you run docker exec.)

Steps (happy path)

SSH into the web VM:

vagrant ssh web

Confirm webapp is Up:
```
sudo docker ps
```
Break it on purpose: kill the container’s main process from inside (PID 1). That ends the container the same way a crashing app would, not the same as docker stop on the host:

sudo docker exec webapp sh -c 'sleep 5 && kill 1'

The sleep 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh http://localhost:8080. You may catch a brief error or blank page while nothing is listening on port 80.

Watch Docker restart the container:

watch sudo docker ps -a

Within a few seconds you should see Exited (137) become Up again. (Press Ctrl+C to exit watch.)

5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under /var/www/html and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.

Why not `docker stop` or `docker kill` on the host for this demo?

Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (hasBeenManuallyStopped), and --restart always may not bring the container back until you docker start it or similar.

Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.

Kubernetes analogy: A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.

What to observe (three separate checks):

Exit code: After kill 1, docker ps -a should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran docker stop on the host.
Restart delay vs browser: Watch how many seconds pass between Exited and Up in docker ps -a; that interval is Docker applying --restart always. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.
Content after recovery: After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with -v), not inside a file that vanishes when the container process restarts. The process was replaced, not your index.html on the host path.

Break 2: Cause a Container Name Conflict

On a single Docker daemon (here, on your web VM), a container name is a unique label. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use docker run --name webapp without cleaning up first hit this error constantly and recognizing it saves time in real work.

Before you start: Ansible already created one container named webapp.
Stay on the web VM (for example still inside vagrant ssh web) so the commands below run where that container lives.

So now, try to start a second container and also call it webapp. The image is plain nginx here on purpose – the point is the name clash, not matching your site’s ports or volume mounts.

sudo docker run -d --name webapp nginx

What actually happens here is that Docker doesn't create a second container. It returns an error immediately. Your original webapp is unchanged.

This is because the name webapp is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.

Example error (your ID will differ):

docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

To fix it, free the name, then create webapp again the same way the playbook does (publish port 80, mount your HTML, restart policy):

sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx

After that, your site should behave as before (refresh http://localhost:8080 from your laptop).

What to observe:

Read Docker’s Conflict message end to end. You should see that the name /webapp is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run docker run again.”

Break 3: Make Ansible Fail to Reach a VM

Ansible separates “could not connect” from “connected, but a task broke.” The first is UNREACHABLE, the second is FAILED. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.

On your laptop, in the project folder, edit inventory and change the web server address from 192.168.33.10 to an IP no VM uses, for example 192.168.33.99. Save the file.

[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

What you run (from the same project folder on the host):

ansible-playbook -i inventory playbook.yml

After this, Ansible tries to SSH to 192.168.33.99. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible never runs tasks on the web server. It stops that host with UNREACHABLE:

fatal: [192.168.33.99]: UNREACHABLE! => {"msg": "Failed to connect to the host via ssh"}

This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is no working SSH session.

Now it's time to put it back: restore 192.168.33.10 in inventory and run ansible-playbook -i inventory playbook.yml again. The web play should reach the VM and complete (assuming your lab is up).

UNREACHABLE vs FAILED – what to observe:

If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.
If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.

When you debug later, you should look at the keyword Ansible prints: UNREACHABLE points to reachability while FAILED points to task output and the first failed task under that host.

Break 4: Fill the VM's Disk

Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.

All commands below run on the db VM after vagrant ssh db. MariaDB was installed there by your playbook.

What you do:

Open a shell on the db VM:
```
vagrant ssh db
```
Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:
```
sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
```
Use df -h to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger count= only in a lab (never fill production disks on purpose without a plan).
Ask systemd to restart MariaDB and show status:
```
sudo systemctl restart mariadb
sudo systemctl status mariadb
```
If the disk is critically full, restart may fail or the service may show failed or not running.
If something looks wrong, read recent logs for the MariaDB unit:
```
sudo journalctl -u mariadb --no-pager | tail -20
```
Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.
Clean up so your VM stays usable:
```
sudo rm /tmp/bigfile
```
Optionally run sudo systemctl restart mariadb again and confirm it is active (running).

What to observe:

You should use df -h first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.
You should read systemctl status mariadb to see whether systemd thinks the service is active, failed, or flapping.
You should read journalctl -u mariadb when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). Space + status + logs is the same order you would use on a production server.

Break 5: Run Minikube Out of Resources

Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay Pending and Events explain why (for example Insufficient cpu). That is not the same as a pod that starts and then crashes.

To do this, you'll need a local cluster (we're using Minikube in this guide) and kubectl on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.

You'll run this on your Mac, Linux, or Windows terminal (host), not inside vagrant ssh. If you're still inside a VM, type exit until your prompt is back on the host.

What you do:

Check Minikube:
```
minikube status
```
If it's stopped, start it (Docker driver matches earlier sections):
```
minikube start --driver=docker
```
Create a deployment with many replicas so your single Minikube node can't run them all at once:
```
kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
```
Press Ctrl+C when you're done watching. Some pods may stay Pending while others are Running.
Pick one Pending pod name from kubectl get pods and inspect it:
```
kubectl describe pod 
```
Under Events, look for FailedScheduling and a line similar to:
```
Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
```
You might see Insufficient memory instead, depending on your machine.
Fix the lab by scaling back so the cluster can catch up:
```
kubectl scale deployment stress --replicas=2
```
You can delete the deployment entirely when finished: kubectl delete deployment stress.

What to observe:

You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any node yet, usually because the node is out of CPU or memory for that workload.
You should read kubectl describe pod and scroll to Events. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.
You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.

What You Can Now Do

You didn't just install tools in this tutorial. You also used them.

You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.

You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.

You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.

That's the difference between someone who has read about DevOps and someone who has run it.

Here are four free projects you can run in this same lab to go further:

DevOps Home-Lab 2026 — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.
KubeLab — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.
K8s Secrets Lab — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.
DevOps Troubleshooting Toolkit — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.

All free and open source: github.com/Osomudeya/List-Of-DevOps-Projects.

If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at Build Your Own DevOps Lab.

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Osomudeya Zudonu — Thu, 26 Mar 2026 14:25:52 +0000

If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?

Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.

In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.

By the end, you'll be able to:

Explain the full architecture from vault to pod
Run the lab locally in about 15 minutes
Prove why environment variables go stale after rotation, while mounted secret files stay fresh
Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD
Troubleshoot the most common failures

Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.

Prerequisites
How to Understand the Secret Flow
How to Run the Local Lab
How to Inspect the ExternalSecret and the Application
How to Test Secret Rotation
How to Choose Between External Secrets Operator and the CSI Driver
How to Deploy the Pattern on Amazon Elastic Kubernetes Service
How to Configure GitHub Actions Without Stored AWS Credentials
How to Troubleshoot the Most Common Failures
Conclusion

Prerequisites

Before you begin, make sure you have the following tools installed and configured.

For the local lab:

An AWS account with access to AWS Secrets Manager
The AWS CLI installed and configured. Run aws configure and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.
kubectl installed. For Microk8s, run microk8s kubectl config view --raw > ~/.kube/config after installation to connect kubectl to your local cluster.
Terraform installed
Helm installed
Docker installed
A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the Microk8s install guide before continuing.

For the Amazon Elastic Kubernetes Service sections:

An Amazon Elastic Kubernetes Service cluster you can create or manage
A GitHub repository you can configure for workflows and secrets

The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's docs/DEPLOY-LOCAL.md and docs/DEPLOY-EKS.md.

How to Understand the Secret Flow

Before you run any command, you need to understand how the pieces connect.

The flow has four stages:

A developer or automated system updates a secret in AWS Secrets Manager.
The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.
Your pod reads that Kubernetes Secret.
During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.

How the External Secrets Operator Sync Works

The External Secrets Operator reads a custom Kubernetes resource called ExternalSecret. That resource tells the operator three things:

Which secret store to connect to
Which Kubernetes Secret name to create or update
How often to refresh

In this lab, the ExternalSecret creates a Kubernetes Secret named myapp-database-creds. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.

How the App Consumes Secrets

The sample application exposes three endpoints so you can validate behavior at any time.

/secrets/env shows what environment variables the pod sees
/secrets/volume shows what files in the mounted secret directory look like
/secrets/compare compares both and reports whether rotation has been detected

The app checks four keys: DB_USERNAME, DB_PASSWORD, DB_HOST, and DB_PORT.

How to Run the Local Lab

The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.

Step 1: Clone the Repo

git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab

Step 2: Run the Spin-Up Script

bash spinup.sh

The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.

If the script fails at any point, check docs/TROUBLESHOOTING.md before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.

Important: Run the Lab UI

The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at lab-ui/ that walks you through each concept and checkpoint as you work through the lab.

To start it, open a second terminal and run:

cd lab-ui && npm install && npm run dev

Then open http://localhost:5173. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.

Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (localhost:3000) are two separate things, the UI guides you through the steps, the app shows you the live secrets.

Step 3: Access the Application

Once the lab finishes, port-forward the service.

kubectl port-forward svc/myapp 3000:80 -n default

Open http://localhost:3000. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.

Step 4: Validate That Secrets Match

Run the compare endpoint directly from the terminal.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

When everything is working, the response will include "all_match": true.

How to Inspect the ExternalSecret and the Application

At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.

Step 1: Read the ExternalSecret Manifest

Open k8s/aws/external-secret.yaml. Focus on these four fields:

refreshInterval: how often the operator polls AWS Secrets Manager
secretStoreRef: which store the operator authenticates against
target: the name of the Kubernetes Secret to create
data: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys

Here is what that mapping looks like in this lab:

spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username

The property field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.

Two fields here are worth understanding before you move on. creationPolicy: Owner means the operator owns the Kubernetes Secret it creates. If you delete the ExternalSecret, the Secret is deleted too. ClusterSecretStore is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain SecretStore is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.

Step 2: Read the Deployment Manifest

Open k8s/aws/deployment.yaml. You are looking for two sections: envFrom and volumeMounts.

envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true

Both paths read from the same Kubernetes Secret, myapp-database-creds. The envFrom block injects all keys as environment variables at pod start.
The volumeMounts block mounts the same secret as files under /etc/secrets.

This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.

Step 3: Read the App Comparison Logic

Open app/server.js. The comparison logic reads environment variables from process.env and reads mounted secret files from /etc/secrets/. Then it computes a per-key match and a global all_match value.

The /secrets/compare endpoint sets rotation_detected: true when any key differs between env and volume.

How to Test Secret Rotation

Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.

How the Rotation Gap Works

When a pod starts, Kubernetes gives it two ways to read a secret.

The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.

The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.

Same secret, two paths. One goes stale while one stays fresh.

The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.

That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.

Here is what you're about to observe in the lab:

The rotation script updates the secret in AWS
ESO syncs the new value into Kubernetes within seconds
The volume file updates automatically
The environment variable stays stale until the pod restarts
The /secrets/compare endpoint shows both values side by side so you can see the gap live

Step 1: Confirm the Lab Is Ready

Make sure your pod and the External Secrets Operator are both running before you start.

kubectl get pods -n external-secrets
kubectl get pods -n default

Both should show Running.

Step 2: Run the Rotation Test Script

bash rotation/test-rotation.sh

The script performs these actions in order:

Reads the current DB_PASSWORD from the volume mount at /etc/secrets/DB_PASSWORD
Reads the current DB_PASSWORD from the environment variable
Updates AWS Secrets Manager with a new password using put-secret-value
Forces an immediate ESO sync by annotating the ExternalSecret with force-sync
Reads the volume value again
Reads the environment variable again

After the script runs, the volume and the env var will show different values.

Step 3: Validate With the Compare Endpoint

Hit the compare endpoint and look at the output.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

You'll see something like this:

{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}

Step 4: Restart the Deployment to Sync Env Vars

Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.

kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default

Then hit /secrets/compare again. All rows should now show "all_match": true.

How to Automate Restarts With Reloader

If you don't want to restart deployments manually after every rotation, you can install Stakater Reloader. It watches an annotation on the Deployment and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.

How to Choose Between External Secrets Operator and the CSI Driver

Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the Secrets Store CSI Driver.

Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:

Feature	External Secrets Operator	Secrets Store CSI Driver
Creates a Kubernetes Secret	Yes	No by default
Supports `envFrom`	Yes	No (workaround only)
Secret stored in etcd	Yes (base64)	No, if you skip sync
Rotation	ESO updates the Secret, Reloader restarts pods	Volume file can update in place
Best for	Most teams. Multi-cloud, env var support	Security policies that prohibit secrets in etcd

This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both envFrom and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.

Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native envFrom model.

How to Deploy the Pattern on Amazon Elastic Kubernetes Service

The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.

Step 1: Prepare Terraform and OpenID Connect Access

The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the terraform/github-oidc folder.

cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn

Copy the role ARN from the output. You'll need it in the next step.

Step 2: Set the Required Environment Variable

The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.

To find your AWS account ID, run:

aws sts get-caller-identity --query Account --output text

Then set the variable, replacing ACCOUNT with the number that command returns.

export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role

Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service

bash spinup.sh --cluster eks

When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing Match ✓.

Step 4: Test Rotation on the Deployed App

After you confirm normal operation, run the rotation test the same way you did locally.

bash rotation/test-rotation.sh

Then use /secrets/compare on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.

⚠️ Cost warning: Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run bash teardown.sh from the repo root to destroy all AWS resources and stop charges.

How to Configure GitHub Actions Without Stored AWS Credentials

The typical CI/CD setup stores AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.

OpenID Connect eliminates that problem entirely.

How OpenID Connect Works for GitHub Actions

GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via AssumeRoleWithWebIdentity. No long-lived keys are ever stored anywhere.

Step 1: Create the IAM Role With Terraform

The terraform/github-oidc folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.

Step 2: Add the Role ARN to GitHub Repository Secrets

In your GitHub repository:

Go to Settings → Secrets and variables → Actions
Click New repository secret
Name it AWS_ROLE_ARN
Paste the role ARN from the Terraform output

That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.

Step 3: Configure Terraform State

For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.

Step 4: Push to Main and Let Workflows Run

After your first spin-up, every push to the main branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use /secrets/compare to validate rotation behavior on the live environment.

How to Troubleshoot the Most Common Failures

Here's a shortlist of the most common symptoms and their fixes.

Symptom	Most Likely Cause	Fix
`ExternalSecret` is not syncing	Missing credentials or wrong store reference	Confirm the operator can access AWS Secrets Manager and that `secretStoreRef` points to the correct store
Pod is stuck in `Pending`	Missing storage setup for local cluster	For Microk8s, enable the storage add-on
Env and volume still match after rotation	Rotation happened but the pod never restarted	Run `kubectl rollout restart` or install Reloader
CRD or API version mismatch	ESO version and manifest `apiVersion` don't match	Verify the `apiVersion` for `ClusterSecretStore` and `ExternalSecret` match your installed ESO version
Amazon Elastic Kubernetes Service node group never joins	Networking or IAM permissions for nodes are wrong	Fix internet routing and review the node IAM policy

How to Inspect the Operator and the ExternalSecret

When something isn't syncing, start with these two commands.

# Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

The status conditions on the ExternalSecret resource will usually tell you exactly what failed.

How to Validate Rotation From the App Side

When you are debugging rotation, don't rely only on Kubernetes resource state. Use the /secrets/compare endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.

Conclusion

You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the ExternalSecret and Deployment manifests, and validated that the application sees the right credentials.

You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.

Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.

The lab repository is at github.com/Osomudeya/k8s-secret-lab. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.

If this helped you, star the repo and share it with someone who is learning Kubernetes.

I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures
→ Join the newsletter

How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster

Osomudeya Zudonu — Fri, 06 Mar 2026 14:43:26 +0000

I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from kubectl describe, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.

You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.

What KubeLab Is?
Prerequisites
How to Get the Lab Running
Simulation 1 — Kill Random Pod
Simulation 2 — Drain a Worker Node
Simulation 3 — CPU Stress and Throttling
Simulation 4 — Memory Stress and OOMKill
Simulation 5 — Database Failure
Simulation 6 — Cascading Pod Failure
Simulation 7 — Readiness Probe Failure
How to Read the Signals in Grafana
How to Use This for Production Debugging

What is KubeLab?

KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.

Simulation	What it teaches
Kill Random Pod	ReplicaSet self-healing, pod immutability
Drain Worker Node	Zero-downtime maintenance, PodDisruptionBudgets
CPU Stress	Throttling vs crashing, invisible latency
Memory Stress	OOMKill, exit code 137, silent restart loops
Database Failure	StatefulSets, PVC persistence
Cascading Pod Failure	Why replicas: 2 isn't enough
Readiness Probe Failure	Liveness vs readiness, traffic control

Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.

Prerequisites

You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.

Hardware: 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at setup/docker-compose-preview.md full UI with mock data, no real cluster needed.

How to Get the Lab Running

Full cluster setup lives at setup/k8s-cluster-setup.md in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:

kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running

Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:

# Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000

Grafana login: admin / kubelab-grafana-2026.

Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.

Simulation 1: Kill Random Pod

This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.

Before you click: Run kubectl get pods -n kubelab -w. Watch for a pod to go Terminating then a new one to appear.

kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement

What happened: The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.

The production trap: A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.

The fix: Set replicas: 2, add a readiness probe, and set terminationGracePeriodSeconds to match your longest request timeout.

Simulation 2: Drain a Worker Node

This simulation cordons a worker node, then evicts all its pods to the remaining node.

To "cordon" a worker node means to mark it as unschedulable. When you run kubectl cordon , the Kubernetes control plane adds the node.kubernetes.io/unschedulable:NoSchedule taint to the node. (A taint is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does not affect the pods that are already running there.

Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.

Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.

Before you click: Run kubectl get pods -n kubelab -o wide -w. Watch which node each pod runs on.

kubectl get pods -n kubelab -o wide -w

NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled

In kubectl get nodes the node shows Ready,SchedulingDisabled until you run kubectl uncordon.

What happened: The node spec got spec.unschedulable=true. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw kubectl delete pod bypasses this check entirely — which is why draining with kubectl drain is always safer than deleting pods manually during maintenance.

The production trap: Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite replicas: 2.

The fix: Use pod anti-affinity with topology key: kubernetes.io/hostname and a PodDisruptionBudget with minAvailable: 1.

Simulation 3: CPU Stress and Throttling

This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.

Before you click: Run watch -n 2 kubectl top pods -n kubelab and open the Grafana CPU Usage panel.

kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m

What happened: The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.

The production trap: kubectl top shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.

The fix: For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]).

Simulation 4: Memory Stress and OOMKill

This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.

Before you click: Run kubectl get pods -n kubelab -l app=backend -w and open the Grafana Memory Usage panel.

kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown

What happened: The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.

The production trap: The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."

The fix: Alert on rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) > 3 before users notice.
The Prometheus expression means: look at how many times containers in the kubelab namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.

Confirm it happened:

kubectl describe pod -n kubelab  | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137

To see the last output before the kernel killed the process, run kubectl logs -n kubelab --previous. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.

Simulation 5: Database Failure

This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.

Before you click: Run kubectl get pods,pvc -n kubelab. Note that the PVC exists before you start.

kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume

A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new postgres-0 pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.

What happened: The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. postgres-0 always mounts postgres-data-postgres-0. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.

The production trap: Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.

The fix: Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.

Simulation 6: Cascading Pod Failure

This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.

Before you click: Run kubectl get endpoints -n kubelab backend-service -w. Watch the IP list.

kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS      ← every request in this window gets Connection refused

What happened: Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.

The production trap: replicas: 2 protects you from one pod dying at a time, nothing more.
If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.
Check right now with kubectl get pods -n kubelab -o wide | grep backend, and if both pods show the same NODE, you are one node failure away from an outage.

The fix: Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with minAvailable: 1 to block any voluntary action that would leave zero replicas.

Simulation 7: Readiness Probe Failure

This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.

Before you click: Run kubectl get pods -n kubelab -w in one tab and kubectl get endpoints -n kubelab backend-service -w in another.

# Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic

What happened: /ready returned 503. The kubelet marked the pod Ready=False. The Endpoints controller removed its IP from the Service. The liveness probe /health) still returned 200, so no restart. After 120 seconds /ready recovered and the pod rejoined. Run kubectl logs -n kubelab -f to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.

The production trap: Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.

The fix: Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.

4. How to Read the Signals in Grafana

kubectl shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.

The Four Panels that Matter

Pod Restarts: A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.

CPU Usage: A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.

Memory Usage: Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.

HTTP Request Rate: During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.

5. How to Read the Terminal Signals

What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.

The -w flag on kubectl get pods -n kubelab -w streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — 1/2 means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.

kubectl get events -n kubelab --sort-by=.lastTimestamp is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.

kubectl describe pod -n kubelab is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.

kubectl get endpoints -n kubelab backend-service shows which pod IPs are actually receiving traffic right now. A pod can show Running in kubectl get pods and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.

kubectl logs -n kubelab shows the container's stdout and stderr. Use -f to follow the stream. After a pod restarts, use --previous to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.

A full event sequence during Kill Pod recovery looks like this:

kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10

REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running

The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.

Two Prometheus Queries Worth Memorizing

First query: silent restart loop. rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.

Second query: invisible CPU throttling. rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]) measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in kubectl top often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).

# Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])

Run these against your own cluster. Not just KubeLab. These are production queries.

6. How to Use This for Production Debugging

The repo includes docs/diagnose.md, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.

Exit code 137, pods restarting. Run the Memory Stress simulation. Confirm with kubectl describe pod | grep -A 5 "Last State:" and look for Reason: OOMKilled. Raise limits or find the leak. The simulation shows both.

High latency, pods look healthy, zero restarts. Run the CPU Stress simulation. Check container_cpu_cfs_throttled_seconds_total in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.

503 on some requests, pods show Running. Run the Readiness Probe Failure simulation. Check kubectl get endpoints — one pod IP is missing despite Running. The pod gets zero traffic.

Pods stuck Pending after a node went down. Run the Drain Node simulation. Run kubectl describe pod and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.

Conclusion

You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from kubectl describe, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.

What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.

The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at docs/interview-prep.md has answers to the 13 most common Kubernetes interview questions. The observability guide at docs/observability.md covers Prometheus and Grafana setup in detail.

If this helped you, star the repo at https://github.com/Osomudeya/kube-lab and share it with someone who is learning Kubernetes the hard way.

Osomudeya Zudonu - freeCodeCamp.org

How Enterprise Teams Manage Infrastructure at Scale with Terraform

Prerequisites

Table of Contents

How State Corruption Happens

Two Engineers Run terraform apply at the Same Time

An Apply Gets Interrupted

Someone Runs a Terraform State Command in the Wrong Environment

Two Teams Manage the Same Resource

Why State File Gets Treated Like a Production Database

How Enterprise Teams Structure Their Terraform Repositories

How Teams Split State Files to Protect Each Other

Why Some Teams Prefer Directories Over Workspaces for Production

How Teams Share Infrastructure Through Modules on GitHub

How Teams Version and Release Terraform Modules

How Teams Maintain Terraform Modules at Scale

How Teams Share Data Between State Files

Reading Another Team's State Outputs

Looking Up Resources Directly From the Cloud

How Infrastructure Changes Actually Move to Production

How CODEOWNERS Enforces Who Reviews What

How Teams Detect Infrastructure Drift

How Teams Recover When State Goes Wrong

Step 1: Pull a Backup Before Touching Anything.

Step 2: Run terraform plan and Look at What it Proposes.

Step 3: Restore from S3 Versioning if the State is Corrupted.

Step 4: Clear a Stale Lock if the Pipeline is Blocked.

Step 5: Re-import Resources That Fell Out of State.

Conclusion

How to Use Bash & Python for Real DevOps Automation – Full Handbook with 5 Production Use Cases

Prerequisites

Knowledge and Skills

AWS IAM Permissions Required

Companion GitHub Repository

Table of Contents

Use Case 1 - Cost Anomaly Detection

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script

How the Script Works

What the Output Looks Like

The Decision the Script Can't Make for You

Break it On Purpose

Use Case 2 – Log Correlation Across Services

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script

How the Script Works

What the Output Looks Like After Breaking it

The Decision the Script Can't Make For You

Break it On Purpose

Use Case 3 - Infrastructure Drift Detection

The Production Problem

What's Actually Happening at the System Level

Set Up the Demo Environment

The Script (Code Files)

How the Script Works

What the Output Looks Like

The Decision the Script Can't Make For You

Break it On Purpose

Use Case 4 - Secrets Rotation with Zero Downtime

The Production Problem

What's Actually Happening

What the /healthz/db Endpoint Does

Set Up the Demo Environment

The Script (Code Files)

How the Script Works

What the Output Looks Like

Break it On Purpose

Step 1: Desync the DB

Step 2: Check what Kubernetes sees

Step 3: Check what your users experience

Step 4: See the mixed pattern (optional)

Step 5: Run the rotation script

The Decision the Script Can't Make For You

Teardown

Use Case 5 - Automated Canary Rollback Trigger

What This Use Case Does and Why it Matters

Two Engineers Run `terraform apply` at the Same Time

Step 2: Run `terraform plan` and Look at What it Proposes.

What the `/healthz/db` Endpoint Does

Why not `docker stop` or `docker kill` on the host for this demo?