Terraform - freeCodeCamp.org

How Enterprise Teams Manage Infrastructure at Scale with Terraform

Osomudeya Zudonu — Tue, 23 Jun 2026 15:56:59 +0000

Tutorials teach you how to write Terraform, but don't teach you what happens when 60 engineers start writing it together.

When you learn Terraform, you work with a single repository, state file, and a single environment. You run terraform apply from your laptop, and your infrastructure is provisioned.

That model works fine until the day you join a company and realize engineers rarely apply to production from a laptop.
A lot of what you see will not match what you practiced.

This article explains how large engineering teams actually run Terraform, the repositories, workflows, ownership rules, and what goes wrong without them.

You'll learn how enterprise teams structure repositories and state files, how they store and version reusable modules through GitHub, why infrastructure changes move to production through pipelines, how they catch changes that happen outside of Terraform, and how they recover when things go wrong.

Every practice here exists because a team hit a specific wall and built something to get past it.

Prerequisites

You should be comfortable with Terraform before reading this.
You should also know how Git pull requests and branch merging work.

This is not a Terraform introduction, it is about what happens after you have learned the basics and start sharing infrastructure with other engineers.

How State Corruption Happens
Why State File Gets Treated Like a Production Database
How Enterprise Teams Structure Their Terraform Repositories
How Teams Split State Files to Protect Each Other
Why Some Teams Prefer Directories Over Workspaces for Production
How Teams Share Infrastructure Through Modules on GitHub
How Teams Version and Release Terraform Modules
How Teams Maintain Terraform Modules at Scale
How Teams Share Data Between State Files
How Infrastructure Changes Actually Move to Production
How Teams Detect Infrastructure Drift
How Teams Recover When State Goes Wrong
Conclusion

How State Corruption Happens

The state file is how Terraform tracks what it has built. It remembers every resource, every ID, and every configuration value. When it gets out of sync with what actually exists in the cloud, that's state corruption.

It gets blamed for a lot of things. But engineers who have dealt with it in production know it usually traces back to one of a handful of situations, each with a different cause and a different fix.

Two Engineers Run `terraform apply` at the Same Time

Before understanding this one, you need to understand something about how Terraform works.

When you run terraform apply, two things happen separately:

First, Terraform talks to AWS, and the resource gets created in the cloud. Second, Terraform updates the state file to record what was just built.

These are two different systems. AWS holds the real infrastructure, and the state file is Terraform's notebook about it. If anything interrupts the process between step one and step two, they fall out of sync.

Now here's what happens when two engineers apply at the same time without locking:

Sarah opens the state file and starts adding a subnet. Marcus opens the same state file at the same moment and starts updating a NAT gateway. Both are working from the same starting copy.

Sarah finishes first. Her apply creates the subnet in AWS and updates the state file to record it.

Marcus finishes second. His apply updates the NAT gateway in AWS. Terraform then updates the state file using the version of state Marcus read when his apply started.

That version didn't include Sarah's subnet, so the updated state no longer contains a record of it.

The subnet exists in AWS. But Terraform's notebook no longer has a record of it. The next terraform plan thinks the subnet was never created and proposes building it again.

State locking prevents this. Sarah's apply acquires a lock before it starts. When Marcus tries to apply, Terraform makes him wait.

After Sarah finishes, Terraform updates the state file and releases the lock. Marcus then runs against the updated state, so both the subnet and NAT gateway changes are recorded correctly.

An Apply Gets Interrupted

A GitHub Actions pipeline is applying changes to the payments infrastructure, adding three new security group rules and a database parameter group. Halfway through, the pipeline runner hits its 60-minute timeout limit, and the job gets killed.

Here's what the apply actually managed to do before dying:

The terminal image above shows three security group rules completing successfully before the pipeline hits its 60-minute runtime limit. The runner is then terminated. The database parameter group never finishes creating, and the state file update never runs because the job died first.

Security group rule 1  → created ✓
Security group rule 2  → created ✓
Security group rule 3  → created ✓
Database parameter     → not created ✗
State file update      → never wrote (job died first)

The three security group rules now exist in AWS. The problem is that the pipeline died before Terraform could finish updating the state file. AWS knows the rules exist. Terraform's state file does not.

At this point, reality and the state file no longer match.

Fortunately, this is usually easy to recover from. When the pipeline runs again, Terraform checks what already exists in AWS. It sees the three security group rules and doesn't try to create them again. It then creates the database parameter group that never got built.

The second run completes successfully and the state file catches up.

This works because Terraform is idempotent, running the same configuration again moves infrastructure toward the desired state rather than blindly creating everything from scratch.

One small complication remains: the state lock.

If the pipeline was interrupted while holding a lock, Terraform may still think another apply is running. The next pipeline run fails immediately with an error like this:

The terminal above shows terraform apply failing because the previous job left a state lock behind. The error includes the lock ID, the path to the state file, and the name of the process that acquired it. Terraform refuses to proceed until the lock is released or manually cleared.

Before clearing the lock, make sure no Terraform apply is still running.

Open your CI/CD system. GitHub Actions, GitLab CI, Jenkins, or whatever your team uses and check the pipeline history for that environment:

The GitHub Actions pipeline history image above shows four recent runs. terraform-plan completed successfully. Two terraform-apply jobs show as cancelled and timed out, both flagged as lock may be stale. A fourth terraform-apply job is currently in progress, and this one shouldn't be unlocked until it finishes.

If the previous apply was cancelled or timed out, the lock is stale. Clear it with terraform force-unlock plus the lock ID from the error. The pipeline then runs normally.

Only force-unlock when you're certain nothing is actively running. Clearing a live lock lets two applies write to the same state at the same time, which is exactly the problem locking was built to prevent.

Someone Runs a Terraform State Command in the Wrong Environment

A database engineer is cleaning up an old test database in the staging environment.

The database still exists in AWS, but Terraform should stop managing it. To do that, the engineer uses terraform state rm.

This command doesn't delete anything in AWS. It only removes Terraform's record of the resource from the state file. Think of it as telling Terraform: "forget this resource exists, but leave it running."

The engineer intends to run it against staging:

Intended:  staging state       → forget the old test database

But they're working in the wrong directory. They run it against production instead.

Actual:    production state    → forget the live payments database

Nothing gets deleted. The production database is still running in AWS. But Terraform has now forgotten it exists.

Now Terraform and reality disagree. The next terraform plan sees a database defined in the code but missing from the state file, so it assumes the database doesn't exist and proposes creating a new one.

If nobody catches it in the plan output, Terraform creates a second production database alongside the original: two databases running in production, neither fully managed, and a very expensive mess to untangle.

terraform state rm, terraform import, and terraform state mv make immediate changes to the state file with no confirmation prompt. Run them from the wrong directory, the wrong workspace, or with the wrong resource address and you change the wrong state in seconds.

Two Teams Manage the Same Resource

The networking team owns a security group that controls access to the payments database. When a new microservice needs database access, a payments engineer has two options: ask the networking team to add a new rule, or manage the security group themselves.

They choose the second option. The engineer imports the existing security group into the payments state file and adds a rule for Microservice C.
From that moment, both teams think they own the same security group.

The problem is that Terraform does exactly what each state file tells it to do. The networking state says the security group should allow A and B. The payments state says it should allow A, B, and Microservice C. Both can't be true at the same time.

When the payments team applies their state, Microservice C gets access. But later that night, the networking pipeline runs. Terraform reads the networking state, sees only A and B, and updates the security group to match. Microservice C's rule disappears silently.

No errors are seen and both pipelines pass, which is exactly what makes this so hard to debug. Terraform isn't broken, it's receiving conflicting instructions from two different state files and doing exactly what each one says.

This isn't something to be fixed with Terraform commands. It's an ownership decision that should have been made before anyone ran an import. If the payments team had submitted a pull request to the networking repository asking them to add the rule, one team would own the security group, one state file would manage it, and the conflict could never have happened.

Why State File Gets Treated Like a Production Database

The state file looks like bookkeeping: a record of what Terraform created. The reason teams treat it differently is that it often contains secrets.

The state file stores sensitive values in plaintext. Database passwords, API keys, connection strings – if those values were passed to a Terraform resource during an apply, they're now sitting in the state file. Even if you marked the variable as sensitive in your Terraform code, the value still lands in the state file. Terraform needs it there to compute diffs on future plans.

That means: whoever can read the state file can read your database password.

In large organizations, engineers typically don't have direct access to the production state bucket. Instead, Terraform runs through a CI/CD pipeline that assumes a dedicated IAM role with permission to read and write the state bucket and perform applies. Engineers interact with infrastructure through pull requests and plan output, not by touching the state bucket directly.

This separation reduces risk and creates an audit trail. Every state change is performed by the pipeline and logged, making it straightforward to trace what changed and when.

How Enterprise Teams Structure Their Terraform Repositories

When you join a large engineering organization, the first thing you notice is the number of repositories. You might expect one repository for all infrastructure, but what you find is dozens.

The structure maps directly to ownership. Each repository belongs to one team, and that team is responsible for everything in it. A typical layout looks like this:

The diagram shows two types of repositories. The first type belongs to the platform team and contains reusable modules: things like VPC configurations, database templates, and security group patterns. These repositories don't create production resources directly.

The second type belongs to individual product teams, such as the payments team or the auth team. These repositories call the platform modules and use them to build their actual infrastructure. A mistake in a product team repository affects only that team. A mistake in a shared platform module can affect every team that depends on it.

The key thing to understand here is that the platform team repositories don't create production resources. They create reusable modules that the product teams call when building their actual infrastructure.

That distinction matters because some repositories are used by one team, while others are shared by everyone.

A mistake in a product team's repository usually affects only that team. A mistake in a shared module can affect every team that depends on it.

The diagram illustrates why shared repositories carry more risk than product-specific ones. A bug in the payments-infra repository affects only the payments team. A bug in the terraform-aws-postgres module affects every team that uses it to provision databases. A bug in the terraform-policies repository affects every pipeline in the company. The wider the module is shared, the larger the blast radius when something goes wrong.

This is why experienced engineers pay close attention to shared modules and policy repositories.

If the payments team's infrastructure breaks, the problem is probably in the payments repository.

If five different teams start seeing the same issue at the same time, the shared modules and policy repositories become the first place to investigate.

How Teams Split State Files to Protect Each Other

A single state file managing everything, VPC, Kubernetes cluster, databases, monitoring, is fine when one person is running things, but quickly becomes a problem when multiple teams share it.

Three specific problems emerge.

Blast radius: If the networking configuration and the database configuration live in the same state file, a bad networking apply can accidentally affect database resources that had nothing to do with the change. Separate state files keep failures contained.
Deployment speed: Networking infrastructure might change a few times a year. Applications might deploy dozens of times a day. If they share a state file, teams end up waiting on each other's locks.
Ownership conflicts: When multiple teams share a state file, one team can change something the other team depends on without realizing it.

The solution is to split state along ownership boundaries. A structure that addresses all three problems looks like this:

The structure image above shows one state file per domain under a production folder.

networking handles VPC, subnets, routing, and NAT gateways.
identity handles IAM roles, policies, and service accounts.
platform handles the Kubernetes cluster, node pools, and add-ons.
database handles RDS instances, Redis clusters, and backups.
security handles security groups, WAF rules, and certificates.
monitoring handles Prometheus, Grafana, and alerting pipelines.
payments handles payment service infrastructure.

production/
  networking/terraform.tfstate   → VPC, subnets, routing, NAT gateways
  identity/terraform.tfstate     → IAM roles, policies, service accounts
  platform/terraform.tfstate     → Kubernetes cluster, node pools, add-ons
  database/terraform.tfstate     → RDS instances, Redis clusters, backups
  security/terraform.tfstate     → Security groups, WAF rules, certificates
  monitoring/terraform.tfstate   → Prometheus, Grafana, alerting pipelines
  payments/terraform.tfstate     → Payment service infrastructure

This is one example, not a universal standard. Larger organizations often split further. The principle is the same: one owning team per state file, one pipeline, one blast radius.

The rule is simple: every resource belongs to one state file. If the networking team owns a security group, it stays in the networking state. Other teams can reference it as a data source, but they don't import it into their own state.
That is what prevents the ownership collision described in the first section.

Why Some Teams Prefer Directories Over Workspaces for Production

Terraform CLI workspaces let you manage multiple environments like dev, staging, and production from a single directory. Each workspace gets its own state file, but they all share the same .tf configuration files.

infra/
  main.tf          ← same code runs for ALL environments
  variables.tf

  terraform.tfstate.d/
    dev/
    staging/
    production/    ← separate state, same code

The workspace approach keeps all environments in one directory called infra. It contains a single main.tf file that runs for all environments. State is stored separately under terraform.tfstate.d with folders for dev, staging, and production, but all three share the same code.

You switch environments with terraform workspace select production, then apply.

The risk is that switching workspaces is a manual step. If the wrong workspace is active, changes meant for staging can end up in production.

Many teams prefer separate directories for long-lived environments:

environments/
  dev/
    main.tf      ← its own code path
    backend.tf   ← points to the dev state bucket
  staging/
    main.tf      ← its own code path
    backend.tf   ← points to the staging state bucket
  production/
    main.tf      ← its own code path
    backend.tf   ← points to the production state bucket

The directory approach gives each environment its own folder under environments. Dev, staging, and production each have their own main.tf with a separate code path, and their own backend.tf pointing to a different state bucket. The environments are completely separate from each other.

To apply against production, you have to be in the production directory. Each environment has its own state, backend, and execution path.

The tradeoff is duplication. Teams usually solve that with shared modules, so each environment directory contains only environment-specific configuration.

Workspaces are still useful for short-lived environments such as feature branches, preview deployments, and temporary test infrastructure.

When 30 teams each need a PostgreSQL database, two things happen.

Without a shared standard, every team writes their own database configuration. Six months later, a security audit runs across all environments and finds that:

The diagram shows what a security audit found when four teams each wrote their own database configuration independently.

Team A set backup_retention_period = 0, meaning their database was never backed up. Team B set storage_encrypted = false, leaving data in plaintext. Team C passed an empty tags = {}, so there was no cost tracking. Team D set deletion_protection = false, leaving the database one accident away from permanent data loss.

Nobody skipped those things on purpose, there was just no shared standard.

With a shared module, the platform team writes a postgres module once. They encode every organizational requirement into it: encryption on, 7-day backups, monitoring alarms, required tags, deletion protection enabled. They publish it to a GitHub repository called terraform-aws-postgres.

Every team that needs a database now writes this:

module "payments_db" {
  source         = "git::ssh://github.company.com/platform/terraform-aws-postgres.git?ref=v2.1.0"
  name           = "payments"
  environment    = "production"
  instance_class = "db.m5.large"
}

Four inputs. Everything else is handled by the module.

Large organizations usually expose approved modules through an internal registry so engineers can discover and version them without browsing GitHub repositories. Instead of the full Git URL, the reference becomes:

module "payments_db" {
  source  = "app.terraform.io/mycompany/postgres/aws"
  version = "~> 2.1"
}

HCP Terraform and Terraform Enterprise both include a private registry that connects to GitHub, watches for version tags on module repositories, and publishes new versions automatically.

How Teams Version and Release Terraform Modules

The ?ref=v2.1.0 in a module source URL isn't decoration. At the scale of 40 teams sharing one module, it's the thing that prevents a well-intentioned change from becoming a company-wide incident.

Without version pinning, the payments team references the Postgres module from main meaning whatever the latest code is at any given moment. The module owners rename an output variable from db_endpoint to database_endpoint to match a new naming convention. The next time any team runs terraform init, they pull that change. Their configuration still references db_endpoint.

Plans break:

payments-infra                        → plan fails
analytics-infra                       → plan fails
auth-infra                            → plan fails
reporting-infra                       → plan fails

Version pinning prevents this. The payments team stays on v2.1.0. The module owners release v2.2.0 with the renamed output and write a changelog. Teams upgrade when they're ready, after testing in staging. Nobody's pipeline breaks without warning.

The versioning convention is called semantic versioning:

v2.1.1  → patch:  bug fix. Safe to upgrade. Nothing to change in your code.
v2.2.0  → minor:  new optional feature. Safe to upgrade. Nothing to change.
v3.0.0  → major:  breaking change. Read the changelog. Update your code first.

The table shows three version types. A patch version like v2.1.1 means a bug fix, safe to upgrade with nothing to change in your code. A minor version like v2.2.0 means a new optional feature, also safe to upgrade with nothing to change. A major version like v3.0.0 means a breaking change, so you need to read the changelog and update your code before upgrading.

How Teams Maintain Terraform Modules at Scale

Building a Terraform module takes an afternoon, bit maintaining it for two years is a different job entirely.

A networking engineer needs a VPC module. The platform team has one, but their backlog is full. So the engineer creates a slightly different version. Three months later, another team does the same. Then another. Now this exists:

terraform-aws-vpc           ← original, maintained by platform team
terraform-aws-vpc-v2        ← created by the app team, author unknown
terraform-aws-vpc-shared    ← no idea which environments use this
terraform-aws-vpc-prod      ← unclear if this was ever different from the original

No one created a module graveyard on purpose. It grew one "I'll just make a quick variation" at a time. Each variant has slightly different security settings, different tagging, different defaults. When a compliance audit requires all VPCs to enable flow logging, the team has to investigate four different modules to figure out which environments are compliant.

Teams that avoid this treat their modules like shared services: named owner, contributions through pull requests, breaking changes in major versions with a migration guide, and deprecated modules with a retirement date. A CODEOWNERS file routes every pull request to the right reviewer automatically.

Organizations that skip this end up with modules that nobody owns, nobody wants to touch, and nobody is sure can be safely removed.

Once infrastructure is split into separate state files, a practical problem surfaces: teams need information from each other's infrastructure. The platform team's Kubernetes state needs the VPC ID from the networking team's state. The database state needs subnet IDs. The payments state needs the database endpoint.

Two patterns exist for solving this.

Reading Another Team's State Outputs

The terraform_remote_state data source lets one state read the outputs of another. The networking team marks their VPC ID and subnet IDs as outputs. The database team reads those outputs and uses them to place databases in the right subnets.

Networking state
  └── outputs: vpc_id, private_subnet_ids
                          ↓
               Database state reads them
               └── places RDS in the right subnets

This works, but there's a limitation. Reading another team's state requires full read access to their entire state file, not just the outputs you want. State files contain database passwords and API keys in plaintext. More dependencies means more teams reading each other's secrets.

Looking Up Resources Directly From the Cloud

The alternative, and the one HashiCorp now recommends, is to look up resources through the cloud provider's API instead of reading another team's state:

data "aws_vpc" "main" {
  tags = {
    Name        = "production-vpc"
    Environment = "production"
  }
}

No cross-team state access needed, and each team's state stays isolated. The tradeoff is consistent tagging: the networking team has to tag their VPC in a way the database team can reliably search for, which forces teams to agree on naming conventions early.

Many teams use both. Remote state for a small number of trusted, tightly coupled dependencies. Cloud data sources for everything broader.

How Infrastructure Changes Actually Move to Production

In large organizations managing production Terraform at scale, changes don't come from someone's laptop. Applying directly from a local machine requires production cloud credentials sitting on that machine, a security risk and leaves no audit trail if something breaks.

Instead, production changes move through a pipeline. Every change goes through a pull request in GitHub, and the pipeline does the work:

Engineer opens a pull request
        ↓
Pipeline: terraform validate + fmt check
        ↓
Pipeline: security scan (Checkov, tfsec, or similar)
        ↓
Pipeline: terraform plan → posts the full output as a comment on the PR
        ↓
Reviewer reads the plan output (not just the code)
        ↓
Required reviewers approve (enforced by CODEOWNERS + branch protection)
        ↓
Merge triggers the apply pipeline
        ↓
Pipeline: acquires state lock → applies → releases lock → logs result

The diagram above shows eight steps in order. An engineer opens a pull request. The pipeline runs terraform validate and a format check. A security scan runs using Checkov, tfsec, or similar. The pipeline runs terraform plan and posts the output as a comment on the pull request. A reviewer reads the full plan output. Required reviewers approve, enforced by CODEOWNERS and branch protection rules. Merging triggers the apply pipeline. The pipeline acquires the state lock, applies the changes, releases the lock, and logs the result.

The part that surprises engineers when they first encounter this is that the reviewer isn't approving the code. They're approving the plan output and the list of exactly what will be created, changed, or destroyed in the cloud.

A code change can look completely harmless and produce a destructive plan. Changing one database parameter might force a resource replacement, meaning Terraform destroys the current database and creates a new one. Seeing this in the plan output before the PR merges:

# aws_db_instance.payments must be replaced
-/+ resource "aws_db_instance" "payments" {

The image above shows a plan output that aws_db_instance.payments must be replaced, meaning Terraform will destroy the existing database and create a new one, not update it in place.

Catching that before merge is the entire point of reviewing the plan. Not the code.

How CODEOWNERS Enforces Who Reviews What

Earlier, we talked about module ownership. A VPC module might belong to the platform team, while database infrastructure belongs to the database team.

The challenge is making sure changes are actually reviewed by the people who own them.

GitHub solves this with a feature called CODEOWNERS. It lets a repository define which team is responsible for which directories. When someone opens a pull request that touches those files, GitHub automatically requests reviews from the correct team.

For example, if an engineer modifies the PostgreSQL module, GitHub can automatically require approval from the platform team before the change can be merged.

Without CODEOWNERS, engineers have to remember who owns which parts of the infrastructure.

CODEOWNERS makes ownership explicit and automatically requests reviews from the right team.

How Teams Detect Infrastructure Drift

Drift is the diff between what Terraform says should exist and what actually exists in the cloud.

Here's the scenario that produces drift more reliably than anything else:

Monday 3:00 AM  Production database CPU spikes. Outage.
Monday 3:15 AM  Engineer resizes database in AWS console: db.m5.large → db.m5.4xlarge
Monday 3:20 AM  Incident resolved. Engineer goes to sleep.
Monday 3:21 AM  Terraform state file: still says db.m5.large

The incident is forgotten, the ticket is closed, and life moves on.

Three months later, a routine Terraform apply runs. Terraform sees db.m5.large in the configuration but finds db.m5.4xlarge running in AWS. From Terraform's perspective, the database is larger than it should be, so the plan proposes changing it back.

Nobody notices the change in the plan output. The apply goes through, the database is downsized, and users begin reporting slow queries. The team spends hours investigating before eventually tracing the issue back to a Terraform change that reverted the emergency fix from months earlier.

Teams that handle this well run scheduled terraform plan jobs against every production state. If terraform plan exits with code 2, differences were found and an alert fires. The team then decides whether to apply to restore declared state or update the configuration to match reality. Either way, the change is visible and deliberate. Invisible drift always gets worse.

How Teams Recover When State Goes Wrong

State is recoverable in almost every situation, as long as the team set things up correctly before the incident happened.

The teams that recover in twenty minutes instead of three days aren't the ones with the deepest Terraform expertise. They're the ones who prepared.

Step 1: Pull a Backup Before Touching Anything.

terraform state pull > backup-$(date +%Y%m%d-%H%M%S).json

This saves the current state to a local file. Whatever you try next, you have a starting point to return to.

Step 2: Run `terraform plan` and Look at What it Proposes.

If Terraform proposes destroying resources that still exist in the cloud, the state is behind reality. If it proposes creating resources that already exist, reality is ahead of the state. Either way, the plan output tells you which direction the mismatch runs.

Step 3: Restore from S3 Versioning if the State is Corrupted.

Every write to a versioned S3 bucket saves a new version automatically. If the state file is corrupted or wrong, list the previous versions, download the last known good one, and push it back:

# List previous versions
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix production/database/terraform.tfstate

# Download a specific version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key production/database/terraform.tfstate \
  --version-id "the-version-id-here" \
  recovered-state.json

# Push it back
terraform state push recovered-state.json

Run terraform plan after restoring to confirm it looks correct before running any apply.

Step 4: Clear a Stale Lock if the Pipeline is Blocked.

If a lock was never released after a failed apply, clear it:

terraform force-unlock LOCK_ID

Only do this after confirming no apply is actively running. Clearing a live lock corrupts the state.

Step 5: Re-import Resources That Fell Out of State.

If a resource exists in the cloud but Terraform no longer knows about it — because of an accidental terraform state rm — bring it back without recreating it:

terraform import aws_db_instance.payments db-ABCD1234EFGH5678

Run terraform plan after importing to confirm no unexpected changes are proposed.

Conclusion

Every practice in this article traces back to a specific problem teams ran into as Terraform usage grew.

State locking prevents engineers from overwriting each other's changes.
State splitting reduces blast radius. Module versioning prevents shared infrastructure from breaking unexpectedly. Drift detection catches changes made outside Terraform. CODEOWNERS ensures the right people review the right changes.

Different problems with different solutions. But they all point to the same underlying theme which is ownership.

As teams grow, many Terraform problems have less to do with infrastructure and more to do with ownership.

State collisions happen when multiple people can modify the same state.
Module sprawl happens when nobody is responsible for maintaining a shared standard.

Drift becomes dangerous when changes are made without anyone taking ownership of bringing Terraform and reality back into alignment. Even review bottlenecks often trace back to uncertainty about who should approve what.

Understanding this changes how you read an unfamiliar Terraform repository.

Dozens of small state files aren't necessarily over-engineering. They're often ownership boundaries. A CODEOWNERS file is not bureaucracy. It's an ownership map. A pipeline that posts plan output on a pull request isn't just automation, it's a review process built around infrastructure consequences rather than code.

The infrastructure matters. But as teams grow, ownership is what keeps the system understandable.

I write about DevOps engineering, production systems, and the things tutorials do not cover weekly. If this was useful, please join the newsletter.
If you enjoyed reading this, we can also connect on Linkedin.

How to Migrate to S3 Native State Locking in Terraform

Tolani Akintayo — Thu, 07 May 2026 22:58:43 +0000

If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.

But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.

The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.

In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning DynamoDB is no longer required for state locking. Terraform 1.10 added support for this feature, and it's now generally available.

In this tutorial, you'll learn:

What S3 native locking is and how it works
How to set it up from scratch if you're starting a new project
How to migrate an existing S3 + DynamoDB setup to S3 native locking safely
How to verify locking is working and handle edge cases

By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.

What Is Terraform State Locking?
What Is S3 Native State Locking?
How S3 Native Locking Compares to the S3 + DynamoDB Approach
Prerequisites
Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch
Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking
How to Verify That Locking Is Working
How to Handle a Stuck Lock
Rollback Plan: If Something Goes Wrong
Security Best Practices for Your State Bucket
Conclusion
References

What is Terraform State Locking?

Before looking at the new approach, it helps to understand what state locking is solving.

Terraform stores everything it knows about your infrastructure in a state file – a JSON document that maps your configuration to real AWS resources. When you run terraform apply, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.

The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a race condition. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.

State locking solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.

Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ---------->|                            |
   |                                   |                            |
   |<------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock --->|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |<------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ---------->|                            |
   |                                   |                            |
   |           [COMPLETED]             |<------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |

What Is S3 Native State Locking?

Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a LockID primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.

S3 native locking uses S3 Object Lock instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.

When S3 native locking is enabled in your Terraform backend:

Terraform writes your state to an .tfstate object in S3 (as before)
To acquire a lock, Terraform uses S3's conditional write operations – specifically the if-none-match conditional header to create a lock file atomically
If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held
When the operation completes, Terraform deletes the lock file to release the lock.

The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.

Note: This feature requires Terraform version 1.10.0 or later and an S3 bucket with Object Lock enabled. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Aspect	S3 + DynamoDB (Old)	S3 Native Locking (New)
AWS services required	S3 + DynamoDB	S3 only
IAM permissions needed	S3 + DynamoDB permissions	S3 permissions only
Terraform version	Any	1.10.0 or later
Setup complexity	Two resources, two IAM scopes	One resource
Stuck lock resolution	Delete DynamoDB record	Delete S3 lock file
Cost	S3 storage + DynamoDB on-demand	S3 storage only
Object Lock requirement	Not required	Required on S3 bucket
Locking mechanism	DynamoDB conditional writes	S3 conditional writes (`if-none-match`)
State versioning	S3 Versioning (recommended)	S3 Versioning (required for full safety)

The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.

Prerequisites

Before you start, make sure you have the following in place:

Terraform 1.10.0 or later installed. Check your version:

terraform version

If you need to upgrade, follow the official upgrade guide.

AWS CLI installed and configured with credentials that have permission to create and manage S3 buckets.

aws --version
aws sts get-caller-identity   # confirm you're authenticated

IAM permissions to perform the following S3 actions:
- s3:CreateBucket
- s3:PutBucketVersioning
- s3:PutBucketEncryption
- s3:PutObjectLegalHold
- s3:PutObjectRetention
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
- s3:ListBucket
For the migration path: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.

Step 1: Create the S3 Bucket with Versioning and Encryption

Object Lock must be enabled at bucket creation time. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

Note: For regions other than us-east-1, add the --create-bucket-configuration flag.

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket

Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:

aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled

Enable server-side encryption so your state files are encrypted at rest:

aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:

aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Verify the bucket configuration:

# Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state

Expected output for the Object Lock check:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Step 2: Configure the Terraform Backend with Native Locking

In your Terraform project, create or update your backend.tf file:

terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}

The critical difference from the old configuration is the use_lockfile = true parameter. Notice what is absent: there's no dynamodb_table argument. No DynamoDB table. No second service.

Here's a direct comparison of the old and new configurations:

Old configuration (S3 + DynamoDB):

terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}

New configuration (S3 native locking):

terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}

Step 3: Initialize and Verify

Run terraform init to initialize the backend:

terraform init

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!

Run a plan to confirm everything is working end-to-end:

terraform plan

If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket – a .tflock file will appear temporarily alongside your state file during the operation and disappear when it completes.

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Follow this section if you have an existing Terraform setup using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.

Important: Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means all team members and CI/CD pipelines must stop running terraform plan or terraform apply during the migration. The migration itself takes under 10 minutes.

Step 1: Verify Your Current Setup

Before making any changes, document your existing backend configuration and confirm the state file is accessible:

# Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'

Check your current backend.tf and note the exact values:

# Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}

Run one final plan to confirm the current state is clean and there are no unexpected changes pending:

terraform plan

If the plan shows no changes, you're in a safe state to proceed.

Step 2: Enable Object Lock on the Existing S3 Bucket

This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.

But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.

Run the following AWS CLI command to enable Object Lock on your existing bucket:

aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'

Note: This command enables Object Lock in governance mode with no default retention, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.

Verify Object Lock is now enabled:

aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket

Expected output:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):

aws s3api get-bucket-versioning \
  --bucket your-existing-bucket

Expected output:

{
    "Status": "Enabled"
}

If versioning isn't enabled, enable it before proceeding:

aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled

Step 3: Update the Terraform Backend Configuration

Update your backend.tf to remove the dynamodb_table argument and add use_lockfile = true:

terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}

Your updated backend.tf should look like this:

terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}

Step 4: Reinitialize Terraform

Run terraform init with the -reconfigure flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):

terraform init -reconfigure

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!

If you see an error here: The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.

Step 5: Verify the Migration

Run a plan to confirm Terraform is working correctly with the new backend configuration:

terraform plan

The plan should:

Complete successfully
Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)
NOT mention DynamoDB anywhere in its output

To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:

╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵

The Path field shows .tfstate.tflock, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.

Step 6: Clean Up the DynamoDB Table

Once you've confirmed the migration is working correctly and your team has run at least one successful plan and apply cycle using the new backend, you can remove the DynamoDB table.

Wait at least 24-48 hours before deleting the DynamoDB table if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.

When you're ready, delete the DynamoDB table:

aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table

Confirm the deletion:

aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table

Expected output:

An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found

This error confirms that the table is gone. The migration is complete.

If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run terraform apply to destroy it via Terraform rather than the CLI directly. This keeps your state clean:

# Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

After removing the block, run:

terraform apply

Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.

How to Verify That Locking Is Working

After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.

Method 1: Observe the lock file during an operation

In one terminal, start a long-running plan against a configuration with many resources:

terraform plan

While it's running, in a second terminal, check for the lock file in S3:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

You should see a file like:

2026-05-06 14:22:01        512 terraform.tfstate.tflock

After the plan completes, run the same command again. The .tflock file should be gone.

Method 2: Read the lock file contents

While a plan is running, download and read the lock file to see its contents:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock && cat /tmp/current.lock

Expected output (formatted for readability):

{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}

This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.

How to Handle a Stuck Lock

With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the .tflock file from S3.

A lock can get stuck if:

A terraform apply or plan process was killed mid-execution
A CI/CD pipeline runner crashed during a Terraform operation
A network interruption prevented the lock release from completing

Here's how you can check for a stuck lock:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

If a .tflock file exists and no Terraform operation is currently running, it is a stuck lock.

You can also read the lock to understand who held it:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock && cat /tmp/stuck.lock

This tells you who (Who field) was running the operation, what operation it was (Operation field), and when it was acquired (Created field).

And you can force-unlock using Terraform like this:

terraform force-unlock LOCK-ID

Replace LOCK-ID with the ID value from the lock file contents. For example:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Terraform will confirm:

Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!

An alternative is to delete the lock file directly via CLI. If terraform force-unlock doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:

aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock

Only delete the lock file if you are certain no Terraform operation is currently running. Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.

Rollback Plan: If Something Goes Wrong

If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.

Step 1: Stop all Terraform operations in your team and CI/CD pipelines.

Step 2: Recreate the DynamoDB table if you already deleted it:

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Step 3: Revert backend.tf to the previous configuration:

terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}

Step 4: Reinitialize:

terraform init -reconfigure

Step 5: Verify:

terraform plan

The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.

Note: Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using dynamodb_table in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.

Security Best Practices for Your State Bucket

Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:

Enable Versioning (Required)

Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.

aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled

Block All Public Access (Non-Negotiable)

Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.

aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Enable Server-Side Encryption

Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:

aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Apply Least-Privilege IAM Permissions

The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}

Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.

Enable Access Logging

Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:

aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'

Conclusion

AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.

Here's a summary of what you accomplished:

Understood what state locking is and why it's required for safe Terraform operations
Compared S3 native locking to the existing S3 + DynamoDB approach
Set up a fresh Terraform backend using S3 native locking with correct bucket configuration
Migrated an existing backend from S3 + DynamoDB to S3 native locking safely
Learned how to verify locking, handle stuck locks, and roll back if needed
Applied security best practices to the state bucket

This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.

If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out The Startup DevOps Field Guide. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.

References

How to Get Started with Terraform

Manish Shivanandhan — Wed, 15 Apr 2026 16:25:48 +0000

Infrastructure has undergone a fundamental shift over the past decade.

What was once configured manually through dashboards and shell access is now defined declaratively in code. This shift isn't just about convenience. It's about repeatability, auditability, and control.

Terraform sits at the centre of this transformation. It allows you to define infrastructure using configuration files, apply those configurations consistently across environments, and evolve systems safely over time.

For teams building modern applications, especially on platform abstractions, Terraform becomes the control plane for everything from application deployment to databases and networking.

The open source Terraform provider from Sevalla extends this model by allowing teams to manage the entire application platform as code, not just underlying infrastructure. It enables you to define applications, databases, networking, storage, and deployment workflows in a single, unified configuration.

Instead of stitching together multiple tools or relying on manual setup, everything from code deployment to traffic routing and environment configuration can be expressed declaratively. This creates a consistent, repeatable system where environments can be replicated easily, changes are version-controlled, and production setups can evolve safely over time.

This article walks through how to go from zero to a production-ready setup using Terraform and the Sevalla Terraform Provider, focusing on practical concepts rather than theory.

What We'll Cover:

What Terraform Actually Does
Setting Up Terraform for the First Time
Understanding Providers, Resources, and Data Sources
Building a Real Application Stack
Managing Configuration and Secrets
Scaling and Process Configuration
Adding Networking and Traffic Management
Pipelines and Continuous Deployment
From Configuration to Production
Why Terraform Scales with You

What Terraform Actually Does

Terraform is an infrastructure-as-code tool that translates configuration files into real infrastructure. You describe the desired state of your system, and Terraform figures out how to achieve it.

At a high level, Terraform operates in three phases.

First, it initializes the working directory and downloads required providers. Providers are plugins that allow Terraform to interact with specific platforms.

Next, it creates an execution plan. This plan shows what resources will be created, modified, or destroyed to match your configuration.

Finally, it applies the plan, making the necessary API calls to bring your infrastructure into the desired state.

The key idea is that Terraform is declarative. You define what you want, not how to do it. Terraform handles the orchestration.

This abstraction becomes extremely powerful as systems grow more complex.

Setting Up Terraform for the First Time

Getting started with Terraform requires very little setup. You install the CLI, create a working directory, and define a basic configuration.

A Terraform configuration is written in HCL, a domain-specific language designed to be human-readable. Even a simple configuration establishes the core concepts.

You define the required provider, configure authentication, and declare resources.

Here's a minimal example that provisions an application using a managed platform provider.

terraform {
 required_providers {
   sevalla = {
     source  = "sevalla-hosting/sevalla"
     version = "~> 1.0"
   }
 }
}

provider "sevalla" {
}
data "sevalla_clusters" "all" {}
resource "sevalla_application" "web" {
 display_name = "my-web-app"
 cluster_id   = data.sevalla_clusters.all.clusters[0].id
 source       = "publicGit"
 repo_url     = "https://github.com/example/app"
}

This configuration does several things.

First, it declares the provider, which tells Terraform how to communicate with the platform. It also fetches available clusters using a data source. It then defines an application resource that points to a Git repository.

Even at this stage, you're already defining infrastructure in a reproducible way.

To execute this configuration, you run three commands.

You initialize the project, generate a plan, and apply it.

export SEVALLA_API_KEY="your-api-key"
terraform init
terraform plan
terraform apply

After applying, your application is deployed without manual steps.

Understanding Providers, Resources, and Data Sources

Terraform revolves around three core constructs.

Providers act as the bridge between Terraform and external systems. They expose APIs in a structured way that Terraform can use.

Resources represent the infrastructure you want to create. These are the building blocks of your system. Applications, databases, load balancers, and storage buckets are all modeled as resources.

Data sources allow you to query existing infrastructure. Instead of creating something new, you retrieve information that can be used elsewhere in your configuration.

The combination of these constructs allows you to build flexible and composable systems.

For example, you can fetch a list of available clusters using a data source and then dynamically assign your application to one of them. This reduces hardcoding and improves portability.

As your configuration grows, these abstractions help you maintain clarity and structure.

Building a Real Application Stack

A production system is rarely just a single application. It typically includes multiple components that need to work together.

With Terraform, you can define the entire stack in one place.

You might start with an application, then add a managed database, connect them internally, and expose the application through a load balancer.

A simplified flow looks like this.

You define the application resource that pulls code from a repository. You provision a database resource, such as PostgreSQL or Redis. You establish an internal connection between the application and the database. You configure environment variables for credentials. You optionally add a custom domain or routing layer.

Each of these components is a resource, and Terraform ensures they're created in the correct order.

This approach eliminates configuration drift. Instead of manually setting up each component, everything is defined in code and version-controlled.

It also makes environments consistent. Your staging and production setups can be identical except for a few variables.

Managing Configuration and Secrets

Production systems require configuration. This includes environment variables, API keys, and connection strings.

Terraform provides multiple ways to handle this.

You can define variables in your configuration and pass values at runtime. Sensitive values, such as API keys, are typically injected via environment variables.

For example, authentication is handled through an API key that can be set as an environment variable.

export SEVALLA_API_KEY="your-api-key"

This avoids hardcoding credentials in configuration files.

You can also define environment variables as part of your infrastructure. This allows you to configure applications consistently across environments.

The important principle is separation of concerns. Infrastructure definitions should remain clean, while sensitive data is managed securely.

Scaling and Process Configuration

Modern applications often consist of multiple processes. A web server handles incoming requests, background workers process jobs, and scheduled tasks run periodically.

Terraform allows you to define these processes explicitly.

You can configure different process types, allocate resources, and scale them independently. This is particularly useful for handling variable workloads.

For example, you might scale web processes based on incoming traffic while keeping background workers at a steady level.

By defining this in code, scaling becomes predictable and repeatable.

You avoid manual intervention and ensure that your system behaves consistently under load.

Adding Networking and Traffic Management

As systems grow, managing traffic becomes more important.

Terraform enables you to define networking components such as load balancers and routing rules. You can map domains to applications, distribute traffic across multiple services, and control access.

This is essential for production readiness.

A load balancer can improve availability by distributing traffic across instances. Domain configuration ensures that users can access your application through a stable endpoint.

You can also define restrictions, such as IP allowlists, to enhance security.

All of this is managed declaratively, which reduces the risk of misconfiguration.

Pipelines and Continuous Deployment

Production systems require reliable deployment workflows.

You can use Terraform to define deployment pipelines and stages. This allows you to model how code moves from development to production.

You can define multiple stages, associate applications with each stage, and control how deployments are triggered.

This brings infrastructure and deployment logic into a single system.

Instead of relying on external scripts or manual processes, everything is defined in a structured and version-controlled way.

It also improves traceability. You can see exactly how a system is configured and how changes are applied over time.

From Configuration to Production

Moving from a simple setup to production involves more than just adding resources. It requires discipline in how you manage infrastructure.

Version control becomes critical. Every change to your infrastructure should go through code review. This reduces the risk of introducing breaking changes.

State management is another key aspect. Terraform keeps track of the current state of your infrastructure. This state must be stored securely and consistently, especially in team environments.

You also need to think about environment separation. Development, staging, and production should be isolated but defined using similar configurations.

Finally, observability should be integrated from the start. While Terraform provisions infrastructure, you need monitoring and logging to understand how it behaves in production.

Why Terraform Scales with You

Terraform works well for small projects, but its real value becomes apparent as systems grow.

As you add more services, environments, and dependencies, manual management becomes unsustainable. Terraform provides a structured way to manage this complexity.

It enforces consistency. It enables automation. It creates a single source of truth for your infrastructure.

Most importantly, it allows teams to move faster without sacrificing reliability.

By defining infrastructure as code, you reduce ambiguity. You make systems easier to understand, easier to debug, and easier to evolve.

That is what takes you from zero to production in a way that actually scales.

Want to build like a 10x developer? Learn through real projects, simple explanations, and tools that help you ship faster. Join my newsletter and start levelling up every week.

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Osomudeya Zudonu — Thu, 26 Mar 2026 14:25:52 +0000

If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?

Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.

In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.

By the end, you'll be able to:

Explain the full architecture from vault to pod
Run the lab locally in about 15 minutes
Prove why environment variables go stale after rotation, while mounted secret files stay fresh
Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD
Troubleshoot the most common failures

Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.

Prerequisites
How to Understand the Secret Flow
How to Run the Local Lab
How to Inspect the ExternalSecret and the Application
How to Test Secret Rotation
How to Choose Between External Secrets Operator and the CSI Driver
How to Deploy the Pattern on Amazon Elastic Kubernetes Service
How to Configure GitHub Actions Without Stored AWS Credentials
How to Troubleshoot the Most Common Failures
Conclusion

Prerequisites

Before you begin, make sure you have the following tools installed and configured.

For the local lab:

An AWS account with access to AWS Secrets Manager
The AWS CLI installed and configured. Run aws configure and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.
kubectl installed. For Microk8s, run microk8s kubectl config view --raw > ~/.kube/config after installation to connect kubectl to your local cluster.
Terraform installed
Helm installed
Docker installed
A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the Microk8s install guide before continuing.

For the Amazon Elastic Kubernetes Service sections:

An Amazon Elastic Kubernetes Service cluster you can create or manage
A GitHub repository you can configure for workflows and secrets

The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's docs/DEPLOY-LOCAL.md and docs/DEPLOY-EKS.md.

How to Understand the Secret Flow

Before you run any command, you need to understand how the pieces connect.

The flow has four stages:

A developer or automated system updates a secret in AWS Secrets Manager.
The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.
Your pod reads that Kubernetes Secret.
During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.

How the External Secrets Operator Sync Works

The External Secrets Operator reads a custom Kubernetes resource called ExternalSecret. That resource tells the operator three things:

Which secret store to connect to
Which Kubernetes Secret name to create or update
How often to refresh

In this lab, the ExternalSecret creates a Kubernetes Secret named myapp-database-creds. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.

How the App Consumes Secrets

The sample application exposes three endpoints so you can validate behavior at any time.

/secrets/env shows what environment variables the pod sees
/secrets/volume shows what files in the mounted secret directory look like
/secrets/compare compares both and reports whether rotation has been detected

The app checks four keys: DB_USERNAME, DB_PASSWORD, DB_HOST, and DB_PORT.

How to Run the Local Lab

The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.

Step 1: Clone the Repo

git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab

Step 2: Run the Spin-Up Script

bash spinup.sh

The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.

If the script fails at any point, check docs/TROUBLESHOOTING.md before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.

Important: Run the Lab UI

The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at lab-ui/ that walks you through each concept and checkpoint as you work through the lab.

To start it, open a second terminal and run:

cd lab-ui && npm install && npm run dev

Then open http://localhost:5173. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.

Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (localhost:3000) are two separate things, the UI guides you through the steps, the app shows you the live secrets.

Step 3: Access the Application

Once the lab finishes, port-forward the service.

kubectl port-forward svc/myapp 3000:80 -n default

Open http://localhost:3000. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.

Step 4: Validate That Secrets Match

Run the compare endpoint directly from the terminal.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

When everything is working, the response will include "all_match": true.

How to Inspect the ExternalSecret and the Application

At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.

Step 1: Read the ExternalSecret Manifest

Open k8s/aws/external-secret.yaml. Focus on these four fields:

refreshInterval: how often the operator polls AWS Secrets Manager
secretStoreRef: which store the operator authenticates against
target: the name of the Kubernetes Secret to create
data: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys

Here is what that mapping looks like in this lab:

spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username

The property field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.

Two fields here are worth understanding before you move on. creationPolicy: Owner means the operator owns the Kubernetes Secret it creates. If you delete the ExternalSecret, the Secret is deleted too. ClusterSecretStore is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain SecretStore is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.

Step 2: Read the Deployment Manifest

Open k8s/aws/deployment.yaml. You are looking for two sections: envFrom and volumeMounts.

envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true

Both paths read from the same Kubernetes Secret, myapp-database-creds. The envFrom block injects all keys as environment variables at pod start.
The volumeMounts block mounts the same secret as files under /etc/secrets.

This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.

Step 3: Read the App Comparison Logic

Open app/server.js. The comparison logic reads environment variables from process.env and reads mounted secret files from /etc/secrets/. Then it computes a per-key match and a global all_match value.

The /secrets/compare endpoint sets rotation_detected: true when any key differs between env and volume.

How to Test Secret Rotation

Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.

How the Rotation Gap Works

When a pod starts, Kubernetes gives it two ways to read a secret.

The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.

The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.

Same secret, two paths. One goes stale while one stays fresh.

The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.

That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.

Here is what you're about to observe in the lab:

The rotation script updates the secret in AWS
ESO syncs the new value into Kubernetes within seconds
The volume file updates automatically
The environment variable stays stale until the pod restarts
The /secrets/compare endpoint shows both values side by side so you can see the gap live

Step 1: Confirm the Lab Is Ready

Make sure your pod and the External Secrets Operator are both running before you start.

kubectl get pods -n external-secrets
kubectl get pods -n default

Both should show Running.

Step 2: Run the Rotation Test Script

bash rotation/test-rotation.sh

The script performs these actions in order:

Reads the current DB_PASSWORD from the volume mount at /etc/secrets/DB_PASSWORD
Reads the current DB_PASSWORD from the environment variable
Updates AWS Secrets Manager with a new password using put-secret-value
Forces an immediate ESO sync by annotating the ExternalSecret with force-sync
Reads the volume value again
Reads the environment variable again

After the script runs, the volume and the env var will show different values.

Step 3: Validate With the Compare Endpoint

Hit the compare endpoint and look at the output.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

You'll see something like this:

{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}

Step 4: Restart the Deployment to Sync Env Vars

Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.

kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default

Then hit /secrets/compare again. All rows should now show "all_match": true.

How to Automate Restarts With Reloader

If you don't want to restart deployments manually after every rotation, you can install Stakater Reloader. It watches an annotation on the Deployment and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.

How to Choose Between External Secrets Operator and the CSI Driver

Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the Secrets Store CSI Driver.

Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:

Feature	External Secrets Operator	Secrets Store CSI Driver
Creates a Kubernetes Secret	Yes	No by default
Supports `envFrom`	Yes	No (workaround only)
Secret stored in etcd	Yes (base64)	No, if you skip sync
Rotation	ESO updates the Secret, Reloader restarts pods	Volume file can update in place
Best for	Most teams. Multi-cloud, env var support	Security policies that prohibit secrets in etcd

This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both envFrom and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.

Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native envFrom model.

How to Deploy the Pattern on Amazon Elastic Kubernetes Service

The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.

Step 1: Prepare Terraform and OpenID Connect Access

The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the terraform/github-oidc folder.

cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn

Copy the role ARN from the output. You'll need it in the next step.

Step 2: Set the Required Environment Variable

The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.

To find your AWS account ID, run:

aws sts get-caller-identity --query Account --output text

Then set the variable, replacing ACCOUNT with the number that command returns.

export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role

Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service

bash spinup.sh --cluster eks

When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing Match ✓.

Step 4: Test Rotation on the Deployed App

After you confirm normal operation, run the rotation test the same way you did locally.

bash rotation/test-rotation.sh

Then use /secrets/compare on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.

⚠️ Cost warning: Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run bash teardown.sh from the repo root to destroy all AWS resources and stop charges.

How to Configure GitHub Actions Without Stored AWS Credentials

The typical CI/CD setup stores AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.

OpenID Connect eliminates that problem entirely.

How OpenID Connect Works for GitHub Actions

GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via AssumeRoleWithWebIdentity. No long-lived keys are ever stored anywhere.

Step 1: Create the IAM Role With Terraform

The terraform/github-oidc folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.

Step 2: Add the Role ARN to GitHub Repository Secrets

In your GitHub repository:

Go to Settings → Secrets and variables → Actions
Click New repository secret
Name it AWS_ROLE_ARN
Paste the role ARN from the Terraform output

That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.

Step 3: Configure Terraform State

For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.

Step 4: Push to Main and Let Workflows Run

After your first spin-up, every push to the main branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use /secrets/compare to validate rotation behavior on the live environment.

How to Troubleshoot the Most Common Failures

Here's a shortlist of the most common symptoms and their fixes.

Symptom	Most Likely Cause	Fix
`ExternalSecret` is not syncing	Missing credentials or wrong store reference	Confirm the operator can access AWS Secrets Manager and that `secretStoreRef` points to the correct store
Pod is stuck in `Pending`	Missing storage setup for local cluster	For Microk8s, enable the storage add-on
Env and volume still match after rotation	Rotation happened but the pod never restarted	Run `kubectl rollout restart` or install Reloader
CRD or API version mismatch	ESO version and manifest `apiVersion` don't match	Verify the `apiVersion` for `ClusterSecretStore` and `ExternalSecret` match your installed ESO version
Amazon Elastic Kubernetes Service node group never joins	Networking or IAM permissions for nodes are wrong	Fix internet routing and review the node IAM policy

How to Inspect the Operator and the ExternalSecret

When something isn't syncing, start with these two commands.

# Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

The status conditions on the ExternalSecret resource will usually tell you exactly what failed.

How to Validate Rotation From the App Side

When you are debugging rotation, don't rely only on Kubernetes resource state. Use the /secrets/compare endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.

Conclusion

You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the ExternalSecret and Deployment manifests, and validated that the application sees the right credentials.

You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.

Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.

The lab repository is at github.com/Osomudeya/k8s-secret-lab. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.

If this helped you, star the repo and share it with someone who is learning Kubernetes.

I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures
→ Join the newsletter

How to Build a Production-Ready DevOps Pipeline with Free Tools

Opaluwa Emidowojo — Mon, 28 Apr 2025 20:15:34 +0000

A few months ago, I dove into DevOps, expecting it to be an expensive journey requiring costly tools and infrastructure. But I discovered you can build professional-grade pipelines using entirely free resources.

If DevOps feels out of reach because you’re also concerned about the cost, don't worry. I’ll guide you step-by-step through creating a production-ready pipeline without spending a dime. Let's get started!

Prerequisites
Introduction
How to Set Up Your Source Control and Project Structure
How to Build Your CI Pipeline with GitHub Actions
How to Optimize Docker Builds for CI
Infrastructure as Code Using Terraform and Free Cloud Providers
How to Set Up Container Orchestration on Minimal Resources
How to Create a Free Deployment Pipeline
How to Build a Comprehensive Monitoring System
How to Implement Security Testing and Scanning
Performance Optimization and Scaling
Putting it All Together
Conclusion

🛠 Prerequisites

Basic Git knowledge: Cloning repos, creating branches, committing code, and creating PRs
Familiarity with command line: For Docker, Terraform, and Kubernetes
Basic understanding of CI/CD: Continuous integration/delivery concepts and pipelines

Accounts needed:

GitHub account
At least one cloud provider: AWS Free Tier (recommended), Oracle Cloud Free Tier, or Google Cloud/Azure with free credits
Terraform Cloud (free tier) for infrastructure state management
Grafana Cloud (free tier) for monitoring
UptimeRobot (free tier) for external availability checks

Tools to Install Locally

Tool	Purpose	Installation Link
Git	Version control	Install Git
Docker	Containerization	Install Docker
Node.js & npm	Sample app & builds	Install Node.js
Terraform	Infrastructure as Code	Install Terraform
kubectl	Kubernetes CLI	Install kubectl
k3d	Lightweight Kubernetes	Install k3d
Trivy	Container security scanning	Install Trivy
OWASP ZAP	Web security scanning	Install ZAP

Optional but Helpful:

VS Code or any good code editor
Postman for testing APIs
Understanding of YAML and Dockerfiles

Introduction

When people hear "DevOps," they often picture complex enterprise systems powered by pricey tools and premium cloud services. But the truth is, you don't actually need a massive budget to build a solid, professional-grade DevOps pipeline. The foundations of good DevOps – automation, consistency, security, and visibility – can be built entirely with free tools.

In this guide, you will learn how to build a production-ready DevOps pipeline using zero-cost resources. We will use a simple CRUD (Create, Read, Update, Delete) app with frontend, backend API, and database as our example project to demonstrate every step of the process.

How to Set Up Your Source Control and Project Structure

1. Create a Well-Structured Repository

A clean repo is the foundation of your pipeline. We will set up:

Separate folders for frontend, backend, and infrastructure
A .github folder to hold workflow configurations
Clear naming conventions and a well-written README.md

🛠 Tip: Use semantic commit messages and consider adopting Conventional Commits for clarity in versioning and changelogs.

2. Set Up Branch Protection Without Paid Features

While GitHub's more advanced rules require Pro, you can still:

Require pull requests before merging
Enable status checks to prevent broken code from landing in main
Enforce linear history for cleaner version control

💡 This makes your project safer and more collaborative, without needing GitHub Enterprise.

3. Implement PR Templates and Automated Checks

Make your reviews smoother:

Add a PULL_REQUEST_TEMPLATE.md to guide contributors
Use GitHub Actions (which we'll set up in the next part) for linting, tests, and formatting checks

✨ These tiny improvements add polish and professionalism.

4. Configure GitHub Issue Templates and Project Boards

Even solo developers benefit from issue tracking:

Add issue templates for bugs and features
Use GitHub Projects to manage work with a Kanban board, all free and native to GitHub

📌 Bonus: This setup lays the groundwork for GitOps practices later on.

5. Advanced Technique: Set Up Custom Validation Scripts as Pre-Commit Hooks

Before code ever hits GitHub, you can catch issues locally with Git hooks. Using a tool like Husky or pre-commit, you can:

Lint code before it's committed
Run tests or formatters automatically
Prevent secrets from being accidentally committed

// Initialize Husky and install needed dependencies
// Then add a pre-commit hook that runs tests before allowing the commit
npx husky-init && npm install
npx husky add .husky/pre-commit "npm test"

6. Sample CRUD App Setup:

Our CRUD app manages users (create, read, update, delete). Below is the minimal code with comments to explain each part:

Backend (backend/):

// backend/package.json
{
  "name": "crud-backend", // Name of the backend project
  "version": "1.0.0", // Version for tracking changes
  "scripts": {
    "start": "node index.js", // Runs the server
    "test": "echo 'Add tests here'", // Placeholder for tests (update with Jest later)
    "lint": "eslint ." // Checks code style with ESLint
  },
  "dependencies": {
    "express": "^4.17.1", // Web framework for API endpoints
    "pg": "^8.7.3" // PostgreSQL client to connect to the database
  },
  "devDependencies": {
    "eslint": "^8.0.0" // Linting tool for code quality
  }
}

// backend/index.js
const express = require('express'); // Import Express for building the API
const { Pool } = require('pg'); // Import PostgreSQL client
const app = express(); // Create an Express app
app.use(express.json()); // Parse JSON request bodies

// Connect to PostgreSQL using DATABASE_URL from environment variables
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

// Health check endpoint for Kubernetes probes and monitoring
app.get('/healthz', (req, res) => res.json({ status: 'ok' }));

// Get all users from the database
app.get('/users', async (req, res) => {
  const { rows } = await pool.query('SELECT * FROM users'); // Query the users table
  res.json(rows); // Send users as JSON
});

// Add a new user to the database
app.post('/users', async (req, res) => {
  const { name } = req.body; // Get name from request body
  // Insert user and return the new record
  const { rows } = await pool.query('INSERT INTO users(name) VALUES($1) RETURNING *', [name]);
  res.json(rows[0]); // Send the new user as JSON
});

// Start the server on port 3000
app.listen(3000, () => console.log('Backend running on port 3000'));

Frontend (frontend/):

// frontend/package.json
{
  "name": "crud-frontend", // Name of the frontend project
  "version": "1.0.0", // Version for tracking changes
  "scripts": {
    "start": "react-scripts start", // Runs the dev server
    "build": "react-scripts build", // Builds for production
    "test": "react-scripts test", // Runs tests (placeholder for Jest)
    "lint": "eslint ." // Checks code style with ESLint
  },
  "dependencies": {
    "react": "^17.0.2", // Core React library
    "react-dom": "^17.0.2", // Renders React to the DOM
    "react-scripts": "^4.0.3", // Scripts for React development
    "axios": "^0.24.0" // HTTP client for API calls
  },
  "devDependencies": {
    "eslint": "^8.0.0" // Linting tool for code quality
  }
}

// frontend/src/App.js
import React, { useState, useEffect } from 'react'; // Import React and hooks
import axios from 'axios'; // Import Axios for API requests

function App() {
  // State for storing users fetched from the backend
  const [users, setUsers] = useState([]);
  // State for the input field to add a new user
  const [name, setName] = useState('');

  // Fetch users when the component mounts
  useEffect(() => {
    axios.get('http://localhost:3000/users').then(res => setUsers(res.data));
  }, []); // Empty array means run once on mount

  // Add a new user via the API
  const addUser = async () => {
    const res = await axios.post('http://localhost:3000/users', { name }); // Post new user
    setUsers([...users, res.data]); // Update users list
    setName(''); // Clear input field
  };

  return (
    <div>
      <h1>Usersh1>
      {/* Input for new user name */}
      <input value={name} onChange={e => setName(e.target.value)} />
      {/* Button to add user */}
      <button onClick={addUser}>Add Userbutton>
      {/* List all users */}
      <ul>{users.map(user => <li key={user.id}>{user.name}li>)}ul>
    div>
  );
}

export default App; // Export the component

Database Setup:

-- infra/db.sql
-- Create a table to store users
CREATE TABLE users (
  id SERIAL PRIMARY KEY, -- Auto-incrementing ID
  name VARCHAR(100) NOT NULL -- User name, required
);

crud-app/
├── backend/
│   ├── package.json
│   └── index.js
├── frontend/
│   ├── package.json
│   └── src/App.js
├── infra/
│   └── db.sql
├── .github/
│   └── workflows/
└── README.md

This app provides a /users endpoint (GET/POST) and a frontend to list/add users, stored in PostgreSQL. The /healthz endpoint supports monitoring. Save this code in your repo to follow the pipeline steps.

How to Build Your CI Pipeline with GitHub Actions

1. Set Up Your First GitHub Actions Workflow

First, let’s create a basic workflow that automatically builds, tests, and lints your app every time you push code or open a pull request. This ensures your app stays healthy and any issues are caught early.

Create a file at .github/workflows/ci.yml and add the following:

# CI workflow to build, test, and lint the CRUD app on push or pull request
name: CI Pipeline
on:
  push:
    branches: [main] # Trigger on pushes to main branch
  pull_request:
    branches: [main] # Trigger on PRs to main branch
jobs:
  build:
    runs-on: ubuntu-latest # Use GitHub's free Linux runner
    steps:
      - uses: actions/checkout@v3 # Check out the repository code
      - name: Set up Node.js # Install Node.js environment
        uses: actions/setup-node@v3
        with:
          node-version: '18' # Use Node.js 18 for consistency
      - name: Cache dependencies # Cache node_modules to speed up builds
        uses: actions/cache@v3
        with:
          path: ~/.npm # Cache npm’s global cache
          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} # Key based on OS and package-lock.json
      - run: npm ci # Install dependencies reliably using package-lock.json
      - run: npm test # Run tests defined in package.json
      - run: npm run lint # Run ESLint to ensure code quality

This workflow automatically runs on every push and pull request to the main branch. It installs dependencies, runs tests, and performs code linting, with dependency caching to make builds faster over time.

Common Issues and Fixes:

“Secret not found”: Ensure AWS_ACCESS_KEY_ID is in repository secrets (Settings → Secrets).
Tests fail: Check test/users.test.js for database connectivity.

Understanding GitHub Actions' Free Tier Limits

Before building more workflows, it is important to know what GitHub offers for free.

If you are working on private repositories, you get 2,000 free minutes per month. For public repositories, you get unlimited minutes.

To avoid hitting limits quickly:

Cache your dependencies to cut down install times.
Only trigger workflows on meaningful branches (like main or release).
Skip unnecessary steps when you can.

2. Creating a Multi-Stage Build Pipeline

As your app grows, it is better to split your CI pipeline into clear stages like install, test, and lint. This structure makes workflows easier to maintain and speeds things up, because some jobs can run in parallel.

Here’s how you can split the work into multiple jobs for better clarity:

jobs:
  install:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm ci  # Clean install of dependencies

  test:
    needs: install  # This job depends on the install job finishing
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm test  # Run test suite

  lint:
    needs: install  # This job also depends on install but runs in parallel with test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm run lint  # Run linting checks

By breaking the pipeline into stages, you can quickly spot which step fails, and your test and lint jobs can run at the same time after dependencies are installed.

3. Implement Matrix Builds for Cross-Environment Testing

When you want your app to work across different Node.js versions or databases, matrix builds are your best bet. They let you test across multiple environments in parallel, without duplicating code.

Here’s how you can set up a matrix strategy, to test across multiple environments simultaneously:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [14.x, 16.x, 18.x]  # Test on multiple Node versions
        database: [postgres, mysql]        # Test against different databases
    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm install
      - run: npm test  # This will run 6 different test combinations (3 Node versions × 2 databases)

Matrix builds save time and help you catch environment-specific bugs early.

4. Optimize Workflow with Dependency Caching

Every second counts in CI. Dependency caching can help save minutes in your workflow by reusing previously installed packages instead of reinstalling them from scratch every time.

Here’s how to set up smart caching to speed up your builds:

- name: Cache node modules
  uses: actions/cache@v3
  with:
    path: |  # Cache both global npm cache and local node_modules
      ~/.npm
      node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}  # Cache key based on OS and dependencies
    restore-keys: |  # Fallback keys if exact match isn't found
      ${{ runner.os }}-node-

This cache setup checks if your dependencies have changed. If not, it restores the cache, making builds significantly faster.

How to Optimize Docker Builds for CI

When you're building Docker images in CI, build time can quickly become a bottleneck. Especially if your images are large. Optimizing your Docker builds makes your pipelines much faster, saves bandwidth, and produces smaller, more efficient images ready for deployment.

In this section, I’ll walk through creating a basic Dockerfile, using multi-stage builds, caching layers, and enabling BuildKit for even faster builds.

1. Create a Baseline Dockerfile

First, start with a simple Dockerfile that installs your app’s dependencies and runs it. This is what you’ll be optimizing later.

# Simple Dockerfile for a Node.js application
FROM node:18-alpine  # Use Alpine for a smaller base image
WORKDIR /app         # Set working directory
COPY . .             # Copy all files to container
RUN npm ci           # Install dependencies (clean install)
CMD ["npm", "start"] # Start the application

Using an Alpine-based Node.js image helps keep your image small from the start.

2. Multi-Stage Docker Builds

Next, let's separate the build process from the production image. Multi-stage builds let you compile or build your app in one stage and only copy over the final product to a clean, smaller image. This keeps production images lean:

# Stage 1: Build the application
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./  # Copy package files first for better caching
RUN npm ci             # Install all dependencies
COPY . .               # Then copy source code
RUN npm run build      # Build the application

# Stage 2: Production image with minimal footprint
FROM node:18-alpine
WORKDIR /app
# Only copy built assets and production dependencies
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
RUN npm ci --production  # Install only production dependencies
CMD ["node", "dist/server.js"]  # Run the built application

This approach keeps your production images lightweight and secure by excluding unnecessary build tools and dev dependencies.

3. Optimizing Layer Caching

For even faster builds, order your Dockerfile instructions to maximize layer caching. Copy and install dependencies before copying your full source code.

This way, Docker reuses the cached npm install step if your dependencies haven't changed, even if you edit your app's code:

First: COPY package*.json ./
Then: RUN npm ci
Finally: COPY . .

4. Enable BuildKit for Faster Builds

Docker BuildKit is a newer build engine that enables features like better caching, parallel build steps, and overall faster builds.

To enable BuildKit during your CI, run:

- name: Build Docker image
  run: |
    # Enable BuildKit for parallel and more efficient builds
    DOCKER_BUILDKIT=1 docker build -t myapp:latest .

Turning on BuildKit can significantly speed up complex Docker builds and is highly recommended for all CI pipelines.

Infrastructure as Code Using Terraform and Free Cloud Providers

Why Infrastructure as Code (IaC) Matters

When you manage infrastructure manually – that is, clicking around cloud dashboards or setting things up by hand – it’s easy to lose track of what you did and how to repeat it.

Infrastructure as Code (IaC) solves this by letting you define your infrastructure with code, version it just like application code, and track every change over time. This makes your setups easy to replicate across environments (development, staging, production), ensures changes are declarative and auditable, and reduces human error.

Whether you are spinning up a single server or scaling a complex system, IaC lays the foundation for professional-grade infrastructure from day one, letting you automate, document, and grow your environment systematically.

How to Provision Infrastructure with Terraform

Initialize a Terraform Project

First, define the providers and versions you need. Here, we’re using Render’s free cloud hosting service:

# Define required providers and versions
terraform {
  required_providers {
    render = {
      source  = "renderinc/render"  # Using Render's free tier
      version = "0.1.0"             # Specify provider version for stability
    }
  }
}

# Configure the Render provider with authentication
provider "render" {
  api_key = var.render_api_key  # Store API key as a variable
}

Then, configure the provider by authenticating with your API key. It is best practice to store secrets like API keys in variables instead of hardcoding them. This setup tells Terraform what platform you’re working with (Render) and how to authenticate to manage resources automatically.

Provision a Web App on Render

Next, define the infrastructure you want – in this case, a web service hosted on Render:

# Define a web service on Render's free tier
resource "render_service" "web_app" {
  name = "ci-demo-app"                                 # Service name
  type = "web_service"                                 # Type of service
  repo = "https://github.com/YOUR-USERNAME/YOUR-REPO"  # Source repo
  env = "docker"                                       # Use Docker environment
  plan = "starter"                                     # Free tier plan
  branch = "main"                                      # Deploy from main branch
  build_command = "docker build -t app ."              # Build command
  start_command = "docker run -p 3000:3000 app"        # Start command
  auto_deploy = true                                   # Auto-deploy on commits
}

This resource block describes exactly how your app should be deployed. Whenever you change this file and reapply, Terraform will update the infrastructure to match.

Provision PostgreSQL for Free

Most applications need a database, but you don't have to pay for one when you're getting started. Platforms like Railway offer free tiers that are perfect for development and small projects.

You can quickly create a free PostgreSQL instance by signing up on the platform and clicking "Create New Project". At the end, you'll get a DATABASE_URL a connection string that your app will use to talk to the database.

Connect App to DB

In Render (or whatever platform you're using), set an environment variable called DATABASE_URL and paste in the connection string from your PostgreSQL provider. This lets your application securely access the database without hardcoding credentials into your codebase.

Make it Reproducible

Once everything is defined, use Terraform to create and apply an infrastructure plan:

# Create execution plan and save it to a file
terraform plan -out=infra.tfplan
# Apply the saved plan exactly as planned
terraform apply infra.tfplan

Saving the plan to a file (infra.tfplan) ensures you’re applying exactly what you reviewed, so there will be no surprises.

Common Issues and Fixes:

Provider not found: Run terraform init.
API key error: Check render_api_key in Terraform Cloud variables.

How to Set Up Container Orchestration on Minimal Resources

When you're working with limited resources like a laptop, a small server, or a lightweight cloud VM, setting up full Kubernetes can be overwhelming. Instead, you can use K3d, a lightweight Kubernetes distribution that runs inside Docker containers. Here's how to set up a minimal, efficient cluster for local development or testing.

1. Install K3d for Local Kubernetes

First, install K3d. It's a super lightweight way to run Kubernetes clusters inside Docker without needing a heavy setup like Minikube.

# Download and install K3d - a lightweight K8s distribution
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

2. Create a Lightweight K3d Cluster

Once K3d is installed, you can spin up a cluster with minimal nodes to save resources.

# Create a minimal K8s cluster with 1 server and 2 agent nodes
k3d cluster create dev-cluster \
  --servers 1 \                        # Single server node to minimize resource usage
  --agents 2 \                         # Two worker nodes for pod distribution
  --volume /tmp/k3dvol:/tmp/k3dvol \   # Mount local volume for persistence
  --port 8080:80@loadbalancer \        # Map port 8080 locally to 80 in the cluster
  --api-port 6443                      # Set the API port

This setup gives you a tiny but real Kubernetes cluster that is perfect for experimentation.

3. Deploy with Optimized Kubernetes Manifests

Now that your cluster is running, you can deploy your app. It's important to define resource requests and limits carefully so your pods don’t consume too much memory or CPU.

# Resource-optimized deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp  # Name of the deployment
spec:
  replicas: 1   # Single replica to save resources
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
        - name: app
          image: myapp:latest
          resources:
            # Set minimal resource requests
            requests:
              memory: "64Mi"   # Request only 64MB memory
              cpu: "50m"       # Request only 5% of a CPU core
            # Set reasonable limits
            limits:
              memory: "128Mi"  # Limit to 128MB memory
              cpu: "100m"      # Limit to 10% of a CPU core

This ensures Kubernetes knows how much to allocate and avoid overloading your lightweight environment.

4. Set up GitOps with Flux

To manage deployments automatically from your GitHub repository, you can set up GitOps using Flux.

# Install Flux CLI
brew install fluxcd/tap/flux

# Bootstrap Flux on your cluster connected to your GitHub repository
flux bootstrap github \
  --owner=YOUR_GITHUB_USERNAME \    # Your GitHub username
  --repository=YOUR_REPO_NAME \     # Repository to store Flux manifests
  --branch=main \                   # Branch to use
  --path=clusters/dev-cluster \     # Path within repo for cluster configs
  --personal                        # Flag for personal account

Flux watches your repo and applies updates to your cluster, keeping everything declarative and reproducible.

Common Issues and Fixes:

Pods crash: Run kubectl logs pod-name or increase resources.
Flux sync fails: Check GitHub token permissions.

How to Create a Free Deployment Pipeline

Like I said initially, not every project needs expensive infrastructure. If you're just getting started or building side projects, free tiers from cloud providers can cover a lot of ground.

1. Understanding Free Tier Limitations

Here’s a quick overview of popular cloud free tiers:

Provider	Free Tier Highlights
AWS Free Tier	750 hours/month EC2, 5GB S3, 1M Lambda requests
Oracle Cloud Free Tier	2 always-free compute instances, 30GB storage
Google Cloud Free Tier	1 f1-micro instance, 5GB storage

Knowing these limits helps you stay within budget.

2. Set Up Deployment Workflows

You can automate deployments with GitHub Actions. Here's an example of a deployment workflow to AWS:

# GitHub Action workflow for deploying to AWS
name: AWS Deployment

on:
  push:
    branches:
      - main  # Deploy on push to main branch

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3  # Check out code

      # Set up AWS credentials from GitHub secrets
      - name: Set up AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      # Build the Docker image
      - name: Build Docker Image
        run: docker build -t myapp .

      # Push the image to AWS ECR
      - name: Push Docker Image to ECR
        run: |
          # Create repository if it doesn't exist (ignoring errors if it does)
          aws ecr create-repository --repository-name myapp || true

          # Login to ECR
          aws ecr get-login-password | docker login --username AWS --password-stdin .dkr.ecr.us-east-1.amazonaws.com

          # Tag and push the image
          docker tag myapp:latest .dkr.ecr.us-east-1.amazonaws.com/myapp:latest
          docker push .dkr.ecr.us-east-1.amazonaws.com/myapp:latest

3. Implement Zero-Downtime Deployments

Zero downtime is crucial. Kubernetes makes this easy with rolling updates:

# Kubernetes deployment configured for zero-downtime updates
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crud-app
spec:
  replicas: 3  # Multiple replicas for high availability
  selector:
    matchLabels:
      app: crud-app
  template:
    metadata:
      labels:
        app: crud-app
    spec:
      containers:
      - name: app
        image: /crud-app:latest
        ports:
        - containerPort: 80  # Expose container port

By having multiple replicas, you ensure that some pods stay live during updates.

4. Create Cross-Cloud Deployment for Redundancy

If you want better reliability, you can deploy across different clouds in parallel:

# Deploy to multiple cloud providers for redundancy
name: Cross-Cloud Deployment

on:
  push:
    branches:
      - main

jobs:
  # Deploy to AWS
  aws-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: AWS Setup & Deploy
        run: |
          # Configure AWS CLI with credentials
          aws configure set aws_access_key_id ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws configure set aws_secret_access_key ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          # AWS deployment commands...

  # Deploy to Oracle Cloud in parallel
  oracle-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Oracle Setup & Deploy
        run: |
          # Configure Oracle Cloud CLI
          oci setup config
          # Oracle Cloud deployment commands...

Now if one cloud goes down, the other is still up.

5. Implement Automated Rollbacks with Health Checks

Set up health checks so Kubernetes can automatically rollback if something goes wrong:

# Deployment with health checks for automated rollbacks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crud-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: crud-app
  template:
    metadata:
      labels:
        app: crud-app
    spec:
      containers:
      - name: crud-app
        image: /crud-app:latest
        ports:
        - containerPort: 80
        # Check if the container is alive
        livenessProbe:
          httpGet:
            path: /healthz  # Health check endpoint
            port: 80
          initialDelaySeconds: 5  # Wait before first check
          periodSeconds: 10       # Check every 10 seconds
        # Check if the container is ready to receive traffic
        readinessProbe:
          httpGet:
            path: /readiness  # Readiness check endpoint
            port: 80
          initialDelaySeconds: 5  # Wait before first check
          periodSeconds: 10       # Check every 10 seconds

How to Build a Comprehensive Monitoring System

Even with a small deployment, monitoring is key to spotting issues early. So now, I’ll walk through setting up a comprehensive monitoring system for your application.

You'll learn how to integrate Grafana Cloud for visualizing your metrics, use Prometheus for collecting data, and configure custom alerts to monitor your app's performance. I’ll also cover tracking Service Level Objectives (SLOs) and setting up external monitoring with UptimeRobot to make sure that your endpoints are always available.

1. Set Up Grafana Cloud's Free Tier

Create a Grafana Cloud account and connect Prometheus as a data source. They offer generous free usage, which is perfect for small teams.

2. Configure Prometheus for Metrics Collection

Prometheus collects metrics from your app.

# prometheus.yml - Basic Prometheus configuration
global:
  scrape_interval: 15s  # Collect metrics every 15 seconds
scrape_configs:
  - job_name: 'crud-app'  # Job name for the crud-app metrics
    static_configs:
      - targets: ['localhost:8080']  # Where to collect metrics from

This scrapes your app every 15 seconds for metrics.

3. Create Monitoring Dashboards

Grafana visualizes Prometheus data. You can create dashboards using queries like:

# Calculate average CPU usage rate per instance over 1 minute
avg(rate(cpu_usage_seconds_total[1m])) by (instance)

This calculates average CPU usage over the last minute per instance.

4. Write Custom PromQL Queries for Alerts

You can create smart alerts to detect increasing error rates, like the below:

# Calculate error rate as a percentage of total requests
# Alert when error rate exceeds 5%
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  / 
sum(rate(http_requests_total[5m])) by (service) > 0.05

This alerts if more than 5% of your traffic results in errors.

5. Implement SLO Tracking on a Budget

You can track Service Level Objectives (SLOs) with Prometheus for free:

# Calculate percentage of requests completed under 200ms
# Alert when it drops below 99%
rate(http_request_duration_seconds_bucket{le="0.2"}[5m]) 
  / rate(http_request_duration_seconds_count[5m]) 
> 0.99

This tracks if 99% of requests complete in under 200ms.

6. Set Up UptimeRobot for External Monitoring

Finally, you can use UptimeRobot to check if your endpoints are reachable externally, and get alerts if anything goes down.

How to Implement Security Testing and Scanning

Security should be integrated into your development pipeline from the start, not added as an afterthought. In this section, I’ll show you how to implement security testing and scanning at various stages of your workflow.

You’ll use GitHub CodeQL for static code analysis, OWASP ZAP for scanning web vulnerabilities, and Trivy for container image scanning. You’ll also learn how to enforce security thresholds directly in your CI pipeline.

1. Enable GitHub Code Scanning with CodeQL

GitHub has built-in code scanning with CodeQL. Here’s how to set it up:

# GitHub workflow for CodeQL security scanning
name: CodeQL

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  analyze:
    name: Analyze code with CodeQL
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      # Initialize the CodeQL scanning tools
      - name: Set up CodeQL
        uses: github/codeql-action/init@v2

      # Run the analysis and generate results
      - name: Analyze code
        uses: github/codeql-action/analyze@v2

This automatically checks your code for security vulnerabilities.

2. Integrate OWASP ZAP into Your CI Pipeline

You can also scan your deployed app with OWASP ZAP like this:

# Automated security scanning with OWASP ZAP
name: ZAP Scan

on:
  push:
    branches:
      - main

jobs:
  zap-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      # Run the ZAP security scan against deployed application
      - name: Run ZAP security scan
        uses: zaproxy/action-full-scan@v0.3.0
        with:
          target: 'https://yourapp.com'  # URL to scan

This checks for common web vulnerabilities.

3. Set Up Trivy for Container Vulnerability Scanning

You can also check your container images for vulnerabilities with Trivy:

# Scan Docker images for vulnerabilities using Trivy
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'crud-app:latest'   # Image to scan
    format: 'table'             # Output format
    exit-code: '1'              # Fail the build if vulnerabilities found
    ignore-unfixed: true        # Skip vulnerabilities without fixes
    severity: 'CRITICAL,HIGH'   # Only alert on critical and high severity

Your builds will fail if serious issues are found, keeping you safe by default.

4. Create Threshold-Based Pipeline Failures

You can configure your pipelines to fail automatically if vulnerabilities exceed a set threshold, enforcing strong security practices without manual effort. Here’s how that should look:

# Fail the pipeline if critical or high vulnerabilities are found
- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'crud-app:latest'   # Image to scan
    format: 'json'              # Output as JSON for parsing
    exit-code: '1'              # Fail the build if vulnerabilities found
    severity: 'CRITICAL,HIGH'   # Check for critical and high severity issues
    ignore-unfixed: true        # Skip vulnerabilities without fixes

This forces a no-compromise security posture – that is, if critical or high vulnerabilities are detected, the build stops immediately.

5. Implement Custom Security Checks

Sometimes you need to go beyond automated scanners. Here's a basic example of a custom security check you can add to your pipeline:

#!/bin/bash

# Custom script to check for hard-coded secrets in source code
# Check for hard-coded API keys in source files
if grep -r "API_KEY" ./src; then
  echo "Security issue: Found hard-coded API keys."
  exit 1  # Fail the build
else
  echo "No hard-coded API keys found."
fi

You can extend this script to scan for patterns like private keys, passwords, or other sensitive information, helping catch issues before they ever reach production.

Performance Optimization and Scaling

Optimizing early saves you pain later. Here’s how to make your pipelines faster, smarter, and more scalable:

1. Measure Pipeline Execution Times

Understanding how long each step takes is the first step to improving it:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # Record the start time
      - name: Start timer
        run: echo "Start time: $(date)"

      - uses: actions/checkout@v3
      - run: npm install

      # Record the end time to calculate duration
      - name: End timer
        run: echo "End time: $(date)"

Later, you can automate time tracking for full reports and alerts.

2. Implement Parallelization Strategies

Split your jobs smartly to save time:

jobs:
  # First job to install dependencies
  install:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - run: npm ci

  # Run tests in parallel with linting
  test:
    runs-on: ubuntu-latest
    needs: install  # Depends on install job
    steps:
      - uses: actions/checkout@v3
      - run: npm test

  # Run linting in parallel with tests
  lint:
    runs-on: ubuntu-latest
    needs: install  # Also depends on install job
    steps:
      - uses: actions/checkout@v3
      - run: npm run lint

Result: Testing and linting run in parallel after installing dependencies, cutting pipeline time significantly.

3. Set Up Distributed Caching

Caching saves your workflow from repeating expensive tasks:

# Cache dependencies to speed up builds
- name: Cache node modules
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm           # Cache global npm cache
      node_modules     # Cache local dependencies
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}  # Key based on OS and dependency hash
    restore-keys: |    # Fallback keys if exact match isn't found
      ${{ runner.os }}-node-

Tip: Also cache build artifacts, Docker layers, and Terraform plans when possible.

4. Create Performance Benchmarks

Track your build times over time with benchmarks:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      # Store the start time as an environment variable
      - name: Start timer
        id: start_time
        run: echo "start_time=$(date +%s)" >> $GITHUB_ENV

      - uses: actions/checkout@v3
      - run: npm install

      # Calculate and display the elapsed time
      - name: End timer and calculate elapsed time
        run: |
          end_time=$(date +%s)
          elapsed_time=$((end_time - ${{ env.start_time }}))
          echo "Build time: $elapsed_time seconds"

With benchmarks in place, you can monitor regressions and trigger optimizations automatically.

5. How to Plan for Growth Beyond Free Tiers

Understand cloud pricing structures: AWS, Azure, GCP all offer generous free tiers, but know the limits to avoid surprise bills. (I have been there and it wasn’t pretty.)
Consider scaling to more advanced CI/CD tools: Jenkins, CircleCI, GitLab can offer better performance or self-hosted control as you grow.
Automate resource provisioning: Use Infrastructure as Code (IaC) with Terraform, Pulumi, or AWS CDK to dynamically scale your infrastructure when your team or traffic grows.

Complete CI/CD Pipeline Example

Here’s a full example tying everything together:

# Complete end-to-end CI/CD pipeline
name: CI/CD Pipeline

on:
  push:
    branches:
      - main

jobs:
  # Initial setup job
  setup:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

  # Build and test job
  build:
    runs-on: ubuntu-latest
    needs: setup  # Depends on setup job
    steps:
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '16'
      - name: Install dependencies
        run: npm install
      - name: Run security scan
        run: npx eslint .  # Run ESLint for security rules

  # Deploy to Kubernetes job
  deploy:
    runs-on: ubuntu-latest
    needs: build  # Depends on successful build
    steps:
      - name: Setup K3d cluster
        run: k3d cluster create dev-cluster --servers 1 --agents 2 --port 8080:80@loadbalancer
      - name: Apply Kubernetes manifests
        run: kubectl apply -f k8s/  # Apply all K8s manifests in the k8s directory
      - name: Deploy app
        run: kubectl rollout restart deployment/webapp  # Restart deployment for zero-downtime update

  # Infrastructure provisioning job
  terraform:
    runs-on: ubuntu-latest
    needs: deploy  # Run after deployment
    steps:
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
      - name: Terraform Init
        run: terraform init  # Initialize Terraform
      - name: Terraform Apply
        run: terraform apply -auto-approve  # Apply infrastructure changes automatically

Runbook: Failed Deployment:

Issue: Pods fail due to resource limits (for example, OOMKilled, CrashLoopBackOff).
Fix:

  kubectl top pod
  kubectl edit deployment crud-app
  kubectl apply -f deployment.yaml
  kubectl rollout status deployment/crud-app

Tip: Set realistic resource requests and limits early, it'll save you debugging time later.

Conclusion

By following along with this tutorial, you now know how to build a production-ready DevOps pipeline using free tools:

CI/CD: GitHub Actions for testing, linting, and building.
Infrastructure: Terraform for AWS/Render and PostgreSQL setup.
Orchestration: K3d for local Kubernetes.
Monitoring: Grafana, Prometheus, UptimeRobot.
Security: CodeQL, OWASP ZAP, Trivy for vulnerability scanning.

This pipeline is scalable and secure, and it’s perfect for small projects. As your app grows, you might want to consider paid plans for more resources (for example, AWS larger instances, Grafana unlimited metrics). You can check AWS Free Tier, Terraform Docs, and Grafana Docs for more learning.

PS: I’d love to see what you build. Share your pipeline on FreeCodeCamp’s forum or tag me on X @Emidowojo with #DevOpsOnABudget, and tell me about the challenges you faced. You can also connect with me on LinkedIn if you’d like to stay in touch. If you made it to the end of this lengthy article, thanks for reading!

How to Automate Alert Provisioning with the SigNoz Terraform Provider

Gursimar Singh — Mon, 17 Mar 2025 19:09:07 +0000

Modern infrastructure requires continuous monitoring and rapid incident response. However, manually configuring and managing alerts is not only labor-intensive but also susceptible to human error.

Automating alert provisioning allows you to enforce consistency, secure sensitive credentials, and integrate monitoring into your deployment pipelines.

This guide dives deep into how you can use the SigNoz Terraform Provider to define and manage alert configurations as code, making your observability setup resilient and adaptable.

Here’s what we’ll cover:

Why Automate Alert Provisioning?
What are SigNoz and Terraform?
Overview of the Setup
Prerequisites
Steps to Set Up the Project
Best Practices and Security Considerations
Integrating with CI/CD Pipelines
Advanced Customizations and Troubleshooting
Conclusion

Why Automate Alert Provisioning?

It’s a good idea to automate your alert provisioning for various reasons.

First of all, configuring things manually often leads to discrepancies between environments (development, staging, production). Automating alerts ensures that all environments adhere to the same monitoring standards, reducing the likelihood of configuration drift and improving consistency and uniformity.

Also, when alerts are defined as code, every change is tracked in your version control system. This audit trail makes it easier to trace and review changes, collaborate with team members, and roll back configurations if issues arise.

Something else to consider is that as your infrastructure grows, manually managing alerts becomes unsustainable. Automation allows you to quickly and efficiently update your alerting rules across multiple services without the need for repetitive manual intervention.

Automation also helps improve security. Storing sensitive information like API tokens as environment variables or in secret management systems helps maintain security. Automating the process also minimizes human exposure to critical credentials.

And finally, defining alerts as code enables you to integrate monitoring configurations into your CI/CD pipelines. This leads to continuous testing, validation, and deployment of alert rules alongside application updates.

So as you can see, there are many compelling reasons to go the automation route. Now let’s see how you can do this in practice.

What Are SigNoz and Terraform?

SigNoz is an open-source observability platform designed to collect, analyze, and visualize metrics, logs, and traces from your applications. Its most helpful features include:

It has comprehensive monitoring abilities: Provides detailed insights into system performance, error rates, and user behaviors.
It comes equipped with real-time analytics: Enables proactive issue detection and performance optimization.
It’s community-driven: As an open-source solution, it benefits from community contributions, transparency, and customization.
It’s cost-effective: Offers powerful observability capabilities without the hefty licensing fees of proprietary solutions.

Terraform is an Infrastructure as Code (IaC) tool developed by HashiCorp. It allows you to define and provision infrastructure using declarative configuration files. Terraform’s core advantages include:

Its declarative syntax: You specify the desired state of your infrastructure, and Terraform handles the implementation.
Its version Control: Configuration files can be managed in Git repositories, enabling traceability and rollback of changes.
Powerful automation: Facilitates automated provisioning and updates, reducing manual effort and errors.
Multi-cloud support: Manages resources across different cloud providers with a consistent workflow.

So you might be wondering: why should you use Terraform with SigNoz?

First of all, Terraform ensures that your infrastructure is managed consistently across different environments, reducing the risk of configuration drift. It also simplifies managing multiple alerts and resources, making it easier to scale your observability setup.

Beyond this, automating the provisioning process reduces manual setup efforts and minimizes the potential for human error.

And finally, Terraform configurations can be version-controlled, allowing teams to track changes over time and collaborate more effectively.

Overview of the Setup

This setup utilizes the SigNoz Terraform Provider to manage alerts and notification channels within SigNoz Cloud. The configuration includes:

Provider configuration: Establishes the connection to SigNoz using the API endpoint and a securely provided API token.
Notification channels: Defines where alerts are sent (for example, via email) to ensure the right teams are notified.
Alert rules: Specifies the conditions under which alerts are triggered, including thresholds and evaluation windows.
External variables: Enhances flexibility by allowing critical values (like CPU thresholds and email addresses) to be managed externally.

Prerequisites

Before diving into the setup, make sure you have the following:

SigNoz Cloud account: If you don't have one, sign up for SigNoz Cloud to host your observability data and configure alerts.
Terraform installed: Install Terraform on your machine. Terraform is the tool you'll use to manage your infrastructure as code.
SigNoz API token:
- Log in to your SigNoz Cloud dashboard.
- Navigate to Settings > API Tokens.
- Click Generate API Token.
- Copy the token, as you'll need it to authenticate Terraform with SigNoz.
Basic knowledge of Terraform: Familiarity with Terraform's syntax and concepts, including writing configuration files and running Terraform commands, is essential.
Text editor: Use any code editor like Visual Studio Code or Sublime Text to write your Terraform configuration files.

Steps to Set Up the Project

1. Understand the `signoz_alert` Resource

The signoz_alert resource allows you to create and manage alert rules in SigNoz via Terraform. It supports various alert types, conditions, and configurations. Understanding this resource is crucial as it forms the basis of your alert configuration.

2. Set Up Your Terraform Configuration

Create a new directory for your Terraform configuration:

mkdir signoz-terraform
cd signoz-terraform

Create a main.tf file with the following content:

terraform {
  required_providers {
    signoz = {
      source  = "SigNoz/signoz"
      version = "0.1.3" # Use the latest version from the Terraform Registry
    }
  }
}

provider "signoz" {
  endpoint  = "https://api.us.signoz.cloud" # Replace with your SigNoz Cloud API endpoint
  api_token = var.signoz_api_token
}

variable "signoz_api_token" {}

The provider block configures the SigNoz provider, where endpoint specifies the API endpoint and api_token is passed through a variable for security.

3. Define a Notification Channel (Optional)

If you plan to send alerts to specific channels, define them using signoz_notification_channel. For example, create a channels.tf file:

resource "signoz_notification_channel" "email_channel" {
  name = "Email Channel"
  type = "email"

  receivers {
    email_config {
      to = ["alerts@example.com"]
    }
  }
}

Defining a notification channel ensures that alerts are sent to the correct recipients, enhancing the utility of your alerting system.

4. Create an Alert Using the `signoz_alert` Resource

Create an alerts.tf file to define your alert:

resource "signoz_alert" "cpu_high_usage" {
  alert            = "High CPU Usage Alert"
  alert_type       = "METRIC_BASED_ALERT"
  severity         = "critical"
  description      = "Alert when CPU usage exceeds 80% over 5 minutes"
  rule_type        = "threshold_rule"
  broadcast_to_all = false
  disabled         = false
  eval_window      = "5m0s"
  frequency        = "1m0s"
  version          = "v4"

  condition = jsonencode({
    compositeQuery = {
      builderQueries = {
        A = {
          aggregateOperator = "avg"
          dataSource        = "metrics"
          metricName        = "cpu_usage_user"
          reduceTo          = "avg"
          filters           = {
            items = []
            op    = "AND"
          }
          groupBy = []
        }
      }
      queryType = "builder"
      panelType = "graph"
      unit      = "%"
    }
    op                = ">"
    target            = 80
    matchType         = "EQUALS"
    selectedQueryName = "A"
    targetUnit        = "%"
  })

  preferred_channels = [signoz_notification_channel.email_channel.name]

  labels = {
    severity = "critical"
    team     = "DevOps"
  }
}

This configuration creates a high CPU usage alert with specific conditions and notifications. The condition parameter is crucial as it defines the alert triggering logic.

5. Provide the API Token

Set the signoz_api_token as an environment variable:

export TF_VAR_signoz_api_token="YOUR_SIGNOZ_API_TOKEN"

This ensures that your API token is securely used by Terraform without hardcoding it in your configuration files.

6. Initialize Terraform

Run:

terraform init

This command initializes your Terraform working directory, downloading necessary plugins, and preparing the environment.

7. Review the Execution Plan

Generate the execution plan:

terraform plan

This step previews the changes Terraform will make, allowing you to verify the configuration before applying it.

8. Apply the Configuration

Apply the changes:

terraform apply

Type yes when prompted. This command applies the configuration, creating or updating resources as specified.

9. Verify the Alert in SigNoz Cloud

To do this, follow these steps:

Log in to your SigNoz Cloud dashboard.
Navigate to Alerts.
Confirm that the "High CPU Usage Alert" is listed.
Click on the alert to view its details and ensure it matches your configuration.

10. Modify the Alert (Optional)

To change the CPU usage threshold to 75%, follow these steps:

Update the target in alerts.tf:
```
  target = 75
```
Apply the changes:
```
  terraform apply
```

11. Destroy the Resources (Optional)

To remove the alert and notification channel:

terraform destroy

Type yes to confirm. This command will delete the resources created by Terraform.

Best Practices and Security Considerations

In modern infrastructure automation, robust best practices and security measures are paramount.

Use version pinning

To ensure your alert provisioning remains reliable and maintainable, start with strict version control. Avoid using the latest tag and instead specify an exact version number. This ensures your infrastructure configuration remains consistent and predictable.

By pinning your provider version (for example, use version = "0.1.3" instead of version = ">= 0.1.3".), you eliminate unexpected behavior that can arise from upstream changes. This practice is critical for long-term stability, especially when your infrastructure scales across multiple environments.

Externalize Credentials

Security is non-negotiable. Instead of embedding sensitive details like API tokens in your codebase, leverage environment variables or dedicated secret management tools such as HashiCorp Vault or AWS Secrets Manager.

For instance, storing your SigNoz API token as an environment variable (TF_VAR_signoz_api_token) not only mitigates the risk of credential exposure but also simplifies the process of credential rotation. Also, enforce access control policies around your configuration repositories and CI/CD pipelines to further secure these secrets.

Use Version Control

A mature setup also demands rigorous infrastructure version control. Hosting your Terraform configuration in a Git repository with branch protection and code review policies allows you to track changes meticulously, roll back problematic updates, and maintain an audit trail. This traceability is essential when troubleshooting issues or validating compliance during audits.

You should also document your configuration decisions extensively—explain why a particular CPU threshold was chosen or why specific labels (like severity and team) are used. Such documentation becomes invaluable for onboarding new team members or when revisiting configurations months later.

Integrating with CI/CD Pipelines

Integrating Terraform with your CI/CD pipeline is a cornerstone of a modern, automated deployment strategy. A well-architected pipeline not only validates your infrastructure changes but also ensures that your alerting rules remain in sync with your evolving application environment.

Continuous Integration (CI) involves automatically merging code changes into a shared repository and running automated tests on each commit. In practice, embedding Terraform plan into your pull request workflow provides early feedback, catching misconfigurations before they reach production. For instance, a GitHub Actions workflow can automatically check your changes:

name: Terraform CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v3
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
      - name: Terraform Init
        run: terraform init
      - name: Terraform Plan
        run: terraform plan -no-color
        env:
          TF_VAR_signoz_api_token: ${{ secrets.SIGNOZ_API_TOKEN }}

This workflow uses GitHub secrets to securely manage your API tokens while validating the configuration changes. Continuous Delivery (CD) takes this further by automating deployments. Once your plan is approved, an automated Terraform apply step (often scheduled during off-peak hours or coordinated with application deployments) ensures smooth, coordinated rollouts.

Advanced pipelines can also include automated rollback mechanisms. For example, if a deployment triggers an anomaly, scripts can automatically revert to a previous version using your version control history—minimizing downtime and reinforcing the feedback loop between application performance and infrastructure configuration.

Advanced Customizations and Troubleshooting

As your observability requirements evolve, you may need to implement advanced customizations. One powerful approach is using multi-metric composite alerts. Instead of triggering an alert on a single threshold, you can design rules that combine multiple conditions—for example, firing only when both CPU usage and memory consumption exceed critical levels. This nuanced alerting minimizes false positives and ensures alerts are issued only during genuine performance issues.

Terraform’s modular design is especially useful here. By creating reusable modules that encapsulate your alert configurations, you can parameterize key variables—such as thresholds, evaluation windows, and notification channels—across a microservices architecture. This modularity enforces consistency while simplifying management and scaling.

Troubleshooting advanced configurations starts with reviewing your terraform plan output to ensure every change aligns with expectations. If an alert isn’t triggering as expected, inspect the JSON structure generated by the jsonencode function. Even minor syntax errors can cause significant issues.

When integrating with incident management tools like PagerDuty or Opsgenie, run comprehensive end-to-end tests in a staging environment. For example, deploy a test alert to a dedicated channel to verify that the complete alerting pipeline—from condition detection to incident escalation—is functioning correctly.

In one real-world scenario, a misconfigured composite query in an alert’s JSON payload led to intermittent failures. By enabling detailed provider logs and iteratively validating the JSON output, the issue was rapidly isolated and resolved. Such experiences underscore the importance of rigorous logging, validation, and testing in production-grade setups.

Conclusion

Automating alert provisioning is a transformative approach to managing observability in modern infrastructures.

By defining alerts and notification channels as code, you make your systems more consistent, scalable, secure, and easily integratabtle with CI/CD. You can set up uniform alert rules across all environments, quickly update and deploy monitoring configs, easily handle secure credentials, and automate CI/CD workflows that stay in sync with application changes. They also become easier to integrate with CI/CD workflows.

I hope you’ve enjoyed this tutorial and have learned something new. I’m always open to suggestions and discussions on LinkedIn. Hit me up with direct messages.

If you’ve enjoyed my writing and want to keep me motivated, consider leaving starts on GitHub and endorsing me for relevant skills on LinkedIn.

Till the next one, happy coding!

A Beginner's Guide to Terraform – Infrastructure-as-Code in Practice

Oluwatobi — Fri, 03 Jan 2025 18:21:03 +0000

Over the years, cloud development has seen a major paradigm shift. Newer and more complex applications are deployed rapidly to the cloud to minimize downtime. And through all of this, the concept of Infrastructure-as-Code and various tools have emerged to simplify the process of application development.

You might be wondering: what is Infrastructure-as-Code? How does it improve the development process and experience, and where does Terraform come into the picture? Well, we’ll explore all this and more in this guide. But before we start, here are some pre-requisites:

Basic knowledge of the cloud and cloud terminologies
Access to a PC to implement code examples
A GCP account

With this, let's get started.

Here’s what we’ll cover:

Overview of Infrastructure as Code
What is Terraform?
Benefits of Terraform
Common Terms Used in Terraform
Demo Project: How to Write a Terraform Configuration
Conclusion

Overview of Infrastructure as Code (IaC)

Infrastructure as code refers to generating cloud infrastructure tools and applications with a code-based configuration document. This process, when running, automates the sequence and process of creating databases, virtual machines, and servers. This improves the user experience by reducing the frequency of manual cloud service deployments, especially for multiple identical services.

There are two distinct approaches to infrastructure as code: the Imperative approach and the Declarative approach.

When you’re using the Declarative approach to infrastructure generation, you simply detail your expected/desired outputs for the Infrastructure to be generated, and then the IaC tool you’re using figures out how to produce that output.

On the other hand, the Imperative approach involves specifying the exact steps needed to achieve the desired infrastructure state. While the Imperative approach seems more suited for complex infrastructure setups, the Declarative approach can work just as well.

Some tools are capable of both approaches while others are only suited to one or the other. Examples of some of the popular IaC tools used globally include Terraform IaC, AWS Cloud Formation, Ansible, and Pulumi, Chef, among others.

Like the name implies – infrastructure as code – the code creating the infrastructure is written in various template languages within the IaC space. Popular template languages include JSON, YAML, ARM template, HCL, Heat Scripts, and so on.

You can also use scripting tools to execute cloud infrastructure. Some popular ones include Bash and PowerShell. These sometimes come preinstalled on most personal computers.

Out of all these tools, though, Terraform is distinct for various reasons – and it’s the one we’ll be examining in this article.

What is Terraform?

Terraform is an open source tool developed by HashiCorp in 2014. It has evolved over the years and now serves as a cloud agnostic infrastructure tool that allows you to create infrastructure across multiple cloud service providers.

Terraform also offers Terraform Cloud, a cloud-based software as a service tool. It allows for cloud-based deployment of cloud tools, instead of using the old local-based methods we had in the defunct Terraform CLI tool.

Also, like other IaC tools which utilize template languages, the template framework used to create infrastructure in Terraform is the HashiCorp template language (HCL).

Benefits of Terraform

Now I’ll highlight some of the benefits of using Terraform as a cloud engineer, along with the tool’s key role in the cloud ecosystem.

1. Declarative Approach

This approach to cloud infrastructure automation ensures that all required infrastructure to be deployed (databases, servers, and so on) is stated explicitly and executed accordingly. This helps avoid conflicts.

2. Conflict Handling

In addition to its efficient cloud tool automation capabilities, Terraform has some robust conflict detection and handling properties. One of the ways it handles conflicts is via the Terraform plan function. This function highlights any perceived or potential conflicts of infrastructure orchestration which allows for easy correction before deployment. I’ll discuss this further in subsequent sections.

3. Cloud Agnostic

Terraform is a multipurpose, multi-cloud automation service provider with efficient infrastructure automation capabilities across the major cloud service providers (AWS, GCP and Azure). It also allows for hybrid and inter-provider automation.

4. User-friendly

Terraform is one of the largest cloud automation tools with the largest user communities out there. It has extensive beginner-friendly tutorials that help you get a quick hang of the tool. Here is a link to its documentation so you can dive in deeper.

5. File Management Capabilities

Terraform automatically creates a local backup of the automation states on your local computer to ensure immediate recall and file handling in case anything goes wrong. It also offers remote backup options to remote cloud service providers where necessary.

6. Version Control

Just like the Git version control system, Terraform has a built-in version control system which lets you track changes to a Terraform file. It also lets you go back to previous versions of your code if there are errors in the present version, for example.

7. Code Reusability

Terraform offers a wide variety of code templates for easy reuse on its developer documentation page.

Now that we’ve highlighted the benefits of Terraform, let’s learn some common terminologies used in Terraform and what they mean.

Common Terms Used in Terraform

Before you start using Terraform, you should be familiar with some key terms that come up a lot. Here’s what you need to know:

Providers: in Terraform, a Provider is a programming interface that lets Terraform interact with various APIs and cloud services. For example, you’d use a provider to interface with a cloud service provider like GCP or Azure.
Modules: Modules are specifically created within the Terraform framework and serve as reusable components that let you easily orchestrate cloud services. You can also store key information regarding cloud services in a module, and then modify it to ensure uniqueness using module variables.
Resources: Resources in Terraform refer to the cloud infrastructure components to be created. Examples include cloud networks, virtual machines, availability zones, and other infrastructures.
State: The concept of state in Terraform forms the basis for its efficiency. State keeps track of the current configuration of your infrastructure resources, and contains details about every resource Terraform has created, modified, or deleted. Terraforms version control system uses it to track any changes you make to a code file and uses that information to destroy and provision infrastructure as necessary.
Workspace: a Workspace functions sort of similarly to a version control system, as it creates a sort of constraint around a work file. Workspaces let you manage multiple instances of a single infrastructure configuration in a clean and isolated way within the same backend. You can use workspaces to separate environments like development, staging, and production while using the same Terraform configuration.

Demo Project: How to Write a Terraform Configuration

In this section, we’ll be diving deeper into writing our first Terraform file to orchestrate a Google Cloud program virtual machine with just a few lines of code. But before we begin, we’ll discuss the various commands that you should understand before we implement the demo project.

Common Terraform Commands

Terraform init: This command initializes the Terraform tool and downloads essential cloud provider-specific files. It also establishes a connection between Terraform and the cloud provider in question. In our case, it’s between GCP and the Terraform provider.
Terraform fmt: This command automatically ensures optimal code formatting and indentation. It ensures orderly execution of the code and minimizes any errors.
Terraform plan: This command outlines the steps of execution of the Terraform code, and detects any errors that may occur during the process of execution. It also highlights any errors in the Terraform code that may hinder execution. Lastly, it works alongside Terraform state management to detect any change of state and de-provision or generate any additional cloud services if necessary.
Terraform apply: This command executes the planned Terraform state implemented by the Terraform plan command.
Terraform destroy: This command is the final command in the Terraform scheme which is used to deactivate or destroy all the cloud services created using the Terraform apply command. It's important to note that you should execute the commands listed above sequentially to ensure that your infrastructure gets created properly.

Creating an IaC-Powered GCP Virtual Machine

Now that you’ve learned these important commands, let’s test them all out by creating our first-ever IaC-powered GCP virtual machine.

In your code editor, type the following code:

provider "google" {
  project = "your-gcp-project-id"  # Replace with your GCP Project ID
  region  = "us-central1"          
  zone    = "us-central1-a"        
}

This code highlights the cloud provider we’re using to generate the cloud resources we need. In our case, it’s the Google cloud program. The name assigned to it is just “google”. Other cloud providers like AWS and Azure are “aws” and “azure” respectively.

The second line identifies the GCP subscription identifier, which is unique to each GCP account (and helps facilitate accurate integration). You should use yours in the space provided.

You’ll also need to include a suitable resource region and resource availability zone. This serves as the physical base for the virtual machine we’ll create so we can run it. In this scenario, I chose the USA central region and 1-a availability zone, respectively. You can read more here about cloud regions and availability zones.

resource "google_compute_instance" "vm_instance" {
  name         = "example-vm"      
  machine_type = "e2-medium"          

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11" 
    }
  }

The code snippet above specifies the exact compute resource that’ll be orchestrated, which in our case is a virtual machine instance coded as “vm_instance”. 'example-vm’ is the name we want to assign to the virtual machine we will be creating for this project. It is important to note that the virtual machine name must be unique too. The type of the virtual machine we opted for was the E2 (General purpose)-medium type VM. You can get more information on Virtual machine types here.

Going further, we also specify the expected booted Operating system (“boot_disk”) which is an image of the Debian Linux Operating system version 11 in my case.

  network_interface {
    network = "default"  # Attach to the default VPC network
    access_config {

    }
  }

output "instance_ip" {
  value = google_compute_instance.vm_instance.network_interface[0].access_config[0].nat_ip
}

To complete the creation of our virtual machine, we need to set up a Virtual Network to allow remote access to the VM. The network interface block connects the virtual machine to the default VPC (Virtual Private Cloud) network provided by GCP. We won’t be able to interface with our virtual machine without the VPC network. The output block also displays the default access IP address in the terminal, which we can use to connect to the virtual machine.

Here is the final expected code:


provider "google" {
  project = "your-gcp-project-id"  # Replace with your GCP Project ID
  region  = "us-central1"          
  zone    = "us-central1-a"       
}

resource "google_compute_instance" "vm_instance" {
  name         = "example-vm"         
  machine_type = "e2-medium"          

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-11"  
    }
  }

  network_interface {
    network = "default"  # Attach to the default VPC network
    access_config {

    }
  }

output "instance_ip" {
  value = google_compute_instance.vm_instance.network_interface[0].access_config[0].nat_ip
}

Going on from there, we’ll now be executing this code using the commands highlighted in the image below:

The command terraform -v confirms that Terraform has been successfully installed on the terminal. The expected output will be the version of the Terraform tool installed.

The next command executed is the terraform init function which initializes a communication with the cloud service provider, which in our case is GCP. All needed dependencies are also installed.

The terraform fmt command is also run to ensure adequate code formatting and indentation. Then the terraform plan command is sequentially executed.

From the image above, you can see the steps Terraform intends to use to generate the expected Virtual machine.

On successful execution of Terraform plan, we will then execute the terraform apply function to execute the steps outlined by Terraform plan.

This will generate a prompt asking for a confirmation of the Terraform execution as shown above. Typing “Yes” will allow the operation to proceed smoothly.

On successful execution, a success message will be displayed as shown above. With that, we have created our Cloud infrastructure with just code. The terraform destroy command can then be called to remove the created Virtual machines.

Conclusion

In this article, you’ve learned the basics about infrastructure as code. We discussed Terraform, its benefits, and some of its key features and commands. We also illustrated its use in a demo project.

To further enhance your knowledge, you can consult Terraform‘s documentation for more code examples. I would also recommend utilizing your newly gained knowledge to automate a project with real-life uses.

Feel free to message me with any comments or questions. You can also check out my other articles here. Till next time, keep on coding!

How to Simplify AWS Multi-Account Management with Terraform and GitOps

Nitheesh Poojary — Tue, 26 Nov 2024 14:56:18 +0000

In the past, in the world of cloud computing, a company's journey often began with a single AWS account. In this unified space, development and testing environments coexisted, while the production environment lived in a separate account.

This arrangement might work well in early days, but as a company grows and their needs become more specialized, the simplicity of a single account might start to show its limitations. The demand for dedicated environments will start to increase, and soon, that company may need to create new AWS accounts for specific functions like security, DevOps, and billing.

With each new account, the complexity of managing security policies and logging across the entire infrastructure grows exponentially. The cloud architects for these companies will then realize that they need a more centralized and streamlined approach to manage this expanding digital presence.

Enter AWS Organizations

AWS Organizations is a service designed to streamline AWS account management. This powerful tool allows you to group multiple AWS accounts under a single umbrella. With AWS Organizations, you can easily create organizational units, apply service control policies, and manage permissions across all accounts. This not only simplifies the process but also enhances security and compliance.

The billing processes of AWS Organizations have also been optimized through the centralization of payments and the generation of comprehensive expense reports for each account. This improved clarity in financial management makes it easier for companies to allocate resources in a more efficient manner and strategize for future expansion.

AWS Organizations can help your team consistently enforce security policies, enable logging across all accounts, and streamline administrative tasks. Cloud infrastructure is now a well-organized, secure, and efficient machine, ready to support a company's ambitions for years to come.

In this article, we’ll discuss what it means to have a multi-account setup and how it works. I’ll walk you through everything from the deployment architecture to creating an Organizational Unit and beyond.

Components of Multi-Account Setup
How to Automate a Multi-Account Strategy
AWS Organization Structure
Deployment Architecture
OverView of CI/CD Components
CI/CD Deployment Process Explained
How to Automate Landing Zone Creation
How to Create an Organizational Unit
How to Automate Attaching Control Tower Control to the OU
Conclusion

Components of Multi-Account Setup

First, let's take a detailed look at the various components that make up an AWS multi-account strategy:

AWS Control Tower
Landing zone
AWS OU
AWS SSO
Control Tower Controls
Service control policies (SCPs)

What is AWS Control Tower?

AWS Control Tower is a comprehensive service that enables you to set up and manage a multi-account AWS environment efficiently. It’s designed based on best practices from AWS experts and adheres to industry standards and requirements.

By using AWS Control Tower, you can ensure that your AWS environment is secure, compliant, and well-organized, facilitating easier management and scalability.

Features of AWS Control Tower:

Cloud IT can be confident that all accounts are in line with company-wide regulations, and distributed teams may create new AWS accounts quickly.
You can enforce best practices, standards, and regulatory requirements with preconfigured controls.
You can automate your AWS environment setup with best-practice blueprints. These blueprints cover various aspects such as multi-account structure, identity and access management, as well as account provisioning workflow.
It lets you govern new or existing account configurations, gain visibility into compliance status, and enforce controls at scale.

What is a Landing Zone in AWS?

A landing zone helps you quickly set up a cloud environment using automation, including preconfigured settings that follow industry best practices for ensuring the security of your AWS accounts.

The starting point serves as a foundation for your company to efficiently initiate and implement workloads and applications, ensuring a secure and reliable infrastructure environment.

There are two choices for creating a landing zone. First, you can use the AWS Control Tower dashboard. Second, you can build a custom landing zone. If you are new to AWS, I recommend using AWS Control Tower to create a landing zone.

If you opt for creating a landing zone via the Control Tower dashboard, the following will be implemented in your landing zone:

A multi-account environment with AWS organizations.
Identity management through the default directory in AWS IAM Identity Center.
Federated access to accounts using IAM Identity Center.
Centralized logging from AWS CloudTrail and AWS Config stored in Amazon Simple Storage Service (Amazon S3).
Enabled cross-account security audits using IAM Identity Center.

What is an AWS Organizational Unit?

Using multiple accounts allows you to better support your security goals and company operations.

AWS Organizations enables policy-based management of multiple AWS accounts. When you create new accounts, you can arrange them in organizational units (OUs), which are groupings of accounts that provide the same application or service.

Advantages of Using OUs:

Accounts are units of security protection. Potential hazards and security threats can be contained within one account without affecting others.
Teams have different assignments and resource needs. Setting up different accounts prevents teams from interfering with one another, as they might do if they used the same account.
Isolating data stores to an account reduces the number of people who have access to and can manage the data store.
The multi-account concept allows you to generate separate billable items for business divisions, functional teams, or individual users.
AWS quotas are set up per account. Separating workloads into different accounts gives each account an individual quota.

What is AWS IAM Identity Center?

The AWS IAM Identity Center provides a centralized solution for managing access to multiple AWS accounts and business applications.

This method offers a single sign-on feature that allows employees to access all assigned accounts and applications from a single credential.

The personalized web user portal provides a centralized view of the user's assigned roles in AWS accounts.

For a uniformed authentication experience, users can sign in using the AWS Command Line Interface, AWS SDKs, or the AWS Console Mobile Application with their directory credentials.

You can also set up and oversee user IDs in IAM Identity Center's identity store, or you can connect to your existing identity provider, such as Microsoft Active Directory, Okta, and so on.

Control Tower Controls (Guardrails)

Controls are predefined governance rules for security, operations, and compliance. You can select and apply them enterprise-wide or to specific groups of accounts.

Controls can be detective, preventive, or proactive and can be either mandatory or optional.

First, we have detective controls (for example, detecting whether public read access to Amazon S3 buckets is allowed).
Next, preventive controls establish intent and prevent deployment of resources that don’t conform to your policies (for example, enabling AWS CloudTrail in all accounts).
Finally, proactive control capabilities use AWS CloudFormation Hooks to proactively identify and block the CloudFormation deployment of resources that are not compliant with the controls you have enabled. For example, developers cannot create S3 buckets that are capable of storing data in an unencrypted state at rest.

Service Control Policies (SCP)

SCPs are a feature of the organization that allows you to set the maximum permissions for member accounts within the organization.

There are many functions and features of an SCP:

If an SCP denies an action on an account, no entity in the account can perform that action, even if its IAM permissions allow it.
Prevents stopping or deletion of CloudTrail logging.
Prevents deletion of VPC flow logs.
Prohibits AWS accounts from leaving the organization.
Prevents AWS GuardDuty changes.
Prevents resource sharing using AWS Resource Access Manager (RAM) either externally or across environments.
Prevents disabling the default Amazon EBS encryption.
Prevents Amazon S3 unencrypted object uploads.
And prevents IAM users and roles in the affected accounts from creating certain resource types if the request doesn't include the specified tags.

How to Automate a Multi-Account Strategy

Now that you’re familiar with the key concepts of a Multi-Account Strategy in AWS, let’s dive deeper into the practical parts.

In the coming subsections, we’ll cover how you can set up an AWS Control Tower, create a landing zone, and automatically create organizational units (OUs). I’ll also walk you through how to configure Control Tower controls—often known as guardrails—to uphold security, compliance, and governance over your AWS environment.

Once we finish this deployment, we will have a solution that includes the following components:

Creates an AWS Organizations OU named Core within the organizational root structure.
Creates and adds two shared accounts to the Security OU: the Log Archive account and the Audit account.
Creates a cloud-native directory in IAM Identity Center, with ready-made groups and single sign-on access.
Applies all required preventive controls to enforce policies.
Applies required detective controls to identify configuration violations.

AWS Organization Structure

We will create and implement the following organizational structure. You can add or modify OUs as per your requirements.

Deployment Architecture

I will be using Terraform Cloud and GitHub Actions for automating the entire process. This architecture applies to all three components, including core accounts, landing zones, and organizational unit (OU) creation and controls.

Overview of CI/CD Components

1. GitHub Actions

GitHub Actions is a CI/CD platform that lets you automate your build, test, and deployment pipeline. You can create workflows that automatically build and test every pull request to your repository, ensuring code changes are verified before merging.

GitHub Actions also lets you deploy merged pull requests to production, streamlining the release process and reducing errors.

Using GitHub Actions enhances your development workflow, improves code quality, and speeds up the delivery of new features and updates.

2. Terraform Cloud

Terraform Cloud is a platform by HashiCorp for managing and executing your Terraform code. It offers tools and features that enhance collaboration between developers and DevOps engineers, making teamwork more efficient.

With Terraform Cloud, you can simplify and streamline your workflow, making it easier to handle complex infrastructure tasks and deployments. The platform also provides strong security features to protect your code and infrastructure, keeping your product secure throughout its lifecycle.

CI/CD Deployment Process Explained

DevOps engineers are responsible for writing the Terraform code and then creating a pull request. I have added several test cases for my Terraform code in the terraform-plan.yml file, which runs only on the feature branch.

Check environment variables: Ensures all required environment variables are set.
Checkout Code: Uses the actions/checkout action to check out the repository.
Verify Checkout: Verifies that the checkout was successful.
Validation: Verifies the Terraform code for any syntax errors. Pull requests contain proposed changes in code, allowing team members to review and merge them into the master branch. Once pull requests are merged with the master branch, all test cases are rerun, and the landing zone is created through Terraform Cloud

What to Know Before Setting up Control Tower

Before beginning the process of setting up for AWS Control Tower, it is important to have a clear understanding of what limitations are associated with Control Tower and consider some key points.

When setting up a landing zone, it is important to choose your home region. Once you have made a selection, you won’t be able to change your home region.
If you intend to establish a control tower on an existing AWS account that is already a part of an existing organizational unit (OU), you won’t be able to use it. In order to proceed, you’ll need to create a new AWS account that is not associated with any organizational Unit (OU).
As part of the Control Tower creation process, you’ll need to create mandatory accounts such as the Log Archive Account and Audit Accounts. Account-specific emails are required.
In order to set up the Landing Zone in the Management Account, it is essential to ensure that you have subscribed to the following services in the management account:
- S3, EC2, SNS, VPC, CloudFormation, CloudTrail, CloudWatch, AWS Config, IAM, AWS Lambda
The AWS Control Tower baseline covers only a few services with limited customization options: IAM Identity Center, CloudTrail, Config, some configuration rules, and some SCPs in AWS Organizations.
Implementing IAM Identity Center is limited to the management account of an organization.
AWS Control Tower implements concurrency limitations, allowing only one operation to be performed at a time.
Note that certain AWS Regions do not support the operation of some controls in AWS Control Tower. This is because the specified Regions lack the necessary underlying functionality to support the required operations.

How to Create a Control Tower

Creating a Control Tower means setting up a landing zone. AWS landing zone requires creating two new member accounts: the Audit account and the Log Archive account. You will need two unique email addresses for these accounts.

We will manage this process using Terraform modules. To keep things simple and clear, we will divide the project into several modules. One module will create the two core accounts. Another module will handle the setup of the landing zone. The final module will create Organizational Units (OUs) and apply Control Tower controls to ensure governance and compliance.

How to Automate Landing Zone Creation

When you run this code, the Core OU and two accounts are created under the Core OU. I have mentioned two repositories for each component: one for deploying the AWS resources like the landing zone, OU, and Control Tower Controls and another for the Terraform module.

A Terraform module is a set of standard configuration files in a specific directory. Terraform modules group resources for a specific task, which reduces the amount of code needed for similar infrastructure components.

I have imported both the core account creation and landing zone creation modules into the same main.tf file. This is necessary because the landing zone creation depends on the core account module. Including them together ensures all dependencies are managed properly and the deployment process is efficient.

This method also simplifies the project structure and helps avoid potential issues from managing these components separately.

The AWS Control Tower CreateLandingZone API needs a landing zone version and a manifest file as input parameters. Below is an example LandingZoneManifest.json manifest.

{
   "governedRegions": ["us-west-2","us-west-1"],
   "organizationStructure": {
       "security": {
           "name": "CORE"
       },
       "sandbox": {
           "name": "Sandbox"
       }
   },
   "centralizedLogging": {
        "accountId": "222222222222",
        "configurations": {
            "loggingBucket": {
                "retentionDays": 60
            },
            "accessLoggingBucket": {
                "retentionDays": 60
            },
            "kmsKeyArn": "arn:aws:kms:us-west-1:123456789123:key/e84XXXXX-6bXX-49XX-9eXX-ecfXXXXXXXXX"
        },
        "enabled": true
   },
   "securityRoles": {
        "accountId": "333333333333"
   },
   "accessManagement": {
        "enabled": true
   }
}

This module sets up the AWS landing zone using landingzone_manifest_template. The landing zone version and admin account ID are given through variables. This module also creates several IAM roles required for the landing zone setup.

I defined a local variable landingzone_manifest_template, which is a JSON template for setting up the landing zone. This JSON template has several important settings:

provider "aws" {
  region = var.region
}

locals {
  landingzone_manifest_template = <
{
    "governedRegions": ${jsonencode(var.governed_regions)},
    "organizationStructure": {
        "security": {
            "name": "Core"
        }
    },
    "centralizedLogging": {
         "accountId": "${module.aws_core_accounts.log_account_id}",
         "configurations": {
             "loggingBucket": {
                 "retentionDays": ${var.retention_days}
             },
             "accessLoggingBucket": {
                 "retentionDays": ${var.retention_days}
             }
         },
         "enabled": true
    },
    "securityRoles": {
         "accountId": "${module.aws_core_accounts.security_account_id}"
    },
    "accessManagement": {
         "enabled": true
    }
}
EOF
}

module "aws_core_accounts" {
  source = "https://github.com/nitheeshp-irl/terraform_modules/aws_core_accounts_module"

  logging_account_email  = var.logging_account_email
  logging_account_name   = var.logging_account_name
  security_account_email = var.security_account_email
  security_account_name  = var.security_account_name
}

module "aws_landingzone" {
  source                  = "https://github.com/nitheeshp-irl/blog_terraform_modules/aws_landingzone_module"
  manifest_json           = local.landingzone_manifest_template
  landingzone_version     = var.landingzone_version
  administrator_account_id = var.administrator_account_id
}

Governed Regions: Specifies the regions governed by the landing zone.
Organization Structure: Defines the security structure with a dedicated security account.
Centralized Logging: Configures logging, specifying the account ID and retention policies for logs.
Security Roles: Specifies the account ID for security roles.
Access Management: Enables access management.
Core Accounts: The core accounts code, also defined in the same file, is what sets up essential AWS accounts for logging and security.

You can find the full code here: https://github.com/nitheeshp-irl/aws-landing-zone.

How to Create an Organizational Unit

When you run this code, different organizational units (OUs) are created according to the specifications in the variable file.

Once the landing zone setup is finished, we can create an OU as per our business requirements. This will take the OU name from the variable file and create the OU.

aws_region = "us-east-2"

organizational_units = [
  {
    unit_name = "apps"
  },
  {
    unit_name = "infra"
  },
  {
    unit_name = "stagingpolicy"
  },
  {
    unit_name = "sandbox"
  },
  {
    unit_name = "security"
  }
]

You can see the code here:

How to Automate Attaching Control Tower Control to the OU

Once you have created the OU units using the above repository, this repository will apply Control Tower controls to the OUs.

After creating the required objects, you can attach controls to the OU if you need them. Here is the main.tf file:

provider "aws" {
  region = var.region
}

module "aws_controls" {
  source = "https://github.com/nitheeshp-irl/blog_terraform_modules/awscontroltower-controls_module"

  aws_region = var.aws_region
  controls   = var.controls
}

We used Terraform modules to create AWS resources.

Here are the control variables:

aws_region = "us-east-2"


controls = [
  {
    control_names = [
      "AWS-GR_ENCRYPTED_VOLUMES",
      "AWS-GR_EBS_OPTIMIZED_INSTANCE",
      "AWS-GR_EC2_VOLUME_INUSE_CHECK",
      "AWS-GR_RDS_INSTANCE_PUBLIC_ACCESS_CHECK",
      "AWS-GR_RDS_SNAPSHOTS_PUBLIC_PROHIBITED",
      "AWS-GR_RDS_STORAGE_ENCRYPTED",
      "AWS-GR_RESTRICTED_COMMON_PORTS",
      "AWS-GR_RESTRICTED_SSH",
      "AWS-GR_RESTRICT_ROOT_USER",
      "AWS-GR_RESTRICT_ROOT_USER_ACCESS_KEYS",
      "AWS-GR_ROOT_ACCOUNT_MFA_ENABLED",
      "AWS-GR_S3_BUCKET_PUBLIC_READ_PROHIBITED",
      "AWS-GR_S3_BUCKET_PUBLIC_WRITE_PROHIBITED",
    ],
    organizational_unit_names = ["infra", "apps"]
  }
]

You can see the code here:

Conclusion

Navigating a multi-account strategy in AWS can be challenging, but with AWS Control Tower and a structured approach, it becomes manageable.

Using AWS Control Tower, your team can ensure that their AWS environments are secure, compliant, and well-organized. The automated setup, governance at scale, and centralized management through AWS Organizations provide a strong foundation for cloud infrastructure.

Implementing a landing zone through AWS Control Tower offers a secure and standardized starting point, allowing for quicker deployment and better governance. Using organizational units (OUs) segregates accounts based on business needs, improving security and operational efficiency. AWS IAM Identity Center simplifies access management, providing a unified authentication experience across multiple accounts and applications.

Service Control Policies (SCPs) help keep things secure and compliant by making sure all resources follow the organization's rules. Terraform Cloud and GitHub Actions make it easier to deploy resources, offering a smooth CI/CD pipeline for managing infrastructure changes.

Terraform Security Best Practices

freeCodeCamp — Thu, 28 Sep 2023 05:12:03 +0000

By Aaron Katz

Terraform is a popular Infrastructure as Code (IaC) tool that allows users to define and manage cloud infrastructure in a declarative way. However, like any tool, Terraform can introduce security risks if not used properly.

In this article, we will explore the most common security risks when using Terraform, the threat landscape and attack surface, how it can be exploited, and how users can stay secure by following Terraform security best practices.

What is Terraform and why Should I Use it?

Terraform is an open-source tool developed by HashiCorp that enables users to define and provision infrastructure resources across multiple cloud providers and on-premises environments. It allows organizations to treat infrastructure as code, bringing the benefits of version control, collaboration, and automation to infrastructure management.

By adopting Terraform and embracing the principles of Infrastructure as Code, organizations can achieve several benefits:

Consistency and repeatability: Infrastructure deployments become consistent and repeatable, eliminating the risk of manual errors and ensuring that the same configuration can be applied across different environments.
Version control and collaboration: Infrastructure code can be stored in version control systems, enabling teams to collaborate, track changes, and roll back to previous versions if needed.
Automation and scalability: Infrastructure deployments can be automated, allowing organizations to scale their infrastructure quickly and efficiently based on demand.
Auditing and compliance: Infrastructure code can be audited and reviewed for compliance with security and regulatory standards, ensuring that best practices are followed.

While Terraform offers a powerful and flexible solution for managing infrastructure, it is essential to address the security considerations associated with using the tool effectively.

Security considerations

As with any tool or technology, Terraform is not immune to security threats and challenges. It is crucial to understand the threat landscape and the potential risks associated with using Terraform.

By identifying these risks, organizations can implement appropriate security measures to mitigate them effectively.

Configuration errors

One of the primary security risks associated with Terraform is misconfigurations in the infrastructure code.

Misconfigurations can lead to vulnerabilities and expose critical resources to unauthorized access or compromise. Common misconfigurations include weak access controls, open network ports, and incorrect permission settings on cloud resources.

Secrets management

Terraform relies on access keys and secret keys to authenticate with cloud providers and provision resources.

Storing these credentials insecurely can lead to security vulnerabilities such as unauthorized access and data breaches.

Handling access credentials securely is crucial to the overall security posture of Terraform deployments.

State security

Terraform uses state files to keep track of the resources it has created or modified.

These state files contain sensitive information, such as resource IDs, metadata, and secrets as referenced above. Inadequate management of state files can lead to security risks, including unauthorized access and data exposure.

State files can be stored either locally on a machine (suitable only for solo testing) or on a remote backend, such as a cloud storage resource, which should be encrypted and locked down.

Supply chain security

As with any software development process, the supply chain of Terraform modules and associated dependencies can introduce security risks.

Organizations must assess the trustworthiness and security of the modules they use and ensure they are regularly updated to address any vulnerabilities.

Permissions management

Appropriate permissions and access controls are essential to prevent unauthorized changes to infrastructure resources.

Managing permissions effectively can help reduce the risk of accidental or malicious actions that could compromise the security of the infrastructure.

Recommendations

To stay secure while using Terraform, users should follow these best practices:

Execute Terraform programmatically to minimize human error and enforce security policies.
Use safe Terraform modules and avoid using untrusted or vulnerable third-party components.
Secure the data store when remotely storing state data to prevent unauthorized access (for example, through encryption and restrictive access permissions).
Avoid storing secrets in state files; instead, use secret management solutions like AWS Secrets Manager, Azure Key Vault, or Google Cloud Secret Manager.
Use Terraform security scanners to identify and remediate potential vulnerabilities.
Implement centralized security policy and governance within Terraform code to improve visibility and enforce least privilege.
Require multi-factor authentication for collaborators to improve security posture.
Keep Terraform and all modules up to date.
Regularly audit terraform configurations for security vulnerabilities and misconfigurations, and build in appropriate automated tooling to detect violations prior to deployment.
Conduct regular drift detection to determine if resources deployed in your cloud provider match what should have been deployed in the terraform state file.

Conclusion

Terraform is a powerful tool for managing cloud infrastructure, but it can introduce security risks if not used properly.

By following Terraform security best practices, users can minimize the risk of security incidents and maintain a secure Terraform environment.

It's essential to keep Terraform configurations and infrastructure up-to-date, monitor for security threats, and adapt to the evolving threat landscape. By doing so, users can ensure that their cloud infrastructure is secure and compliant with industry standards.

Terraform Certified Associate (003) – How to Study for the Exam

freeCodeCamp — Mon, 11 Sep 2023 19:17:49 +0000

By Chris Williams

I've been meaning to get my Terraform associates certification for some time now, but something always got in the way.

Finally I was able to sit down and work my way through the study materials.

Currently Andrew Brown and I are creating two Terraform Bootcamps: one for beginners and the other one for intermediate practitioners. These bootcamps will be similar to Andrew's AWS Cloud Project Bootcamp (YouTube playlist here).

In this guide, I've compiled my live study notes that I've used for prepping to sit the Terraform Certified Associate Exam to help you know what to study.

Here's what I'll cover:

Preparation Materials
How to Use This Guide
Understand Infrastructure as Code (IaC) concepts
Understand the purpose of Terraform (vs other IaC tools)
Understand Terraform Basics
Use Terraform outside the core workflow
Interact with Terraform modules
Use the core Terraform workflow
Implement and maintain state
Read, generate, and modify configuration
Understand Terraform Cloud capabilities

Preparation Materials

In getting ready for any exam, I like to list out the study materials and reference resources that I'm going to be using ahead of time. This allows me to schedule my study time with a bit more discipline.

Here are the materials I've used:

HashiCorp Cloud Engineer Certifications (Free)

This site has a wealth of information:

https://developer.hashicorp.com/certifications

The Terraform Associate Prep Tutorials (Free)

https://developer.hashicorp.com/terraform/tutorials/certification-003

The newly updated freeCodeCamp course by Andrew Brown (Free) 😁
Jumppad.dev and their Terraform-workshop repository (Free)
The Terraform Hands On Labs Udemy course by Bryan Krausen (Paid)

How to Use this Guide

Each of the below sections will cover one of the nine domains specified in the Terraform Review Guide. Read through the documentation, complete the tutorials, and dig into the additional links I've provided.

The sections below are the large, important bits of information that I've culled for each domain, but this study guide is not comprehensive. Depending on how comfortable you are with the domain-specific knowledge, you will need to dive into the links provided in each section to round out your understanding of the material.

Understand Infrastructure as Code (IaC) concepts

Domain 1 covers the broad concepts of IaC. Why do we want to use it? What is it good for? Are there any areas where you would NOT want to use it? What are the different kinds of languages that can be used for IaC and how are the approaches different?

Explain what IaC is:

Manually configuring your infrastructure is fine for prototyping, but is prone to human error at scale (or when you need to provision the same env repeatedly). IaC is a blueprint of your infra and allows you to share/version/inventory/document your infra.

There are two main types of infrastructure:

Declarative = What you see is what you get. It's explicit with 0 chance of misconfiguration:

Azure only -> ARM Templates, Azure Blueprints
AWS only -> CloudFormation
GCP only -> Cloud Deployment Manager
All of the above (& many others) -> Terraform

Imperative = Uses existing programming languages like Python, JS or Ruby:

AWS only -> AWS CDK
AWS, Azure, GCP, K8s -> Pulumi

Terraform supports For loops, dynamic blocks, complex data structures – so it's declarative with some imperative benefits.

The Infrastructure Lifecycle is having clearly defined work phases for planning, designing, building, testing, maintaining and retiring your infrastructure.

Idempotent: a property of some operations such that no matter how many times you execute them, you achieve the same result. Terraform is idempotent because, no matter how many times you run the same configuration file, you will end up with the same expected state.

Configuration Drift: an unexpected configuration change away from what is stated in the config file. Can be due to manual adjustment (console access in prod = BAD 😂), evil h@xx0rs, etc... How do we fix it?

Detect: use a compliance tool like AWS Config, or built-in support e.g. AWS CF Drift Detection, TF statefiles
Correct:
- TF refresh & plan commands
- Manually correct (try not to do this)
- Reprovision (comes with it's own risks)
Prevent:
- use immutable infrastructure
- always create & destroy, never reuse
- use GitOps to version control IaC:
  - Create tf file
  - commit
  - Pull Request
  - peer review
  - commit to main
- GitHub action triggers build

Mutable vs Immutable infrastructure

Think of mutable infrastructure as (1) building a base image (2) Deploying that base image then (3) configuring the software after deploy.

Think if immutable infrastructure as (1) building a fully installed base image (2) deploying then (3) if a change needs to be made, tearing down that infra and rebuilding it with a new fully installed base image

Mutable = Develop -> Deploy (VM) -> Configure (e.g. cloud-init)
Immutable = Develop -> Configure (Packer) -> Deploy

Describe advantages of IaC patterns:

Why is Infrastructure as Code important? It allows you to:

Build & manage your infra in (relatively 😅) safe, consistent & repeatable ways
Share & reuse your configurations more easily
Manage infra on multiple cloud platforms
Track resource changes
Use version control (Git, GitHub, etc..) to collaborate with team members

Understand the Purpose of Terraform (vs other IaC tools)

Domain 2 spells out the differences between Terraform and the other IaC offerings available in the market. Agnostic vs Cloud Specific IaC tools each have their own place in the market and you will choose between them based upon your (and your companies) needs.

Explain multi-cloud and provider agnostic benefits:

Increases fault-tolerance
Allows for more graceful recovery from cloud provider outages
Reduces complexity because each provider has its own interfaces, tools, and workflows that Terraform abstracts for you
Use the same workflow to manage multiple providers and handle cross-cloud dependencies
Unified resource view
A technology agnostic approach/workflow

Explain the benefits of state

State (a statefile) is necessary for Terraform to function
It is a map referencing a resource in the tf file to an actual resource that is deployed
- For example resource "aws_instance" "webserver" {} mapping to known instance "i-0dfcf96cceba9bc77"
Metadata tracking
- resource dependencies
- build/delete order tracking
- Ordering within one provider and across multiple providers -> complexity quickly ramps up
For larger environments use -refresh=false and -target flags
- querying every resource can take too long
- cached state is treated as record of truth
Use remote state when working in teams
- remote locking prevents 2 admins making simultaneous changes

Understand Terraform Basics

Domain 3 gets into the commands and processes that you will need to understand to leverage Terraform. It covers the installation of Terraform itself, the providers, what modules are, and the basic workflow that you will do when building IaC environments.

Some helpful Terraform CLI cheat codes: terraform -help and terraform (command) -help.

Terraform lifecycle:

code - create or edit your terraform config file
terraform init - Initialize workspace, pull providers and modules
terraform plan - see what changes will be made (or generate an execution plan) also known as a "dry-run"
terraform validate - ensure types, values, and required attributes are valid and present
terraform apply - make the things!
terraform destroy - unmake the things! 😱

Diagram showing a basic Terraform workflow

HCL Syntax:

The syntax of the Terraform language consists of a few standard elements:

Standard Elements of HCL

For example, this is a basic resource block that will spin up an EC2 instance:

HCL Example with standard elements highlighted

resource "aws_instance" "terraform_101_server"{
  ami            = "ami-0b5eea76982371e91"
  instance_type  = "t2.micro"

Blocks are containers for other content and usually represent the configuration of some kind of object (like a resource). Blocks have a block type, can have zero or more labels, and have a body that contains any number of arguments and nested blocks. There are several types of blocks:
- Terraform block - settings for the execution environment of Terraform itself (required terraform version, backend settings, and so on.)
- Provider block - details of the provider(s) being used. Includes information like access mechanisms, regional options, profile to use, and so on...
- Resource block - specifies a single uniquely named resource managed by terraform. Includes resource type, name, and config options Data block - data sources that can be queried (cloud provider, local list, etc.)
- Module block - reusable set of resources that can be leveraged across multiple terraform configs
- Output block - Resources managed by Terraform each export attributes whose values can be used elsewhere in configuration. Output values are a way to expose some of that information to the user of your module. (For example, the IP address of an EC2 instance).
- Variable block - Defines variables to be used in the Terraform config. Input variables let you customize aspects of Terraform modules without altering the module's own source code. This functionality allows you to share modules across different Terraform configs, making your module composable and reusable. Variable names have to be unique 😉
  - order of precedence: defaults < env vars < terraform.tfvars file < terraform.tfvars.json file < .auto.tfvars < command line (-var & -var-file)
- Locals block - A local value assigns a name to an expression, so you can use the name multiple times within a module instead of repeating the expression.

Install and version Terraform providers

Terraform relies on Providers to allow Terraform to interact with remote systems (CSPs, SaaSs, APIs, and so on).

Some providers require additional config info (endpoints, regions used, etc..) to work.

You must declare which providers are needed in your Terraform configs. They go in the root module (child modules get their provider configs from the root module) in a required_providers block (see Requiring Providers for more details)

Use the alias meta-argument to define multiple configs for the same provider (that is, to support multiple regions for a cloud platform)

The required_providers block defines all of the providers needed by the current module

To ensure that multiple users run the same Terraform config (with the same provider versions), you:

Specify provider version constraints
Use the dependency lock file:
- named .terraform.lock.hcl
- updates when you run the terraform init command
- should be included in version control repo!
- if a provider is in the lock file, TF will always use that version unless you terraform init -upgrade
If it does upgrade, review the changes 😉:

Example of a lockfile change of the AWS provider version

Describe plugin-based architecture

Terraform is split into 2 main parts:

Terraform Core: a statically compiled binary (written in Go). When you type terraform in the CLI you are invoking the core functionality:
- reading and interpolating config files and modules
- state mgmt
- building a resource graph
- plan execution
- talking to plugins

Terraform Plugins: executable binaries invoked by Terraform Core over RPC.

Each plugin is geared towards a specific service (like AWS).
All Providers and Provisioners used in Terraform configs are plugins
Provider plugin are responsible for:
- Initializing libraries for making API calls
- Authentication with the infra provider
- Defining resource maps to specific services

Write a Terraform configuration using multiple providers

Sometimes you'll need to reference the same provider for multiple reasons. In the below example we're using multiple regions within AWS, therefore we need a mechanism for distinguishing between the two providers. Enter the alias argument. With it you can assign resources to specific environments:

Example of using multiple providers with the alias argument

provider "aws" {
  profile = "prod"
  region  = "us-east-1"
 }

 provider "aws" {
  profile = "prod"
  region  = "us-west-2"
  alias   = "west"
 }

Describe how Terraform finds and fetches providers

Required providers are specified in the (surprise!) required_providers block nested inside the top-level terraform block:

Example usage of the required providers block

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.8.0"
    }
  }
}

The source value specifies the primary location where Terraform can download it (see link for specifics around syntax).

Use the commands terraform version (what version of core and what plugins are installed) and terraform providers (what providers are required by the configuration) to get more information about config requirements:

Output of the terraform version command

Output of the terraform providers command

Use Terraform Outside the Core Workflow

Now that we've learned the basics of Terraform, Domain 4 gets into more workflows that you will see very often in the real world.

This domain answers questions like "If you built an environment without using Terraform, how would you move those resources over to a Terraform managed state?" and "What happens when everything breaks?"

Describe when to use terraform import to import existing infra into your Terraform state (note: can only import 1 resource at a time)

Write a resource block for it in your configuration
- resource name must be unique (like any regular resource block)
Run terraform import with the syntax terraform import [options]
- address id is the resource id of the provider
- each remote object must be bound to only one resource block in the terraform config
Run terraform plan & review how the config compares to the imported resource
Make adjustments to the config to reach desired state

Use terraform state to view Terraform state

Use this instead of modifying the statefile directly
- Never modify the statefile directly ever ever ever EVER
Used for advanced state mgmt
Works with both local and remote statefiles
Creates a backup that cannot be disabled
terraform state -help to get started
terraform state list to get a less-cluttered view of resources under mgmt
terraform state list [resource] to get granular resource data (very handy):

Output of the terraform state command

Describe when to enable verbose logging and what the outcome/value is

You can generate logs for Terraform Core and Providers separately.

log levels = TRACE > DEBUG > INFO > WARN > ERROR
To enable core logging, set env var TF_LOG_CORE=(log level)
- linux example export TF_LOG_CORE=TRACE
- powershell example $env:TF_LOG_CORE=TRACE
To enable provider logging, set env var TF_LOG_PROVIDER=(log level)
- linux example export TF_LOG_PROVIDER=TRACE
- powershell example $env:TF_LOG_PROVIDER=TRACE
To persist logs, set env var:
- linux export TF_LOG_PATH=logs.txt
- powershell $env:TF_LOG_PATH=logs.txt
To undo env var, reset values to null:
- export TF_LOG_CORE=""
- export TF_LOG_PROVIDER=""
- export TF_LOG_PATH=""

Output of the terraform refresh command

Interact with Terraform Modules

Domain 5 expands your knowledge of Terraform by introducing the concept of modules. A lot of people are using Terraform and they are creating modules to help make it easier for everyone to provision resources. Don't recreate the wheel, use modules!

Contrast & use different module source options including the public Terraform Registry

Browse module section of the Terraform registry

A Terraform module is just a set of Terraform configuration files inside a folder – nothing to be afraid of. 😊

Every Terraform config has at least one module, the root module. If you have child modules, the root module can make calls to them.

Child modules are just files outside of the working directory. They can be a folder on your system, up in a GitHub repo, S3 bucket, and so on. Check out all the options here

The source arg in the module block tells Terraform where to find the source code for the module. The xyntax for registry modules required source argument is //

example:terraform-aws-modules/vpc/aws

Use the Terraform Registry to find and use 'public' modules

terraform init will download and cache modules ref'ed in a config
By default, only verified modules are shown in the Terraform registry. You can change that with filters.
Modules in the registry are versioned using the version argument
Private Registry Module Sources syntax = ///

Example of using a Terraform module

module "ec2_instance" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "3.5.0"
  count   = 2

  name = "my-ec2-cluster"

  instance_type          = "t2.micro"
  vpc_security_group_ids = ["sg-12345678"]
  subnet_id              = module.vpc.public_subnets[0]

  tags = {
    Terraform   = "true"
    Environment = "dev"
  }
}

Interact with module inputs and outputs

Input variables let you customize aspects of Terraform modules without altering the module's own source code

Each input variable accepted by a module must be declared using a variable block. As always, variable names must be unique w/in the module
If a variable doesn't have a default value assigned to it, it's required

Output blocks provide back information from the resources generated by the module.

to use output information in the 'calling module' (normally the root module), use interpolation syntax of the output block in the 'sending module'

child module output being called by calling module

Describe variable scope w/in modules/child modules

Deciding what is in and out of scope for a module can be challenging. Don't overuse a module by putting too many resources into one.

A good rule is to limit it to one resource type offered by a provider.

You can group infrastructure that is always deployed together. You can also group resources with the same set of privledges where possible to minimize blast radius.

Try to seperate long-running resources from short-lived resources: don't put your production database in the same module as your dev lambdas. 😂

Variables are declared in the child module. If you want information from the child module, you must create an output block for that info in the child module & then call it using interpolation syntax. You cannot access child module resource information otherwise.

If you are building a module, it is good practice to create output blocks for all resource info availalbe, even if you don't see an immediate use for it:

Example layout of root and child modules

Set module version

Where to check module versions

Much like other blocks, modules can be versioned. If you change the module version, you will need to run terraform init again.

It is recommended to version constrain your module usage to prevent unexpected changes from occurring
Versioning is supported for the public Terraform Registry & TFCs private module registry
Local file modules do not support versioning

Use the Core Terraform Workflow

Domain 6 gets much more granular into the core Terraform workflow. In the real world, this is the process that you will do again and again and again (and again!), so it's important that you understand all of the little details.

Especially pay attention to how the processes interact with each other and exactly WHAT is happening when you use a particular command.

Describe Terraform workflow (write -> plan -> create)

Write:

Author your IaC in your editor of choice (like VS Code😉)
Written in Hashicorp Configuration Language (HCL)
Store your work in VCS (like Git + GitHub)

Plan:

Initialize the working directory with terraform init
Preview changes before you apply with terraform plan
Do this repeatedly as you are building your config to fix errors and have a tight feedback loop
Can output and save for later with terraform plan -out [plan name]

Create:

terraform apply to deploy the infra that you've written
Can apply a previous saved plan with terraform apply [plan name]
Recommended: push your config to remote repo for redundancy/safekeeping

Initialize a Terraform directory (`terraform init`)

initializes (get it?) a working directory that contains .tf files
This is the first command that should be run after writing/cloning a new config. You can find Command Options here
Root config directory is checked for backend config data
- to update backend, use -reconfigure or -migrate-state
Sources and downloads the providers and modules used in the config
Creates a lock file .terraform.lock.hcl to pin provider verions
Reasons to re-initialize a config:
- adding a new provider
- upgrading/downgrading the version of a provider with terraform init -upgrade
- adding a new module
- upgrading/downgrading the version of a module with terraform init -upgrade
- changing the location of the backend (statefile)

Validate a Terraform config (`terraform validate`)

Know the limitations! It DOES:

Validate local configuration files
Check syntactic validity
Check internal consistency
Check correctness of attribute names, value types, and expected argument types

It DOES NOT:

Access remote services (remote state, provider APIs, upstream dependencies, etc...)
Check with backend provider to insure external consistency

Generate & review an execution plan (`[terraform plan](https://developer.hashicorp.com/terraform/cli/commands/plan)`)

This creates an execution plan to preview changes Terraform wants to make to your env.

Here are the steps:

Reads the current state (if any) of remote objects under mgmt to make sure state is up to date
Compares current config to prior state & notes changes
Proposes changes that will make the remote env match the config

You can use terraform plan -help to see all the options, and terraform plan -out [plan name] to create a file to be reviewed/used later.

Use terraform plan -refresh-only to detect drift between your config and the actual environment
Doesn't change the env, you can run multiple times
+ resource will be created
- resource will be destroyed
~ resource will update in place
-/+ resource will be destroyed & recreated

creating 30 new resources

Execute changes to infra (`terraform apply`)

Provisions, changes, and destroys (recreates) resources in an environment
Executes the proposed actions of a terraform plan
Will only affect the resources under mgmt
Running it without a plan causes it to automatically run a plan, then execute it
You can pass in a previously generated plan with terraform apply -out=[plan name]
terraform apply -auto-approve bypasses the post-plan manual 'yes' check
terrafrom apply [plan name] to execute a previously saved plan

Destroy Terraform managed infra (`terraform destroy`)

You'll never guess what this command does 😂
Destroys all infrastructure under management by Terraform (nothing else)
Automated cleanup is better because you always forget to delete something when you build it manually
Be careful with terraform destroy -auto-approve
terraform apply -destroy also does the same thing

Are you really REALLY sure?!?!

Apply formatting & style adjustments to a config (terraform fmt)

Formats and styles your code for better readability
Does NOT fix your errors
Lists which files it updates
Very useful 😁

How terraform fmt works

Implement and Maintain State

State is THE MOST IMPORTANT THING in any given Terraform managed environment. Without your statefile, you're going to have a bad day.

Domain 7 walks through the different types of state, how to move it around, and how to protect it.

Describe default `local` backend

Terraform stores and references the state of all terraform managed environments.

The configuration file is what we want the environment to look like, the statefile one to one mapping of the resources provisioned to the config file.

The local backend:
- stores state on the local filesystem
- locks that state using system APIs
- performs operations locally

Default statefile is named terraform.tfstate and lives in the working directory. If you don't specify a backend, terraform uses the default local backend.

You CAN explicitly specify local backend if you want to have greater control over statefile location, future state migration considerations, and so on.

Explicitly defined default local state

terraform {
  backend "local" {
    path = "terraform.tfstate"
  }
}

Describe state locking

State data (the statefile) is the source of truth for Terraform and therefore is very important. As such it needs to be protected from several file corruption and data loss scenarios.

Backends are responsible for storing state and providing an API for state locking. State locking is optional but highly recommended in multi-user environments.

You can manually retrieve remote state with terraform state pull
You can manually write state with terraform state push.... but don't ever ever ever do this without proper supervision and guidance and backups.

State locking prevents multiple users from making changes to a managed environment simultaneously (potentially corrupting state). Locking happens automatically on all potential state writing operations.

You can ignore state locking with -force but don't 😅
You can also terraform force-unlock [LOCK_ID] if unlocking fails, but this is a break glass emergency use case

Not all backends support locking! Local, TFC, AWS S3 (with some tweaks), and several others do (see docs for which ones do/don't).

Handle backend and cloud integration auth methods

When you have state backend stored somewhere other than local, you'll need to have some form of authentication - this is very sensitive information that needs to be protected!

Each backend has it's own auth mechanism (for example, access keys for AWS).

Some other things to keep in mind:

Arguments used in the block body are specific to the chosen back end type
If you want to change backend location you'll need to start with terraform init -reconfigure
terraform login [hostname] is used to obtain and save an API token from TFC, TFE, or other host that offers Terraform services
If you don't explicitly provide a hostname, cmd defaults to TFC at app.terraform.io
By default Terraform pulls & saves API token in plain text to credentials.tfrc.json - this can be modified for other secrets mgmt systems
To configure a backend block, add a nested backend into the top-level terraform block:

Defined remote state

terraform {
  backend "remote" {
    organization = "example_corp"

    workspaces {
      name = "my-app-prod"
    }
  }
}

Differentiate remote state backend options

Terraform has a built-in selection of backends and the configured backend must be available in the version of Terraform you are using. This is why it's important to version everything in your configs!

You don't need to configure a backend when using TFC because it auto-manages state in the workspaces assoc. to the config. If your config includes a cloud block it cannot have a backend block.

Manage resource drift and Terraform state

terraform plan -refresh-only

Creates a plan that updates state to match changes made outside of terraform
Good for drift detection
Does not propose any actions to undo changes

terraform apply -refresh-only

Updates the statefile to accept the changes made manually in the environment
Does NOT change the config file! If you don't update the config with the new changes, then the next terraform apply of that config will revert environment back to the original state

Describe `backend` block and cloud integration configuration

Defines where terraform stores the statefile for a working directory
The statefile is the source of truth for resources under terraform management and therefore is extremely important (I might have said this before 😉)
By default terraform uses local which stores statefile on local disk
remote is the other backend type which covers everything else (TFC, S3, and so on...)
Limitations:
- 1 config, 1 backend block
- Cannot use interpolation (so we can't use variables)
Partial configurations
- Omit certain arguments to be supplied at runtime
- Useful for automation scripting & CI scenarios
When using Terraform Cloud, you don't need to configure a backend because TFC manages state in the workspace associated with the config
cloud block is nested in terraform block
Limitations:
- 1 config, 1 cloud block
- If the config includes a cloud block it cannot also have a backend block
- Cannot use interpolation
- See cloud settings for more information

Understand secret management in state files

The statefile can contain a lot of secrets!!!
- resource IDs
- DB username/passwords
- private keys
- Andrew Brown's home phone number

So you should treat it like you would your company passwords. By default local state is stored in plaintext as JSON. You can mark sensitive information in your config files as such with the sensitive = true argument.

redacted from the CLI output, but still is in the statefile as plaintext (that's why it's important to lock down the statefile)

Use a backend that encrypts and protects your statefile from unauthorized access.

TFC encrypts state at rest & in transit
Turn on encryption if you are using S3 (& use state locking!)
Terraform does not persist state to local disk when remote state is being used

Read, Generate, and Modify Configuration

Domain 8 teaches you more about the config files and how to leverage HCL fully. Like any programming language, HCL can be refactored for ease of understanding and keeping your code DRY (Don't Repeat Yourself).

Demonstrate use of variables and outputs

Programming analogies:
- Input variables = function arguments
- Output values = function return values
- Local values = function local variables
Each input variable accepted by a module must be declared using a variable block:
- Variable names can be any valid name except source, version, providers, count, for_each, lifecycle, depends_on, or locals
- Input order of precedence: defaults < env vars < terraform.tfvars file < terraform.tfvars.json file < .auto.tfvars < command line (-var & -var-file)
  - side note: terraform.tfvars is the most popular way for manipulating variables used out in the wild
By the same token, each output value must be declared using an output block
- In the root module, the output is displayed to the user
- In a child module, the output can be used to access a value by the root module (module..)
- Outputs only render on terraform apply, not terraform plan
- terraform output will display your outputs without running an apply
- terraform output to pull a specific value
- marking an output value as sensitive suppresses the value in the CLI during a terraform apply, but NOT in the statefile or a terraform output
- if you mark an variable as sensitive but NOT an output for that variable, it will error out

Describe secure secret injection best practices

Mark sensitive values as sensitive
Never put actual secret values into a .tf file as they would be checked into source control
- Passwords, API tokens, access tokens, etc... must be obfuscated
Never check your statefile to source control (same reason as above)
Use environment variables by setting TF_VAR_
If you are using Terraform Cloud use the environment variables for the appropriate workspace:
Use a secrets management solution like Vault
- Run through this tutorial to get a feel for injecting secrets into Terrafrom using Vault

Understand the use of collection and structural types

Colletion types are a collection of one type of grouping
- All elements of a collection must be of the same type list(string) is different from list(number)
- 3 kinds of collection types:
  - list(...) sequence of ordered elements (starting at 0)
  - map(...) sequence of key/value pairs separated by a comma. Can confusingly use both = or : as the k/v separator
  - set(...) a collection of unique, unordered, unrepeating vaules
Structural types allow multiple types of elements to be grouped together
- 2 kinds of structural types:
  - object({ = , ...}) named attributes where each one has it's own type
  - tuple([, , ...]) sequence of ordered elements (starting at 0)

Create and differentiate `resource` and `data` configuration

Providers can access both Resources and Data Sources (examples from the AWS provider:
You can query resources you've created in Terraform (via exporting Attribute References):
And also do data lookups for existing resources that haven't been made with Terraform
Do the Query data sources tutorial

Use resource addressing and resource parameters to connect resources together

Resource path syntax
- [module path][resource info]
Module path syntax
- module.[optional module index]
Resource spec syntax
- resource_type.user_defined_name[optional index]
Types of named values:
- Resources .
  - if count is used, ref is a list accessed with [N]
  - if for_each is used, ref is a map accessed with ["key"]
- Input Variables var.
- Locals local.
- Child module outputs module.
  - same count and for_each rules as resources
- Data blocks data..
  - same count and for_each rules as resources
- Filesystem/workspace info
  - path.module location of expression (don't use in write operations)
  - path.root root module location
  - terraform.workspace currently selected workspace
- Block 'local' values
  - count.index
  - each.key/each.value
  - self

Use HCL and Terraform functions to write configuration

Terraform has a number of built-in functions for manipulating values, strings etc...
- Review them here
- Not necessary for the exam (I don't think?) but they will make your life easier when actually using HCL
Do the dynamic operations with functions and create dynamic expressions tutorials

Describe built-in dependency management (order of execution based)

Terraform generates a dependency graph for determining which resources need to be built 1st, 2nd, 3rd, etc...
- depends_on can be used to alter dependencies
- The lifecycle block along with create_before_destroy and prevent_destroy are additional tools in the lifecycle toolbelt
Items with no dependencies are built in parallel to speed up the provisioning process
- By default, up to 10 concurrent operations can be run at the same time
- This can be changed with the -parallelism flag on plan, apply, & destroy commands
You can see this dependency map using the terraform graph command and a viewer like Graphviz (or http://www.webgraphviz.com/ if you are lazy like me)

Understand Terraform Cloud Capabilities

Domain 9 (the last domain!) is all about Terraform Cloud. This is the HashiCorp managed remote backend and offers a free tier (up to 500 managed resources when I wrote this article).

Every production-level environment will use a state-locking remote back end, so knowing how Terraform Cloud works is great not only for the exam, but for real world job experience as well.

Explain how Terraform Cloud helps to manage infra

Terraform Cloud - is a SaaS offering that:

Manages Terraform runs in a consistent & reliable environment
Includes easy access to shared state and secret data
Access controls for approving changes to infrastructure
A private registry for sharing Terraform modules
Detailed policy controls for governing the contents of Terraform configurations
Remote state storage
Version control integrations
Custom workspace permissions
Flexible workflows - CLI, UI, VCS, or the API
Collaboration - review/comment on plans prior to executing infra changes
Audit logs - who broke it

Terraform Enterprise is a self-hosted distribution of Terraform Cloud. It's not on the exam, but here are the docs for requirements, ref architectures, and install guides.

Describe how Terraform Cloud enables collaboration and governance

Terraform Cloud uses Teams as its grouping paradigm. Teams are comprised of Users in a given Organization. Each Team can have an API token that is not associated with a specific user.

The Organization grants workspace permissions to Users and Teams. The Owners Team:

Is the 1st team created
Cannot be deleted the Owners Team or left empty
Can create/delete other Teams
Manages Org-level permissions to other Teams
Can view the full list of teams (Visible and Secret)

Terraform Cloud enforces Policies on runs using the Sentinel Policy Language. After defining a policy, they are added to policy sets that Terraform Cloud can then enforce.

Do the Enforce a policy with Sentinel tutorial.

Conclusion

That's it! 😂 I feel confident that if you review all of the material here, do the tutorials specified in the exam prep, and attend our Terraform Beginner Bootcamp, you will be well prepared to sit and pass the Terraform Associate Exam.

Good luck!

HashiCorp Terraform Associate Certification Study Course – Pass the Exam With This Free 7 Hour Course

freeCodeCamp — Thu, 17 Aug 2023 14:07:00 +0000

By Andrew Brown

Learn how to pass the HashiCorp Terraform Associate Certification (003) with this free 7-hour course.

What is the HashiCorp Terraform Associate?

HashiCorp is a company specializing in open-source tools for multi-cloud workloads. HashiCorp's most popular tool is Terraform, which allows DevOps Engineers to write code to provision infrastructure for multiple Cloud Service Providers (CSPs), for example AWS, Azure, and GCP.

The HashiCorp Terraform Associate is a certification that proves an engineer has a practical understanding of the Terraform tool as well as the SaaS offering known as Terraform Cloud.

For those seeking a career as a DevOps Engineer, the Terraform Associate is essential since Terraform has become the industry standard for Infrastructure as Code (IaC) and frequently appears as an expected skill in DevOps Job postings.

The Terraform Associate is not a difficult exam but strongly relies on practical knowledge of Terraform. That's why this study course is 7 hours – we've added many follow alongs and common edge cases that you will only experience in practice.

Overview of the Terraform Associate

The Terraform Associate is composed of the following domains:

Understand infrastructure as code (IaC) concepts
Understand Terraform's purpose (vs other IaC)
Understand Terraform basics
Use the Terraform CLI (outside of core workflow)
Interact with Terraform modules
Navigate Terraform workflow
Implement and maintain state
Read, generate, and modify configuration
Understand Terraform Cloud and Enterprise capabilities

While the Terraform certification has many more domains than other cloud certifications they are finite in the expected requirements of each domain, and nearly everything you see on in the exam-guide appears on the exam.

Can I simply watch the videos and pass the exam?

Yes, HashiCorp Terraform Associate is easy enough to simply watch the videos and pass the exam.

However, since the exam is always based on the last three minor versions of Terraform, if you don't do the follow alongs on your machine you could be missing on new emerging Terraform edge cases.

Where is the free practice exam and cheatsheets for this course?

There is a free practice exam of 57 questions for this course and downloadable cheatsheets on ExamPro. No Credit Card required, No Trial Limit.

Head on over to freeCodeCamp's YouTube channel to start working through the full 7 hour course.

How to Use Terraform to Deploy a Site on Google Cloud Platform

Beau Carnes — Thu, 13 Jul 2023 21:46:56 +0000

Modern cloud technologies have revolutionized the way we develop and deploy applications. But managing complex infrastructure can still be a daunting task, especially when working at scale. The solution? Infrastructure as Code (IaC) — an approach that brings programming paradigms to infrastructure management, thereby enhancing efficiency, repeatability, and agility.

With that in mind, we're excited to introduce a new course that combines IaC with the power of Terraform and Google Cloud Platform (GCP). The course is now live on the freeCodeCamp.org YouTube channel. This course is designed to equip you with the skills to efficiently deploy a website to the GCP using Terraform, an industry-leading IaC tool.

Rishab Kumar created this course. He is a Developer Evangelist at Twilio and is an excellent teacher.

Here's a snapshot of the course:

Introduction to Project: Get an overview of the project you'll be working on throughout the course — deploying a website to the GCP.
Setting Up Google Cloud Platform (GCP): Navigate through the GCP setup process, understanding key features of this powerful cloud service provider.
Installing Terraform and Setting Up the Directory: Dive into the Terraform setup, learn how to install it, and set up the directory for your project.
Writing Terraform Code: Master the art of writing effective Terraform code for managing infrastructure.
Deploying Google Storage Bucket to GCP: Implement your first deployment — a Google Storage Bucket — to the GCP using Terraform.
Adding Other Resources in Terraform: Expand your Terraform skills by learning to add more resources, taking your deployment capabilities to the next level.
Custom Domain Configuration: Find out how to configure a custom domain for your website, adding a professional touch.
Deploying Remaining Resources to GCP: Deploy the rest of the resources to the GCP, consolidating all the concepts you've learned so far.
Terraform Destroy and gitignore: Finally, understand how to safely de-provision infrastructure using 'Terraform Destroy' and learn about the role of '.gitignore' in a Terraform project.

This course is an excellent opportunity to tap into the transformative power of IaC. Whether you're a beginner or an experienced developer looking to expand your skillset, this course provides a hands-on, practical approach to mastering Terraform and GCP.

Watch the full course on the freeCodeCamp.org YouTube channel (1-hour watch).

What is Infrastructure as Code? Explained for Beginners

Daniel Adetunji — Thu, 15 Jun 2023 14:32:46 +0000

Infrastructure as Code (IaC) is a way of managing your infrastructure like it was code. This gives you all the benefits of using code to create your infrastructure, like version control, faster and safer infrastructure deployments across different environments, and having up to date documentation of your infrastructure.

The article will cover how infrastructure as code works using an analogy. We'll cover the different infrastructure as code tools available as well as declarative vs imperative code

I'll also introduce you to Terraform, which is an open source infrastructure as code tool you can use to create infrastructure across multiple cloud providers like AWS, GCP, Azure and others.

Infrastructure as Code in Practice

Imagine you are trying to create a three-tiered web application on AWS as you can see in the image below:

Three tiered web application example

The presentation tier is responsible for presenting the user interface to the user. It includes the user interface components such as HTML, CSS, and JavaScript running on EC2 instances.

The logic tier is responsible for processing user requests and generating responses, by communicating with the database layer to retrieve or store data. This is also deployed on EC2 instances

The database tier is responsible for storing and managing the application's data and allows access to its data through the logic tier. The database runs on AWS RDS.

Each of the instances are in an autoscaling group with a load balancer in front of it (except for the database tier).

If you want to create this infrastructure through the AWS console, you would have to manually click through various screens to spin up the infrastructure. This is fine if it is a one time activity.

But if you need to repeat this across different environments like development and test, or need to add additional infrastructure like caches, queues, firewall rules, IAM or SSL certificates, then it becomes increasingly more complex to manage through the AWS console.

Managing complex infrastructure through the console also introduces the possibility of human error.

Infrastructure as code expresses your desired infrastructure in the language of code. This brings all the benefits of code to managing your infrastructure like:

Version Control – allows you to store the history of your infrastructure and revert to a previous version if needed.
Faster & safer deployments – can recreate infrastructure in new environments quickly and with less errors since every part of the infrastructure is clearly defined in the code.
Documentation – your current infrastructure state is documented and kept up to date automatically whenever you make a change. This keeps your infrastructure documentation detailed and accurate, compared to having the infrastructure written in a document or on a confluence page that may not be updated whenever there is a change.

How Infrastructure as Code Works – Explained with an Analogy

Infrastructure as code allows you to create a detailed blueprint of your infrastructure. This blueprint gives instructions to your cloud provider about the infrastructure you want created.

This is similar to how an architecture blueprint works. It outlines the layout, dimensions, materials, and various components of the structure. The blueprint serves as a reference for architects and engineers to understand the desired construction.

how an architectural blueprint is analogous to infrastructure as code

The blueprint leaves little room for error. It will be interpreted in the same way by any architect or engineer. If you wanted to build exact copies of this house, all you need is the architecture blueprint.

Infrastructure as code, at a basic level, works in the same way as an architecture blueprint. It details the infrastructure you want to create as code in a number of different possible languages (JSON, YAML, HCL, Python, Ruby, JavaScript, and so on), instructing the cloud provider to create your infrastructure exactly as specified.

Declarative & Imperative Infrastructure as Code Tools

There are many IaC options to choose from, and all the major cloud providers have their own dedicated tools:

AWS has CloudFormation
GCP has Deployment Manager
Azure has Resource manager

One limitation of these cloud provider-specific tools is that they can only create infrastructure in their respective clouds. So CloudFormation only works in AWS and Deployment Manager only works in GCP. IaC using these providers is usually written in JSON or YAML format.

Terraform, on the other hand, is open source and you can use it to create infrastructure across all the major cloud providers. It uses HCL (HashiCorp Configuration Language).

Infrastructure as code can also be written using popular languages like Python and JavaScript.

These scripting/programming languages lie on a spectrum of declarative and imperative code as shown below.

A spectrum of declarative & imperative languages and where Terraform HCL fits

The main difference between an imperative and declarative language is that imperative languages explicitly define the control flow. This is simply the order in which instructions are executed in a program. Control flow determines the path the program takes and how it responds to different conditions or events.

In imperative languages, control flow is explicitly defined using control structures such as loops, conditionals, and function calls. Imperative languages give you more flexibility in configuring your infrastructure. This is not necessarily a positive, as more flexibility means more opportunity to introduce errors into your infrastructure.

A declarative language focuses on describing the desired result without giving specific instructions on how to achieve it.

An illustration demonstrating the difference between declarative and imperative languages

An example JSON is shown below, used in AWS CloudFormation to create an EC2 instance:

"Type": "AWS::EC2::Instance",
      "Properties": {
        "ImageId": "ami-0123456789",
        "InstanceType": "t2.micro",
        "KeyName": "my-key-pair",
        "SecurityGroupIds": ["sg-0123456789"],
        "SubnetId": "subnet-0123456789",
        "Tags": [
          {
            "Key": "Name",
            "Value": "MyEC2Instance"
          }
        ]
      }

A declarative language like JSON abstracts away the underlying complexity that details how the EC2 instance will be created. All it cares about is the end state.

Terraform HCL is closer to the declarative end of the spectrum. Terraform allows you to describe the desired infrastructure's final state without specifying the exact steps to get there. Terraform internally manages the execution order, resource dependencies, and handles the infrastructure changes based on the desired configuration.

But Terraform does have support for some imperative features like variables and expressions, allowing dynamic behaviour based on inputs. So, it is not a completely declarative language like JSON.

How Terraform Works

There are two fundamental concepts that serve as a foundation for understanding Terraform:

The configuration file – this describes the desired infrastructure
The state file – this describes the current infrastructure as it exists in the real world

Terraform’s job is to create, modify or delete infrastructure as needed so that the desired infrastructure configuration is met. It does this by executing the necessary API calls to your cloud provider(s) to create, modify, or destroy the resources as specified.

Once the infrastructure has been created/modified/destroyed to match the configuration file, the state file is updated to reflect the current infrastructure.

The terraform plan command creates an execution plan, which lets you preview the changes that Terraform plans to make to your infrastructure.

By default, when Terraform creates a plan, it compares the desired configuration as described in the configuration file, with the current configuration as described in the state file. Terraform then proposes a list of changes needed what will ensure that the current configuration matches the desired configuration.

If you then run the terraform apply command, terraform will modify the real world infrastructure so that it matches the desired configuration, and updates the state file to show the new infrastructure configuration.

At a high level, this is what terraform does:

What happens when you run the terraform apply command

Let’s bring back the architectural blueprint analogy.

The configuration file is like the architectural blueprint. It details the infrastructure that needs to be built, that is the desired construction. The real world infrastructure is the existing construction in the physical world and the state file is a representation of what currently exists – the current blueprint. The engineers work to ensure that the existing construction matches the architecture blueprint.

In this analogy, engineers do the work of Terraform in ensuring that the existing construction matches the architecture blueprint. You don’t need to specify the details of how to build the house, you just need to specify what you want built and the engineers handle the rest.

An architectural analogy to running terraform apply

If you want to learn more about how Terraform works and how you can use it in your projects, you can check out this free course on freeCodeCamp's YouTube channel.

Bringing it Together

Infrastructure as code (IaC) is a great way of managing complex infrastructure configuration in the form of code. This naturally brings all the advantages of code to your infrastructure like version control, faster and safer infrastructure deployments across different environments and up to date documentation of your infrastructure.

Terraform is an open source IaC tool that allows you to work with multiple cloud providers to spin up infrastructure as defined in your configuration files.

Terraform HCL is a declarative language that allows you to describe your desired infrastructure configuration. All you have to do is specify what you want created and terraform handles the creation on your behalf by making API calls to your chosen cloud provider(s).

Learn Terraform by Deploying a Jenkins Server on AWS

Destiny Erhabor — Tue, 26 Jul 2022 18:03:50 +0000

Hello, everyone! Today we're going to learn about Terraform by building a project.

Terraform is more than just a tool to boost the productivity of operations teams. You have the chance to transform your developers into operators by implementing Terraform.

This can help increase the efficiency of your entire engineering team and improve communication between developers and operators.

In this article, I'll show you how to fully automate the deployment of your Jenkins services on the AWS cloud using Terraform with a custom baked image.

What is Terraform?
Why Should You Use Terraform?
How Terraform works
What is a Procedural Language vs a Declarative Language?
Prerequisites and Installation
File/Folder Structure of Our Project
How to First Initialize Terraform State
How to Provision an AWS Virtual Private Cloud
How to Work with Terraform Modules
How to Create a VPC Subnet
How to Setup VPC Route Tables
How to Create a Public Route Table
How to Create a Private Route Table
How to Setup a VPC Bastion Host
How to Provision our Compute Service
Jenkins master instance
How to Create the Load Balancer
Cleaning Up
Summary

What is Terraform?

Terraform by HashiCorp is an infrastructure as code solution. It lets you specify cloud and on-premise resources in human-readable configuration files that you can reuse and share. It is a powerful DevOps provisioning tool.

Why Should You Use Terraform?

Terraform has a number of use-cases, including the capacity to:

Specify infrastructure in config/code and easily rebuild/change and track changes to infrastructure.
Support different cloud platforms
Perform incremental resource modifications
Support software-defined networking

How Terraform works

Let's have a look at how Terraform works at a high level.

Terraform is developed in the Go programing language. The Go code is compiled into terraform, a single binary. You can use this binary to deploy infrastructure from your laptop, a build server, or just about any other computer, and you won't need to run any additional infrastructure to do so.

This is because the Terraform binary makes API calls on your behalf to one or more providers, which include Azure, AWS, Google Cloud, DigitalOcean, and others. This allows Terraform to take advantage of the infrastructure that those providers already have in place for their API servers, as well as the authentication processes they require.

But Terraform doesn't know what API requests to make – so how does it know? Terraform configurations, which are text files in declarative language that specify what infrastructure you want to generate, are the answer. The "code" in "infrastructure as code" is these setups.

You have complete control over your infrastructure, including servers, databases, load balancers, network topology, and more. On your behalf, the Terraform binary parses your code and converts it into a series of API calls as quickly as possible.

What is a Procedural Language vs a Declarative Language?

A procedural language allows you to specify the entire process and list the steps necessary to complete it. You merely give instructions and specify how the process will be carried out. Chef and Ansible encourage this method.

Declarative languages, on the other hand, allow you to simply set the command or order and leave it up to the system to carry it out. You don't need to go into the process; you just need the result. Examples are Terraform, cloudFormation, and Puppeteer.

Enough of the theory...

Now is the moment to put Terraform's high availability, security, performance, and dependability into action.

Here, we're talking about a Terraform-based Jenkins server on Amazon Web Services. We are setting up the networking from the ground-up, so let's get started.

Prerequisites and Installation

There are a few things you'll need to have setup and installed to follow along with this tutorial:

File/Folder Structure of Our Project

We'll use a modular development strategy to separate our Jenkins cluster deployment into numerous template files (rather than developing one large template file).

Each file is in charge of executing a target infrastructure component or AWS resource.

For creating and enforcing infrastructure settings, Terraform leverages the syntax of a JSON-like configuration language called HCL (HashiCorp Configuration Language).

files/folders structures

How to First Initialize Terraform State

To follow best practices, we will be storing our Terraform state files in our cloud storage. This is essential especially for team collaboration.

Terraform state files are files that contain Terraform resources on the projects.

Inside the main.tf file in the backend-state folder, add the following code:

variable "aws_region" { 
    default = "us-east-1" 
 } 
variable "aws_secret_key" {} 
variable "aws_access_key" {} 

provider "aws" { 
    region = var.aws_region 
    access_key = var.aws_access_key 
    secret_key = var.aws_secret_key 
} 

resource "aws_s3_bucket" "terraform_state" { 
    bucket = "terraform-state-caesar-tutorial-jenkins" 

    lifecycle { 
        prevent_destroy = true 
    } 

    versioning { 
        enabled = true 
   } 

   server_side_encryption_configuration { 
           rule { 
            apply_server_side_encryption_by_default { 
                sse_algorithm = "AES256" 
            } 
        } 
   } 
}

Let's make sure we know what's going on in the above code.

We use variables to store data, and in Terraform you declare a variable with the variable keyword followed by the name. The variable block can either take some properties such as default, description, type, and so on or none. You will be seeing this a lot.

Now we are declaring the variables as variable "variable_name"{} and using them in any resources/data block as var.variable_name. Later you'll see how we will be assigning values to those variables in our secrets.tfvars file.

To use Terraform, you need to tell it the provider it will be communicating with and pass in its required properties for authentication. Here we have the AWS region, access, and secret key (you should have these downloaded on your system from the prerequisites).

In terraform, each resource we need is defined in the resource block. Resources is the underlined infrastructure that creates our cloud service. It follows the syntax resource "terraform-resource-name" "custom-name" {}.

Terraform has a lot of resources for particular providers in the terraform docs (always refer to the docs if you have questions).

Next, we are creating the aws_s3_bucket. This will store our remote state. It takes the following properties:

bucket → This has to be globally unique
lifecycle → If you need to destroy your Terraform resources, you might want to prevent destroying the state as it is shared across teams
versioning → Helps provide some version control over the states
server_side_encryption_configuration → Provides encryption.

Our state backend is ready. But before we initialize it, plan, and apply it with Terraform, let’s assign our variable to its values.

In secrets.tfvars, add the following info from your AWS account:

  aws_region = "us-east-1 
  aws_secret_key = "enter-your-secret" 
  aws_access_key = "enter-your-access

In your terminal in the same backend-state folder, run terraform init.

terraform state on terminal

Then terraform apply -var-file=secrets.tfvars:

terraform state on terminal

In your AWS console, here's what you'll see:

terraform state on aws s3 bucket

‌‌Now that our state is ready, let’s move to the next part.

How to Provision an AWS Virtual Private Cloud

To secure our Jenkins cluster, we will deploy the architecture within a virtual private cloud (VPC) and private subnet. You can deploy the cluster in the AWS default VPC.

To have complete control over the network topology, we will create a VPC from scratch.

 variable "cidr_block" {} 
 variable "aws_access_key" {} 
 variable "aws_secret_key" {} 
 variable "aws_region" {} 

 provider "aws" { 
     region = var.aws_region 
    access_key = var.aws_access_key 
    secret_key = var.aws_secret_key 
} 

terraform { 
    backend "s3" { 
        bucket     = "terraform-state-caesar-tutorial-jenkins" 
        key        = "tutorial-jenkins/development/network/terraform.tfstate" 
        region     = "us-east-1" 
        encrypt    = true 
   }
} 

resource "aws_vpc" "main_vpc" { 
    cidr_block           = var.cidr_block 
    enable_dns_support   = true 
    enable_dns_hostnames = true 

    tags = { 
        Name        = "jenkins-instance-main_vpc" 
    } 
}

output "vpc_id" { 
    value = aws_vpc.main_vpc.id 
} 

output "vpc_cidr_block" { 
    value = aws_vpc.main_vpc.cidr_block 
}

cidr_block            = "172.0.0.0/16" 
aws_region = "us-east-1" 
aws_secret_key = "enter-your-secret" 
aws_access_key = "enter-your-access"

cidr_block → Classless Inter-Domain Routing is referred to as CIDR. A CIDR block is an IP address range, to put it simply. This defines what range we are working in.
output → The output block in Terraform is used to export resource values to other modules. This is another important term when transferring a resource data in one module to another resource in a separate module. (You will learn what modules are soon) Here's its syntax: output "custom_output_name" { value = "resource-name"}. It takes in a value key that takes the resource passed. Here we are output vpc_id and cidr_block.

Now, in the terminal, run terraform init and terraform apply to create the resources. You can run terraform plan before to see what resources you are actually creating. Here's the command: terraform apply -var-file=secrets.tfvars, and the output:

You should see your vpc_id and vpc_cidr_block in your AWS Console:

vpc on aws

How to Work with Terraform Modules

A group of typical configuration files in a specific directory make up a Terraform module. Terraform modules put together resources that are used for a single operation. This cuts down on the amount of code you need to create identical infrastructure components.

Using the syntax below, you can transfer one Terraform module resource to another to be used.

module "custom-module-name" { 
    source     = "path-to-modules-resources" 
}

And to used the module resource output inside another resource module, this is the command: module.custom-module-name.resource-output-value.

How to Create a VPC Subnet

Creating a VPC isn't enough – we also need a subnet to be able to install Jenkins instances on this isolated network. We must pass the VPC ID we output before, since this subnet belongs to a previously constructed VPC.

For resilience, we'll use two public subnets and two private subnets in distinct availability zones. Each subnet has its own CIDR block, which is a subset of the VPC CIDR block, which we got from the VPC resource.

resource "aws_subnet" "public_subnets" { 
    vpc_id         = var.vpc_id 
    cidr_block     = cidrsubnet(var.vpc_cidr_block, 8, 2 + count.index)  
       availability_zone   = element(var.availability_zones, count.index)      
    map_public_ip_on_launch = true 
    count                   = var.public_subnets_count 

    tags = { 
        Name        = "jenkins-instance-public-subnet" 
   } 
} 

resource "aws_subnet" "private_subnets" { 
    vpc_id     = var.vpc_id 
    cidr_block = cidrsubnet(var.vpc_cidr_block, 8, count.index)              
    availability_zone    = element(var.availability_zones, count.index)  
    map_public_ip_on_launch = false 
    count                   = var.private_subnets_count 

    tags = { 
        Name        = "jenkins-instance-private-subnet" 
    } 
 }

Alright, what's going on in this code?

count → The count meta-argument accepts a whole number, and creates that many instances of the resource or module. Here we are specifying 2 each to the variables private_subnets_count and public_subnets_count.
map_public_ip_on_launch → Specify true to indicate that instances launched into the subnet should be assigned a public IP address.
cidrsubnet() → cidrsubnet calculates a subnet address within a given IP network address prefix.
element() → element retrieves a single element from a list.

Now let’s update our modules variables:

variable "vpc_id" {} 
variable "vpc_cidr_block" {} 
variable "private_subnets_count" {} 
variable "public_subnets_count" {} 
variable "availability_zones" {}

Update the secrets.tfvars like this:

private_subnets_count = 2 
public_subnets_count  = 2

You must establish private and public route tables to specify the traffic-routing method in VPC subnets. Let’s do that before we execute terraform apply on our resources.

How to Setup VPC Route Tables

We will develop private and public route tables for fine-grained traffic management. This will enable instances deployed in private subnets to access the internet without being exposed to the general public.

How to create a public route table

First we need to establish an Internet gateway resource and link it to the VPC we generated previously. Then we need to define a public route table and a route that points all traffic (0.0.0.0/0) to the internet gateway. And lastly we need to link it with public subnets in our VPC so that traffic flowing from those subnets is routed to the internet gateway by creating a route table association.

/*** Internet Gateway - Provides a connection between the VPC and the public internet, allowing traffic to flow in and out of the VPC and translating IP addresses to public* addresses.*/ 
resource "aws_internet_gateway" "igw" { 
    vpc_id = var.vpc_id 

    tags = { 
        Name = "igw_jenkins" 
   } 
} 

/*** A route from the public route table out to the internet through the internet* gateway.*/ 
resource "aws_route_table" "public_rt" { 
    vpc_id = var.vpc_id 

    route { 
        cidr_block = "0.0.0.0/0" 
        gateway_id = aws_internet_gateway.igw.id 
   } 

   tags = { 
           Name = "public_rt_jenkins" 
   } 
} 
/*** Associate the public route table with the public subnets.*/ 
resource "aws_route_table_association" "public" { 
    count     = var.public_subnets_count 
    subnet_id = element(var.public_subnets.*.id, count.index) 
    route_table_id = aws_route_table.public_rt.id 
}

‌How to create a private route table

Now that our public route table is finished, let’s create the private route table.

To allow our Jenkins instances to connect to the internet as it is deployed on the private subnet, we will construct a NAT gateway resource inside a public subnet.

Add an Elastic IP address to the NAT gateway after that and a private route table with a route (0.0.0.0/0) that directs all traffic to the ID of the NAT gateway you established. Then we attach private subnets to the private route table by creating the route table association.

 /*** An elastic IP address to be used by the NAT Gateway defined below.  The NAT* gateway acts as a gateway between our private subnets and the public* internet, providing access out to the internet from within those subnets,* while denying access to them from the public internet.  This IP address* acts as the IP address from which all the outbound traffic from the private* subnets will originate.*/ 

 resource "aws_eip" "eip_for_the_nat_gateway" { 
     vpc = true 

    tags = { 
        Name = "jenkins-tutoral-eip_for_the_nat_gateway" 
    } 
} 

/*** A NAT Gateway that lives in our public subnet and provides an interface* between our private subnets and the public internet.  It allows traffic to* exit our private subnets, but prevents traffic from entering them.*/ 

resource "aws_nat_gateway" "nat_gateway" { 
    allocation_id = aws_eip.eip_for_the_nat_gateway.id 
    subnet_id     = element(var.public_subnets.*.id, 0) 

    tags = { 
        Name = "jenkins-tutorial-nat_gateway" 
    } 
} 
/*** A route from the private route table out to the internet through the NAT * Gateway.*/ 

resource "aws_route_table" "private_rt" { 
    vpc_id = var.vpc_id 

    route { 
        cidr_block     = "0.0.0.0/0" 
        nat_gateway_id = aws_nat_gateway.nat_gateway.id } 

        tags = { 
            Name   = "private_rt_${var.vpc_name}" 
            Author = var.author 
        } 
} 
/*** Associate the private route table with the private subnet.*/ 
resource "aws_route_table_association" "private" { 
    count = var.private_subnets_count 
    subnet_id = element(aws_subnet.private_subnets.*.id, count.index) 
    route_table_id = aws_route_table.private_rt.id 
}

‌Now let's run terraform apply. But we need to update our main.tf files (as this is our entry terraform file) to be aware of our subnets and module variables and secrets.tfvars (for our variables).

variable "vpc_id" {} 
variable "vpc_cidr_block" {} 
variable "private_subnets_count" {} 
variable "public_subnets_count" {} 
variable "availability_zones" {} 
variable "public_subnets" {}

variable "private_subnets_count" {} 
variable "public_subnets_count" {} 
variable "availability_zones" {} 

module "subnet_module" { 
    source     = "./modules" 
    vpc_id     = aws_vpc.main_vpc.id 
    vpc_cidr_block = aws_vpc.main_vpc.cidr_block 
    availability_zones = var.availability_zones 
    public_subnets_count = var.public_subnets_count 
    private_subnets_count = var.private_subnets_count 
 }

 availability_zones    = ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d", "us-east-1e"]

Our subnets and respective securities are ready. Now we can initialize it, plan, and apply with Terraform.

We will run terraform apply to create the resources. You can run terraform plan before to see what resources you are actually creating.

In the terminal run terraform apply -var-file=secrets.tfvars.

Just keep in mind that the number of resources added here might defer from yours.

Here's the AWS Console (subnets, elastic address, route_tables):

subnets

elastic ip

route tables

How to Setup a VPC Bastion Host

We deployed our Jenkins cluster inside the private subnets. Because the cluster lacks a public IP, instances won't be publicly available via the internet. So to take care of this, we'll set up a bastion host so that we can access Jenkins instances safely.

Add the following resources and security group in the bastion.tf file:

/*** A security group to allow SSH access into our bastion instance.*/ 
resource "aws_security_group" "bastion" { 
    name   = "bastion-security-group" 
    vpc_id = var.vpc_id 

    ingress { 
        protocol    = "tcp" 
        from_port   = 22 
        to_port     = 22 
        cidr_blocks = ["0.0.0.0/0"] 
    } 
    egress { 
        protocol    = -1 
        from_port   = 0 
        to_port     = 0 
        cidr_blocks = ["0.0.0.0/0"] 
   } 

   tags = { 
           Name = "aws_security_group.bastion_jenkins" 
   } 
} 

/*** The public key for the key pair we'll use to ssh into our bastion instance.*/ 

resource "aws_key_pair" "bastion" { 
    key_name   = "bastion-key-jenkins" 
    public_key = var.public_key 
 } 

 /*** This parameter contains the AMI ID for the most recent Amazon Linux 2 ami,* managed by AWS.*/ 

 data "aws_ssm_parameter" "linux2_ami" { 
     name = "/aws/service/ami-amazon-linux-latest/amzn-ami-hvm-x86_64-ebs" 
} 

/*** Launch a bastion instance we can use to gain access to the private subnets of* this availabilty zone.*/ 

resource "aws_instance" "bastion" { 
    ami           = data.aws_ssm_parameter.linux2_ami.value 
    key_name      = aws_key_pair.bastion.key_name 
    instance_type = "t2.large" 
    associate_public_ip_address = true 
    subnet_id                   = element(aws_subnet.public_subnets, 0).id 
    vpc_security_group_ids      = [aws_security_group.bastion.id] 

    tags = { 
        Name        = "jenkins-bastion" 
    } 
} 

output "bastion" { value = aws_instance.bastion.public_ip }

Let's see what's going on in the code here:

bastion security group resource – Newly generated EC2 instances do not allow SSH access.
We will link a security group to the active instance in order to enable SSH access to the bastion hosts. Any inbound (ingress) traffic on port 22 (SSH) from anyplace (0.0.0.0/0) will be permitted by the security group. To improve security and prevent security breaches, you can substitute your own public IP address/32 or network address for the CIDR source block.
aws_key_pair – To be able to connect to the bastion host using SSH and the private key, we added an SSH key pair when we created the EC2. Our public SSH key is used in the key pair. Using the sshkeygen command, you can also create a new one.
aws_ssm_parameter – The Amazon 2 Linux machine image is used by the EC2 instance. The AMI ID is obtained from the AWS marketplace using the AWS AMI data source
aws_instance – Finally, we deploy our EC2 bastion instance with its defined configurations and access
output – By specifying an output, we use the Terraform outputs functionality to show the IP address in the terminal session.

Now, let’s update our variable within the modules and the main.tf with the new public_key we are passing as a variable:

variable "public_key"{}

varable "public_key" {} 
module "subnet_module" { 
    source     = "./modules" 
    ... 
    publc_key = var.public_key 
}

public_key = "enter-your-public-key"

We will run terraform apply to create the resources. You can run terraform plan before to see what resources you are actually creating.

On the terminal, let's run terraform apply -var-file=secrets.tfvars:

terminal resources

Here's the output in the AWS console:

aws-console instances

How to Provision our Compute Service

Jenkins master instance

So far, we have successfully been able to set up our VPC and networking topology. ‌‌Finally, we will create our Jenkins EC2 instance that will use a Jenkins master AMI baked by Packer.

You can check out my previous article on how it was baked: Learn Infrastructure as Code by Building a Custom Machine Image in AWS on freecodecamp.org. Regardless, you can used any of your custom images if you have one.

 /*** This parameter contains our baked AMI ID fetch from the Amazon Console*/ data "aws_ami" "jenkins-master" { 
     most_recent = true owners      = ["self"] 
} 

resource "aws_security_group" "jenkins_master_sg" { 
    name        = "jenkins_master_sg" 
    description = "Allow traffic on port 8080 and enable SSH" 
    vpc_id      = var.vpc_id 

    ingress { 
        from_port       = "22" 
        to_port         = "22" 
        protocol        = "tcp" 
        security_groups = [aws_security_group.bastion.id] 
   } 
   ingress { 
           from_port       = "8080" 
        to_port         = "8080" 
        protocol        = "tcp" 
        security_groups = [aws_security_group.lb.id] 
   } 
   ingress { 
           from_port   = "8080" 
        to_port     = "8080" 
        protocol    = "tcp" 
        cidr_blocks = ["0.0.0.0/0"] 
  } 
  egress { 
          from_port   = "0" 
        to_port     = "0" 
        protocol    = "-1" 
        cidr_blocks = ["0.0.0.0/0"] 
  } 

  tags = { 
      Name = "jenkins_master_sg" 
  }
}

Attaching a security group to the instance will enable inbound traffic on port 8080 (the Jenkins web dashboard) and SSH only from the bastion server and the VPC CIDR block.

resource "aws_key_pair" "jenkins" { 
    key_name   = "key-jenkins" 
    public_key = var.public_key 
} 

resource "aws_instance" "jenkins_master" { 
    ami       = data.aws_ami.jenkins-master.id 
    instance_type  = "t2.large" 
    key_name       = aws_key_pair.jenkins.key_name 
    vpc_security_group_ids = [aws_security_group.jenkins_master_sg.id]
    subnet_id              = element(aws_subnet.private_subnets, 0).id
    root_block_device { 
        volume_type           = "gp3" 
        volume_size           = 30 
        delete_on_termination = false 
    } 

    tags = { 
        Name = "jenkins_master" 
     } 
 }

Next, we create a variable and define the instance type that we used to deploy the EC2 instance. We won't be allocating executors or workers on the master, so t2.large (8 GB of memory and 2vCPU) should be adequate for the purposes of simplicity.

Thus, build jobs won't cause the Jenkins master to get overcrowded. But Jenkins' memory requirements vary depending on your project's build requirements and the tools used in those builds. It will require two to three threads, or at least 2 MB of memory, to connect to each build node.

Just a note: consider installing Jenkins workers to prevent overworking the master. As a result, a general-purpose instance can host a Jenkins master and offer a balance between computation and memory resources. In order to maintain the article's simplicity, we won't do that.

How to Create the Load Balancer

To access the Jenkins dashboard, we will create a public load balancer in front of the EC2 instance.

This Elastic load balancer will accept HTTP traffic on port 80 and forward it to the EC2 instance on port 8080. Also, it automatically checks the health of the registered EC2 instance on port 8080. If the Elastic Load Balancing (ELB) finds the instance unhealthy, it stops sending traffic to the Jenkins instance.

 /*** A security group to allow SSH access into our load balancer*/ resource "aws_security_group" "lb" { 
     name   = "ecs-alb-security-group" 
    vpc_id = var.vpc_id 

    ingress { 
        protocol    = "tcp" 
        from_port   = 80 
        to_port     = 80 
        cidr_blocks = ["0.0.0.0/0"] 
     } 
     egress { 
         from_port   = 0 
        to_port     = 0 
        protocol    = "-1" 
        cidr_blocks = ["0.0.0.0/0"] 
     } 

     tags = { 
         Name = "jenkins-lb-sg" 
      } 
 } 

 /***Load Balancer to be attached to the ECS cluster to distribute the load among instances*/ 

 resource "aws_elb" "jenkins_elb" { 
     subnets    = [for subnet in aws_subnet.public_subnets : subnet.id]
    cross_zone_load_balancing = true 
    security_groups       = [aws_security_group.lb.id] 
    instances             = [aws_instance.jenkins_master.id] 

    listener { 
        instance_port     = 8080 
        instance_protocol = "http" 
        lb_port           = 80 
        lb_protocol       = "http" 
     } 

     health_check { 
         healthy_threshold   = 2 
        unhealthy_threshold = 2 
        timeout             = 3 
        target              = "TCP:8080"    
        interval            = 5 
    } 

    tags = { 
        Name = "jenkins_elb" 
    } 
 } 

 output "load-balancer-ip" { 
     value = aws_elb.jenkins_elb.dns_name 
 }

Before, we do our terraform apply, let’s update our development/output.tf folder to output the load balancer DNS:

 output "load-balancer-ip" { 
     value = module.subnet_module.load-balancer-ip
 }

On the terminal, run the following command: terraform apply -var-file="secrets.tfvars". Which will give you this:

load balancer output

After you apply the changes with Terraform, the Jenkins master load balancer URL should be displayed in your terminal session.

Point your favorite browser to the URL, and you should have access to the Jenkins web dashboard.

jenkins-instances

Then just follow the screen instructions to UNLOCK.

unlock jenkins

You can find the full code at this GitHub repo.

Cleaning Up

To avoid the unnecessary cost of running AWS services, you will need to run the following command to destroy all created and running resources:‌‌terraform destroy -var-file="secrets.tfvars" which should give this output:

destroy resources

How interesting, right? With just few lines of code we can destroy and spin up our resources.

Summary

In this tutorial, you have learned how to use Terraform at a high level. You've also learned one of its applications by provisioning a Jenkins server on the AWS cloud platform.

You have also learned about best practices of Terraform backend states and modules.

To learn more about Terraform and its many use-cases, you can check out the official Terraform docs here.

Happy Learning!

Learn Terraform and Azure by Building a Dev Environment

Beau Carnes — Wed, 29 Jun 2022 14:35:30 +0000

Terraform is an open source infrastructure as code software tool that makes it easy to programmatically to set up infrastructure on a variety of cloud service providers.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use Terraform and Azure to setup a development environment.

We previously posted a similar course with AWS instead of Azure.

Derek Morgan created this course. He has created a bunch of courses to prepare students for a variety of technical certifications.

This course is designed to help you learn the basics of Terraform fast with a hands-on project that can serve as a framework for your own projects!

This project will guide you through Terraform basics as you utilize Visual Studio Code (on Windows, Mac, or Linux!) to deploy Azure resources and an Azure VM that you can SSH into to have your own redeployable environment that will be perfect for your own future projects!

Here are the sections covered in this course:

Welcome and Setup
Terraform Provider Init
A Resource Group
A Virtual Network and Referencing other Resources
Terraform State
Terraform Destroy
A Subnet
A Security Group
Security Group Associations
A Public IP
A Network Interface
A Key Pair
Custom Data
SSH Config Scripts
The Provisioner
Data Sources
Outputs
Variables
Variable Precedence
Conditionals

Watch the full course below or on the freeCodeCamp.org YouTube channel (2-hour watch).

Terraform - freeCodeCamp.org

How Enterprise Teams Manage Infrastructure at Scale with Terraform

Prerequisites

Table of Contents

How State Corruption Happens

Two Engineers Run terraform apply at the Same Time

An Apply Gets Interrupted

Someone Runs a Terraform State Command in the Wrong Environment

Two Teams Manage the Same Resource

Why State File Gets Treated Like a Production Database

How Enterprise Teams Structure Their Terraform Repositories

How Teams Split State Files to Protect Each Other

Why Some Teams Prefer Directories Over Workspaces for Production

How Teams Share Infrastructure Through Modules on GitHub

How Teams Version and Release Terraform Modules

How Teams Maintain Terraform Modules at Scale

How Teams Share Data Between State Files

Reading Another Team's State Outputs

Looking Up Resources Directly From the Cloud

How Infrastructure Changes Actually Move to Production

How CODEOWNERS Enforces Who Reviews What

How Teams Detect Infrastructure Drift

How Teams Recover When State Goes Wrong

Step 1: Pull a Backup Before Touching Anything.

Step 2: Run terraform plan and Look at What it Proposes.

Step 3: Restore from S3 Versioning if the State is Corrupted.

Step 4: Clear a Stale Lock if the Pipeline is Blocked.

Step 5: Re-import Resources That Fell Out of State.

Conclusion

How to Migrate to S3 Native State Locking in Terraform

Table of Contents

What is Terraform State Locking?

What Is S3 Native State Locking?

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Prerequisites

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Step 1: Create the S3 Bucket with Versioning and Encryption

Step 2: Configure the Terraform Backend with Native Locking

Step 3: Initialize and Verify

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Step 1: Verify Your Current Setup

Step 2: Enable Object Lock on the Existing S3 Bucket

Step 3: Update the Terraform Backend Configuration

Step 4: Reinitialize Terraform

Step 5: Verify the Migration

Step 6: Clean Up the DynamoDB Table

How to Verify That Locking Is Working

Method 1: Observe the lock file during an operation

Method 2: Read the lock file contents

How to Handle a Stuck Lock

Rollback Plan: If Something Goes Wrong

Security Best Practices for Your State Bucket

Enable Versioning (Required)

Block All Public Access (Non-Negotiable)

Enable Server-Side Encryption

Apply Least-Privilege IAM Permissions

Enable Access Logging

Conclusion

References

How to Get Started with Terraform

What We'll Cover:

What Terraform Actually Does

Setting Up Terraform for the First Time

Understanding Providers, Resources, and Data Sources

Building a Real Application Stack

Managing Configuration and Secrets

Scaling and Process Configuration

Adding Networking and Traffic Management

Pipelines and Continuous Deployment

From Configuration to Production

Why Terraform Scales with You

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Table of Contents

Prerequisites

How to Understand the Secret Flow

How the External Secrets Operator Sync Works

How the App Consumes Secrets

How to Run the Local Lab

Step 1: Clone the Repo

Step 2: Run the Spin-Up Script

Two Engineers Run `terraform apply` at the Same Time

Step 2: Run `terraform plan` and Look at What it Proposes.