<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Cloud Computing - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Cloud Computing - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 28 Jun 2026 09:50:32 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/cloud-computing/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How Enterprise Teams Manage Infrastructure at Scale with Terraform ]]>
                </title>
                <description>
                    <![CDATA[ Tutorials teach you how to write Terraform, but don't teach you what happens when 60 engineers start writing it together. When you learn Terraform, you work with a single repository, state file, and a ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-enterprise-teams-manage-infrastructure-at-scale-with-terraform/</link>
                <guid isPermaLink="false">6a3aaccb0aca21a37c59db4a</guid>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Infrastructure as code ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jun 2026 15:56:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fb89a3e9-6826-4fc9-bebb-d16ef6a6d31d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Tutorials teach you how to write Terraform, but don't teach you what happens when 60 engineers start writing it together.</p>
<p>When you learn Terraform, you work with a single repository, state file, and a single environment. You run <code>terraform apply</code> from your laptop, and your infrastructure is provisioned.</p>
<p>That model works fine until the day you join a company and realize engineers rarely apply to production from a laptop.<br>A lot of what you see will not match what you practiced.</p>
<p>This article explains how large engineering teams actually run Terraform, the repositories, workflows, ownership rules, and what goes wrong without them.  </p>
<p>You'll learn how enterprise teams structure repositories and state files, how they store and version reusable modules through GitHub, why infrastructure changes move to production through pipelines, how they catch changes that happen outside of Terraform, and how they recover when things go wrong.</p>
<p>Every practice here exists because a team hit a specific wall and built something to get past it.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You should be comfortable with Terraform before reading this.<br>You should also know how Git pull requests and branch merging work.</p>
<p>This is not a Terraform introduction, it is about what happens after you have learned the basics and start sharing infrastructure with other engineers.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-state-corruption-happens">How State Corruption Happens</a></p>
</li>
<li><p><a href="#heading-why-state-file-gets-treated-like-a-production-database">Why State File Gets Treated Like a Production Database</a></p>
</li>
<li><p><a href="#heading-how-enterprise-teams-structure-their-terraform-repositories">How Enterprise Teams Structure Their Terraform Repositories</a></p>
</li>
<li><p><a href="#heading-how-teams-split-state-files-to-protect-each-other">How Teams Split State Files to Protect Each Other</a></p>
</li>
<li><p><a href="#heading-why-some-teams-prefer-directories-over-workspaces-for-production">Why Some Teams Prefer Directories Over Workspaces for Production</a></p>
</li>
<li><p><a href="#heading-how-teams-share-infrastructure-through-modules-on-github">How Teams Share Infrastructure Through Modules on GitHub</a></p>
</li>
<li><p><a href="#heading-how-teams-version-and-release-terraform-modules">How Teams Version and Release Terraform Modules</a></p>
</li>
<li><p><a href="#heading-how-teams-maintain-terraform-modules-at-scale">How Teams Maintain Terraform Modules at Scale</a></p>
</li>
<li><p><a href="#heading-how-teams-share-data-between-state-files">How Teams Share Data Between State Files</a></p>
</li>
<li><p><a href="#heading-how-infrastructure-changes-actually-move-to-production">How Infrastructure Changes Actually Move to Production</a></p>
</li>
<li><p><a href="#heading-how-teams-detect-infrastructure-drift">How Teams Detect Infrastructure Drift</a></p>
</li>
<li><p><a href="#heading-how-teams-recover-when-state-goes-wrong">How Teams Recover When State Goes Wrong</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-how-state-corruption-happens">How State Corruption Happens</h2>
<p>The state file is how Terraform tracks what it has built. It remembers every resource, every ID, and every configuration value. When it gets out of sync with what actually exists in the cloud, that's state corruption.</p>
<p>It gets blamed for a lot of things. But engineers who have dealt with it in production know it usually traces back to one of a handful of situations, each with a different cause and a different fix.</p>
<h3 id="heading-two-engineers-run-terraform-apply-at-the-same-time">Two Engineers Run <code>terraform apply</code> at the Same Time</h3>
<p>Before understanding this one, you need to understand something about how Terraform works.</p>
<p>When you run <code>terraform apply</code>, two things happen separately:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/1fdf9458-9c60-4b65-8bd7-2126b8d47065.png" alt="When you run terraform apply, two things happen separately. Step 1: Terraform tells AWS to create the subnet and AWS creates it in the cloud. Step 2: Terraform updates the state file to record that the subnet now exists. AWS holds the real infrastructure. The state file is Terraform's notebook about it. They are separate and can get out of sync." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>First, Terraform talks to AWS, and the resource gets created in the cloud. Second, Terraform updates the state file to record what was just built.</p>
<p>These are two different systems. AWS holds the real infrastructure, and the state file is Terraform's notebook about it. If anything interrupts the process between step one and step two, they fall out of sync.</p>
<p>Now here's what happens when two engineers apply at the same time without locking:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/724c562a-ef35-42e5-a9f2-4683f8acef31.png" alt="Diagram showing Sarah and Marcus both open the same Terraform state file at the same time. Sarah reads the state, adds a subnet, and saves. Marcus reads the same original state, updates the NAT gateway, and saves last. His save overwrites Sarah's. The final state file contains the NAT gateway update but the subnet record is gone, even though the subnet still exists in AWS. Caption: Two people. Same state file. Different changes. Last write wins. Terraform state file simultaneously, causing one engineer's changes to overwrite the other's." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>Sarah opens the state file and starts adding a subnet. Marcus opens the same state file at the same moment and starts updating a NAT gateway. Both are working from the same starting copy.</p>
<p>Sarah finishes first. Her apply creates the subnet in AWS and updates the state file to record it.</p>
<p>Marcus finishes second. His apply updates the NAT gateway in AWS. Terraform then updates the state file using the version of state Marcus read when his apply started.</p>
<p>That version didn't include Sarah's subnet, so the updated state no longer contains a record of it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/07d237af-96dd-450c-8284-7b2be89b2a41.png" alt="comparison showing AWS contains both the subnet and NAT gateway update, while Terraform's state file is missing the subnet record" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The subnet exists in AWS. But Terraform's notebook no longer has a record of it. The next <code>terraform plan</code> thinks the subnet was never created and proposes building it again.</p>
<p>State locking prevents this. Sarah's apply acquires a lock before it starts. When Marcus tries to apply, Terraform makes him wait.</p>
<p>After Sarah finishes, Terraform updates the state file and releases the lock. Marcus then runs against the updated state, so both the subnet and NAT gateway changes are recorded correctly.</p>
<h3 id="heading-an-apply-gets-interrupted">An Apply Gets Interrupted</h3>
<p>A GitHub Actions pipeline is applying changes to the payments infrastructure, adding three new security group rules and a database parameter group. Halfway through, the pipeline runner hits its 60-minute timeout limit, and the job gets killed.</p>
<p>Here's what the apply actually managed to do before dying:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/40cba2d8-56c4-4b03-bd46-0c33a7b1b7af.png" alt="A terminal showing terraform apply running. Three security group rules are created successfully at 12:00. At 12:00:07, the database parameter group starts creating. At 12:01:30, two errors appear in red: Job exceeded maximum runtime 60m and Runner terminated. A pipeline summary below shows security group rules 1, 2, and 3 as created with green checkmarks, database parameter as not created with a red X, and state file update as never wrote because the job died first, also with a red X." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The terminal image above shows three security group rules completing successfully before the pipeline hits its 60-minute runtime limit. The runner is then terminated. The database parameter group never finishes creating, and the state file update never runs because the job died first.</p>
<pre><code class="language-plaintext">Security group rule 1  → created ✓
Security group rule 2  → created ✓
Security group rule 3  → created ✓
Database parameter     → not created ✗
State file update      → never wrote (job died first)
</code></pre>
<p>The three security group rules now exist in AWS. The problem is that the pipeline died before Terraform could finish updating the state file. AWS knows the rules exist. Terraform's state file does not.</p>
<p>At this point, reality and the state file no longer match.</p>
<p>Fortunately, this is usually easy to recover from. When the pipeline runs again, Terraform checks what already exists in AWS. It sees the three security group rules and doesn't try to create them again. It then creates the database parameter group that never got built.</p>
<p>The second run completes successfully and the state file catches up.</p>
<p>This works because Terraform is idempotent, running the same configuration again moves infrastructure toward the desired state rather than blindly creating everything from scratch.</p>
<p>One small complication remains: the state lock.</p>
<p>If the pipeline was interrupted while holding a lock, Terraform may still think another apply is running. The next pipeline run fails immediately with an error like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/cbb56e69-7c40-4e61-8966-4cedbdaf2649.png" alt="terminal image showing terraform apply failing because the previous job left a state lock behind. The error includes the lock ID, the path to the state file, and the name of the process that acquired it. Terraform refuses to proceed until the lock is released or manually cleared." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The terminal above shows terraform apply failing because the previous job left a state lock behind. The error includes the lock ID, the path to the state file, and the name of the process that acquired it. Terraform refuses to proceed until the lock is released or manually cleared.</p>
<p>Before clearing the lock, make sure no Terraform apply is still running.</p>
<p>Open your CI/CD system. GitHub Actions, GitLab CI, Jenkins, or whatever your team uses and check the pipeline history for that environment:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/716f6d0c-62d7-4bea-b7f4-ce6cfdda1188.png" alt="The GitHub Actions pipeline history shows four recent runs. terraform-plan completed successfully. Two terraform-apply jobs show as cancelled and timed out, both flagged as lock may be stale. A fourth terraform-apply job is currently in progress, this one should not be unlocked until it finishes." style="display:block;margin:0 auto" width="1171" height="1343" loading="lazy">

<p>The GitHub Actions pipeline history image above shows four recent runs. terraform-plan completed successfully. Two terraform-apply jobs show as cancelled and timed out, both flagged as lock may be stale. A fourth terraform-apply job is currently in progress, and this one shouldn't be unlocked until it finishes.</p>
<p>If the previous apply was cancelled or timed out, the lock is stale. Clear it with <code>terraform force-unlock</code> plus the lock ID from the error. The pipeline then runs normally.</p>
<p>Only force-unlock when you're certain nothing is actively running. Clearing a live lock lets two applies write to the same state at the same time, which is exactly the problem locking was built to prevent.</p>
<h3 id="heading-someone-runs-a-terraform-state-command-in-the-wrong-environment">Someone Runs a Terraform State Command in the Wrong Environment</h3>
<p>A database engineer is cleaning up an old test database in the staging environment.</p>
<p>The database still exists in AWS, but Terraform should stop managing it. To do that, the engineer uses <code>terraform state rm</code>.</p>
<p>This command doesn't delete anything in AWS. It only removes Terraform's record of the resource from the state file. Think of it as telling Terraform: <em>"forget this resource exists, but leave it running."</em></p>
<p>The engineer intends to run it against staging:</p>
<pre><code class="language-plaintext">Intended:  staging state       → forget the old test database
</code></pre>
<p>But they're working in the wrong directory. They run it against production instead.</p>
<pre><code class="language-plaintext">Actual:    production state    → forget the live payments database
</code></pre>
<p>Nothing gets deleted. The production database is still running in AWS. But Terraform has now forgotten it exists.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a1dd566-4f38-4ceb-8b41-70bf6ebc69c3.png" alt="Image showing database exists in AWS but is missing from Terraform state." style="display:block;margin:0 auto" width="1774" height="887" loading="lazy">

<p>Now Terraform and reality disagree. The next <code>terraform plan</code> sees a database defined in the code but missing from the state file, so it assumes the database doesn't exist and proposes creating a new one.</p>
<p>If nobody catches it in the plan output, Terraform creates a second production database alongside the original: two databases running in production, neither fully managed, and a very expensive mess to untangle.</p>
<p><code>terraform state rm</code>, <code>terraform import</code>, and <code>terraform state mv</code> make immediate changes to the state file with no confirmation prompt. Run them from the wrong directory, the wrong workspace, or with the wrong resource address and you change the wrong state in seconds.</p>
<h3 id="heading-two-teams-manage-the-same-resource">Two Teams Manage the Same Resource</h3>
<p>The networking team owns a security group that controls access to the payments database. When a new microservice needs database access, a payments engineer has two options: ask the networking team to add a new rule, or manage the security group themselves.</p>
<p>They choose the second option. The engineer imports the existing security group into the payments state file and adds a rule for Microservice C.<br>From that moment, both teams think they own the same security group.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/a7780c50-c6c0-49f5-b161-7b4884bc0394.png" alt="Two Terraform state files managing the same security group with different access rules" style="display:block;margin:0 auto" width="1672" height="941" loading="lazy">

<p>The problem is that Terraform does exactly what each state file tells it to do. The networking state says the security group should allow A and B. The payments state says it should allow A, B, and Microservice C. Both can't be true at the same time.</p>
<p>When the payments team applies their state, Microservice C gets access. But later that night, the networking pipeline runs. Terraform reads the networking state, sees only A and B, and updates the security group to match. Microservice C's rule disappears silently.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/df2c9e46-2d6a-49dc-9f5a-f23a06575452.png" alt="image showing the flow of When the payments team applies their state, Microservice C gets access. But later that night, the networking pipeline runs. Terraform reads the networking state, sees only A and B, and updates the security group to match. Microservice C's rule disappears silently." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>No errors are seen and both pipelines pass, which is exactly what makes this so hard to debug. Terraform isn't broken, it's receiving conflicting instructions from two different state files and doing exactly what each one says.</p>
<p>This isn't something to be fixed with Terraform commands. It's an ownership decision that should have been made before anyone ran an import. If the payments team had submitted a pull request to the networking repository asking them to add the rule, one team would own the security group, one state file would manage it, and the conflict could never have happened.</p>
<h2 id="heading-why-state-file-gets-treated-like-a-production-database">Why State File Gets Treated Like a Production Database</h2>
<p>The state file looks like bookkeeping: a record of what Terraform created. The reason teams treat it differently is that it often contains secrets.</p>
<p>The state file stores sensitive values in plaintext. Database passwords, API keys, connection strings&nbsp;– if those values were passed to a Terraform resource during an apply, they're now sitting in the state file. Even if you marked the variable as <code>sensitive</code> in your Terraform code, the value still lands in the state file. Terraform needs it there to compute diffs on future plans.</p>
<p>That means: <strong>whoever can read the state file can read your database password.</strong></p>
<p>In large organizations, engineers typically don't have direct access to the production state bucket. Instead, Terraform runs through a CI/CD pipeline that assumes a dedicated IAM role with permission to read and write the state bucket and perform applies. Engineers interact with infrastructure through pull requests and plan output, not by touching the state bucket directly.</p>
<p>This separation reduces risk and creates an audit trail. Every state change is performed by the pipeline and logged, making it straightforward to trace what changed and when.</p>
<h2 id="heading-how-enterprise-teams-structure-their-terraform-repositories">How Enterprise Teams Structure Their Terraform Repositories</h2>
<p>When you join a large engineering organization, the first thing you notice is the number of repositories. You might expect one repository for all infrastructure, but what you find is dozens.</p>
<p>The structure maps directly to ownership. Each repository belongs to one team, and that team is responsible for everything in it. A typical layout looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ca81c9b1-b310-4321-8001-f59ab258c652.png" alt="diagram showing how platform, security, and product teams organize Terraform repositories and ownership" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The diagram shows two types of repositories. The first type belongs to the platform team and contains reusable modules: things like VPC configurations, database templates, and security group patterns. These repositories don't create production resources directly.</p>
<p>The second type belongs to individual product teams, such as the payments team or the auth team. These repositories call the platform modules and use them to build their actual infrastructure. A mistake in a product team repository affects only that team. A mistake in a shared platform module can affect every team that depends on it.</p>
<p>The key thing to understand here is that the platform team repositories don't create production resources. They create reusable modules that the product teams call when building their actual infrastructure.</p>
<p>That distinction matters because some repositories are used by one team, while others are shared by everyone.</p>
<p>A mistake in a product team's repository usually affects only that team. A mistake in a shared module can affect every team that depends on it.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/7fbefdda-c8bb-4e66-a8bc-adc19ae931e7.png" alt="diagram showing how bugs in shared Terraform modules affect more teams than bugs in product-specific repositories." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The diagram illustrates why shared repositories carry more risk than product-specific ones. A bug in the <code>payments-infra</code> repository affects only the payments team. A bug in the <code>terraform-aws-postgres</code> module affects every team that uses it to provision databases. A bug in the <code>terraform-policies</code> repository affects every pipeline in the company. The wider the module is shared, the larger the blast radius when something goes wrong.</p>
<p>This is why experienced engineers pay close attention to shared modules and policy repositories.</p>
<p>If the payments team's infrastructure breaks, the problem is probably in the payments repository.</p>
<p>If five different teams start seeing the same issue at the same time, the shared modules and policy repositories become the first place to investigate.</p>
<h2 id="heading-how-teams-split-state-files-to-protect-each-other">How Teams Split State Files to Protect Each Other</h2>
<p>A single state file managing everything, VPC, Kubernetes cluster, databases, monitoring, is fine when one person is running things, but quickly becomes a problem when multiple teams share it.</p>
<p>Three specific problems emerge.</p>
<ol>
<li><p><strong>Blast radius:</strong> If the networking configuration and the database configuration live in the same state file, a bad networking apply can accidentally affect database resources that had nothing to do with the change. Separate state files keep failures contained.</p>
</li>
<li><p><strong>Deployment speed:</strong> Networking infrastructure might change a few times a year. Applications might deploy dozens of times a day. If they share a state file, teams end up waiting on each other's locks.</p>
</li>
<li><p><strong>Ownership conflicts:</strong> When multiple teams share a state file, one team can change something the other team depends on without realizing it.</p>
</li>
</ol>
<p>The solution is to split state along ownership boundaries. A structure that addresses all three problems looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5abbcdbd-af2b-42b7-8dce-00389dbb91eb.png" alt="5abbcdbd-af2b-42b7-8dce-00389dbb91eb" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The structure image above shows one state file per domain under a production folder.</p>
<ul>
<li><p>networking handles VPC, subnets, routing, and NAT gateways.</p>
</li>
<li><p>identity handles IAM roles, policies, and service accounts.</p>
</li>
<li><p>platform handles the Kubernetes cluster, node pools, and add-ons.</p>
</li>
<li><p>database handles RDS instances, Redis clusters, and backups.</p>
</li>
<li><p>security handles security groups, WAF rules, and certificates.</p>
</li>
<li><p>monitoring handles Prometheus, Grafana, and alerting pipelines.</p>
</li>
<li><p>payments handles payment service infrastructure.</p>
</li>
</ul>
<pre><code class="language-plaintext">production/
  networking/terraform.tfstate   → VPC, subnets, routing, NAT gateways
  identity/terraform.tfstate     → IAM roles, policies, service accounts
  platform/terraform.tfstate     → Kubernetes cluster, node pools, add-ons
  database/terraform.tfstate     → RDS instances, Redis clusters, backups
  security/terraform.tfstate     → Security groups, WAF rules, certificates
  monitoring/terraform.tfstate   → Prometheus, Grafana, alerting pipelines
  payments/terraform.tfstate     → Payment service infrastructure
</code></pre>
<p>This is one example, not a universal standard. Larger organizations often split further. The principle is the same: one owning team per state file, one pipeline, one blast radius.</p>
<p>The rule is simple: every resource belongs to one state file. If the networking team owns a security group, it stays in the networking state. Other teams can reference it as a data source, but they don't import it into their own state.<br>That is what prevents the ownership collision described in the first section.</p>
<h2 id="heading-why-some-teams-prefer-directories-over-workspaces-for-production">Why Some Teams Prefer Directories Over Workspaces for Production</h2>
<p>Terraform CLI workspaces let you manage multiple environments like dev, staging, and production from a single directory. Each workspace gets its own state file, but they all share the same <code>.tf</code> configuration files.</p>
<pre><code class="language-plaintext">infra/
  main.tf          ← same code runs for ALL environments
  variables.tf

  terraform.tfstate.d/
    dev/
    staging/
    production/    ← separate state, same code
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/4461ccbb-3d64-45d2-af7b-143a778b5649.png" alt="The workspace approach keeps all environments in one directory called infra. It contains a single main.tf file that runs for all environments. State is stored separately under terraform.tfstate.d with folders for dev, staging, and production, but all three share the same code." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The workspace approach keeps all environments in one directory called infra. It contains a single main.tf file that runs for all environments. State is stored separately under terraform.tfstate.d with folders for dev, staging, and production, but all three share the same code.</p>
<p>You switch environments with <code>terraform workspace select production</code>, then apply.</p>
<p>The risk is that switching workspaces is a manual step. If the wrong workspace is active, changes meant for staging can end up in production.</p>
<p>Many teams prefer separate directories for long-lived environments:</p>
<pre><code class="language-plaintext">environments/
  dev/
    main.tf      ← its own code path
    backend.tf   ← points to the dev state bucket
  staging/
    main.tf      ← its own code path
    backend.tf   ← points to the staging state bucket
  production/
    main.tf      ← its own code path
    backend.tf   ← points to the production state bucket
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/83604abf-302c-400e-a322-f53e7d0b7d56.png" alt="project structure showing separate Terraform directories for dev, staging, and production environments." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The directory approach gives each environment its own folder under environments. Dev, staging, and production each have their own main.tf with a separate code path, and their own backend.tf pointing to a different state bucket. The environments are completely separate from each other.</p>
<p>To apply against production, you have to be in the production directory. Each environment has its own state, backend, and execution path.</p>
<p>The tradeoff is duplication. Teams usually solve that with shared modules, so each environment directory contains only environment-specific configuration.</p>
<p>Workspaces are still useful for short-lived environments such as feature branches, preview deployments, and temporary test infrastructure.</p>
<h2 id="heading-how-teams-share-infrastructure-through-modules-on-github">How Teams Share Infrastructure Through Modules on GitHub</h2>
<p>When 30 teams each need a PostgreSQL database, two things happen.</p>
<p><strong>Without a shared standard</strong>, every team writes their own database configuration. Six months later, a security audit runs across all environments and finds that:</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/392e5cee-408e-49d5-9cc5-5a53f3537562.png" alt="Diagram showing four teams and their database misconfigurations: Team A with no backups, Team B with unencrypted storage, Team C with no tags, Team D with deletion protection disabled." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The diagram shows what a security audit found when four teams each wrote their own database configuration independently.</p>
<p>Team A set <code>backup_retention_period = 0</code>, meaning their database was never backed up. Team B set <code>storage_encrypted = false</code>, leaving data in plaintext. Team C passed an empty <code>tags = {}</code>, so there was no cost tracking. Team D set <code>deletion_protection = false</code>, leaving the database one accident away from permanent data loss.</p>
<p>Nobody skipped those things on purpose, there was just no shared standard.</p>
<p><strong>With a shared module</strong>, the platform team writes a <code>postgres</code> module once. They encode every organizational requirement into it: encryption on, 7-day backups, monitoring alarms, required tags, deletion protection enabled. They publish it to a GitHub repository called <code>terraform-aws-postgres</code>.</p>
<p>Every team that needs a database now writes this:</p>
<pre><code class="language-hcl">module "payments_db" {
  source         = "git::ssh://github.company.com/platform/terraform-aws-postgres.git?ref=v2.1.0"
  name           = "payments"
  environment    = "production"
  instance_class = "db.m5.large"
}
</code></pre>
<p>Four inputs. Everything else is handled by the module.</p>
<p>Large organizations usually expose approved modules through an internal registry so engineers can discover and version them without browsing GitHub repositories. Instead of the full Git URL, the reference becomes:</p>
<pre><code class="language-csharp">module "payments_db" {
  source  = "app.terraform.io/mycompany/postgres/aws"
  version = "~&gt; 2.1"
}
</code></pre>
<p>HCP Terraform and Terraform Enterprise both include a private registry that connects to GitHub, watches for version tags on module repositories, and publishes new versions automatically.</p>
<h2 id="heading-how-teams-version-and-release-terraform-modules">How Teams Version and Release Terraform Modules</h2>
<p>The <code>?ref=v2.1.0</code> in a module source URL isn't decoration. At the scale of 40 teams sharing one module, it's the thing that prevents a well-intentioned change from becoming a company-wide incident.</p>
<p>Without version pinning, the payments team references the Postgres module from <code>main</code> meaning whatever the latest code is at any given moment. The module owners rename an output variable from <code>db_endpoint</code> to <code>database_endpoint</code> to match a new naming convention. The next time any team runs <code>terraform init</code>, they pull that change. Their configuration still references <code>db_endpoint</code>.</p>
<p>Plans break:</p>
<pre><code class="language-plaintext">payments-infra                        → plan fails
analytics-infra                       → plan fails
auth-infra                            → plan fails
reporting-infra                       → plan fails
</code></pre>
<p>Version pinning prevents this. The payments team stays on <code>v2.1.0</code>. The module owners release <code>v2.2.0</code> with the renamed output and write a changelog. Teams upgrade when they're ready, after testing in staging. Nobody's pipeline breaks without warning.</p>
<p>The versioning convention is called semantic versioning:</p>
<pre><code class="language-plaintext">v2.1.1  → patch:  bug fix. Safe to upgrade. Nothing to change in your code.
v2.2.0  → minor:  new optional feature. Safe to upgrade. Nothing to change.
v3.0.0  → major:  breaking change. Read the changelog. Update your code first.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/e23a58d7-0f3b-4f12-921e-1128f33d6c40.png" alt="image of module semantic versioning" style="display:block;margin:0 auto" width="1774" height="887" loading="lazy">

<p>The table shows three version types. A patch version like v2.1.1 means a bug fix, safe to upgrade with nothing to change in your code. A minor version like v2.2.0 means a new optional feature, also safe to upgrade with nothing to change. A major version like v3.0.0 means a breaking change, so you need to read the changelog and update your code before upgrading.</p>
<h2 id="heading-how-teams-maintain-terraform-modules-at-scale">How Teams Maintain Terraform Modules at Scale</h2>
<p>Building a Terraform module takes an afternoon, bit maintaining it for two years is a different job entirely.</p>
<p>A networking engineer needs a VPC module. The platform team has one, but their backlog is full. So the engineer creates a slightly different version. Three months later, another team does the same. Then another. Now this exists:</p>
<pre><code class="language-plaintext">terraform-aws-vpc           ← original, maintained by platform team
terraform-aws-vpc-v2        ← created by the app team, author unknown
terraform-aws-vpc-shared    ← no idea which environments use this
terraform-aws-vpc-prod      ← unclear if this was ever different from the original
</code></pre>
<p>No one created a module graveyard on purpose. It grew one <em>"I'll just make a quick variation"</em> at a time. Each variant has slightly different security settings, different tagging, different defaults. When a compliance audit requires all VPCs to enable flow logging, the team has to investigate four different modules to figure out which environments are compliant.</p>
<p>Teams that avoid this treat their modules like shared services: named owner, contributions through pull requests, breaking changes in major versions with a migration guide, and deprecated modules with a retirement date. A <code>CODEOWNERS</code> file routes every pull request to the right reviewer automatically.</p>
<p>Organizations that skip this end up with modules that nobody owns, nobody wants to touch, and nobody is sure can be safely removed.</p>
<h2 id="heading-how-teams-share-data-between-state-files">How Teams Share Data Between State Files</h2>
<p>Once infrastructure is split into separate state files, a practical problem surfaces: teams need information from each other's infrastructure. The platform team's Kubernetes state needs the VPC ID from the networking team's state. The database state needs subnet IDs. The payments state needs the database endpoint.</p>
<p>Two patterns exist for solving this.</p>
<h3 id="heading-reading-another-teams-state-outputs">Reading Another Team's State Outputs</h3>
<p>The <code>terraform_remote_state</code> data source lets one state read the outputs of another. The networking team marks their VPC ID and subnet IDs as outputs. The database team reads those outputs and uses them to place databases in the right subnets.</p>
<pre><code class="language-plaintext">Networking state
  └── outputs: vpc_id, private_subnet_ids
                          ↓
               Database state reads them
               └── places RDS in the right subnets
</code></pre>
<p>This works, but there's a limitation. Reading another team's state requires full read access to their entire state file, not just the outputs you want. State files contain database passwords and API keys in plaintext. More dependencies means more teams reading each other's secrets.</p>
<h3 id="heading-looking-up-resources-directly-from-the-cloud">Looking Up Resources Directly From the Cloud</h3>
<p>The alternative, and the one HashiCorp now recommends, is to look up resources through the cloud provider's API instead of reading another team's state:</p>
<pre><code class="language-hcl">data "aws_vpc" "main" {
  tags = {
    Name        = "production-vpc"
    Environment = "production"
  }
}
</code></pre>
<p>No cross-team state access needed, and each team's state stays isolated. The tradeoff is consistent tagging: the networking team has to tag their VPC in a way the database team can reliably search for, which forces teams to agree on naming conventions early.</p>
<p>Many teams use both. Remote state for a small number of trusted, tightly coupled dependencies. Cloud data sources for everything broader.</p>
<h2 id="heading-how-infrastructure-changes-actually-move-to-production">How Infrastructure Changes Actually Move to Production</h2>
<p>In large organizations managing production Terraform at scale, changes don't come from someone's laptop. Applying directly from a local machine requires production cloud credentials sitting on that machine, a security risk and leaves no audit trail if something breaks.</p>
<p>Instead, production changes move through a pipeline. Every change goes through a pull request in GitHub, and the pipeline does the work:</p>
<pre><code class="language-plaintext">Engineer opens a pull request
        ↓
Pipeline: terraform validate + fmt check
        ↓
Pipeline: security scan (Checkov, tfsec, or similar)
        ↓
Pipeline: terraform plan → posts the full output as a comment on the PR
        ↓
Reviewer reads the plan output (not just the code)
        ↓
Required reviewers approve (enforced by CODEOWNERS + branch protection)
        ↓
Merge triggers the apply pipeline
        ↓
Pipeline: acquires state lock → applies → releases lock → logs result
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/d154589a-67b6-41e0-bda1-d7243521878f.png" alt="CI pipeline flowchart with Terraform" style="display:block;margin:0 auto" width="1024" height="1536" loading="lazy">

<p>The diagram above shows eight steps in order. An engineer opens a pull request. The pipeline runs terraform validate and a format check. A security scan runs using Checkov, tfsec, or similar. The pipeline runs terraform plan and posts the output as a comment on the pull request. A reviewer reads the full plan output. Required reviewers approve, enforced by CODEOWNERS and branch protection rules. Merging triggers the apply pipeline. The pipeline acquires the state lock, applies the changes, releases the lock, and logs the result.</p>
<p>The part that surprises engineers when they first encounter this is that the reviewer isn't approving the code. They're approving the <strong>plan output</strong> and the list of exactly what will be created, changed, or destroyed in the cloud.</p>
<p>A code change can look completely harmless and produce a destructive plan. Changing one database parameter might force a resource replacement, meaning Terraform destroys the current database and creates a new one. Seeing this in the plan output before the PR merges:</p>
<pre><code class="language-plaintext"># aws_db_instance.payments must be replaced
-/+ resource "aws_db_instance" "payments" {
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/4b39ca51-ab6d-4187-b7b7-a68b35d13959.png" alt="Terraform plan output in terminal - aws_db_instance.payments" style="display:block;margin:0 auto" width="1567" height="1004" loading="lazy">

<p>The image above shows a plan output that aws_db_instance.payments must be replaced, meaning Terraform will destroy the existing database and create a new one, not update it in place.</p>
<p>Catching that before merge is the entire point of reviewing the plan. Not the code.</p>
<h3 id="heading-how-codeowners-enforces-who-reviews-what">How CODEOWNERS Enforces Who Reviews What</h3>
<p>Earlier, we talked about module ownership. A VPC module might belong to the platform team, while database infrastructure belongs to the database team.</p>
<p>The challenge is making sure changes are actually reviewed by the people who own them.</p>
<p>GitHub solves this with a feature called <strong>CODEOWNERS</strong>. It lets a repository define which team is responsible for which directories. When someone opens a pull request that touches those files, GitHub automatically requests reviews from the correct team.</p>
<p>For example, if an engineer modifies the PostgreSQL module, GitHub can automatically require approval from the platform team before the change can be merged.</p>
<p>Without CODEOWNERS, engineers have to remember who owns which parts of the infrastructure.</p>
<p>CODEOWNERS makes ownership explicit and automatically requests reviews from the right team.</p>
<h2 id="heading-how-teams-detect-infrastructure-drift">How Teams Detect Infrastructure Drift</h2>
<p>Drift is the diff between what Terraform says should exist and what actually exists in the cloud.</p>
<p>Here's the scenario that produces drift more reliably than anything else:</p>
<pre><code class="language-plaintext">Monday 3:00 AM  Production database CPU spikes. Outage.
Monday 3:15 AM  Engineer resizes database in AWS console: db.m5.large → db.m5.4xlarge
Monday 3:20 AM  Incident resolved. Engineer goes to sleep.
Monday 3:21 AM  Terraform state file: still says db.m5.large
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/a1d72ad3-42f2-45d1-8d5a-3fdc8a4cbb99.png" alt="Four panels showing how drift happens: the database CPU spikes at 3:00 AM, an engineer resizes it manually in the AWS console at 3:15 AM, the incident resolves at 3:20 AM, and by 3:21 AM the Terraform state file still says db.m5.large, unaware of the change." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<p>The incident is forgotten, the ticket is closed, and life moves on.</p>
<p>Three months later, a routine Terraform apply runs. Terraform sees <code>db.m5.large</code> in the configuration but finds <code>db.m5.4xlarge</code> running in AWS. From Terraform's perspective, the database is larger than it should be, so the plan proposes changing it back.</p>
<p>Nobody notices the change in the plan output. The apply goes through, the database is downsized, and users begin reporting slow queries. The team spends hours investigating before eventually tracing the issue back to a Terraform change that reverted the emergency fix from months earlier.</p>
<p>Teams that handle this well run scheduled <code>terraform plan</code> jobs against every production state. If <code>terraform plan</code> exits with code <code>2</code>, differences were found and an alert fires. The team then decides whether to apply to restore declared state or update the configuration to match reality. Either way, the change is visible and deliberate. Invisible drift always gets worse.</p>
<h2 id="heading-how-teams-recover-when-state-goes-wrong">How Teams Recover When State Goes Wrong</h2>
<p>State is recoverable in almost every situation, as long as the team set things up correctly before the incident happened.</p>
<p>The teams that recover in twenty minutes instead of three days aren't the ones with the deepest Terraform expertise. They're the ones who prepared.</p>
<h3 id="heading-step-1-pull-a-backup-before-touching-anything">Step 1: Pull a Backup Before Touching Anything.</h3>
<pre><code class="language-bash">terraform state pull &gt; backup-$(date +%Y%m%d-%H%M%S).json
</code></pre>
<p>This saves the current state to a local file. Whatever you try next, you have a starting point to return to.</p>
<h3 id="heading-step-2-run-terraform-plan-and-look-at-what-it-proposes">Step 2: Run <code>terraform plan</code> and Look at What it Proposes.</h3>
<p>If Terraform proposes destroying resources that still exist in the cloud, the state is behind reality. If it proposes creating resources that already exist, reality is ahead of the state. Either way, the plan output tells you which direction the mismatch runs.</p>
<h3 id="heading-step-3-restore-from-s3-versioning-if-the-state-is-corrupted">Step 3: Restore from S3 Versioning if the State is Corrupted.</h3>
<p>Every write to a versioned S3 bucket saves a new version automatically. If the state file is corrupted or wrong, list the previous versions, download the last known good one, and push it back:</p>
<pre><code class="language-bash"># List previous versions
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix production/database/terraform.tfstate

# Download a specific version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key production/database/terraform.tfstate \
  --version-id "the-version-id-here" \
  recovered-state.json

# Push it back
terraform state push recovered-state.json
</code></pre>
<p>Run <code>terraform plan</code> after restoring to confirm it looks correct before running any apply.</p>
<h3 id="heading-step-4-clear-a-stale-lock-if-the-pipeline-is-blocked">Step 4: Clear a Stale Lock if the Pipeline is Blocked.</h3>
<p>If a lock was never released after a failed apply, clear it:</p>
<pre><code class="language-bash">terraform force-unlock LOCK_ID
</code></pre>
<p>Only do this after confirming no apply is actively running. Clearing a live lock corrupts the state.</p>
<h3 id="heading-step-5-re-import-resources-that-fell-out-of-state">Step 5: Re-import Resources That Fell Out of State.</h3>
<p>If a resource exists in the cloud but Terraform no longer knows about it — because of an accidental <code>terraform state rm</code> — bring it back without recreating it:</p>
<pre><code class="language-bash">terraform import aws_db_instance.payments db-ABCD1234EFGH5678
</code></pre>
<p>Run <code>terraform plan</code> after importing to confirm no unexpected changes are proposed.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Every practice in this article traces back to a specific problem teams ran into as Terraform usage grew.</p>
<p>State locking prevents engineers from overwriting each other's changes.<br>State splitting reduces blast radius. Module versioning prevents shared infrastructure from breaking unexpectedly. Drift detection catches changes made outside Terraform. CODEOWNERS ensures the right people review the right changes.</p>
<p>Different problems with different solutions. But they all point to the same underlying theme which is ownership.</p>
<p>As teams grow, many Terraform problems have less to do with infrastructure and more to do with ownership.</p>
<p>State collisions happen when multiple people can modify the same state.<br>Module sprawl happens when nobody is responsible for maintaining a shared standard.</p>
<p>Drift becomes dangerous when changes are made without anyone taking ownership of bringing Terraform and reality back into alignment. Even review bottlenecks often trace back to uncertainty about who should approve what.</p>
<p>Understanding this changes how you read an unfamiliar Terraform repository.</p>
<p>Dozens of small state files aren't necessarily over-engineering. They're often ownership boundaries. A CODEOWNERS file is not bureaucracy. It's an ownership map. A pipeline that posts plan output on a pull request isn't just automation, it's a review process built around infrastructure consequences rather than code.</p>
<p>The infrastructure matters. But as teams grow, ownership is what keeps the system understandable.</p>
<p><em>I write about DevOps engineering, production systems, and the things tutorials do not cover weekly. If this was useful,</em> <a href="https://osomudeya.kit.com/23db7ca59f"><em>please join the newsletter.</em></a><br><em>If you enjoyed reading this, we can also connect on</em> <a href="https://www.linkedin.com/in/osomudeya-zudonu-17290b124">Linkedin</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The EKS Cost Optimization Handbook: Reduce Your AWS Bill by 60% Using Karpenter and Rightsizing ]]>
                </title>
                <description>
                    <![CDATA[ This handbook is a complete guide to the 7-step playbook that took one EKS bill from \(85,000/month to \)34,000/month — without touching a single line of product code. I've audited EKS clusters at mor ]]>
                </description>
                <link>https://www.freecodecamp.org/news/eks-cost-optimization-reduce-your-aws-bill-using-karpenter-and-rightsizing/</link>
                <guid isPermaLink="false">6a396515fa8e37864960ddb6</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ optimization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ayobami Adejumo ]]>
                </dc:creator>
                <pubDate>Mon, 22 Jun 2026 16:38:45 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/cd12d552-bcf2-466a-a98e-7674c436afaa.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>This handbook is a complete guide to the 7-step playbook that took one EKS bill from \(85,000/month to \)34,000/month — without touching a single line of product code.</p>
<p>I've audited EKS clusters at more than 10 companies. The same waste patterns appear every time: over-provisioned nodes, cross-AZ data transfer, idle EBS volumes, and so on. And the most expensive mistake of all: buying compute commitments before rightsizing.</p>
<p>This handbook is the fix. I've used this 7-step playbook to reduce EKS costs by 50–60% at every company where I've implemented it. There are no product code changes, and no downtime. Just infrastructure optimization executed in the right order.</p>
<p>By the end of this guide, you'll know how to right-size pod resource requests, implement Karpenter for intelligent bin-packing and Spot diversification, migrate compatible workloads to Graviton for 20% cheaper compute, and eliminate NAT Gateway charges entirely with VPC endpoints.</p>
<p>All Terraform modules, NodePool templates, and automation scripts referenced in this guide are available in the companion repository at <a href="https://github.com/aayostem/eks-cost-optimization">github.com/aayostem/eks-cost-optimization</a>. The repo includes ready-to-deploy configurations for every step so you can move from reading to implementing in the same afternoon.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-part-1-the-baseline-where-your-eks-money-is-going">Part 1: The Baseline — Where Your EKS Money Is Going</a></p>
</li>
<li><p><a href="#heading-part-2-right-sizing-pod-resource-requests">Part 2: Right-Sizing Pod Resource Requests</a></p>
</li>
<li><p><a href="#heading-part-3-karpenter-for-bin-packing-and-spot-diversification">Part 3: Karpenter for Bin-Packing and Spot Diversification</a></p>
</li>
<li><p><a href="#heading-part-4-graviton-migration">Part 4: Graviton Migration</a></p>
</li>
<li><p><a href="#heading-part-5-vpc-endpoints-for-data-transfer">Part 5: VPC Endpoints for Data Transfer</a></p>
</li>
<li><p><a href="#heading-part-6-ebs-volume-optimisation">Part 6: EBS Volume Optimisation</a></p>
</li>
<li><p><a href="#heading-part-7-load-balancer-consolidation">Part 7: Load Balancer Consolidation</a></p>
</li>
<li><p><a href="#heading-the-complete-7-step-sequence">The Complete 7-Step Sequence</a></p>
</li>
<li><p><a href="#heading-best-practices-for-eks-cost-optimisation">Best Practices Summary</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<ul>
<li><p>How to right-size pod resource requests using VPA recommendations</p>
</li>
<li><p>The complete Karpenter setup with Spot diversification and automatic consolidation</p>
</li>
<li><p>Graviton3 migration for all non-GPU workloads</p>
</li>
<li><p>VPC endpoints to eliminate NAT Gateway data transfer charges</p>
</li>
<li><p>EBS gp2 to gp3 migration — 20% cheaper with zero performance loss</p>
</li>
<li><p>Load balancer consolidation with shared Ingress</p>
</li>
<li><p>The 7-step sequence that maximises ROI — and why the order isn't optional</p>
</li>
</ul>
<p>Let's dive in.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following along, you should have:</p>
<p><strong>Knowledge:</strong></p>
<ul>
<li><p>Working familiarity with Kubernetes — you can deploy an application and inspect pods</p>
</li>
<li><p>Basic AWS knowledge — you understand EC2 instance types, VPCs, and EBS volumes</p>
</li>
<li><p>Comfort reading Terraform HCL and Kubernetes YAML</p>
</li>
</ul>
<p><strong>Tools and access:</strong></p>
<ul>
<li><p>An existing EKS cluster running Kubernetes 1.27 or later</p>
</li>
<li><p><code>kubectl</code> configured and pointing at your cluster</p>
</li>
<li><p>AWS CLI v2 installed and authenticated with appropriate permissions</p>
</li>
<li><p>Helm 3 installed (for Karpenter and Kubecost)</p>
</li>
<li><p><a href="https://github.com/kubernetes-sigs/metrics-server">Metrics Server</a> installed in your cluster</p>
</li>
</ul>
<p><strong>Companion repository:</strong> Clone the repo before starting. It contains all YAML, Terraform, and shell scripts referenced in this guide:</p>
<pre><code class="language-bash">git clone https://github.com/aayostem/eks-cost-optimization
cd eks-cost-optimization
</code></pre>
<p><strong>Estimated savings:</strong> For a cluster running at \(85,000/month with typical over-provisioning, expect \)40,000–55,000/month in savings after completing all 7 steps. Smaller clusters under $10,000/month typically see 40–50% reduction.</p>
<h2 id="heading-part-1-the-baseline-where-your-eks-money-is-going">Part 1: The Baseline — Where Your EKS Money Is Going</h2>
<h3 id="heading-11-the-typical-eks-cost-breakdown">1.1 The Typical EKS Cost Breakdown</h3>
<p>Before touching anything, you need to know exactly where the money is going. Optimising the wrong category first is how teams waste weeks of engineering time and see no meaningful reduction.</p>
<p>Here's what a typical $85,000/month EKS cluster looks like when you break it down:</p>
<table>
<thead>
<tr>
<th>Category</th>
<th>Monthly Cost</th>
<th>Percentage</th>
<th>Waste Potential</th>
</tr>
</thead>
<tbody><tr>
<td>Compute (EC2 nodes)</td>
<td>$52,000</td>
<td>61%</td>
<td>High — over-provisioning, wrong instance types</td>
</tr>
<tr>
<td>Data Transfer</td>
<td>$15,300</td>
<td>18%</td>
<td>Very High — cross-AZ and NAT Gateway charges</td>
</tr>
<tr>
<td>Storage (EBS volumes)</td>
<td>$10,200</td>
<td>12%</td>
<td>Medium — unattached volumes and gp2 vs gp3</td>
</tr>
<tr>
<td>Load Balancers</td>
<td>$4,250</td>
<td>5%</td>
<td>Low to Medium — single-service ALBs</td>
</tr>
<tr>
<td>EKS Control Plane</td>
<td>$72</td>
<td>&lt;1%</td>
<td>None — this is a fixed cost</td>
</tr>
<tr>
<td>Other</td>
<td>$3,178</td>
<td>4%</td>
<td>Low</td>
</tr>
</tbody></table>
<p>Compute and Data Transfer together represent 79% of the bill and account for 90% of the correctable waste. Those are the targets.</p>
<p>Run this command to see your own breakdown before starting anything:</p>
<pre><code class="language-bash"># Pull last month's cost breakdown by service
# Save this output — it becomes your before number
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn
</code></pre>
<p>Screenshot the output and save it. You'll compare against it after each step to verify actual savings before moving to the next one.</p>
<h3 id="heading-12-the-most-expensive-mistake-wrong-optimisation-order">1.2 The Most Expensive Mistake: Wrong Optimisation Order</h3>
<p>Here's what most teams do when they get a large AWS bill:</p>
<ol>
<li><p>Buy Savings Plans immediately, locking in waste at a 30% discount</p>
</li>
<li><p>Then implement Karpenter, discovering they've over-committed the wrong instance family</p>
</li>
<li><p>Then migrate to Graviton, discovering their Savings Plan doesn't cover ARM instances</p>
</li>
</ol>
<p>The result: a 12–36 month commitment paying for waste they could have eliminated in three weeks.</p>
<p>The correct sequence is:</p>
<pre><code class="language-plaintext">Step 1: Right-size pod requests        ← Always first
Step 2: Implement Karpenter            ← Dynamic provisioning on rightsized requests
Step 3: Enable Spot for non-prod       ← Karpenter handles fallback automatically
Step 4: Migrate to Graviton            ← Karpenter makes this seamless
Step 5: Add VPC endpoints              ← Eliminate data transfer charges
Step 6: Optimise EBS volumes           ← Quick win, run alongside other steps
Step 7: Consolidate load balancers     ← Final structural cleanup
</code></pre>
<p>Then, and only then, buy Savings Plans — against the optimised baseline you've just established.</p>
<p>The one rule: optimise first, then commit. Every step before the Savings Plan purchase reduces what you're locking in for 1–3 years.</p>
<h2 id="heading-part-2-right-sizing-pod-resource-requests">Part 2: Right-Sizing Pod Resource Requests</h2>
<h3 id="heading-21-why-over-provisioned-requests-are-so-expensive">2.1 Why Over-Provisioned Requests Are So Expensive</h3>
<p>Kubernetes schedules pods based on resource <em>requests</em> — not actual usage. A pod that requests 2 vCPUs and 4GB of memory requires a node with that capacity available, regardless of whether the pod is actually using it.</p>
<p>Here's the incorrect approach with the requests set to worst-case peak estimates:</p>
<pre><code class="language-yaml"># Bad: Resource requests set during initial deployment, never revisited
# This pod actually uses 250m CPU and 512Mi memory on average
resources:
  requests:
    cpu: "2"        # 8x more than actual usage
    memory: "4Gi"   # 8x more than actual usage
  limits:
    cpu: "4"
    memory: "8Gi"
</code></pre>
<p>When every pod is over-requested by 8x, your cluster needs 8x more nodes than your workloads actually require. That's where the 61% compute line in your bill comes from.</p>
<p>First, verify actual usage before changing anything:</p>
<pre><code class="language-bash"># Install Metrics Server if not already running
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check actual CPU and memory usage per pod
# Compare these numbers against your current resource requests
kubectl top pods --all-namespaces --sort-by=cpu
</code></pre>
<p>Expected output showing the typical gap:</p>
<pre><code class="language-plaintext">NAMESPACE     NAME                    CPU(cores)   MEMORY(bytes)
production    payment-api-xxx         25m          128Mi
production    user-api-xxx            15m          96Mi
production    notification-svc-xxx    5m           64Mi
staging       worker-xxx              10m          256Mi
</code></pre>
<p>If your pods are requesting 2 CPU cores each but using 25m–15m cores in practice, you have a 50–80x over-request ratio. Every node in your cluster is mostly empty space you're paying for.</p>
<h3 id="heading-22-using-the-vertical-pod-autoscaler-for-recommendations">2.2 Using the Vertical Pod Autoscaler for Recommendations</h3>
<p>The Vertical Pod Autoscaler (VPA) is a Kubernetes component that analyses historical CPU and memory usage for each deployment and recommends optimal resource requests. You use it in recommendation-only mode first — it tells you what to set without changing anything automatically, so you can review and apply the changes yourself with full control.</p>
<p>Here's the correct implementation:</p>
<pre><code class="language-yaml"># Good: VPA in recommendation-only mode
# Watches your pod's actual usage for 24+ hours, then recommends right-sized requests
# updateMode: "Off" means it only recommends — it never restarts your pods
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-api
  updatePolicy:
    updateMode: "Off"   # Recommendation only — you apply manually after review
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "100m"     # VPA will never recommend below this floor
        memory: "256Mi"
      maxAllowed:
        cpu: "2"        # VPA will never recommend above this ceiling
        memory: "4Gi"
</code></pre>
<p>Install VPA and retrieve recommendations:</p>
<pre><code class="language-bash"># Install VPA components
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-1.0.0/vpa-v1.0.0.yaml

# Apply the VPA manifest for each deployment you want to right-size
kubectl apply -f vpa/payment-api-vpa.yaml

# Wait 24 hours for VPA to collect usage data, then check recommendations
kubectl describe vpa payment-api-vpa -n production
</code></pre>
<p>What a VPA recommendation looks like:</p>
<pre><code class="language-plaintext">Recommendation:
  Container Recommendations:
    Container Name: payment-api
    Lower Bound:
      cpu:     50m
      memory:  128Mi
    Target:
      cpu:     250m      ← Set your requests to this value
      memory:  512Mi     ← Set your requests to this value
    Upper Bound:
      cpu:     500m
      memory:  1Gi
</code></pre>
<p>Apply the recommendation to your deployment:</p>
<pre><code class="language-yaml"># Good: Right-sized requests based on VPA Target recommendation
resources:
  requests:
    cpu: "250m"     # Down from 2000m — an 8x reduction
    memory: "512Mi" # Down from 4096Mi — an 8x reduction
  limits:
    cpu: "500m"     # 2x the request — headroom for genuine spikes
    memory: "1Gi"   # 2x the request
</code></pre>
<p>All VPA manifests for common deployment types are in <code>vpa/</code> in the <a href="https://github.com/aayostem/eks-cost-optimization/tree/main/vpa">companion repo</a>.</p>
<h3 id="heading-23-the-roi-of-right-sizing">2.3 The ROI of Right-Sizing</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Before</th>
<th>After</th>
<th>Improvement</th>
</tr>
</thead>
<tbody><tr>
<td>Average CPU utilisation</td>
<td>18%</td>
<td>65%</td>
<td>+47 percentage points</td>
</tr>
<tr>
<td>Node count required</td>
<td>42</td>
<td>28</td>
<td>-33%</td>
</tr>
<tr>
<td>Monthly compute cost</td>
<td>$52,000</td>
<td>$36,400</td>
<td>-$15,600/month</td>
</tr>
</tbody></table>
<p>Verify the improvement after applying recommendations:</p>
<pre><code class="language-bash"># Check cluster-level utilisation after right-sizing
# Target: 60–75% CPU and memory utilisation across nodes
kubectl top nodes
</code></pre>
<h2 id="heading-part-3-karpenter-for-bin-packing-and-spot-diversification">Part 3: Karpenter for Bin-Packing and Spot Diversification</h2>
<p>Karpenter is an open-source Kubernetes node provisioner built by AWS and donated to the CNCF.</p>
<p>Where the default Kubernetes Cluster Autoscaler scales pre-configured node groups up and down, Karpenter watches the actual resource requests of pending pods and provisions exactly the right EC2 instance type to satisfy them — selecting dynamically from thousands of available instance families rather than the two or three you pre-configured. It also continuously monitors running nodes for underutilisation and consolidates workloads onto fewer nodes, terminating the empty ones automatically.</p>
<p>The result is a cluster that is always sized to what your workloads actually need right now, not what you anticipated at setup time.</p>
<h3 id="heading-31-the-ceiling-with-cluster-autoscaler">3.1 The Ceiling with Cluster Autoscaler</h3>
<p>Cluster Autoscaler works with pre-defined node groups. You configure which instance types are available and it scales those groups up and down.</p>
<p>The limitation is that it can only provision instances from the types you pre-configured. It can't dynamically select the right instance type based on what the workload actually needs right now.</p>
<p>Here's the incorrect approach using static node groups:</p>
<pre><code class="language-bash"># Bad: Two static node groups, each over-provisioning against worst-case scenarios
# CPU-optimised group runs even when workloads are memory-bound
# Memory-optimised group runs even when workloads are CPU-bound
eksctl create nodegroup \
  --cluster my-cluster \
  --name cpu-optimized \
  --instance-types c5.2xlarge \
  --nodes-min 5 --nodes-max 20

eksctl create nodegroup \
  --cluster my-cluster \
  --name memory-optimized \
  --instance-types r5.2xlarge \
  --nodes-min 3 --nodes-max 10
</code></pre>
<p>You're provisioning for the worst case in each family simultaneously. At any given moment, one group is underutilised while the other is scaling. Neither is right.</p>
<h3 id="heading-32-how-karpenter-solves-this">3.2 How Karpenter Solves This</h3>
<p>Karpenter watches the actual resource requests of pending pods and provisions exactly the right instance type to fit them. It selects from thousands of available instance types, not just the two you pre-configured. It also consolidates running workloads onto fewer nodes when utilisation drops, automatically terminating underutilised nodes.</p>
<p>Here's the correct implementation:</p>
<pre><code class="language-yaml"># Good: Karpenter NodePool
# Karpenter selects the optimal instance type based on pending pod requirements
# Tries Spot first, falls back to On-Demand automatically when Spot isn't available
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM (Graviton) — Karpenter picks the cheaper option
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        # Try Spot first, fall back to On-Demand if unavailable
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # Exclude families with poor price-to-performance ratio
        - key: karpenter.k8s.aws/instance-family
          operator: NotIn
          values: ["t2", "t3a"]
  limits:
    cpu: "1000"
    memory: "4000Gi"
  disruption:
    # Remove underutilised nodes and reschedule their pods automatically
    consolidationPolicy: WhenUnderutilized
    # Recycle nodes after 30 days to ensure fresh, patched AMIs
    expireAfter: 720h
</code></pre>
<p>What each setting does:</p>
<ul>
<li><p><code>consolidationPolicy: WhenUnderutilized</code>: Karpenter continuously monitors node utilisation and removes underused nodes, moving their pods elsewhere. Your node count decreases automatically as load drops without any manual intervention.</p>
</li>
<li><p><code>expireAfter: 720h</code>: Nodes older than 30 days are gracefully replaced, ensuring your infrastructure always runs the latest EKS-optimised AMI with current security patches.</p>
</li>
<li><p><code>values: ["spot", "on-demand"]</code>: Karpenter attempts Spot capacity first. If Spot is unavailable for the requested instance type, it falls back to On-Demand with no alerts and no manual action required.</p>
</li>
</ul>
<p>Migrating from Cluster Autoscaler safely:</p>
<pre><code class="language-bash"># Step 1: Install Karpenter alongside Cluster Autoscaler — do not remove CAS yet
helm repo add karpenter https://charts.karpenter.sh
helm install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --set settings.clusterName=your-cluster-name

# Step 2: Apply NodePool and NodeClass configuration
kubectl apply -f karpenter/nodepool.yaml
kubectl apply -f karpenter/nodeclass.yaml

# Step 3: Taint existing legacy nodes so new pods schedule on Karpenter nodes
# This migrates workloads gradually — zero downtime
kubectl taint nodes -l eks.amazonaws.com/nodegroup=cpu-optimized \
  group=legacy:NoSchedule

# Step 4: Watch pods reschedule to Karpenter-managed nodes over the next hour
kubectl get pods -o wide --all-namespaces | grep -v legacy

# Step 5: After 30 days of stable operation, remove the old node groups
eksctl delete nodegroup --cluster my-cluster --name cpu-optimized
eksctl delete nodegroup --cluster my-cluster --name memory-optimized
</code></pre>
<p>Ready-to-deploy NodePool and NodeClass templates are in <code>karpenter/</code> in the <a href="https://github.com/aayostem/eks-cost-optimization/tree/main/karpenter">companion repo</a>.</p>
<h3 id="heading-33-spot-instances-for-non-production-workloads">3.3 Spot Instances for Non-Production Workloads</h3>
<p>Staging and development workloads don't need the reliability guarantees of On-Demand instances. Moving them to Spot saves 60–90% on those node costs. Karpenter handles Spot interruptions by rescheduling pods automatically. For stateless workloads, interruptions are invisible to users.</p>
<pre><code class="language-yaml"># Good: Spot-only NodePool for staging environments
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: staging-spot
spec:
  template:
    metadata:
      labels:
        billing/environment: staging
    spec:
      taints:
        - key: environment
          value: staging
          effect: NoSchedule  # Only pods that tolerate this taint schedule here
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]   # Spot only for non-production
  disruption:
    consolidationPolicy: WhenUnderutilized
</code></pre>
<h3 id="heading-34-the-roi-of-karpenter-and-spot">3.4 The ROI of Karpenter and Spot</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Before (Cluster Autoscaler)</th>
<th>After (Karpenter + Spot)</th>
<th>Improvement</th>
</tr>
</thead>
<tbody><tr>
<td>Average node count</td>
<td>28</td>
<td>18</td>
<td>-36%</td>
</tr>
<tr>
<td>Average CPU utilisation</td>
<td>65%</td>
<td>82%</td>
<td>+17 percentage points</td>
</tr>
<tr>
<td>Staging environment cost</td>
<td>$8,000/month</td>
<td>$2,400/month</td>
<td>-70%</td>
</tr>
<tr>
<td>Scale-up time for new pods</td>
<td>3–5 minutes</td>
<td>30–60 seconds</td>
<td>-80%</td>
</tr>
</tbody></table>
<h2 id="heading-part-4-graviton-migration">Part 4: Graviton Migration</h2>
<p>AWS Graviton is Amazon's own ARM-based processor family, available across EC2 instance types with names ending in <code>g</code> — <code>m7g</code>, <code>c7g</code>, <code>r7g</code>, and so on.</p>
<p>Graviton instances are priced approximately 20% lower than equivalent Intel or AMD x86 instances. For most server-side workloads — Node.js, Python, Go, Java — they also deliver 20–40% better performance per dollar because the processor architecture is optimised specifically for these workload types.</p>
<p>You don't change your application code to use Graviton. You change the architecture flag in your container image build and the node selector in your Kubernetes deployment.</p>
<h3 id="heading-41-why-graviton-reduces-cost-without-reducing-performance">4.1 Why Graviton Reduces Cost Without Reducing Performance</h3>
<p>The first question to answer before migrating is whether your container images support ARM64. Most official images from Docker Hub ship as multi-architecture images. Your own application images need to be built for both architectures explicitly.</p>
<p>Check whether your images support ARM64:</p>
<pre><code class="language-bash"># Check if an image has an ARM64 manifest
docker manifest inspect your-registry/your-app:latest | jq '.manifests[].platform'
</code></pre>
<p>Expected output for a multi-arch image:</p>
<pre><code class="language-json">{"architecture": "amd64", "os": "linux"},
{"architecture": "arm64", "os": "linux", "variant": "v8"}
</code></pre>
<p>If <code>arm64</code> appears, the image is ready. If not, you need to build and push a multi-arch image first.</p>
<p>Build and push a multi-architecture image:</p>
<pre><code class="language-bash"># Build for both x86 and ARM in a single command using Docker Buildx
docker buildx create --use --name multi-arch-builder

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag your-registry/your-app:latest \
  --push \
  .
</code></pre>
<h3 id="heading-42-migrating-workloads-to-graviton">4.2 Migrating Workloads to Graviton</h3>
<p>With Karpenter already installed, Graviton migration is a single label change on your deployment. Karpenter provisions the appropriate ARM64 node automatically.</p>
<p>Here's the correct implementation:</p>
<pre><code class="language-yaml"># Good: nodeSelector directs the pod to Graviton nodes
# Karpenter provisions an arm64 node if one isn't already available
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64   # Schedule exclusively on Graviton nodes
      containers:
        - name: api
          image: your-registry/payment-api:latest  # Must be multi-arch
</code></pre>
<p>Migrate gradually, starting with stateless services:</p>
<pre><code class="language-bash"># Step 1: Migrate one stateless service and monitor for 48 hours
kubectl patch deployment payment-api \
  -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/arch":"arm64"}}}}}'

# Step 2: Watch for errors in the first 30 minutes
kubectl logs -l app=payment-api --tail=100 -f

# Step 3: Verify the pod is running on a Graviton node
# The NODE column should show a Graviton instance type (m7g, c7g, r7g)
kubectl get pods -l app=payment-api -o wide

# Step 4: After 48 hours of stable operation, migrate the next service
</code></pre>
<p>There are some situations where you shouldn't migrate to Graviton: GPU workloads, applications with native x86 binary dependencies, or any workload where you haven't yet built multi-arch images.</p>
<h3 id="heading-43-the-roi-of-graviton">4.3 The ROI of Graviton</h3>
<table>
<thead>
<tr>
<th>Workload Type</th>
<th>x86 Monthly Cost</th>
<th>Graviton Monthly Cost</th>
<th>Saving</th>
</tr>
</thead>
<tbody><tr>
<td>Web services (Node.js, Python)</td>
<td>$18,000</td>
<td>$14,400</td>
<td>$3,600/month</td>
</tr>
<tr>
<td>Data processing</td>
<td>$12,000</td>
<td>$9,600</td>
<td>$2,400/month</td>
</tr>
<tr>
<td>API services (Go, Java)</td>
<td>$8,000</td>
<td>$6,400</td>
<td>$1,600/month</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>$38,000</strong></td>
<td><strong>$30,400</strong></td>
<td><strong>$7,600/month</strong></td>
</tr>
</tbody></table>
<h2 id="heading-part-5-vpc-endpoints-for-data-transfer">Part 5: VPC Endpoints for Data Transfer</h2>
<h3 id="heading-51-the-nat-gateway-tax">5.1 The NAT Gateway Tax</h3>
<p>Every byte that travels from your EKS pods to an AWS service — S3, DynamoDB, ECR, SQS — goes through a NAT Gateway if you haven't configured VPC endpoints. NAT Gateway charges $0.045 per GB of data processed.</p>
<p>A busy EKS cluster pulling container images from ECR, writing to S3, and polling SQS queues can process hundreds of terabytes per month through NAT Gateway — generating thousands of dollars in charges for traffic that never actually left the AWS network.</p>
<p>Measure your current NAT Gateway cost before adding endpoints:</p>
<pre><code class="language-bash"># Get last month's NAT Gateway data processing charges
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity DAILY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NATGateway-Bytes"]
    }
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table
</code></pre>
<h3 id="heading-52-vpc-endpoints-the-fix-that-takes-30-minutes">5.2 VPC Endpoints — The Fix That Takes 30 Minutes</h3>
<p>A VPC endpoint creates a private connection between your VPC and an AWS service, routing traffic through the AWS backbone without touching the NAT Gateway. The data transfer becomes free. Each endpoint costs approximately \(0.01/hour — roughly \)7.20/month — far less than the NAT Gateway processing charges it replaces.</p>
<p>Here's the complete implementation for the four most common EKS traffic destinations:</p>
<pre><code class="language-bash"># Get your VPC ID and primary route table ID first
VPC_ID=$(aws eks describe-cluster --name your-cluster \
  --query 'cluster.resourcesVpcConfig.vpcId' --output text)

ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
  --filters Name=vpc-id,Values=$VPC_ID Name=association.main,Values=true \
  --query 'RouteTables[0].RouteTableId' --output text)

echo "VPC: \(VPC_ID | Route Table: \)ROUTE_TABLE_ID"

# S3 gateway endpoint — free to create, eliminates all S3 traffic through NAT
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids $ROUTE_TABLE_ID

# DynamoDB gateway endpoint — also free, same mechanism as S3
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids $ROUTE_TABLE_ID

# ECR API interface endpoint — eliminates NAT charges on image pulls
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters Name=vpc-id,Values=$VPC_ID Name=tag:Tier,Values=private \
    --query 'Subnets[*].SubnetId' --output text)

# ECR Docker endpoint — required alongside ECR API for complete image pull coverage
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids $(aws ec2 describe-subnets \
    --filters Name=vpc-id,Values=$VPC_ID Name=tag:Tier,Values=private \
    --query 'Subnets[*].SubnetId' --output text)
</code></pre>
<p>The Terraform module that creates all four endpoints in a single <code>apply</code> is in <code>terraform/vpc-endpoints/</code> in the <a href="https://github.com/aayostem/eks-cost-optimization/tree/main/terraform/vpc-endpoints">companion repo</a>.</p>
<p>Verify that the endpoints are routing traffic correctly:</p>
<pre><code class="language-bash">aws ec2 describe-vpc-endpoints \
  --filters Name=vpc-id,Values=$VPC_ID \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State,Type:VpcEndpointType}' \
  --output table
# Expected: all endpoints showing State=available
</code></pre>
<h3 id="heading-53-the-roi-of-vpc-endpoints">5.3 The ROI of VPC Endpoints</h3>
<table>
<thead>
<tr>
<th>Service</th>
<th>Before (Through NAT)</th>
<th>After (VPC Endpoint)</th>
<th>Monthly Saving</th>
</tr>
</thead>
<tbody><tr>
<td>S3 data transfer</td>
<td>$4,500</td>
<td>$0</td>
<td>$4,500</td>
</tr>
<tr>
<td>ECR image pulls</td>
<td>$800</td>
<td>$0</td>
<td>$800</td>
</tr>
<tr>
<td>DynamoDB queries</td>
<td>$1,200</td>
<td>$0</td>
<td>$1,200</td>
</tr>
<tr>
<td>Endpoint cost</td>
<td>—</td>
<td>$29 (4 endpoints)</td>
<td>-$29</td>
</tr>
<tr>
<td><strong>Net saving</strong></td>
<td></td>
<td></td>
<td><strong>$6,471/month</strong></td>
</tr>
</tbody></table>
<h2 id="heading-part-6-ebs-volume-optimisation">Part 6: EBS Volume Optimisation</h2>
<h3 id="heading-61-the-gp2-to-gp3-migration">6.1 The gp2 to gp3 Migration</h3>
<p>EBS gp2 volumes price their IOPS based on storage size — 3 IOPS per GB, with a 100 IOPS minimum. EBS gp3 volumes provide 3,000 IOPS baseline regardless of size, and cost 20% less per GB. The migration runs online with no downtime.</p>
<p>Find and migrate all gp2 volumes:</p>
<pre><code class="language-bash"># Step 1: List all gp2 volumes and their sizes
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,State:State}' \
  --output table

# Step 2: Migrate each gp2 volume to gp3 — no instance stop required
# The modify operation runs online while the volume stays attached and in use
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text | tr '\t' '\n' | while read vol; do
    echo "Migrating $vol from gp2 to gp3..."
    aws ec2 modify-volume \
      --volume-id $vol \
      --volume-type gp3
done

# Step 3: Verify all volumes are now gp3
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].VolumeId' \
  --output text
# Expected: empty output — zero gp2 volumes remaining
</code></pre>
<h3 id="heading-62-finding-and-removing-orphaned-volumes-and-snapshots">6.2 Finding and Removing Orphaned Volumes and Snapshots</h3>
<p>When Kubernetes PersistentVolumeClaims are deleted, the underlying EBS volumes sometimes aren't cleaned up. They keep running — and billing — indefinitely.</p>
<pre><code class="language-bash"># Find unattached EBS volumes — status=available means not attached to any instance
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find EBS snapshots older than 90 days
aws ec2 describe-snapshots \
  --owner-ids self \
  --query "Snapshots[?StartTime&lt;='$(date -d '90 days ago' --iso-8601=seconds)'].[SnapshotId,StartTime,VolumeSize]" \
  --output table
</code></pre>
<p>Before deleting any snapshot, cross-reference with your RDS automated backup schedule to confirm it's not the only backup for a production database.</p>
<h3 id="heading-63-the-roi-of-ebs-optimisation">6.3 The ROI of EBS Optimisation</h3>
<table>
<thead>
<tr>
<th>Resource</th>
<th>Before</th>
<th>After</th>
<th>Monthly Saving</th>
</tr>
</thead>
<tbody><tr>
<td>gp2 → gp3 migration (1TB total)</td>
<td>$102</td>
<td>$72</td>
<td>$30</td>
</tr>
<tr>
<td>Unattached volumes removed (50 × 100GB)</td>
<td>$500</td>
<td>$0</td>
<td>$500</td>
</tr>
<tr>
<td>Old snapshots cleaned (500GB)</td>
<td>$25</td>
<td>$0</td>
<td>$25</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>$627</strong></td>
<td><strong>$72</strong></td>
<td><strong>$555/month</strong></td>
</tr>
</tbody></table>
<h2 id="heading-part-7-load-balancer-consolidation">Part 7: Load Balancer Consolidation</h2>
<h3 id="heading-71-the-problem-one-load-balancer-per-service">7.1 The Problem — One Load Balancer Per Service</h3>
<p>Many teams create a separate <code>LoadBalancer</code> Service for every microservice. On AWS, each Application Load Balancer costs approximately \(16.20/month base charge plus \)0.008/LCU-hour for traffic processed. At 20 microservices, that's $324/month before a single request is processed.</p>
<p>Here's the incorrect approach:</p>
<pre><code class="language-yaml"># Bad: This creates a dedicated AWS ALB every time it's applied
# 20 microservices = 20 ALBs = $324+/month before any traffic charges
apiVersion: v1
kind: Service
metadata:
  name: payment-api
spec:
  type: LoadBalancer   # Creates a dedicated ALB
  ports:
  - port: 80
    targetPort: 8080
</code></pre>
<h3 id="heading-72-the-fix-shared-ingress-controller">7.2 The Fix — Shared Ingress Controller</h3>
<p>An Ingress controller is a Kubernetes component that runs as a pod inside your cluster and programs a single external load balancer to route traffic to multiple services based on hostname and URL path. Instead of one AWS Application Load Balancer per microservice, you get one ALB total — with path-based routing directing each request to the right backend service. The result is the same routing behaviour at a fraction of the cost.</p>
<p>Here's the correct implementation:</p>
<pre><code class="language-yaml"># Good: One Ingress resource routes all external traffic
# The AWS Load Balancer Controller creates one ALB for all services listed here
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: shared-ingress
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
spec:
  rules:
  - host: api.company.com
    http:
      paths:
      - path: /payments
        pathType: Prefix
        backend:
          service:
            name: payment-service
            port:
              number: 8080
      - path: /users
        pathType: Prefix
        backend:
          service:
            name: user-service
            port:
              number: 8080
  - host: dashboard.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: dashboard-service
            port:
              number: 3000
  tls:
  - hosts:
    - api.company.com
    - dashboard.company.com
    secretName: tls-wildcard-cert
</code></pre>
<p>Verify the Ingress is provisioned and the ALB DNS name is assigned:</p>
<pre><code class="language-bash"># Watch until the ADDRESS column shows the ALB DNS name (typically 2–3 minutes)
kubectl get ingress shared-ingress -n production -w
</code></pre>
<p>The cost difference:</p>
<table>
<thead>
<tr>
<th>Approach</th>
<th>Load balancers</th>
<th>Monthly cost</th>
</tr>
</thead>
<tbody><tr>
<td>LoadBalancer Service per microservice (20 services)</td>
<td>20 ALBs</td>
<td>~$400/month</td>
</tr>
<tr>
<td>Single Ingress controller</td>
<td>1 ALB</td>
<td>~$27/month</td>
</tr>
<tr>
<td><strong>Monthly saving</strong></td>
<td></td>
<td><strong>~$373/month</strong></td>
</tr>
</tbody></table>
<p>The shared Ingress manifest is in <code>k8s/ingress/</code> in the <a href="https://github.com/aayostem/eks-cost-optimization/tree/main/k8s/ingress">companion repo</a>.</p>
<h2 id="heading-the-complete-7-step-sequence">The Complete 7-Step Sequence</h2>
<table>
<thead>
<tr>
<th>Step</th>
<th>Action</th>
<th>Time to Implement</th>
<th>Expected Monthly Saving</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td>Right-size pod resource requests (VPA)</td>
<td>1 week</td>
<td>$15,600</td>
</tr>
<tr>
<td>2</td>
<td>Install Karpenter with consolidation</td>
<td>1 week</td>
<td>$8,400</td>
</tr>
<tr>
<td>3</td>
<td>Move staging and dev to Spot</td>
<td>1 week</td>
<td>$11,200</td>
</tr>
<tr>
<td>4</td>
<td>Migrate compatible workloads to Graviton</td>
<td>2 weeks</td>
<td>$7,600</td>
</tr>
<tr>
<td>5</td>
<td>Add VPC endpoints for S3, ECR, DynamoDB</td>
<td>1 day</td>
<td>$6,471</td>
</tr>
<tr>
<td>6</td>
<td>Migrate gp2 to gp3 and delete orphaned volumes</td>
<td>1 day</td>
<td>$555</td>
</tr>
<tr>
<td>7</td>
<td>Consolidate load balancers with shared Ingress</td>
<td>1 day</td>
<td>$373</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td></td>
<td><strong>3–4 weeks</strong></td>
<td><strong>$49,799/month</strong></td>
</tr>
</tbody></table>
<p>Annual saving at this rate: <strong>$597,588</strong>. Engineering time required: one engineer, one sprint per step.</p>
<h2 id="heading-best-practices-for-eks-cost-optimisation">Best Practices for EKS Cost Optimisation</h2>
<p>✅ <strong>Do:</strong> Right-size pod resource requests before any other optimisation. Every subsequent step depends on accurate requests.</p>
<p>✅ <strong>Do:</strong> Implement Karpenter with <code>consolidationPolicy: WhenUnderutilized</code>. Let it continuously optimise your node count automatically.</p>
<p>✅ <strong>Do:</strong> Move staging and development workloads to Spot. 60–90% savings for workloads that tolerate interruption.</p>
<p>✅ <strong>Do:</strong> Migrate compatible workloads to Graviton. Most web services and APIs run without code changes.</p>
<p>✅ <strong>Do:</strong> Add VPC endpoints for S3, DynamoDB, and ECR before reviewing data transfer costs.</p>
<p>✅ <strong>Do:</strong> Migrate gp2 volumes to gp3. It's online, zero downtime, and immediately 20% cheaper.</p>
<p>✅ <strong>Do:</strong> Use a single shared Ingress controller for all external traffic instead of per-service load balancers.</p>
<p>❌ <strong>Don't:</strong> Buy Savings Plans before completing steps 1–6. You'll lock in waste for 1–3 years.</p>
<p>❌ <strong>Don't:</strong> Use static node groups with Cluster Autoscaler when your workload mix changes. Karpenter handles this dynamically.</p>
<p>❌ <strong>Don't:</strong> Run staging and development environments on On-Demand instances. Spot interruptions are manageable, but the cost difference is not.</p>
<h2 id="heading-resources">Resources</h2>
<ul>
<li><p><a href="https://karpenter.sh/docs/"><strong>Karpenter Documentation</strong></a> — Official NodePool configuration reference and installation guide</p>
</li>
<li><p><a href="https://github.com/aws/aws-graviton-getting-started"><strong>AWS Graviton Getting Started Guide</strong></a> — Language-specific compatibility notes and migration guidance from AWS</p>
</li>
<li><p><a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler"><strong>Vertical Pod Autoscaler GitHub</strong></a> — VPA installation and configuration documentation</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints.html"><strong>AWS VPC Endpoints Documentation</strong></a> — Complete list of available VPC endpoints and configuration options</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/requesting-ebs-volume-modifications.html"><strong>EBS Volume Modification Documentation</strong></a> — AWS guide for online volume type migration with zero downtime</p>
</li>
<li><p><a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/"><strong>AWS Load Balancer Controller</strong></a> — Official documentation for the Ingress controller that provisions AWS ALBs</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/cost-management/latest/APIReference/API_GetCostAndUsage.html"><strong>AWS Cost Explorer API Reference</strong></a> — Full reference for the cost breakdown commands used throughout this guide</p>
</li>
<li><p><a href="https://aws.github.io/aws-eks-best-practices/cost_optimization/cfm_framework/"><strong>EKS Best Practices Guide — Cost Optimisation</strong></a> — AWS's official EKS cost optimisation framework</p>
</li>
<li><p><a href="https://github.com/aayostem/eks-cost-optimization"><strong>Companion Repository</strong></a> — All Terraform modules, NodePool templates, VPA manifests, and automation scripts from this guide</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The 2026 FinOps Roadmap: From Cost-Blind Engineer to Cloud Financial Manager ]]>
                </title>
                <description>
                    <![CDATA[ My first AWS bill was $23,000. I had been working at the company for three weeks. Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda func ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-2026-finops-roadmap-from-cost-blind-engineer-to-cloud-financial-manager/</link>
                <guid isPermaLink="false">6a30894af07f26c8d93079b8</guid>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ finops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Roadmap ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ayobami Adejumo ]]>
                </dc:creator>
                <pubDate>Mon, 15 Jun 2026 23:22:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/365d29dc-738d-4c21-a9a5-8f818c36cc95.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>My first AWS bill was $23,000. I had been working at the company for three weeks.</p>
<p>Nobody told me. The bill just grew quietly in the background while I was proud of the feature I shipped. A Lambda function that called an external enrichment API on every user event. Clean code. Solid tests. Thirty-two million events that month. At $0.0007 per API call.</p>
<p>My engineering manager forwarded the invoice with two words: "Please explain."</p>
<p>That was the moment I discovered FinOps — not from a conference talk or a certification course, but from the specific shame of having written expensive code and not knowing it until the damage was done.</p>
<p>This roadmap is what I needed that day. A complete, honest guide to transforming from an engineer who builds things that work into an engineer who builds things that work <em>and</em> cost what they should. By the end of this guide, you'll have the skills, the scripts, and the vocabulary to talk about cloud spend the way a CFO and a CTO both want to hear.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-four-stages-overview">The Four Stages Overview</a></p>
</li>
<li><p><a href="#heading-stage-1-the-cost-aware-engineer-months-1-to-3">Stage 1: The Cost-Aware Engineer — Months 1 to 3</a></p>
</li>
<li><p><a href="#heading-stage-2-the-optimisation-specialist-months-4-to-8">Stage 2: The Optimisation Specialist — Months 4 to 8</a></p>
</li>
<li><p><a href="#heading-stage-3-the-automation-architect-months-9-to-15">Stage 3: The Automation Architect — Months 9 to 15</a></p>
</li>
<li><p><a href="#heading-stage-4-the-cloud-financial-manager-months-16-to-24">Stage 4: The Cloud Financial Manager — Months 16 to 24</a></p>
</li>
<li><p><a href="#heading-essential-tools-and-certifications">Essential Tools and Certifications</a></p>
</li>
<li><p><a href="#heading-your-90-day-action-plan">Your 90-Day Action Plan</a></p>
</li>
<li><p><a href="#heading-best-practices-summary">Best Practices Summary</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<ul>
<li><p>How to read your AWS bill as an engineer, not as a passive observer</p>
</li>
<li><p>The exact tagging strategy that makes cost attribution possible</p>
</li>
<li><p>How to right-size EC2 and RDS instances using CloudWatch data you already have</p>
</li>
<li><p>The correct sequence for purchasing Savings Plans — and why sequence matters more than the discount percentage</p>
</li>
<li><p>How to build automated cleanup systems for orphaned resources</p>
</li>
<li><p>How to present cloud cost findings to engineering leadership with data that drives decisions</p>
</li>
<li><p>The chargeback and showback models that make cost accountability stick</p>
</li>
</ul>
<p>Let's begin.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following this roadmap, you should have some skills and tools ready to go.</p>
<p><strong>Knowledge:</strong></p>
<ul>
<li><p>You can deploy an application to AWS (EC2, Lambda, or containers)</p>
</li>
<li><p>You understand basic AWS services: S3, RDS, EC2, VPC, IAM</p>
</li>
<li><p>You're comfortable reading Python and writing simple bash scripts</p>
</li>
<li><p>You know what a pull request is and have gone through at least one code review</p>
</li>
</ul>
<p><strong>Access:</strong></p>
<ul>
<li><p>Read-only access to your AWS billing console and Cost Explorer</p>
</li>
<li><p>AWS CLI v2 configured with at least <code>ReadOnlyAccess</code> policy attached</p>
</li>
<li><p>Python 3.9 or later for running the audit scripts in this guide</p>
</li>
</ul>
<p><strong>Mindset:</strong> You don't need to be a finance expert. But you do need to be willing to look at numbers that might be uncomfortable. Every engineer I've worked with who became excellent at FinOps had one thing in common: they were willing to be the person who asked "but what does this cost?" in a room where nobody else wanted to.</p>
<p><strong>Estimated time:</strong> This roadmap covers 24 months of deliberate skill-building. You can absorb the reading in a few evenings. The practice is the 24 months.</p>
<h2 id="heading-the-four-stages-overview">The Four Stages Overview</h2>
<p>Before going deep, here's the complete picture of where you're going:</p>
<pre><code class="language-plaintext">Stage 1 — Cost-Aware Engineer (Months 1–3)
├── Read your cloud bill and understand it
├── Tag every resource with meaningful metadata
├── Identify your top 5 cost drivers
└── Block your first expensive PR with cost justification

Stage 2 — Optimisation Specialist (Months 4–8)
├── Right-size every over-provisioned resource
├── Implement storage lifecycle policies
├── Move non-production to Spot instances
└── Purchase your first Savings Plan in the right order

Stage 3 — Automation Architect (Months 9–15)
├── Build automated cleanup for orphaned resources
├── Add cost estimation to your CI/CD pipeline
├── Create cost-aware auto-scaling triggers
└── Deploy a self-service FinOps dashboard

Stage 4 — Cloud Financial Manager (Months 16–24)
├── Lead monthly FinOps reviews with engineering leadership
├── Build chargeback models for departments
├── Negotiate enterprise agreements with AWS
└── Forecast cloud spend within 5% variance
</code></pre>
<p>The reason this is a 24-month journey and not a weekend project: each stage builds on the previous one. Engineers who jump straight to Savings Plans without rightsizing first end up paying discounted prices for waste. Engineers who build dashboards before tagging get beautiful charts with no actionable data. The sequence isn't arbitrary.</p>
<h2 id="heading-stage-1-the-cost-aware-engineer-months-1-to-3">Stage 1: The Cost-Aware Engineer — Months 1 to 3</h2>
<h3 id="heading-11-reading-the-bill-like-an-engineer-not-an-accountant">1.1 Reading the Bill Like an Engineer, Not an Accountant</h3>
<p>The default AWS Cost Explorer view shows you service-level totals. That's accounting. What you need is engineering-level decomposition: which specific resources cost money, what business function they serve, and whether each dollar is justified.</p>
<p>Start by pulling a proper breakdown:</p>
<pre><code class="language-bash"># Pull last month's cost breakdown grouped by service
# Run this before touching any optimisation — this is your baseline
aws ce get-cost-and-usage \
  --time-period Start=\((date -d 'last month' +%Y-%m-01),End=\)(date +%Y-%m-01) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=SERVICE \
  --metrics UnblendedCost \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.UnblendedCost.Amount}' \
  --output table | sort -k3 -rn
</code></pre>
<p>Save the output. Name the file <code>aws-baseline-YYYY-MM.txt</code>. You'll compare every future month against this number. Without a baseline, you can't measure progress — and without measurable progress, you can't make the case to leadership that the work is worth engineering time.</p>
<h4 id="heading-three-questions-for-every-service-in-your-top-5">Three questions for every service in your top 5:</h4>
<p>Most engineers stop at "what is this service?" and never reach the useful question. Here's the framework I use when I first audit an account:</p>
<p>The first question is whether you know what specific business function this service is performing. Not the product name, the function. "S3" isn't an answer. "Storing unprocessed video uploads that sit for 90 days before anyone watches them" is an answer.</p>
<p>The second question is whether the cost is growing, stable, or shrinking when you look at the past three months. A stable \(12,000/month is a different problem from a \)12,000/month line that was $4,000 six months ago.</p>
<p>The third question is what percentage of your total bill this service represents. Optimising a 1% line item while a 40% line item runs unchecked is a common time-wasting trap.</p>
<h3 id="heading-12-the-tagging-strategy-that-actually-survives">1.2 The Tagging Strategy That Actually Survives</h3>
<p>Here's the honest truth about tagging: most tagging strategies die within six months because they're designed for reporting rather than for engineers. Engineers don't tag things well when they're moving fast. The solution isn't to demand more discipline. Instead, it's to make tagging enforced at the infrastructure layer.</p>
<p>Here's the minimal viable tag set (the six tags that cover 90% of attribution needs):</p>
<pre><code class="language-yaml"># These six tags enable cost attribution, accountability, and automated remediation
# Add these to every resource in your AWS account — EC2, RDS, S3, Lambda, everything

Environment: "production" | "staging" | "dev"
Team: "platform" | "backend" | "data" | "ml"
Service: "payment-api" | "fraud-detection" | "user-service"
Owner: "ayo@cloudfrugal.com"     # Person responsible for this resource
CostCenter: "engineering"         # For chargeback reporting
AutoShutdown: "true" | "false"    # Enables automated remediation
</code></pre>
<p>Enforce tags at the Terraform level so they can't be skipped:</p>
<pre><code class="language-hcl"># variables.tf
# Add this to your Terraform root module
# Any plan that creates a resource without these tags will fail validation

variable "required_tags" {
  description = "Tags required on every resource in this account"
  type = map(string)
  
  validation {
    condition = contains(keys(var.required_tags), "Environment") &amp;&amp;
                contains(keys(var.required_tags), "Team") &amp;&amp;
                contains(keys(var.required_tags), "Owner")
    error_message = "required_tags must include Environment, Team, and Owner."
  }
}

# Apply in every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = merge(var.required_tags, {
    Name    = "app-server-${var.environment}"
    Service = "payment-api"
  })
}
</code></pre>
<p>Find everything that's currently untagged:</p>
<pre><code class="language-bash"># List EC2 instances missing the Team tag
# Run this weekly until you hit zero results
aws ec2 describe-instances \
  --query "Reservations[].Instances[?!not_null(Tags[?Key=='Team'].Value | [0])].[InstanceId, InstanceType, State.Name]" \
  --output table
</code></pre>
<p>Once you start finding untagged resources, you'll discover a pattern: the oldest resources in the account are the least tagged, and they're often the most expensive. An EC2 instance from 2021 that predates your tagging policy is exactly the kind of thing that generates a $3,000/month line item nobody can explain.</p>
<h3 id="heading-13-the-cost-aware-code-review">1.3 The Cost-Aware Code Review</h3>
<p>The most underused FinOps practice in engineering teams is reviewing code changes for cost implications before they merge. It takes thirty seconds per PR once you build the habit, and it prevents the kind of problem that opened this guide: the expensive feature that nobody priced before shipping.</p>
<p>Add this section to your PR template:</p>
<pre><code class="language-markdown">## Cost Impact (required for infrastructure and data changes)

- [ ] This change does not affect cloud resource usage
- [ ] New API calls introduced: estimated cost per call $______, calls/month ______
- [ ] New data storage: estimated monthly delta $______
- [ ] Cross-region data transfer introduced: yes / no
- [ ] New external service dependency with per-call pricing: yes / no

If any box other than the first is checked, add a cost estimate before requesting review.
</code></pre>
<p>The discipline is in making cost estimation a first-class review concern, not an afterthought that gets caught by the finance team on the 15th of the month.</p>
<h3 id="heading-stage-1-outcomes">Stage 1 Outcomes</h3>
<p>By the end of month 3, you should have a baseline cost breakdown on file, 100% tag coverage on active resources, identified your top 5 cost drivers with specific reduction targets, and blocked at least one expensive PR with a cost justification that held up in review.</p>
<h2 id="heading-stage-2-the-optimisation-specialist-months-4-to-8">Stage 2: The Optimisation Specialist — Months 4 to 8</h2>
<h3 id="heading-21-right-sizing-the-8020-of-cloud-savings">2.1 Right-Sizing: The 80/20 of Cloud Savings</h3>
<p>The single most reliable source of cloud waste I find in every account I audit is over-provisioned compute.</p>
<p>The pattern is consistent: an engineer provisions an instance at a size that handles their anticipated peak load, the peak never quite materialises at the expected scale, and nobody revisits the instance size because there's no automatic signal that says "this machine is 75% empty."</p>
<p>Make sure you verify actual utilisation before changing anything:</p>
<pre><code class="language-python"># rightsize_analyzer.py
# Finds EC2 instances running below 20% average CPU for 14 days
# These are right-sizing candidates — not automatic deletions

import boto3
from datetime import datetime, timedelta

def find_oversized_instances(region='us-east-1'):
    """
    Returns instances with average CPU below 20% for the last 14 days.
    Low CPU alone doesn't mean right-size — check memory too if CW agent installed.
    """
    ec2 = boto3.client('ec2', region_name=region)
    cw  = boto3.client('cloudwatch', region_name=region)

    reservations = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )['Reservations']

    candidates = []

    for r in reservations:
        for inst in r['Instances']:
            iid  = inst['InstanceId']
            itype = inst['InstanceType']
            tags = {t['Key']: t['Value'] for t in inst.get('Tags', [])}

            # Pull 14-day average CPU from CloudWatch
            stats = cw.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': iid}],
                StartTime=datetime.utcnow() - timedelta(days=14),
                EndTime=datetime.utcnow(),
                Period=1209600,   # One 14-day period
                Statistics=['Average']
            )['Datapoints']

            avg_cpu = stats[0]['Average'] if stats else 0.0

            if avg_cpu &lt; 20.0:
                candidates.append({
                    'instance_id':  iid,
                    'instance_type': itype,
                    'avg_cpu_pct':  round(avg_cpu, 1),
                    'environment':  tags.get('Environment', 'unknown'),
                    'owner':        tags.get('Owner', 'unknown'),
                    'team':         tags.get('Team', 'unknown'),
                })

    return sorted(candidates, key=lambda x: x['avg_cpu_pct'])

if __name__ == '__main__':
    results = find_oversized_instances()
    print(f"\nFound {len(results)} right-sizing candidates:\n")
    for r in results:
        print(f"  {r['instance_id']} ({r['instance_type']}) — "
              f"{r['avg_cpu_pct']}% avg CPU — "
              f"owner: {r['owner']}")
</code></pre>
<p>A word of caution: CPU utilisation below 20% is a signal, not a verdict. Some workloads are memory-intensive or I/O-bound and will show low CPU while being correctly sized. Before acting on any right-sizing recommendation, check memory utilisation (requires the CloudWatch agent) and network I/O patterns alongside CPU.</p>
<h3 id="heading-22-storage-tiering-stop-paying-retail-for-cold-data">2.2 Storage Tiering: Stop Paying Retail for Cold Data</h3>
<p>S3 Standard costs \(0.023 per GB per month. S3 Glacier Deep Archive costs \)0.00099 per GB per month. The difference is a factor of 23. If you have data that you last accessed six months ago and you're keeping it in S3 Standard because nobody set up lifecycle policies, you're paying 23x more than necessary.</p>
<p><strong>The complete S3 lifecycle policy for engineering teams:</strong></p>
<pre><code class="language-json">{
  "Rules": [
    {
      "ID": "application-logs-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {"Days": 30,  "StorageClass": "STANDARD_IA"},
        {"Days": 90,  "StorageClass": "GLACIER_IR"},
        {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
      ],
      "Expiration": {"Days": 2555},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    },
    {
      "ID": "training-checkpoints-lifecycle",
      "Status": "Enabled",
      "Filter": {"Prefix": "ml-checkpoints/"},
      "Transitions": [
        {"Days": 7,  "StorageClass": "STANDARD_IA"},
        {"Days": 30, "StorageClass": "GLACIER_IR"}
      ],
      "Expiration": {"Days": 90}
    }
  ]
}
</code></pre>
<pre><code class="language-bash"># Apply the lifecycle policy to a bucket
aws s3api put-bucket-lifecycle-configuration \
  --bucket your-logs-bucket \
  --lifecycle-configuration file://lifecycle.json

# Verify it applied correctly
aws s3api get-bucket-lifecycle-configuration \
  --bucket your-logs-bucket
</code></pre>
<h3 id="heading-23-savings-plans-the-sequence-is-everything">2.3 Savings Plans: The Sequence Is Everything</h3>
<p>A Savings Plan is a commitment to spend a minimum dollar amount per hour on AWS compute for one or three years, in exchange for discounts of 30–70% off On-Demand rates. The discount is real. The trap is buying before optimising.</p>
<p><strong>The wrong order:</strong> You have a \(50,000/month EC2 bill. You buy a Savings Plan covering \)35,000/hour. Then you implement right-sizing and Spot instances — and your actual spend drops to \(22,000/month. You've committed to paying \)35,000/month for 12 months against a need of \(22,000. You're paying \)13,000/month for compute you don't use, at a 30% discount. Congratulations on your discounted waste.</p>
<p><strong>The right order:</strong></p>
<pre><code class="language-plaintext">Month 1-2: Right-size all instances using VPA and CloudWatch data
Month 3:   Move staging and development to Spot instances
Month 4:   Migrate compatible workloads to Graviton (20% cheaper)
Month 5:   Add VPC endpoints to eliminate NAT Gateway charges
Month 6:   THEN look at your steady-state On-Demand spend
Month 6+:  Purchase Savings Plans covering 70% of that optimised baseline
</code></pre>
<p><strong>Calculate what to commit to:</strong></p>
<pre><code class="language-bash"># Get your On-Demand EC2 spend for the last 30 days
# This is your rightsized baseline — the number to commit against
aws ce get-cost-and-usage \
  --time-period Start=\((date -d '30 days ago' +%Y-%m-%d),End=\)(date +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{
    "And": [
      {"Dimensions": {"Key": "SERVICE",       "Values": ["Amazon Elastic Compute Cloud - Compute"]}},
      {"Dimensions": {"Key": "PURCHASE_TYPE", "Values": ["On-Demand"]}}
    ]
  }' \
  --metrics UnblendedCost \
  --query 'ResultsByTime[*].{Date:TimePeriod.Start,Cost:Total.UnblendedCost.Amount}' \
  --output table

# Get AWS's own recommendation for what to commit
aws savingsplans get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --lookback-period-in-days THIRTY_DAYS
</code></pre>
<h2 id="heading-stage-3-the-automation-architect-months-9-to-15">Stage 3: The Automation Architect — Months 9 to 15</h2>
<h3 id="heading-31-the-orphaned-resource-problem-and-why-it-never-fixes-itself">3.1 The Orphaned Resource Problem — And Why It Never Fixes Itself</h3>
<p>Orphaned resources are the cloud equivalent of a gym membership you forgot to cancel. They exist, they charge you, but nobody notices until the annual audit.</p>
<p>The root cause isn't laziness. It's the absence of lifecycle management at the infrastructure layer. When an engineer spins up an EC2 instance for a one-week experiment and then leaves the company, there's no automatic signal that the instance is now orphaned. It sits there, billing $140/month, until someone hunts it down.</p>
<p>The fix is a weekly automated audit that surfaces candidates for deletion and notifies the registered owner, not a process change that depends on engineers remembering to clean up.</p>
<pre><code class="language-python"># orphan_reporter.py
# Runs every Sunday via EventBridge → Lambda
# Posts a Slack report of orphaned resources for human review
# DOES NOT auto-delete — deletion requires a human decision

import boto3
import json
import urllib.request
from datetime import datetime, timedelta, timezone

SLACK_WEBHOOK = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
UNATTACHED_VOLUME_AGE_DAYS = 14
SNAPSHOT_AGE_DAYS = 90


def find_orphaned_resources():
    ec2 = boto3.client('ec2')
    report = {'monthly_waste_usd': 0, 'items': []}

    # Unattached EBS volumes
    for vol in ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']:
        age = (datetime.now(timezone.utc) - vol['CreateTime']).days
        if age &gt;= UNATTACHED_VOLUME_AGE_DAYS:
            cost = round(vol['Size'] * 0.08, 2)  # gp3 rate
            tags = {t['Key']: t['Value'] for t in vol.get('Tags', [])}
            report['items'].append({
                'type':  'Unattached EBS Volume',
                'id':    vol['VolumeId'],
                'detail': f"{vol['Size']}GB {vol['VolumeType']} — {age} days old",
                'owner': tags.get('Owner', 'unknown'),
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    # Unassociated Elastic IPs
    for addr in ec2.describe_addresses()['Addresses']:
        if 'AssociationId' not in addr:
            report['items'].append({
                'type':  'Unassociated Elastic IP',
                'id':    addr.get('AllocationId', addr['PublicIp']),
                'detail': addr['PublicIp'],
                'owner': 'unknown',
                'monthly_cost_usd': 3.60,
            })
            report['monthly_waste_usd'] += 3.60

    # Old snapshots
    cutoff = (datetime.now(timezone.utc) - timedelta(days=SNAPSHOT_AGE_DAYS)).isoformat()
    for snap in ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']:
        if snap['StartTime'].isoformat() &lt; cutoff:
            cost = round(snap.get('VolumeSize', 0) * 0.05, 2)
            report['items'].append({
                'type':  f'Snapshot ({SNAPSHOT_AGE_DAYS}+ days old)',
                'id':    snap['SnapshotId'],
                'detail': f"Created {snap['StartTime'].strftime('%Y-%m-%d')}",
                'owner': 'unknown',
                'monthly_cost_usd': cost,
            })
            report['monthly_waste_usd'] += cost

    return report


def post_to_slack(report):
    lines = [
        f":money_with_wings: *Weekly Orphaned Resource Report*",
        f"Found *{len(report['items'])} orphaned resources* "
        f"costing *${report['monthly_waste_usd']:.2f}/month*\n",
    ]
    for item in report['items'][:20]:  # Cap at 20 lines to stay readable
        lines.append(
            f"• `{item['type']}` {item['id']} — {item['detail']} "
            f"— *${item['monthly_cost_usd']:.2f}/mo* — owner: {item['owner']}"
        )
    lines.append("\nReview and delete anything no longer needed.")

    req = urllib.request.Request(
        SLACK_WEBHOOK,
        data=json.dumps({'text': '\n'.join(lines)}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)


def lambda_handler(event, context):
    report = find_orphaned_resources()
    post_to_slack(report)
    return {
        'items_found': len(report['items']),
        'monthly_waste': report['monthly_waste_usd'],
    }
</code></pre>
<h3 id="heading-32-cost-estimation-in-your-cicd-pipeline">3.2 Cost Estimation in Your CI/CD Pipeline</h3>
<p>The goal is to catch expensive infrastructure changes at the PR stage — before they deploy and before they generate a billing surprise.</p>
<pre><code class="language-yaml"># .github/workflows/cost-check.yml
# Runs on any PR that touches infrastructure files
# Uses Infracost to estimate the monthly cost delta

name: Infrastructure Cost Check

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'infrastructure/**'
      - '*.tf'

jobs:
  cost-estimate:
    name: Estimate monthly cost change
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown \
            --path terraform/ \
            --format json \
            --out-file /tmp/infracost.json

      - name: Post cost diff to PR
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

      - name: Block if monthly increase exceeds threshold
        run: |
          MONTHLY_DELTA=$(cat /tmp/infracost.json | \
            jq '.projects[0].diff.totalMonthlyCost' | tr -d '"')

          echo "Estimated monthly cost change: \$$MONTHLY_DELTA"

          # Fail the PR if this change adds more than $500/month
          python3 -c "
          import sys
          delta = float('$MONTHLY_DELTA')
          if delta &gt; 500:
              print(f'PR blocked: estimated +\\({delta:.2f}/month exceeds \\)500 threshold')
              sys.exit(1)
          else:
              print(f'Cost check passed: estimated +\${delta:.2f}/month')
          "
</code></pre>
<h2 id="heading-stage-4-the-cloud-financial-manager-months-16-to-24">Stage 4: The Cloud Financial Manager — Months 16 to 24</h2>
<h3 id="heading-41-leading-finops-reviews-with-executives">4.1 Leading FinOps Reviews with Executives</h3>
<p>By month 16, you have the data. What changes at Stage 4 is the audience. You're no longer presenting to engineers who understand instance types and NAT Gateway pricing. You're presenting to a CTO who wants to know if the infrastructure investment is proportional to the business value it produces, and a CFO who wants to know when the line will stop going up.</p>
<p>The vocabulary shift is simple but important. You stop saying "we right-sized our EC2 instances" and start saying "we reduced our infrastructure unit cost by 28% while maintaining the same request throughput." You stop saying "we eliminated NAT Gateway charges" and start saying "we closed a $6,400/month gap between what we were paying and what was necessary."</p>
<p>The metric that anchors every executive FinOps conversation is cost per business unit. Not total bill (cost per API call, cost per user, cost per transaction, cost per model inference). That ratio tells the story of whether your infrastructure efficiency is improving as the business scales.</p>
<pre><code class="language-python"># unit_economics.py
# Calculate cost per transaction — the metric that matters to leadership

import boto3
from datetime import datetime, timedelta

def calculate_cost_per_transaction(service_name, transaction_count, days_back=30):
    """
    Returns cost per transaction for a given service over the last N days.
    transaction_count: total transactions for the same period (from your metrics)
    """
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=days_back)).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key':    'Service',
                'Values': [service_name]
            }
        }
    )

    total_cost = sum(
        float(period['Total']['UnblendedCost']['Amount'])
        for period in response['ResultsByTime']
    )

    cost_per_txn = total_cost / transaction_count if transaction_count &gt; 0 else 0

    return {
        'service':           service_name,
        'period_days':       days_back,
        'total_cost_usd':    round(total_cost, 2),
        'transactions':      transaction_count,
        'cost_per_txn_usd':  round(cost_per_txn, 6),
    }


# Example: payment service processed 4.2M transactions this month
result = calculate_cost_per_transaction('payment-api', 4_200_000)
print(f"Cost per transaction: ${result['cost_per_txn_usd']:.6f}")
print(f"Total infrastructure cost: ${result['total_cost_usd']:,.2f}")
</code></pre>
<h3 id="heading-42-the-chargeback-and-showback-models">4.2 The Chargeback and Showback Models</h3>
<p>Chargeback means actually billing departments for their cloud usage. Showback means showing departments their usage costs without the internal billing transfer. Both create the same outcome: engineers start caring about what they consume because someone they work with is paying attention to it.</p>
<pre><code class="language-python"># showback_report.py
# Generates monthly cost-by-team report for distribution to engineering leads

import boto3
from datetime import datetime

def generate_team_showback():
    ce = boto3.client('ce')

    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': datetime.now().replace(day=1).strftime('%Y-%m-%d'),
            'End':   datetime.now().strftime('%Y-%m-%d'),
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG',       'Key': 'Team'},
            {'Type': 'DIMENSION', 'Key': 'SERVICE'},
        ]
    )

    by_team = {}
    for group in response['ResultsByTime'][0].get('Groups', []):
        team    = group['Keys'][0].replace('Team$', '') or 'untagged'
        service = group['Keys'][1]
        cost    = float(group['Metrics']['UnblendedCost']['Amount'])

        if team not in by_team:
            by_team[team] = {'total': 0, 'services': {}}
        by_team[team]['total'] += cost
        by_team[team]['services'][service] = round(cost, 2)

    # Print sorted by total cost descending
    print(f"\n{'='*52}")
    print(f"  Month-to-Date Cloud Spend by Team")
    print(f"  Generated: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"{'='*52}\n")

    for team, data in sorted(by_team.items(), key=lambda x: x[1]['total'], reverse=True):
        print(f"  {team:&lt;20} ${data['total']:&gt;10,.2f}/month")
        top_services = sorted(data['services'].items(), key=lambda x: x[1], reverse=True)[:3]
        for svc, cost in top_services:
            print(f"    └─ {svc:&lt;30} ${cost:&gt;8,.2f}")
    print()

generate_team_showback()
</code></pre>
<h2 id="heading-essential-tools-and-certifications">Essential Tools and Certifications</h2>
<p>The tools that matter at each stage of this roadmap:</p>
<table>
<thead>
<tr>
<th>Stage</th>
<th>Tool</th>
<th>Why It Matters</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td>AWS Cost Explorer</td>
<td>Free, built-in, the starting point for all cost analysis</td>
</tr>
<tr>
<td>1</td>
<td>AWS CLI <code>ce</code> commands</td>
<td>Scriptable cost queries — dashboards can't be automated</td>
</tr>
<tr>
<td>2</td>
<td>AWS Compute Optimizer</td>
<td>ML-powered rightsizing recommendations for EC2 and RDS</td>
</tr>
<tr>
<td>2</td>
<td>VPA (Kubernetes)</td>
<td>Pod-level rightsizing recommendations using actual usage</td>
</tr>
<tr>
<td>3</td>
<td>Infracost</td>
<td>PR-level cost estimation for Terraform changes</td>
</tr>
<tr>
<td>3</td>
<td>AWS Budgets</td>
<td>Proactive alerts — catches problems before the monthly invoice</td>
</tr>
<tr>
<td>4</td>
<td>AWS Cost and Usage Report + Athena</td>
<td>SQL-level billing analysis at any granularity</td>
</tr>
<tr>
<td>4</td>
<td>CloudHealth or Vantage</td>
<td>Multi-account, multi-cloud cost management</td>
</tr>
</tbody></table>
<p><strong>The one certification worth your time:</strong> FinOps Certified Practitioner from the FinOps Foundation. It takes 20 hours to prepare and $300 to sit. It signals to hiring managers and clients that you understand the discipline formally — which matters when you're the person leading FinOps conversations at the executive level.</p>
<h2 id="heading-your-90-day-action-plan">Your 90-Day Action Plan</h2>
<h3 id="heading-month-1-foundation">Month 1 — Foundation:</h3>
<p>Enable Cost Explorer if it isn't already on. Pull the baseline command from Section 1.1 and save the output. Run the untagged resource query from Section 1.2 and document how many resources are missing tags. Find your top three cost drivers. Present the findings to your engineering manager — not as a problem, but as an opportunity with a dollar figure attached.</p>
<h3 id="heading-month-2-quick-wins">Month 2 — Quick Wins:</h3>
<p>Run the rightsizing analyser from Section 2.1 on your EC2 fleet. Downsize the three highest-confidence candidates. Apply S3 lifecycle policies to your two largest buckets. Create VPC endpoints for S3, ECR, and DynamoDB. Estimate the savings from each action and document them against your baseline.</p>
<h3 id="heading-month-3-automation-and-habits">Month 3 — Automation and Habits:</h3>
<p>Deploy the orphan reporter Lambda on a Sunday schedule. Add the cost check GitHub Action to your infrastructure repository. Start a monthly FinOps review meeting — even if it's just you and one other engineer. Build the habit before you need the audience.</p>
<h2 id="heading-best-practices-summary">Best Practices Summary</h2>
<p>✅ <strong>Do:</strong> Establish a cost baseline before any optimisation. The number is meaningless without a comparison point.</p>
<p>✅ <strong>Do:</strong> Right-size before buying Savings Plans. Always. The sequence changes the outcome.</p>
<p>✅ <strong>Do:</strong> Enforce tagging at the infrastructure layer — Terraform or CloudFormation — not as a process reminder.</p>
<p>✅ <strong>Do:</strong> Move staging and development to Spot instances. The interruption rate is manageable, while the 70% cost difference is not.</p>
<p>✅ <strong>Do:</strong> Add VPC endpoints for S3, ECR, and DynamoDB before reviewing data transfer costs. It's a 30-minute fix for a multi-thousand-dollar line item.</p>
<p>✅ <strong>Do:</strong> Present cost findings as cost-per-business-metric, not as total bill. "We reduced cost per transaction from \(0.0021 to \)0.0013" is a business result. "$38,000/month reduction" is an accounting result.</p>
<p>❌ <strong>Don't:</strong> Buy Savings Plans on an unoptimised baseline. You'll lock in discounted waste.</p>
<p>❌ <strong>Don't:</strong> Build FinOps dashboards before tagging is complete. Beautiful charts with no attribution data answer no questions.</p>
<p>❌ <strong>Don't:</strong> Run orphaned resource cleanup without human review first. Run in report-only mode for two weeks, verify the candidates are genuinely orphaned, then add deletion logic.</p>
<h2 id="heading-resources">Resources</h2>
<ul>
<li><p><a href="https://www.finops.org/framework/"><strong>FinOps Foundation Framework</strong></a> — The practitioner framework that defines the Inform, Optimise, and Operate cycle this roadmap is built on</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/cost-management/latest/APIReference/API_GetCostAndUsage.html"><strong>AWS Cost Explorer API Reference</strong></a> — Full reference for the cost query commands used throughout this guide</p>
</li>
<li><p><a href="https://aws.amazon.com/compute-optimizer/"><strong>AWS Compute Optimizer</strong></a> — AWS's own rightsizing recommendation service; complements the manual analysis in Stage 2</p>
</li>
<li><p><a href="https://www.infracost.io/docs/"><strong>Infracost Documentation</strong></a> — Setup guide for the PR-level cost estimation tool in Stage 3</p>
</li>
<li><p><a href="https://learn.finops.org/path/finops-certified-practitioner"><strong>FinOps Certified Practitioner Exam</strong></a> — The certification referenced in the tools section</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/savingsplans/latest/userguide/what-is-savings-plans.html"><strong>AWS Savings Plans Documentation</strong></a> — The authoritative reference on commitment types, coverage rules, and purchase strategy</p>
</li>
<li><p><a href="https://github.com/aayostem"><strong>Companion Repository</strong></a> — All scripts from this guide, including the rightsizing analyser, orphan reporter, and showback report generator</p>
</li>
</ul>
<p><a href="https://github.com/aayostem"><em>Ayobami Adejumo</em></a> <em>is a senior platform engineer and FinOps consultant. He has audited AWS infrastructure for 20+ Series A and Series B companies. He is an active FinOps Foundation Supporter</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Spring Boot App with MySQL on Amazon EKS ]]>
                </title>
                <description>
                    <![CDATA[ If you've been looking to deploy your Spring Boot app to the cloud but feel a little overwhelmed by all the moving pieces, don't worry, you're not alone. Kubernetes can seem intimidating at first, but ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks/</link>
                <guid isPermaLink="false">6a20609578a43e3153ae5422</guid>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ EKS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Springboot ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chisom Uma ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:12:53 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/5a7cd6a7-7850-4e3c-9a45-b577c2f91598.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been looking to deploy your Spring Boot app to the cloud but feel a little overwhelmed by all the moving pieces, don't worry, you're not alone.</p>
<p>Kubernetes can seem intimidating at first, but Amazon EKS (Elastic Kubernetes Service) makes it much more approachable, especially when you have a step-by-step guide to follow.</p>
<p>In this tutorial, we'll walk through exactly how to get a Spring Boot application with a MySQL database up and running on Amazon EKS. I'll take you from from containerizing your app to connecting it to a managed database, all the way to accessing it live in the cloud. Let’s get started.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-application-overview">Application Overview</a></p>
</li>
<li><p><a href="#heading-what-is-amazon-eks">What is Amazon EKS?</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks">How to Deploy a Spring Boot App with MySQL on Amazon EKS</a></p>
<ul>
<li><p><a href="#heading-step-1-create-the-vpc">Step 1: Create the VPC</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-the-mysql-database-in-a-private-subnet">Step 2: Set Up the MySQL Database in a Private Subnet</a></p>
</li>
<li><p><a href="#heading-step-3-deploy-ec2-instance-in-a-public-subnet">Step 3: Deploy EC2 Instance in a Public Subnet</a></p>
</li>
<li><p><a href="#heading-step-4-create-ssh-tunneling-for-the-database">Step 4: Create SSH Tunneling for the Database</a></p>
</li>
<li><p><a href="#heading-step-5-set-up-a-simple-springboot-application-development">Step 5: Set Up a Simple SpringBoot Application Development</a></p>
</li>
<li><p><a href="#heading-step-6-configure-springboot-app-for-database">Step 6: Configure SpringBoot App for Database</a></p>
</li>
<li><p><a href="#heading-step-7-dockerize-the-spring-boot-application">Step 7: Dockerize the Spring Boot Application</a></p>
</li>
<li><p><a href="#heading-step-8-push-the-image-to-elastic-container-registry-ecr">Step 8: Push the Image to Elastic Container Registry (ECR)</a></p>
</li>
<li><p><a href="#heading-step-9-implement-aws-app-load-balancer">Step 9: Implement AWS App Load Balancer</a></p>
</li>
<li><p><a href="#heading-step-10-create-a-cluster-in-eks">Step 10: Create a Cluster in EKS</a></p>
</li>
<li><p><a href="#heading-step-11-install-aws-load-balancing">Step 11: Install AWS Load Balancing</a></p>
</li>
<li><p><a href="#heading-step-12-create-and-deploy-kubernetes">Step 12: Create and Deploy Kubernetes</a></p>
</li>
<li><p><a href="#heading-step-13-delete-cluster">Step 13: Delete Cluster</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, ensure you have the following:</p>
<ul>
<li><p>Basic knowledge of AWS (AWS Console access).</p>
</li>
<li><p>Basic knowledge of containerization.</p>
</li>
<li><p>Working knowledge of Kubernetes.</p>
</li>
<li><p>Basic knowledge of databases.</p>
</li>
<li><p><a href="https://helm.sh/docs/intro/install/">Helm</a> installed</p>
</li>
<li><p><a href="https://kubernetes.io/docs/tasks/tools/">Kubectl</a> installed</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-eksctl.html">Eksctl</a> installed</p>
</li>
<li><p>An IDE</p>
</li>
</ul>
<h2 id="heading-application-overview">Application Overview</h2>
<p>The application runs inside an AWS VPC spread across two availability zones for high availability. When a user makes a request, it flows through an Internet Gateway into an AWS Application Load Balancer sitting in the public subnet, which handles incoming traffic via an Ingress rule.</p>
<p>The Load Balancer routes requests to the App Service, which distributes them across multiple App Pods running inside AWS EKS (Elastic Kubernetes Service) in the private subnets.</p>
<p>The Docker images for these pods are pulled from AWS ECR (Elastic Container Registry). For data persistence, the app pods connect to Amazon RDS MySQL databases through a MySQL External Service, with an RDS instance in each availability zone to ensure redundancy.</p>
<p>A NAT Gateway in the public subnet allows the private resources to make outbound internet calls without being directly exposed to the internet.</p>
<h2 id="heading-what-is-amazon-eks">What is Amazon EKS?</h2>
<p>If you've ever tried to manage containers manually, you already know it can get messy pretty quickly, tracking which containers are running, restarting ones that crash, scaling up when traffic spikes... It's a lot.</p>
<p>That's exactly the problem Kubernetes was built to solve. It automates the deployment, scaling, and management of containerized applications. But setting up and maintaining your own Kubernetes cluster from scratch? That's a whole other challenge.</p>
<p>That's where <a href="https://aws.amazon.com/pm/eks/">Amazon EKS</a> comes in. EKS is a fully managed Kubernetes service provided by AWS, which means AWS handles the heavy lifting of setting up, securing, and maintaining the Kubernetes control plane for you. You just focus on deploying your application.</p>
<h2 id="heading-how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks">How to Deploy a Spring Boot App with MySQL on Amazon EKS</h2>
<p>In this section, we’ll cover the steps to follow in deploying your SpringBoot application with MySQL on Amazon EKS.</p>
<h3 id="heading-step-1-create-the-vpc">Step 1: Create the VPC</h3>
<p>To create a VPC, log in to the <a href="https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fus-east-1.console.aws.amazon.com%2Fiam%3Fca-oauth-flow-id%3Df7d2%26hashArgs%3D%2523%26isauthcode%3Dtrue%26oauthStart%3D1777888354778%26region%3Dus-east-1%26state%3DhashArgsFromTB_us-east-1_0481039a94bc47bd&amp;client_id=arn%3Aaws%3Asignin%3A%3A%3Aconsole%2Fiamv2&amp;forceMobileApp=0&amp;code_challenge=USO5m22DxkRMX1kvbC19ZE-zr5Eyzp52MXY5jnbANB8&amp;code_challenge_method=SHA-256">AWS IAM Console</a> and search for “VPC,” then click create VPC.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/9a1f57fd-7665-469f-a0c2-7d548590c20f.png" alt="vpc interface" style="display:block;margin:0 auto" width="714" height="192" loading="lazy">

<p>Select the "VPC and more option:, and give your VPC a name for your project, for example, spring-demo. Set the IPv4 CIDR block to 10.4.0.0/16. For the NAT gateway configuration, select Zonal, then In 1 AZ.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/960002c0-9d53-481d-90be-79a7092088ce.png" alt="NAT gateway config" style="display:block;margin:0 auto" width="465" height="228" loading="lazy">

<p>Select None for VPC endpoints configuration. Next, click Create VPC, then click View VPC. This takes you to the VPC resource map.</p>
<h3 id="heading-step-2-set-up-the-mysql-database-in-a-private-subnet">Step 2: Set Up the MySQL Database in a Private Subnet</h3>
<p>First, you need to create the security group for the MySQL and EC2 instance deployment. To do that, navigate to EC2 &gt; Security Groups. For the inbound rule, select Type: All traffic and Source: Anywhere-IPv4. Then click Create security group.</p>
<p>Next, we’ll create the subnet group for the database. To do that, navigate to Aurora and RDS &gt; Subnet groups and click Create DB subnet group. Next, configure the DB subnet to include:</p>
<ul>
<li><p><strong>Name</strong>: private-subnet-db</p>
</li>
<li><p><strong>Description</strong>: private-subnet-db</p>
</li>
<li><p><strong>VPC</strong>: Select VPC</p>
</li>
<li><p><strong>Add subnets</strong>: Choose <code>us-east-1a</code> and <code>us-east-1b</code> as the availability zones, then select the private and public subnets</p>
</li>
</ul>
<p>Click Create**.**</p>
<p>Now, navigate to Databases, click Create database, and select Full configuration. Select MySQL as the engine type.</p>
<p>Select the Free tier when choosing a sample template. Next, give your DB a username and a strong password. Choose <code>db.t3.micro</code> as the instance type.</p>
<p>Select your VPC and associated private subnet. Now, uncheck the "Enable auto minor version upgrade" option in the Additional configuration section and click Create database.</p>
<p>While our database initializes, let's create a key pair for the EC2 instance, which will be launched in a public subnet. To do that, navigate to EC2 &gt; Network &amp; Security &gt; Key Pairs and click Create key pair.</p>
<p>Give your key pair a name, for example, ece-db-key-pair. Leave everything else as-is and click Create key pair. This automatically downloads the key-pair into your local machine.</p>
<h3 id="heading-step-3-deploy-ec2-instance-in-a-public-subnet">Step 3: Deploy EC2 Instance in a Public Subnet</h3>
<p>Now it’s time to create an EC2 instance. To do this, navigate to EC2 &gt; Instances and click Launch instances. Select the key pair you just created in the Key pair section.</p>
<p>Next, in the Network section, select the VPC created earlier for the project. For Auto-assign public IP, choose Enable. Next, choose the Select existing security group option and select the all-access-sg security group created earlier. Next, click Launch instance.</p>
<h3 id="heading-step-4-create-ssh-tunneling-for-the-database">Step 4: Create SSH Tunneling for the Database</h3>
<p>For this step, go into your terminal and navigate to the folder where your key pair is downloaded. Run the ls command, and you should see your key pair there.</p>
<p>Next, you need to change the permission of the key pair file. Use the command below:</p>
<pre><code class="language-shell">chmod 0400 ece-db-key-pair.pem&nbsp;
</code></pre>
<p>Now, run the SSH tunneling command below:</p>
<pre><code class="language-shell">ssh -i &lt;YOUR-KEY-PAIR&gt;.pem -f -N -L &lt;LOCAL-PORT&gt;:&lt;YOUR-RDS-ENDPOINT&gt;:&lt;RDS-PORT&gt; &lt;EC2-USERNAME&gt;@&lt;YOUR-EC2-PUBLIC-DNS&gt; -v
</code></pre>
<ul>
<li><p><code>&lt;YOUR-KEY-PAIR&gt;.pem</code>: the name of your downloaded key pair file</p>
</li>
<li><p><code>&lt;LOCAL-PORT&gt;</code>:&nbsp; the port on your laptop (3306 for MySQL, 5432 for PostgreSQL)</p>
</li>
<li><p><code>&lt;YOUR-RDS-ENDPOINT&gt;</code>: found in AWS Console &gt; RDS &gt; Your database &gt; Connectivity &amp; Security &gt; Endpoint</p>
</li>
<li><p><code>&lt;RDS-PORT&gt;</code>: same as local port (3306 for MySQL, 5432 for PostgreSQL)</p>
</li>
<li><p><code>&lt;EC2-USERNAME&gt;</code>: usually ec2-user for Amazon Linux, ubuntu for Ubuntu</p>
</li>
<li><p><code>&lt;YOUR-EC2-PUBLIC-DNS&gt;</code>: found in AWS Console &gt; EC2 &gt; Your instance &gt; Public IPv4 DNS</p>
</li>
</ul>
<p>This command lets your laptop or local machine talk directly to your remote database, as if the database were sitting on your own computer.</p>
<p>After running this command, you can open a database tool (like MySQL Workbench, DBeaver, or TablePlus) on your laptop and connect to:</p>
<ul>
<li><p>Host: localhost</p>
</li>
<li><p>Port: 3306</p>
</li>
</ul>
<p>For this tutorial, I’ll be using the community version of DBeaver. You can use other similar tools, but if you prefer to use the same tool for the purpose of this guide, you can install the community version from the official <a href="https://dbeaver.io/download/">DBeaver download page</a>.</p>
<p>After download and installation, open the DBeaver client and click the Connect to a database icon in the top-left corner of the app.</p>
<p>Select MySQL and click Next. On the next window, enter your database username and password, and set Server Host to 127.0.0.1.</p>
<p>Click Test Connection.</p>
<p>You should see a window appear on your screen, indicating that the connection is successful.</p>
<p>Click OK and Finish.</p>
<p>Now, on the left panel, you should see your connection. Expand it to see the database structure.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/5c8115eb-020a-4b8d-9c84-9a1b4c10a071.png" alt="database structure" style="display:block;margin:0 auto" width="730" height="212" loading="lazy">

<p>Now, you have successfully created SSH tunneling for your database.</p>
<h4 id="heading-troubleshooting">Troubleshooting</h4>
<p>While attempting to test the database connection, I initially ran into a “Plugin 'mysql_native_password' is not loaded” error. If you encounter this error, follow the steps below to fix it.</p>
<ol>
<li><p>On the Connection Settings window, navigate to the Driver properties tab.</p>
</li>
<li><p>Look for allowPublicKeyRetrieval and set it to FALSE.</p>
</li>
<li><p>Navigate back to the Main tab and click Test Connection.</p>
</li>
</ol>
<p>Everything should work fine now.</p>
<h3 id="heading-step-5-set-up-a-simple-springboot-application-development">Step 5: Set Up a Simple SpringBoot Application Development</h3>
<p>To get started, head over to the <a href="https://start.spring.io/">Spring Initializr website</a>. Rename Artifact to “springboot-mysql-eks”. Then click ADD DEPENDENCIES… to add dependencies for the REST APIs. Search for the following dependencies:</p>
<ul>
<li><p><strong>Spring Web:</strong> Build web apps, including RESTful applications using Spring MVC. Uses Apache Tomcat as the default embedded container.</p>
</li>
<li><p><strong>Spring Data JPA:</strong> Persist data in SQL stores with the Java Persistence API using Spring Data and Hibernate.</p>
</li>
<li><p><strong>IBM DB2 Driver:</strong> A JDBC driver that provides access to IBM DB2.</p>
</li>
<li><p><strong>Lombok:</strong> A Java annotation library that helps to reduce boilerplate code.</p>
</li>
</ul>
<p>Next, click GENERATE at the bottom center of the page. This action downloads a zip file to your local machine. Open this file in an IDE, such as VSCode or IntelliJ IDEA. For this tutorial, I use VSCode. In the build.gradle file, you can see all the added dependencies:</p>
<pre><code class="language-json">dependencies {
   implementation 'org.springframework.boot:spring-boot-starter-data-jpa'
   implementation 'org.springframework.boot:spring-boot-starter-webmvc'
   compileOnly 'org.projectlombok:lombok'
   runtimeOnly 'com.ibm.db2:jcc'
   annotationProcessor 'org.projectlombok:lombok'
   testImplementation 'org.springframework.boot:spring-boot-starter-data-jpa-test'
   testImplementation 'org.springframework.boot:spring-boot-starter-webmvc-test'
   testCompileOnly 'org.projectlombok:lombok'
   testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
   testAnnotationProcessor 'org.projectlombok:lombok'
}
</code></pre>
<h4 id="heading-what-were-building">What we're building</h4>
<p>The Spring Boot app is a currency exchange rate and conversion app:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/eef403da-eb1d-47e8-8edd-d0e4d845a3d1.png" alt="image of counter " style="display:block;margin:0 auto" width="750" height="434" loading="lazy">

<p>We'll be inserting the exchange data into the database table.</p>
<p>To continue with this tutorial, you can clone the project repo <a href="https://github.com/ChisomUma/sprint-boot-msql-eks">here</a> to save time.</p>
<p>In main &gt; java &gt; com.. &gt; model &gt; ExchangeRate, you’ll see the code below:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.Setter;

import java.sql.Date;

@Getter
@Setter
@Entity
@Table(name = "exchange-rate")
public class ExchangeRate {
   @Id
   @GeneratedValue(strategy=GenerationType.AUTO)
   private Integer transactionId;
   private String sourceCurrency;
   private String targetCurrency;
   private double amount;
   private Date lastUpdated;
}
</code></pre>
<p>This class is essentially a blueprint for storing currency exchange rate data in our database. It uses the libraries and dependencies added earlier. Lombok handles all the repetitive getter/setter boilerplate so you don't have to write it yourself, while JPA annotations like <code>@Entity</code> and <code>@Table</code> tell Spring, "hey, this class maps to a database table called exchange-rate."</p>
<p>Inside the class, there are five fields that become database columns:</p>
<ul>
<li><p>A self-incrementing transactionId as the primary key.</p>
</li>
<li><p>sourceCurrency and targetCurrency to track which currencies are being converted,</p>
</li>
<li><p>The amount holding the actual exchange rate</p>
</li>
<li><p>lastUpdated date, so you always know how fresh your data is.</p>
</li>
</ul>
<p>To store the data, create a repository file in main &gt; java &gt; com.. &gt; repository &gt; ExchangeRateRepository:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.repository;

import com.example.springbootmysqleks.model.ExchangeRate;
import org.springframework.data.jpa.repository.JpaRepository;

public interface ExchangeRateRepository extends JpaRepository&lt;ExchangeRate, Integer&gt; {
   ExchangeRate findBySourceCurrencyAndTargetCurrency(String sourceCurrency, String targetCurrency);
}
</code></pre>
<p>This file acts as the middleman between your code and the database. By simply extending JpaRepository, you instantly get a whole suite of built-in database operations (like save, delete, findAll, and so on) completely for free, without writing a single SQL query.</p>
<p>The interface is typed to work with the <code>ExchangeRate</code> model we just looked at, using Integer as the primary key type.</p>
<p>The one custom method, <code>findBySourceCurrencyAndTargetCurrency</code>, is where Spring's magic really shines. Just by following a naming convention, Spring automatically figures out the SQL query it needs to run, so you can look up an exchange rate by simply passing in two currency codes like "USD" and "EUR" without writing any query logic yourself.</p>
<p>To use the <code>findBySourceCurrencyAndTargetCurrency</code> method, create a service file in main &gt; java &gt; com.. &gt; service &gt; ExchangeRateService with the code below:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.service;

import com.example.springbootmysqleks.model.ExchangeRate;
import com.example.springbootmysqleks.repository.ExchangeRateRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

@Service
public class ExchangeRateService {

   @Autowired
   private ExchangeRateRepository exchangeRateRepository;

   public ExchangeRate addExchangeRate(ExchangeRate exchangeRate) {
       return exchangeRateRepository.save(exchangeRate);
   }

   public double getAmount(String sourceCurrency, String targetCurrency) {
       ExchangeRate exchangeRate =  exchangeRateRepository.findBySourceCurrencyAndTargetCurrency(sourceCurrency, targetCurrency);
       return exchangeRate == null ? 0 : exchangeRate.getAmount();
   }
}
</code></pre>
<p>Here, we created a <code>@Service</code> class that interacts with the repository.</p>
<p>The class has two methods, the <code>addExchangeRate</code>, which simply takes an <code>ExchangeRate</code> object and saves it to the database, and <code>getAmount</code>, which takes a source and target currency, uses our custom repository method to look up the matching record, and then either returns the exchange rate amount or a safe default of 0 if no record is found.</p>
<p>That little ternary check (<code>exchangeRate == null ? 0 : exchangeRate.getAmount()</code>) ensures the app doesn't crash if you query a currency pair that doesn't exist in the database yet.</p>
<p>In main &gt; java &gt; com.. &gt; controller &gt; ExchangeRateService, we have the following code:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.controller;

import com.example.springbootmysqleks.model.ExchangeRate;
import com.example.springbootmysqleks.service.ExchangeRateService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

@RestController
public class ExchangeRateController {

   @Autowired
   ExchangeRateService exchangeRateService;

   @GetMapping("/getAmount")
   public double getAmount(@RequestParam String sourceCurrency, @RequestParam String targetCurrency) {
       return exchangeRateService.getAmount(sourceCurrency, targetCurrency);
   }

   @PostMapping("/addExchangeRate")
   public ExchangeRate addExchangeRate(@RequestBody ExchangeRate exchangeRate) {
       return exchangeRateService.addExchangeRate(exchangeRate);
   }

   @GetMapping("/")
   public String getHealth() {
       return "up";
   }

}
</code></pre>
<p>The <code>@RestController</code> annotation tells Spring this class will be serving up REST API endpoints, and again <code>@Autowired</code> wires in the service layer automatically.</p>
<p>There are three endpoints:</p>
<ol>
<li><p>a GET request to <code>/getAmount</code> that accepts <code>sourceCurrency</code> and <code>targetCurrency</code> as query parameters and returns the exchange rate amount</p>
</li>
<li><p>a POST request to <code>/addExchangeRate</code> that accepts a full <code>ExchangeRate</code> object as a JSON body and saves it to the database</p>
</li>
<li><p>and finally a simple health check endpoint at / that just returns "up",&nbsp; which is a common pattern in cloud deployments to let load balancers and orchestration tools know the app is alive and running.</p>
</li>
</ol>
<h3 id="heading-step-6-configure-springboot-app-for-database">Step 6: Configure SpringBoot App for Database</h3>
<p>Now, it’s time to configure the application for the database. Navigate to src &gt; main &gt; resources &gt; application.properties, and you should see this:</p>
<pre><code class="language-java">spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.url=jdbc:mysql://\({MYSQL_HOSTNAME}:\){MYSQL_PORT}/${MYSQL_DATABASE}?createDatabaseIfNotExist=true
spring.datasource.username=${MYSQL_USERNAME}
spring.datasource.password=${MYSQL_PASSWORD}

spring.jpa.hibernate.ddl-auto=update

spring.jpa.show-sql: true
</code></pre>
<p>These are the configurations that allow your app to connect with the database.</p>
<ul>
<li><p><code>spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver</code>: The driver class for the MySQL database.</p>
</li>
<li><p><code>spring.datasource.url=jdbc:mysql://\({MYSQL_HOSTNAME}:\){MYSQL_PORT}/${MYSQL_DATABASE}?createDatabaseIfNotExist=true</code>: This is the data source URL in which we are using the MySQL hostname (127.0.0.1), port name, and database name.</p>
</li>
<li><p><code>spring.datasource.username=${MYSQL_USERNAME}</code>: your database user name.</p>
</li>
<li><p><code>spring.datasource.password=${MYSQL_PASSWORD}</code>: your database password.</p>
</li>
</ul>
<p>One thing to note: the process of configuring environment variables with your actual credentials varies depending on the IDE you're using. If you're using IntelliJ IDEA, this process is pretty straightforward. If you're using VS Code, the process is different.</p>
<p>To configure your actual credentials for the <code>env</code> variables, create a <code>.vscode/launch.json</code> file in your project root folder and paste in the following configuration:</p>
<pre><code class="language-json">{
 "version": "0.2.0",
 "configurations": [
   {
     "type": "java",
     "name": "Spring Boot App",
     "request": "launch",
     "mainClass": "com.example.springbootmysqleks.SpringbootMysqlEksApplication",
     "projectName": "springboot-mysql-eks",
     "env": {
       "MYSQL_HOSTNAME": "localhost",
       "MYSQL_PORT": "3306",
       "MYSQL_DATABASE": "exchangedb",
       "MYSQL_USERNAME": "root",
       "MYSQL_PASSWORD": "CHANGE_ME"
     }
   }
 ]
}
</code></pre>
<p>Configure the credentials to use your actual credentials.</p>
<p>Now, when you run the app, you should be able to see the created <code>exchangedb</code> table in DBeaver:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/9ea1b85c-25a8-4587-a013-4d592d1664eb.png" alt="exchnage db image" style="display:block;margin:0 auto" width="724" height="224" loading="lazy">

<p>Use an API testing tool like Postman to send a POST request to the database:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/5fa9e950-17ff-4fd6-a650-b6a33b607744.png" alt="postman request image" style="display:block;margin:0 auto" width="446" height="97" loading="lazy">

<p>Next, run the <code>select * from exchange_rate er</code> script in the <code>exchangedb</code> SQL script editor:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/a7a2ffea-f930-4f27-b29a-f5aaf8f8543e.png" alt="sql editor image" style="display:block;margin:0 auto" width="2048" height="1051" loading="lazy">

<p>At the bottom of the editor, you should see the created table from the Postman request.</p>
<p>Now, run a GET request to the endpoint below:</p>
<pre><code class="language-json">http://localhost:8080/getAmount?sourceCurrency=USD&amp;targetCurrency=EUR&amp;transactionId=1
</code></pre>
<p>You should get a 200 OK response with the currency exchange value, for example, 0.93.</p>
<h3 id="heading-step-7-dockerize-the-springboot-application">Step 7: Dockerize the SpringBoot Application</h3>
<p>To Dockerize your application, create a file named Dockerfile and paste in the configuration below:</p>
<pre><code class="language-dockerfile">FROM eclipse-temurin:17-jre-jammy
WORKDIR /app
COPY build/libs/springboot-mysql-eks.jar /app
EXPOSE 8080
CMD ["java", "-jar", "springboot-mysql-eks.jar"]
</code></pre>
<p>Our Dockerfile starts by pulling the lightweight <code>eclipse-temurin:17-jre-jammy</code> base image to keep things lean, then sets /app as the working directory inside the container. It copies our compiled Spring Boot JAR file from the local build/libs/ folder into that directory, exposes port 8080 for incoming traffic, and finally runs the app with <code>java -jar</code> when the container starts up.</p>
<p>Next, build the app to create the <code>.jar</code> file. To do that, run the command below:</p>
<pre><code class="language-shell">./gradlew clean assemble 
</code></pre>
<p>You should get a successful build output as shown below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/bde4395b-bc30-46e4-9687-bd03836f574d.png" alt="bde4395b-bc30-46e4-9687-bd03836f574d" style="display:block;margin:0 auto" width="471" height="93" loading="lazy">

<p>Navigate to build &gt; the libs folder. You’ll see the <code>springboot-mysql-eks</code> file created.</p>
<p>If you run into an “operation couldn’t be completed.” error, try running the export commands to fix this issue. If you’re using a Mac, then run the command below:</p>
<pre><code class="language-shell">brew install openjdk@21
</code></pre>
<p>Next, run the export commands:</p>
<pre><code class="language-shell">export JAVA_HOME=/opt/homebrew/opt/openjdk@21/libexec/openjdk.jdk/Contents/Home

export PATH=\(JAVA_HOME/bin:\)PATH
</code></pre>
<p>Then run the <code>./gradlew clean assemble</code> command again.</p>
<h3 id="heading-step-8-push-the-image-to-elastic-container-registry-ecr">Step 8: Push the Image to Elastic Container Registry (ECR)</h3>
<p>In this next step, we’ll create an Amazon ECR and push our image to the registry.</p>
<p>To get started, head back into your AWS Console and search for “ECR”. On the ECR page, click Create**.** Then, enter a repository name, for example, “springboot-mysql-eks.” Next, click Create.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/21c56eef-c656-49b6-a624-b1da66cf1096.png" alt="ECR image" style="display:block;margin:0 auto" width="1176" height="248" loading="lazy">

<p>Next, select the repo and click View push commands at the top of the page. This presents a window with a bunch of commands you can use to push your image to the registry. Open your terminal and run these commands. You'll need to ensure Docker is running on your local machine before running the commands.</p>
<p>After running the commands, you should see that your image has been successfully pushed to the registry.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/b39d0b9c-ea99-4cf3-bfb0-3496ac79dea9.png" alt="ECR image" style="display:block;margin:0 auto" width="1118" height="196" loading="lazy">

<h3 id="heading-step-9-implement-aws-app-load-balancer">Step 9: Implement AWS App Load Balancer</h3>
<p>Before getting started with this step, make sure you check out the installation steps and link to additional AWS documentation in the project README. This will help you follow along.</p>
<p>Now, to get started, create a new folder in your root directory named <code>cluster</code> . This is where you'll download the AWS IAM policy for the load balancer. To download the policy, go into your terminal and <code>cd</code> into <code>cluster</code>, then run the command below:</p>
<pre><code class="language-shell">curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.14.1/docs/install/iam_policy.json
</code></pre>
<p>This command is gotten from the <a href="https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html">AWS documentation</a>. Now, when you go to the folder, you’ll see an iam_policy.json file automatically generated.</p>
<p>Next, apply the IAM policy using the command below:</p>
<pre><code class="language-shell">aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam_policy.json
</code></pre>
<p>You should get an output like this in your terminal:</p>
<img alt="terminal image" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>This shows that the IAM policy has been successfully created. To confirm this, head over to the IAM section in your console, navigate to Policies**,** and search for “AWSLoad…”. You should see the policy created there.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/c022959a-c732-4dfd-bd58-4b80d3011b71.png" alt="load balancer policy image" style="display:block;margin:0 auto" width="577" height="229" loading="lazy">

<p>The next step is creating the Kubernetes service account. But before that, you need to tag your public and private subnets as described in this <a href="https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html">documentation</a>.</p>
<p>Now, head over to the VPC dashboard, navigate to Subnets, click into a subnet, and navigate to Tags. Then, click Manage tags.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/615f3ec2-d266-413e-8936-d0767d03316d.png" alt="tag image" style="display:block;margin:0 auto" width="1180" height="207" loading="lazy">

<p>Click Add new tag, then enter the key/pair value in the documentation.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/253abe00-cb7e-4276-81de-79ba0ffc249a.png" alt="tag image" style="display:block;margin:0 auto" width="1162" height="167" loading="lazy">

<h3 id="heading-step-10-create-a-cluster-in-eks">Step 10: Create a Cluster in EKS</h3>
<p>To create a Kubernetes cluster on EKS, you need the eksctl CLI. Follow the instructions in the <a href="https://docs.aws.amazon.com/eks/latest/eksctl/installation.html">AWS eksctl documentation</a> to install the CLI. Next, you need a <a href="https://docs.aws.amazon.com/eks/latest/eksctl/schema.html">config file schema</a> to create the cluster. To use this schema, create a new file called cluster.yaml in the cluster folder.</p>
<p>Next, paste in the following configurations:</p>
<pre><code class="language-dockerfile">apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: spring-test-cluster
  region: us-east-1
  version: "1.30"

vpc:
  id: "&lt;your-vpc-id&gt;"
  subnets:
    private:
      us-east-1a:
        id: "&lt;your-private-subnet-1a-id&gt;" # spring-demo-subnet-private1-us-east-1a
      us-east-1b:
        id: "&lt;your-private-subnet-1b-id&gt;" # spring-demo-subnet-private2-us-east-1b
    public:
      us-east-1a:
        id: "&lt;your-public-subnet-1a-id&gt;" # spring-demo-subnet-public1-us-east-1a
      us-east-1b:
        id: "&lt;your-public-subnet-1b-id&gt;" # spring-demo-subnet-public2-us-east-1b

nodeGroups:
  - name: ng-1
    labels: { role: backend }
    instanceType: t2.micro
    desiredCapacity: 3
    minSize: 3
    maxSize: 5
    privateNetworking: true
    ssh:
      allow: true
      publicKeyName: &lt;your-ec2-key-name&gt;
    iam:
      withAddonPolicies:
        imageBuilder: true
        awsLoadBalancerController: true
        autoScaler: true
iam:
  withOIDC: true
  serviceAccounts:
    - metadata:
        name: aws-load-balancer-controller
        namespace: kube-system
      attachPolicyARNs:
        - arn:aws:iam::&lt;YOUR_AWS_ACCOUNT_ID&gt;:policy/AWSLoadBalancerControllerIAMPolicy
</code></pre>
<p>Th <code>ClusterConfig</code> file is used by eksctl to create our EKS cluster called <code>spring-test-cluster</code> in the <code>us-east-1 region</code>, running Kubernetes version 1.30. It plugs into our existing VPC, placing the worker nodes across private subnets in two availability zones <code>us-east-1a</code> and <code>us-east-1b</code>) for high availability, while keeping public subnets available for the load balancer.</p>
<p>The node group spins up t2.micro EC2 instances with a desired count of 3 (scaling up to 5 if needed), all with private networking enabled for security. It also sets up the necessary IAM permissions for the AWS Load Balancer Controller, Auto Scaler, and ECR image access so our cluster has everything it needs to manage traffic and pull our Docker images automatically.</p>
<p>Now, after updating your configuration with your credentials, run the command below:</p>
<pre><code class="language-shell">eksctl create cluster -f cluster.yaml
</code></pre>
<p>This creates the cluster. You should see an output like this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/758c6b7d-9cf3-4fa6-a0d5-2276adc82147.png" alt="cluster creation image" style="display:block;margin:0 auto" width="1466" height="514" loading="lazy">

<p>Now, in your AWS console, navigate to CloudFormation, and you’ll see your cluster creation process in progress.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/4aa96465-45c1-4e43-b8ca-d44a8802e02d.png" alt="stack creation image" style="display:block;margin:0 auto" width="1249" height="211" loading="lazy">

<p>Now, when you go into the EC2 instance page, you should see the three nodes created.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/d946d993-0c64-4870-90fb-1132df51544f.png" alt="running cluster image" style="display:block;margin:0 auto" width="950" height="99" loading="lazy">

<h3 id="heading-step-11-install-aws-load-balancing">Step 11: Install AWS Load Balancing</h3>
<p>The next step is installing a load balancer for our application. To get started, run the command below:</p>
<pre><code class="language-shell"> kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller/crds?ref=master"
</code></pre>
<p>This installs <a href="https://www.geeksforgeeks.org/devops/custom-resource-definitions-crds/">custom resource definitions (CRDs)</a> for our controller. Next, run the command below to add the Helm chart repo.</p>
<pre><code class="language-shell">helm repo add eks https://aws.github.io/eks-charts
</code></pre>
<p>Update your local repo to ensure you have the most recent charts:</p>
<pre><code class="language-shell">helm repo update eks
</code></pre>
<p>Next, install the Helm chart:</p>
<pre><code class="language-shell">helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
&nbsp; -n kube-system \
&nbsp; --set clusterName=my-cluster \
&nbsp; --set serviceAccount.create=false \
&nbsp; --set serviceAccount.name=aws-load-balancer-controller \
&nbsp; --version 1.14.0
</code></pre>
<p>Next, verify that the controller is installed:</p>
<pre><code class="language-shell">kubectl get deployment -n kube-system aws-load-balancer-controller
</code></pre>
<p>You should see this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/61d954f3-b09d-4691-9b1c-02134c2d8bf1.png" alt="61d954f3-b09d-4691-9b1c-02134c2d8bf1" style="display:block;margin:0 auto" width="1226" height="118" loading="lazy">

<p>This indicates that your controller is ready.</p>
<h3 id="heading-step-12-create-and-deploy-kubernetes">Step 12: Create and Deploy Kubernetes</h3>
<p>To get started, you'll first need to create a Kubernetes manifest file. For that, we’ll use <a href="https://www.freecodecamp.org/news/what-is-a-helm-chart-tutorial-for-kubernetes-beginners/">Helm Chart</a>.</p>
<pre><code class="language-shell">helm create ytchart
</code></pre>
<p>The command above creates a folder named <code>ytchart</code> with the templates for the components. In this folder, you need to make some configurations for your use case. First, navigate to ytchart &gt; templates and delete the <code>serviceaccount.yaml</code> file, since we already created the service account earlier.</p>
<p>Next, go to values.yaml and make the following changes:</p>
<ul>
<li><p>For <code>repository</code>, navigate to the ECR service page on the AWS Console and copy the image URI.</p>
</li>
<li><p>Tag is <code>latest</code>.</p>
</li>
<li><p>Set database name</p>
</li>
</ul>
<pre><code class="language-dockerfile">mysql:
 databaseName: exchangedb
</code></pre>
<ul>
<li><p>Change service account creation to <code>false</code>.</p>
</li>
<li><p>Scroll down a bit more and change the service <code>type</code> to <code>NodePort</code> and <code>port</code> to <code>8080</code>.</p>
</li>
</ul>
<p>You also need to store the database username and password using secrets. Navigate to the <code>templates</code> folder and go into the file named <code>secrets.yaml</code>. Here, set your database username and password, then comment out the liveness and readiness probe in <code>deployment.yaml</code>.</p>
<p>Next, we’ll create a service to connect to the database. To do that, navigate to the <code>mysql.yaml</code> file, then for <code>externalName</code>. Navigate to the RDS service page on the AWS console and copy the database endpoint.</p>
<p>Now, in the <code>deployment.yaml</code> file, paste in the following configuration:</p>
<pre><code class="language-dockerfile">          env:
            - name: SPRING_DATASOURCE_URL
              value: jdbc:mysql://spring-mysql:3306/{{ .Values.mysql.databaseName }}?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useUnicode=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true
            - name: SPRING_DATASOURCE_USERNAME
              valueFrom:
                secretKeyRef:
                  name: mysql-username
                  key: username
            - name: SPRING_DATASOURCE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-root-password
                  key: password
</code></pre>
<p>You have successfully created environment variables to secure your database credentials.</p>
<p>In the <code>ingress.yaml</code> file, paste in the following configuration:</p>
<pre><code class="language-dockerfile">apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "spring-microservice-ingress"
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/load-balancer-name: spring-alb-test
  labels:
    app: spring-microservice
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: {{ include "ytchart.fullname" . }}
                port:
                  number: 8080
</code></pre>
<p>This is your configuration for the ingress service.</p>
<p>Run the command below to see all your configuration values:</p>
<pre><code class="language-shell">helm template ytchart/
</code></pre>
<p>Next, run the command below to deploy the chart:</p>
<pre><code class="language-shell">helm install mychart ytchart
</code></pre>
<p>You should see an output like this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/4178105c-f469-4bd6-94f0-58046d12c080.png" alt="helm chart image" style="display:block;margin:0 auto" width="970" height="398" loading="lazy">

<p>Now, when you run kubectl get all, you should see this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/23e91fff-6368-42a1-adda-1af45616e9ef.png" alt="deployment image" style="display:block;margin:0 auto" width="576" height="127" loading="lazy">

<p>Now, navigate to EC2 &gt; Load balancers, copy the DNS name, and enter it into a browser. You should see the “up” text. This indicates that your application is working properly.</p>
<p>Now, when you call the API using the DNS URL as such:</p>
<pre><code class="language-shell">http://spring-alb-test-260424558.us-east-1.elb.amazonaws.com/addExchangeRate
</code></pre>
<p>You should get a 200 OK response. Congratulations, you have successfully deployed a SpringBoot app in Kubernetes!</p>
<h3 id="heading-step-13-delete-cluster">Step 13: Delete Cluster</h3>
<p>If you’re familiar with AWS and the cloud, you should already be aware of how costly it can be to leave resources running for extended periods, especially when you’re not using them actively.</p>
<p>Now that we've come to the end of this tutorial, it’s time to delete the resources.</p>
<p>These are the resources to delete:</p>
<ul>
<li><p>RDS database.</p>
</li>
<li><p>Cluster using the command eksctl delete cluster -f cluster.yaml.</p>
</li>
<li><p>Navigate to VPC and delete the NAT Gateway</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Deploying a Spring Boot application with MySQL on Amazon EKS involves a lot of moving parts, but each step builds logically on the last.</p>
<p>In this tutorial, you've gone from setting up a VPC and provisioning a managed database to containerizing your app, pushing it to ECR, and finally orchestrating everything with Kubernetes and an Application Load Balancer.</p>
<p>What you get is a production-grade setup with high availability, private networking, secure credential management, and auto-scaling built in. This is the kind of infrastructure that would take significant manual effort to replicate without managed services like EKS and RDS.</p>
<p>As a next step, consider adding HTTPS support via AWS Certificate Manager, setting up horizontal pod autoscaling, or integrating a CI/CD pipeline to automate future deployments. And remember to clean up your AWS resources when you're done experimenting. Your wallet will thank you.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Common DevOps Mistakes and How to Avoid Them — Tips for Startups ]]>
                </title>
                <description>
                    <![CDATA[ Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production. Startup environments make this worse. The p ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-avoid-devops-mistakes/</link>
                <guid isPermaLink="false">6a060c22baf09db7a6253878</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ startup ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tips ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 14 May 2026 17:53:38 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/6fcabd5e-272f-4f1d-b035-8241896e8296.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what <em>not</em> to do before they got into production.</p>
<p>Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.</p>
<p>This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.</p>
<p>Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-who-this-article-is-for">Who This Article Is For</a></p>
</li>
<li><p><a href="#heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</a></p>
</li>
<li><p><a href="#heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</a></p>
</li>
<li><p><a href="#heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</a></p>
</li>
<li><p><a href="#heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</a></p>
</li>
<li><p><a href="#heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</a></p>
</li>
<li><p><a href="#heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</a></p>
</li>
<li><p><a href="#heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</a></p>
</li>
<li><p><a href="#heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</a></p>
</li>
<li><p><a href="#heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</a></p>
</li>
<li><p><a href="#heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</a></p>
</li>
<li><p><a href="#heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</a></p>
</li>
<li><p><a href="#heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</a></p>
</li>
<li><p><a href="#heading-your-production-readiness-checklist">Your Production Readiness Checklist</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-who-this-article-is-for">Who This Article Is For</h2>
<ul>
<li><p><strong>Early-career DevOps and cloud engineers</strong> who are building or maintaining production infrastructure at a startup.</p>
</li>
<li><p><strong>Backend developers</strong> who have recently taken on DevOps responsibilities.</p>
</li>
<li><p><strong>Engineers joining a startup</strong> who want to understand what operational discipline actually looks like in a fast-moving environment.</p>
</li>
</ul>
<p>You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.</p>
<h2 id="heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</h2>
<p>Before getting into the mistakes, you have to understand why startups produce them in the first place.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/f9bec1fa-8938-4144-b934-9e5af4edf4ad.svg" alt="diagram showing the startup DevOps reality, a single engineer handling infra, CI/CD, security, monitoring, and deployment pipelines simultaneously" style="display:block;margin:0 auto" width="680" height="506" loading="lazy">

<p>In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.</p>
<p>This creates four specific pressure points:</p>
<ol>
<li><p><strong>Speed pressure.</strong> The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.</p>
</li>
<li><p><strong>Budget constraints.</strong> Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.</p>
</li>
<li><p><strong>Absent guardrails.</strong> There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.</p>
</li>
<li><p><strong>Constantly changing requirements.</strong> The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.</p>
</li>
</ol>
<h2 id="heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before you deploy anything to production, you should be able to answer these five questions in writing:</p>
<ol>
<li><p><strong>What compute type is running my code?</strong> (EC2, Lambda, Fargate, container?)</p>
</li>
<li><p><strong>How does a new version replace the old one?</strong> (Rolling? Blue/green? All-at-once?)</p>
</li>
<li><p><strong>Where does configuration and secrets come from?</strong> (SSM? Secrets Manager? Environment file?)</p>
</li>
<li><p><strong>What downstream services depend on this?</strong> (Database connections? Other APIs? Cache?)</p>
</li>
<li><p><strong>How do I roll back in under five minutes if this breaks?</strong></p>
</li>
</ol>
<p>If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.</p>
<blockquote>
<p>"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."</p>
</blockquote>
<p>Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.</p>
<ul>
<li><p><strong>The first question is: What is this technology or concept about?</strong><br>This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.</p>
</li>
<li><p><strong>The second question is: Why do we need it?</strong><br>I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.</p>
</li>
<li><p><strong>The third question is: How should it be implemented?</strong><br>There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.</p>
</li>
</ul>
<p>This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.</p>
<h2 id="heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.</p>
<p>This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>You need at minimum three separate environments and ideally three separate AWS accounts:</p>
<table>
<thead>
<tr>
<th>Environment</th>
<th>Purpose</th>
<th>Access Level</th>
</tr>
</thead>
<tbody><tr>
<td><strong>dev</strong></td>
<td>Break things freely. No real data.</td>
<td>Engineers have broad access</td>
</tr>
<tr>
<td><strong>staging</strong></td>
<td>Mirror of production. Final verification.</td>
<td>Controlled access</td>
</tr>
<tr>
<td><strong>production</strong></td>
<td>Real customers. Real data.</td>
<td>MFA required. No manual deployments.</td>
</tr>
</tbody></table>
<p>Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.</p>
<p>Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.</p>
<pre><code class="language-hcl"># terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}
</code></pre>
<pre><code class="language-hcl"># terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}
</code></pre>
<p>The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.</p>
<h2 id="heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A new engineer joins a startup and clones the repository. Inside they find a <code>.env</code> file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.</p>
<p>GitHub's automated secret scanning never triggered because the secrets were inside a <code>.env</code> file rather than raw in the code. The credentials had been valid and actively used for over six months.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:</p>
<ul>
<li><p>Crypto-mining workloads generating thousands of dollars in cloud bills overnight</p>
</li>
<li><p>Complete exfiltration of customer data from every S3 bucket</p>
</li>
<li><p>Privilege escalation: the attacker creates new admin users and locks you out of your own account</p>
</li>
<li><p>AWS account suspension while the investigation runs</p>
</li>
</ul>
<p>According to <a href="https://github.blog/security/vulnerability-research/securing-millions-of-developers-together/">GitHub's annual security report</a>, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.</p>
<h2 id="heading-the-fix">The Fix</h2>
<p><strong>Step 1: Never commit secrets to Git.</strong> Not temporarily. Not in a branch. Not in a private repository.</p>
<p><strong>Step 2: Add</strong> <code>.gitignore</code> <strong>before you create the first file.</strong> Check in the <code>.gitignore</code> with the first line of code before any <code>.env</code> files exist.</p>
<pre><code class="language-gitignore"># .gitignore
.env
.env.*
*.pem
*.key
secrets/
</code></pre>
<p><strong>Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets.</strong> Your application reads secrets at runtime:</p>
<pre><code class="language-python"># Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -&gt; dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]
</code></pre>
<p><strong>Step 4: Scan your existing repositories immediately.</strong> You may already have a problem:</p>
<pre><code class="language-bash"># Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo
</code></pre>
<p><strong>Step 5: Add a pre-commit hook to prevent future accidents:</strong></p>
<pre><code class="language-bash">pip install pre-commit
</code></pre>
<pre><code class="language-yaml"># .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
</code></pre>
<pre><code class="language-bash">pre-commit install
# Now the hook runs before every commit and blocks detected secrets
</code></pre>
<p>There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.</p>
<h2 id="heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.</p>
<p>Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Match your infrastructure to your actual stage:</p>
<table>
<thead>
<tr>
<th>Scale</th>
<th>Right Infrastructure</th>
<th>Cost Range</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1–1,000 users</strong></td>
<td>Single EC2 + RDS + Nginx reverse proxy</td>
<td>$20–50/month</td>
</tr>
<tr>
<td><strong>1K–50K users</strong></td>
<td>Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD</td>
<td>$200-500/month</td>
</tr>
<tr>
<td><strong>50K–500K users</strong></td>
<td>ECS Fargate, RDS read replicas, ElastiCache, full observability</td>
<td>$1K-5K/month</td>
</tr>
<tr>
<td><strong>500K+ users</strong></td>
<td>Multi-region, managed Kubernetes, dedicated SRE</td>
<td>$10K+/month</td>
</tr>
</tbody></table>
<p>The question to ask before every infrastructure decision is: <strong>"What specific, measurable problem does this solve today that my current setup cannot solve?"</strong></p>
<p>Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.</p>
<p>Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.</p>
<h2 id="heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.</p>
<p>The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.</p>
<h3 id="heading-business-impact">Business Impact</h3>
<p>Without observability:</p>
<ul>
<li><p>You find out about production problems from users, not from your systems</p>
</li>
<li><p>Incidents take 10x longer to resolve because diagnosis is guesswork</p>
</li>
<li><p>You cannot tell whether a deployment improved or degraded performance</p>
</li>
<li><p>You have no data for making better architecture decisions</p>
</li>
</ul>
<h3 id="heading-the-fix">The Fix</h3>
<p>Implement the four golden signals before any service goes to production. These come from <a href="https://sre.google/sre-book/monitoring-distributed-systems/">Google's Site Reliability Engineering book</a>:</p>
<ol>
<li><p><strong>Latency</strong>: How long requests take to complete (p50, p95, p99)</p>
</li>
<li><p><strong>Traffic</strong>: How many requests per second the system is handling</p>
</li>
<li><p><strong>Errors</strong>: The rate of failed requests (5xx responses per minute)</p>
</li>
<li><p><strong>Saturation</strong>: How close the system is to its limits (CPU, memory, connection pool)</p>
</li>
</ol>
<p>Here is a minimal CloudWatch alarm setup using the AWS CLI:</p>
<pre><code class="language-shell"># Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef
</code></pre>
<p>Every application should also expose a <code>/health</code> endpoint that returns <code>200 OK</code> when healthy:</p>
<pre><code class="language-python"># FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }
</code></pre>
<p>Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.</p>
<blockquote>
<p>You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."</p>
</blockquote>
<h2 id="heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:</p>
<ul>
<li><p>S3 buckets publicly accessible by default</p>
</li>
<li><p>EC2 instances with port 22 open to <code>0.0.0.0/0</code></p>
</li>
<li><p>IAM users with <code>AdministratorAccess</code> for the entire team</p>
</li>
<li><p>No encryption on the database at rest</p>
</li>
<li><p>JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.</p>
</li>
</ul>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Apply these six security controls before the first line of production code ships:</p>
<p><strong>1. Principle of Least Privilege every IAM role gets only what it needs:</strong></p>
<p>One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (<code>s3:*</code>) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.</p>
<p>The fix is simple: look at what your service actually does, then write a policy that allows exactly that.</p>
<p>If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}
</code></pre>
<p>Notice the <code>Resource</code> is scoped to <code>my-app-uploads/*</code> not all S3 buckets. And the <code>Action</code> list covers only <code>GetObject</code> and <code>PutObject</code> not <code>DeleteObject</code>, not <code>s3:*</code>. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.</p>
<p><strong>2. Block all S3 public access by default:</strong></p>
<p>AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.</p>
<p>The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:</p>
<pre><code class="language-bash">aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.</p>
<p><strong>3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:</strong></p>
<p>Port 22 open to <code>0.0.0.0/0</code> is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.</p>
<p>AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:</p>
<pre><code class="language-bash"># Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0
</code></pre>
<p>To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the <code>AmazonSSMManagedInstanceCore</code> policy attached. Once that is set up, you can close port 22 on the security group entirely.</p>
<p><strong>4. Enable MFA for all IAM users and enforce it via policy:</strong></p>
<p>A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.</p>
<p>Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.</p>
<p>The AWS documentation provides the <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_users-self-manage-mfa-and-creds.html">Complete Deny Without MFA Policy</a>, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.</p>
<p><strong>5. Enable CloudTrail in all regions:</strong></p>
<p>Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.</p>
<p>CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:</p>
<pre><code class="language-bash">aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation
</code></pre>
<p>The <code>--enable-log-file-validation</code> flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every <code>AssumeRole</code>, every <code>DeleteBucket</code>, and every <code>RunInstances</code> call in your account is permanently recorded.</p>
<p><strong>6. Run AWS Security Hub from day one:</strong></p>
<p>Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.</p>
<p>Enabling it takes a single command:</p>
<pre><code class="language-bash">aws securityhub enable-security-hub
</code></pre>
<p>Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.</p>
<p>Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.</p>
<h2 id="heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running <code>git pull</code>, running <code>npm install</code>, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips <code>npm install</code>. The application starts crashing because a new dependency is missing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:</p>
<pre><code class="language-yaml"># .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" &gt;&gt; $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true
</code></pre>
<p>Notice <code>wait-for-service-stability: true</code>. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.</p>
<h2 id="heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.</p>
<p>The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.</p>
<p>Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p><strong>Define your RTO and RPO before you design anything:</strong></p>
<ul>
<li><p><strong>RTO (Recovery Time Objective):</strong> How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.</p>
</li>
<li><p><strong>RPO (Recovery Point Objective):</strong> How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.</p>
</li>
</ul>
<p><strong>Enable RDS Multi-AZ for all production databases:</strong></p>
<pre><code class="language-hcl"># Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}
</code></pre>
<p><strong>Test your backups on a schedule.</strong> Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.</p>
<pre><code class="language-bash"># Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"
</code></pre>
<p>For official guidance on RDS backup and restore, refer to the <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html">AWS RDS Backup and Restore documentation</a>.</p>
<h2 id="heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Documentation for an engineering team means three specific things:</p>
<ol>
<li><p><strong>Infrastructure as Code is the highest form of documentation.</strong> The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.</p>
</li>
<li><p><strong>A runbook for every operational task.</strong> A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:</p>
</li>
</ol>
<pre><code class="language-markdown"># Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.
</code></pre>
<ol>
<li><strong>An architecture README in every repository.</strong> Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.</li>
</ol>
<h2 id="heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before making any infrastructure decision, answer these four questions:</p>
<ol>
<li><p><strong>What is the actual, measured bottleneck?</strong> Instrument before you act. The bottleneck is almost never where you assumed it was.</p>
</li>
<li><p><strong>What does success look like, and how will you measure it?</strong> "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.</p>
</li>
<li><p><strong>What is the full cost of this solution?</strong> Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?</p>
</li>
<li><p><strong>Can a simpler solution solve 80% of the problem in 20% of the time?</strong></p>
</li>
</ol>
<p>Always profile and measure before you rebuild:</p>
<pre><code class="language-bash"># Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"
</code></pre>
<p>Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.</p>
<h2 id="heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</h2>
<p>Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/b33035a6-448f-419b-b293-206b7b775594.jpg" alt="A diagram showing a request flowing through a full system: user → CDN → load balancer → application servers → cache → database → logs/monitoring" style="display:block;margin:0 auto" width="544" height="650" loading="lazy">

<p>A system thinker asks six questions before making any change in production:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Why You Ask It</th>
</tr>
</thead>
<tbody><tr>
<td><strong>What does this change?</strong></td>
<td>List every configuration, file, or service that will be different.</td>
</tr>
<tr>
<td><strong>What does this depend on?</strong></td>
<td>What must be true upstream for this component to work correctly?</td>
</tr>
<tr>
<td><strong>What depends on this?</strong></td>
<td>What downstream systems are affected if this changes or fails?</td>
</tr>
<tr>
<td><strong>What is the failure mode?</strong></td>
<td>Does this fail loudly (500 errors) or silently (wrong data)?</td>
</tr>
<tr>
<td><strong>What is the rollback path?</strong></td>
<td>How do you reverse this in under five minutes?</td>
</tr>
<tr>
<td><strong>What does healthy look like after the change?</strong></td>
<td>What metrics confirm everything is working correctly?</td>
</tr>
</tbody></table>
<p>This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.</p>
<h2 id="heading-your-production-readiness-checklist">Your Production Readiness Checklist</h2>
<p>Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.</p>
<h3 id="heading-infrastructure">Infrastructure</h3>
<ul>
<li><p>Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git</p>
</li>
<li><p>Separate dev, staging, and production environments exist with separate credentials</p>
</li>
<li><p>All production changes go through an automated CI/CD pipeline, no manual SSH deployments</p>
</li>
<li><p>You can rebuild the entire production environment from code in under two hours</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p>No secrets, credentials, or API keys exist in any Git repository</p>
</li>
<li><p>All production secrets are in Secrets Manager or SSM Parameter Store</p>
</li>
<li><p>All IAM roles follow the principle of least privilege</p>
</li>
<li><p>S3 buckets have public access blocked by default</p>
</li>
<li><p>Port 22 is not open to <code>0.0.0.0/0</code> on any security group</p>
</li>
<li><p>CloudTrail is enabled in all regions</p>
</li>
<li><p>All IAM users have MFA enabled</p>
</li>
<li><p>AWS Security Hub is enabled and findings are reviewed weekly</p>
</li>
</ul>
<h3 id="heading-observability">Observability</h3>
<ul>
<li><p>Every service has a <code>/health</code> endpoint that monitoring checks continuously</p>
</li>
<li><p>Alerts fire within five minutes of a production error rate spike</p>
</li>
<li><p>Dashboards exist showing latency, error rate, and resource utilization</p>
</li>
<li><p>Logs are centralized and searchable, not scattered across individual servers</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><p>Production database has Multi-AZ enabled</p>
</li>
<li><p>Backup restoration has been tested in the last 30 days</p>
</li>
<li><p>Written runbooks exist for the three most likely failure scenarios</p>
</li>
<li><p>RTO and RPO requirements are documented and the architecture meets them</p>
</li>
</ul>
<h3 id="heading-documentation">Documentation</h3>
<ul>
<li><p>Every repository has a README explaining what it does and how to deploy it</p>
</li>
<li><p>A new engineer could understand the production architecture from documentation alone</p>
</li>
<li><p>No single engineer holds critical knowledge that lives only in their head</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.</p>
<p>The good news is that every single one of them is preventable with the right awareness and the right habits applied early.</p>
<p>You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.</p>
<p>The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.</p>
<h2 id="heading-want-to-go-deeper">Want to Go Deeper?</h2>
<p>If this article resonated with you, <a href="https://coachli.co/tolani-akintayo/PR-H4oQS"><strong>The Startup DevOps Field Guide</strong></a> covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.</p>
<p>It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Migrate to S3 Native State Locking in Terraform ]]>
                </title>
                <description>
                    <![CDATA[ If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them togethe ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-migrate-to-s3-native-state-locking-in-terraform/</link>
                <guid isPermaLink="false">69fd19239f93a850a430069b</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Infrastructure as code ]]>
                    </category>
                
                    <category>
                        <![CDATA[ S3 ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 07 May 2026 22:58:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/9619ad45-15c5-4be7-9221-ed4b76bc2b24.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.</p>
<p>But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.</p>
<p>The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.</p>
<p>In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning <strong>DynamoDB is no longer required for state locking</strong>. Terraform 1.10 added support for this feature, and it's now generally available.</p>
<p>In this tutorial, you'll learn:</p>
<ul>
<li><p>What S3 native locking is and how it works</p>
</li>
<li><p>How to set it up from scratch if you're starting a new project</p>
</li>
<li><p>How to migrate an existing S3 + DynamoDB setup to S3 native locking safely</p>
</li>
<li><p>How to verify locking is working and handle edge cases</p>
</li>
</ul>
<p>By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-terraform-state-locking">What Is Terraform State Locking?</a></p>
</li>
<li><p><a href="#heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</a></p>
</li>
<li><p><a href="#heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</a></p>
<ul>
<li><p><a href="#heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</a></p>
</li>
<li><p><a href="#heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</a></p>
</li>
<li><p><a href="#heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking</a></p>
<ul>
<li><p><a href="#heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</a></p>
</li>
<li><p><a href="#heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</a></p>
</li>
<li><p><a href="#heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</a></p>
</li>
<li><p><a href="#heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</a></p>
</li>
<li><p><a href="#heading-step-5-verify-the-migration">Step 5: Verify the Migration</a></p>
</li>
<li><p><a href="#heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</a></p>
</li>
<li><p><a href="#heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</a></p>
</li>
<li><p><a href="#heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</a></p>
</li>
<li><p><a href="#heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-what-is-terraform-state-locking">What is Terraform State Locking?</h2>
<p>Before looking at the new approach, it helps to understand what state locking is solving.</p>
<p>Terraform stores everything it knows about your infrastructure in a <strong>state file</strong> – a JSON document that maps your configuration to real AWS resources. When you run <code>terraform apply</code>, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.</p>
<p>The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a <strong>race condition</strong>. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.</p>
<p><strong>State locking</strong> solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.</p>
<pre><code class="language-plaintext">Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ----------&gt;|                            |
   |                                   |                            |
   |&lt;------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock ---&gt;|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |&lt;------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ----------&gt;|                            |
   |                                   |                            |
   |           [COMPLETED]             |&lt;------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |
</code></pre>
<h2 id="heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</h2>
<p>Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a <code>LockID</code> primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.</p>
<p>S3 native locking uses <strong>S3 Object Lock</strong> instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.</p>
<p>When S3 native locking is enabled in your Terraform backend:</p>
<ol>
<li><p>Terraform writes your state to an <code>.tfstate</code> object in S3 (as before)</p>
</li>
<li><p>To acquire a lock, Terraform uses <strong>S3's conditional write operations</strong> – specifically the <code>if-none-match</code> conditional header to create a lock file atomically</p>
</li>
<li><p>If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held</p>
</li>
<li><p>When the operation completes, Terraform deletes the lock file to release the lock.</p>
</li>
</ol>
<p>The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.</p>
<p><strong>Note:</strong> This feature requires Terraform version <strong>1.10.0 or later</strong> and an S3 bucket with <strong>Object Lock enabled</strong>. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.</p>
<h2 id="heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</h2>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>S3 + DynamoDB (Old)</strong></th>
<th><strong>S3 Native Locking (New)</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>AWS services required</strong></td>
<td>S3 + DynamoDB</td>
<td>S3 only</td>
</tr>
<tr>
<td><strong>IAM permissions needed</strong></td>
<td>S3 + DynamoDB permissions</td>
<td>S3 permissions only</td>
</tr>
<tr>
<td><strong>Terraform version</strong></td>
<td>Any</td>
<td>1.10.0 or later</td>
</tr>
<tr>
<td><strong>Setup complexity</strong></td>
<td>Two resources, two IAM scopes</td>
<td>One resource</td>
</tr>
<tr>
<td><strong>Stuck lock resolution</strong></td>
<td>Delete DynamoDB record</td>
<td>Delete S3 lock file</td>
</tr>
<tr>
<td><strong>Cost</strong></td>
<td>S3 storage + DynamoDB on-demand</td>
<td>S3 storage only</td>
</tr>
<tr>
<td><strong>Object Lock requirement</strong></td>
<td>Not required</td>
<td>Required on S3 bucket</td>
</tr>
<tr>
<td><strong>Locking mechanism</strong></td>
<td>DynamoDB conditional writes</td>
<td>S3 conditional writes (<code>if-none-match</code>)</td>
</tr>
<tr>
<td><strong>State versioning</strong></td>
<td>S3 Versioning (recommended)</td>
<td>S3 Versioning (required for full safety)</td>
</tr>
</tbody></table>
<p>The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following in place:</p>
<ul>
<li><strong>Terraform 1.10.0 or later</strong> installed. Check your version:</li>
</ul>
<pre><code class="language-shell">terraform version
</code></pre>
<p>If you need to upgrade, follow the <a href="https://developer.hashicorp.com/terraform/install">official upgrade guide</a>.</p>
<ul>
<li><strong>AWS CLI</strong> installed and configured with credentials that have permission to create and manage S3 buckets.</li>
</ul>
<pre><code class="language-shell">aws --version
aws sts get-caller-identity   # confirm you're authenticated
</code></pre>
<ul>
<li><p><strong>IAM permissions</strong> to perform the following S3 actions:</p>
<ul>
<li><p><code>s3:CreateBucket</code></p>
</li>
<li><p><code>s3:PutBucketVersioning</code></p>
</li>
<li><p><code>s3:PutBucketEncryption</code></p>
</li>
<li><p><code>s3:PutObjectLegalHold</code></p>
</li>
<li><p><code>s3:PutObjectRetention</code></p>
</li>
<li><p><code>s3:GetObject</code></p>
</li>
<li><p><code>s3:PutObject</code></p>
</li>
<li><p><code>s3:DeleteObject</code></p>
</li>
<li><p><code>s3:ListBucket</code></p>
</li>
</ul>
</li>
<li><p>For the <strong>migration path</strong>: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.</p>
</li>
</ul>
<h2 id="heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</h2>
<p>Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.</p>
<h3 id="heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</h3>
<p>Object Lock <strong>must be enabled at bucket creation time</strong>. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p><strong>Note:</strong> For regions other than <code>us-east-1</code>, add the <code>--create-bucket-configuration</code> flag.</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p>Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled
</code></pre>
<p>Enable server-side encryption so your state files are encrypted at rest:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<p>Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Verify the bucket configuration:</p>
<pre><code class="language-shell"># Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state
</code></pre>
<p>Expected output for the Object Lock check:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/2b2e56cf-687f-4932-a61e-ed7cc33ea6f1.png" alt="Terminal showing AWS CLI verification commands confirming S3 bucket is configured correctly with Object Lock, versioning, and encryption enabled" style="display:block;margin:0 auto" width="1120" height="616" loading="lazy">

<h3 id="heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</h3>
<p>In your Terraform project, create or update your <code>backend.tf</code> file:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}
</code></pre>
<p>The critical difference from the old configuration is the <code>use_lockfile = true</code> parameter. Notice what is <strong>absent</strong>: there's no <code>dynamodb_table</code> argument. No DynamoDB table. No second service.</p>
<p>Here's a direct comparison of the old and new configurations:</p>
<p><strong>Old configuration (S3 + DynamoDB):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}
</code></pre>
<p><strong>New configuration (S3 native locking):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}
</code></pre>
<h3 id="heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</h3>
<p>Run <code>terraform init</code> to initialize the backend:</p>
<pre><code class="language-shell">terraform init
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!
</code></pre>
<p>Run a plan to confirm everything is working end-to-end:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket&nbsp;– a <code>.tflock</code> file will appear temporarily alongside your state file during the operation and disappear when it completes.</p>
<h2 id="heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration&nbsp;– How to Move from S3 + DynamoDB to S3 Native Locking</h2>
<p>Follow this section if you have an <strong>existing Terraform setup</strong> using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.</p>
<p><strong>Important:</strong> Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means <strong>all team members and CI/CD pipelines must stop running</strong> <code>terraform plan</code> <strong>or</strong> <code>terraform apply</code> <strong>during the migration</strong>. The migration itself takes under 10 minutes.</p>
<h3 id="heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</h3>
<p>Before making any changes, document your existing backend configuration and confirm the state file is accessible:</p>
<pre><code class="language-shell"># Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'
</code></pre>
<p>Check your current <code>backend.tf</code> and note the exact values:</p>
<pre><code class="language-shell"># Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}
</code></pre>
<p>Run one final plan to confirm the current state is clean and there are no unexpected changes pending:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If the plan shows no changes, you're in a safe state to proceed.</p>
<h3 id="heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</h3>
<p>This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.</p>
<p>But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.</p>
<p>Run the following AWS CLI command to enable Object Lock on your <strong>existing</strong> bucket:</p>
<pre><code class="language-bash">aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'
</code></pre>
<p><strong>Note:</strong> This command enables Object Lock in <strong>governance mode with no default retention</strong>, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.</p>
<p>Verify Object Lock is now enabled:</p>
<pre><code class="language-shell">aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<p>Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):</p>
<pre><code class="language-shell">aws s3api get-bucket-versioning \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "Status": "Enabled"
}
</code></pre>
<p>If versioning isn't enabled, enable it before proceeding:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/cd17df01-3d0a-4f93-9250-3f51627e91c8.png" alt="Terminal output showing successful Object Lock enablement on an existing S3 bucket using the AWS CLI" style="display:block;margin:0 auto" width="1204" height="320" loading="lazy">

<h3 id="heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</h3>
<p>Update your <code>backend.tf</code> to remove the <code>dynamodb_table</code> argument and add <code>use_lockfile = true</code>:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}
</code></pre>
<p>Your updated <code>backend.tf</code> should look like this:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}
</code></pre>
<h3 id="heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</h3>
<p>Run <code>terraform init</code> with the <code>-reconfigure</code> flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):</p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!
</code></pre>
<p><strong>If you see an error here:</strong> The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.</p>
<h3 id="heading-step-5-verify-the-migration">Step 5: Verify the Migration</h3>
<p>Run a plan to confirm Terraform is working correctly with the new backend configuration:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The plan should:</p>
<ul>
<li><p>Complete successfully</p>
</li>
<li><p>Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)</p>
</li>
<li><p>NOT mention DynamoDB anywhere in its output</p>
</li>
</ul>
<p>To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:</p>
<pre><code class="language-plaintext">╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵
</code></pre>
<p>The <code>Path</code> field shows <code>.tfstate.tflock</code>, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/e9abb703-af6e-429c-83bb-2ea2dac43a3a.png" alt="Two terminals showing concurrent terraform plan commands, the second one displays a lock error confirming S3 native locking is working" style="display:block;margin:0 auto" width="1264" height="539" loading="lazy">

<h3 id="heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</h3>
<p>Once you've confirmed the migration is working correctly and your team has run at least one successful <code>plan</code> and <code>apply</code> cycle using the new backend, you can remove the DynamoDB table.</p>
<p><strong>Wait at least 24-48 hours before deleting the DynamoDB table</strong> if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.</p>
<p>When you're ready, delete the DynamoDB table:</p>
<pre><code class="language-shell">aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Confirm the deletion:</p>
<pre><code class="language-shell">aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found
</code></pre>
<p>This error confirms that the table is gone. The migration is complete.</p>
<p>If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run <code>terraform apply</code> to destroy it via Terraform rather than the CLI directly. This keeps your state clean:</p>
<pre><code class="language-hcl"># Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}
</code></pre>
<p>After removing the block, run:</p>
<pre><code class="language-bash">terraform apply
</code></pre>
<p>Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.</p>
<h2 id="heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</h2>
<p>After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.</p>
<h3 id="heading-method-1-observe-the-lock-file-during-an-operation">Method 1: Observe the lock file during an operation</h3>
<p>In one terminal, start a long-running plan against a configuration with many resources:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>While it's running, in a second terminal, check for the lock file in S3:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>You should see a file like:</p>
<pre><code class="language-plaintext">2026-05-06 14:22:01        512 terraform.tfstate.tflock
</code></pre>
<p>After the plan completes, run the same command again. The <code>.tflock</code> file should be gone.</p>
<h3 id="heading-method-2-read-the-lock-file-contents">Method 2: Read the lock file contents</h3>
<p>While a plan is running, download and read the lock file to see its contents:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock &amp;&amp; cat /tmp/current.lock
</code></pre>
<p>Expected output (formatted for readability):</p>
<pre><code class="language-json">{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}
</code></pre>
<p>This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.</p>
<h2 id="heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</h2>
<p>With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the <code>.tflock</code> file from S3.</p>
<p>A lock can get stuck if:</p>
<ul>
<li><p>A <code>terraform apply</code> or <code>plan</code> process was killed mid-execution</p>
</li>
<li><p>A CI/CD pipeline runner crashed during a Terraform operation</p>
</li>
<li><p>A network interruption prevented the lock release from completing</p>
</li>
</ul>
<p>Here's how you can check for a stuck lock:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>If a <code>.tflock</code> file exists and no Terraform operation is currently running, it is a stuck lock.</p>
<p>You can also read the lock to understand who held it:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock &amp;&amp; cat /tmp/stuck.lock
</code></pre>
<p>This tells you who (<code>Who</code> field) was running the operation, what operation it was (<code>Operation</code> field), and when it was acquired (<code>Created</code> field).</p>
<p>And you can force-unlock using Terraform like this:</p>
<pre><code class="language-shell">terraform force-unlock LOCK-ID
</code></pre>
<p>Replace <code>LOCK-ID</code> with the <code>ID</code> value from the lock file contents. For example:</p>
<pre><code class="language-shell">terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
</code></pre>
<p>Terraform will confirm:</p>
<pre><code class="language-plaintext">Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!
</code></pre>
<p>An alternative is to delete the lock file directly via CLI. If <code>terraform force-unlock</code> doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:</p>
<pre><code class="language-shell">aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock
</code></pre>
<p><strong>Only delete the lock file if you are certain no Terraform operation is currently running.</strong> Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.</p>
<h2 id="heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</h2>
<p>If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.</p>
<p><strong>Step 1: Stop all Terraform operations</strong> in your team and CI/CD pipelines.</p>
<p><strong>Step 2: Recreate the DynamoDB table</strong> if you already deleted it:</p>
<pre><code class="language-shell">aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
</code></pre>
<p><strong>Step 3: Revert</strong> <code>backend.tf</code> to the previous configuration:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}
</code></pre>
<p><strong>Step 4: Reinitialize:</strong></p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p><strong>Step 5: Verify:</strong></p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.</p>
<p><strong>Note:</strong> Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using <code>dynamodb_table</code> in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.</p>
<h2 id="heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</h2>
<p>Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:</p>
<h3 id="heading-enable-versioning-required">Enable Versioning (Required)</h3>
<p>Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<h3 id="heading-block-all-public-access-non-negotiable">Block All Public Access (Non-Negotiable)</h3>
<p>Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<h3 id="heading-enable-server-side-encryption">Enable Server-Side Encryption</h3>
<p>Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<h3 id="heading-apply-least-privilege-iam-permissions">Apply Least-Privilege IAM Permissions</h3>
<p>The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}
</code></pre>
<p>Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.</p>
<h3 id="heading-enable-access-logging">Enable Access Logging</h3>
<p>Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:</p>
<pre><code class="language-shell">aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.</p>
<p>Here's a summary of what you accomplished:</p>
<ul>
<li><p>Understood what state locking is and why it's required for safe Terraform operations</p>
</li>
<li><p>Compared S3 native locking to the existing S3 + DynamoDB approach</p>
</li>
<li><p>Set up a fresh Terraform backend using S3 native locking with correct bucket configuration</p>
</li>
<li><p>Migrated an existing backend from S3 + DynamoDB to S3 native locking safely</p>
</li>
<li><p>Learned how to verify locking, handle stuck locks, and roll back if needed</p>
</li>
<li><p>Applied security best practices to the state bucket</p>
</li>
</ul>
<p>This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.</p>
<p><em>If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out</em> <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a><em>. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.</em></p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a href="https://developer.hashicorp.com/terraform/language/backend/s3#use_lockfile">HashiCorp - S3 Backend Configuration: use_lockfile</a></p>
</li>
<li><p><a href="https://github.com/hashicorp/terraform/releases/tag/v1.10.0">HashiCorp: Terraform 1.10 Release Notes</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html">AWS Docs: S3 Object Lock Overview</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectLockConfiguration.html">AWS Docs: PutObjectLockConfiguration API</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html">AWS Docs: S3 Conditional Writes</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/language/state/locking">HashiCorp: Backend State Locking</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/cli/commands/force-unlock">HashiCorp: terraform force-unlock Command</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html">AWS Docs: Enabling S3 Versioning</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html">AWS Docs: S3 Server-Side Encryption</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For ]]>
                </title>
                <description>
                    <![CDATA[ You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating. And yet  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-land-your-first-cloud-or-devops-role-what-hiring-managers-actually-look-for/</link>
                <guid isPermaLink="false">69f3683c909e64ad07e3b0fc</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Career ]]>
                    </category>
                
                    <category>
                        <![CDATA[ jobs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 14:33:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/374e807b-a67f-4f04-a639-dfa230b0ba5f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating.</p>
<p>And yet the applications go out, and nothing comes back.</p>
<p>This is one of the most frustrating experiences in tech. You're genuinely learning, genuinely putting in the time, and you have nothing to show for it in terms of results. You start to wonder if the market is too competitive, if you need one more certification, or if there's some hidden door everyone else found that you're missing.</p>
<p>The truth is simpler and more actionable than any of that: <strong>hiring managers can't see your YouTube watch history. They can see your GitHub.</strong> Most beginners optimize for learning. Hired candidates optimize for proof.</p>
<p>In this guide, you'll get an honest breakdown of the nine factors hiring managers actually evaluate when they look at a junior cloud or DevOps candidate and a concrete 90-day plan to address each one. By the end, you'll know exactly where you stand and exactly what to do next.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</a></p>
<ul>
<li><p><a href="#heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</a></p>
</li>
<li><p><a href="#heading-pattern-2--the-theorypractice-gap">Pattern 2: The Theory-Practice Gap</a></p>
</li>
<li><p><a href="#pattern-3-silent-learning">Pattern 3: Silent Learning</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</a></p>
</li>
<li><p><a href="#heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</a></p>
<ul>
<li><a href="#heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</a></li>
</ul>
</li>
<li><p><a href="#heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</a></p>
</li>
<li><p><a href="#heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</a></p>
</li>
<li><p><a href="#heading-factor-4-communication-skills">Factor 4: Communication Skills</a></p>
</li>
<li><p><a href="#heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</a></p>
</li>
<li><p><a href="#heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</a></p>
</li>
<li><p><a href="#heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</a></p>
</li>
<li><p><a href="#heading-factor-8--business-awareness">Factor 8: Business Awareness</a></p>
</li>
<li><p><a href="#heading-factor-9-learning-agility">Factor 9: Learning Agility</a></p>
</li>
<li><p><a href="#heading-your-90-day-action-plan">Your 90-Day Action Plan</a></p>
</li>
<li><p><a href="#heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references-and-recommended-resources">References and Recommended Resources</a></p>
</li>
</ul>
<h2 id="heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</h2>
<h3 id="heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</h3>
<p>Week 1: You watch eight hours of Docker content. Week 2: You start an AWS course and get 70% through. Week 3: A Kubernetes series looks interesting, so you start that instead. Week 4: You open LinkedIn and wonder why you're not getting callbacks.</p>
<p>Watching tutorials feels like progress. It's comfortable, passive, and has no failure state. Nothing breaks. Nothing goes wrong.</p>
<p>The problem is that it produces nothing a hiring manager can evaluate. Courses and certifications tell an employer what you've been exposed to. Your GitHub tells them what you can actually do.</p>
<h3 id="heading-pattern-2-the-theory-practice-gap">Pattern 2: The Theory-Practice Gap</h3>
<p>You can explain CI/CD fluently. You've read the Kubernetes documentation. You understand the conceptual difference between a container and a virtual machine.</p>
<p>But you've never taken a simple application, containerized it, connected it to a pipeline, and deployed it to a cloud server with a real URL that someone can visit.</p>
<p>In an interview, "I understand how it works" and "I have built this and here is the link" are not equivalent answers. Hiring managers hear the first version from hundreds of candidates. The second version gets callbacks.</p>
<h3 id="heading-pattern-3-silent-learning">Pattern 3: Silent Learning</h3>
<p>This one is perhaps the most painful pattern because the learning is real. You're putting in the work every day but nobody knows. No GitHub activity. No LinkedIn posts. No community presence. Just cold applications sent from job boards to ATS systems that filter you out before a human ever sees your name.</p>
<p>The hard truth: people get hired through people. A hiring manager who has seen your LinkedIn post about a problem you solved is significantly more likely to give your résumé serious attention than a stranger who applied through a portal.</p>
<h2 id="heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</h2>
<p>I've grouped the nine factors that follow into three buckets: <strong>Mindset</strong>, <strong>Execution</strong>, and <strong>Visibility</strong>. The order matters: mindset shapes how you execute, and execution is what powers visibility.</p>
<table>
<thead>
<tr>
<th>Bucket</th>
<th>Covers</th>
<th>Factors</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Mindset</strong></td>
<td>How you think about problems and your career</td>
<td>Factors 2, 7, 8, 9</td>
</tr>
<tr>
<td><strong>Execution</strong></td>
<td>What you actually build and demonstrate</td>
<td>Factors 1, 3</td>
</tr>
<tr>
<td><strong>Visibility</strong></td>
<td>Whether the right people know you exist</td>
<td>Factors 4, 5, 6</td>
</tr>
</tbody></table>
<p>Let's go through each one.</p>
<h2 id="heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</h2>
<p>If there's one thing to take from this entire article, it's this: <strong>no portfolio means no serious consideration.</strong> The most technically capable candidate in the applicant pool is invisible without proof of work.</p>
<p>This isn't about impressing anyone with complexity. It's about demonstrating that you can take a system from zero to deployed, documented, and working.</p>
<p>Here's the checklist every portfolio project should meet before you consider it done:</p>
<ul>
<li><p><strong>It's deployed</strong>: there's a real URL you can share, not "it works on my machine"</p>
</li>
<li><p><strong>It has a CI/CD pipeline</strong>: code changes are automatically tested and deployed</p>
</li>
<li><p><strong>Infrastructure is defined as code</strong>: not manually clicked together in the AWS console</p>
</li>
<li><p><strong>It has monitoring and alerting</strong>: you know when it breaks before users tell you</p>
</li>
<li><p><strong>It's documented</strong>: a README explains what it does, how to run it, and how it works</p>
</li>
<li><p><strong>It's on GitHub publicly</strong>: with real commit history showing iterative work</p>
</li>
</ul>
<p>If your project meets all six criteria, you have proof of work. If it meets four of six, you have a project in progress. Finish it before you start applying.</p>
<h3 id="heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</h3>
<p>You don't need ten projects. You need two to three projects that together demonstrate the full range of DevOps skills.</p>
<h4 id="heading-project-1-the-full-stack-deploy-pipeline">Project 1 : The Full-Stack Deploy Pipeline</h4>
<p>This is the foundational DevOps project every beginner should build first.</p>
<p>Take any simple web application – a Python Flask app, a Node.js API, or even a static site. Containerize it with Docker. Write a CI/CD pipeline that runs tests, builds the Docker image, and deploys to a cloud server automatically on every push to the main branch. You can also set up Nginx as a reverse proxy and add an uptime monitor (UptimeRobot has a free tier).</p>
<p>Tools: GitHub Actions, Docker, AWS EC2 or <a href="http://Render.com">Render.com</a>, Nginx.</p>
<p>Why it matters to a hiring manager: it proves you can automate a full deployment workflow end-to-end. The hiring manager can visit your URL, see it running, and inspect your pipeline history.</p>
<p>This single project puts you ahead of most applicants who only have course completion screenshots.</p>
<h4 id="heading-project-2-infrastructure-as-code-with-terraform">Project 2: Infrastructure as Code with Terraform</h4>
<p>Write Terraform code that provisions a complete environment: a VPC, public and private subnets, an EC2 instance with properly scoped security group rules, and an S3 bucket for remote state. Destroy it and recreate it from scratch to prove the code actually works. Add a GitHub Actions workflow that runs <code>terraform plan</code> on pull requests and <code>terraform apply</code> on merge to main.</p>
<p>Tools: Terraform, AWS (or Azure/GCP), GitHub Actions.</p>
<p>Why it matters: Infrastructure as Code with Terraform is a required skill at almost every company running cloud infrastructure. Showing you can write, version-control, and automate Terraform demonstrates a core professional competency.</p>
<h4 id="heading-project-3-monitoring-and-observability-stack">Project 3: Monitoring and Observability Stack</h4>
<p>Deploy a monitoring stack using Docker Compose: Prometheus scraping metrics from your application and the host, Grafana dashboards showing CPU, memory, request rates, and error rates, and Alertmanager configured to send alerts to Slack or email when thresholds are crossed. Connect this to your Project 1 application so the pipeline deploys and the monitoring watches it.</p>
<p>Tools: Prometheus, Grafana, Alertmanager, Node Exporter, Docker Compose.</p>
<p>Why it matters: most beginner portfolios have zero observability work. This project immediately signals that you understand production engineering, not just deployment. Any senior DevOps engineer or SRE reviewing your application will notice it and it will set you apart.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/da9e25be-9b59-48c8-9cf0-9cfdb050c277.png" alt="GitHub profile showing three pinned DevOps portfolio repositories with descriptive names " style="display:block;margin:0 auto" width="1353" height="584" loading="lazy">

<h2 id="heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</h2>
<p>This is the mindset that separates a DevOps engineer from someone who just knows a collection of tools. System-level thinking means you can see the whole picture, not just the part you happen to be working on at any given moment.</p>
<p>Here's the mental test hiring managers are running throughout your interview: <em>can you trace a user request from the moment they click a button to the moment they see a response, and explain what happens at every layer in between?</em></p>
<p>Here's the full journey of a web request, the map of modern infrastructure every DevOps engineer needs to understand:</p>
<table>
<thead>
<tr>
<th>Step</th>
<th>Layer</th>
<th>What's happening and what can go wrong</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td>User's Browser</td>
<td>The user types a URL. The browser needs to find the server.</td>
</tr>
<tr>
<td>2</td>
<td>DNS Resolution</td>
<td>The domain is translated into an IP address. DNS misconfigurations mean users can't reach you at all.</td>
</tr>
<tr>
<td>3</td>
<td>CDN / Edge Network</td>
<td>Traffic hits a CDN (Cloudflare, CloudFront) first. Static assets are served from the nearest edge. SSL terminates here.</td>
</tr>
<tr>
<td>4</td>
<td>Load Balancer</td>
<td>Routes the request to an available application server. If all targets are unhealthy, users get 502/503 errors.</td>
</tr>
<tr>
<td>5</td>
<td>Compute / Application Servers</td>
<td>The application code runs here in containers, on VMs, or in server-less functions. Business logic executes.</td>
</tr>
<tr>
<td>6</td>
<td>Database Layer</td>
<td>The application reads from or writes to a database. Slow queries or a full disk causes slow responses or outages.</td>
</tr>
<tr>
<td>7</td>
<td>Cache Layer</td>
<td>Redis or Memcached caches frequently-read data. Cache misses cause extra database load.</td>
</tr>
<tr>
<td>8</td>
<td>Response Returns</td>
<td>The response travels back through the stack and the user sees the result.</td>
</tr>
<tr>
<td>9</td>
<td>Logging and Monitoring</td>
<td>Every step above should emit logs and metrics. Good monitoring alerts you before users notice a problem.</td>
</tr>
</tbody></table>
<p>Why does this matter in an interview? Consider two candidates answering the question: <em>"Tell me about a time something broke in production."</em></p>
<p>Candidate A: "The website was down."</p>
<p>Candidate B: "The load balancer health checks were failing because the app containers were running out of memory due to a memory leak introduced in the previous deploy. We identified it via memory metrics in Grafana, rolled back, and added a memory limit to the container spec."</p>
<p>Same incident. Completely different answer. System-level thinking is what makes the difference.</p>
<h2 id="heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</h2>
<p>Many beginners rush to learn Kubernetes and Terraform before mastering the foundations that make those tools make sense. This creates a knowledge structure that looks impressive but has no solid base underneath it.</p>
<p>Here are the fundamentals that actually matter and what to do if you have a gap in any of them:</p>
<h3 id="heading-1-linux-and-the-command-line">1. Linux and the Command Line</h3>
<p>DevOps tools run on Linux. CI/CD jobs run in Linux containers. SSH is the front door to every server. If the terminal makes you uncomfortable, you're not ready for a production environment. This is not a preference, it's a prerequisite.</p>
<p>Start with daily Linux practice. The <a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation's free introductory materials</a> are a solid starting point. And here's a <a href="https://www.freecodecamp.org/news/learn-the-basics-of-the-linux-operating-system/">solid freeCodeCamp course on Linux basics.</a></p>
<h3 id="heading-2-networking-fundamentals">2. Networking Fundamentals</h3>
<p>DNS, TCP/IP, HTTP/HTTPS, load balancing, firewalls, VPCs, subnets these concepts appear in every cloud architecture. Without them, Terraform and Kubernetes are magic boxes. Study the request flow in Factor 2 above until you can draw it from memory without looking.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/computer-networking-fundamentals/">computer networking fundamentals course</a> to get you started.</p>
<h3 id="heading-3-scripting-bash-and-python">3. Scripting: Bash and Python</h3>
<p>CI/CD pipelines are scripts. Automation is scripting. If you cannot write a Bash script that reads a config file, calls an API, and handles errors gracefully your automation ceiling is very low. Fix this by writing one small, useful script every week. Solve real problems with code.</p>
<p>Here's a helpful tutorial on <a href="https://www.freecodecamp.org/news/shell-scripting-crash-course-how-to-write-bash-scripts-in-linux/">shell scripting in Linux for beginners</a>.</p>
<h3 id="heading-4-git-and-version-control">4. Git and Version Control</h3>
<p>Not just <code>git commit</code> and <code>git push</code>. Branching strategies, pull requests, merge conflicts, rebasing, and tagging releases are all standard practice in professional DevOps teams. Use Git for everything including your personal learning notes. Practice branching workflows intentionally.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/gitting-things-done-book/">full book on all the Git basics</a> (and some more advanced topics, too) you need to know.</p>
<h3 id="heading-5-docker-and-containers">5. Docker and Containers</h3>
<p>Docker is the universal packaging format for modern software. Understanding layers, multi-stage builds, volumes, networking, and container security is the floor not the ceiling. Every project you build should be containerized. Write your Dockerfiles by hand instead of copying them.</p>
<p>Here's a course on <a href="https://www.freecodecamp.org/news/learn-docker-and-kubernetes-hands-on-course/">Docker and Kubernetes</a> to get you started,</p>
<h2 id="heading-factor-4-communication-skills">Factor 4: Communication Skills</h2>
<p>Technical skills set your ceiling. Communication skills determine how fast you reach it. This is the most consistently underestimated factor among beginner DevOps candidates.</p>
<p>Two candidates with identical technical ability will have very different career outcomes based on how clearly they communicate. Here's what that looks like in practice:</p>
<p><strong>Architecture explanation</strong>: Can you describe how your project works to someone who has never seen it? Can you draw the architecture on a whiteboard and walk someone through your design decisions and the trade-offs you made?</p>
<p><strong>Trade-off articulation</strong>: <em>"I chose X over Y because..."</em> is one of the most powerful phrases in a technical interview. It shows you understand that every decision has pros and cons and you made a conscious, reasoned choice rather than just copying a tutorial.</p>
<p><strong>Written documentation</strong>: A README is your project's cover letter. A well-written README with clear setup instructions, an architecture diagram, and documented decisions demonstrates engineering maturity that most beginners don't show.</p>
<p>Here's a quick test: open your most recent project on GitHub and read the README as if you're a hiring manager seeing it for the first time. Does it answer these questions?</p>
<ul>
<li><p>What does this project do, and why did you build it?</p>
</li>
<li><p>What does the architecture look like?</p>
</li>
<li><p>How do I run this locally, and how do I deploy it?</p>
</li>
<li><p>What decisions did you make, and why?</p>
</li>
<li><p>What would you improve if you continued working on it?</p>
</li>
</ul>
<p>If you answered "no" to more than two of those rewrite the README before applying anywhere. This single action will meaningfully improve your response rate.</p>
<p><strong>Interview communication</strong>: Hiring managers assess communication throughout the entire interview not just your answers. Thinking out loud, structuring your responses, and admitting uncertainty honestly are all evaluated.</p>
<h2 id="heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</h2>
<p>Hiring managers are pattern recognition machines. They look at your GitHub contribution graph, your LinkedIn activity, and your learning trajectory and form an impression before reading a single word on your résumé.</p>
<p>A binge-learning approach, 10-hour weekends followed by weeks of nothing produces a GitHub graph that tells the wrong story. Thirty minutes of focused daily practice for six months beats a monthly 10-hour binge. At the six-month mark, the daily practitioner has 90 hours of focused work. The binge learner has 60 with significantly worse retention.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/1315bb8d-9e4e-4f84-836f-4e02b83c75ce.webp" alt="GitHub contribution graph showing 12 months of consistent activity with regular commits across the year" style="display:block;margin:0 auto" width="1080" height="273" loading="lazy">

<p>Here's how to build consistency in practice:</p>
<ul>
<li><p>Pick a time slot in your day that you will protect. Thirty minutes is enough to make progress.</p>
</li>
<li><p>Define a four-week learning sprint with a specific goal, not "learn Terraform" but "build and deploy a VPC with Terraform and write the README."</p>
</li>
<li><p>Keep a private learning journal: date, what you studied, what you built, what confused you.</p>
</li>
<li><p>When the sprint ends, evaluate what you built and plan the next one.</p>
</li>
</ul>
<p>What to avoid: declaring publicly on LinkedIn that you're "grinding DevOps full time" and then disappearing for six weeks. The absence is noticed. Only commit publicly to what you will actually sustain.</p>
<h2 id="heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</h2>
<p>This is the factor most beginners resist most, and the one that makes the biggest practical difference in time-to-hire.</p>
<p>Most DevOps jobs are filled through people referrals, community connections, LinkedIn conversations. A warm introduction from someone who has seen your work outweighs fifty cold applications every time.</p>
<p>Here are three ways to build visibility without it feeling performative:</p>
<h3 id="heading-community-engagement">Community Engagement</h3>
<p>Join communities where DevOps engineers actually talk: AWS User Groups, local DevOps meetups, DevOps Discord servers, Reddit communities like r/devops and r/kubernetes. You don't need to be the expert. Ask specific questions, answer what you genuinely know, and show up consistently. After three to six months, people will recognize your name.</p>
<h3 id="heading-linkedin-content">LinkedIn Content</h3>
<p>Post once per week about something you learned, built, or got stuck on. Not marketing – documentation. A post that says <em>"This week I configured Prometheus alerting for a Docker Compose stack. Here's what tripped me up and how I solved it"</em> attracts recruiters, leads to conversations, and builds a searchable record of your growth over time.</p>
<h3 id="heading-asking-good-questions-in-public">Asking Good Questions in Public</h3>
<p>When you get stuck and figure it out, write it up. Post the solution in the same community where you asked the question. Answer someone else's version of the same question later. You position yourself as a helpful, engaged learner, exactly who hiring managers want to hire.</p>
<p>Here's a concrete three-month visibility sprint to follow:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Action</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Update your LinkedIn headline: "Cloud / DevOps Engineer in Training │ Building with AWS, Docker, Terraform". Connect with 20 people in DevOps engineers, recruiters, hiring managers. Add a short personal note when connecting.</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Write your first LinkedIn post. Document something you built or learned this week. Keep it honest and specific. 150–200 words is enough.</td>
</tr>
<tr>
<td>Month 2</td>
<td>Join one community. Introduce yourself. Answer one question per week.</td>
</tr>
<tr>
<td>Month 3</td>
<td>Post consistently once per week. Engage with others' posts. Start appearing in recruiter searches.</td>
</tr>
</tbody></table>
<p>By month three, recruiters searching for "DevOps" in your location will encounter your activity. Some of the best entry-level DevOps opportunities come from exactly this kind of low-pressure visibility.</p>
<h2 id="heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</h2>
<p>This factor is less about personality type and more about observable behavior. Hiring managers are looking for evidence that you finish what you start not just that you start things.</p>
<p>Here's what the contrast looks like:</p>
<table>
<thead>
<tr>
<th>What hiring managers frequently see</th>
<th>What hiring managers want to see</th>
</tr>
</thead>
<tbody><tr>
<td>"I started a Kubernetes project and encountered a lot of issues"</td>
<td>"Here is a complete project. It deploys to AWS, has a CI/CD pipeline, is monitored, and you can access it at this URL right now."</td>
</tr>
<tr>
<td>"I was working through a Terraform course, learnt a lot about XYZ."</td>
<td>"I finished it, documented it, and wrote a post about what I learned."</td>
</tr>
</tbody></table>
<p>Ownership mindset has three components. First, finish things: a complete, simple project is worth ten times more than ten incomplete complex ones. Second, take responsibility without blame when something breaks: ownership means identifying the cause, fixing it, and adding monitoring so it doesn't happen again. Third, self-direct your learning you don't wait for someone to tell you what to learn next. You see a gap, identify how to close it, and close it. This is what "junior who can work independently" actually means in job descriptions.</p>
<h2 id="heading-factor-8-business-awareness">Factor 8: Business Awareness</h2>
<p>Technical skill gets you in the door. Business awareness keeps you there and accelerates your career.</p>
<p>The core question hiring managers are testing is: <em>can you connect your technical decisions to cost, uptime, and user impact?</em> Infrastructure decisions are business decisions. Cloud costs are typically the second-largest engineering expense at most companies after salaries. A misconfigured auto-scaling group or a forgotten large EC2 instance can burn thousands of dollars overnight.</p>
<p>Here are a few benchmark questions worth being able to answer comfortably:</p>
<ul>
<li><p>If your company has a 99.9% SLA, how many minutes of downtime per month is that? (About 43 minutes.)</p>
</li>
<li><p>If you move workloads from on-demand EC2 instances to Reserved Instances, what's the approximate cost saving? (Around 40–60%.)</p>
</li>
<li><p>If your CI/CD pipeline takes 45 minutes per build and you run 20 builds per day, how much developer wait time does that represent weekly?</p>
</li>
</ul>
<p>Most junior candidates can't answer these fluently in an interview. Candidates who can stand out immediately not because the questions are hard, but because so few people bother to connect infrastructure and business.</p>
<p>The simple habit to build: whenever you describe a technical decision in your project documentation or in an interview, add the business dimension. "I configured auto-scaling" becomes "I configured auto-scaling to handle traffic spikes, which eliminated the cost of over-provisioning and reduced our estimated monthly cloud spend by approximately $X."</p>
<h2 id="heading-factor-9-learning-agility">Factor 9: Learning Agility</h2>
<p>Everyone claims to be a fast learner. It's the most overused phrase in technology job applications. Here's how to make it actually mean something.</p>
<p>Saying "I'm a fast learner" in an interview is table stakes. The question is whether you can prove it. Proof sounds like this: <em>"I had never used GitHub Actions before. I needed a CI/CD pipeline for a project I was building. In 48 hours, I had a working pipeline that runs tests, builds a Docker image, and deploys to AWS."</em></p>
<p>What makes that credible: it names a specific tool, a specific timeframe, and a specific outcome. There is a GitHub repository with a commit history and a working pipeline that a hiring manager can actually look at.</p>
<p>Learning agility is not about knowing many tools shallowly. It's about picking up new tools quickly because you deeply understand the underlying concepts. Tool names change every few years. Concepts networking, automation, observability, reliability do not.</p>
<p>To build a concrete track record of learning agility: once a month, pick one tool you haven't used. Follow its quick-start guide. Build something small. Document what was difficult. Post about it. This is your learning agility portfolio visible, dated, and specific.</p>
<h2 id="heading-your-90-day-action-plan">Your 90-Day Action Plan</h2>
<p>Here is a concrete, sequential plan that takes you from where you are now to your first DevOps interview-ready state.</p>
<h3 id="heading-month-1-build-your-foundation">Month 1: Build Your Foundation</h3>
<p>Focus entirely on Project 1 from the Proof of Work section. Build it completely. Deploy it. Get the live URL. Don't start Project 2 until Project 1 meets all six checklist criteria.</p>
<p>Alongside the build: 30 minutes of Linux and Bash scripting practice daily. This isn't optional, it's the foundation everything else runs on.</p>
<h3 id="heading-month-2-expand-your-execution-and-start-your-visibility">Month 2: Expand Your Execution and Start Your Visibility</h3>
<p>Begin Project 2 (Terraform IaC). Write your first LinkedIn post, it doesn't need to be polished, it needs to be specific. Join one community and introduce yourself.</p>
<h3 id="heading-month-3-complete-the-portfolio-and-document-everything">Month 3: Complete the Portfolio and Document Everything</h3>
<p>Finish all three projects to full checklist standard. Polish every README. Add architecture diagrams. Optimize your GitHub profile, pin your three best repos, write a profile README that describes who you are and what you build, and add links to your live project URLs.</p>
<h3 id="heading-month-4-onward-apply-with-strategy">Month 4 Onward: Apply with Strategy</h3>
<p>Don't start applying before month four. Apply with real proof of work in hand. Target five to ten quality applications per week rather than spraying a hundred. Include your GitHub and your best project's live URL in every application. For roles at companies where you have a community connection, reach out to that person before applying.</p>
<p>Track every application in a spreadsheet: company, role, date applied, status, outcome, notes. After thirty applications, you'll have enough data to see what's working and what isn't.</p>
<p>Here's the full 90-day breakdown:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Focus</th>
<th>Milestone</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Linux fundamentals. Set up GitHub profile. Start Project 1.</td>
<td>Foundation</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Complete Project 1 CI/CD pipeline. Deploy. Get live URL. Write README.</td>
<td>First Proof of Work</td>
</tr>
<tr>
<td>Month 2</td>
<td>Begin Project 2. First LinkedIn post. Join one community.</td>
<td>Visibility begins</td>
</tr>
<tr>
<td>Month 2-3</td>
<td>Complete Project 2. Scaffold monitoring (Project 3). Post weekly on LinkedIn.</td>
<td>Building momentum</td>
</tr>
<tr>
<td>Month 3</td>
<td>Finish all 3 projects to checklist standard. Polish READMEs and GitHub profile.</td>
<td>Portfolio complete</td>
</tr>
<tr>
<td>Month 4+</td>
<td>Apply strategically. Continue posting and community engagement.</td>
<td>Active job search</td>
</tr>
</tbody></table>
<h2 id="heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</h2>
<p>Go through each statement below. Be completely honest: this is for you, not anyone else.</p>
<table>
<thead>
<tr>
<th>Statement</th>
<th>Action if the answer is No</th>
</tr>
</thead>
<tbody><tr>
<td>I can explain a web request end-to-end (DNS → load balancer → compute → database → logs)</td>
<td>Study Factor 2 until you can draw this from memory</td>
</tr>
<tr>
<td>I have at least one deployed project with a live URL</td>
<td>This is Priority 1. Nothing else matters more right now.</td>
</tr>
<tr>
<td>My best project has a CI/CD pipeline that auto-deploys on push</td>
<td>Add this to your existing project this week</td>
</tr>
<tr>
<td>I have written infrastructure as code (Terraform or CloudFormation)</td>
<td>Project 2 is your next build target</td>
</tr>
<tr>
<td>My projects have READMEs that explain architecture and decisions</td>
<td>Spend one hour today rewriting your README</td>
</tr>
<tr>
<td>I have posted about my learning on LinkedIn in the last 30 days</td>
<td>Post something today, document what you built last week</td>
</tr>
<tr>
<td>I am part of at least one DevOps community</td>
<td>Join r/devops or an AWS Discord server this week</td>
</tr>
<tr>
<td>I can write a Bash script that solves a real automation problem</td>
<td>30 minutes of daily scripting practice for the next 30 days</td>
</tr>
<tr>
<td>I can explain what I built, why I made each decision, and what I'd change</td>
<td>Practice saying this out loud about each project until it's fluent</td>
</tr>
</tbody></table>
<p>Count your "no" answers. Each one is a specific, actionable gap, not a vague sense of being behind. That's the difference between this self-assessment and the anxious feeling of "I'm not ready yet." You're not behind. You just have a prioritized list of what to build next.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Here's what you know now that most beginners still don't:</p>
<p>The gap between you and a DevOps job isn't a gap in certifications, a gap in courses completed, or a gap in the number of tools you've heard about. It's a gap in proof of work, visibility, and the consistency with which you execute.</p>
<p>Hiring managers aren't looking for someone who has watched everything. They're looking for someone who has built something, documented it, deployed it, monitored it, and can clearly explain every decision they made along the way.</p>
<p>The path isn't secret. It's just work. Build two to three complete projects that meet the full checklist. Document everything. Show up consistently in communities and on LinkedIn. Apply with strategy. Iterate based on feedback.</p>
<p>If you want a production-grade reference to support your DevOps journey complete with real Terraform modules, CI/CD workflow templates, infrastructure runbooks, and platform engineering patterns used in real startup environments <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a> was built for exactly this stage of your career.</p>
<p>The information gap between you and your first DevOps role is smaller than you think. The execution gap is where the work is. Start today.</p>
<h2 id="heading-references-and-recommended-resources">References and Recommended Resources</h2>
<ul>
<li><p><a href="https://roadmap.sh/devops">roadmap.sh/devops</a>: The community-maintained DevOps learning roadmap. Use this to sequence what you learn next and avoid random jumps between topics.</p>
</li>
<li><p><a href="https://dora.dev">DORA State of DevOps Report</a>: Free annual report on what DevOps practices actually improve software delivery performance. Gives you the vocabulary hiring managers speak.</p>
</li>
<li><p><a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation - Introduction to Linux</a>: Free introductory Linux course. If the terminal still makes you nervous, start here.</p>
</li>
<li><p><a href="https://itrevolution.com/product/the-phoenix-project/">The Phoenix Project</a>: A business novel about DevOps transformation. Teaches core concepts through story. Gives you vocabulary for business-aware conversations.</p>
</li>
<li><p><a href="http://ExplainShell.com">ExplainShell.com</a>: Paste any command you find online and see exactly what every part does. Use this constantly while building your projects.</p>
</li>
<li><p><a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes">GitHub - How to Write a Good README</a>: Official GitHub guidance on repository documentation.</p>
</li>
<li><p><a href="https://prometheus.io/docs/introduction/overview/">Prometheus Documentation</a>: Official docs for the monitoring tool used in Project 3.</p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/tutorials/aws-get-started">Terraform Getting Started - AWS</a>: Official step-by-step guide for Project 2.</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a>: Complete reference for building CI/CD pipelines in Project 1.</p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/learn-linux-for-beginners-book-basic-to-advanced/">freeCodeCamp - Learn Linux for Beginners</a>: Comprehensive Linux guide available on freeCodeCamp.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD ]]>
                </title>
                <description>
                    <![CDATA[ I typically build my projects using Next.js 14 (App Router) and Supabase for authentication along with Postgres. The default deployment choice for a Next.js app is usually Vercel, and for good reason: ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-deploy-a-full-stack-next-js-app-on-cloudflare-workers-with-github-actions-ci-cd/</link>
                <guid isPermaLink="false">69f2145e6e0124c05e1a5b6e</guid>
                
                    <category>
                        <![CDATA[ Next.js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloudflare ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GitHub Actions ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md Tarikul Islam ]]>
                </dc:creator>
                <pubDate>Wed, 29 Apr 2026 14:23:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/cbb9e559-baa7-452c-992a-3416041712ad.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I typically build my projects using Next.js 14 (App Router) and Supabase for authentication along with Postgres. The default deployment choice for a Next.js app is usually Vercel, and for good reason: it provides an excellent developer experience.</p>
<p>But after running the same project on both platforms for about a week, I started exploring Cloudflare Workers as an alternative. I noticed improvements in latency (lower TTFB) and found the free tier to be more flexible for my use case.</p>
<p>Deploying Next.js apps on Cloudflare used to be challenging. Earlier solutions like Cloudflare Pages had limitations with full Next.js features, and tools like <code>next-on-pages</code> often lagged behind the latest releases.</p>
<p>That changed with the introduction of <a href="https://opennext.js.org/cloudflare"><code>@opennextjs/cloudflare</code></a>. It allows you to compile a standard Next.js application into a Cloudflare Worker, supporting features like SSR, ISR, middleware, and the Image component – all without requiring major code changes.</p>
<p>In this guide, I’ll walk you through the exact steps I used to deploy my full-stack Next.js + Supabase application to Cloudflare Workers.</p>
<p>This article is the runbook I wish I had when I started.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-choose-cloudflare-workers-over-vercel">Why Choose Cloudflare Workers Over Vercel?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-stack">The Stack</a></p>
</li>
<li><p><a href="#heading-step-1-install-the-cloudflare-adapter">Step 1 — Install the Cloudflare Adapter</a></p>
</li>
<li><p><a href="#heading-step-2-wire-opennext-into-next-dev">Step 2 — Wire OpenNext into next dev</a></p>
</li>
<li><p><a href="#heading-step-3-local-environment-setup-with-devvars">Step 3— Local Environment Setup with .dev.vars</a></p>
</li>
<li><p><a href="#heading-step-4-deploy-your-app-from-your-local-machine">Step 4 — Deploy Your App from Your Local Machine</a></p>
</li>
<li><p><a href="#heading-step-5-push-your-secrets-to-the-worker">Step 5 — Push your secrets to the Worker</a></p>
</li>
<li><p><a href="#heading-step-6-set-up-continuous-deployment-with-github-actions">Step 6 — Set Up Continuous Deployment with GitHub Actions</a></p>
</li>
<li><p><a href="#heading-step-7-updating-the-project-the-daily-workflow">Step 7 — Updating the project (the daily workflow)</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final thoughts</a></p>
</li>
</ul>
<h2 id="heading-why-choose-cloudflare-workers-over-vercel">Why Choose Cloudflare Workers Over Vercel?</h2>
<p>When deploying a Next.js application, Vercel is often the default choice. It offers a smooth developer experience and tight integration with Next.js.</p>
<p>But Cloudflare Workers provides a compelling alternative, especially when you care about global performance and cost efficiency.</p>
<p>Here’s a high-level comparison (at the time of writing):</p>
<table>
<thead>
<tr>
<th>Concern</th>
<th>Vercel (Hobby)</th>
<th>Cloudflare Workers (Free Tier)</th>
</tr>
</thead>
<tbody><tr>
<td>Requests</td>
<td>Fair usage limits</td>
<td>Millions of requests per day</td>
</tr>
<tr>
<td>Cold starts</td>
<td>~100–300 ms (region-based)</td>
<td>Near-zero (V8 isolates)</td>
</tr>
<tr>
<td>Edge locations</td>
<td>Limited regions for SSR</td>
<td>300+ global edge locations</td>
</tr>
<tr>
<td>Bandwidth</td>
<td>~100 GB/month (soft cap)</td>
<td>Generous / no strict cap on free tier</td>
</tr>
<tr>
<td>Custom domains</td>
<td>Supported</td>
<td>Supported</td>
</tr>
<tr>
<td>Image optimization</td>
<td>Counts toward usage</td>
<td>Available via <code>IMAGES</code> binding</td>
</tr>
<tr>
<td>Pricing beyond free</td>
<td>Starts at ~$20/month</td>
<td>Low-cost, usage-based pricing</td>
</tr>
</tbody></table>
<h3 id="heading-key-takeaways">Key Takeaways</h3>
<ul>
<li><p><strong>Lower latency globally</strong>: Cloudflare runs your app across hundreds of edge locations, reducing response time for users worldwide.</p>
</li>
<li><p><strong>Minimal cold starts</strong>: Thanks to V8 isolates, functions start almost instantly.</p>
</li>
<li><p><strong>Cost efficiency</strong>: The free tier is generous enough for portfolios, blogs, and many small-to-medium apps.</p>
</li>
</ul>
<h3 id="heading-trade-offs-to-consider">Trade-offs to Consider</h3>
<p>Cloudflare Workers use a V8 isolate runtime, not a full Node.js environment. That means:</p>
<ul>
<li><p>Some Node.js APIs like <code>fs</code> or <code>child_process</code> aren't available</p>
</li>
<li><p>Native binaries or certain libraries may not work</p>
</li>
</ul>
<p>That said, for most modern stacks –&nbsp;like Next.js + Supabase + Stripe + Resend – this limitation is rarely an issue.</p>
<p>In short, choose <strong>Vercel</strong> if you want the simplest, plug-and-play Next.js deployment. Choose <strong>Cloudflare Workers</strong> if you want better edge performance and more flexible scaling.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before getting started, make sure you have the following set up. Most of these take only a few minutes:</p>
<ul>
<li><p><strong>Node.js 18+</strong> and <strong>pnpm 9+</strong> (you can also use npm or yarn, but this guide uses pnpm.)</p>
</li>
<li><p>A <strong>Cloudflare account</strong> 👉 <a href="https://dash.cloudflare.com/sign-up">https://dash.cloudflare.com/sign-up</a></p>
</li>
<li><p>A <strong>Supabase account</strong> (if your app uses a database) 👉 <a href="https://supabase.com">https://supabase.com</a></p>
</li>
<li><p>A <strong>GitHub repository</strong> for your project (required later for CI/CD setup)</p>
</li>
<li><p>A <strong>domain name</strong> (optional) – You’ll get a free <code>*.workers.dev</code> URL by default.</p>
</li>
</ul>
<h3 id="heading-install-wrangler-cloudflare-cli">Install Wrangler (Cloudflare CLI)</h3>
<p>We’ll use Wrangler to build and deploy the application:</p>
<pre><code class="language-bash">pnpm add -D wrangler
</code></pre>
<h2 id="heading-the-stack">The Stack</h2>
<p>Here’s the tech stack used in this project:</p>
<ul>
<li><p><strong>Next.js (v14.2.x):</strong> Using the App Router with Edge runtime for both public and dashboard routes</p>
</li>
<li><p><strong>Supabase:</strong> Handles authentication, Postgres database, and Row-Level Security (RLS)</p>
</li>
<li><p><strong>Tailwind CSS</strong> + UI utilities: For styling, along with lightweight animation using Framer Motion</p>
</li>
<li><p><strong>Cloudflare Workers:</strong> Deployment powered by <code>@opennextjs/cloudflare</code> and <code>wrangler</code></p>
</li>
<li><p><strong>GitHub Actions:</strong> Used to automate CI/CD and deployments</p>
</li>
</ul>
<p><strong>Note:</strong> If you're using Next.js <strong>15 or later</strong>, you can remove the<br><code>--dangerouslyUseUnsupportedNextVersion</code> flag from the build script, as it's only required for certain Next.js 14 setups.</p>
<h2 id="heading-step-1-install-the-cloudflare-adapter">Step 1 — Install the Cloudflare Adapter</h2>
<p>From inside your existing Next.js project, install the OpenNext adapter along with Wrangler (Cloudflare’s CLI tool):</p>
<pre><code class="language-bash">pnpm add @opennextjs/cloudflare
pnpm add -D wrangler
</code></pre>
<p>Then add the deploy scripts to <code>package.json</code>:</p>
<pre><code class="language-jsonc">{
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint",

    "cloudflare-build": "opennextjs-cloudflare build --dangerouslyUseUnsupportedNextVersion",
    "preview":          "pnpm cloudflare-build &amp;&amp; opennextjs-cloudflare preview",
    "deploy":           "pnpm cloudflare-build &amp;&amp; wrangler deploy",
    "upload":           "pnpm cloudflare-build &amp;&amp; opennextjs-cloudflare upload",
    "cf-typegen":       "wrangler types --env-interface CloudflareEnv cloudflare-env.d.ts"
  }
}
</code></pre>
<p>What each script does:</p>
<table>
<thead>
<tr>
<th>Script</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>pnpm cloudflare-build</code></td>
<td>Compiles your Next app into <code>.open-next/</code> (the Worker bundle). No upload.</td>
</tr>
<tr>
<td><code>pnpm preview</code></td>
<td>Builds and runs the Worker locally with <code>wrangler dev</code>. Closest thing to prod.</td>
</tr>
<tr>
<td><code>pnpm deploy</code></td>
<td>Builds and uploads to Cloudflare. <strong>This ships to production.</strong></td>
</tr>
<tr>
<td><code>pnpm upload</code></td>
<td>Builds and uploads a <em>new version</em> without promoting it (for staged rollouts).</td>
</tr>
<tr>
<td><code>pnpm cf-typegen</code></td>
<td>Regenerates <code>cloudflare-env.d.ts</code> types after editing <code>wrangler.jsonc</code>.</td>
</tr>
</tbody></table>
<p><strong>Heads up:</strong> the Pages-based <code>@cloudflare/next-on-pages</code> is a different tool. We are <strong>not</strong> using Pages — we're deploying as a real Worker. Don't mix the two.</p>
<h2 id="heading-step-2-wire-opennext-into-next-dev">Step 2 — Wire OpenNext into <code>next dev</code></h2>
<p>So that <code>pnpm dev</code> can read your Cloudflare bindings (env vars, R2, KV, D1, …) the same way production will, edit <code>next.config.mjs</code>:</p>
<pre><code class="language-js">/** @type {import('next').NextConfig} */
const nextConfig = {};

if (process.env.NODE_ENV !== "production") {
  const { initOpenNextCloudflareForDev } = await import(
    "@opennextjs/cloudflare"
  );
  initOpenNextCloudflareForDev();
}

export default nextConfig;
</code></pre>
<p>We only call it in development so <code>next build</code> stays fast and CI doesn't spin up a Miniflare instance for nothing.</p>
<h2 id="heading-step-3-local-environment-setup-with-devvars">Step 3 — Local Environment Setup with <code>.dev.vars</code></h2>
<p>When working with Cloudflare Workers locally, Wrangler uses a file called <code>.dev.vars</code> to store environment variables (instead of <code>.env.local</code> used by Next.js).</p>
<p>A simple and reliable approach is to keep an example file in your repo and ignore the real one.</p>
<h3 id="heading-example-devvarsexample-committed">Example: <code>.dev.vars.example</code> (committed)</h3>
<pre><code class="language-bash">NEXT_PUBLIC_SUPABASE_URL="https://YOUR-PROJECT-ref.supabase.co"
NEXT_PUBLIC_SUPABASE_ANON_KEY="YOUR-ANON-KEY"
NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL="admin@example.com"
</code></pre>
<h3 id="heading-set-up-your-local-environment">Set Up Your Local Environment</h3>
<p>Run the following commands:</p>
<pre><code class="language-plaintext">cp .dev.vars.example .dev.vars
cp .dev.vars .env.local
</code></pre>
<ul>
<li><p><code>.dev.vars</code> is used by Wrangler (<code>wrangler dev</code>)</p>
</li>
<li><p><code>.env.local</code> is used by Next.js (<code>next dev</code>)</p>
</li>
</ul>
<h3 id="heading-why-use-both-files">Why Use Both Files?</h3>
<ul>
<li><p><code>next dev</code> reads from <code>.env.local</code></p>
</li>
<li><p><code>wrangler dev</code> (used in <code>pnpm preview</code>) reads from <code>.dev.vars</code></p>
</li>
</ul>
<p>Keeping both files in sync ensures your app behaves consistently in development and when running in the Cloudflare runtime.</p>
<h3 id="heading-update-gitignore">Update <code>.gitignore</code></h3>
<p>Make sure these files are ignored:</p>
<pre><code class="language-plaintext">.dev.vars
.env*.local
.open-next
.wrangler
</code></pre>
<h2 id="heading-step-4-deploy-your-app-from-your-local-machine">Step 4 — Deploy Your App from Your Local Machine</h2>
<p>Once <code>pnpm preview</code> is working correctly, you're ready to deploy your application:</p>
<pre><code class="language-bash">pnpm deploy
</code></pre>
<p>Under the hood that runs:</p>
<pre><code class="language-bash">pnpm cloudflare-build &amp;&amp; wrangler deploy
</code></pre>
<p>The first time, Wrangler will:</p>
<ol>
<li><p>Compile your app to <code>.open-next/worker.js</code>.</p>
</li>
<li><p>Upload the script + assets to Cloudflare.</p>
</li>
<li><p>Print your live URL, e.g. <code>https://porfolio.&lt;your-account&gt;.workers.dev</code>.</p>
</li>
</ol>
<p>Open it in a browser. Congratulations — you're on Cloudflare's edge in 330+ cities. The page should be served in <strong>&lt;100 ms</strong> TTFB from anywhere.  </p>
<p><a href="https://portfolio.tarikuldev.workers.dev/">Here's the live version of my own portfolio deployed this way</a></p>
<h2 id="heading-step-5-push-your-secrets-to-the-worker">Step 5 — Push Your Secrets to the Worker</h2>
<p>Local <code>.dev.vars</code> is <strong>not</strong> uploaded by <code>wrangler deploy</code>. You have to push secrets explicitly:</p>
<pre><code class="language-bash">wrangler secret put NEXT_PUBLIC_SUPABASE_URL
wrangler secret put NEXT_PUBLIC_SUPABASE_ANON_KEY
wrangler secret put NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL
</code></pre>
<p>Each command prompts you for the value and stores it encrypted on Cloudflare. Or do it visually:</p>
<blockquote>
<p>Cloudflare Dashboard → <strong>Workers &amp; Pages</strong> → your worker → <strong>Settings</strong> → <strong>Variables and Secrets</strong> → <strong>Add</strong>.</p>
</blockquote>
<p>Important: <code>NEXT_PUBLIC_*</code> vars are inlined into the client bundle at build time, so they also need to be available when pnpm cloudflare-build runs (locally, that's your .env.local; in CI, see Step 10).</p>
<h2 id="heading-step-6-set-up-continuous-deployment-with-github-actions">Step 6 — Set Up Continuous Deployment with GitHub Actions</h2>
<p>Once your local deployment is working, the next step is automating deployments so every push to the <code>main</code> branch updates production automatically.</p>
<p>With this workflow:</p>
<ul>
<li><p>Pull requests will run validation checks</p>
</li>
<li><p>Production deploys only happen after successful builds</p>
</li>
<li><p>Broken code never reaches your live site</p>
</li>
</ul>
<p>Create the following file inside your project:</p>
<p><code>.github/workflows/deploy.yml</code></p>
<pre><code class="language-yaml">name: CI / Deploy to Cloudflare Workers

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:

concurrency:
  group: cloudflare-deploy-${{ github.ref }}
  cancel-in-progress: true

jobs:
  verify:
    name: Lint and Build
    runs-on: ubuntu-latest
    timeout-minutes: 10

    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 10

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm

      - run: pnpm install --frozen-lockfile
      - run: pnpm lint
      - run: pnpm build
        env:
          NEXT_PUBLIC_SUPABASE_URL: ${{ secrets.NEXT_PUBLIC_SUPABASE_URL }}
          NEXT_PUBLIC_SUPABASE_ANON_KEY: ${{ secrets.NEXT_PUBLIC_SUPABASE_ANON_KEY }}
          NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL: ${{ secrets.NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL }}

  deploy:
    name: Deploy to Cloudflare Workers
    needs: verify
    if: github.event_name == 'push' &amp;&amp; github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 10

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: pnpm

      - run: pnpm install --frozen-lockfile

      - name: Build and Deploy
        run: pnpm run deploy
        env:
          CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}
          CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          NEXT_PUBLIC_SUPABASE_URL: ${{ secrets.NEXT_PUBLIC_SUPABASE_URL }}
          NEXT_PUBLIC_SUPABASE_ANON_KEY: ${{ secrets.NEXT_PUBLIC_SUPABASE_ANON_KEY }}
          NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL: ${{ secrets.NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL }}
</code></pre>
<h3 id="heading-required-github-repo-secrets">Required GitHub repo secrets</h3>
<p>Go to GitHub repo → Settings → Secrets and variables → Actions → New repository secret and add:</p>
<table>
<thead>
<tr>
<th>Secret</th>
<th>Where to get it</th>
</tr>
</thead>
<tbody><tr>
<td><code>CLOUDFLARE_API_TOKEN</code></td>
<td><a href="https://dash.cloudflare.com/profile/api-tokens">https://dash.cloudflare.com/profile/api-tokens</a> → "Edit Cloudflare Workers" template</td>
</tr>
<tr>
<td><code>CLOUDFLARE_ACCOUNT_ID</code></td>
<td>Cloudflare dashboard → right sidebar, "Account ID"</td>
</tr>
<tr>
<td><code>CLOUDFLARE_ACCOUNT_SUBDOMAIN</code></td>
<td>Your <code>*.workers.dev</code> subdomain (used only for the deployment URL link)</td>
</tr>
<tr>
<td><code>NEXT_PUBLIC_SUPABASE_URL</code></td>
<td>Supabase project settings</td>
</tr>
<tr>
<td><code>NEXT_PUBLIC_SUPABASE_ANON_KEY</code></td>
<td>Supabase project settings</td>
</tr>
<tr>
<td><code>NEXT_PUBLIC_DASHBOARD_DEFAULT_EMAIL</code></td>
<td>Email pre-filled on <code>/dashboard/login</code></td>
</tr>
</tbody></table>
<p>That's it. Push it to <code>main</code> and it'll go live in about 90 seconds. PRs run lint and build only, so broken code never reaches production.</p>
<h2 id="heading-step-7-updating-the-project-the-daily-workflow">Step 7 — Updating the Project (the Daily Workflow)</h2>
<p>After the initial setup, the loop is boringly simple — which is the whole point. Here's what I actually do day-to-day:</p>
<h3 id="heading-code-change">Code Change</h3>
<pre><code class="language-bash">git checkout -b feat/new-section
# ...edit files...
pnpm dev                # iterate locally
pnpm preview            # final smoke test on the Worker runtime
git commit -am "feat: add new section"
git push origin feat/new-section
</code></pre>
<p>Open a PR and the <strong>verify</strong> that the job runs. Then review, merge, and the deploy it. The job ships to Cloudflare automatically.</p>
<h3 id="heading-updating-env-vars-secrets">Updating env Vars / Secrets</h3>
<pre><code class="language-bash"># Local
nano .dev.vars

# Production
wrangler secret put NEXT_PUBLIC_SUPABASE_URL
# ...etc.
</code></pre>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>When I started this migration, I was nervous about leaving Vercel — the Next.js DX there is genuinely excellent. But the moment you push beyond a hobby site, Cloudflare's economics and edge performance are not close.</p>
<p>With <code>@opennextjs/cloudflare</code>, the developer experience has also caught up: my <code>pnpm dev</code> loop is identical, my <code>pnpm preview</code> mimics production, and <code>git push</code> deploys globally in ~90 seconds.</p>
<p>If you've been holding off because the old Cloudflare Pages + Next.js story was rough, that era is over. Try this runbook on a side project this weekend and see for yourself.</p>
<p>If you found this useful, the full repo is <a href="./">here</a> — feel free to clone it as a starter.</p>
<p>Happy shipping.</p>
<p>— <em>Tarikul</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP ]]>
                </title>
                <description>
                    <![CDATA[ Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-gpu-optimized-machine-image-with-hashicorp-packer-on-gcp/</link>
                <guid isPermaLink="false">69e93606d5f8830e7d9fbad6</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ VM Image ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hashicorp packer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 20:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fd393878-fe7c-458a-addf-7cd22d8280ac.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.</p>
<p>In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a href="#heading-step-1-install-packer">Step 1: Install Packer</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</a></p>
</li>
<li><p><a href="#heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</a></p>
</li>
<li><p><a href="#heading-step-4-define-your-source">Step 4: Define Your Source</a></p>
</li>
<li><p><a href="#heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</a></p>
</li>
<li><p><a href="#heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</a></p>
<ul>
<li><p><a href="#heading-section-1-pre-installation-kernel-headers">section 1: Pre-Installation (Kernel Headers)</a></p>
</li>
<li><p><a href="#heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</a></p>
</li>
<li><p><a href="#heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</a></p>
</li>
<li><p><a href="#heading-section-4-installing-the-driver">Section 4: Installing the Driver</a></p>
</li>
<li><p><a href="#heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</a></p>
</li>
<li><p><a href="#heading-section-6-nvidia-container-toolkit">Section 6: Nvidia Container Toolkit</a></p>
</li>
<li><p><a href="#heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM — Data Center GPU Manager</a></p>
</li>
<li><p><a href="#heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</a></p>
</li>
<li><p><a href="#heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-7assembling-and-running-the-build">Step 7:Assembling and Running the Build</a></p>
</li>
<li><p><a href="#heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><a href="https://www.packer.io/">HashiCorp Packer</a> &gt;= 1.9</p>
</li>
<li><p><a href="https://github.com/hashicorp/packer-plugin-googlecompute">Google Compute Packer plugin</a> (installed via <code>packer init</code>)</p>
</li>
<li><p>Optionally, the <a href="https://github.com/hashicorp/packer-plugin-amazon">AWS Packer plugin</a> can be used for EC2 builds by adding an <code>amazon-ebs</code> source to <code>node.pkr.hcl</code></p>
</li>
<li><p>GCP project with Compute Engine API enabled (or AWS account with EC2 access)</p>
</li>
<li><p>GCP authentication (<code>gcloud auth application-default login</code>) or AWS credentials</p>
</li>
<li><p>Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-step-1-install-packer">Step 1: Install Packer</h3>
<p>To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation <a href="https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli#:~:text=Chocolatey%20on%20Windows-,Linux,-HashiCorp%20officially%20maintains">guides</a>).</p>
<p>First, you'll install the official Packer formula from the terminal.</p>
<p>Install the HashiCorp tap, a repository of all Hashicorp packages.</p>
<pre><code class="language-plaintext">$ brew tap hashicorp/tap
</code></pre>
<p>Now, install Packer with <code>hashicorp/tap/packer</code>.</p>
<pre><code class="language-plaintext">$ brew install hashicorp/tap/packer
</code></pre>
<h3 id="heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</h3>
<p>With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your <code>packer_demo</code> folder using the command below:</p>
<pre><code class="language-plaintext">mkdir -p packer_demo/script &amp;&amp; touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh
</code></pre>
<p>Your file directory should look like this:</p>
<pre><code class="language-plaintext">packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script 
</code></pre>
<h3 id="heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</h3>
<p>In your <code>plugins.pkr.hcl file,</code>, define your plugins in the <code>packer block.</code> The <code>packer {}</code> block contains Packer settings, including specifying a required plugin version. You'll find the <code>required_plugins</code> block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin <a href="https://developer.hashicorp.com/packer/integrations">here</a>.</p>
<pre><code class="language-hcl">packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~&gt; 1"
    }
  }
}
</code></pre>
<p>Then, initialize your Packer plugin with the command below:</p>
<pre><code class="language-plaintext">packer init .
</code></pre>
<h3 id="heading-step-4-define-your-source">Step 4: Define Your Source</h3>
<p>With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your <code>project ID</code>, the zone where your machine will be created, the <code>source_image_family</code> (think of this as your base image, such as Debian, Ubuntu, and so on), and your <code>source_image_project_id</code>.</p>
<p>In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the <code>machine type</code> to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.</p>
<pre><code class="language-hcl">source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}
</code></pre>
<p>Setting <code>on_host_maintenance = "TERMINATE"</code> on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.</p>
<p>You'll define all your variables in the <code>variable.pkr.hcl</code> file, and set the values in the <code>values.pkrvars.hcl</code>. Remember to always add your <code>values.pkrvars.hcl</code> file to Gitignore.</p>
<pre><code class="language-hcl">variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}
</code></pre>
<p><code>values.pkrvars.hcl</code></p>
<pre><code class="language-hcl">image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1" 
</code></pre>
<h3 id="heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</h3>
<p>Create <code>build.pkr.hcl</code>. The <code>build</code> block creates a temporary instance, runs provisioners, and produces an image.</p>
<p>Provisioners in this template are organized as follows:</p>
<ul>
<li><p><strong>First provisioner</strong> runs system updates and upgrades.</p>
</li>
<li><p><strong>Second provisioner</strong> reboots the instance (<code>expect_disconnect = true</code>).</p>
</li>
<li><p><strong>Third provisioner</strong> waits for the instance to come back (<code>pause_before</code>), then runs <code>script/base.sh</code>. This provisioner sets <code>max_retries</code> to handle transient SSH timeouts and pass environment variables for <code>DRIVER_VERSION</code> and <code>CUDA_VERSION</code>.</p>
</li>
</ul>
<p>Lastly, you have the post-processor to tell you the image ID and completion status:</p>
<pre><code class="language-hcl">build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}
</code></pre>
<h3 id="heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</h3>
<p>Now we'll go through the base script, and break down some parts of it.</p>
<h3 id="heading-section-1-pre-installation-kernel-headers">Section 1: Pre-Installation (Kernel Headers)</h3>
<p>Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.</p>
<pre><code class="language-shellscript">log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget
</code></pre>
<h3 id="heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</h3>
<p>This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.</p>
<pre><code class="language-shellscript">log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq
</code></pre>
<h3 id="heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</h3>
<p>Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.</p>
<p>NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit</p>
<p>A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.</p>
<pre><code class="language-shellscript">log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"
</code></pre>
<h3 id="heading-section-4-installing-the-driver">Section 4: Installing the Driver</h3>
<p>The <code>libnvidia-compute</code> installs only the compute‑related user‑space libraries (CUDA driver components), while the <code>nvidia-dkms-open;</code> installs the <strong>open‑source NVIDIA kernel module</strong>, built locally via DKMS.</p>
<p>Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.</p>
<p>Here, we're using <strong>NVIDIA’s compute‑only driver stack using the open‑source kernel modules</strong>, as it deliberately avoids installing any display-related components, which you don't need.</p>
<p>This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open
</code></pre>
<h3 id="heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</h3>
<p>This part of the script installs the <strong>CUDA Toolkit</strong> for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.</p>
<p>It adds CUDA binaries to PATH, so commands like <code>nvcc</code>, <code>cuda-gdb</code>, and <code>cuda-memcheck</code> work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.</p>
<pre><code class="language-shellscript">log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig
</code></pre>
<h3 id="heading-section-6-nvidia-container-toolkit">Section 6: NVIDIA Container Toolkit</h3>
<p>This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi
</code></pre>
<h3 id="heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM (Data Center GPU Manager)</h3>
<p>This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.</p>
<p>It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.</p>
<p>The script extracts the installed version and checks that it meets the <strong>minimum required version</strong> for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.</p>
<pre><code class="language-shellscript">log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi
</code></pre>
<h3 id="heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</h3>
<p>The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.</p>
<p>Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.</p>
<pre><code class="language-shellscript">log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced
</code></pre>
<h3 id="heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</h3>
<p>This block applies a set of <strong>system‑level performance and stability tunings</strong> that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.</p>
<p>Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.</p>
<ul>
<li><p>Swap and memory behavior: Disabling swap and setting <code>vm.swappiness=0</code> prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.</p>
</li>
<li><p>Hugepages for large memory allocations: Setting <code>vm.nr_hugepages=2048</code> allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.</p>
<p>CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.</p>
</li>
<li><p>CPU frequency governor: Installing <code>cpupower</code> and forcing the CPU governor to <code>performance</code> ensures the CPU stays at maximum frequency instead of scaling down.</p>
<p>GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.</p>
</li>
<li><p>NUMA and topology tools: Installing <code>numactl</code>, <code>libnuma-dev</code>, and <code>hwloc</code> provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.</p>
</li>
<li><p>Disabling irqbalance: Stopping and disabling <code>irqbalance</code> it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.</p>
</li>
</ul>
<pre><code class="language-shell">log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system
</code></pre>
<p>Full base.sh script here:</p>
<pre><code class="language-shell">#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" &gt;&amp;2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] &amp;&amp; error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] &amp;&amp; error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release &amp;&amp; echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"
</code></pre>
<h2 id="heading-step-7-assembling-and-running-the-build">Step 7: Assembling and Running the Build</h2>
<p>Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.</p>
<pre><code class="language-shellscript">packer validate -var-file=values.pkrvars.hcl .
</code></pre>
<p>If validation succeeds, you’ll see a short confirmation like <code>The configuration is valid.</code>. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:</p>
<pre><code class="language-plaintext">packer build -var-file=values.pkrvars.hcl .
</code></pre>
<p>The build typically takes <strong>15–20 minutes,</strong> depending on network speed and package installs. Watch the Packer log for three key checkpoints:</p>
<ul>
<li><p><strong>Instance creation</strong> — confirms the temporary VM was provisioned.</p>
</li>
<li><p><strong>Provisioner output</strong> — shows each script step (updates, reboot, <code>script/base.sh</code>) and any errors.</p>
</li>
<li><p><strong>Image creation</strong> — indicates the build finished and an image artifact was written.</p>
</li>
</ul>
<p>If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.</p>
<pre><code class="language-plaintext">googlecompute.gpu-node: output will be in this color.

==&gt; googlecompute.gpu-node: Checking image does not exist...
==&gt; googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==&gt; googlecompute.gpu-node: no persistent disk to create
==&gt; googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==&gt; googlecompute.gpu-node: Creating instance...
==&gt; googlecompute.gpu-node: Loading zone: us-central1-a
==&gt; googlecompute.gpu-node: Loading machine type: g2-standard-4
==&gt; googlecompute.gpu-node: Requesting instance creation...
==&gt; googlecompute.gpu-node: Waiting for creation operation to complete...
==&gt; googlecompute.gpu-node: Instance has been created!
==&gt; googlecompute.gpu-node: Waiting for the instance to become running...
==&gt; googlecompute.gpu-node: IP: 34.58.58.214
==&gt; googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==&gt; googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==&gt; googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: No containers need to be restarted.
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: User sessions running outdated binaries:
==&gt; googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==&gt; googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==&gt; googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==&gt; googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==&gt; googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==&gt; googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==&gt; googlecompute.gpu-node: [BASE] Updating system packages...
==&gt; googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==&gt; googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==&gt; googlecompute.gpu-node: [BASE] Installing DCGM...
==&gt; googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==&gt; googlecompute.gpu-node: [BASE] Applying system tuning...
==&gt; googlecompute.gpu-node: vm.swappiness=0
==&gt; googlecompute.gpu-node: vm.nr_hugepages=2048
==&gt; googlecompute.gpu-node: Setting cpu: 0
==&gt; googlecompute.gpu-node: Error setting new values. Common errors:
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==&gt; googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==&gt; googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==&gt; googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==&gt; googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: Deleting instance...
==&gt; googlecompute.gpu-node: Instance has been deleted!
==&gt; googlecompute.gpu-node: Creating image...
==&gt; googlecompute.gpu-node: Deleting disk...
==&gt; googlecompute.gpu-node: Disk has been deleted!
==&gt; googlecompute.gpu-node: Running post-processor:  (type shell-local)
==&gt; googlecompute.gpu-node (shell-local): Running local shell script: 
==&gt; googlecompute.gpu-node (shell-local): === Image Build Complete ===
==&gt; googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==&gt; googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==&gt; Wait completed after 17 minutes 55 seconds

==&gt; Builds finished. The artifacts of successful builds are:
--&gt; googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134
</code></pre>
<h3 id="heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</h3>
<p>Confirm the image exists in the GCP Console: <strong>Compute → Storage → Images</strong> and locate your newly created OS image.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/90f304eb-3fe7-4304-b2ad-d86701dde607.png" alt="Your Image information on GCP" style="display:block;margin:0 auto" width="1686" height="692" loading="lazy">

<p>Create a test VM from the image:</p>
<pre><code class="language-plaintext">gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING
</code></pre>
<p>Once the instance is <code>RUNNING</code>, verify the NVIDIA driver and GPU are visible:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/364df8fc-7584-40df-8ab7-b3fe349d5065.png" alt="Output from the Nvidia-SMI command showing Driver and CUDA Version" style="display:block;margin:0 auto" width="1508" height="630" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/0912c303-3bb0-47fa-aa34-1c91ff26874f.png" alt="Image verifying the persistence mode is enabled" style="display:block;margin:0 auto" width="1508" height="80" loading="lazy">

<p><strong>The</strong> <code>nvidia-smi</code> <strong>output confirms:</strong></p>
<ul>
<li><p>Driver 590.48.01 loaded</p>
</li>
<li><p>CUDA 13.1 available</p>
</li>
<li><p>Persistence Mode is <code>On</code></p>
</li>
<li><p>The L4 GPU is detected with 23GB VRAM</p>
</li>
<li><p>Zero ECC errors</p>
</li>
<li><p>No running processes (clean idle state).</p>
</li>
</ul>
<p>This is exactly what a healthy base image should look like. Notice <code>Disp.A: Off</code>? That confirms our compute-only driver choice is working — no display adapter is active.</p>
<p>Confirm the installed CUDA toolkit by running. <code>nvcc --version</code>. You can see that version 13.1 was installed as specified.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/cc744624-9408-4348-88d7-61da04b5e1d0.png" alt="Output from the NVCC -Version command" style="display:block;margin:0 auto" width="1508" height="202" loading="lazy">

<p>Let's confirm DCGM installation by running <code>dcgmi discovery -l</code>. Successful output indicates DCGM is running and communicating with the driver.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/114996c6-1f28-43d4-a3fa-13aa7ccd2c82.png" alt="Output from the DCGMI dicovery -l command showing device information" style="display:block;margin:0 auto" width="1508" height="714" loading="lazy">

<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.</p>
<p>From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.</p>
<p>The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p>NVIDIA Driver Installation Guide (Ubuntu): <a href="https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/">https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/</a></p>
</li>
<li><p>NVIDIA CUDA Toolkit Documentation: <a href="https://docs.nvidia.com/cuda/">https://docs.nvidia.com/cuda/</a></p>
</li>
<li><p>NVIDIA Container Toolkit Installation Guide: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html</a></p>
</li>
<li><p>NVIDIA DCGM Documentation: <a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html">https://docs.nvidia.com/datacenter/dcgm/latest/index.html</a></p>
</li>
<li><p>NVIDIA Persistence Daemon: <a href="https://docs.nvidia.com/deploy/driver-persistence/index.html">https://docs.nvidia.com/deploy/driver-persistence/index.html</a></p>
</li>
<li><p>HashiCorp Packer Documentation: <a href="https://developer.hashicorp.com/packer/docs">https://developer.hashicorp.com/packer/docs</a></p>
</li>
<li><p>Packer Google Compute Builder: <a href="https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute">https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Why Chrome OS Is the Operating System the AI Era Was Built For ]]>
                </title>
                <description>
                    <![CDATA[ Chrome OS runs on a read-only filesystem. You can't install executables on the host. There's no traditional desktop environment. Everything that interacts with the underlying system does so through a  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/why-chrome-os-is-the-ai-os/</link>
                <guid isPermaLink="false">69e2765cfd22b8ad62611ba8</guid>
                
                    <category>
                        <![CDATA[ Chrome OS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Linux ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Christopher Galliart ]]>
                </dc:creator>
                <pubDate>Fri, 17 Apr 2026 18:05:16 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c4116a06-9e42-4da5-a152-0fe1433e0857.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Chrome OS runs on a read-only filesystem. You can't install executables on the host. There's no traditional desktop environment. Everything that interacts with the underlying system does so through a sandboxed browser, a containerized Linux terminal, or a cloud connection.</p>
<p>For years, that list of constraints was the reason people dismissed it. But in 2026, it's the reason Chrome OS might be the most correctly designed operating system for what's coming.</p>
<p>The security architecture treats the endpoint as untrusted by default. The containerized Linux environment gives developers a full headless stack without compromising the host. And an upcoming OS-level rewrite, Aluminium, puts Google's on-device AI models directly into the kernel.</p>
<p>This article covers security architecture, the container-based developer environment, cloud-streamed creative tools via AWS NICE DCV, cloud gaming, and what Aluminium OS means for on-device AI.</p>
<h3 id="heading-heres-what-well-cover">Here's what we'll cover:</h3>
<ol>
<li><p><a href="#heading-security-first-architecture-in-the-era-of-ai-powered-threats">Security-First Architecture in an Era of AI-Powered Threats</a></p>
</li>
<li><p><a href="#heading-a-headless-linux-stack-thats-more-flexible-than-it-looks">A Headless Linux Stack That's More Flexible Than It Looks</a></p>
</li>
<li><p><a href="#heading-aws-nice-dcv-changes-the-creative-tools-conversation">AWS NICE DCV Changes the Creative Tools Conversation</a></p>
</li>
<li><p><a href="#heading-cloud-gaming-works">Cloud Gaming Works</a></p>
</li>
<li><p><a href="#heading-aluminum-os-on-device-models-on-googles-own-architecture">Aluminium OS: On-Device Models on Google's Own Architecture</a></p>
</li>
<li><p><a href="#heading-where-this-lands">Where This Lands</a></p>
</li>
</ol>
<h2 id="heading-security-first-architecture-in-an-era-of-ai-powered-threats">Security-First Architecture in an Era of AI-Powered Threats</h2>
<p>Threat actors are getting better tools. Models like Mythos are lowering the barrier for generating convincing phishing campaigns, crafting polymorphic malware, and automating social engineering at scale.</p>
<p>Traditional operating systems present exactly the attack surface these tools target: writable system files, user-installable executables, patches that sit uninstalled for weeks because someone clicked "remind me later."</p>
<p>Chrome OS sidesteps most of this by design. The root filesystem is read-only and cryptographically verified on every boot through a process called Verified Boot.</p>
<p>If anything has modified the OS files since the last verified state, whether that's malware, a compromised package, or a rogue AI agent that decided to start deleting system files, the device detects it at startup and either self-corrects or refuses to boot.</p>
<p>Persistence across reboots isn't difficult. It's architecturally impossible through software alone.</p>
<p>Updates happen silently. While you're working, the system downloads the next OS version to an inactive partition. On your next reboot, it pivots to the updated version. No prompts, no deferred patches, no exposure window.</p>
<p>Major updates ship every four to six weeks. Security patches land every two to three weeks. The gap between vulnerability discovery and remediation is measured in days.</p>
<p>Chrome OS consistently doesn't appear in the top 50 products by CVE count in the NIST vulnerability database. Windows and the Linux kernel sit near the top every year. When AI is actively being weaponized to find and exploit vulnerabilities faster than humans can patch them, a read-only, verified, automatically updated endpoint is a different category of security posture.</p>
<p>The tradeoff is trust. Chrome OS's security model means trusting Google as the root authority for your entire computing stack: updates, certificate trust, telemetry. Organizations with strict data sovereignty requirements should weigh that dependency carefully.</p>
<h2 id="heading-a-headless-linux-stack-thats-more-flexible-than-it-looks">A Headless Linux Stack That's More Flexible Than It Looks</h2>
<p>Chrome OS is a text-based operating system. There's no native GUI layer. Stop and sit with that for a second, because it's the thing that makes people dismiss Chrome OS and also the thing that makes it work.</p>
<p>The entire graphical interface you interact with IS the Chrome browser. The Ash shell, Chrome's window manager, is the desktop. You don't install applications onto it the way you install .exe files on Windows or drag .app bundles into a macOS Applications folder. If it isn't running in a browser tab, an Android VM, or a Linux container, it doesn't run. That restriction is what keeps the host locked down, and it's what makes everything else possible.</p>
<p>Under the hood, Chrome OS runs a minimal virtual machine called Termina through crosvm, Google's Rust-based VM monitor.</p>
<p>Inside Termina, LXD manages Linux containers. The default container, penguin, is a Debian environment with a special trick: it bridges GUI-based Linux applications directly into the Chrome OS desktop through a Wayland proxy called Sommelier. Install VS Code, GIMP, or LibreOffice in penguin and they show up in your Chrome OS app launcher, running in windows alongside your browser tabs. For a lot of developers, penguin alone covers the daily workflow.</p>
<p>But Termina gives you more than penguin. Through the LXD layer you can spin up independent containers that are fully isolated operating systems: Arch, Alpine, Ubuntu, whatever you need.</p>
<p>These aren't attached to the GUI bridge. They run headless, natively, with their own systemd, their own package managers, their own persistent state. Need a clean Ubuntu environment to test a deployment script without touching your main setup? <code>lxc launch</code> and you're there. Need to blow it away? <code>lxc delete</code> and it's gone. No orphaned files on the host, no cross-contamination between environments.</p>
<p>The key distinction from Docker is that LXD runs system containers (full OS emulation) rather than application containers. You get background services, persistent daemons, the works. You can also run Docker inside any of these LXD containers if you need application-level containerization on top of that.</p>
<p>Snapshot your entire environment with <code>lxc snapshot</code> before a risky dependency install and roll back instantly if something breaks. That kind of safety net is broader than version control alone: it captures your full OS configuration, not just code.</p>
<p>Pair this with browser-native tools like GitHub Codespaces, Google Colab, AWS CloudShell, or vscode.dev, and the terminal handles your local tooling while the browser handles everything else.</p>
<p>AI coding assistants like Claude and Gemini already operate natively in the browser. The distance between "cloud IDE" and "local IDE" keeps shrinking.</p>
<p>There are friction points: no custom kernel modules inside Crostini. Nested KVM requires Intel Gen 10+ processors. VPN routing into the Linux container from the Chrome OS host can be a headache, with WireGuard requiring userspace workarounds inside the container.</p>
<p>But none of these break the core architecture for cloud-native work. They're just worth knowing about before you commit.</p>
<h2 id="heading-aws-nice-dcv-changes-the-creative-tools-conversation">AWS NICE DCV Changes the Creative Tools Conversation</h2>
<p>One of the longest-standing arguments against Chrome OS has been the absence of professional creative software. There's no Premiere, no DaVinci Resolve, no Blender, no Ableton. For years, this was a dead-end conversation.</p>
<p>AWS NICE DCV (Desktop Cloud Visualization) reopens it. DCV is a high-performance remote display protocol that streams GPU-accelerated desktop sessions from EC2 instances to any device, including a Chromebook running the browser-based DCV client. It supports OpenGL, Vulkan, and DirectX rendering, with adaptive encoding that adjusts to network conditions. On AWS, the DCV license is free. You pay only for the EC2 compute time.</p>
<p>Netflix engineers use DCV to stream content creation applications to remote artists. Volkswagen runs 3D CAD simulations across their engineering division through it. A VFX studio called RVX used it to deliver visual effects for HBO's The Last of Us, streaming Nuke, Maya, Houdini, and Blender to artists distributed across Europe from servers in Iceland. Their team said it was the best remote experience they'd ever worked with.</p>
<p>So: a Chromebook connected to a g5.xlarge EC2 instance (one A10G GPU) can run Blender, DaVinci Resolve, or any other GPU-accelerated creative application with full hardware acceleration. The rendering happens in the data center. DCV streams the pixels. The creative professional gets a responsive, high-fidelity workspace on a $400 machine that couldn't locally render a single frame.</p>
<p>The constraints are connectivity and cost. You need sustained bandwidth (25+ Mbps for 1080p work, more for 4K multi-monitor setups) and leaving a GPU instance running around the clock adds up. But for studios and professionals who already budget for high-end workstations, the math often pencils out, especially when you factor in zero local hardware maintenance and the ability to scale GPU power on demand.</p>
<h2 id="heading-cloud-gaming-works">Cloud Gaming Works</h2>
<p>GeForce NOW survived where Stadia failed because it made a better business decision: bring your own games. Connect your existing Steam, Epic, or Ubisoft library and stream from NVIDIA's server-side hardware. The Ultimate tier now runs on RTX 5080-class infrastructure. 4K at 120fps with ray tracing, on a fanless Chromebook.</p>
<p>Chrome OS has a structural advantage as a cloud gaming client. GeForce NOW runs natively in the Chromium browser via WebRTC, and users consistently report less micro-stuttering and tighter input handling than the standalone Windows desktop app. Under good network conditions, measured total latency runs 13 to 14ms, with sub-3ms ping documented near datacenter proximity. That's below human perceptual threshold for most game types.</p>
<p>Anti-cheat systems like Easy Anti-Cheat and Riot Vanguard are a non-issue in this model. They run on the server where the game executes, not on your local endpoint. On-device gaming isn't viable on Chrome OS and likely never will be. The architecture isn't designed for it, and even projects attempting to bridge local GPUs hit bottlenecks in the container layers. Cloud gaming is the path, and it works.</p>
<p>The limiting factors are network-dependent. Latency spikes above 500ms on bad connections make fast-twitch games unplayable, and NVIDIA's 100-hour monthly cap on the Ultimate tier has drawn criticism. But cloud gaming on Chrome OS has crossed the line from novelty to daily-driver viable for most use cases.</p>
<h2 id="heading-aluminium-os-on-device-models-on-googles-own-architecture">Aluminium OS: On-Device Models on Google's Own Architecture</h2>
<p>The most consequential near-term development for Chrome OS is Project Aluminium, a ground-up rewrite that replaces the current Chrome OS foundation with a native Android kernel. Not another bolted-on compatibility layer: a new operating system built on Android 16, designed to run Android applications natively with direct hardware acceleration instead of routing them through the resource-heavy ARCVM virtual machine that currently eats CPU cycles on even basic app launches.</p>
<p>The AI story is the real story. Aluminium is being built with Gemini models integrated directly into the OS: the file system, the application launcher, the window manager.</p>
<p>Google serving their own proprietary models on their own devices, using an architecture optimized specifically to run them, is a level of vertical integration that no other OS vendor has in the pipeline. Apple has the silicon advantage for local inference. Google has the model-to-OS integration advantage. Those are competing theses about where AI compute should live, and both are worth taking seriously.</p>
<p>The rollout timeline from court documents and leaked roadmaps puts a trusted tester program on select hardware in late 2026, premium tablets by early 2027, and general consumer availability in 2028. Chrome OS Classic gets maintained through existing support obligations until 2033 or 2034.</p>
<p>The launch won't be perfect. Google's track record on platform transitions gives the community earned skepticism. But the ability to iterate a natively AI-integrated OS on hardware they control is the kind of capability that compounds over time.</p>
<h2 id="heading-where-this-lands">Where This Lands</h2>
<p>Two years ago, calling Chrome OS a serious platform for development or creative work would have been a stretch. Today you can run a full Debian environment with systemd daemons, snapshot your workspace, stream Blender from a GPU-backed data center, play AAA games at 4K on hardware you don't own, and do all of it from a verified, read-only endpoint that patches itself while you sleep.</p>
<p>The remaining gaps are real. But they're concentrated in workflows that are themselves moving to the cloud. Chrome OS was designed around assumptions about computing that used to be premature. They're not premature anymore.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Authenticate Users in Kubernetes: x509 Certificates, OIDC, and Cloud Identity ]]>
                </title>
                <description>
                    <![CDATA[ Kubernetes doesn't know who you are. It has no user database, no built-in login system, no password file. When you run kubectl get pods, Kubernetes receives an HTTP request and asks one question: who  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-authenticate-users-in-kubernetes-x509-certificates-oidc-and-cloud-identity/</link>
                <guid isPermaLink="false">69d4182f40c9cabf4484dbdb</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ authentication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:31:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36356282-0cfb-43a8-8461-84f20e64b041.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Kubernetes doesn't know who you are.</p>
<p>It has no user database, no built-in login system, no password file. When you run <code>kubectl get pods</code>, Kubernetes receives an HTTP request and asks one question: who signed this, and do I trust that signature? Everything else — what you're allowed to do, which namespaces you can access, whether your request goes through at all — comes after that question is answered.</p>
<p>This surprises most engineers who are new to Kubernetes. They expect something like a database of users with passwords. Instead, they find a pluggable chain of authenticators, each one able to vouch for a request in a different way:</p>
<ul>
<li><p>Client certificates</p>
</li>
<li><p>OIDC tokens from an external identity provider</p>
</li>
<li><p>Cloud provider IAM tokens</p>
</li>
<li><p>Service account tokens projected into pods.</p>
</li>
</ul>
<p>Any of these can be active at the same time.</p>
<p>Understanding this model is what separates engineers who can debug authentication failures from engineers who copy kubeconfig files and hope for the best.</p>
<p>In this article, you'll work through how the Kubernetes authentication chain works from first principles. You'll see how x509 client certificates are used — and why they're a poor choice for human users in production. You'll configure OIDC authentication with Dex, giving your cluster a real browser-based login flow. And you'll see how AWS, GCP, and Azure each plug into the same underlying model.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A running kind cluster — a fresh one works fine, or reuse an existing one</p>
</li>
<li><p><code>kubectl</code> and <code>helm</code> installed</p>
</li>
<li><p><code>openssl</code> available on your machine (comes pre-installed on macOS and most Linux distros)</p>
</li>
<li><p>Basic familiarity with what a JWT is (a signed JSON object with claims) — you don't need to be able to write one, just recognise one</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</a></p>
<ul>
<li><p><a href="#heading-the-authenticator-chain">The Authenticator Chain</a></p>
</li>
<li><p><a href="#heading-users-vs-service-accounts">Users vs Service Accounts</a></p>
</li>
<li><p><a href="#heading-what-happens-after-authentication">What Happens After Authentication</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</a></p>
<ul>
<li><p><a href="#heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</a></p>
</li>
<li><p><a href="#the-cluster-ca">The Cluster CA</a></p>
</li>
<li><p><a href="#heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</a></p>
<ul>
<li><p><a href="#heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</a></p>
</li>
<li><p><a href="#heading-the-api-server-configuration">The API Server Configuration</a></p>
</li>
<li><p><a href="#heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</a></p>
</li>
<li><p><a href="#heading-how-kubelogin-works">How kubelogin Works</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</a></p>
</li>
<li><p><a href="#heading-cloud-provider-authentication">Cloud Provider Authentication</a></p>
<ul>
<li><p><a href="#heading-aws-eks">AWS EKS</a></p>
</li>
<li><p><a href="#heading-google-gke">Google GKE</a></p>
</li>
<li><p><a href="#heading-azure-aks">Azure AKS</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-webhook-token-authentication">Webhook Token Authentication</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</h2>
<p>Every request that reaches the Kubernetes API server — whether from <code>kubectl</code>, a pod, a controller, or a CI pipeline — carries a credential of some kind.</p>
<p>The API server passes that credential through a chain of authenticators in sequence. The first authenticator that can verify the credential wins. If none can, the request is treated as anonymous.</p>
<h3 id="heading-the-authenticator-chain">The Authenticator Chain</h3>
<p>Kubernetes supports several authentication strategies simultaneously. You can have client certificate authentication and OIDC authentication active on the same cluster at the same time, which is common in production: cluster administrators use certificates, regular developers use OIDC. The strategies active on a cluster are determined by flags passed to the <code>kube-apiserver</code> process.</p>
<p>The strategies available are x509 client certificates, bearer tokens (static token files — rarely used in production), bootstrap tokens (used during node join operations), service account tokens, OIDC tokens, authenticating proxies, and webhook token authentication. A cluster doesn't have to use all of them, and most don't. But knowing they all exist helps when you're diagnosing an auth failure.</p>
<h3 id="heading-users-vs-service-accounts">Users vs Service Accounts</h3>
<p>There is an important distinction in how Kubernetes thinks about identity. Service accounts are Kubernetes objects — they live in a namespace, get created with <code>kubectl create serviceaccount</code>, and have tokens managed by the cluster itself. Every pod runs as a service account. These are machine identities for workloads.</p>
<p>Users, on the other hand, don't exist as Kubernetes objects at all. There is no <code>kubectl create user</code> command. Kubernetes doesn't manage user accounts. Instead, it trusts external systems to assert user identity — a certificate authority, an OIDC provider, or a cloud provider's IAM system. Kubernetes just verifies the assertion and extracts the username and group memberships from it.</p>
<table>
<thead>
<tr>
<th></th>
<th>Service Account</th>
<th>User</th>
</tr>
</thead>
<tbody><tr>
<td>Kubernetes object?</td>
<td>Yes — lives in a namespace</td>
<td>No — managed externally</td>
</tr>
<tr>
<td>Created with</td>
<td><code>kubectl create serviceaccount</code></td>
<td>External system (CA, IdP, cloud IAM)</td>
</tr>
<tr>
<td>Used by</td>
<td>Pods and workloads</td>
<td>Humans and CI systems</td>
</tr>
<tr>
<td>Token managed by</td>
<td>Kubernetes</td>
<td>External system</td>
</tr>
<tr>
<td>Namespaced?</td>
<td>Yes</td>
<td>No</td>
</tr>
</tbody></table>
<h3 id="heading-what-happens-after-authentication">What Happens After Authentication</h3>
<p>Authentication only answers one question: who is this? Once the API server has a verified identity — a username and zero or more group memberships — it passes the request to the authorisation layer. By default that is RBAC, which checks the identity against Role and ClusterRole bindings to determine what the request is allowed to do.</p>
<p>This is why authentication and authorisation are separate concerns in Kubernetes. A valid certificate gets you past the front door. What you can do inside is RBAC's job. An authenticated user with no RBAC bindings can authenticate successfully but will be denied every API call.</p>
<p>If you want a deep dive into how RBAC rules, roles, and bindings work, check out this handbook on <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a>.</p>
<h2 id="heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</h2>
<p>x509 client certificate authentication is the oldest and simplest authentication method in Kubernetes. It's how <code>kubectl</code> works out of the box when you create a cluster — the kubeconfig file that <code>kind</code> or <code>kubeadm</code> generates contains an embedded client certificate signed by the cluster's Certificate Authority.</p>
<h3 id="heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</h3>
<p>When the API server receives a request with a client certificate, it validates the certificate against its trusted CA, then reads two fields (The Common Name and Organization) from the certificate to construct an identity.</p>
<p>The <strong>Common Name (CN)</strong> field becomes the username. The <strong>Organization (O)</strong> field, which can contain multiple values, becomes the list of groups the user belongs to.</p>
<p>So a certificate with <code>CN=jane</code> and <code>O=engineering</code> authenticates as username <code>jane</code> in group <code>engineering</code>. If you want to give <code>jane</code> permissions, you create a RoleBinding that references either the username <code>jane</code> or the group <code>engineering</code> as a subject.</p>
<p>This is the same mechanism behind <code>system:masters</code>. When <code>kind</code> creates a cluster and writes a kubeconfig for you, it generates a certificate with <code>O=system:masters</code>. Kubernetes has a built-in ClusterRoleBinding that grants <code>cluster-admin</code> to anyone in the <code>system:masters</code> group. That's why your default kubeconfig has full admin access — it's not magic, it's a certificate with the right group.</p>
<h3 id="heading-the-cluster-ca">The Cluster CA</h3>
<p>Every Kubernetes cluster has a root Certificate Authority — a private key and a self-signed certificate that the API server trusts. Any client certificate signed by this CA is trusted by the cluster.</p>
<p>The CA certificate and key are typically stored in <code>/etc/kubernetes/pki/</code> on the control plane node, or in the <code>kube-system</code> namespace as a secret, depending on how the cluster was created.</p>
<p>On kind clusters, you can copy the CA cert and key directly from the control plane container:</p>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>Whoever holds the CA key can issue certificates for any username and any group, including <code>system:masters</code>. This makes the CA key the most sensitive secret in a Kubernetes cluster. Guard it accordingly.</p>
<h3 id="heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</h3>
<p>Client certificates work, but they have two fundamental problems that make them a poor choice for human users in production.</p>
<p>The first is that <strong>Kubernetes doesn't check certificate revocation lists (CRLs)</strong>. If a developer's kubeconfig is stolen, the embedded certificate remains valid until it expires — which is typically one year in most Kubernetes setups. There's no way to immediately invalidate it. You can't "log out" a certificate. The only mitigation is to rotate the entire cluster CA, which invalidates every certificate including those belonging to other legitimate users.</p>
<p>The second is <strong>operational overhead</strong>. Certificates must be generated, distributed to users, and rotated before expiry. There's no self-service. In a team of ten engineers, managing certificates is annoying. In a team of a hundred, it's a full-time job.</p>
<p>For human access in production, OIDC is the right answer: short-lived tokens issued by a trusted identity provider, with a central revocation mechanism, and a standard browser-based login flow. Certificates are fine for service accounts and automation, where token management can be automated and rotation is handled programmatically.</p>
<p>That said, understanding certificates isn't optional. Your kubeconfig uses one. Your CI system probably does too. And cert-based auth is what you fall back to when everything else breaks.</p>
<h2 id="heading-demo-1-create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</h2>
<p>In this section, you'll generate a user certificate signed by the cluster CA, bind it to an RBAC role, and use it to authenticate to the cluster as a different user.</p>
<p><strong>This guide is for local development and learning only.</strong> Manually signing certificates with the cluster CA and storing keys on disk is done here for simplicity.</p>
<p>In production, you should use the Kubernetes CertificateSigningRequest API or cert-manager for certificate issuance, enforce short-lived certificates with automatic rotation, and store private keys in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or hardware security module (HSM) — never distribute the cluster CA key.</p>
<h3 id="heading-step-1-copy-the-ca-cert-and-key-from-the-kind-control-plane">Step 1: Copy the CA cert and key from the kind control plane</h3>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>This will create two files in your current directory called <code>ca.crt</code> and <code>ca.key</code></p>
<h3 id="heading-step-2-generate-a-private-key-and-csr-for-a-new-user">Step 2: Generate a private key and CSR for a new user</h3>
<p>You're creating a certificate for a user named <code>jane</code> in the <code>engineering</code> group:</p>
<pre><code class="language-bash"># Generate the private key
openssl genrsa -out jane.key 2048

# Generate a Certificate Signing Request
# CN = username, O = group
openssl req -new \
  -key jane.key \
  -out jane.csr \
  -subj "/CN=jane/O=engineering"
</code></pre>
<h3 id="heading-step-3-sign-the-csr-with-the-cluster-ca">Step 3: Sign the CSR with the cluster CA</h3>
<pre><code class="language-bash">openssl x509 -req \
  -in jane.csr \
  -CA ca.crt \
  -CAkey ca.key \
  -CAcreateserial \
  -out jane.crt \
  -days 365
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Certificate request self-signature ok
subject=CN=jane, O=engineering
</code></pre>
<h3 id="heading-step-4-inspect-the-certificate">Step 4: Inspect the certificate</h3>
<p>Before using it, confirm the identity it carries:</p>
<pre><code class="language-bash">openssl x509 -in jane.crt -noout -subject -dates
</code></pre>
<pre><code class="language-plaintext">subject=CN=jane, O=engineering
notBefore=Mar 20 10:00:00 2024 GMT
notAfter=Mar 20 10:00:00 2025 GMT
</code></pre>
<p>One year from now, this certificate becomes invalid and must be replaced. There's no way to extend it — you have to issue a new one.</p>
<h3 id="heading-step-5-build-a-kubeconfig-entry-for-jane">Step 5: Build a kubeconfig entry for jane</h3>
<pre><code class="language-bash"># Get the cluster API server address from the current context
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Create a kubeconfig for jane
kubectl config set-cluster k8s-security \
  --server=$APISERVER \
  --certificate-authority=ca.crt \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-credentials jane \
  --client-certificate=jane.crt \
  --client-key=jane.key \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-context jane@k8s-security \
  --cluster=k8s-security \
  --user=jane \
  --kubeconfig=jane.kubeconfig

kubectl config use-context jane@k8s-security \
  --kubeconfig=jane.kubeconfig
</code></pre>
<h3 id="heading-step-6-test-authentication-before-rbac">Step 6: Test authentication — before RBAC</h3>
<p>Try to list pods using jane's kubeconfig:</p>
<pre><code class="language-bash">kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">Error from server (Forbidden): pods is forbidden: User "jane" cannot list
resource "pods" in API group "" in the namespace "staging"
</code></pre>
<p>This is correct. Jane authenticated successfully — Kubernetes knows who she is. But she has no RBAC bindings, so every API call is denied. Authentication passed, but authorisation failed.</p>
<h3 id="heading-step-7-grant-jane-access-with-rbac">Step 7: Grant jane access with RBAC</h3>
<p>RBAC bindings use the username exactly as it appears in the certificate's CN field. If you need a refresher on how Roles, ClusterRoles, and RoleBindings work, this handbook <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a> covers the full RBAC model. For now, a simple RoleBinding using the built-in <code>view</code> ClusterRole is enough:</p>
<pre><code class="language-yaml"># jane-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jane-reader
  namespace: staging
subjects:
  - kind: User
    name: jane          # matches the CN in the certificate
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f jane-rolebinding.yaml
kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">No resources found in staging namespace.
</code></pre>
<p>No error — jane can now list pods in <code>staging</code>. She can't delete them, create them, or access other namespaces. The certificate got her in. RBAC determines what she can do.</p>
<h2 id="heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</h2>
<p>OpenID Connect is an identity layer on top of OAuth 2.0. It's how Kubernetes integrates with enterprise identity providers — Active Directory, Okta, Google Workspace, Keycloak, and any other provider that speaks OIDC. Understanding how Kubernetes uses it requires following the token from the user's browser to the API server's decision.</p>
<h3 id="heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</h3>
<p>When a developer runs <code>kubectl get pods</code> with OIDC configured, the following happens:</p>
<ol>
<li><p><code>kubectl</code> checks whether the current credential in the kubeconfig is a valid, unexpired OIDC token</p>
</li>
<li><p>If not, it launches <code>kubelogin</code>, a kubectl plugin that opens a browser window</p>
</li>
<li><p>The browser redirects to the OIDC provider (Dex, Okta, your corporate IdP)</p>
</li>
<li><p>The user logs in with their corporate credentials</p>
</li>
<li><p>The OIDC provider issues a signed JWT and returns it to kubelogin</p>
</li>
<li><p>kubelogin caches the token locally (under <code>~/.kube/cache/oidc-login/</code>) and returns it to <code>kubectl</code></p>
</li>
<li><p><code>kubectl</code> sends the token to the API server as a <code>Bearer</code> header</p>
</li>
<li><p>The API server fetches the provider's public keys from its JWKS endpoint and verifies the token signature</p>
</li>
<li><p>If valid, the API server extracts the username and group claims from the token</p>
</li>
<li><p>RBAC takes over from there</p>
</li>
</ol>
<p>The Kubernetes API server never contacts the OIDC provider for each request. It only fetches the provider's public keys periodically to verify signatures locally. This makes OIDC authentication stateless and scalable.</p>
<h3 id="heading-the-api-server-configuration">The API Server Configuration</h3>
<p>For OIDC to work, the API server needs to know where to find the identity provider and how to interpret the tokens it issues.</p>
<p>In Kubernetes v1.30+, this is configured through an <code>AuthenticationConfiguration</code> file passed via the <code>--authentication-config</code> flag. (In older versions, individual <code>--oidc-*</code> flags were used instead, but these were removed in v1.35.)</p>
<p>The <code>AuthenticationConfiguration</code> defines OIDC providers under the <code>jwt</code> key:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it does</th>
<th>Example</th>
</tr>
</thead>
<tbody><tr>
<td><code>issuer.url</code></td>
<td>The OIDC provider's base URL — must match the <code>iss</code> claim in the token</td>
<td><code>https://dex.example.com</code></td>
</tr>
<tr>
<td><code>issuer.audiences</code></td>
<td>The client IDs the token was issued for — must match the <code>aud</code> claim</td>
<td><code>["kubernetes"]</code></td>
</tr>
<tr>
<td><code>issuer.certificateAuthority</code></td>
<td>CA certificate to trust when contacting the OIDC provider (inlined PEM)</td>
<td><code>-----BEGIN CERTIFICATE-----...</code></td>
</tr>
<tr>
<td><code>claimMappings.username.claim</code></td>
<td>Which JWT claim to use as the Kubernetes username</td>
<td><code>email</code></td>
</tr>
<tr>
<td><code>claimMappings.groups.claim</code></td>
<td>Which JWT claim to use as the Kubernetes group list</td>
<td><code>groups</code></td>
</tr>
<tr>
<td><code>claimMappings.*.prefix</code></td>
<td>Prefix added to the claim value — set to <code>""</code> for no prefix</td>
<td><code>""</code></td>
</tr>
</tbody></table>
<p>On a kind cluster, the <code>--authentication-config</code> flag is set in the cluster configuration before creation, not after. You'll see this in the next demo.</p>
<h3 id="heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</h3>
<p>A JWT is a signed JSON object with three sections: a header, a payload, and a signature. The payload is a set of claims – key-value pairs that assert facts about the token. Kubernetes reads specific claims from the payload to build an identity.</p>
<p>The required claims are <code>iss</code> (the issuer URL, must match <code>issuer.url</code> in the <code>AuthenticationConfiguration</code>), <code>sub</code> (the subject, a unique identifier for the user), and <code>aud</code> (the audience, must match the <code>issuer.audiences</code> list). The <code>exp</code> claim (expiry time) is also required as the API server rejects expired tokens.</p>
<p>The most useful optional claim is <code>groups</code> (or whatever you configure via <code>claimMappings.groups.claim</code>). When this claim is present, Kubernetes can map OIDC group memberships directly to RBAC group bindings. A user in the <code>platform-engineers</code> group in your identity provider automatically gets the RBAC permissions you've bound to that group in Kubernetes — no manual user management required.</p>
<h3 id="heading-how-kubelogin-works">How kubelogin Works</h3>
<p>kubelogin (also distributed as <code>kubectl oidc-login</code>) is a kubectl credential plugin. Instead of embedding a static certificate or token in your kubeconfig, you configure a credential plugin that runs a helper binary when <code>kubectl</code> needs a token.</p>
<p>When kubelogin is invoked, it checks its local token cache. If the cached token is still valid, it returns it immediately. If the token has expired, it initiates the OIDC authorization code flow — opens a browser, redirects to the identity provider, receives the token after login, caches it locally, and returns it to <code>kubectl</code>. The whole flow takes about five seconds when it triggers.</p>
<p>This means tokens are short-lived (typically an hour) and rotate automatically. If a developer's machine is compromised, the token expires on its own. There is no long-lived credential sitting in a file somewhere.</p>
<h2 id="heading-demo-2-configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</h2>
<p>In this section, you'll deploy Dex as a self-hosted OIDC provider, configure a kind cluster to trust it, and log in with a browser. Dex is a good demo vehicle because it runs inside the cluster and doesn't require a cloud account or an external service.</p>
<p><strong>This guide is for local development and learning only.</strong> Self-signed certificates, static passwords, and certs stored on disk are used here for simplicity.</p>
<p>In production, use a managed identity provider (Azure Entra ID, Google Workspace, Okta), automate certificate lifecycle with cert-manager, and store secrets in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or inject them via CSI driver — never commit or store certs as local files.</p>
<h3 id="heading-step-1-create-a-kind-cluster-with-oidc-authentication">Step 1: Create a kind cluster with OIDC authentication</h3>
<p>OIDC authentication for the API server must be configured at cluster creation time on Kind because the API server needs to know which identity provider to trust before it starts accepting requests.</p>
<p><strong>Note:</strong> Kubernetes v1.30+ deprecated the <code>--oidc-*</code> API server flags in favor of the structured <code>AuthenticationConfiguration</code> API (via <code>--authentication-config</code>). In v1.35+ the old flags are removed entirely. This guide uses the new approach.</p>
<p><strong>nip.io</strong> is a wildcard DNS service — <code>dex.127.0.0.1.nip.io</code> resolves to <code>127.0.0.1</code>. This lets us use a real hostname for TLS without editing <code>/etc/hosts</code>.</p>
<p>First, generate a self-signed CA and TLS certificate for Dex:</p>
<pre><code class="language-bash"># Generate a CA for Dex
openssl req -x509 -newkey rsa:4096 -keyout dex-ca.key \
  -out dex-ca.crt -days 365 -nodes \
  -subj "/CN=dex-ca"

# Generate a certificate for Dex signed by that CA
openssl req -newkey rsa:2048 -keyout dex.key \
  -out dex.csr -nodes \
  -subj "/CN=dex.127.0.0.1.nip.io"

openssl x509 -req -in dex.csr \
  -CA dex-ca.crt -CAkey dex-ca.key \
  -CAcreateserial -out dex.crt -days 365 \
  -extfile &lt;(printf "subjectAltName=DNS:dex.127.0.0.1.nip.io")
</code></pre>
<p>Next, generate the <code>AuthenticationConfiguration</code> file. This tells the API server how to validate JWTs — which issuer to trust (<code>url</code>), which audience to expect (<code>audiences</code>), and which JWT claims map to Kubernetes usernames and groups (<code>claimMappings</code>). The CA cert is inlined so the API server can verify Dex's TLS certificate when fetching signing keys:</p>
<pre><code class="language-bash">cat &gt; auth-config.yaml &lt;&lt;EOF
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
  - issuer:
      url: https://dex.127.0.0.1.nip.io:32000
      audiences:
        - kubernetes
      certificateAuthority: |
$(sed 's/^/        /' dex-ca.crt)
    claimMappings:
      username:
        claim: email
        prefix: ""
      groups:
        claim: groups
        prefix: ""
EOF
</code></pre>
<p>The <code>kind-oidc.yaml</code> config uses <code>extraPortMappings</code> to expose Dex's port to your browser, <code>extraMounts</code> to copy files into the Kind node, and a <code>kubeadmConfigPatch</code> to pass <code>--authentication-config</code> to the API server:</p>
<pre><code class="language-yaml"># kind-oidc.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      # Forward port 32000 from the Docker container to localhost,
      # so your browser can reach Dex's login page
      - containerPort: 32000
        hostPort: 32000
        protocol: TCP
    extraMounts:
      # Copy files from your machine into the Kind node's filesystem
      - hostPath: ./dex-ca.crt
        containerPath: /etc/ca-certificates/dex-ca.crt
        readOnly: true
      - hostPath: ./auth-config.yaml
        containerPath: /etc/kubernetes/auth-config.yaml
        readOnly: true
    kubeadmConfigPatches:
      # Patch the API server to enable OIDC authentication
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            # Tell the API server to load our AuthenticationConfiguration
            authentication-config: /etc/kubernetes/auth-config.yaml
          extraVolumes:
            # Mount files into the API server pod (it runs as a static pod,
            # so it needs explicit volume mounts even though files are on the node)
            - name: dex-ca
              hostPath: /etc/ca-certificates/dex-ca.crt
              mountPath: /etc/ca-certificates/dex-ca.crt
              readOnly: true
              pathType: File
            - name: auth-config
              hostPath: /etc/kubernetes/auth-config.yaml
              mountPath: /etc/kubernetes/auth-config.yaml
              readOnly: true
              pathType: File
</code></pre>
<p>Create the cluster:</p>
<pre><code class="language-bash">kind create cluster --name k8s-auth --config kind-oidc.yaml
</code></pre>
<h3 id="heading-step-2-deploy-dex">Step 2: Deploy Dex</h3>
<p>Dex is an OIDC-compliant identity provider that acts as a bridge between Kubernetes and upstream identity sources (LDAP, SAML, GitHub, and so on). In this demo it runs inside the cluster with a static password database — two hardcoded users you can log in as.</p>
<p>The API server doesn't talk to Dex directly on every request. It only needs Dex's CA certificate (which you inlined in the <code>AuthenticationConfiguration</code>) to verify the JWT signatures on tokens that Dex issues.</p>
<p>The deployment has four parts: a ConfigMap with Dex's configuration, a Deployment to run Dex, a NodePort Service to expose it on port 32000 (matching the issuer URL), and RBAC resources so Dex can store state using Kubernetes CRDs.</p>
<p>First, create the namespace and load the TLS certificate as a Kubernetes Secret. Dex needs this to serve HTTPS. Without it, your browser and the API server would refuse to connect:</p>
<pre><code class="language-bash">kubectl create namespace dex

kubectl create secret tls dex-tls \
  --cert=dex.crt \
  --key=dex.key \
  -n dex
</code></pre>
<p>Save the following as <code>dex-config.yaml</code>. This configures Dex with a static password connector — two hardcoded users for the demo:</p>
<pre><code class="language-yaml"># dex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex-config
  namespace: dex
data:
  config.yaml: |
    # issuer must exactly match the URL in your AuthenticationConfiguration
    issuer: https://dex.127.0.0.1.nip.io:32000

    # Dex stores refresh tokens and auth codes — here it uses Kubernetes CRDs
    storage:
      type: kubernetes
      config:
        inCluster: true

    # Dex's HTTPS listener — serves the login page and token endpoints
    web:
      https: 0.0.0.0:5556
      tlsCert: /etc/dex/tls/tls.crt
      tlsKey: /etc/dex/tls/tls.key

    # staticClients defines which applications can request tokens.
    # "kubernetes" is the client ID that kubelogin uses when authenticating
    staticClients:
      - id: kubernetes
        redirectURIs:
          - http://localhost:8000     # kubelogin listens here to receive the callback
        name: Kubernetes
        secret: kubernetes-secret     # shared secret between kubelogin and Dex

    # Two demo users with the password "password" (bcrypt-hashed).
    # In production, you'd connect Dex to LDAP, SAML, or a social login instead
    enablePasswordDB: true
    staticPasswords:
      - email: "jane@example.com"
        # bcrypt hash of "password" — generate your own with: htpasswd -bnBC 10 "" password
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "jane"
        userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
      - email: "admin@example.com"
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "admin"
        userID: "a8b53e13-7e8c-4f7b-9a33-6c2f4d8c6a1b"
        groups:
          - platform-engineers
</code></pre>
<p>Save the following as <code>dex-deployment.yaml</code>. This creates the Deployment, Service, ServiceAccount, and RBAC that Dex needs to run:</p>
<pre><code class="language-yaml"># dex-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dex
  namespace: dex
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dex
  template:
    metadata:
      labels:
        app: dex
    spec:
      serviceAccountName: dex
      containers:
        - name: dex
          # v2.45.0+ required — earlier versions don't include groups from staticPasswords in tokens
          image: ghcr.io/dexidp/dex:v2.45.0
          command: ["dex", "serve", "/etc/dex/cfg/config.yaml"]
          ports:
            - name: https
              containerPort: 5556
          volumeMounts:
            - name: config
              mountPath: /etc/dex/cfg
            - name: tls
              mountPath: /etc/dex/tls
      volumes:
        - name: config
          configMap:
            name: dex-config
        - name: tls
          secret:
            secretName: dex-tls
---
# NodePort Service — exposes Dex on port 32000 on the Kind node.
# Combined with extraPortMappings, this makes Dex reachable from your browser
apiVersion: v1
kind: Service
metadata:
  name: dex
  namespace: dex
spec:
  type: NodePort
  ports:
    - name: https
      port: 5556
      targetPort: 5556
      nodePort: 32000
  selector:
    app: dex
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dex
  namespace: dex
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dex
rules:
  - apiGroups: ["dex.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dex
subjects:
  - kind: ServiceAccount
    name: dex
    namespace: dex
roleRef:
  kind: ClusterRole
  name: dex
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f dex-config.yaml
kubectl apply -f dex-deployment.yaml
kubectl rollout status deployment/dex -n dex
</code></pre>
<h3 id="heading-step-3-install-kubelogin">Step 3: Install kubelogin</h3>
<pre><code class="language-bash"># macOS
brew install int128/kubelogin/kubelogin

# Linux
curl -LO https://github.com/int128/kubelogin/releases/latest/download/kubelogin_linux_amd64.zip
unzip -j kubelogin_linux_amd64.zip kubelogin -d /tmp
sudo mv /tmp/kubelogin /usr/local/bin/kubectl-oidc_login
rm kubelogin_linux_amd64.zip
</code></pre>
<p>Confirm it's installed:</p>
<pre><code class="language-bash">kubectl oidc-login --version
</code></pre>
<h3 id="heading-step-4-configure-a-kubeconfig-entry-for-oidc">Step 4: Configure a kubeconfig entry for OIDC</h3>
<p>This creates a new user and context in your kubeconfig. Instead of using a client certificate (like the default Kind admin), it tells kubectl to use kubelogin to get a token from Dex.</p>
<p>The <code>--oidc-extra-scope</code> flags are important: without <code>email</code> and <code>groups</code>, Dex won't include those claims in the JWT, and the API server won't know who you are or what groups you belong to.</p>
<pre><code class="language-bash">kubectl config set-credentials oidc-user \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg=--oidc-issuer-url=https://dex.127.0.0.1.nip.io:32000 \
  --exec-arg=--oidc-client-id=kubernetes \
  --exec-arg=--oidc-client-secret=kubernetes-secret \
  --exec-arg=--oidc-extra-scope=email \
  --exec-arg=--oidc-extra-scope=groups \
  --exec-arg=--certificate-authority=$(pwd)/dex-ca.crt

kubectl config set-context oidc@k8s-auth \
  --cluster=kind-k8s-auth \
  --user=oidc-user

kubectl config use-context oidc@k8s-auth
</code></pre>
<h3 id="heading-step-5-trigger-the-login-flow">Step 5: Trigger the login flow</h3>
<p>Jane has no RBAC permissions yet, so first grant her read access from the admin context:</p>
<pre><code class="language-bash">kubectl --context kind-k8s-auth create clusterrolebinding jane-view \
  --clusterrole=view --user=jane@example.com
</code></pre>
<p>Now switch to the OIDC context and trigger a login:</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Your browser opens and redirects to the Dex login page. Log in as <code>jane@example.com</code> with password <code>password</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/44fe0657-b383-4245-9e43-45daea7a3f4f.png" alt="dexidp login screen" style="display:block;margin:0 auto" width="866" height="549" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/4f77442a-3055-47fc-a141-8d881731a1f4.png" alt="dexidp grant access" style="display:block;margin:0 auto" width="925" height="512" loading="lazy">

<p>After login, the terminal completes:</p>
<pre><code class="language-plaintext">No resources found in default namespace.
</code></pre>
<p>The browser-based authentication worked. <code>kubectl</code> received the token from Dex, sent it to the API server, the API server validated the JWT signature using the CA certificate from the <code>AuthenticationConfiguration</code>, extracted <code>jane@example.com</code> from the <code>email</code> claim, matched it against the RBAC binding, and authorized the request.</p>
<p>Without the <code>clusterrolebinding</code>, you would see <code>Error from server (Forbidden)</code> — authentication succeeds (the API server knows <em>who</em> you are) but authorization fails (jane has no permissions). This is the distinction between 401 Unauthorized and 403 Forbidden.</p>
<h3 id="heading-step-6-inspect-the-jwt">Step 6: Inspect the JWT</h3>
<p>A JWT (JSON Web Token) is a signed JSON payload that contains claims about the user. kubelogin caches the token locally under <code>~/.kube/cache/oidc-login/</code> so you don't have to log in on every kubectl command.</p>
<p>List the directory to find the cached file:</p>
<pre><code class="language-bash">ls ~/.kube/cache/oidc-login/
</code></pre>
<p>Decode the JWT payload directly from the cache:</p>
<pre><code class="language-bash">cat ~/.kube/cache/oidc-login/$(ls ~/.kube/cache/oidc-login/ | grep -v lock | head -1) | \
  python3 -c "
import json, sys, base64
token = json.load(sys.stdin)['id_token'].split('.')[1]
token += '=' * (4 - len(token) % 4)
print(json.dumps(json.loads(base64.urlsafe_b64decode(token)), indent=2))
"
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-json">{
  "iss": "https://dex.127.0.0.1.nip.io:32000",
  "sub": "CiQwOGE4Njg0Yi1kYjg4LTRiNzMtOTBhOS0zY2QxNjYxZjU0NjYSBWxvY2Fs",
  "aud": "kubernetes",
  "exp": 1775307910,
  "iat": 1775221510,
  "email": "jane@example.com",
  "email_verified": true
}
</code></pre>
<p>The <code>email</code> claim becomes jane's Kubernetes username because the <code>AuthenticationConfiguration</code> maps <code>username.claim: email</code>. The <code>aud</code> matches the configured <code>audiences</code>. The <code>iss</code> matches the issuer <code>url</code>. This is how the API server validates the token without contacting Dex on every request — it only needs the CA certificate to verify the JWT signature.</p>
<h3 id="heading-step-7-map-oidc-groups-to-rbac">Step 7: Map OIDC groups to RBAC</h3>
<p>The <code>admin@example.com</code> user has a <code>groups</code> claim in the Dex config containing <code>platform-engineers</code>. Instead of creating individual RBAC bindings per user, you can bind permissions to a group — anyone whose JWT contains that group gets the permissions automatically:</p>
<pre><code class="language-yaml"># platform-engineers-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-engineers-admin
subjects:
  - kind: Group
    name: platform-engineers     # matches the groups claim in the JWT
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>You're currently logged in as <code>jane@example.com</code> via the OIDC context, but jane only has <code>view</code> permissions — she can't create cluster-wide RBAC bindings. Switch back to the admin context to apply this:</p>
<pre><code class="language-bash">kubectl config use-context kind-k8s-auth
kubectl apply -f platform-engineers-binding.yaml
kubectl config use-context oidc@k8s-auth
</code></pre>
<p>Now clear the cached token to log out of jane's session, then trigger a new login as <code>admin@example.com</code>:</p>
<pre><code class="language-bash"># Clear the cached token — this is how you "log out" with kubelogin
rm -rf ~/.kube/cache/oidc-login/

# This will open the browser again for a fresh login
kubectl get pods -n default
</code></pre>
<p>Log in as <code>admin@example.com</code> with password <code>password</code>. This time the JWT will contain <code>"groups": ["platform-engineers"]</code>, which matches the <code>ClusterRoleBinding</code> you just created. The admin user gets full cluster access — without ever being added to a kubeconfig by name.</p>
<p>You can verify by decoding the new token (Step 6) — the <code>groups</code> claim will be present:</p>
<pre><code class="language-json">{
  "email": "admin@example.com",
  "groups": ["platform-engineers"]
}
</code></pre>
<p>This is the real power of OIDC group claims: you manage group membership in your identity provider, and Kubernetes permissions follow automatically. Add someone to the <code>platform-engineers</code> group in Dex (or any upstream IdP), and they get cluster-admin access on their next login — no kubeconfig or RBAC changes needed.</p>
<h2 id="heading-cloud-provider-authentication">Cloud Provider Authentication</h2>
<p>AWS, GCP, and Azure each give Kubernetes clusters a native authentication mechanism that ties into their IAM systems.</p>
<p>The implementations differ in API surface, but they all use the same underlying mechanism: OIDC token projection. Once you understand how Dex works above, these are all variations on the same theme.</p>
<h3 id="heading-aws-eks">AWS EKS</h3>
<p>EKS uses the <code>aws-iam-authenticator</code> to translate AWS IAM identities into Kubernetes identities. When you run <code>kubectl</code> against an EKS cluster, the AWS CLI generates a short-lived token signed with your IAM credentials. The API server passes this token to the aws-iam-authenticator webhook, which verifies it against AWS STS and returns the corresponding username and groups.</p>
<p>User access is controlled via the <code>aws-auth</code> ConfigMap in <code>kube-system</code>, which maps IAM role ARNs and IAM user ARNs to Kubernetes usernames and groups. A typical entry looks like this:</p>
<pre><code class="language-yaml"># In kube-system/aws-auth ConfigMap
mapRoles:
  - rolearn: arn:aws:iam::123456789:role/platform-engineers
    username: platform-engineer:{{SessionName}}
    groups:
      - platform-engineers
</code></pre>
<p>AWS is migrating from the <code>aws-auth</code> ConfigMap to a newer Access Entries API, which manages the same mapping through the EKS API rather than a ConfigMap. The underlying authentication mechanism is the same.</p>
<h3 id="heading-google-gke">Google GKE</h3>
<p>GKE integrates with Google Cloud IAM using two different mechanisms, depending on whether you're authenticating as a human user or as a workload.</p>
<p>For human users, GKE accepts standard Google OAuth2 tokens. Running <code>gcloud container clusters get-credentials</code> writes a kubeconfig that uses the <code>gcloud</code> CLI as a credential plugin, generating short-lived tokens from your Google account automatically.</p>
<p>For pod-level identity — letting a pod assume a Google Cloud IAM role — GKE uses Workload Identity. You annotate a Kubernetes service account to bind it to a Google Service Account, and pods running as that service account can call Google Cloud APIs using the GSA's permissions:</p>
<pre><code class="language-bash"># Bind a Kubernetes SA to a Google Service Account
kubectl annotate serviceaccount my-app \
  --namespace production \
  iam.gke.io/gcp-service-account=my-app@my-project.iam.gserviceaccount.com
</code></pre>
<h3 id="heading-azure-aks">Azure AKS</h3>
<p>AKS integrates with Azure Active Directory. When Azure AD integration is enabled, <code>kubectl</code> requests an Azure AD token on behalf of the user via the Azure CLI, and the AKS API server validates it against Azure AD.</p>
<p>For pod-level identity, AKS uses Azure Workload Identity, which follows the same OIDC federation pattern as GKE Workload Identity. A Kubernetes service account is annotated with an Azure Managed Identity client ID, and pods can request Azure AD tokens without storing any credentials:</p>
<pre><code class="language-bash"># Annotate a service account with the Azure Managed Identity client ID
kubectl annotate serviceaccount my-app \
  --namespace production \
  azure.workload.identity/client-id=&lt;MANAGED_IDENTITY_CLIENT_ID&gt;
</code></pre>
<p>The underlying pattern across all three providers is the same: a trusted OIDC token is issued by the cloud provider, verified by the Kubernetes API server, and mapped to an identity through a binding (the <code>aws-auth</code> ConfigMap, a GKE Workload Identity binding, or an AKS federated identity credential). The OIDC section in this article is the conceptual foundation for all of them.</p>
<h2 id="heading-webhook-token-authentication">Webhook Token Authentication</h2>
<p>Webhook token authentication is worth knowing about because it appears in several common Kubernetes setups, even if you never configure it yourself.</p>
<p>When a request arrives with a bearer token that no other authenticator recognises, Kubernetes can send that token to an external HTTP endpoint for validation. The endpoint returns a response indicating who the token belongs to.</p>
<p>This is how EKS authentication worked before the aws-iam-authenticator was built into the API server. It's also how bootstrap tokens work during node join operations: a token is generated, embedded in the <code>kubeadm join</code> command, and validated by the bootstrap webhook when the new node contacts the API server for the first time.</p>
<p>For most clusters, you'll encounter webhook auth as something already running rather than something you configure. The main thing to know is that it exists and what it looks like when it appears in logs or configuration.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the OIDC demo cluster
kind delete cluster --name k8s-auth

# Remove generated certificate files
rm -f ca.crt ca.key jane.key jane.csr jane.crt jane.kubeconfig
rm -f dex-ca.crt dex-ca.key dex.crt dex.key dex.csr dex-ca.srl auth-config.yaml

# Remove the kubelogin token cache
rm -rf ~/.kube/cache/oidc-login/
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes authentication is not a single mechanism — it's a chain of pluggable strategies, each one suited to different use cases. In this article you worked through the most important ones.</p>
<p>x509 client certificates are how Kubernetes works out of the box. The CN field becomes the username, the O field becomes the group, and the cluster CA is the trust anchor. You created a certificate for a new user, bound it to RBAC, and saw exactly how authentication and authorisation interact — authentication gets you in, RBAC determines what you can do.</p>
<p>You also saw the fundamental limitation: Kubernetes doesn't check certificate revocation lists, so a compromised certificate remains valid until it expires. This makes certificates a poor fit for human users in production environments.</p>
<p>OIDC is the production-grade answer. Tokens are short-lived, issued by a trusted identity provider, and map directly to Kubernetes groups through JWT claims. You deployed Dex as a self-hosted OIDC provider, configured the API server to trust it, and set up kubelogin for browser-based authentication.</p>
<p>You then decoded a JWT to see exactly what the API server reads from it, and mapped an OIDC group claim to a Kubernetes ClusterRoleBinding.</p>
<p>Cloud provider authentication — EKS, GKE, AKS — uses the same OIDC foundation with provider-specific wrappers. Understanding how Dex works makes each of those systems immediately readable.</p>
<p>All YAML, certificates, and configuration files from this article are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build AI Agents That Can Control Cloud Infrastructure ]]>
                </title>
                <description>
                    <![CDATA[ Cloud infrastructure has become deeply programmable over the past decade. Nearly every platform exposes APIs that allow developers to create applications, provision databases, configure networking, an ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-ai-agents-that-can-control-cloud-infrastructure/</link>
                <guid isPermaLink="false">69cbefa6c1e86567d7576d3e</guid>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Tue, 31 Mar 2026 16:00:38 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/69bdba8c-6915-4d8c-ab35-1f5d06824f50.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Cloud infrastructure has become deeply programmable over the past decade.</p>
<p>Nearly every platform exposes APIs that allow developers to create applications, provision databases, configure networking, and retrieve metrics.</p>
<p>This shift enabled automation via Infrastructure as Code and CI/CD pipelines, allowing teams to manage systems through scripts rather than dashboards.</p>
<p>Now another layer of automation is emerging. AI agents are starting to participate directly in development workflows. These agents can read codebases, generate implementations, run terminal commands, and help debug systems. The next logical step is to allow them to interact with the infrastructure itself.</p>
<p>Instead of manually inspecting dashboards or remembering complex command-line syntax, developers can ask an AI agent to check system state, deploy services, or retrieve metrics. The agent performs these tasks by interacting with cloud APIs on behalf of the user.</p>
<p>This capability opens the door to a new type of workflow where infrastructure becomes conversational, programmable, and deeply integrated into development environments.</p>
<p>In this article, we will explore how AI agents can interact with cloud infrastructure through APIs, the challenges of exposing large APIs to AI systems, and how architectures like MCP make it possible for agents to discover and execute infrastructure operations safely. We will also look at a practical example of connecting an AI agent to a cloud platform like Sevalla using the search-and-execute pattern.</p>
<p>Familiarity with cloud infrastructure concepts such as APIs, Infrastructure as Code, and CI/CD workflows is recommended to follow along effectively. You should also have a basic understanding of how AI agents or developer assistants interact with code and systems to fully understand the architectures discussed in this article.</p>
<h2 id="heading-what-well-cover">What We'll Cover</h2>
<ul>
<li><p><a href="#heading-ai-agents-are-becoming-part-of-the-development-environment">AI Agents Are Becoming Part of the Development Environment</a></p>
</li>
<li><p><a href="#heading-connecting-ai-agents-to-external-systems">Connecting AI Agents to External Systems</a></p>
</li>
<li><p><a href="#heading-the-challenge-of-large-cloud-apis">The Challenge of Large Cloud APIs</a></p>
</li>
<li><p><a href="#heading-a-simpler-pattern-for-api-access">A Simpler Pattern for API Access</a></p>
</li>
<li><p><a href="#heading-why-sandboxed-code-execution-is-important">Why Sandboxed Code Execution Is Important</a></p>
</li>
<li><p><a href="#heading-practical-example-with-sevalla">Practical Example with Sevalla</a></p>
</li>
<li><p><a href="#heading-what-this-means-for-developers">What This Means for Developers</a></p>
</li>
<li><p><a href="#heading-the-next-evolution-of-infrastructure-automation">The Next Evolution of Infrastructure Automation</a></p>
</li>
</ul>
<h2 id="heading-ai-agents-are-becoming-part-of-the-development-environment">AI Agents Are Becoming Part of the Development Environment</h2>
<p>Modern developer tools increasingly embed AI assistants directly inside coding environments. Editors such as Cursor, Windsurf, and Claude Code allow developers to ask questions about their projects, generate new code, and execute commands without leaving the editor.</p>
<p>Instead of manually navigating documentation or writing boilerplate code, developers can simply describe what they want. The AI interprets the request and produces the necessary actions.</p>
<p>This approach is already common for tasks like writing functions, refactoring code, or debugging errors. However, infrastructure management is still largely handled through dashboards, terminal commands, or external tooling.</p>
<p>If AI agents are going to assist developers effectively, they need access to the same systems developers interact with every day. That means accessing APIs that manage applications, databases, deployments, and other infrastructure resources.</p>
<p>The challenge is providing that access in a structured and scalable way.</p>
<h2 id="heading-connecting-ai-agents-to-external-systems">Connecting AI Agents to External&nbsp;Systems</h2>
<p>AI agents do not inherently know how to interact with external services. They need a framework that allows them to call tools and access data safely.</p>
<p><a href="https://www.freecodecamp.org/news/how-the-model-context-protocol-works/">Model Context Protocol</a>, or MCP, provides one such framework. MCP is designed to let AI assistants connect to external tools in a standardized way.</p>
<p>An MCP server exposes tools that an AI agent can call when it needs information or wants to act. These tools might retrieve data from a database, query logs, interact with APIs, or execute commands on a remote system.</p>
<p>When the AI agent receives a request from the user, it determines which tool to call and executes that tool through the MCP server. The results are returned to the agent, which can then continue reasoning about the problem.</p>
<p>This architecture allows AI assistants to interact with complex systems while maintaining a clear boundary between the agent and the external environment.</p>
<h2 id="heading-the-challenge-of-large-cloud-apis">The Challenge of Large Cloud&nbsp;APIs</h2>
<p>While MCP enables connecting AI agents to infrastructure systems, cloud platforms introduce an additional challenge.</p>
<p>Most cloud platforms expose large APIs with many endpoints. A typical platform might include endpoints for managing applications, databases, storage, networking, domains, metrics, logs, and deployment pipelines.</p>
<p>If an MCP server exposes each endpoint as a separate tool, the number of tools can quickly grow into the hundreds.</p>
<p>This creates several problems. First, the AI agent must understand the purpose and parameters of every available tool before deciding which one to use. This increases the amount of context required for the agent to operate effectively.</p>
<p>Second, maintaining hundreds of tools becomes difficult for developers who build and maintain the MCP server.</p>
<p>Third, the system becomes rigid. Every time a new API endpoint is added, a new tool must also be created and documented.</p>
<p>For large APIs, this approach quickly becomes impractical.</p>
<h2 id="heading-a-simpler-pattern-for-api-access">A Simpler Pattern for API&nbsp;Access</h2>
<p>A different architecture solves this problem by dramatically reducing the number of tools exposed to the AI.</p>
<p>Instead of providing a separate tool for every API endpoint, the MCP server exposes only two capabilities.</p>
<p>The first capability allows the agent to search the API specification. This lets the agent discover available endpoints, understand parameters, and inspect request or response schemas.</p>
<p>The second capability allows the agent to execute code that calls the API.</p>
<p>In this model, the AI agent dynamically generates the code required to call the API. Because the agent can search the specification and write its own API calls, the MCP server does not need to define individual tools for every endpoint.</p>
<p>This pattern drastically reduces the complexity of the integration while still giving the agent full access to the underlying platform.</p>
<h2 id="heading-why-sandboxed-code-execution-is-important">Why Sandboxed Code Execution Is Important</h2>
<p>Allowing AI agents to generate and execute code raises important security considerations.</p>
<p>If the generated code runs unrestricted, it could potentially access sensitive parts of the system or perform unintended operations. To prevent this, the execution environment must be carefully controlled.</p>
<p>A common solution is running the generated code inside a sandboxed environment. In this setup, the code runs in an isolated runtime with limited permissions. The environment exposes only specific functions that allow interaction with the platform’s API.</p>
<p>Because the code cannot access the host system directly, the risk of unintended behavior is greatly reduced. At the same time, the AI agent retains the flexibility to generate custom API calls as needed.</p>
<p>This combination of dynamic code generation and sandboxed execution makes it possible for AI agents to interact with complex APIs safely.</p>
<h2 id="heading-practical-example-with-sevalla">Practical Example with&nbsp;Sevalla</h2>
<p>A practical implementation of this architecture can be seen in the <a href="https://github.com/sevalla-hosting/mcp">Sevalla MCP server</a>, which exposes a cloud platform’s API to AI agents through the search-and-execute pattern.</p>
<p><a href="https://sevalla.com/">Sevalla</a> is a PaaS provider designed for developers shipping production applications. It offers app hosting, database, object storage, and static site hosting for your projects. We also have other options, such as AWS and Azure, that come with their own MCP tools.</p>
<p>Instead of registering hundreds of tools for every API endpoint, the server provides only two tools that allow the AI agent to explore and interact with the entire platform. Find the <a href="https://docs.sevalla.com/quick-starts/coding-agents/overview">full documentation</a> for Sevalla’s MCP server here.</p>
<p>The first tool, <code>search</code>, allows the agent to query the platform’s OpenAPI specification. Through this interface the agent can discover available endpoints, understand parameters, and inspect response schemas.</p>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/b1030c9d-b944-41f4-b0a0-4cf1f1bc3039.png" alt="MCP client" style="display:block;margin:0 auto" width="480" height="497" loading="lazy">

<p>Because the API specification is searchable, the agent does not need to know the structure of the platform’s API in advance. It can explore the API dynamically based on the task it needs to perform.</p>
<p>For example, if the user asks the agent to list all applications running in their account, the agent can begin by searching the API specification.</p>
<pre><code class="language-plaintext">const endpoints = await sevalla.search("list all applications")
</code></pre>
<p>The result returns the relevant API definitions, including the correct path and parameters required for the request. Once the agent understands which endpoint to use, it can generate the necessary API call.</p>
<p>The second tool, <code>execute</code>, runs JavaScript inside a sandboxed V8 environment. Within this environment the agent can call the API using a helper function provided by the platform.</p>
<pre><code class="language-plaintext">const apps = await sevalla.request({
  method: "GET",
  path: "/applications"
})
</code></pre>
<p>Because the code runs inside an isolated V8 sandbox, the generated script cannot access the host system. The only permitted interaction is through the API helper function. This ensures that the AI agent can perform infrastructure operations safely while still retaining the flexibility to generate dynamic API calls.</p>
<p>This approach allows an agent to discover and interact with many parts of the platform without requiring predefined tools for each capability. After discovering endpoints through the API specification, the agent can retrieve application data, inspect deployments, query metrics, or manage infrastructure resources through generated API calls.</p>
<p>The design also significantly reduces context usage. Traditional MCP integrations might require hundreds of tools to represent every endpoint of a large API. In contrast, the search-and-execute pattern allows the entire API surface to be accessed through just two tools.</p>
<p>For developers connecting AI assistants to infrastructure platforms, this architecture provides a practical way to expose large APIs while keeping the integration simple and efficient.</p>
<h2 id="heading-what-this-means-for-developers">What This Means for Developers</h2>
<p>Allowing AI agents to interact with infrastructure APIs changes how developers manage systems.</p>
<p>Instead of manually navigating dashboards or writing long sequences of commands, developers can describe what they want in natural language. The AI agent can interpret the request, discover the relevant API endpoints, and execute the required operations.</p>
<p>This approach also improves observability and debugging. When something goes wrong, the agent can query logs, inspect metrics, and retrieve system state without requiring the developer to manually gather information.</p>
<p>Over time, this type of integration could significantly reduce the friction involved in managing complex cloud systems.</p>
<h2 id="heading-the-next-evolution-of-infrastructure-automation">The Next Evolution of Infrastructure Automation</h2>
<p>Infrastructure automation has evolved through several stages. Early cloud systems relied heavily on manual configuration through web interfaces. Infrastructure as Code later allowed teams to define infrastructure using scripts and configuration files.</p>
<p>CI/CD pipelines then automated the process of deploying and updating systems.</p>
<p>AI agents represent the next step in this progression. By combining APIs, MCP integrations, and sandboxed execution environments, developers can allow intelligent systems to reason about infrastructure and interact with it safely.</p>
<p>Instead of static integrations, agents can dynamically discover and call APIs as needed. This makes infrastructure management more flexible and accessible while maintaining the reliability of programmable systems.</p>
<p>As AI tools become more deeply embedded in development environments, the ability for agents to understand and control infrastructure will likely become a standard capability for modern platforms.</p>
<p><em>Hope you enjoyed this article.</em> <a href="https://www.manishmshiva.me/"><em>Visit my blog</em></a> <em>for more practical tutorials.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Infrastructure as Code with APIs: How to Automate Cloud Resources the Developer Way ]]>
                </title>
                <description>
                    <![CDATA[ Modern software development moves fast. Teams deploy code many times a day. New environments appear and disappear constantly. In this world, manual infrastructure setup simply doesn't scale. For years ]]>
                </description>
                <link>https://www.freecodecamp.org/news/iac-with-apis-how-to-automate-cloud-resources/</link>
                <guid isPermaLink="false">69c17f3c30a9b81e3a894704</guid>
                
                    <category>
                        <![CDATA[ #IaC ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ APIs ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Mon, 23 Mar 2026 17:58:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/d45b828b-c7b7-4138-a373-6edea786bc65.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Modern software development moves fast. Teams deploy code many times a day. New environments appear and disappear constantly. In this world, manual infrastructure setup simply doesn't scale.</p>
<p>For years, developers logged into dashboards, clicked through forms, and configured servers by hand. This worked for small projects, but it quickly became fragile. Every manual step increased the chance of mistakes. Environments drifted apart. Reproducing the same setup became difficult.</p>
<p><a href="https://www.redhat.com/en/topics/automation/what-is-infrastructure-as-code-iac">Infrastructure as Code (IaC)</a> solves this problem. Instead of clicking through interfaces, developers define infrastructure using code. This approach makes infrastructure predictable, repeatable, and easy to automate.</p>
<p>In recent years, another approach has become popular alongside traditional IaC tools: using cloud APIs directly to create and manage infrastructure. This gives developers full control over how resources are provisioned and integrated into workflows.</p>
<p>This article explains what Infrastructure as Code means, why APIs are a powerful way to implement it, and how developers can automate cloud resources using simple scripts.</p>
<p>A basic understanding of cloud platforms, command-line interfaces, and scripting languages like Python, Bash, or JavaScript will help you follow along effectively. Familiarity with APIs, authentication methods, and CI/CD concepts will also make it easier to implement the automation techniques discussed in this article.</p>
<h2 id="heading-heres-what-well-cover">Here's what we'll cover:</h2>
<ul>
<li><p><a href="#heading-what-is-infrastructure-as-code">What Is Infrastructure as Code?</a></p>
</li>
<li><p><a href="#heading-the-limits-of-manual-infrastructure">The Limits of Manual Infrastructure</a></p>
</li>
<li><p><a href="#heading-why-apis-are-a-powerful-iac-tool">Why APIs Are a Powerful IaC Tool</a></p>
</li>
<li><p><a href="#heading-automating-infrastructure-with-scripts">Automating Infrastructure with Scripts</a></p>
</li>
<li><p><a href="#heading-practical-example-with-sevalla">Practical Example with Sevalla</a></p>
<ul>
<li><p><a href="#heading-installing-cli">Installing CLI</a></p>
</li>
<li><p><a href="#heading-working-with-your-infrastructure-using-cli">Working with your Infrastructure using CLI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-infrastructure-as-code-improves-developer-productivity">Infrastructure as Code Improves Developer Productivity</a></p>
</li>
<li><p><a href="#heading-the-future-of-infrastructure">The Future of Infrastructure</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-infrastructure-as-code">What Is Infrastructure as&nbsp;Code?</h2>
<p>Infrastructure as Code means managing infrastructure using code instead of manual processes.</p>
<p>Instead of setting up servers, databases, and networks by hand, you define them in scripts or configuration files. These files describe the desired state of your infrastructure. A tool or script then creates and maintains that state automatically.</p>
<p>For example, instead of manually creating a database, you might define it in code like this:</p>
<pre><code class="language-plaintext">database:
  name: app_db
  engine: postgres
  version: 16
</code></pre>
<p>Once the code runs, the database is created automatically.</p>
<p>This approach provides several key benefits.</p>
<p>First, it improves consistency. Every environment is created from the same definition. Development, staging, and production environments stay aligned.</p>
<p>Second, it improves repeatability. If infrastructure fails, it can be recreated from code in minutes.</p>
<p>Third, it improves version control. Infrastructure definitions live in the same repositories as application code. Teams can review, track, and roll back changes.</p>
<p>Finally, it enables automation. Infrastructure can be created during deployments, tests, or CI/CD pipelines.</p>
<h2 id="heading-the-limits-of-manual-infrastructure">The Limits of Manual Infrastructure</h2>
<p>Before IaC became common, infrastructure management relied heavily on dashboards and manual configuration.</p>
<p>A developer would open a cloud console and perform steps like:</p>
<ul>
<li><p>Create a server</p>
</li>
<li><p>Attach storage</p>
</li>
<li><p>Configure environment variables</p>
</li>
<li><p>Connect a database</p>
</li>
<li><p>Add a domain</p>
</li>
</ul>
<p>These steps worked, but they introduced problems.</p>
<p>First of all, manual configuration is hard to document. Even if teams write guides, small details are often missed. Over time, environments drift apart.</p>
<p>Manual processes also slow down development. Spinning up a new environment may take hours instead of seconds.</p>
<p>Even worse, manual infrastructure cannot easily be tested. If something breaks, reproducing the same conditions becomes difficult.</p>
<p>Infrastructure as Code removes these problems by turning infrastructure into something that can be scripted, tested, and automated.</p>
<h2 id="heading-why-apis-are-a-powerful-iac-tool">Why APIs Are a Powerful IaC&nbsp;Tool</h2>
<p>Many people associate Infrastructure as Code with tools like Terraform or CloudFormation. These tools are powerful, but they're not the only option.</p>
<p>Every modern cloud platform exposes an API. That API allows developers to create resources programmatically.</p>
<p>This means infrastructure can be controlled directly from code using HTTP requests or command-line interfaces.</p>
<p>Using APIs for IaC has several advantages.</p>
<p>First, it offers maximum flexibility. Developers can integrate infrastructure creation directly into applications, deployment scripts, or internal tools.</p>
<p>Second, it reduces tooling complexity. Instead of learning a specialized IaC language, teams can use languages they already know, such as Python, JavaScript, or Bash.</p>
<p>Third, it enables dynamic infrastructure. Scripts can create resources only when needed, scale them automatically, and remove them when work is complete.</p>
<p>For example, a test suite could automatically create a database, run tests, and delete the database afterwards. This keeps environments clean and reduces costs.</p>
<p>APIs essentially turn the cloud into a programmable platform.</p>
<h2 id="heading-automating-infrastructure-with-scripts">Automating Infrastructure with&nbsp;Scripts</h2>
<p>Using APIs for infrastructure automation usually follows a simple workflow.</p>
<ol>
<li><p>First, a script authenticates with the cloud platform using an API token or credentials.</p>
</li>
<li><p>Second, the script sends requests to create or modify resources such as applications, databases, or storage.</p>
</li>
<li><p>Third, the script captures identifiers or configuration values from the response.</p>
</li>
<li><p>Finally, those values are used in later steps, such as deployments or integrations.</p>
</li>
</ol>
<p>Because these steps run in code, they can easily be included in CI/CD pipelines.</p>
<p>A typical pipeline might do the following:</p>
<ul>
<li><p>Create infrastructure</p>
</li>
<li><p>Deploy the application</p>
</li>
<li><p>Run tests</p>
</li>
<li><p>Collect metrics</p>
</li>
<li><p>Destroy temporary environments</p>
</li>
</ul>
<p>This approach ensures every deployment follows the same process.</p>
<h2 id="heading-practical-example-with-sevalla">Practical Example with&nbsp;Sevalla</h2>
<p>A practical way to apply Infrastructure as Code through APIs is to use a command-line interface that directly interacts with a cloud platform’s API. This lets you automate infrastructure creation using scripts rather than dashboards.</p>
<p>One example is the <a href="https://github.com/sevalla-hosting/cli">Sevalla CLI</a>, which exposes infrastructure operations as terminal commands that can be executed manually or inside automation pipelines.</p>
<p><a href="https://sevalla.com/">Sevalla</a> is a developer-centric PaaS designed to simplify your workflow. They provide high-performance application hosting, managed databases, object storage, and static sites in one unified platform.</p>
<p>Other options are AWS and Azure, which require complex CLI tools and heavy DevOps overhead. Sevalla offers simplicity and ease of use, similar to Heroku.</p>
<h3 id="heading-installing-cli">Installing CLI</h3>
<p>You can install the CLI using the following shell command:</p>
<pre><code class="language-plaintext">bash &lt;(curl -fsSL https://raw.githubusercontent.com/sevalla-hosting/cli/main/install.sh)
</code></pre>
<p>Once installed, you can view the list of all available commands using the <code>help</code> command:</p>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/3e6209cd-6c1e-420d-9d60-7376d54849d0.png" alt="Sevalla CLI Help" style="display:block;margin:0 auto" width="1000" height="730" loading="lazy">

<p>The first step is authentication. Make sure you have an account on Sevalla before using the CLI.</p>
<pre><code class="language-plaintext">sevalla login
</code></pre>
<p>For automated environments such as CI/CD pipelines, authentication can be done with an API token. The token is stored in an environment variable so scripts can run without user interaction.</p>
<pre><code class="language-plaintext">export SEVALLA_API_TOKEN="your-api-token"
</code></pre>
<p>Once authenticated, you can quickly view a list of your apps using <code>sevalla apps list</code></p>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/e3590cfa-5a95-4c5a-b0e9-8615070932da.png" alt="Sevalla Apps List" style="display:block;margin:0 auto" width="958" height="228" loading="lazy">

<h3 id="heading-working-with-your-infrastructure-using-cli">Working with Your Infrastructure using CLI</h3>
<p>Your infrastructure can now be created directly from the command line. For example, you might start by creating an application service that will run the backend code:</p>
<pre><code class="language-plaintext">sevalla apps create --name myapp --source privateGit --cluster &lt;id&gt;
</code></pre>
<p>This command provisions a new application resource on the platform. Instead of navigating through a web interface and filling out forms, the entire setup is performed through a single command.</p>
<p>Because the command can be stored in scripts or configuration files, it becomes part of the project’s infrastructure definition.</p>
<p>After creating the application, you'll often need a database. You can also provision this programmatically:</p>
<pre><code class="language-plaintext">sevalla databases create \
  --name mydb \
  --type postgresql \
  --db-version 16 \
  --cluster &lt;id&gt; \
  --resource-type &lt;id&gt; \
  --db-name mydb \
  --db-password secret
</code></pre>
<p>This creates a PostgreSQL database with a defined version and credentials. In an automated workflow, the database creation step could run during environment setup for staging or testing.</p>
<p>Once the application and database exist, the next step might be configuring environment variables so the application can connect to the database:</p>
<pre><code class="language-plaintext">sevalla apps env-vars create &lt;app-id&gt; --key DATABASE_URL --value "postgres://..."
</code></pre>
<p>These configuration values can be injected during deployments, ensuring the application always receives the correct settings.</p>
<p>Deployment automation is another key part of Infrastructure as Code. Instead of manually triggering deployments, a script can deploy new code whenever a repository is updated.</p>
<pre><code class="language-plaintext">sevalla apps deployments trigger &lt;app-id&gt; --branch main
</code></pre>
<p>This allows CI/CD systems to deploy new versions of the application automatically after tests pass.</p>
<p>Infrastructure automation also includes scaling and monitoring. For example, if an application needs more instances to handle traffic, you can update the number of running processes programmatically.</p>
<pre><code class="language-plaintext">sevalla apps processes update &lt;process-id&gt; --app-id &lt;app-id&gt; --instances 3
</code></pre>
<p>You can also retrieve metrics through the CLI. This allows monitoring tools or scripts to analyze system performance.</p>
<pre><code class="language-plaintext">sevalla apps processes metrics cpu-usage &lt;app-id&gt; &lt;process-id&gt;
</code></pre>
<p>Similarly, you can also query application metrics such as response time or request rates to detect performance issues.</p>
<p>Another common step in infrastructure automation is configuring domains. Instead of manually linking domains to applications, a script can add them during environment setup.</p>
<pre><code class="language-plaintext">sevalla apps domains add &lt;app-id&gt; --name example.com
</code></pre>
<p>With these commands combined in scripts or pipelines, you can fully automate the lifecycle of your infrastructure. A CI pipeline could create an application, provision a database, configure environment variables, deploy code, attach a domain, and monitor performance  – all without human intervention.</p>
<p>Because every command supports JSON output, scripts can also capture values returned by the platform and reuse them in later steps. For example:</p>
<pre><code class="language-plaintext">APP_ID=$(sevalla apps list --json | jq -r '.[0].id')
</code></pre>
<p>This ability to chain commands together makes it easy to build powerful automation workflows.</p>
<p>In practice, teams often place these commands inside deployment scripts or pipeline steps. Whenever code is pushed to a repository, the pipeline automatically provisions or updates the infrastructure needed to run the application.</p>
<p>This approach demonstrates how APIs and automation tools can turn infrastructure into something you can manage the same way you manage application code: through scripts, version control, and automated workflows.</p>
<h2 id="heading-infrastructure-as-code-improves-developer-productivity">Infrastructure as Code Improves Developer Productivity</h2>
<p>One of the biggest benefits of Infrastructure as Code is developer productivity.</p>
<p>Developers no longer need to wait for infrastructure changes or manually configure environments.</p>
<p>Instead, infrastructure becomes part of the development workflow.</p>
<p>When a new feature requires a service, the developer simply adds the infrastructure definition to the repository. The pipeline then creates it automatically.</p>
<p>This reduces delays and keeps development moving quickly.</p>
<p>It also makes onboarding easier. New team members can spin up a full environment with a single command.</p>
<h2 id="heading-the-future-of-infrastructure">The Future of Infrastructure</h2>
<p>Cloud infrastructure continues to evolve toward automation and programmability.</p>
<p>Platforms increasingly expose APIs that allow every resource to be created, configured, and monitored through code.</p>
<p>This trend aligns naturally with the way developers already work.</p>
<p>Applications are built with code. Deployments are automated with code. It makes sense that infrastructure should also be defined with code.</p>
<p>Infrastructure as Code with APIs takes this idea even further. It allows infrastructure to be embedded directly into development workflows, pipelines, and internal tools.</p>
<p>The result is faster development, fewer configuration errors, and more reliable systems.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Infrastructure as Code has transformed how teams manage cloud environments.</p>
<p>By replacing manual configuration with code, organizations gain consistency, automation, and repeatability.</p>
<p>Using APIs to control infrastructure adds another level of flexibility. Developers can integrate infrastructure directly into scripts, pipelines, and applications.</p>
<p>This approach turns the cloud into a programmable platform.</p>
<p>As systems grow more complex and deployment cycles accelerate, the ability to automate infrastructure will only become more important.</p>
<p>For modern development teams, treating infrastructure as code is no longer optional. It's the foundation of reliable and scalable software delivery.</p>
<p><em>Hope you enjoyed this article. Learn more about me by</em> <a href="https://www.linkedin.com/in/manishmshiva/edit/intro/"><em><strong>visiting my LinkedIn</strong></em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Full-Stack CRUD App with React, AWS Lambda, DynamoDB, and Cognito Auth ]]>
                </title>
                <description>
                    <![CDATA[ Building a web application that works only on your local machine is one thing. Building one that is secure, connected to a real database, and accessible to anyone on the internet is another challenge  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/full-stack-aws-react-lambda-dynamodb-tutorial/</link>
                <guid isPermaLink="false">69b96f7ec22d3eeb8ac3bf81</guid>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ React ]]>
                    </category>
                
                    <category>
                        <![CDATA[ full stack ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Benedicta Onyebuchi ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 15:13:02 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1a996eff-72f5-4f4d-b8da-cf4d646c3224.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Building a web application that works only on your local machine is one thing. Building one that is secure, connected to a real database, and accessible to anyone on the internet is another challenge entirely. And it requires a different set of tools.</p>
<p>Most production web applications share a common set of needs: they store and retrieve data, they expose that data through an API, they require users to authenticate before accessing sensitive operations, and they need to be deployed somewhere reliable and fast.</p>
<p>Meeting all of those needs used to require managing servers, configuring databases, handling authentication infrastructure, and provisioning hosting environments – often as separate, manual processes.</p>
<p>AWS changes that model significantly. With the combination of services you'll use in this tutorial (Lambda, DynamoDB, API Gateway, Cognito, and CloudFront), you can build and deploy a fully functional, secured, globally distributed application without managing a single server.</p>
<p>Each service handles one specific responsibility:</p>
<ul>
<li><p>DynamoDB stores your data</p>
</li>
<li><p>Lambda runs your business logic on demand</p>
</li>
<li><p>API Gateway exposes your functions as a REST API</p>
</li>
<li><p>Cognito manages user authentication</p>
</li>
<li><p>CloudFront delivers your frontend worldwide over HTTPS.</p>
</li>
</ul>
<p>The AWS CDK (Cloud Development Kit) ties all of this together by letting you define every one of those services as TypeScript code. Instead of clicking through the AWS Console to configure each resource manually, you describe your entire infrastructure in a single file and deploy it with one command.</p>
<p>By the end of this tutorial, you will have a fully deployed vendor management dashboard. Users can sign up, log in, and then create, read, and delete vendors, with all data securely stored in AWS DynamoDB and all routes protected by Amazon Cognito authentication.</p>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>In this handbook, you'll build a two-panel web app where authenticated users can:</p>
<ul>
<li><p>Add a new vendor (name, category, contact email)</p>
</li>
<li><p>View all saved vendors in real time</p>
</li>
<li><p>Delete a vendor from the list</p>
</li>
<li><p>Sign in and sign out securely</p>
</li>
</ul>
<p>The frontend is built with Next.js. The backend runs entirely on AWS: DynamoDB stores the data, Lambda functions handle the logic, API Gateway exposes a REST API, Cognito manages authentication, and CloudFront serves the app globally over HTTPS.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-who-this-is-for">Who This Is For</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-architecture-overview">Architecture Overview</a></p>
</li>
<li><p><a href="#heading-part-1-set-up-your-aws-account-and-tools">Part 1: Set Up Your AWS Account and Tools</a></p>
</li>
<li><p><a href="#heading-part-2-set-up-the-project-structure">Part 2: Set Up the Project Structure</a></p>
</li>
<li><p><a href="#heading-part-3-define-the-database-dynamodb">Part 3: Define the Database (DynamoDB)</a></p>
</li>
<li><p><a href="#heading-part-4-write-the-lambda-functions">Part 4: Write the Lambda Functions</a></p>
</li>
<li><p><a href="#heading-part-5-build-the-api-with-api-gateway">Part 5: Build the API with API Gateway</a></p>
</li>
<li><p><a href="#heading-part-6-deploy-the-backend-to-aws">Part 6: Deploy the Backend to AWS</a></p>
</li>
<li><p><a href="#heading-part-7-build-the-react-frontend">Part 7: Build the React Frontend</a></p>
</li>
<li><p><a href="#heading-part-8-add-authentication-with-amazon-cognito">Part 8: Add Authentication with Amazon Cognito</a></p>
</li>
<li><p><a href="#heading-part-9-deploy-the-frontend-with-s3-and-cloudfront">Part 9: Deploy the Frontend with S3 and CloudFront</a></p>
</li>
<li><p><a href="#heading-what-you-built">What You Built</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-who-this-is-for">Who This Is For</h2>
<p>This tutorial is for developers who know basic JavaScript and React but have never used AWS. You don't need any prior backend, cloud, or DevOps experience. I'll explain every AWS concept before we use it.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before starting, make sure you have the following installed and available:</p>
<ul>
<li><p><strong>Node.js 18 or higher</strong>: <a href="https://nodejs.org">Download here</a></p>
</li>
<li><p><strong>npm</strong>: Included with Node.js</p>
</li>
<li><p><strong>A code editor</strong>: I recommend VS Code</p>
</li>
<li><p><strong>A terminal</strong>: Any terminal on macOS, Linux, or Windows (WSL recommended on Windows)</p>
</li>
<li><p><strong>An AWS account</strong>: You will create one in Part 1. A credit card is required, but the Free Tier covers everything in this tutorial.</p>
</li>
<li><p><strong>Basic familiarity with React and TypeScript</strong>: You should understand components, <code>useState</code>, and <code>useEffect</code>.</p>
</li>
</ul>
<h2 id="heading-architecture-overview">Architecture Overview</h2>
<p>Before writing any code, here's a plain-English description of how the pieces fit together.</p>
<p>When a user clicks "Add Vendor" in the React app:</p>
<ol>
<li><p>The frontend reads the user's JWT auth token from the browser session</p>
</li>
<li><p>It sends a <code>POST</code> request to API Gateway, including the token in the request header</p>
</li>
<li><p>API Gateway checks the token against Cognito. If the token is invalid or missing, it rejects the request with a 401 error immediately</p>
</li>
<li><p>If the token is valid, API Gateway passes the request to the createVendor Lambda function</p>
</li>
<li><p>The Lambda function writes the new vendor to DynamoDB</p>
</li>
<li><p>DynamoDB confirms the write, and the Lambda returns a success response</p>
</li>
<li><p>The frontend re-fetches the vendor list and updates the UI</p>
</li>
</ol>
<p>The same flow applies to reading and deleting vendors, with different Lambda functions and HTTP methods.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/70486bdc-f272-45db-be30-f10752916546.png" alt="Architecture diagram of the Vendors Tracker Application" style="display:block;margin:0 auto" width="1100" height="499" loading="lazy">

<p><strong>How the app is deployed:</strong> Your React app is exported as a static site, uploaded to an S3 bucket, and served globally through CloudFront. Your backend infrastructure (Lambda functions, API Gateway, DynamoDB, Cognito) is defined in TypeScript using AWS CDK and deployed with a single command.</p>
<h2 id="heading-part-1-set-up-your-aws-account-and-tools">Part 1: Set Up Your AWS Account and Tools</h2>
<p>Before writing any application code, you need three things in place: an AWS account, the right tools on your machine, and credentials that let those tools communicate with AWS on your behalf.</p>
<h3 id="heading-11-create-your-aws-account">1.1 Create Your AWS Account</h3>
<p>If you don't have an AWS account:</p>
<ol>
<li><p>Go to <a href="https://aws.amazon.com">https://aws.amazon.com</a></p>
</li>
<li><p>Click <strong>Create an AWS Account</strong></p>
</li>
<li><p>Follow the sign-up prompts and add a payment method</p>
</li>
<li><p>Once registered, log in to the AWS Management Console</p>
</li>
</ol>
<p>AWS has a Free Tier that covers all the services used in this tutorial. You won't be charged for normal use while following along.</p>
<h3 id="heading-12-install-the-aws-cli-and-cdk">1.2 Install the AWS CLI and CDK</h3>
<p>The <strong>AWS CLI</strong> is a command-line tool that lets you interact with AWS from your terminal: checking resources, configuring credentials, and more.</p>
<p>The <strong>AWS CDK (Cloud Development Kit)</strong> is the tool you will use to define your entire backend (database, Lambda functions, API) using TypeScript code. Instead of clicking through the AWS Console to create each resource, you describe what you want in a TypeScript file and CDK builds it for you.</p>
<p>Install both:</p>
<pre><code class="language-shell"># Install AWS CLI (macOS)
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

# For Linux, see: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html
# For Windows, see: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html

# Install AWS CDK globally
npm install -g aws-cdk
</code></pre>
<p>Verify both are installed:</p>
<pre><code class="language-shell">aws --version
cdk --version
</code></pre>
<p>Both commands should print a version number. If they do, you are ready to move on.</p>
<h3 id="heading-13-configure-your-aws-credentials-iam">1.3 Configure Your AWS Credentials (IAM)</h3>
<p>This step is critical. Your terminal needs a set of credentials – like a username and password – to act on your behalf inside AWS.</p>
<p>Think of your root account (the one you signed up with) as the master key to your entire AWS account. You should never use it for day-to-day development. Instead, you will create a separate IAM user with its own set of keys. If those keys are ever exposed, you can delete them without compromising your root account.</p>
<h4 id="heading-phase-1-create-an-iam-user">Phase 1: Create an IAM User</h4>
<ol>
<li><p>Log in to the AWS Console and search for IAM in the top search bar</p>
</li>
<li><p>In the left sidebar, click Users, then click Create user</p>
</li>
<li><p>Name the user <code>cdk-dev</code>. Leave "Provide user access to the AWS Management Console" unchecked – you only need terminal access, not console access</p>
</li>
<li><p>On the permissions screen, choose Attach policies directly</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/d4699108-c1aa-4dd3-957c-b84292c719a2.png" alt="IAM Console showing the “Attach policies directly” screen with AdministratorAccess checked" style="display:block;margin:0 auto" width="1100" height="441" loading="lazy">

<ol>
<li>Search for <code>AdministratorAccess</code> and check the box next to it</li>
</ol>
<p>Note on permissions: In a production job you would use a more restricted policy. For this tutorial, Administrator access is needed because CDK creates many different types of AWS resources.</p>
<p>6. Click through to the end and click Create user</p>
<h4 id="heading-phase-2-generate-access-keys">Phase 2: Generate Access Keys</h4>
<ol>
<li><p>Click on your newly created <code>cdk-dev</code> user from the Users list</p>
</li>
<li><p>Go to the Security credentials tab</p>
</li>
<li><p>Scroll down to Access keys and click Create access key</p>
</li>
<li><p>Select Command Line Interface (CLI), check the acknowledgment box, and click Next</p>
</li>
<li><p>Click Create access key</p>
</li>
</ol>
<p><strong>Important</strong>: Copy both the Access Key ID and the Secret Access Key right now. You will never be able to see the Secret Access Key again after closing this screen. Save both values in a password manager or secure note.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/d85bb4eb-0ecf-4d92-be92-d75af5a534c6.png" alt="IAM Console showing the Create access key screen with the Access Key ID and Secret Access Key" style="display:block;margin:0 auto" width="1100" height="426" loading="lazy">

<h4 id="heading-phase-3-connect-your-terminal-to-aws">Phase 3: Connect Your Terminal to AWS</h4>
<p>Run the following command in your terminal:</p>
<pre><code class="language-shell">aws configure
</code></pre>
<p>You will be prompted for four values:</p>
<pre><code class="language-shell">AWS Access Key ID:     [paste your Access Key ID]
AWS Secret Access Key: [paste your Secret Access Key]
Default region name:   us-east-1
Default output format: json
</code></pre>
<p>Use <code>us-east-1</code> as your region for this tutorial. After this step, every CDK and AWS CLI command you run will use these credentials automatically.</p>
<h2 id="heading-part-2-set-up-the-project-structure">Part 2: Set Up the Project Structure</h2>
<p>You will use a <strong>monorepo</strong> layout – one top-level folder with two sub-projects inside: <code>frontend</code> for your React app and <code>backend</code> for your AWS infrastructure code. They are deployed independently but live side by side.</p>
<h3 id="heading-21-create-the-workspace">2.1 Create the Workspace</h3>
<pre><code class="language-shell">mkdir vendor-tracker &amp;&amp; cd vendor-tracker
mkdir backend frontend
</code></pre>
<h3 id="heading-22-initialize-the-frontend-nextjs">2.2 Initialize the Frontend (Next.js)</h3>
<p>Navigate into the <code>frontend</code> folder and run:</p>
<pre><code class="language-shell">cd frontend
npx create-next-app@latest .
</code></pre>
<p>When prompted, choose the following options:</p>
<ul>
<li><p><strong>TypeScript</strong> --&gt; Yes</p>
</li>
<li><p><strong>ESLint</strong> --&gt; Yes</p>
</li>
<li><p><strong>Tailwind CSS</strong> --&gt; Yes</p>
</li>
<li><p><strong>src/ directory</strong> --&gt;No</p>
</li>
<li><p><strong>App Router</strong> --&gt; Yes</p>
</li>
<li><p><strong>Import alias</strong> --&gt; No</p>
</li>
</ul>
<h3 id="heading-23-initialize-the-backend-cdk">2.3 Initialize the Backend (CDK)</h3>
<p>Navigate into the <code>backend</code> folder and run:</p>
<pre><code class="language-shell">cd ../backend
cdk init app --language typescript
</code></pre>
<p>This generates a boilerplate CDK project. The most important file it creates is <code>backend/lib/backend-stack.ts</code>. This is where you will define all of your AWS infrastructure as TypeScript code.</p>
<p>Also install <code>esbuild</code>, which CDK uses to bundle your Lambda functions:</p>
<pre><code class="language-shell">npm install --save-dev esbuild
</code></pre>
<h3 id="heading-24-understanding-cdk-before-you-write-any-code">2.4 Understanding CDK Before You Write Any Code</h3>
<p>CDK is likely different from most tools you have used. Here is how it works:</p>
<p>Normally, you would create AWS resources by clicking through the AWS Console: create a table here, configure a Lambda function there. CDK lets you do all of that using TypeScript code instead.</p>
<p>When you run <code>cdk deploy</code>, CDK reads your TypeScript file, converts it into an AWS CloudFormation template (an internal AWS format for describing infrastructure), and submits it to AWS. AWS then creates all the resources you described.</p>
<p>A few terms you will see throughout this tutorial:</p>
<ul>
<li><p><strong>Stack</strong>: The collection of all AWS resources you define together. Your <code>BackendStack</code> class is your stack.</p>
</li>
<li><p><strong>Construct</strong>: Each individual AWS resource you create inside a stack (a table, a Lambda function, an API) is called a construct.</p>
</li>
<li><p><strong>Deploy</strong>: Running <code>cdk deploy</code> sends your TypeScript definition to AWS and creates or updates the real resources.</p>
</li>
</ul>
<p>The main file you'll work in is <code>backend/lib/backend-stack.ts</code>. Think of it as the blueprint for your entire backend.</p>
<p>Your final project structure will look like this:</p>
<pre><code class="language-plaintext">vendor-tracker/
├── backend/
│   ├── lambda/
│   │   ├── createVendor.ts
│   │   ├── getVendors.ts
│   │   └── deleteVendor.ts
│   ├── lib/
│   │   └── backend-stack.ts
│   └── package.json
└── frontend/
    ├── app/
    │   ├── layout.tsx
    │   ├── page.tsx
    │   └── providers.tsx
    ├── lib/
    │   └── api.ts
    ├── types/
    │   └── vendor.ts
    └── .env.local
</code></pre>
<h2 id="heading-part-3-define-the-database-dynamodb">Part 3: Define the Database (DynamoDB)</h2>
<p>DynamoDB is AWS's NoSQL database. Think of it as a fast, scalable key-value store in the cloud. Every item in a DynamoDB table must have a unique ID called the <strong>partition key</strong>. For your vendor table, that key will be <code>vendorId</code>.</p>
<p>Open <code>backend/lib/backend-stack.ts</code>. Replace the entire file contents with the following:</p>
<pre><code class="language-typescript">import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: {
        name: 'vendorId',
        type: dynamodb.AttributeType.STRING,
      },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY, // For development only
    });
  }
}
</code></pre>
<p><strong>What each line does:</strong></p>
<ul>
<li><p><code>partitionKey</code> tells DynamoDB that <code>vendorId</code> is the unique identifier for every record. No two vendors can share the same <code>vendorId</code>.</p>
</li>
<li><p><code>PAY_PER_REQUEST</code> means you only pay when data is actually read or written. There is no charge when the table is idle, which makes it cost-effective for learning.</p>
</li>
<li><p><code>RemovalPolicy.DESTROY</code> means the table will be deleted when you run <code>cdk destroy</code>. For production apps you would not use this.</p>
</li>
</ul>
<h2 id="heading-part-4-write-the-lambda-functions">Part 4: Write the Lambda Functions</h2>
<p>A Lambda function is your server, but unlike a traditional server, it only runs when it's called. AWS spins it up on demand, runs your code, and shuts it down. You're only charged for the time your code is actually running.</p>
<p>You'll write three Lambda functions:</p>
<ul>
<li><p><code>createVendor.ts</code>: Adds a new vendor to DynamoDB</p>
</li>
<li><p><code>getVendors.ts</code>: Returns all vendors from DynamoDB</p>
</li>
<li><p><code>deleteVendor.ts</code>: Removes a vendor from DynamoDB by ID</p>
</li>
</ul>
<p>Create a new folder inside <code>backend</code>:</p>
<pre><code class="language-shell">mkdir backend/lambda
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/6330a84b-77c3-4001-9783-5fedc89ae1c0.png" alt="6330a84b-77c3-4001-9783-5fedc89ae1c0" style="display:block;margin:0 auto" width="300" height="185" loading="lazy">

<h3 id="heading-a-note-on-the-aws-sdk">A Note on the AWS SDK</h3>
<p>All three Lambda functions use <strong>AWS SDK v3</strong> (<code>@aws-sdk/client-dynamodb</code> and <code>@aws-sdk/lib-dynamodb</code>). This is the current standard. An older version of the SDK (<code>aws-sdk</code>) exists but is deprecated and not bundled in the Node.js 18 Lambda runtime, which is what you'll use. Stick to v3 throughout.</p>
<h3 id="heading-41-create-vendor-lambda">4.1 Create Vendor Lambda</h3>
<p>Create <code>backend/lambda/createVendor.ts</code>:</p>
<pre><code class="language-typescript">import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, PutCommand } from "@aws-sdk/lib-dynamodb";
import { randomUUID } from "crypto";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async (event: any) =&gt; {
  try {
    const body = JSON.parse(event.body);

    const item = {
      vendorId: randomUUID(), // Generates a collision-safe unique ID
      name: body.name,
      category: body.category,
      contactEmail: body.contactEmail,
      createdAt: new Date().toISOString(),
    };

    await docClient.send(
      new PutCommand({
        TableName: process.env.TABLE_NAME!,
        Item: item,
      })
    );

    return {
      statusCode: 201,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Access-Control-Allow-Methods": "OPTIONS,POST,GET,DELETE",
      },
      body: JSON.stringify({ message: "Vendor created", vendorId: item.vendorId }),
    };
  } catch (error) {
    console.error("Error creating vendor:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to create vendor" }),
    };
  }
};
</code></pre>
<p><strong>What each part does:</strong></p>
<ul>
<li><p><code>randomUUID()</code> generates a universally unique ID using Node's built-in <code>crypto</code> module. No extra package is needed. This is more reliable than <code>Date.now()</code>, which can produce duplicate IDs if two requests arrive within the same millisecond.</p>
</li>
<li><p><code>process.env.TABLE_NAME</code> reads the DynamoDB table name from an environment variable. You'll set this value in the CDK stack. This avoids hardcoding the table name inside your Lambda code.</p>
</li>
<li><p>The <code>headers</code> block is required for CORS (Cross-Origin Resource Sharing). Without <code>Access-Control-Allow-Origin</code>, your browser will block responses from a different domain than your frontend. Without <code>Access-Control-Allow-Headers</code>, the <code>Authorization</code> header you add later for Cognito will be rejected during the browser's preflight check.</p>
</li>
</ul>
<h3 id="heading-42-get-vendors-lambda">4.2 Get Vendors Lambda</h3>
<p>Create <code>backend/lambda/getVendors.ts</code>:</p>
<pre><code class="language-typescript">import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, ScanCommand } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async () =&gt; {
  try {
    const response = await docClient.send(
      new ScanCommand({
        TableName: process.env.TABLE_NAME!,
      })
    );

    return {
      statusCode: 200,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Content-Type": "application/json",
      },
      body: JSON.stringify(response.Items ?? []),
    };
  } catch (error) {
    console.error("Error fetching vendors:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to fetch vendors" }),
    };
  }
};
</code></pre>
<p><strong>What each part does:</strong></p>
<ul>
<li><p><code>ScanCommand</code> reads every item in the table and returns them as an array. For a learning project this is fine. In a production app with millions of rows, you would use a more targeted <code>QueryCommand</code> to avoid reading the entire table on every request.</p>
</li>
<li><p><code>response.Items ?? []</code> returns an empty array if the table is empty, preventing the frontend from crashing when there are no vendors yet.</p>
</li>
</ul>
<h3 id="heading-43-delete-vendor-lambda">4.3 Delete Vendor Lambda</h3>
<p>Create <code>backend/lambda/deleteVendor.ts</code>:</p>
<pre><code class="language-typescript">import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, DeleteCommand } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({});
const docClient = DynamoDBDocumentClient.from(client);

export const handler = async (event: any) =&gt; {
  try {
    const body = JSON.parse(event.body);
    const { vendorId } = body;

    if (!vendorId) {
      return {
        statusCode: 400,
        headers: { "Access-Control-Allow-Origin": "*" },
        body: JSON.stringify({ error: "vendorId is required" }),
      };
    }

    await docClient.send(
      new DeleteCommand({
        TableName: process.env.TABLE_NAME!,
        Key: { vendorId },
      })
    );

    return {
      statusCode: 200,
      headers: {
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Allow-Headers": "Content-Type,Authorization",
        "Access-Control-Allow-Methods": "OPTIONS,POST,GET,DELETE",
      },
      body: JSON.stringify({ message: "Vendor deleted" }),
    };
  } catch (error) {
    console.error("Error deleting vendor:", error);
    return {
      statusCode: 500,
      headers: { "Access-Control-Allow-Origin": "*" },
      body: JSON.stringify({ error: "Failed to delete vendor" }),
    };
  }
};
</code></pre>
<p><strong>What each part does:</strong></p>
<ul>
<li><p><code>DeleteCommand</code> removes the item whose <code>vendorId</code> matches the key you provide. DynamoDB doesn't return an error if the item doesn't exist. It simply does nothing.</p>
</li>
<li><p>The <code>400</code> guard at the top returns a clear error if the caller forgets to send a <code>vendorId</code>, rather than letting DynamoDB throw a confusing internal error.</p>
</li>
</ul>
<h2 id="heading-part-5-build-the-api-with-api-gateway">Part 5: Build the API with API Gateway</h2>
<p>API Gateway is what gives your Lambda functions a public URL. Without it, there's no way for your browser to trigger a Lambda function. Think of it as the front door of your backend: it receives HTTP requests, checks whether the caller is authorized, routes the request to the correct Lambda, and returns the Lambda's response to the caller.</p>
<p>Now you'll wire everything together in <code>backend/lib/backend-stack.ts</code>.</p>
<h3 id="heading-51-add-lambda-functions-and-api-gateway-to-the-stack">5.1 Add Lambda Functions and API Gateway to the Stack</h3>
<p>Replace the entire contents of <code>backend/lib/backend-stack.ts</code> with this complete, assembled file:</p>
<pre><code class="language-typescript">import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table 
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: {
        name: 'vendorId',
        type: dynamodb.AttributeType.STRING,
      },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // 2. Lambda Functions
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // 3. Permissions (Least Privilege)
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // 4. API Gateway
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda));
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda));
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda));

    // 5. Outputs
    new cdk.CfnOutput(this, 'ApiEndpoint', {
      value: api.url,
    });
  }
}
</code></pre>
<p><strong>What each section does:</strong></p>
<p><code>NodejsFunction</code> is a special CDK construct that automatically bundles your Lambda code and all its dependencies into a single file using <code>esbuild</code> before uploading it to AWS. This is why you installed <code>esbuild</code> in Part 2.</p>
<p>Always use <code>NodejsFunction</code> instead of the basic <code>lambda.Function</code> construct. The basic version requires you to manually manage bundling, which causes "Module not found" errors at runtime.</p>
<p><strong>Permissions (Least Privilege):</strong> In AWS, no resource can communicate with any other resource by default. A Lambda function has no access to DynamoDB, S3, or anything else unless you explicitly grant it.</p>
<p>This is called the <strong>Least Privilege</strong> principle: each piece of your system gets exactly the permissions it needs, and nothing more. <code>grantWriteData</code> lets a Lambda write and delete items. <code>grantReadData</code> lets a Lambda read items. Using separate grants for each function means the <code>getVendors</code> Lambda can never accidentally delete data.</p>
<p><code>CfnOutput</code> prints a value to your terminal after <code>cdk deploy</code> completes. You'll use the <code>ApiEndpoint</code> URL to configure your frontend.</p>
<h2 id="heading-part-6-deploy-the-backend-to-aws">Part 6: Deploy the Backend to AWS</h2>
<p>Your infrastructure is fully defined in code. Now you'll deploy it to AWS and get a live API URL.</p>
<h3 id="heading-61-bootstrap-your-aws-environment">6.1 Bootstrap Your AWS Environment</h3>
<p>Before your first CDK deployment, AWS needs a small landing zone in your account – an S3 bucket where CDK can upload your Lambda bundles and other assets. This setup step is called <strong>bootstrapping</strong> and only needs to be done once per AWS account per region.</p>
<p>From inside your <code>backend</code> folder, run:</p>
<pre><code class="language-shell">cdk bootstrap
</code></pre>
<p><strong>Important</strong>: Bootstrapping is region-specific. If you ever switch to a different AWS region, you will need to run <code>cdk bootstrap</code> again in that region.</p>
<h3 id="heading-62-deploy">6.2 Deploy</h3>
<p>Run:</p>
<pre><code class="language-shell">cdk deploy
</code></pre>
<p>CDK will display a summary of everything it is about to create and ask for your confirmation. Type <code>y</code> and press Enter.</p>
<p>When the deployment finishes, you'll see an <strong>Outputs</strong> section in your terminal:</p>
<pre><code class="language-plaintext">Outputs:
BackendStack.ApiEndpoint = https://abcdef123.execute-api.us-east-1.amazonaws.com/prod/
</code></pre>
<p>Copy that URL. You'll need it when building the frontend.</p>
<h3 id="heading-63-troubleshooting-how-to-read-aws-error-logs">6.3 Troubleshooting: How to Read AWS Error Logs</h3>
<p>Real deployments rarely go perfectly the first time. If something goes wrong after deploying, here is how to find the actual error message.</p>
<h4 id="heading-error-502-bad-gateway">Error: 502 Bad Gateway</h4>
<p>A <code>502</code> means API Gateway received your request but your Lambda crashed before it could respond. The most common cause is a missing environment variable – for example, if <code>TABLE_NAME</code> was not passed correctly and the Lambda cannot find the table.</p>
<p>To find the actual error message, use <strong>CloudWatch Logs</strong>:</p>
<ol>
<li><p>Log in to the AWS Console and search for CloudWatch</p>
</li>
<li><p>In the left sidebar, click Logs --&gt; Log groups</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/abfb78fc-574b-4a75-a12b-12fb09f041b3.png" alt="CloudWatch left sidebar with log groups, and the search field showing /aws/lambda/" style="display:block;margin:0 auto" width="1915" height="428" loading="lazy">

<ol>
<li><p>Find the group named <code>/aws/lambda/BackendStack-CreateVendorHandler...</code></p>
</li>
<li><p>Click the most recent Log stream</p>
</li>
<li><p>Read the error message. It will tell you exactly what went wrong</p>
</li>
</ol>
<p>Two common messages and their fixes:</p>
<ul>
<li><p><code>Runtime.ImportModuleError</code> : Your Lambda cannot find a module. Make sure you're using <code>NodejsFunction</code> (not <code>lambda.Function</code>) in your CDK stack. <code>NodejsFunction</code> automatically bundles dependencies; <code>lambda.Function</code> does not.</p>
</li>
<li><p><code>AccessDeniedException</code>: Your Lambda tried to access DynamoDB but doesn't have permission. Check that you have the correct <code>grantWriteData</code> or <code>grantReadData</code> call in your stack for that Lambda.</p>
</li>
</ul>
<h2 id="heading-part-7-build-the-react-frontend">Part 7: Build the React Frontend</h2>
<p>Your backend is live. Now you'll build the React UI that talks to it.</p>
<h3 id="heading-71-define-the-vendor-type">7.1 Define the Vendor Type</h3>
<p>Before writing any API or component code, define what a "vendor" looks like in TypeScript. This gives you type safety throughout your frontend code.</p>
<p>Create <code>frontend/types/vendor.ts</code>:</p>
<pre><code class="language-typescript">export interface Vendor {
  vendorId?: string; // Optional when creating — the Lambda generates it
  name: string;
  category: string;
  contactEmail: string;
  createdAt?: string;
}
</code></pre>
<p>The <code>vendorId?</code> is marked optional with <code>?</code> because when you are <em>creating</em> a new vendor, you don't have an ID yet. The <code>createVendor</code> Lambda generates one. When you <em>read</em> vendors back from the API, <code>vendorId</code> will always be present.</p>
<h3 id="heading-72-create-the-api-service-layer">7.2 Create the API Service Layer</h3>
<p>Rather than writing <code>fetch</code> calls directly inside your React components, you'll centralize all your API logic in one file. This pattern is called a <strong>service layer</strong>. It keeps your components clean and makes it easy to update API calls in one place.</p>
<p>First, create a <code>.env.local</code> file inside your <code>frontend</code> folder to store your API URL:</p>
<pre><code class="language-bash"># frontend/.env.local
NEXT_PUBLIC_API_URL=https://abcdef123.execute-api.us-east-1.amazonaws.com/prod
</code></pre>
<p>Replace the URL with the <code>ApiEndpoint</code> value from your <code>cdk deploy</code> output. The <code>NEXT_PUBLIC_</code> prefix is required by Next.js to make an environment variable accessible in the browser.</p>
<p>You might be wondering: <strong>why not hardcode the URL</strong>? If you paste your API URL directly into your code and push it to GitHub, it becomes publicly visible. While an API URL alone does not expose your data (Cognito will protect that), it's good practice to keep URLs and secrets out of source control. Always use .env.local and add it to your .gitignore.</p>
<p>Make sure <code>.env.local</code> is in your <code>.gitignore</code>:</p>
<pre><code class="language-shell">echo ".env.local" &gt;&gt; frontend/.gitignore
</code></pre>
<p>Now create <code>frontend/lib/api.ts</code>:</p>
<pre><code class="language-typescript">import { Vendor } from '@/types/vendor';

const BASE_URL = process.env.NEXT_PUBLIC_API_URL!;

export const getVendors = async (): Promise&lt;Vendor[]&gt; =&gt; {
  const response = await fetch(`${BASE_URL}/vendors`);
  if (!response.ok) throw new Error('Failed to fetch vendors');
  return response.json();
};

export const createVendor = async (vendor: Omit&lt;Vendor, 'vendorId' | 'createdAt'&gt;): Promise&lt;void&gt; =&gt; {
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(vendor),
  });
  if (!response.ok) throw new Error('Failed to create vendor');
};

export const deleteVendor = async (vendorId: string): Promise&lt;void&gt; =&gt; {
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'DELETE',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ vendorId }),
  });
  if (!response.ok) throw new Error('Failed to delete vendor');
};
</code></pre>
<p><strong>What each part does:</strong></p>
<ul>
<li><p><code>Omit&lt;Vendor, 'vendorId' | 'createdAt'&gt;</code> means the <code>createVendor</code> function accepts a vendor without an ID or timestamp (those are generated server-side).</p>
</li>
<li><p><code>if (!response.ok) throw new Error(...)</code> ensures that any HTTP error (4xx or 5xx) surfaces as a JavaScript error in your component, where you can show the user a meaningful message instead of silently failing.</p>
</li>
</ul>
<p>You'll update these functions later in Part 8 to include the Cognito auth token.</p>
<h3 id="heading-73-build-the-main-page">7.3 Build the Main Page</h3>
<p>Now create the main page component. It includes a form for adding vendors and a live list that displays all current vendors.</p>
<p>Replace the contents of <code>frontend/app/page.tsx</code> with:</p>
<pre><code class="language-typescript">'use client';

import { useState, useEffect } from 'react';
import { createVendor, getVendors, deleteVendor } from '@/lib/api';
import { Vendor } from '@/types/vendor';

export default function Home() {
  const [vendors, setVendors] = useState&lt;Vendor[]&gt;([]);
  const [form, setForm] = useState({ name: '', category: '', contactEmail: '' });
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  const loadVendors = async () =&gt; {
    try {
      const data = await getVendors();
      setVendors(data);
    } catch {
      setError('Failed to load vendors.');
    }
  };

  // Load vendors once when the page first renders
  useEffect(() =&gt; {
    loadVendors();
  }, []);
  // The empty [] means this runs only once. Without it, the effect would
  // run after every render, causing an infinite loop of fetch requests.

  const handleSubmit = async (e: React.FormEvent) =&gt; {
    e.preventDefault(); // Prevent the browser from reloading the page on submit
    setLoading(true);
    setError('');
    try {
      await createVendor(form);
      setForm({ name: '', category: '', contactEmail: '' }); // Reset the form
      await loadVendors(); // Refresh the list from DynamoDB
    } catch {
      setError('Failed to add vendor. Please try again.');
    } finally {
      setLoading(false);
    }
  };

  const handleDelete = async (vendorId: string) =&gt; {
    try {
      await deleteVendor(vendorId);
      await loadVendors(); // Refresh after deleting
    } catch {
      setError('Failed to delete vendor.');
    }
  };

  return (
    &lt;main className="p-10 max-w-5xl mx-auto"&gt;
      &lt;h1 className="text-3xl font-bold mb-2 text-gray-900"&gt;Vendor Tracker&lt;/h1&gt;
      &lt;p className="text-gray-500 mb-8"&gt;Manage your vendors, stored in AWS DynamoDB.&lt;/p&gt;

      {error &amp;&amp; (
        &lt;div className="mb-4 p-3 bg-red-100 text-red-700 rounded"&gt;{error}&lt;/div&gt;
      )}

      &lt;div className="grid grid-cols-1 md:grid-cols-2 gap-10"&gt;

        {/* ── Add Vendor Form ── */}
        &lt;section&gt;
          &lt;h2 className="text-xl font-semibold mb-4 text-gray-800"&gt;Add New Vendor&lt;/h2&gt;
          &lt;form onSubmit={handleSubmit} className="space-y-4"&gt;
            &lt;input
              className="w-full p-2 border rounded text-black focus:outline-none focus:ring-2 focus:ring-orange-400"
              placeholder="Vendor Name"
              value={form.name}
              onChange={e =&gt; setForm({ ...form, name: e.target.value })}
              required
            /&gt;
            &lt;input
              className="w-full p-2 border rounded text-black focus:outline-none focus:ring-2 focus:ring-orange-400"
              placeholder="Category (e.g. SaaS, Hardware)"
              value={form.category}
              onChange={e =&gt; setForm({ ...form, category: e.target.value })}
              required
            /&gt;
            &lt;input
              className="w-full p-2 border rounded text-black focus:outline-none focus:ring-2 focus:ring-orange-400"
              placeholder="Contact Email"
              type="email"
              value={form.contactEmail}
              onChange={e =&gt; setForm({ ...form, contactEmail: e.target.value })}
              required
            /&gt;
            &lt;button
              type="submit"
              disabled={loading}
              className="w-full bg-orange-500 text-white p-2 rounded hover:bg-orange-600 disabled:bg-gray-400 transition-colors"
            &gt;
              {loading ? 'Saving...' : 'Add Vendor'}
            &lt;/button&gt;
          &lt;/form&gt;
        &lt;/section&gt;

        {/* ── Vendor List ── */}
        &lt;section&gt;
          &lt;h2 className="text-xl font-semibold mb-4 text-gray-800"&gt;
            Current Vendors ({vendors.length})
          &lt;/h2&gt;
          &lt;div className="space-y-3"&gt;
            {vendors.length === 0 ? (
              &lt;p className="text-gray-400 italic"&gt;No vendors yet. Add one using the form.&lt;/p&gt;
            ) : (
              vendors.map(v =&gt; (
                &lt;div
                  key={v.vendorId}
                  className="p-4 border rounded shadow-sm bg-white flex justify-between items-start"
                &gt;
                  &lt;div&gt;
                    &lt;p className="font-semibold text-gray-900"&gt;{v.name}&lt;/p&gt;
                    &lt;p className="text-sm text-gray-500"&gt;{v.category} · {v.contactEmail}&lt;/p&gt;
                  &lt;/div&gt;
                  &lt;button
                    onClick={() =&gt; v.vendorId &amp;&amp; handleDelete(v.vendorId)}
                    className="ml-4 text-sm text-red-500 hover:text-red-700 hover:underline"
                  &gt;
                    Delete
                  &lt;/button&gt;
                &lt;/div&gt;
              ))
            )}
          &lt;/div&gt;
        &lt;/section&gt;

      &lt;/div&gt;
    &lt;/main&gt;
  );
}
</code></pre>
<p><strong>Key points in this component:</strong></p>
<ul>
<li><p><code>'use client'</code> at the top is a Next.js directive. It tells Next.js that this component uses browser APIs (<code>useState</code>, <code>useEffect</code>, event handlers) and must run in the browser, not be pre-rendered on the server.</p>
</li>
<li><p><code>e.preventDefault()</code> inside <code>handleSubmit</code> stops the browser's default form submission behavior, which would cause a full page reload and wipe your React state.</p>
</li>
<li><p>After every <code>createVendor</code> or <code>deleteVendor</code> call, <code>loadVendors()</code> is called again. This re-fetches the latest data from DynamoDB so the UI always matches what is actually stored in the database.</p>
</li>
</ul>
<h3 id="heading-74-test-the-app-locally">7.4 Test the App Locally</h3>
<p>Start your Next.js development server:</p>
<pre><code class="language-shell">cd frontend
npm run dev
</code></pre>
<p>Open <code>http://localhost:3000</code> in your browser. You should see the two-panel layout. Try adding a vendor and confirm it appears in the list.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/281f971a-27b8-49b3-9079-e12601525d80.png" alt="The running Vendor Tracker app at localhost:3000 showing the two-panel layout with the Add Vendor form on the left and an empty vendor list on the right" style="display:block;margin:0 auto" width="1690" height="708" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/88b5dd74-5847-4310-bec3-b1a2b129fbaa.png" alt="The Vendor Tracker app after a vendor has been added, showing the vendor card in the list" style="display:block;margin:0 auto" width="1646" height="598" loading="lazy">

<h4 id="heading-verifying-the-connection-to-aws">Verifying the connection to AWS:</h4>
<p>Open Chrome DevTools (F12) and click the Network tab. When you add a vendor, you should see:</p>
<ul>
<li><p>A <code>POST</code> request to your AWS API URL returning a <strong>201</strong> status code</p>
</li>
<li><p>A <code>GET</code> request returning <strong>200</strong> with the updated vendor list</p>
</li>
</ul>
<p>You can also verify the data was saved by opening the AWS Console, navigating to <strong>DynamoDB --&gt; Tables --&gt; VendorTable --&gt; Explore table items</strong>. Your vendor should appear there.</p>
<h2 id="heading-part-8-add-authentication-with-amazon-cognito">Part 8: Add Authentication with Amazon Cognito</h2>
<p>Right now your API is completely open. Anyone who finds your API URL can add or delete vendors. You'll fix that with <strong>Amazon Cognito</strong>.</p>
<p>Cognito is AWS's authentication service. It manages a User Pool – a database of registered users with usernames and passwords. When a user logs in, Cognito issues a JWT (JSON Web Token): a cryptographically signed string that proves who the user is. Your API Gateway will check for this token on every request. No valid token means no access.</p>
<p><strong>What is a JWT?</strong> A JSON Web Token is a string that looks like <code>eyJhbGci...</code>. It contains encoded information about the user and is signed by Cognito using a secret key.</p>
<p>API Gateway can verify the signature without contacting Cognito on every request, which makes token checking fast. Think of it as a tamper-proof badge: anyone can read the name on it, but only Cognito's signature makes it valid.</p>
<h3 id="heading-81-add-cognito-to-the-cdk-stack">8.1 Add Cognito to the CDK Stack</h3>
<p>Open <code>backend/lib/backend-stack.ts</code> and update it to include Cognito. Here is the complete updated file:</p>
<pre><code class="language-typescript">import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // ─── 1. DynamoDB Table ────────────────────────────────────────────────────
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: { name: 'vendorId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // ─── 2. Lambda Functions ──────────────────────────────────────────────────
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // ─── 3. Permissions ───────────────────────────────────────────────────────
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // ─── 4. Cognito User Pool ─────────────────────────────────────────────────
    const userPool = new cognito.UserPool(this, 'VendorUserPool', {
      selfSignUpEnabled: true,
      signInAliases: { email: true },
      autoVerify: { email: true },
      userVerification: {
        emailStyle: cognito.VerificationEmailStyle.CODE,
      },
    });

    // Required to host Cognito's internal auth endpoints
    userPool.addDomain('VendorUserPoolDomain', {
      cognitoDomain: {
        domainPrefix: `vendor-tracker-${this.account}`,
      },
    });

    const userPoolClient = userPool.addClient('VendorAppClient');

    // ─── 5. API Gateway + Authorizer ──────────────────────────────────────────
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const authorizer = new apigateway.CognitoUserPoolsAuthorizer(
      this,
      'VendorAuthorizer',
      { cognitoUserPools: [userPool] }
    );

    const authOptions = {
      authorizer,
      authorizationType: apigateway.AuthorizationType.COGNITO,
    };

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda), authOptions);
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda), authOptions);
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda), authOptions);

    // ─── 6. Outputs ───────────────────────────────────────────────────────────
    new cdk.CfnOutput(this, 'ApiEndpoint', { value: api.url });
    new cdk.CfnOutput(this, 'UserPoolId', { value: userPool.userPoolId });
    new cdk.CfnOutput(this, 'UserPoolClientId', { value: userPoolClient.userPoolClientId });
  }
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/c5e91abf-e6af-429f-bf5b-b14d18233f6c.png" alt="The newly created User Pool (VendorUserPool...) in the User Pools list, with the User Pool ID visible" style="display:block;margin:0 auto" width="1100" height="454" loading="lazy">

<p><strong>What changed:</strong></p>
<ul>
<li><p><code>CognitoUserPoolsAuthorizer</code> tells API Gateway to check every request for a valid Cognito JWT before passing it to any Lambda. If the token is missing or invalid, API Gateway rejects the request with a <code>401 Unauthorized</code> response without ever touching your Lambda.</p>
</li>
<li><p><code>authOptions</code> is applied to all three API methods: GET, POST, and DELETE. All routes are now protected.</p>
</li>
<li><p><code>autoVerify: { email: true }</code> tells Cognito to mark the email attribute as verified after a user confirms via the verification code email. It doesn't skip the verification email, as users still receive a code. If you want to skip verification during development, you can manually confirm users in the Cognito console (covered in section 8.5).</p>
</li>
<li><p>Two new <code>CfnOutput</code> values (<code>UserPoolId</code> and <code>UserPoolClientId</code>) will appear in your terminal after the next deployment. Your frontend needs them to connect to Cognito.</p>
</li>
</ul>
<p>Deploy the updated stack:</p>
<pre><code class="language-shell">cd backend
cdk deploy
</code></pre>
<p>After deployment, your terminal output will include three values:</p>
<pre><code class="language-plaintext">Outputs:
BackendStack.ApiEndpoint     = https://abc123.execute-api.us-east-1.amazonaws.com/prod/
BackendStack.UserPoolId      = us-east-1_xxxxxxxx
BackendStack.UserPoolClientId = xxxxxxxxxxxxxxxxxxxx
</code></pre>
<p>Save all three values. You'll use them in the next step.</p>
<h3 id="heading-82-install-and-configure-aws-amplify">8.2 Install and Configure AWS Amplify</h3>
<p><strong>AWS Amplify</strong> is a frontend library that handles all the complex authentication logic for you: it manages the login UI, stores tokens in the browser, refreshes expired tokens automatically, and exposes a simple API to read the current user's session.</p>
<p>Install the Amplify libraries inside your <code>frontend</code> folder:</p>
<pre><code class="language-shell">cd frontend
npm install aws-amplify @aws-amplify/ui-react
</code></pre>
<p>Create <code>frontend/app/providers.tsx</code>. This file initializes Amplify with your Cognito configuration. It runs once when the app loads:</p>
<pre><code class="language-typescript">'use client';

import { Amplify } from 'aws-amplify';

Amplify.configure(
  {
    Auth: {
      Cognito: {
        userPoolId: process.env.NEXT_PUBLIC_USER_POOL_ID!,
        userPoolClientId: process.env.NEXT_PUBLIC_USER_POOL_CLIENT_ID!,
      },
    },
  },
  { ssr: true }
);

export function Providers({ children }: { children: React.ReactNode }) {
  return &lt;&gt;{children}&lt;/&gt;;
}
</code></pre>
<p>Add the Cognito IDs to your <code>frontend/.env.local</code> file:</p>
<pre><code class="language-shell">NEXT_PUBLIC_API_URL=https://abc123.execute-api.us-east-1.amazonaws.com/prod
NEXT_PUBLIC_USER_POOL_ID=us-east-1_xxxxxxxx
NEXT_PUBLIC_USER_POOL_CLIENT_ID=xxxxxxxxxxxxxxxxxxxx
</code></pre>
<p>Replace the values with the outputs from your <code>cdk deploy</code>.</p>
<h3 id="heading-83-wire-providers-into-the-app-layout">8.3 Wire Providers into the App Layout</h3>
<p><strong>This step is critical.</strong> Amplify must be initialized before any component tries to use authentication. If you skip this step, <code>fetchAuthSession()</code> will throw an "Amplify not configured" error and nothing will work.</p>
<p>Open <code>frontend/app/layout.tsx</code> and update it to wrap the app in the <code>Providers</code> component:</p>
<pre><code class="language-typescript">import type { Metadata } from 'next';
import './globals.css';
import { Providers } from './providers';

export const metadata: Metadata = {
  title: 'Vendor Tracker',
  description: 'Manage your vendors with AWS',
};

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    &lt;html lang="en"&gt;
      &lt;body&gt;
        &lt;Providers&gt;{children}&lt;/Providers&gt;
      &lt;/body&gt;
    &lt;/html&gt;
  );
}
</code></pre>
<p>By wrapping <code>{children}</code> in <code>&lt;Providers&gt;</code>, you ensure that Amplify is configured once at the root of the app, before any child page or component renders.</p>
<h3 id="heading-84-protect-the-ui-with-withauthenticator">8.4 Protect the UI with withAuthenticator</h3>
<p>Now wrap your <code>Home</code> component so that unauthenticated users see a login screen instead of the dashboard.</p>
<p>Replace the contents of <code>frontend/app/page.tsx</code> with this updated version:</p>
<pre><code class="language-typescript">'use client';

import { useState, useEffect } from 'react';
import { withAuthenticator } from '@aws-amplify/ui-react';
import '@aws-amplify/ui-react/styles.css';
import { getVendors, createVendor, deleteVendor } from '@/lib/api';
import { Vendor } from '@/types/vendor';

// withAuthenticator injects `signOut` and `user` as props automatically
function Home({ signOut, user }: { signOut?: () =&gt; void; user?: any }) {
  const [vendors, setVendors] = useState&lt;Vendor[]&gt;([]);
  const [form, setForm] = useState({ name: '', category: '', contactEmail: '' });
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState('');

  const loadVendors = async () =&gt; {
    try {
      const data = await getVendors();
      setVendors(data);
    } catch {
      setError('Failed to load vendors.');
    }
  };

  useEffect(() =&gt; {
    loadVendors();
  }, []);

  const handleSubmit = async (e: React.FormEvent) =&gt; {
    e.preventDefault();
    setLoading(true);
    setError('');
    try {
      await createVendor(form);
      setForm({ name: '', category: '', contactEmail: '' });
      await loadVendors();
    } catch {
      setError('Failed to add vendor.');
    } finally {
      setLoading(false);
    }
  };

  const handleDelete = async (vendorId: string) =&gt; {
    try {
      await deleteVendor(vendorId);
      await loadVendors();
    } catch {
      setError('Failed to delete vendor.');
    }
  };

  return (
    &lt;main className="p-10 max-w-5xl mx-auto"&gt;
      {/* ── Header ── */}
      &lt;header className="flex justify-between items-center mb-8 p-4 bg-gray-100 rounded"&gt;
        &lt;div&gt;
          &lt;h1 className="text-xl font-bold text-gray-900"&gt;Vendor Tracker&lt;/h1&gt;
          &lt;p className="text-sm text-gray-500"&gt;Signed in as: {user?.signInDetails?.loginId}&lt;/p&gt;
        &lt;/div&gt;
        &lt;button
          onClick={signOut}
          className="bg-red-500 text-white px-4 py-2 rounded hover:bg-red-600 transition-colors"
        &gt;
          Sign Out
        &lt;/button&gt;
      &lt;/header&gt;

      {error &amp;&amp; (
        &lt;div className="mb-4 p-3 bg-red-100 text-red-700 rounded"&gt;{error}&lt;/div&gt;
      )}

      &lt;div className="grid grid-cols-1 md:grid-cols-2 gap-10"&gt;

        {/* ── Add Vendor Form ── */}
        &lt;section&gt;
          &lt;h2 className="text-xl font-semibold mb-4 text-gray-800"&gt;Add New Vendor&lt;/h2&gt;
          &lt;form onSubmit={handleSubmit} className="space-y-4"&gt;
            &lt;input
              className="w-full p-2 border rounded text-black"
              placeholder="Vendor Name"
              value={form.name}
              onChange={e =&gt; setForm({ ...form, name: e.target.value })}
              required
            /&gt;
            &lt;input
              className="w-full p-2 border rounded text-black"
              placeholder="Category (e.g. SaaS, Hardware)"
              value={form.category}
              onChange={e =&gt; setForm({ ...form, category: e.target.value })}
              required
            /&gt;
            &lt;input
              className="w-full p-2 border rounded text-black"
              placeholder="Contact Email"
              type="email"
              value={form.contactEmail}
              onChange={e =&gt; setForm({ ...form, contactEmail: e.target.value })}
              required
            /&gt;
            &lt;button
              type="submit"
              disabled={loading}
              className="w-full bg-orange-500 text-white p-2 rounded hover:bg-orange-600 disabled:bg-gray-400"
            &gt;
              {loading ? 'Saving...' : 'Add Vendor'}
            &lt;/button&gt;
          &lt;/form&gt;
        &lt;/section&gt;

        {/* ── Vendor List ── */}
        &lt;section&gt;
          &lt;h2 className="text-xl font-semibold mb-4 text-gray-800"&gt;
            Current Vendors ({vendors.length})
          &lt;/h2&gt;
          &lt;div className="space-y-3"&gt;
            {vendors.length === 0 ? (
              &lt;p className="text-gray-400 italic"&gt;No vendors yet.&lt;/p&gt;
            ) : (
              vendors.map(v =&gt; (
                &lt;div
                  key={v.vendorId}
                  className="p-4 border rounded shadow-sm bg-white flex justify-between items-start"
                &gt;
                  &lt;div&gt;
                    &lt;p className="font-semibold text-gray-900"&gt;{v.name}&lt;/p&gt;
                    &lt;p className="text-sm text-gray-500"&gt;{v.category} · {v.contactEmail}&lt;/p&gt;
                  &lt;/div&gt;
                  &lt;button
                    onClick={() =&gt; v.vendorId &amp;&amp; handleDelete(v.vendorId)}
                    className="ml-4 text-sm text-red-500 hover:text-red-700 hover:underline"
                  &gt;
                    Delete
                  &lt;/button&gt;
                &lt;/div&gt;
              ))
            )}
          &lt;/div&gt;
        &lt;/section&gt;

      &lt;/div&gt;
    &lt;/main&gt;
  );
}

// Wrapping Home with withAuthenticator means any user who is not logged in
// will see Amplify's built-in login/signup screen instead of this component.
export default withAuthenticator(Home);
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/e65a88dc-ea75-4daa-b7cf-eac3406c8060.png" alt="Amplify-generated login screen" style="display:block;margin:0 auto" width="1639" height="805" loading="lazy">

<h3 id="heading-85-pass-the-auth-token-to-api-calls">8.5 Pass the Auth Token to API Calls</h3>
<p>Now that API Gateway requires a JWT on every request, your <code>fetch</code> calls need to include the token in the <code>Authorization</code> header. Without it, every request will return a <code>401 Unauthorized</code> error.</p>
<p>Update <code>frontend/lib/api.ts</code> with a token helper and updated fetch calls:</p>
<pre><code class="language-typescript">import { fetchAuthSession } from 'aws-amplify/auth';
import { Vendor } from '@/types/vendor';

const BASE_URL = process.env.NEXT_PUBLIC_API_URL!;

// Retrieves the current user's JWT token from the active Amplify session
const getAuthToken = async (): Promise&lt;string&gt; =&gt; {
  const session = await fetchAuthSession();
  const token = session.tokens?.idToken?.toString();
  if (!token) throw new Error('No active session. Please sign in.');
  return token;
};

export const getVendors = async (): Promise&lt;Vendor[]&gt; =&gt; {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    headers: { Authorization: token },
  });
  if (!response.ok) throw new Error('Failed to fetch vendors');
  return response.json();
};

export const createVendor = async (
  vendor: Omit&lt;Vendor, 'vendorId' | 'createdAt'&gt;
): Promise&lt;void&gt; =&gt; {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: token,
    },
    body: JSON.stringify(vendor),
  });
  if (!response.ok) throw new Error('Failed to create vendor');
};

export const deleteVendor = async (vendorId: string): Promise&lt;void&gt; =&gt; {
  const token = await getAuthToken();
  const response = await fetch(`${BASE_URL}/vendors`, {
    method: 'DELETE',
    headers: {
      'Content-Type': 'application/json',
      Authorization: token,
    },
    body: JSON.stringify({ vendorId }),
  });
  if (!response.ok) throw new Error('Failed to delete vendor');
};
</code></pre>
<p><strong>What</strong> <code>getAuthToken</code> <strong>does:</strong></p>
<p><code>fetchAuthSession()</code> reads the currently logged-in user's session from the browser. Amplify stores the session in memory and <code>localStorage</code> after the user signs in.</p>
<p><code>session.tokens?.idToken</code> is the JWT string that API Gateway's Cognito Authorizer is looking for. Passing it as the <code>Authorization</code> header tells API Gateway: "This request is from an authenticated user."</p>
<h3 id="heading-86-troubleshooting-cognito">8.6 Troubleshooting Cognito</h3>
<h4 id="heading-unconfirmed-user-error-after-sign-up">"Unconfirmed" user error after sign-up</h4>
<p>When a new user signs up through the Amplify UI, Cognito marks the account as <em>Unconfirmed</em> until the user verifies their email address. A verification code is sent to the user's email. After entering the code, the account becomes confirmed and the user can log in.</p>
<p>If you are testing locally and want to skip the email step, you can manually confirm any account in the AWS Console:</p>
<ol>
<li><p>Open the AWS Console and navigate to Cognito</p>
</li>
<li><p>Click on your User Pool (<code>VendorUserPool...</code>)</p>
</li>
<li><p>Click the Users tab</p>
</li>
<li><p>Click on the user's email address</p>
</li>
<li><p>Open the Actions dropdown and click Confirm account</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/158fb773-9cb1-4c14-9fd7-49e4369ba7e3.png" alt=" Cognito Users list showing a user with &quot;Unconfirmed&quot; status" style="display:block;margin:0 auto" width="1100" height="190" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/5637ac80-ee0c-4fdf-93cf-d4b7d71f6a65.png" alt="Cognito Users list showing a user with &quot;Unconfirmed&quot; status" style="display:block;margin:0 auto" width="1100" height="442" loading="lazy">

<h4 id="heading-401-unauthorized-errors-after-deployment">401 Unauthorized errors after deployment</h4>
<p>If you are getting 401 errors, check two things:</p>
<ol>
<li><p>Open Chrome DevTools --&gt; Network tab, click the failing request, and look at the <strong>Request Headers</strong>. You should see an <code>Authorization</code> header with a long string of characters. If it is missing, <code>getAuthToken</code> is failing. Check that Amplify is configured correctly in <code>providers.tsx</code> and wired in via <code>layout.tsx</code>.</p>
</li>
<li><p>In your CDK stack, confirm that <code>authorizationType: apigateway.AuthorizationType.COGNITO</code> is present on every protected method definition. If it is missing, API Gateway may not be checking tokens even though the authorizer is defined.</p>
</li>
</ol>
<h2 id="heading-part-9-deploy-the-frontend-with-s3-and-cloudfront">Part 9: Deploy the Frontend with S3 and CloudFront</h2>
<p>Your app works locally. Now you'll deploy it to a real HTTPS URL that anyone in the world can visit.</p>
<p><strong>The strategy:</strong> Next.js will export your React app as a set of static HTML, CSS, and JavaScript files. Those files will be uploaded to an <strong>S3 bucket</strong> (AWS's file storage service). <strong>CloudFront</strong> sits in front of the bucket as a Content Delivery Network (CDN), distributing your files to servers around the world and serving them over HTTPS.</p>
<h3 id="heading-91-configure-nextjs-for-static-export">9.1 Configure Next.js for Static Export</h3>
<p>Open <code>frontend/next.config.js</code> (or <code>next.config.mjs</code>) and add the <code>output: 'export'</code> setting:</p>
<pre><code class="language-javascript">/** @type {import('next').NextConfig} */
const nextConfig = {
  output: 'export', // Generates a static /out folder instead of a Node.js server
};

export default nextConfig;
</code></pre>
<p><strong>Note on 'use client' and static export</strong>: When output: 'export' is set, Next.js builds every page at compile time. Any component that uses browser-only APIs – like withAuthenticator from Amplify – must have 'use client' at the top of the file. This tells Next.js to skip server-side rendering for that component and run it only in the browser.</p>
<p>You already have 'use client' in page.tsx. If you ever see a build error mentioning window is not defined or similar, check that the relevant component has 'use client' at the top.</p>
<p>Build the frontend:</p>
<pre><code class="language-shell">cd frontend
npm run build
</code></pre>
<p>This generates an <code>/out</code> folder containing your complete website as static files. Verify the folder was created:</p>
<pre><code class="language-shell">ls out
# You should see: index.html, _next/, etc.
</code></pre>
<h3 id="heading-92-add-s3-and-cloudfront-to-the-cdk-stack">9.2 Add S3 and CloudFront to the CDK Stack</h3>
<p>Open <code>backend/lib/backend-stack.ts</code> and add the hosting infrastructure. Here's the complete final version of the file:</p>
<pre><code class="language-typescript">import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';

export class BackendStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. DynamoDB Table 
    const vendorTable = new dynamodb.Table(this, 'VendorTable', {
      partitionKey: { name: 'vendorId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // 2. Lambda Functions
    const lambdaEnv = { TABLE_NAME: vendorTable.tableName };

    const createVendorLambda = new NodejsFunction(this, 'CreateVendorHandler', {
      entry: 'lambda/createVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const getVendorsLambda = new NodejsFunction(this, 'GetVendorsHandler', {
      entry: 'lambda/getVendors.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    const deleteVendorLambda = new NodejsFunction(this, 'DeleteVendorHandler', {
      entry: 'lambda/deleteVendor.ts',
      handler: 'handler',
      environment: lambdaEnv,
    });

    // 3. Permissions
    vendorTable.grantWriteData(createVendorLambda);
    vendorTable.grantReadData(getVendorsLambda);
    vendorTable.grantWriteData(deleteVendorLambda);

    // 4. Cognito User Pool
    const userPool = new cognito.UserPool(this, 'VendorUserPool', {
      selfSignUpEnabled: true,
      signInAliases: { email: true },
      autoVerify: { email: true },
      userVerification: {
        emailStyle: cognito.VerificationEmailStyle.CODE,
      },
    });

    userPool.addDomain('VendorUserPoolDomain', {
      cognitoDomain: { domainPrefix: `vendor-tracker-${this.account}` },
    });

    const userPoolClient = userPool.addClient('VendorAppClient');

    // 5. API Gateway + Authorizer
    const api = new apigateway.RestApi(this, 'VendorApi', {
      restApiName: 'Vendor Service',
      defaultCorsPreflightOptions: {
        allowOrigins: apigateway.Cors.ALL_ORIGINS,
        allowMethods: apigateway.Cors.ALL_METHODS,
        allowHeaders: ['Content-Type', 'Authorization'],
      },
    });

    const authorizer = new apigateway.CognitoUserPoolsAuthorizer(
      this,
      'VendorAuthorizer',
      { cognitoUserPools: [userPool] }
    );

    const authOptions = {
      authorizer,
      authorizationType: apigateway.AuthorizationType.COGNITO,
    };

    const vendors = api.root.addResource('vendors');
    vendors.addMethod('GET', new apigateway.LambdaIntegration(getVendorsLambda), authOptions);
    vendors.addMethod('POST', new apigateway.LambdaIntegration(createVendorLambda), authOptions);
    vendors.addMethod('DELETE', new apigateway.LambdaIntegration(deleteVendorLambda), authOptions);

    // 6. S3 Bucket (Frontend Files) 
    const siteBucket = new s3.Bucket(this, 'VendorSiteBucket', {
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      autoDeleteObjects: true,
    });

    // 7. CloudFront Distribution (HTTPS + CDN)
    const distribution = new cloudfront.Distribution(this, 'SiteDistribution', {
      defaultBehavior: {
        origin: new origins.S3Origin(siteBucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
      },
      defaultRootObject: 'index.html',
      errorResponses: [
        {
          // Redirect all 404s back to index.html so React can handle routing
          httpStatus: 404,
          responseHttpStatus: 200,
          responsePagePath: '/index.html',
        },
      ],
    });

    // 8. Deploy Frontend Files to S3 
    new s3deploy.BucketDeployment(this, 'DeployWebsite', {
      sources: [s3deploy.Source.asset('../frontend/out')],
      destinationBucket: siteBucket,
      distribution,
      distributionPaths: ['/*'], // Clears CloudFront cache on every deploy
    });

    // 9. Outputs ───────────────────────────────────────────────────────────
    new cdk.CfnOutput(this, 'ApiEndpoint', { value: api.url });
    new cdk.CfnOutput(this, 'UserPoolId', { value: userPool.userPoolId });
    new cdk.CfnOutput(this, 'UserPoolClientId', { value: userPoolClient.userPoolClientId });
    new cdk.CfnOutput(this, 'CloudFrontURL', {
      value: `https://${distribution.distributionDomainName}`,
    });
  }
}
</code></pre>
<p><strong>What the hosting infrastructure does:</strong></p>
<ul>
<li><p>The <strong>S3 bucket</strong> stores your static HTML, CSS, and JavaScript files. It is private – users cannot access it directly.</p>
</li>
<li><p><strong>CloudFront</strong> is the CDN that sits in front of S3. It gives you an HTTPS URL and caches your files at edge locations worldwide, so the app loads fast no matter where users are located. <code>REDIRECT_TO_HTTPS</code> automatically upgrades any HTTP request to HTTPS.</p>
</li>
<li><p>The <strong>error response</strong> for 404 returns <code>index.html</code> instead of an error page. This is necessary for single-page apps: if a user navigates directly to a route like <code>/vendors/123</code>, CloudFront cannot find a file at that path, but sending back <code>index.html</code> lets the React app handle the routing correctly.</p>
</li>
<li><p><code>distributionPaths: ['/*']</code> tells CloudFront to invalidate its entire cache after every deployment. This ensures users always see the latest version of your app immediately.</p>
</li>
<li><p><code>BucketDeployment</code> is a CDK construct that automatically uploads the contents of your <code>frontend/out</code> folder to the S3 bucket every time you run <code>cdk deploy</code>.</p>
</li>
</ul>
<h3 id="heading-93-run-the-final-deployment">9.3 Run the Final Deployment</h3>
<p>First, build the frontend with the latest environment variables:</p>
<pre><code class="language-shell">cd frontend
npm run build
</code></pre>
<p>Then deploy everything from the backend folder:</p>
<pre><code class="language-shell">cd ../backend
cdk deploy
</code></pre>
<p>After deployment finishes, copy the <code>CloudFrontURL</code> from the terminal output:</p>
<pre><code class="language-plaintext">Outputs:
BackendStack.CloudFrontURL = https://d1234abcd.cloudfront.net
</code></pre>
<p>Open that URL in your browser. Your app is now live on the internet, served over HTTPS, globally distributed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62d53ab5bc2c7a1dc672b04f/f8e14979-a667-4afc-bdd4-9afe4abd9593.png" alt="f8e14979-a667-4afc-bdd4-9afe4abd9593" style="display:block;margin:0 auto" width="1686" height="804" loading="lazy">

<h2 id="heading-what-you-built">What You Built</h2>
<p>You now have a fully deployed, production-style full-stack application. Here is a summary of every piece you built and what it does:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Service</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td>Frontend</td>
<td>Next.js + CloudFront</td>
<td>React UI served globally over HTTPS</td>
</tr>
<tr>
<td>Auth</td>
<td>Amazon Cognito + Amplify</td>
<td>User sign-up, login, and JWT token management</td>
</tr>
<tr>
<td>API</td>
<td>API Gateway</td>
<td>Routes HTTP requests, validates auth tokens</td>
</tr>
<tr>
<td>Logic</td>
<td>AWS Lambda (×3)</td>
<td>Creates, reads, and deletes vendors on demand</td>
</tr>
<tr>
<td>Database</td>
<td>DynamoDB</td>
<td>Stores vendor records with no idle cost</td>
</tr>
<tr>
<td>Storage</td>
<td>S3</td>
<td>Holds your built frontend files</td>
</tr>
<tr>
<td>Infrastructure</td>
<td>AWS CDK</td>
<td>Defines and deploys all of the above as code</td>
</tr>
</tbody></table>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You have built and deployed the foundational pattern of almost every cloud application: a secured API backed by a database, deployed with infrastructure as code. Here is everything you accomplished:</p>
<p>You set up a professional AWS development environment with scoped IAM credentials. You defined your entire backend infrastructure as TypeScript code using AWS CDK, which means your database, API, Lambda functions, and authentication system are all version-controlled, repeatable, and deployable with a single command.</p>
<p>You wrote three Lambda functions that handle create, read, and delete operations, each with proper error handling and the correct AWS SDK v3 patterns. You connected them to a REST API through API Gateway and protected every route with Amazon Cognito authentication, so only registered, verified users can interact with your data.</p>
<p>On the frontend, you built a Next.js application with a service layer that cleanly separates API logic from UI components, manages JWTs automatically through AWS Amplify, and gives users a complete sign-up and sign-in flow without you writing a single line of authentication UI code.</p>
<p>Finally, you deployed the entire system: your backend to AWS Lambda and DynamoDB, and your frontend as a static site served globally through CloudFront over HTTPS.</p>
<p>The full source code for this tutorial is available on <a href="https://github.com/BenedictaUche/vendor-tracker">GitHub</a>. Clone it, modify it, and use it as a reference for your own projects.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Serverless RAG Pipeline on AWS That Scales to Zero ]]>
                </title>
                <description>
                    <![CDATA[ Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM end ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-serverless-rag-pipeline-on-aws-that-scales-to-zero/</link>
                <guid isPermaLink="false">69b1b23c6c896b0519b4eda8</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Christopher Galliart ]]>
                </dc:creator>
                <pubDate>Wed, 11 Mar 2026 18:19:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/c0416d9e-9661-47a3-ba9c-8001f5f91b8c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most RAG tutorials end the same way: you've got a working prototype and a bill for a vector database that runs whether anyone's querying it or not. Add an always-on embedding service, a hosted LLM endpoint, and the usual AWS infrastructure, and you're looking at real money before a single user shows up.</p>
<p>But it doesn't have to work that way. In this tutorial, you'll deploy a fully serverless RAG pipeline that processes documents, images, video, and audio, then scales to zero when nobody's using it.</p>
<p>Everything runs in your AWS account, your data never leaves your infrastructure, and your ongoing monthly cost for a modest knowledge base will be closer to <code>2-3 USD</code> than <code>300 USD</code>.</p>
<p>We'll use <a href="https://github.com/HatmanStack/RAGStack-Lambda">RAGStack-Lambda</a>, an open-source project I built on AWS. By the end, you'll have a deployed pipeline with a dashboard, an AI chat interface with source citations, a drop-in web component you can embed in any app, and an MCP server you can use to feed your assistant context.</p>
<h3 id="heading-heres-what-well-cover">Here's what we'll cover:</h3>
<ul>
<li><p><a href="#heading-what-this-actually-costs">What This Actually Costs</a></p>
</li>
<li><p><a href="#heading-what-youre-building">What You're Building</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-deploying-from-aws-marketplace">Deploying from AWS Marketplace</a></p>
</li>
<li><p><a href="#heading-deploying-from-source">Deploying from Source</a></p>
</li>
<li><p><a href="#heading-uploading-your-first-documents">Uploading Your First Documents</a></p>
</li>
<li><p><a href="#heading-chatting-with-your-knowledge-base">Chatting With Your Knowledge Base</a></p>
</li>
<li><p><a href="#heading-embedding-the-web-component-in-your-app">Embedding the Web Component in Your App</a></p>
</li>
<li><p><a href="#heading-using-the-mcp-server">Using the MCP Server</a></p>
</li>
<li><p><a href="#heading-what-you-can-build-from-here">What You Can Build From Here</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-this-actually-costs">What This Actually Costs</h2>
<p>Before we build anything, let's talk money, because the cost story is the whole point.</p>
<p>RAG pipelines have two cost phases: ingestion (processing your documents once) and operation (querying them over time).</p>
<p>Most platforms charge you a flat monthly rate regardless of which phase you're in. A serverless architecture flips that: ingestion costs something, and then everything scales to zero.</p>
<h3 id="heading-ingestion-the-one-time-hit">Ingestion: The One-Time Hit</h3>
<p>When you upload documents, several things happen: text extraction (OCR for PDFs and images), embedding generation, metadata extraction, and storage. Here's what that actually costs per service:</p>
<p><strong>Textract (OCR):</strong> This is the most expensive part of ingestion, and it only applies to scanned PDFs and images that need text extraction. Plain text, HTML, CSV, and other text-based formats skip this entirely.</p>
<p>Textract charges about <code>1.50 USD</code> per 1,000 pages for standard text detection. If you're uploading 500 pages of scanned PDFs, that's about <code>0.75 USD</code>. A heavy initial load of several thousand scanned pages might run <code>5-10 USD</code>. But once your documents are processed, you never pay this again unless you add new ones.</p>
<p><strong>Bedrock Embeddings (Nova Multimodal):</strong> This is where your content gets converted into vectors for semantic search. The pricing is almost comically cheap:</p>
<ul>
<li><p>Text: <code>0.00002 USD</code> per 1,000 input tokens</p>
</li>
<li><p>Images: <code>0.00115 USD</code> per image</p>
</li>
<li><p>Video/Audio: <code>0.00200 USD</code> per minute</p>
</li>
</ul>
<p>To put that in perspective: if you have 1,500 text documents averaging 2,500 tokens each after chunking, your total embedding cost is about <code>0.08 USD</code>. A knowledge base with 500 images runs <code>0.58 USD</code>. Even a mixed corpus of text, images, and a few hours of video stays well under <code>2 USD</code> for the entire embedding pass. This is a one-time cost – you only re-embed if you add or update documents.</p>
<p><strong>Bedrock LLM (Metadata Extraction):</strong> RAGStack uses an LLM to analyze each document and extract structured metadata automatically. This is a few inference calls per document using Nova Lite or a similar model. At <code>0.06 USD</code>/<code>0.24 USD</code> per million input/output tokens, processing 1,500 documents costs well under <code>1 USD</code>.</p>
<p><strong>S3 Vectors (Storage):</strong> Storing your embeddings. At <code>0.06 USD</code> per GB/month, a knowledge base of 1,500 documents with 1,024-dimension vectors takes up a trivially small amount of space. We're talking pennies per month.</p>
<p><strong>S3 (Document Storage):</strong> Your source documents in standard S3. Even cheaper, <code>0.023 USD</code> per GB/month.</p>
<p><strong>DynamoDB:</strong> Stores document metadata and processing state. The on-demand pricing model means you pay per request during ingestion, then essentially nothing at rest. A few cents for the initial load.</p>
<p>To put real numbers on it: if you upload 200 text documents (PDFs, HTML, markdown), your total ingestion cost is likely under <code>1 USD</code>. If you upload 1,000 scanned PDFs that need OCR, you might see <code>5-8 USD</code> as a one-time hit. That <code>7-10 USD</code> figure you might see referenced? That's the upper end for a heavy initial load with lots of OCR work.</p>
<h3 id="heading-operation-where-scale-to-zero-shines">Operation: Where Scale-to-Zero Shines</h3>
<p>Once your documents are ingested, the pipeline is waiting. Not running. Waiting. Here's what each query costs:</p>
<p><strong>Lambda:</strong> Invocations are billed per request and duration. The free tier covers 1 million requests/month. For a personal or small-team knowledge base, you may never leave the free tier.</p>
<p><strong>S3 Vectors (Queries):</strong> <code>2.50 USD</code> per million query API calls, plus a per-TB data processing charge. For a small index queried a few hundred times a month, this rounds to effectively zero.</p>
<p><strong>Bedrock (Chat Inference):</strong> This is your main operating cost. Each chat response requires an LLM call. Using Nova Lite at <code>0.06 USD</code> per million input tokens and <code>0.24 USD</code> per million output tokens, a typical RAG query (retrieval context + user question + response) might cost <code>0.001-0.003 USD</code> per query. A hundred queries a month is <code>0.10-0.30 USD</code>.</p>
<p><strong>Step Functions:</strong> Orchestrates the document processing pipeline. Standard workflows charge <code>0.025 USD</code> per 1,000 state transitions. Minimal during operation since it's only active during ingestion.</p>
<p><strong>Cognito:</strong> User authentication. Free for the first 10,000 monthly active users.</p>
<p><strong>CloudFront:</strong> Serves the dashboard UI. Free tier covers 1 TB of data transfer per month.</p>
<p><strong>API Gateway:</strong> Handles GraphQL API requests. Free tier covers 1 million API calls per month.</p>
<p>Add it all up for a knowledge base with 500 documents getting a few hundred queries per month, and your monthly operating cost is somewhere between <code>0.50 USD</code> and <code>3.00 USD</code>. Most of that is the LLM inference for chat responses.</p>
<h3 id="heading-the-comparison-that-matters">The Comparison That Matters</h3>
<p>Here's the same pipeline on a traditional always-on stack:</p>
<table>
<thead>
<tr>
<th>Service</th>
<th>RAGStack-Lambda</th>
<th>Traditional Stack</th>
</tr>
</thead>
<tbody><tr>
<td>Vector Database</td>
<td>S3 Vectors: pennies/mo</td>
<td>Pinecone Starter: <code>70 USD</code>/mo</td>
</tr>
<tr>
<td>Vector Database (alt)</td>
<td>S3 Vectors: pennies/mo</td>
<td>OpenSearch Serverless: about <code>350 USD</code>/mo min</td>
</tr>
<tr>
<td>Compute</td>
<td>Lambda: free tier</td>
<td>EC2 or ECS: <code>50-150 USD</code>/mo</td>
</tr>
<tr>
<td>LLM Inference</td>
<td>Same per-query cost</td>
<td>Same per-query cost</td>
</tr>
<tr>
<td>Total (idle)</td>
<td>about <code>0.50-3.00 USD</code>/mo</td>
<td><code>120-500 USD</code>/mo</td>
</tr>
</tbody></table>
<p>The LLM inference cost per query is roughly the same everywhere – that's Bedrock's on-demand pricing regardless of your architecture. The difference is everything else. Traditional stacks pay a floor cost whether anyone's using them or not. A serverless stack pays for what it uses, and idle costs essentially nothing.</p>
<h3 id="heading-what-about-transcribe">What About Transcribe?</h3>
<p>If you're uploading video or audio, AWS Transcribe adds cost for speech-to-text conversion. Standard transcription runs about <code>0.024 USD</code> per minute of audio. A 10-minute video costs <code>0.24 USD</code> to transcribe. This is a one-time ingestion cost, once transcribed and embedded, the resulting text chunks are queried like any other document.</p>
<h2 id="heading-what-youre-building">What You're Building</h2>
<p>By the end of this tutorial, you'll have a deployed pipeline that does the following:</p>
<ol>
<li><p>You upload a document (PDF, image, video, audio, HTML, CSV, <a href="https://github.com/HatmanStack/RAGStack-Lambda/blob/main/docs/ARCHITECTURE.md">the full list</a> is extensive) through a web dashboard.</p>
</li>
<li><p>The pipeline detects the file type and routes it to the right processor. Scanned PDFs go through OCR via Textract. Video and audio go through Transcribe for speech-to-text, split into 30-second searchable chunks with speaker identification. Images get visual embeddings and any caption text you provide.</p>
</li>
<li><p>An LLM analyzes each document and extracts structured metadata, topic, document type, date range, people mentioned, whatever's relevant. This happens automatically.</p>
</li>
<li><p>Everything gets embedded using Amazon Nova Multimodal Embeddings and stored in a Bedrock Knowledge Base backed by S3 Vectors.</p>
</li>
<li><p>You (or your users) ask questions through an AI chat interface. The pipeline retrieves relevant documents, passes them as context to a Bedrock LLM, and returns an answer with collapsible source citations, including timestamp links for video and audio that jump to the exact position.</p>
</li>
</ol>
<p>All of this runs in your AWS account. No external control plane, no third-party services beyond AWS itself.</p>
<h3 id="heading-the-architecture">The Architecture</h3>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/45eca6a5-91b4-4f55-8b1a-ba9f59a3e25d.png" alt="The diagram illustrates a flowchart of a buyer's AWS account, detailing the application plane with processes like S3 to Lambda OCR, supported by services like Cognito Auth. It emphasizes Amazon Bedrock's integration for knowledge and chat." style="display:block;margin:0 auto" width="2816" height="1536" loading="lazy">

<p>A few things to note about this architecture:</p>
<p><strong>Step Functions orchestrate everything.</strong> When a document is uploaded, a state machine manages the entire processing flow, detecting the file type, routing to the right processor, waiting for async operations like Transcribe jobs, then triggering embedding and metadata extraction.</p>
<p>This is what makes the pipeline reliable without a running server. If a step fails, it retries. You can see exactly where every document is in the processing pipeline.</p>
<p><strong>Lambda does the compute.</strong> Every processing step is a Lambda function. They spin up when needed, run for a few seconds to a few minutes, and shut down. There's no EC2 instance idling at 3 AM.</p>
<p><strong>S3 Vectors is the vector store.</strong> Your embeddings live in S3's purpose-built vector storage rather than in a dedicated vector database like Pinecone or OpenSearch.</p>
<p>This is what makes the "scale to zero" cost possible: you're paying object storage rates for vector data instead of keeping a database cluster warm. It also means your vectors are sitting in your own S3 bucket, not in a third-party managed service that holds your data on their terms.</p>
<p><strong>Cognito handles auth.</strong> The dashboard and API are protected with Cognito user pools. When you deploy, you get a temporary password via email. The web component uses IAM-based authentication, and server-side integrations use API key auth.</p>
<p><strong>CloudFront serves the UI.</strong> The dashboard is a static React app served through CloudFront, so there's no web server to maintain.</p>
<h3 id="heading-two-ways-to-deploy">Two Ways to Deploy</h3>
<p>You have two deployment paths depending on what you want:</p>
<p><strong>AWS Marketplace (the fast path)</strong>, click deploy, fill in two fields (stack name and email), and wait about 10 minutes. No local tooling required. This is the path we'll walk through first.</p>
<p><strong>From Source (the developer path)</strong>, Clone the repo, run <code>publish.py</code>, and deploy via SAM CLI. This is the path for when you want to customize the processing pipeline, modify the UI, or contribute to the project. We'll cover this after the Marketplace walkthrough.</p>
<p>Both paths produce the same stack. The Marketplace version just wraps the CloudFormation template in a one-click deployment.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you deploy, you'll need:</p>
<ul>
<li><p><strong>An AWS account</strong> with permissions to create CloudFormation stacks, Lambda functions, S3 buckets, DynamoDB tables, and Cognito user pools. If you're using an admin account, you're covered.</p>
</li>
<li><p><strong>Bedrock model access:</strong> RAGStack defaults to <code>us-east-1</code> because that's where Nova Multimodal Embeddings is available. Amazon's own models (including Nova) are available by default in Bedrock, no manual enablement required. Just make sure your IAM role has the necessary <code>bedrock:InvokeModel</code> permissions.</p>
</li>
<li><p><strong>For the Marketplace path:</strong> just a web browser.</p>
</li>
<li><p><strong>For the source path:</strong> Python 3.13+, Node.js 24+, AWS CLI and SAM CLI configured, and Docker (for building Lambda layers).</p>
</li>
</ul>
<h2 id="heading-deploying-from-aws-marketplace">Deploying from AWS Marketplace</h2>
<p>This is the fastest path – no local tools, no CLI, no Docker. You'll launch a CloudFormation stack and have a working pipeline in about 10 minutes.</p>
<h3 id="heading-step-1-launch-the-stack">Step 1: Launch the Stack</h3>
<p>Click the <a href="https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://ragstack-quicklaunch-public.s3.us-east-1.amazonaws.com/ragstack-template.yaml&amp;stackName=my-docs">direct deploy link</a> to open CloudFormation's "Quick create stack" page with the template pre-loaded.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/d354f6bc-dee8-4f44-9b3b-523ea27564c7.png" alt="Screenshot of AWS CloudFormation Quick Create Stack page in dark mode. Sections for template URL, stack name, parameters, and build options are visible." style="display:block;margin:0 auto" width="1691" height="886" loading="lazy">

<h3 id="heading-step-2-fill-in-two-fields">Step 2: Fill In Two Fields</h3>
<p>The page has a lot of options, but you only need two:</p>
<ul>
<li><p><strong>Stack name:</strong> Must be lowercase. This becomes the prefix for all your AWS resources (for example, <code>my-docs</code>, <code>team-kb</code>, <code>project-notes</code>). Keep it short.</p>
</li>
<li><p><strong>Admin Email:</strong> Under Required Settings. Cognito will send your temporary login credentials here. Use an email you can access right now.</p>
</li>
</ul>
<p>Everything else – Build Options, Advanced Settings, OCR Backend, model selections – can stay at the defaults. They're there for customization later, but the defaults work out of the box.</p>
<h3 id="heading-step-3-deploy">Step 3: Deploy</h3>
<p>Scroll to the bottom, check the three acknowledgment boxes under "Capabilities and transforms," and click <strong>Create stack</strong>.</p>
<p>Deployment takes roughly 10 minutes. You can watch the progress in the CloudFormation Events tab if you're curious, but there's nothing to do until the stack status flips to <code>CREATE_COMPLETE</code>.</p>
<h3 id="heading-step-4-log-in">Step 4: Log In</h3>
<p>Once the stack finishes, check your email. Cognito sends you the dashboard URL and a temporary password. Log in, set a new password, and you're looking at an empty dashboard ready for documents.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/5ac31b6c-2782-4b66-82a9-0cb962c5dac4.png" alt="A software dashboard interface titled 'Document Pipeline (Demo)' displaying options for uploading, scraping, and searching documents. The screen shows no current documents or scrape jobs, with menu options on the left and a search and filter bar at the center. The overall tone is functional and minimalist." style="display:block;margin:0 auto" width="1902" height="886" loading="lazy">

<h2 id="heading-deploying-from-source">Deploying from Source</h2>
<p>If you want to customize the pipeline, modify the UI, or contribute to the project, deploy from source instead.</p>
<h3 id="heading-step-1-clone-and-set-up">Step 1: Clone and Set Up</h3>
<pre><code class="language-bash">git clone https://github.com/HatmanStack/RAGStack-Lambda.git
cd RAGStack-Lambda

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
</code></pre>
<h3 id="heading-step-2-deploy">Step 2: Deploy</h3>
<p>The <code>publish.py</code> script handles everything: building the frontend, packaging Lambda functions, and deploying via SAM CLI.</p>
<pre><code class="language-bash">python publish.py \
  --project-name my-docs \
  --admin-email admin@example.com
</code></pre>
<p>This defaults to <code>us-east-1</code> for Nova Multimodal Embeddings. The script will build the React dashboard, build the web component, package all Lambda layers with Docker, and deploy the CloudFormation stack through SAM.</p>
<p>First deploy takes longer (15-20 minutes) because it's building everything from scratch. Subsequent deploys are faster since SAM caches unchanged resources.</p>
<p>If you only want to iterate on the backend and skip UI builds:</p>
<pre><code class="language-bash"># Skip dashboard build (still builds web component)
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui

# Skip ALL UI builds
python publish.py --project-name my-docs --admin-email admin@example.com --skip-ui-all
</code></pre>
<p>Once it finishes, you'll get the same Cognito email and dashboard URL as the Marketplace path.</p>
<h2 id="heading-uploading-your-first-documents">Uploading Your First Documents</h2>
<p>The dashboard has tabs for different content types. We'll start with the Documents tab since that's the most common use case.</p>
<h3 id="heading-documents">Documents</h3>
<p>Click the <strong>Documents</strong> tab and upload a file. RAGStack accepts a wide range of formats: PDF, DOCX, XLSX, HTML, CSV, JSON, XML, EML, EPUB, TXT, and Markdown. Drag and drop or use the file picker.</p>
<p>Once uploaded, the document enters the processing pipeline. You'll see the status update in real time:</p>
<ol>
<li><p><strong>UPLOADED:</strong> File received and stored in S3.</p>
</li>
<li><p><strong>PROCESSING:</strong> Step Functions has picked it up and routed it to the right processor. Text-based files (HTML, CSV, Markdown) go through direct extraction. Scanned PDFs and images go through Textract OCR. The LLM analyzes the content and extracts structured metadata, topic, document type, people mentioned, date ranges, whatever's relevant to the content.</p>
</li>
<li><p><strong>INDEXED:</strong> Embeddings generated, vectors stored, document is searchable.</p>
</li>
</ol>
<p>Text documents typically process in 1-5 minutes. OCR-heavy documents (scanned PDFs, images with text) can take 2-15 minutes depending on page count.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/3df05041-2632-41a9-a71c-6d764c503f2a.png" alt="Screenshot of a document upload interface labeled &quot;Document Pipeline (Demo).&quot; Central panel shows a box for drag-and-drop file upload. Sleek, modern design." style="display:block;margin:0 auto" width="1902" height="886" loading="lazy">

<h3 id="heading-images">Images</h3>
<p>The <strong>Images</strong> tab works differently. Upload a JPG, PNG, GIF, or WebP and you can add a caption. Both the visual content and caption text get embedded using Nova Multimodal Embeddings, so you can search by what's in the image or by your description of it.</p>
<p>This is where multimodal embeddings earn their keep. A traditional text-only RAG pipeline would need you to describe every image manually. Here, the image itself becomes searchable, and since everything stays in your AWS account, you're not sending personal photos or sensitive visual content to an external service to get there.</p>
<h3 id="heading-what-about-video-and-audio">What About Video and Audio?</h3>
<p>Upload video or audio files and RAGStack routes them through AWS Transcribe for speech-to-text conversion. The transcript gets split into 30-second chunks with speaker identification, then embedded like any other document. When chat results reference a video source, you get timestamp links that jump to the exact position in the recording.</p>
<h3 id="heading-web-scraping">Web Scraping</h3>
<p>The <strong>Scrape</strong> tab lets you pull websites directly into your knowledge base. Enter a URL and RAGStack crawls the page, extracts the content, and processes it through the same pipeline as uploaded documents, metadata extraction, embedding, indexing.</p>
<p>This is useful for building a knowledge base from existing web content without manually saving and uploading pages. Documentation sites, blog archives, reference material, anything publicly accessible.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/ac2c6239-a323-4770-80f7-31aa7ff3bdfb.png" alt="Web scraping interface with fields for URL, max pages, and depth. A dropdown for scope selection and a 'Start Scrape' button are visible." style="display:block;margin:0 auto" width="1902" height="886" loading="lazy">

<h2 id="heading-chatting-with-your-knowledge-base">Chatting With Your Knowledge Base</h2>
<p>This is the payoff. Go to the <strong>Chat</strong> tab, type a question, and RAGStack retrieves relevant documents from your knowledge base, passes them as context to a Bedrock LLM, and returns an answer with source citations.</p>
<p>The citations are collapsible, so click to expand and see which documents informed the answer, with the option to download the source file. For video and audio sources, you get clickable timestamps that jump to the relevant moment.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698f5932352111d3f67030a2/760b3cd0-8bb8-493d-97ce-5eb3d0138592.png" alt="Screenshot of a web interface titled &quot;Knowledge Base Chat&quot; with menu options on the left. The central section prompts users to ask document-related questions." style="display:block;margin:0 auto" width="1902" height="886" loading="lazy">

<h3 id="heading-metadata-filtering">Metadata Filtering</h3>
<p>If you've uploaded enough documents to have meaningful metadata categories, the chat interface lets you filter search results by metadata before querying. RAGStack auto-discovers the metadata structure from your documents, so you don't configure this manually, it just appears as your knowledge base grows.</p>
<p>This is useful when you have a large mixed corpus. Instead of hoping the vector search picks the right context from thousands of documents, you can narrow it down: "only search documents about project X" or "only search content from Q4 2024."</p>
<h2 id="heading-embedding-the-web-component-in-your-app">Embedding the Web Component in Your App</h2>
<p>The dashboard is useful for managing your knowledge base, but the real power is embedding RAGStack's chat in your own application. The web component works with any framework, React, Vue, Angular, Svelte, plain HTML.</p>
<p>Load the script once from your CloudFront distribution:</p>
<pre><code class="language-html">&lt;script src="https://your-cloudfront-url/ragstack-chat.js"&gt;&lt;/script&gt;
</code></pre>
<p>Then drop the component wherever you want a chat interface:</p>
<pre><code class="language-html">&lt;ragstack-chat
  conversation-id="my-app"
  header-text="Ask About Documents"
&gt;&lt;/ragstack-chat&gt;
</code></pre>
<p>That's it. The component handles authentication (via IAM), manages conversation state, and renders source citations, all self-contained. Your CloudFront URL is in the stack outputs.</p>
<p>For server-side integrations that don't need a UI, the GraphQL API is available with API key authentication. You can find your endpoint and API key in the dashboard under Settings.</p>
<h2 id="heading-using-the-mcp-server">Using the MCP Server</h2>
<p>RAGStack includes an MCP server that connects your knowledge base to AI assistants like Claude Desktop, Cursor, VS Code, and Amazon Q CLI. Instead of switching to the dashboard to search your documents, you ask your assistant directly.</p>
<p>Install it:</p>
<pre><code class="language-bash">pip install ragstack-mcp
</code></pre>
<p>Then add it to your AI assistant's MCP configuration:</p>
<pre><code class="language-json">{
  "ragstack": {
    "command": "uvx",
    "args": ["ragstack-mcp"],
    "env": {
      "RAGSTACK_GRAPHQL_ENDPOINT": "YOUR_ENDPOINT",
      "RAGSTACK_API_KEY": "YOUR_API_KEY"
    }
  }
}
</code></pre>
<p>Your endpoint and API key are in the dashboard under Settings. Once configured, type <code>@ragstack</code> in your assistant's chat to invoke the MCP server, then ask things like "search my knowledge base for authentication docs" and it queries RAGStack directly.</p>
<p>See the <a href="https://github.com/HatmanStack/RAGStack-Lambda/blob/main/src/ragstack-mcp/README.md">MCP Server docs</a> for the full list of available tools and setup details.</p>
<h2 id="heading-what-you-can-build-from-here">What You Can Build From Here</h2>
<p>You've got a deployed RAG pipeline that costs almost nothing to run and handles text, images, video, and audio. A few directions you might take it:</p>
<p><strong>A searchable personal archive.</strong> Every conference talk you've saved, every PDF textbook, every tutorial video that's sitting in a folder somewhere. Upload it all, and now you have one search interface across years of accumulated material. The multimodal embeddings mean your screenshots and diagrams are searchable too, not just the text.</p>
<p>I built <a href="https://github.com/HatmanStack/family-archive-document-ai">a family archive app</a> this way, scanned letters, old photos, home videos, with RAGStack deployed as a nested CloudFormation stack so the whole family can search across decades of memories using the chat widget.</p>
<p><strong>A second brain for a client project.</strong> Scrape the client's existing docs, upload the SOW and meeting notes, drop in the codebase documentation. Now you've got a searchable knowledge base scoped to that engagement. Spin it up at the start, tear it down when the contract ends. At these costs, it's disposable infrastructure.</p>
<p><strong>AI chat over a niche dataset.</strong> Recipe collections, legal filings, research papers, local government meeting minutes, any corpus that's too specialized for general-purpose LLMs to know well. The web component means you can ship it as a standalone tool without building a frontend from scratch.</p>
<p><strong>RAG for your MCP workflow.</strong> If you're already using Claude Desktop or Cursor, the MCP server turns your knowledge base into another tool your assistant can reach for. Upload your team's runbooks and architecture docs, and now <code>@ragstack</code> in your editor gives you instant context without tab-switching.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The serverless RAG pipeline you just deployed handles document processing, multimodal embeddings, metadata extraction, and AI chat with source citations, all scaling to zero when idle, all running in your AWS account. Your documents, your vectors, your infrastructure. The traditional approach to this stack costs <code>120-500 USD</code>/month in baseline infrastructure. This one costs pocket change.</p>
<p>The full source is at <a href="https://github.com/HatmanStack/RAGStack-Lambda">github.com/HatmanStack/RAGStack-Lambda</a>. File issues, open PRs, or just poke around the architecture. If you want to go deeper on the technical tradeoffs, particularly how filtered vector search behaves on cost-optimized backends like S3 Vectors, that's a story for the next post.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
