<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Tolani Akintayo - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Tolani Akintayo - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 30 May 2026 16:30:29 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/tolani-akintayo/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Common DevOps Mistakes and How to Avoid Them — Tips for Startups ]]>
                </title>
                <description>
                    <![CDATA[ Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production. Startup environments make this worse. The p ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-avoid-devops-mistakes/</link>
                <guid isPermaLink="false">6a060c22baf09db7a6253878</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ startup ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tips ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 14 May 2026 17:53:38 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/6fcabd5e-272f-4f1d-b035-8241896e8296.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what <em>not</em> to do before they got into production.</p>
<p>Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.</p>
<p>This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.</p>
<p>Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-who-this-article-is-for">Who This Article Is For</a></p>
</li>
<li><p><a href="#heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</a></p>
</li>
<li><p><a href="#heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</a></p>
</li>
<li><p><a href="#heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</a></p>
</li>
<li><p><a href="#heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</a></p>
</li>
<li><p><a href="#heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</a></p>
</li>
<li><p><a href="#heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</a></p>
</li>
<li><p><a href="#heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</a></p>
</li>
<li><p><a href="#heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</a></p>
</li>
<li><p><a href="#heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</a></p>
</li>
<li><p><a href="#heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</a></p>
</li>
<li><p><a href="#heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</a></p>
</li>
<li><p><a href="#heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</a></p>
</li>
<li><p><a href="#heading-your-production-readiness-checklist">Your Production Readiness Checklist</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-who-this-article-is-for">Who This Article Is For</h2>
<ul>
<li><p><strong>Early-career DevOps and cloud engineers</strong> who are building or maintaining production infrastructure at a startup.</p>
</li>
<li><p><strong>Backend developers</strong> who have recently taken on DevOps responsibilities.</p>
</li>
<li><p><strong>Engineers joining a startup</strong> who want to understand what operational discipline actually looks like in a fast-moving environment.</p>
</li>
</ul>
<p>You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.</p>
<h2 id="heading-why-startups-are-a-different-environment">Why Startups Are a Different Environment</h2>
<p>Before getting into the mistakes, you have to understand why startups produce them in the first place.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/f9bec1fa-8938-4144-b934-9e5af4edf4ad.svg" alt="diagram showing the startup DevOps reality, a single engineer handling infra, CI/CD, security, monitoring, and deployment pipelines simultaneously" style="display:block;margin:0 auto" width="680" height="506" loading="lazy">

<p>In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.</p>
<p>This creates four specific pressure points:</p>
<ol>
<li><p><strong>Speed pressure.</strong> The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.</p>
</li>
<li><p><strong>Budget constraints.</strong> Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.</p>
</li>
<li><p><strong>Absent guardrails.</strong> There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.</p>
</li>
<li><p><strong>Constantly changing requirements.</strong> The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.</p>
</li>
</ol>
<h2 id="heading-mistake-1-deploying-without-understanding-what-youre-deploying">Mistake 1: Deploying Without Understanding What You're Deploying</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before you deploy anything to production, you should be able to answer these five questions in writing:</p>
<ol>
<li><p><strong>What compute type is running my code?</strong> (EC2, Lambda, Fargate, container?)</p>
</li>
<li><p><strong>How does a new version replace the old one?</strong> (Rolling? Blue/green? All-at-once?)</p>
</li>
<li><p><strong>Where does configuration and secrets come from?</strong> (SSM? Secrets Manager? Environment file?)</p>
</li>
<li><p><strong>What downstream services depend on this?</strong> (Database connections? Other APIs? Cache?)</p>
</li>
<li><p><strong>How do I roll back in under five minutes if this breaks?</strong></p>
</li>
</ol>
<p>If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.</p>
<blockquote>
<p>"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."</p>
</blockquote>
<p>Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.</p>
<ul>
<li><p><strong>The first question is: What is this technology or concept about?</strong><br>This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.</p>
</li>
<li><p><strong>The second question is: Why do we need it?</strong><br>I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.</p>
</li>
<li><p><strong>The third question is: How should it be implemented?</strong><br>There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.</p>
</li>
</ul>
<p>This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.</p>
<h2 id="heading-mistake-2-using-production-as-a-development-environment">Mistake 2: Using Production as a Development Environment</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.</p>
<p>This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>You need at minimum three separate environments and ideally three separate AWS accounts:</p>
<table>
<thead>
<tr>
<th>Environment</th>
<th>Purpose</th>
<th>Access Level</th>
</tr>
</thead>
<tbody><tr>
<td><strong>dev</strong></td>
<td>Break things freely. No real data.</td>
<td>Engineers have broad access</td>
</tr>
<tr>
<td><strong>staging</strong></td>
<td>Mirror of production. Final verification.</td>
<td>Controlled access</td>
</tr>
<tr>
<td><strong>production</strong></td>
<td>Real customers. Real data.</td>
<td>MFA required. No manual deployments.</td>
</tr>
</tbody></table>
<p>Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.</p>
<p>Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.</p>
<pre><code class="language-hcl"># terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}
</code></pre>
<pre><code class="language-hcl"># terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}
</code></pre>
<p>The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.</p>
<h2 id="heading-mistake-3-hardcoding-secrets-and-credentials">Mistake 3: Hardcoding Secrets and Credentials</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A new engineer joins a startup and clones the repository. Inside they find a <code>.env</code> file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.</p>
<p>GitHub's automated secret scanning never triggered because the secrets were inside a <code>.env</code> file rather than raw in the code. The credentials had been valid and actively used for over six months.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:</p>
<ul>
<li><p>Crypto-mining workloads generating thousands of dollars in cloud bills overnight</p>
</li>
<li><p>Complete exfiltration of customer data from every S3 bucket</p>
</li>
<li><p>Privilege escalation: the attacker creates new admin users and locks you out of your own account</p>
</li>
<li><p>AWS account suspension while the investigation runs</p>
</li>
</ul>
<p>According to <a href="https://github.blog/security/vulnerability-research/securing-millions-of-developers-together/">GitHub's annual security report</a>, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.</p>
<h2 id="heading-the-fix">The Fix</h2>
<p><strong>Step 1: Never commit secrets to Git.</strong> Not temporarily. Not in a branch. Not in a private repository.</p>
<p><strong>Step 2: Add</strong> <code>.gitignore</code> <strong>before you create the first file.</strong> Check in the <code>.gitignore</code> with the first line of code before any <code>.env</code> files exist.</p>
<pre><code class="language-gitignore"># .gitignore
.env
.env.*
*.pem
*.key
secrets/
</code></pre>
<p><strong>Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets.</strong> Your application reads secrets at runtime:</p>
<pre><code class="language-python"># Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -&gt; dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]
</code></pre>
<p><strong>Step 4: Scan your existing repositories immediately.</strong> You may already have a problem:</p>
<pre><code class="language-bash"># Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo
</code></pre>
<p><strong>Step 5: Add a pre-commit hook to prevent future accidents:</strong></p>
<pre><code class="language-bash">pip install pre-commit
</code></pre>
<pre><code class="language-yaml"># .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets
</code></pre>
<pre><code class="language-bash">pre-commit install
# Now the hook runs before every commit and blocks detected secrets
</code></pre>
<p>There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.</p>
<h2 id="heading-mistake-4-overengineering-for-problems-you-dont-have-yet">Mistake 4: Overengineering for Problems You Don't Have Yet</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.</p>
<p>Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Match your infrastructure to your actual stage:</p>
<table>
<thead>
<tr>
<th>Scale</th>
<th>Right Infrastructure</th>
<th>Cost Range</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1–1,000 users</strong></td>
<td>Single EC2 + RDS + Nginx reverse proxy</td>
<td>$20–50/month</td>
</tr>
<tr>
<td><strong>1K–50K users</strong></td>
<td>Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD</td>
<td>$200-500/month</td>
</tr>
<tr>
<td><strong>50K–500K users</strong></td>
<td>ECS Fargate, RDS read replicas, ElastiCache, full observability</td>
<td>$1K-5K/month</td>
</tr>
<tr>
<td><strong>500K+ users</strong></td>
<td>Multi-region, managed Kubernetes, dedicated SRE</td>
<td>$10K+/month</td>
</tr>
</tbody></table>
<p>The question to ask before every infrastructure decision is: <strong>"What specific, measurable problem does this solve today that my current setup cannot solve?"</strong></p>
<p>Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.</p>
<p>Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.</p>
<h2 id="heading-mistake-5-no-observability-before-launch">Mistake 5: No Observability Before Launch</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.</p>
<p>The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.</p>
<h3 id="heading-business-impact">Business Impact</h3>
<p>Without observability:</p>
<ul>
<li><p>You find out about production problems from users, not from your systems</p>
</li>
<li><p>Incidents take 10x longer to resolve because diagnosis is guesswork</p>
</li>
<li><p>You cannot tell whether a deployment improved or degraded performance</p>
</li>
<li><p>You have no data for making better architecture decisions</p>
</li>
</ul>
<h3 id="heading-the-fix">The Fix</h3>
<p>Implement the four golden signals before any service goes to production. These come from <a href="https://sre.google/sre-book/monitoring-distributed-systems/">Google's Site Reliability Engineering book</a>:</p>
<ol>
<li><p><strong>Latency</strong>: How long requests take to complete (p50, p95, p99)</p>
</li>
<li><p><strong>Traffic</strong>: How many requests per second the system is handling</p>
</li>
<li><p><strong>Errors</strong>: The rate of failed requests (5xx responses per minute)</p>
</li>
<li><p><strong>Saturation</strong>: How close the system is to its limits (CPU, memory, connection pool)</p>
</li>
</ol>
<p>Here is a minimal CloudWatch alarm setup using the AWS CLI:</p>
<pre><code class="language-shell"># Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef
</code></pre>
<p>Every application should also expose a <code>/health</code> endpoint that returns <code>200 OK</code> when healthy:</p>
<pre><code class="language-python"># FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }
</code></pre>
<p>Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.</p>
<blockquote>
<p>You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."</p>
</blockquote>
<h2 id="heading-mistake-6-treating-security-as-a-final-step">Mistake 6: Treating Security as a Final Step</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:</p>
<ul>
<li><p>S3 buckets publicly accessible by default</p>
</li>
<li><p>EC2 instances with port 22 open to <code>0.0.0.0/0</code></p>
</li>
<li><p>IAM users with <code>AdministratorAccess</code> for the entire team</p>
</li>
<li><p>No encryption on the database at rest</p>
</li>
<li><p>JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.</p>
</li>
</ul>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Apply these six security controls before the first line of production code ships:</p>
<p><strong>1. Principle of Least Privilege every IAM role gets only what it needs:</strong></p>
<p>One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (<code>s3:*</code>) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.</p>
<p>The fix is simple: look at what your service actually does, then write a policy that allows exactly that.</p>
<p>If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}
</code></pre>
<p>Notice the <code>Resource</code> is scoped to <code>my-app-uploads/*</code> not all S3 buckets. And the <code>Action</code> list covers only <code>GetObject</code> and <code>PutObject</code> not <code>DeleteObject</code>, not <code>s3:*</code>. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.</p>
<p><strong>2. Block all S3 public access by default:</strong></p>
<p>AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.</p>
<p>The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:</p>
<pre><code class="language-bash">aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.</p>
<p><strong>3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:</strong></p>
<p>Port 22 open to <code>0.0.0.0/0</code> is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.</p>
<p>AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:</p>
<pre><code class="language-bash"># Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0
</code></pre>
<p>To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the <code>AmazonSSMManagedInstanceCore</code> policy attached. Once that is set up, you can close port 22 on the security group entirely.</p>
<p><strong>4. Enable MFA for all IAM users and enforce it via policy:</strong></p>
<p>A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.</p>
<p>Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.</p>
<p>The AWS documentation provides the <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_users-self-manage-mfa-and-creds.html">Complete Deny Without MFA Policy</a>, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.</p>
<p><strong>5. Enable CloudTrail in all regions:</strong></p>
<p>Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.</p>
<p>CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:</p>
<pre><code class="language-bash">aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation
</code></pre>
<p>The <code>--enable-log-file-validation</code> flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every <code>AssumeRole</code>, every <code>DeleteBucket</code>, and every <code>RunInstances</code> call in your account is permanently recorded.</p>
<p><strong>6. Run AWS Security Hub from day one:</strong></p>
<p>Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.</p>
<p>Enabling it takes a single command:</p>
<pre><code class="language-bash">aws securityhub enable-security-hub
</code></pre>
<p>Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.</p>
<p>Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.</p>
<h2 id="heading-mistake-7-manual-deployments-in-production">Mistake 7: Manual Deployments in Production</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running <code>git pull</code>, running <code>npm install</code>, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips <code>npm install</code>. The application starts crashing because a new dependency is missing.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:</p>
<pre><code class="language-yaml"># .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" &gt;&gt; $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true
</code></pre>
<p>Notice <code>wait-for-service-stability: true</code>. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.</p>
<h2 id="heading-mistake-8-no-disaster-recovery-plan">Mistake 8: No Disaster Recovery Plan</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.</p>
<p>The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.</p>
<p>Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p><strong>Define your RTO and RPO before you design anything:</strong></p>
<ul>
<li><p><strong>RTO (Recovery Time Objective):</strong> How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.</p>
</li>
<li><p><strong>RPO (Recovery Point Objective):</strong> How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.</p>
</li>
</ul>
<p><strong>Enable RDS Multi-AZ for all production databases:</strong></p>
<pre><code class="language-hcl"># Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}
</code></pre>
<p><strong>Test your backups on a schedule.</strong> Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.</p>
<pre><code class="language-bash"># Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"
</code></pre>
<p>For official guidance on RDS backup and restore, refer to the <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html">AWS RDS Backup and Restore documentation</a>.</p>
<h2 id="heading-mistake-9-no-documentation-or-runbooks">Mistake 9: No Documentation or Runbooks</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Documentation for an engineering team means three specific things:</p>
<ol>
<li><p><strong>Infrastructure as Code is the highest form of documentation.</strong> The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.</p>
</li>
<li><p><strong>A runbook for every operational task.</strong> A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:</p>
</li>
</ol>
<pre><code class="language-markdown"># Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.
</code></pre>
<ol>
<li><strong>An architecture README in every repository.</strong> Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.</li>
</ol>
<h2 id="heading-mistake-10-solving-technical-problems-without-understanding-the-business">Mistake 10: Solving Technical Problems Without Understanding the Business</h2>
<h3 id="heading-the-scenario">The Scenario</h3>
<p>A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.</p>
<h3 id="heading-the-business-impact">The Business Impact</h3>
<p>Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.</p>
<h3 id="heading-the-fix">The Fix</h3>
<p>Before making any infrastructure decision, answer these four questions:</p>
<ol>
<li><p><strong>What is the actual, measured bottleneck?</strong> Instrument before you act. The bottleneck is almost never where you assumed it was.</p>
</li>
<li><p><strong>What does success look like, and how will you measure it?</strong> "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.</p>
</li>
<li><p><strong>What is the full cost of this solution?</strong> Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?</p>
</li>
<li><p><strong>Can a simpler solution solve 80% of the problem in 20% of the time?</strong></p>
</li>
</ol>
<p>Always profile and measure before you rebuild:</p>
<pre><code class="language-bash"># Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"
</code></pre>
<p>Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.</p>
<h2 id="heading-the-system-thinking-framework-every-devops-engineer-needs">The System Thinking Framework Every DevOps Engineer Needs</h2>
<p>Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/b33035a6-448f-419b-b293-206b7b775594.jpg" alt="A diagram showing a request flowing through a full system: user → CDN → load balancer → application servers → cache → database → logs/monitoring" style="display:block;margin:0 auto" width="544" height="650" loading="lazy">

<p>A system thinker asks six questions before making any change in production:</p>
<table>
<thead>
<tr>
<th>Question</th>
<th>Why You Ask It</th>
</tr>
</thead>
<tbody><tr>
<td><strong>What does this change?</strong></td>
<td>List every configuration, file, or service that will be different.</td>
</tr>
<tr>
<td><strong>What does this depend on?</strong></td>
<td>What must be true upstream for this component to work correctly?</td>
</tr>
<tr>
<td><strong>What depends on this?</strong></td>
<td>What downstream systems are affected if this changes or fails?</td>
</tr>
<tr>
<td><strong>What is the failure mode?</strong></td>
<td>Does this fail loudly (500 errors) or silently (wrong data)?</td>
</tr>
<tr>
<td><strong>What is the rollback path?</strong></td>
<td>How do you reverse this in under five minutes?</td>
</tr>
<tr>
<td><strong>What does healthy look like after the change?</strong></td>
<td>What metrics confirm everything is working correctly?</td>
</tr>
</tbody></table>
<p>This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.</p>
<h2 id="heading-your-production-readiness-checklist">Your Production Readiness Checklist</h2>
<p>Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.</p>
<h3 id="heading-infrastructure">Infrastructure</h3>
<ul>
<li><p>Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git</p>
</li>
<li><p>Separate dev, staging, and production environments exist with separate credentials</p>
</li>
<li><p>All production changes go through an automated CI/CD pipeline, no manual SSH deployments</p>
</li>
<li><p>You can rebuild the entire production environment from code in under two hours</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p>No secrets, credentials, or API keys exist in any Git repository</p>
</li>
<li><p>All production secrets are in Secrets Manager or SSM Parameter Store</p>
</li>
<li><p>All IAM roles follow the principle of least privilege</p>
</li>
<li><p>S3 buckets have public access blocked by default</p>
</li>
<li><p>Port 22 is not open to <code>0.0.0.0/0</code> on any security group</p>
</li>
<li><p>CloudTrail is enabled in all regions</p>
</li>
<li><p>All IAM users have MFA enabled</p>
</li>
<li><p>AWS Security Hub is enabled and findings are reviewed weekly</p>
</li>
</ul>
<h3 id="heading-observability">Observability</h3>
<ul>
<li><p>Every service has a <code>/health</code> endpoint that monitoring checks continuously</p>
</li>
<li><p>Alerts fire within five minutes of a production error rate spike</p>
</li>
<li><p>Dashboards exist showing latency, error rate, and resource utilization</p>
</li>
<li><p>Logs are centralized and searchable, not scattered across individual servers</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><p>Production database has Multi-AZ enabled</p>
</li>
<li><p>Backup restoration has been tested in the last 30 days</p>
</li>
<li><p>Written runbooks exist for the three most likely failure scenarios</p>
</li>
<li><p>RTO and RPO requirements are documented and the architecture meets them</p>
</li>
</ul>
<h3 id="heading-documentation">Documentation</h3>
<ul>
<li><p>Every repository has a README explaining what it does and how to deploy it</p>
</li>
<li><p>A new engineer could understand the production architecture from documentation alone</p>
</li>
<li><p>No single engineer holds critical knowledge that lives only in their head</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.</p>
<p>The good news is that every single one of them is preventable with the right awareness and the right habits applied early.</p>
<p>You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.</p>
<p>The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.</p>
<h2 id="heading-want-to-go-deeper">Want to Go Deeper?</h2>
<p>If this article resonated with you, <a href="https://coachli.co/tolani-akintayo/PR-H4oQS"><strong>The Startup DevOps Field Guide</strong></a> covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.</p>
<p>It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Migrate to S3 Native State Locking in Terraform ]]>
                </title>
                <description>
                    <![CDATA[ If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them togethe ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-migrate-to-s3-native-state-locking-in-terraform/</link>
                <guid isPermaLink="false">69fd19239f93a850a430069b</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Infrastructure as code ]]>
                    </category>
                
                    <category>
                        <![CDATA[ S3 ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 07 May 2026 22:58:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/9619ad45-15c5-4be7-9221-ed4b76bc2b24.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.</p>
<p>But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.</p>
<p>The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.</p>
<p>In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning <strong>DynamoDB is no longer required for state locking</strong>. Terraform 1.10 added support for this feature, and it's now generally available.</p>
<p>In this tutorial, you'll learn:</p>
<ul>
<li><p>What S3 native locking is and how it works</p>
</li>
<li><p>How to set it up from scratch if you're starting a new project</p>
</li>
<li><p>How to migrate an existing S3 + DynamoDB setup to S3 native locking safely</p>
</li>
<li><p>How to verify locking is working and handle edge cases</p>
</li>
</ul>
<p>By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-terraform-state-locking">What Is Terraform State Locking?</a></p>
</li>
<li><p><a href="#heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</a></p>
</li>
<li><p><a href="#heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</a></p>
<ul>
<li><p><a href="#heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</a></p>
</li>
<li><p><a href="#heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</a></p>
</li>
<li><p><a href="#heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking</a></p>
<ul>
<li><p><a href="#heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</a></p>
</li>
<li><p><a href="#heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</a></p>
</li>
<li><p><a href="#heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</a></p>
</li>
<li><p><a href="#heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</a></p>
</li>
<li><p><a href="#heading-step-5-verify-the-migration">Step 5: Verify the Migration</a></p>
</li>
<li><p><a href="#heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</a></p>
</li>
<li><p><a href="#heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</a></p>
</li>
<li><p><a href="#heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</a></p>
</li>
<li><p><a href="#heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-what-is-terraform-state-locking">What is Terraform State Locking?</h2>
<p>Before looking at the new approach, it helps to understand what state locking is solving.</p>
<p>Terraform stores everything it knows about your infrastructure in a <strong>state file</strong> – a JSON document that maps your configuration to real AWS resources. When you run <code>terraform apply</code>, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.</p>
<p>The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a <strong>race condition</strong>. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.</p>
<p><strong>State locking</strong> solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.</p>
<pre><code class="language-plaintext">Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ----------&gt;|                            |
   |                                   |                            |
   |&lt;------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock ---&gt;|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |&lt;------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ----------&gt;|                            |
   |                                   |                            |
   |           [COMPLETED]             |&lt;------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |
</code></pre>
<h2 id="heading-what-is-s3-native-state-locking">What Is S3 Native State Locking?</h2>
<p>Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a <code>LockID</code> primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.</p>
<p>S3 native locking uses <strong>S3 Object Lock</strong> instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.</p>
<p>When S3 native locking is enabled in your Terraform backend:</p>
<ol>
<li><p>Terraform writes your state to an <code>.tfstate</code> object in S3 (as before)</p>
</li>
<li><p>To acquire a lock, Terraform uses <strong>S3's conditional write operations</strong> – specifically the <code>if-none-match</code> conditional header to create a lock file atomically</p>
</li>
<li><p>If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held</p>
</li>
<li><p>When the operation completes, Terraform deletes the lock file to release the lock.</p>
</li>
</ol>
<p>The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.</p>
<p><strong>Note:</strong> This feature requires Terraform version <strong>1.10.0 or later</strong> and an S3 bucket with <strong>Object Lock enabled</strong>. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.</p>
<h2 id="heading-how-s3-native-locking-compares-to-the-s3-dynamodb-approach">How S3 Native Locking Compares to the S3 + DynamoDB Approach</h2>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>S3 + DynamoDB (Old)</strong></th>
<th><strong>S3 Native Locking (New)</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>AWS services required</strong></td>
<td>S3 + DynamoDB</td>
<td>S3 only</td>
</tr>
<tr>
<td><strong>IAM permissions needed</strong></td>
<td>S3 + DynamoDB permissions</td>
<td>S3 permissions only</td>
</tr>
<tr>
<td><strong>Terraform version</strong></td>
<td>Any</td>
<td>1.10.0 or later</td>
</tr>
<tr>
<td><strong>Setup complexity</strong></td>
<td>Two resources, two IAM scopes</td>
<td>One resource</td>
</tr>
<tr>
<td><strong>Stuck lock resolution</strong></td>
<td>Delete DynamoDB record</td>
<td>Delete S3 lock file</td>
</tr>
<tr>
<td><strong>Cost</strong></td>
<td>S3 storage + DynamoDB on-demand</td>
<td>S3 storage only</td>
</tr>
<tr>
<td><strong>Object Lock requirement</strong></td>
<td>Not required</td>
<td>Required on S3 bucket</td>
</tr>
<tr>
<td><strong>Locking mechanism</strong></td>
<td>DynamoDB conditional writes</td>
<td>S3 conditional writes (<code>if-none-match</code>)</td>
</tr>
<tr>
<td><strong>State versioning</strong></td>
<td>S3 Versioning (recommended)</td>
<td>S3 Versioning (required for full safety)</td>
</tr>
</tbody></table>
<p>The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following in place:</p>
<ul>
<li><strong>Terraform 1.10.0 or later</strong> installed. Check your version:</li>
</ul>
<pre><code class="language-shell">terraform version
</code></pre>
<p>If you need to upgrade, follow the <a href="https://developer.hashicorp.com/terraform/install">official upgrade guide</a>.</p>
<ul>
<li><strong>AWS CLI</strong> installed and configured with credentials that have permission to create and manage S3 buckets.</li>
</ul>
<pre><code class="language-shell">aws --version
aws sts get-caller-identity   # confirm you're authenticated
</code></pre>
<ul>
<li><p><strong>IAM permissions</strong> to perform the following S3 actions:</p>
<ul>
<li><p><code>s3:CreateBucket</code></p>
</li>
<li><p><code>s3:PutBucketVersioning</code></p>
</li>
<li><p><code>s3:PutBucketEncryption</code></p>
</li>
<li><p><code>s3:PutObjectLegalHold</code></p>
</li>
<li><p><code>s3:PutObjectRetention</code></p>
</li>
<li><p><code>s3:GetObject</code></p>
</li>
<li><p><code>s3:PutObject</code></p>
</li>
<li><p><code>s3:DeleteObject</code></p>
</li>
<li><p><code>s3:ListBucket</code></p>
</li>
</ul>
</li>
<li><p>For the <strong>migration path</strong>: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.</p>
</li>
</ul>
<h2 id="heading-part-1-fresh-setup-how-to-configure-s3-native-locking-from-scratch">Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch</h2>
<p>Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.</p>
<h3 id="heading-step-1-create-the-s3-bucket-with-versioning-and-encryption">Step 1: Create the S3 Bucket with Versioning and Encryption</h3>
<p>Object Lock <strong>must be enabled at bucket creation time</strong>. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p><strong>Note:</strong> For regions other than <code>us-east-1</code>, add the <code>--create-bucket-configuration</code> flag.</p>
<pre><code class="language-shell">aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket
</code></pre>
<p>Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled
</code></pre>
<p>Enable server-side encryption so your state files are encrypted at rest:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<p>Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<p>Verify the bucket configuration:</p>
<pre><code class="language-shell"># Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state
</code></pre>
<p>Expected output for the Object Lock check:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/2b2e56cf-687f-4932-a61e-ed7cc33ea6f1.png" alt="Terminal showing AWS CLI verification commands confirming S3 bucket is configured correctly with Object Lock, versioning, and encryption enabled" style="display:block;margin:0 auto" width="1120" height="616" loading="lazy">

<h3 id="heading-step-2-configure-the-terraform-backend-with-native-locking">Step 2: Configure the Terraform Backend with Native Locking</h3>
<p>In your Terraform project, create or update your <code>backend.tf</code> file:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}
</code></pre>
<p>The critical difference from the old configuration is the <code>use_lockfile = true</code> parameter. Notice what is <strong>absent</strong>: there's no <code>dynamodb_table</code> argument. No DynamoDB table. No second service.</p>
<p>Here's a direct comparison of the old and new configurations:</p>
<p><strong>Old configuration (S3 + DynamoDB):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}
</code></pre>
<p><strong>New configuration (S3 native locking):</strong></p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}
</code></pre>
<h3 id="heading-step-3-initialize-and-verify">Step 3: Initialize and Verify</h3>
<p>Run <code>terraform init</code> to initialize the backend:</p>
<pre><code class="language-shell">terraform init
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!
</code></pre>
<p>Run a plan to confirm everything is working end-to-end:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket&nbsp;– a <code>.tflock</code> file will appear temporarily alongside your state file during the operation and disappear when it completes.</p>
<h2 id="heading-part-2-migration-how-to-move-from-s3-dynamodb-to-s3-native-locking">Part 2: Migration&nbsp;– How to Move from S3 + DynamoDB to S3 Native Locking</h2>
<p>Follow this section if you have an <strong>existing Terraform setup</strong> using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.</p>
<p><strong>Important:</strong> Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means <strong>all team members and CI/CD pipelines must stop running</strong> <code>terraform plan</code> <strong>or</strong> <code>terraform apply</code> <strong>during the migration</strong>. The migration itself takes under 10 minutes.</p>
<h3 id="heading-step-1-verify-your-current-setup">Step 1: Verify Your Current Setup</h3>
<p>Before making any changes, document your existing backend configuration and confirm the state file is accessible:</p>
<pre><code class="language-shell"># Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'
</code></pre>
<p>Check your current <code>backend.tf</code> and note the exact values:</p>
<pre><code class="language-shell"># Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}
</code></pre>
<p>Run one final plan to confirm the current state is clean and there are no unexpected changes pending:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>If the plan shows no changes, you're in a safe state to proceed.</p>
<h3 id="heading-step-2-enable-object-lock-on-the-existing-s3-bucket">Step 2: Enable Object Lock on the Existing S3 Bucket</h3>
<p>This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.</p>
<p>But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.</p>
<p>Run the following AWS CLI command to enable Object Lock on your <strong>existing</strong> bucket:</p>
<pre><code class="language-bash">aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'
</code></pre>
<p><strong>Note:</strong> This command enables Object Lock in <strong>governance mode with no default retention</strong>, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.</p>
<p>Verify Object Lock is now enabled:</p>
<pre><code class="language-shell">aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}
</code></pre>
<p>Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):</p>
<pre><code class="language-shell">aws s3api get-bucket-versioning \
  --bucket your-existing-bucket
</code></pre>
<p>Expected output:</p>
<pre><code class="language-json">{
    "Status": "Enabled"
}
</code></pre>
<p>If versioning isn't enabled, enable it before proceeding:</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/cd17df01-3d0a-4f93-9250-3f51627e91c8.png" alt="Terminal output showing successful Object Lock enablement on an existing S3 bucket using the AWS CLI" style="display:block;margin:0 auto" width="1204" height="320" loading="lazy">

<h3 id="heading-step-3-update-the-terraform-backend-configuration">Step 3: Update the Terraform Backend Configuration</h3>
<p>Update your <code>backend.tf</code> to remove the <code>dynamodb_table</code> argument and add <code>use_lockfile = true</code>:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}
</code></pre>
<p>Your updated <code>backend.tf</code> should look like this:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}
</code></pre>
<h3 id="heading-step-4-reinitialize-terraform">Step 4: Reinitialize Terraform</h3>
<p>Run <code>terraform init</code> with the <code>-reconfigure</code> flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):</p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!
</code></pre>
<p><strong>If you see an error here:</strong> The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.</p>
<h3 id="heading-step-5-verify-the-migration">Step 5: Verify the Migration</h3>
<p>Run a plan to confirm Terraform is working correctly with the new backend configuration:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The plan should:</p>
<ul>
<li><p>Complete successfully</p>
</li>
<li><p>Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)</p>
</li>
<li><p>NOT mention DynamoDB anywhere in its output</p>
</li>
</ul>
<p>To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:</p>
<pre><code class="language-plaintext">╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵
</code></pre>
<p>The <code>Path</code> field shows <code>.tfstate.tflock</code>, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/e9abb703-af6e-429c-83bb-2ea2dac43a3a.png" alt="Two terminals showing concurrent terraform plan commands, the second one displays a lock error confirming S3 native locking is working" style="display:block;margin:0 auto" width="1264" height="539" loading="lazy">

<h3 id="heading-step-6-clean-up-the-dynamodb-table">Step 6: Clean Up the DynamoDB Table</h3>
<p>Once you've confirmed the migration is working correctly and your team has run at least one successful <code>plan</code> and <code>apply</code> cycle using the new backend, you can remove the DynamoDB table.</p>
<p><strong>Wait at least 24-48 hours before deleting the DynamoDB table</strong> if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.</p>
<p>When you're ready, delete the DynamoDB table:</p>
<pre><code class="language-shell">aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Confirm the deletion:</p>
<pre><code class="language-shell">aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found
</code></pre>
<p>This error confirms that the table is gone. The migration is complete.</p>
<p>If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run <code>terraform apply</code> to destroy it via Terraform rather than the CLI directly. This keeps your state clean:</p>
<pre><code class="language-hcl"># Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}
</code></pre>
<p>After removing the block, run:</p>
<pre><code class="language-bash">terraform apply
</code></pre>
<p>Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.</p>
<h2 id="heading-how-to-verify-that-locking-is-working">How to Verify That Locking Is Working</h2>
<p>After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.</p>
<h3 id="heading-method-1-observe-the-lock-file-during-an-operation">Method 1: Observe the lock file during an operation</h3>
<p>In one terminal, start a long-running plan against a configuration with many resources:</p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>While it's running, in a second terminal, check for the lock file in S3:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>You should see a file like:</p>
<pre><code class="language-plaintext">2026-05-06 14:22:01        512 terraform.tfstate.tflock
</code></pre>
<p>After the plan completes, run the same command again. The <code>.tflock</code> file should be gone.</p>
<h3 id="heading-method-2-read-the-lock-file-contents">Method 2: Read the lock file contents</h3>
<p>While a plan is running, download and read the lock file to see its contents:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock &amp;&amp; cat /tmp/current.lock
</code></pre>
<p>Expected output (formatted for readability):</p>
<pre><code class="language-json">{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}
</code></pre>
<p>This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.</p>
<h2 id="heading-how-to-handle-a-stuck-lock">How to Handle a Stuck Lock</h2>
<p>With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the <code>.tflock</code> file from S3.</p>
<p>A lock can get stuck if:</p>
<ul>
<li><p>A <code>terraform apply</code> or <code>plan</code> process was killed mid-execution</p>
</li>
<li><p>A CI/CD pipeline runner crashed during a Terraform operation</p>
</li>
<li><p>A network interruption prevented the lock release from completing</p>
</li>
</ul>
<p>Here's how you can check for a stuck lock:</p>
<pre><code class="language-shell">aws s3 ls s3://your-bucket/path/to/ | grep tflock
</code></pre>
<p>If a <code>.tflock</code> file exists and no Terraform operation is currently running, it is a stuck lock.</p>
<p>You can also read the lock to understand who held it:</p>
<pre><code class="language-shell">aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock &amp;&amp; cat /tmp/stuck.lock
</code></pre>
<p>This tells you who (<code>Who</code> field) was running the operation, what operation it was (<code>Operation</code> field), and when it was acquired (<code>Created</code> field).</p>
<p>And you can force-unlock using Terraform like this:</p>
<pre><code class="language-shell">terraform force-unlock LOCK-ID
</code></pre>
<p>Replace <code>LOCK-ID</code> with the <code>ID</code> value from the lock file contents. For example:</p>
<pre><code class="language-shell">terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890
</code></pre>
<p>Terraform will confirm:</p>
<pre><code class="language-plaintext">Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!
</code></pre>
<p>An alternative is to delete the lock file directly via CLI. If <code>terraform force-unlock</code> doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:</p>
<pre><code class="language-shell">aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock
</code></pre>
<p><strong>Only delete the lock file if you are certain no Terraform operation is currently running.</strong> Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.</p>
<h2 id="heading-rollback-plan-if-something-goes-wrong">Rollback Plan: If Something Goes Wrong</h2>
<p>If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.</p>
<p><strong>Step 1: Stop all Terraform operations</strong> in your team and CI/CD pipelines.</p>
<p><strong>Step 2: Recreate the DynamoDB table</strong> if you already deleted it:</p>
<pre><code class="language-shell">aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
</code></pre>
<p><strong>Step 3: Revert</strong> <code>backend.tf</code> to the previous configuration:</p>
<pre><code class="language-hcl">terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}
</code></pre>
<p><strong>Step 4: Reinitialize:</strong></p>
<pre><code class="language-shell">terraform init -reconfigure
</code></pre>
<p><strong>Step 5: Verify:</strong></p>
<pre><code class="language-shell">terraform plan
</code></pre>
<p>The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.</p>
<p><strong>Note:</strong> Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using <code>dynamodb_table</code> in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.</p>
<h2 id="heading-security-best-practices-for-your-state-bucket">Security Best Practices for Your State Bucket</h2>
<p>Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:</p>
<h3 id="heading-enable-versioning-required">Enable Versioning (Required)</h3>
<p>Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.</p>
<pre><code class="language-shell">aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled
</code></pre>
<h3 id="heading-block-all-public-access-non-negotiable">Block All Public Access (Non-Negotiable)</h3>
<p>Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.</p>
<pre><code class="language-shell">aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
</code></pre>
<h3 id="heading-enable-server-side-encryption">Enable Server-Side Encryption</h3>
<p>Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:</p>
<pre><code class="language-shell">aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'
</code></pre>
<h3 id="heading-apply-least-privilege-iam-permissions">Apply Least-Privilege IAM Permissions</h3>
<p>The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}
</code></pre>
<p>Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.</p>
<h3 id="heading-enable-access-logging">Enable Access Logging</h3>
<p>Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:</p>
<pre><code class="language-shell">aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.</p>
<p>Here's a summary of what you accomplished:</p>
<ul>
<li><p>Understood what state locking is and why it's required for safe Terraform operations</p>
</li>
<li><p>Compared S3 native locking to the existing S3 + DynamoDB approach</p>
</li>
<li><p>Set up a fresh Terraform backend using S3 native locking with correct bucket configuration</p>
</li>
<li><p>Migrated an existing backend from S3 + DynamoDB to S3 native locking safely</p>
</li>
<li><p>Learned how to verify locking, handle stuck locks, and roll back if needed</p>
</li>
<li><p>Applied security best practices to the state bucket</p>
</li>
</ul>
<p>This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.</p>
<p><em>If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out</em> <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a><em>. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.</em></p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a href="https://developer.hashicorp.com/terraform/language/backend/s3#use_lockfile">HashiCorp - S3 Backend Configuration: use_lockfile</a></p>
</li>
<li><p><a href="https://github.com/hashicorp/terraform/releases/tag/v1.10.0">HashiCorp: Terraform 1.10 Release Notes</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html">AWS Docs: S3 Object Lock Overview</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectLockConfiguration.html">AWS Docs: PutObjectLockConfiguration API</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-requests.html">AWS Docs: S3 Conditional Writes</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/language/state/locking">HashiCorp: Backend State Locking</a></p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/cli/commands/force-unlock">HashiCorp: terraform force-unlock Command</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/manage-versioning-examples.html">AWS Docs: Enabling S3 Versioning</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html">AWS Docs: S3 Server-Side Encryption</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For ]]>
                </title>
                <description>
                    <![CDATA[ You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating. And yet  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-land-your-first-cloud-or-devops-role-what-hiring-managers-actually-look-for/</link>
                <guid isPermaLink="false">69f3683c909e64ad07e3b0fc</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Career ]]>
                    </category>
                
                    <category>
                        <![CDATA[ jobs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 14:33:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/374e807b-a67f-4f04-a639-dfa230b0ba5f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating.</p>
<p>And yet the applications go out, and nothing comes back.</p>
<p>This is one of the most frustrating experiences in tech. You're genuinely learning, genuinely putting in the time, and you have nothing to show for it in terms of results. You start to wonder if the market is too competitive, if you need one more certification, or if there's some hidden door everyone else found that you're missing.</p>
<p>The truth is simpler and more actionable than any of that: <strong>hiring managers can't see your YouTube watch history. They can see your GitHub.</strong> Most beginners optimize for learning. Hired candidates optimize for proof.</p>
<p>In this guide, you'll get an honest breakdown of the nine factors hiring managers actually evaluate when they look at a junior cloud or DevOps candidate and a concrete 90-day plan to address each one. By the end, you'll know exactly where you stand and exactly what to do next.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</a></p>
<ul>
<li><p><a href="#heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</a></p>
</li>
<li><p><a href="#heading-pattern-2--the-theorypractice-gap">Pattern 2: The Theory-Practice Gap</a></p>
</li>
<li><p><a href="#pattern-3-silent-learning">Pattern 3: Silent Learning</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</a></p>
</li>
<li><p><a href="#heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</a></p>
<ul>
<li><a href="#heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</a></li>
</ul>
</li>
<li><p><a href="#heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</a></p>
</li>
<li><p><a href="#heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</a></p>
</li>
<li><p><a href="#heading-factor-4-communication-skills">Factor 4: Communication Skills</a></p>
</li>
<li><p><a href="#heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</a></p>
</li>
<li><p><a href="#heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</a></p>
</li>
<li><p><a href="#heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</a></p>
</li>
<li><p><a href="#heading-factor-8--business-awareness">Factor 8: Business Awareness</a></p>
</li>
<li><p><a href="#heading-factor-9-learning-agility">Factor 9: Learning Agility</a></p>
</li>
<li><p><a href="#heading-your-90-day-action-plan">Your 90-Day Action Plan</a></p>
</li>
<li><p><a href="#heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references-and-recommended-resources">References and Recommended Resources</a></p>
</li>
</ul>
<h2 id="heading-the-three-patterns-that-keep-beginners-stuck">The Three Patterns That Keep Beginners Stuck</h2>
<h3 id="heading-pattern-1-the-tutorial-loop">Pattern 1: The Tutorial Loop</h3>
<p>Week 1: You watch eight hours of Docker content. Week 2: You start an AWS course and get 70% through. Week 3: A Kubernetes series looks interesting, so you start that instead. Week 4: You open LinkedIn and wonder why you're not getting callbacks.</p>
<p>Watching tutorials feels like progress. It's comfortable, passive, and has no failure state. Nothing breaks. Nothing goes wrong.</p>
<p>The problem is that it produces nothing a hiring manager can evaluate. Courses and certifications tell an employer what you've been exposed to. Your GitHub tells them what you can actually do.</p>
<h3 id="heading-pattern-2-the-theory-practice-gap">Pattern 2: The Theory-Practice Gap</h3>
<p>You can explain CI/CD fluently. You've read the Kubernetes documentation. You understand the conceptual difference between a container and a virtual machine.</p>
<p>But you've never taken a simple application, containerized it, connected it to a pipeline, and deployed it to a cloud server with a real URL that someone can visit.</p>
<p>In an interview, "I understand how it works" and "I have built this and here is the link" are not equivalent answers. Hiring managers hear the first version from hundreds of candidates. The second version gets callbacks.</p>
<h3 id="heading-pattern-3-silent-learning">Pattern 3: Silent Learning</h3>
<p>This one is perhaps the most painful pattern because the learning is real. You're putting in the work every day but nobody knows. No GitHub activity. No LinkedIn posts. No community presence. Just cold applications sent from job boards to ATS systems that filter you out before a human ever sees your name.</p>
<p>The hard truth: people get hired through people. A hiring manager who has seen your LinkedIn post about a problem you solved is significantly more likely to give your résumé serious attention than a stranger who applied through a portal.</p>
<h2 id="heading-what-hiring-managers-are-actually-evaluating">What Hiring Managers Are Actually Evaluating</h2>
<p>I've grouped the nine factors that follow into three buckets: <strong>Mindset</strong>, <strong>Execution</strong>, and <strong>Visibility</strong>. The order matters: mindset shapes how you execute, and execution is what powers visibility.</p>
<table>
<thead>
<tr>
<th>Bucket</th>
<th>Covers</th>
<th>Factors</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Mindset</strong></td>
<td>How you think about problems and your career</td>
<td>Factors 2, 7, 8, 9</td>
</tr>
<tr>
<td><strong>Execution</strong></td>
<td>What you actually build and demonstrate</td>
<td>Factors 1, 3</td>
</tr>
<tr>
<td><strong>Visibility</strong></td>
<td>Whether the right people know you exist</td>
<td>Factors 4, 5, 6</td>
</tr>
</tbody></table>
<p>Let's go through each one.</p>
<h2 id="heading-factor-1-proof-of-work-the-non-negotiable">Factor 1: Proof of Work (The Non-Negotiable)</h2>
<p>If there's one thing to take from this entire article, it's this: <strong>no portfolio means no serious consideration.</strong> The most technically capable candidate in the applicant pool is invisible without proof of work.</p>
<p>This isn't about impressing anyone with complexity. It's about demonstrating that you can take a system from zero to deployed, documented, and working.</p>
<p>Here's the checklist every portfolio project should meet before you consider it done:</p>
<ul>
<li><p><strong>It's deployed</strong>: there's a real URL you can share, not "it works on my machine"</p>
</li>
<li><p><strong>It has a CI/CD pipeline</strong>: code changes are automatically tested and deployed</p>
</li>
<li><p><strong>Infrastructure is defined as code</strong>: not manually clicked together in the AWS console</p>
</li>
<li><p><strong>It has monitoring and alerting</strong>: you know when it breaks before users tell you</p>
</li>
<li><p><strong>It's documented</strong>: a README explains what it does, how to run it, and how it works</p>
</li>
<li><p><strong>It's on GitHub publicly</strong>: with real commit history showing iterative work</p>
</li>
</ul>
<p>If your project meets all six criteria, you have proof of work. If it meets four of six, you have a project in progress. Finish it before you start applying.</p>
<h3 id="heading-the-three-projects-that-cover-everything">The Three Projects That Cover Everything</h3>
<p>You don't need ten projects. You need two to three projects that together demonstrate the full range of DevOps skills.</p>
<h4 id="heading-project-1-the-full-stack-deploy-pipeline">Project 1 : The Full-Stack Deploy Pipeline</h4>
<p>This is the foundational DevOps project every beginner should build first.</p>
<p>Take any simple web application – a Python Flask app, a Node.js API, or even a static site. Containerize it with Docker. Write a CI/CD pipeline that runs tests, builds the Docker image, and deploys to a cloud server automatically on every push to the main branch. You can also set up Nginx as a reverse proxy and add an uptime monitor (UptimeRobot has a free tier).</p>
<p>Tools: GitHub Actions, Docker, AWS EC2 or <a href="http://Render.com">Render.com</a>, Nginx.</p>
<p>Why it matters to a hiring manager: it proves you can automate a full deployment workflow end-to-end. The hiring manager can visit your URL, see it running, and inspect your pipeline history.</p>
<p>This single project puts you ahead of most applicants who only have course completion screenshots.</p>
<h4 id="heading-project-2-infrastructure-as-code-with-terraform">Project 2: Infrastructure as Code with Terraform</h4>
<p>Write Terraform code that provisions a complete environment: a VPC, public and private subnets, an EC2 instance with properly scoped security group rules, and an S3 bucket for remote state. Destroy it and recreate it from scratch to prove the code actually works. Add a GitHub Actions workflow that runs <code>terraform plan</code> on pull requests and <code>terraform apply</code> on merge to main.</p>
<p>Tools: Terraform, AWS (or Azure/GCP), GitHub Actions.</p>
<p>Why it matters: Infrastructure as Code with Terraform is a required skill at almost every company running cloud infrastructure. Showing you can write, version-control, and automate Terraform demonstrates a core professional competency.</p>
<h4 id="heading-project-3-monitoring-and-observability-stack">Project 3: Monitoring and Observability Stack</h4>
<p>Deploy a monitoring stack using Docker Compose: Prometheus scraping metrics from your application and the host, Grafana dashboards showing CPU, memory, request rates, and error rates, and Alertmanager configured to send alerts to Slack or email when thresholds are crossed. Connect this to your Project 1 application so the pipeline deploys and the monitoring watches it.</p>
<p>Tools: Prometheus, Grafana, Alertmanager, Node Exporter, Docker Compose.</p>
<p>Why it matters: most beginner portfolios have zero observability work. This project immediately signals that you understand production engineering, not just deployment. Any senior DevOps engineer or SRE reviewing your application will notice it and it will set you apart.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/da9e25be-9b59-48c8-9cf0-9cfdb050c277.png" alt="GitHub profile showing three pinned DevOps portfolio repositories with descriptive names " style="display:block;margin:0 auto" width="1353" height="584" loading="lazy">

<h2 id="heading-factor-2-system-level-thinking">Factor 2: System-Level Thinking</h2>
<p>This is the mindset that separates a DevOps engineer from someone who just knows a collection of tools. System-level thinking means you can see the whole picture, not just the part you happen to be working on at any given moment.</p>
<p>Here's the mental test hiring managers are running throughout your interview: <em>can you trace a user request from the moment they click a button to the moment they see a response, and explain what happens at every layer in between?</em></p>
<p>Here's the full journey of a web request, the map of modern infrastructure every DevOps engineer needs to understand:</p>
<table>
<thead>
<tr>
<th>Step</th>
<th>Layer</th>
<th>What's happening and what can go wrong</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td>User's Browser</td>
<td>The user types a URL. The browser needs to find the server.</td>
</tr>
<tr>
<td>2</td>
<td>DNS Resolution</td>
<td>The domain is translated into an IP address. DNS misconfigurations mean users can't reach you at all.</td>
</tr>
<tr>
<td>3</td>
<td>CDN / Edge Network</td>
<td>Traffic hits a CDN (Cloudflare, CloudFront) first. Static assets are served from the nearest edge. SSL terminates here.</td>
</tr>
<tr>
<td>4</td>
<td>Load Balancer</td>
<td>Routes the request to an available application server. If all targets are unhealthy, users get 502/503 errors.</td>
</tr>
<tr>
<td>5</td>
<td>Compute / Application Servers</td>
<td>The application code runs here in containers, on VMs, or in server-less functions. Business logic executes.</td>
</tr>
<tr>
<td>6</td>
<td>Database Layer</td>
<td>The application reads from or writes to a database. Slow queries or a full disk causes slow responses or outages.</td>
</tr>
<tr>
<td>7</td>
<td>Cache Layer</td>
<td>Redis or Memcached caches frequently-read data. Cache misses cause extra database load.</td>
</tr>
<tr>
<td>8</td>
<td>Response Returns</td>
<td>The response travels back through the stack and the user sees the result.</td>
</tr>
<tr>
<td>9</td>
<td>Logging and Monitoring</td>
<td>Every step above should emit logs and metrics. Good monitoring alerts you before users notice a problem.</td>
</tr>
</tbody></table>
<p>Why does this matter in an interview? Consider two candidates answering the question: <em>"Tell me about a time something broke in production."</em></p>
<p>Candidate A: "The website was down."</p>
<p>Candidate B: "The load balancer health checks were failing because the app containers were running out of memory due to a memory leak introduced in the previous deploy. We identified it via memory metrics in Grafana, rolled back, and added a memory limit to the container spec."</p>
<p>Same incident. Completely different answer. System-level thinking is what makes the difference.</p>
<h2 id="heading-factor-3-software-engineering-fundamentals">Factor 3: Software Engineering Fundamentals</h2>
<p>Many beginners rush to learn Kubernetes and Terraform before mastering the foundations that make those tools make sense. This creates a knowledge structure that looks impressive but has no solid base underneath it.</p>
<p>Here are the fundamentals that actually matter and what to do if you have a gap in any of them:</p>
<h3 id="heading-1-linux-and-the-command-line">1. Linux and the Command Line</h3>
<p>DevOps tools run on Linux. CI/CD jobs run in Linux containers. SSH is the front door to every server. If the terminal makes you uncomfortable, you're not ready for a production environment. This is not a preference, it's a prerequisite.</p>
<p>Start with daily Linux practice. The <a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation's free introductory materials</a> are a solid starting point. And here's a <a href="https://www.freecodecamp.org/news/learn-the-basics-of-the-linux-operating-system/">solid freeCodeCamp course on Linux basics.</a></p>
<h3 id="heading-2-networking-fundamentals">2. Networking Fundamentals</h3>
<p>DNS, TCP/IP, HTTP/HTTPS, load balancing, firewalls, VPCs, subnets these concepts appear in every cloud architecture. Without them, Terraform and Kubernetes are magic boxes. Study the request flow in Factor 2 above until you can draw it from memory without looking.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/computer-networking-fundamentals/">computer networking fundamentals course</a> to get you started.</p>
<h3 id="heading-3-scripting-bash-and-python">3. Scripting: Bash and Python</h3>
<p>CI/CD pipelines are scripts. Automation is scripting. If you cannot write a Bash script that reads a config file, calls an API, and handles errors gracefully your automation ceiling is very low. Fix this by writing one small, useful script every week. Solve real problems with code.</p>
<p>Here's a helpful tutorial on <a href="https://www.freecodecamp.org/news/shell-scripting-crash-course-how-to-write-bash-scripts-in-linux/">shell scripting in Linux for beginners</a>.</p>
<h3 id="heading-4-git-and-version-control">4. Git and Version Control</h3>
<p>Not just <code>git commit</code> and <code>git push</code>. Branching strategies, pull requests, merge conflicts, rebasing, and tagging releases are all standard practice in professional DevOps teams. Use Git for everything including your personal learning notes. Practice branching workflows intentionally.</p>
<p>Here's a <a href="https://www.freecodecamp.org/news/gitting-things-done-book/">full book on all the Git basics</a> (and some more advanced topics, too) you need to know.</p>
<h3 id="heading-5-docker-and-containers">5. Docker and Containers</h3>
<p>Docker is the universal packaging format for modern software. Understanding layers, multi-stage builds, volumes, networking, and container security is the floor not the ceiling. Every project you build should be containerized. Write your Dockerfiles by hand instead of copying them.</p>
<p>Here's a course on <a href="https://www.freecodecamp.org/news/learn-docker-and-kubernetes-hands-on-course/">Docker and Kubernetes</a> to get you started,</p>
<h2 id="heading-factor-4-communication-skills">Factor 4: Communication Skills</h2>
<p>Technical skills set your ceiling. Communication skills determine how fast you reach it. This is the most consistently underestimated factor among beginner DevOps candidates.</p>
<p>Two candidates with identical technical ability will have very different career outcomes based on how clearly they communicate. Here's what that looks like in practice:</p>
<p><strong>Architecture explanation</strong>: Can you describe how your project works to someone who has never seen it? Can you draw the architecture on a whiteboard and walk someone through your design decisions and the trade-offs you made?</p>
<p><strong>Trade-off articulation</strong>: <em>"I chose X over Y because..."</em> is one of the most powerful phrases in a technical interview. It shows you understand that every decision has pros and cons and you made a conscious, reasoned choice rather than just copying a tutorial.</p>
<p><strong>Written documentation</strong>: A README is your project's cover letter. A well-written README with clear setup instructions, an architecture diagram, and documented decisions demonstrates engineering maturity that most beginners don't show.</p>
<p>Here's a quick test: open your most recent project on GitHub and read the README as if you're a hiring manager seeing it for the first time. Does it answer these questions?</p>
<ul>
<li><p>What does this project do, and why did you build it?</p>
</li>
<li><p>What does the architecture look like?</p>
</li>
<li><p>How do I run this locally, and how do I deploy it?</p>
</li>
<li><p>What decisions did you make, and why?</p>
</li>
<li><p>What would you improve if you continued working on it?</p>
</li>
</ul>
<p>If you answered "no" to more than two of those rewrite the README before applying anywhere. This single action will meaningfully improve your response rate.</p>
<p><strong>Interview communication</strong>: Hiring managers assess communication throughout the entire interview not just your answers. Thinking out loud, structuring your responses, and admitting uncertainty honestly are all evaluated.</p>
<h2 id="heading-factor-5-consistency-over-intensity">Factor 5: Consistency Over Intensity</h2>
<p>Hiring managers are pattern recognition machines. They look at your GitHub contribution graph, your LinkedIn activity, and your learning trajectory and form an impression before reading a single word on your résumé.</p>
<p>A binge-learning approach, 10-hour weekends followed by weeks of nothing produces a GitHub graph that tells the wrong story. Thirty minutes of focused daily practice for six months beats a monthly 10-hour binge. At the six-month mark, the daily practitioner has 90 hours of focused work. The binge learner has 60 with significantly worse retention.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/1315bb8d-9e4e-4f84-836f-4e02b83c75ce.webp" alt="GitHub contribution graph showing 12 months of consistent activity with regular commits across the year" style="display:block;margin:0 auto" width="1080" height="273" loading="lazy">

<p>Here's how to build consistency in practice:</p>
<ul>
<li><p>Pick a time slot in your day that you will protect. Thirty minutes is enough to make progress.</p>
</li>
<li><p>Define a four-week learning sprint with a specific goal, not "learn Terraform" but "build and deploy a VPC with Terraform and write the README."</p>
</li>
<li><p>Keep a private learning journal: date, what you studied, what you built, what confused you.</p>
</li>
<li><p>When the sprint ends, evaluate what you built and plan the next one.</p>
</li>
</ul>
<p>What to avoid: declaring publicly on LinkedIn that you're "grinding DevOps full time" and then disappearing for six weeks. The absence is noticed. Only commit publicly to what you will actually sustain.</p>
<h2 id="heading-factor-6-networking-and-visibility">Factor 6: Networking and Visibility</h2>
<p>This is the factor most beginners resist most, and the one that makes the biggest practical difference in time-to-hire.</p>
<p>Most DevOps jobs are filled through people referrals, community connections, LinkedIn conversations. A warm introduction from someone who has seen your work outweighs fifty cold applications every time.</p>
<p>Here are three ways to build visibility without it feeling performative:</p>
<h3 id="heading-community-engagement">Community Engagement</h3>
<p>Join communities where DevOps engineers actually talk: AWS User Groups, local DevOps meetups, DevOps Discord servers, Reddit communities like r/devops and r/kubernetes. You don't need to be the expert. Ask specific questions, answer what you genuinely know, and show up consistently. After three to six months, people will recognize your name.</p>
<h3 id="heading-linkedin-content">LinkedIn Content</h3>
<p>Post once per week about something you learned, built, or got stuck on. Not marketing – documentation. A post that says <em>"This week I configured Prometheus alerting for a Docker Compose stack. Here's what tripped me up and how I solved it"</em> attracts recruiters, leads to conversations, and builds a searchable record of your growth over time.</p>
<h3 id="heading-asking-good-questions-in-public">Asking Good Questions in Public</h3>
<p>When you get stuck and figure it out, write it up. Post the solution in the same community where you asked the question. Answer someone else's version of the same question later. You position yourself as a helpful, engaged learner, exactly who hiring managers want to hire.</p>
<p>Here's a concrete three-month visibility sprint to follow:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Action</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Update your LinkedIn headline: "Cloud / DevOps Engineer in Training │ Building with AWS, Docker, Terraform". Connect with 20 people in DevOps engineers, recruiters, hiring managers. Add a short personal note when connecting.</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Write your first LinkedIn post. Document something you built or learned this week. Keep it honest and specific. 150–200 words is enough.</td>
</tr>
<tr>
<td>Month 2</td>
<td>Join one community. Introduce yourself. Answer one question per week.</td>
</tr>
<tr>
<td>Month 3</td>
<td>Post consistently once per week. Engage with others' posts. Start appearing in recruiter searches.</td>
</tr>
</tbody></table>
<p>By month three, recruiters searching for "DevOps" in your location will encounter your activity. Some of the best entry-level DevOps opportunities come from exactly this kind of low-pressure visibility.</p>
<h2 id="heading-factor-7-ownership-mindset">Factor 7: Ownership Mindset</h2>
<p>This factor is less about personality type and more about observable behavior. Hiring managers are looking for evidence that you finish what you start not just that you start things.</p>
<p>Here's what the contrast looks like:</p>
<table>
<thead>
<tr>
<th>What hiring managers frequently see</th>
<th>What hiring managers want to see</th>
</tr>
</thead>
<tbody><tr>
<td>"I started a Kubernetes project and encountered a lot of issues"</td>
<td>"Here is a complete project. It deploys to AWS, has a CI/CD pipeline, is monitored, and you can access it at this URL right now."</td>
</tr>
<tr>
<td>"I was working through a Terraform course, learnt a lot about XYZ."</td>
<td>"I finished it, documented it, and wrote a post about what I learned."</td>
</tr>
</tbody></table>
<p>Ownership mindset has three components. First, finish things: a complete, simple project is worth ten times more than ten incomplete complex ones. Second, take responsibility without blame when something breaks: ownership means identifying the cause, fixing it, and adding monitoring so it doesn't happen again. Third, self-direct your learning you don't wait for someone to tell you what to learn next. You see a gap, identify how to close it, and close it. This is what "junior who can work independently" actually means in job descriptions.</p>
<h2 id="heading-factor-8-business-awareness">Factor 8: Business Awareness</h2>
<p>Technical skill gets you in the door. Business awareness keeps you there and accelerates your career.</p>
<p>The core question hiring managers are testing is: <em>can you connect your technical decisions to cost, uptime, and user impact?</em> Infrastructure decisions are business decisions. Cloud costs are typically the second-largest engineering expense at most companies after salaries. A misconfigured auto-scaling group or a forgotten large EC2 instance can burn thousands of dollars overnight.</p>
<p>Here are a few benchmark questions worth being able to answer comfortably:</p>
<ul>
<li><p>If your company has a 99.9% SLA, how many minutes of downtime per month is that? (About 43 minutes.)</p>
</li>
<li><p>If you move workloads from on-demand EC2 instances to Reserved Instances, what's the approximate cost saving? (Around 40–60%.)</p>
</li>
<li><p>If your CI/CD pipeline takes 45 minutes per build and you run 20 builds per day, how much developer wait time does that represent weekly?</p>
</li>
</ul>
<p>Most junior candidates can't answer these fluently in an interview. Candidates who can stand out immediately not because the questions are hard, but because so few people bother to connect infrastructure and business.</p>
<p>The simple habit to build: whenever you describe a technical decision in your project documentation or in an interview, add the business dimension. "I configured auto-scaling" becomes "I configured auto-scaling to handle traffic spikes, which eliminated the cost of over-provisioning and reduced our estimated monthly cloud spend by approximately $X."</p>
<h2 id="heading-factor-9-learning-agility">Factor 9: Learning Agility</h2>
<p>Everyone claims to be a fast learner. It's the most overused phrase in technology job applications. Here's how to make it actually mean something.</p>
<p>Saying "I'm a fast learner" in an interview is table stakes. The question is whether you can prove it. Proof sounds like this: <em>"I had never used GitHub Actions before. I needed a CI/CD pipeline for a project I was building. In 48 hours, I had a working pipeline that runs tests, builds a Docker image, and deploys to AWS."</em></p>
<p>What makes that credible: it names a specific tool, a specific timeframe, and a specific outcome. There is a GitHub repository with a commit history and a working pipeline that a hiring manager can actually look at.</p>
<p>Learning agility is not about knowing many tools shallowly. It's about picking up new tools quickly because you deeply understand the underlying concepts. Tool names change every few years. Concepts networking, automation, observability, reliability do not.</p>
<p>To build a concrete track record of learning agility: once a month, pick one tool you haven't used. Follow its quick-start guide. Build something small. Document what was difficult. Post about it. This is your learning agility portfolio visible, dated, and specific.</p>
<h2 id="heading-your-90-day-action-plan">Your 90-Day Action Plan</h2>
<p>Here is a concrete, sequential plan that takes you from where you are now to your first DevOps interview-ready state.</p>
<h3 id="heading-month-1-build-your-foundation">Month 1: Build Your Foundation</h3>
<p>Focus entirely on Project 1 from the Proof of Work section. Build it completely. Deploy it. Get the live URL. Don't start Project 2 until Project 1 meets all six checklist criteria.</p>
<p>Alongside the build: 30 minutes of Linux and Bash scripting practice daily. This isn't optional, it's the foundation everything else runs on.</p>
<h3 id="heading-month-2-expand-your-execution-and-start-your-visibility">Month 2: Expand Your Execution and Start Your Visibility</h3>
<p>Begin Project 2 (Terraform IaC). Write your first LinkedIn post, it doesn't need to be polished, it needs to be specific. Join one community and introduce yourself.</p>
<h3 id="heading-month-3-complete-the-portfolio-and-document-everything">Month 3: Complete the Portfolio and Document Everything</h3>
<p>Finish all three projects to full checklist standard. Polish every README. Add architecture diagrams. Optimize your GitHub profile, pin your three best repos, write a profile README that describes who you are and what you build, and add links to your live project URLs.</p>
<h3 id="heading-month-4-onward-apply-with-strategy">Month 4 Onward: Apply with Strategy</h3>
<p>Don't start applying before month four. Apply with real proof of work in hand. Target five to ten quality applications per week rather than spraying a hundred. Include your GitHub and your best project's live URL in every application. For roles at companies where you have a community connection, reach out to that person before applying.</p>
<p>Track every application in a spreadsheet: company, role, date applied, status, outcome, notes. After thirty applications, you'll have enough data to see what's working and what isn't.</p>
<p>Here's the full 90-day breakdown:</p>
<table>
<thead>
<tr>
<th>Timeframe</th>
<th>Focus</th>
<th>Milestone</th>
</tr>
</thead>
<tbody><tr>
<td>Week 1-2</td>
<td>Linux fundamentals. Set up GitHub profile. Start Project 1.</td>
<td>Foundation</td>
</tr>
<tr>
<td>Week 3-4</td>
<td>Complete Project 1 CI/CD pipeline. Deploy. Get live URL. Write README.</td>
<td>First Proof of Work</td>
</tr>
<tr>
<td>Month 2</td>
<td>Begin Project 2. First LinkedIn post. Join one community.</td>
<td>Visibility begins</td>
</tr>
<tr>
<td>Month 2-3</td>
<td>Complete Project 2. Scaffold monitoring (Project 3). Post weekly on LinkedIn.</td>
<td>Building momentum</td>
</tr>
<tr>
<td>Month 3</td>
<td>Finish all 3 projects to checklist standard. Polish READMEs and GitHub profile.</td>
<td>Portfolio complete</td>
</tr>
<tr>
<td>Month 4+</td>
<td>Apply strategically. Continue posting and community engagement.</td>
<td>Active job search</td>
</tr>
</tbody></table>
<h2 id="heading-honest-self-assessment-where-do-you-stand">Honest Self-Assessment: Where Do You Stand?</h2>
<p>Go through each statement below. Be completely honest: this is for you, not anyone else.</p>
<table>
<thead>
<tr>
<th>Statement</th>
<th>Action if the answer is No</th>
</tr>
</thead>
<tbody><tr>
<td>I can explain a web request end-to-end (DNS → load balancer → compute → database → logs)</td>
<td>Study Factor 2 until you can draw this from memory</td>
</tr>
<tr>
<td>I have at least one deployed project with a live URL</td>
<td>This is Priority 1. Nothing else matters more right now.</td>
</tr>
<tr>
<td>My best project has a CI/CD pipeline that auto-deploys on push</td>
<td>Add this to your existing project this week</td>
</tr>
<tr>
<td>I have written infrastructure as code (Terraform or CloudFormation)</td>
<td>Project 2 is your next build target</td>
</tr>
<tr>
<td>My projects have READMEs that explain architecture and decisions</td>
<td>Spend one hour today rewriting your README</td>
</tr>
<tr>
<td>I have posted about my learning on LinkedIn in the last 30 days</td>
<td>Post something today, document what you built last week</td>
</tr>
<tr>
<td>I am part of at least one DevOps community</td>
<td>Join r/devops or an AWS Discord server this week</td>
</tr>
<tr>
<td>I can write a Bash script that solves a real automation problem</td>
<td>30 minutes of daily scripting practice for the next 30 days</td>
</tr>
<tr>
<td>I can explain what I built, why I made each decision, and what I'd change</td>
<td>Practice saying this out loud about each project until it's fluent</td>
</tr>
</tbody></table>
<p>Count your "no" answers. Each one is a specific, actionable gap, not a vague sense of being behind. That's the difference between this self-assessment and the anxious feeling of "I'm not ready yet." You're not behind. You just have a prioritized list of what to build next.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Here's what you know now that most beginners still don't:</p>
<p>The gap between you and a DevOps job isn't a gap in certifications, a gap in courses completed, or a gap in the number of tools you've heard about. It's a gap in proof of work, visibility, and the consistency with which you execute.</p>
<p>Hiring managers aren't looking for someone who has watched everything. They're looking for someone who has built something, documented it, deployed it, monitored it, and can clearly explain every decision they made along the way.</p>
<p>The path isn't secret. It's just work. Build two to three complete projects that meet the full checklist. Document everything. Show up consistently in communities and on LinkedIn. Apply with strategy. Iterate based on feedback.</p>
<p>If you want a production-grade reference to support your DevOps journey complete with real Terraform modules, CI/CD workflow templates, infrastructure runbooks, and platform engineering patterns used in real startup environments <a href="https://coachli.co/tolani-akintayo/PR-H4oQS">The Startup DevOps Field Guide</a> was built for exactly this stage of your career.</p>
<p>The information gap between you and your first DevOps role is smaller than you think. The execution gap is where the work is. Start today.</p>
<h2 id="heading-references-and-recommended-resources">References and Recommended Resources</h2>
<ul>
<li><p><a href="https://roadmap.sh/devops">roadmap.sh/devops</a>: The community-maintained DevOps learning roadmap. Use this to sequence what you learn next and avoid random jumps between topics.</p>
</li>
<li><p><a href="https://dora.dev">DORA State of DevOps Report</a>: Free annual report on what DevOps practices actually improve software delivery performance. Gives you the vocabulary hiring managers speak.</p>
</li>
<li><p><a href="https://training.linuxfoundation.org/training/introduction-to-linux/">Linux Foundation - Introduction to Linux</a>: Free introductory Linux course. If the terminal still makes you nervous, start here.</p>
</li>
<li><p><a href="https://itrevolution.com/product/the-phoenix-project/">The Phoenix Project</a>: A business novel about DevOps transformation. Teaches core concepts through story. Gives you vocabulary for business-aware conversations.</p>
</li>
<li><p><a href="http://ExplainShell.com">ExplainShell.com</a>: Paste any command you find online and see exactly what every part does. Use this constantly while building your projects.</p>
</li>
<li><p><a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes">GitHub - How to Write a Good README</a>: Official GitHub guidance on repository documentation.</p>
</li>
<li><p><a href="https://prometheus.io/docs/introduction/overview/">Prometheus Documentation</a>: Official docs for the monitoring tool used in Project 3.</p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/tutorials/aws-get-started">Terraform Getting Started - AWS</a>: Official step-by-step guide for Project 2.</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a>: Complete reference for building CI/CD pipelines in Project 1.</p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/learn-linux-for-beginners-book-basic-to-advanced/">freeCodeCamp - Learn Linux for Beginners</a>: Comprehensive Linux guide available on freeCodeCamp.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've been storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks i ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-openid-connect-oidc-in-github-actions-for-aws/</link>
                <guid isPermaLink="false">69ef7bbf330a1ad7f7f2d579</guid>
                
                    <category>
                        <![CDATA[ OpenID Connect ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OIDC ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GitHub Actions ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tolani Akintayo ]]>
                </dc:creator>
                <pubDate>Mon, 27 Apr 2026 15:07:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/83b71e24-b63b-42a4-ac1c-d59e226da6c3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been storing <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks in a CI/CD pipeline.</p>
<p>Here's why: static credentials don't expire on their own. If they get leaked through a misconfigured workflow, a public fork, or a compromised repository, an attacker has persistent access to your AWS environment until you manually rotate them. And most teams don't rotate them often enough.</p>
<p>OpenID Connect (OIDC) solves this entirely. Instead of storing long-lived credentials, GitHub Actions requests a <strong>short-lived token</strong> directly from AWS every time your workflow runs. No secrets to rotate. No credentials to leak. No manual key management.</p>
<p>In this tutorial, you'll learn how to set up OIDC authentication between GitHub Actions and AWS from scratch. By the end, your workflows will authenticate to AWS securely without storing a single access key.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-openid-connect-oidc">What Is OpenID Connect (OIDC)?</a></p>
</li>
<li><p><a href="#heading-how-oidc-works-between-github-actions-and-aws">How OIDC Works Between GitHub Actions and AWS</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-create-an-iam-oidc-identity-provider-in-aws">Step 1: Create an IAM OIDC Identity Provider in AWS</a></p>
<p><a href="#heading-step-2-create-an-iam-role-with-a-trust-policy">Step 2: Create an IAM Role with a Trust Policy</a></p>
<p><a href="#heading-step-3-attach-permissions-to-the-iam-role">Step 3: Attach Permissions to the IAM Role</a></p>
<p><a href="#heading-step-4-store-the-role-arn-as-a-github-actions-variable">Step 4: Store the Role ARN as a GitHub Actions Variable</a></p>
<p><a href="#heading-step-5-configure-your-github-actions-workflow">Step 5: Configure Your GitHub Actions Workflow</a></p>
<p><a href="#heading-step-6-run-and-verify-your-workflow">Step 6: Run and Verify Your Workflow</a></p>
</li>
<li><p><a href="#heading-security-best-practices">Security Best Practices</a></p>
</li>
<li><p><a href="#heading-troubleshooting-common-errors">Troubleshooting Common Errors</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-what-is-openid-connect-oidc">What Is OpenID Connect (OIDC)?</h2>
<p>OpenID Connect is an identity protocol built on top of OAuth 2.0. It allows systems to verify identity through tokens rather than shared secrets.</p>
<p>In the context of GitHub Actions and AWS:</p>
<ul>
<li><p><strong>GitHub</strong> acts as the <strong>identity provider (IdP)</strong>. It issues a signed JWT (JSON Web Token) for each workflow run.</p>
</li>
<li><p><strong>AWS</strong> acts as the <strong>service provider</strong>. It validates that token against GitHub's public keys and exchanges it for temporary AWS credentials. The credentials AWS returns are short-lived (valid for up to 1 hour by default) and scoped to exactly the IAM role you define. When the workflow ends, those credentials are gone.</p>
</li>
</ul>
<p>This model is called <strong>federated identity</strong>. It's the same concept used when you "Sign in with Google" on a third-party website. The difference is that instead of a user signing in, your workflow is the one authenticating.</p>
<h2 id="heading-how-oidc-works-between-github-actions-and-aws">How OIDC Works Between GitHub Actions and AWS</h2>
<p>Before writing a single line of YAML, it beneficial to understand the flow. This is my personal approach when implementing new technologies or concepts. Here's what happens every time your workflow runs:</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/8b5b39de-f671-4ffe-a2db-96d10ade69b3.jpg" alt="Diagram showing the OIDC authentication flow between GitHub Actions and AWS" style="display:block;margin:0 auto" width="449" height="544" loading="lazy">

<p>The diagram illustrates a secure authentication flow between GitHub Actions and AWS using OpenID Connect (OIDC), eliminating the need to store long-lived AWS credentials in GitHub. Here's what happens step-by-step:</p>
<p><strong>1. Initial Authentication Request</strong></p>
<p>When your GitHub Actions workflow starts, the runner (the virtual machine executing your workflow) requests a JSON Web Token (JWT) from GitHub's OIDC provider located at <code>https://token.actions.githubusercontent.com</code>.</p>
<p><strong>2. Token Issuance</strong></p>
<p>GitHub's OIDC provider generates and signs a JWT containing important claims (metadata) about your workflow. These claims include details like which repository the workflow is running from, which branch triggered it, what environment it's running in, and other contextual information that proves the workflow's identity.</p>
<p><strong>3. Token Validation</strong></p>
<p>The GitHub Actions runner presents this signed JWT to AWS Security Token Service (STS). AWS STS validates the JWT's signature by checking it against GitHub's publicly available cryptographic keys, ensuring the token is authentic and hasn't been tampered with.</p>
<p><strong>4. Trust Policy Verification</strong></p>
<p>AWS STS checks the trust policy configured on your IAM Role. This trust policy specifies which GitHub repositories, branches, or environments are allowed to assume this role. If the claims in the JWT match your trust policy conditions, authentication succeeds.</p>
<p><strong>5. Temporary Credentials Issued</strong></p>
<p>Once validated, AWS STS returns temporary security credentials to the GitHub Actions runner. These credentials include an Access Key ID, Secret Access Key, and Session Token that are valid for a limited time (typically 1 hour by default, configurable up to 12 hours).</p>
<p><strong>6. AWS API Access</strong></p>
<p>The GitHub Actions runner uses these temporary credentials to authenticate API calls to your AWS resources such as pushing Docker images to ECR, updating ECS services, writing to S3 buckets, or invoking Lambda functions.</p>
<p>The key point: <strong>AWS never sees your GitHub credentials, and GitHub never sees your AWS credentials.</strong> The JWT is the only thing exchanged and it's signed, scoped, and short-lived.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following in place:</p>
<ul>
<li><p>An <strong>AWS account</strong> with IAM permissions to create identity providers and roles</p>
</li>
<li><p>A <strong>GitHub repository</strong> (public or private) where your workflows will run</p>
</li>
<li><p>Basic familiarity with <strong>GitHub Actions</strong>, knowing how to write a <code>.yml</code> workflow file</p>
</li>
<li><p>Basic familiarity with <strong>AWS IAM</strong> roles, policies, and permissions</p>
</li>
<li><p>The <strong>AWS CLI</strong> installed and configured (optional, but useful for verification). You don't need to be an AWS expert. Each step includes the exact console path and the configuration values you need.</p>
</li>
</ul>
<h2 id="heading-step-1-create-an-iam-oidc-identity-provider-in-aws">Step 1: Create an IAM OIDC Identity Provider in AWS</h2>
<p>The first thing you need to do is tell AWS to trust GitHub as an identity provider. This is a one-time setup per AWS account.</p>
<h3 id="heading-how-to-do-it-in-the-aws-console">How to Do It in the AWS Console</h3>
<p>1. Open the <a href="https://console.aws.amazon.com/iam/">AWS IAM Console</a></p>
<p>2. In the left sidebar, click Identity providers</p>
<p>3. Click Add provider</p>
<p>4. For Provider type, select OpenID Connect</p>
<p>5. For Provider URL, enter:</p>
<pre><code class="language-plaintext">https://token.actions.githubusercontent.com
</code></pre>
<p>6. For Audience, enter:</p>
<pre><code class="language-plaintext">sts.amazonaws.com
</code></pre>
<p>7. Click Add provider</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/66f1de9d-36f9-462e-ad0c-090b152be6e5.png" alt="AWS IAM console showing the Add Identity Provider form configured for GitHub Actions OIDC" style="display:block;margin:0 auto" width="1349" height="609" loading="lazy">

<h3 id="heading-how-to-do-it-with-the-aws-cli">How to Do It with the AWS CLI</h3>
<p>If you prefer the terminal, run this command:</p>
<pre><code class="language-shell">aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/4b779fa0-0df2-4bc3-bbf4-9839ef8ce5e6.png" alt="terminal-oidc-connect-created" style="display:block;margin:0 auto" width="966" height="114" loading="lazy">

<p>Once created, you'll see <code>token.actions.githubusercontent.com</code> listed under <strong>Identity providers</strong> in your IAM console. This provider will be referenced in your IAM role's trust policy in the next step.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/eb820487-6553-43d2-b6b7-4e7b08d039ef.png" alt="verify oidc connect in AWS" style="display:block;margin:0 auto" width="1132" height="284" loading="lazy">

<h2 id="heading-step-2-create-an-iam-role-with-a-trust-policy">Step 2: Create an IAM Role with a Trust Policy</h2>
<p>Now you need an IAM role that your GitHub Actions workflow will assume. The trust policy on this role controls which repositories and branches are allowed to request credentials.</p>
<h3 id="heading-how-to-create-the-iam-role-in-the-aws-console">How to Create the IAM Role in the AWS Console</h3>
<p>1. Open the <a href="https://console.aws.amazon.com/iam/">AWS IAM Console</a></p>
<p>2. In the left sidebar, click <strong>Roles</strong></p>
<p>3. Click <strong>Create role</strong></p>
<p>4. For <strong>Trusted entity type</strong>, select <strong>Web identity</strong></p>
<p>5. For <strong>Identity Provider</strong>, choose: <code>token.actions.githubusercontent.com</code> which you created earlier.</p>
<p>6. For Audience, choose <code>sts.amazonaws.com</code> as well</p>
<p>7. For GitHub organisation, enter your GitHub username or organization name</p>
<p>8. For GitHub repository, enter your GitHub repository</p>
<p>9. For GitHub branch, enter your branch name (for example, main)</p>
<p>10. Click Next, then Next, give a name to the role and click create role</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/dca12969-db8a-4ec4-885e-e953f4808f6c.png" alt="create-iam-role-for-github-action-via-the-console" style="display:block;margin:0 auto" width="1351" height="620" loading="lazy">

<p>Note: Creating the IAM role using this approach already establishes the <strong>Trusted Entities</strong> using a trusted policy based on the step 4-9 above. You can verify this by clicking on the created role and navigating to Trust relationships.</p>
<h3 id="heading-how-to-create-the-iam-role-with-the-aws-cli">How to Create the IAM Role with the AWS CLI</h3>
<p>First, you'll need to create a trust policy document on your local machine: You can call it <code>trust-policy.json</code>:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_GITHUB_ORG/YOUR_REPO_NAME:*"
        }
      }
    }
  ]
}
</code></pre>
<p>Replace the following placeholders before saving:</p>
<table>
<thead>
<tr>
<th>Placeholder</th>
<th>Replace With</th>
</tr>
</thead>
<tbody><tr>
<td><code>YOUR_ACCOUNT_ID</code></td>
<td>Your 12-digit AWS account ID</td>
</tr>
<tr>
<td><code>YOUR_GITHUB_ORG</code></td>
<td>Your GitHub username or organization name</td>
</tr>
<tr>
<td><code>YOUR_REPO_NAME</code></td>
<td>The name of your GitHub repository</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-understand-the-sub-condition">How to Understand the <code>sub</code> Condition</h3>
<p>The <code>sub (subject)</code> claim in the JWT tells AWS exactly where the request is coming from. The value <code>repo:your-org/your-repo:*</code> means any branch in that repository can assume this role.</p>
<p>You can tighten this further depending on your needs:</p>
<pre><code class="language-shell"># Only the main branch
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
 
# Only a specific GitHub Environment
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"
</code></pre>
<p>Scoping this correctly is one of the most important security decisions in this setup. Here's how to decide:</p>
<ul>
<li><p>Use <code>ref:refs/heads/main</code> if only your main/production branch should deploy to AWS. This is the most restrictive and secure option: feature branches can't accidentally (or maliciously) trigger deployments or modify production resources.</p>
</li>
<li><p>Use <code>environment:production</code> if you're using GitHub Environments with protection rules (required reviewers, deployment gates). This lets you control deployments through GitHub's approval workflow while still restricting which workflows can access AWS.</p>
</li>
<li><p>Use <code>repo:your-org/your-repo:*</code> (wildcard) only if you need any branch to deploy. for example, in development environments where every feature branch deploys to its own isolated stack. Never use this for production roles.</p>
</li>
</ul>
<p>Run this command to create the role using your trust policy:</p>
<pre><code class="language-shell">aws iam create-role \
  --role-name GitHubActionsOIDCRole \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role assumed by GitHub Actions via OIDC"
</code></pre>
<p>Take note of the <strong>Role ARN</strong> in the output. It will look like this:</p>
<pre><code class="language-plaintext">arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole
</code></pre>
<p>You'll need this ARN in your workflow YAML in Step 4.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/6bb154e7-0fb3-4c58-94e1-90116eaea95a.png" alt="terminal output of the AWS CLI create-role command showing the returned Role ARN" style="display:block;margin:0 auto" width="1123" height="615" loading="lazy">

<h2 id="heading-step-3-attach-permissions-to-the-iam-role">Step 3: Attach Permissions to the IAM Role</h2>
<p>The IAM role can now authenticate, but it has no permissions yet. You need to attach a policy that defines what your workflow is actually allowed to do in AWS.</p>
<h3 id="heading-how-to-apply-the-principle-of-least-privilege">How to Apply the Principle of Least Privilege</h3>
<p>Only grant the permissions your workflow genuinely needs. If your workflow deploys to S3, give it S3 permissions. If it pushes images to ECR, give it ECR permissions. Never attach <code>AdministratorAccess</code> to a CI/CD role.</p>
<h4 id="heading-option-1-attach-an-aws-managed-policy-quick-start">Option 1: Attach an AWS managed policy (quick start):</h4>
<pre><code class="language-shell">aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
</code></pre>
<h4 id="heading-option-2-create-a-custom-policy-scoped-to-a-specific-s3-bucket-recommended-for-production">Option 2: Create a custom policy scoped to a specific S3 bucket (recommended for production):</h4>
<p>This approach is recommended for production because it limits the blast radius of a security incident. If your workflow credentials are ever compromised, a custom policy scoped to a specific bucket means an attacker can only affect that single bucket not every S3 bucket in your AWS account. It also prevents accidental misconfigurations in your workflow from impacting unrelated resources.</p>
<p>Create a file called <code>s3-deploy-policy.json</code>:</p>
<pre><code class="language-json">{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}
</code></pre>
<p>Then create and attach it:</p>
<pre><code class="language-shell">aws iam create-policy \
  --policy-name GitHubActionsS3DeployPolicy \
  --policy-document file://s3-deploy-policy.json
 
aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/GitHubActionsS3DeployPolicy
</code></pre>
<p>Note: You can as well implement <strong>Step 3</strong> via the console.</p>
<p><strong>Reference:</strong> For a full list of available AWS IAM actions, see the <a href="https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html">AWS IAM actions reference</a>.</p>
<h2 id="heading-step-4-store-the-role-arn-as-a-github-actions-variable">Step 4: Store the Role ARN as a GitHub Actions Variable</h2>
<p>Before you configure your workflow, you need to make the Role ARN available to it. You'll store it as a repository variable in GitHub, not a secret, because the ARN itself isn't sensitive data.</p>
<h3 id="heading-how-to-add-the-variable-in-your-repository">How to Add the Variable in Your Repository</h3>
<p>First, open your GitHub repository and click <strong>Settings:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/b2dd526a-00ca-44eb-8d22-b78dfd220a14.png" alt="GitHub repository top navigation bar with the Settings tab highlighted" style="display:block;margin:0 auto" width="1310" height="307" loading="lazy">

<p>In the left sidebar, scroll down to <strong>Secrets and variables</strong>, then click <strong>Actions:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/61d67c83-7bbc-4570-93ec-f2ee4207ad6e.png" alt="GitHub repository settings sidebar showing Secrets and variables expanded with Actions selected" style="display:block;margin:0 auto" width="1266" height="325" loading="lazy">

<p>Then click the <strong>Variables</strong> tab (not Secrets). Click New repository variable – you can set the <strong>Name</strong> to:</p>
<pre><code class="language-plaintext">AWS_ROLE_ARN
</code></pre>
<p>Set the <strong>Value</strong> to your Role ARN from Step 2, for example:</p>
<pre><code class="language-plaintext">arn:aws:iam::YOUR_ACCOUNT_ID::role/GitHubActionsOIDCRole
</code></pre>
<p>Click <strong>Add variable:</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65a5bfab4c73b29396c0b895/71f5468d-d4ab-45c1-aecd-8509f575237a.png" alt="GitHub repository Actions variables tab showing AWS_ROLE_ARN variable added successfully" style="display:block;margin:0 auto" width="1083" height="377" loading="lazy">

<p>You'll reference this variable in your workflow in the next step using <code>${{</code> <code>vars.AWS_ROLE_ARN }}</code>.</p>
<h2 id="heading-step-5-configure-your-github-actions-workflow">Step 5: Configure Your GitHub Actions Workflow</h2>
<p>With AWS and GitHub fully configured, you now need to update your workflow to request an OIDC token and use it to authenticate.</p>
<h3 id="heading-how-to-set-the-required-workflow-permissions">How to Set the Required Workflow Permissions</h3>
<p>Your workflow <strong>must</strong> declare <code>id-token: write</code>. Without this, GitHub won't issue an OIDC token to the runner.</p>
<pre><code class="language-yaml">permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read    # Required to checkout the repository
</code></pre>
<p><strong>Important:</strong> If you set permissions at the job level, they override any top-level permissions. Make sure <code>id-token: write</code> is present at whichever level your AWS authentication step runs.</p>
<h3 id="heading-full-workflow-example">Full Workflow Example</h3>
<p>Here's a complete workflow that authenticates to AWS using OIDC and deploys a static site to S3:</p>
<pre><code class="language-yaml">name: Deploy to AWS S3
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write
  contents: read
 
jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-2
 
      - name: Verify AWS identity
        run: aws sts get-caller-identity
 
      - name: Deploy to S3
        run: |
          aws s3 sync ./code s3://your-bucket-name
</code></pre>
<p>Replace the following before committing:</p>
<table>
<thead>
<tr>
<th>Placeholder</th>
<th>Replace With</th>
</tr>
</thead>
<tbody><tr>
<td><code>AWS_ROLE_ARN</code></td>
<td>The variable name for your IAM role ARN in GitHub</td>
</tr>
<tr>
<td><code>us-east-2</code></td>
<td>Your target AWS region</td>
</tr>
<tr>
<td><code>your-bucket-name</code></td>
<td>Your S3 bucket name</td>
</tr>
<tr>
<td><code>./code</code></td>
<td>The local directory where the file you want to sync to S3 is located</td>
</tr>
</tbody></table>
<p>You can see the code sample in my GitHub Repo <a href="https://github.com/tolani-akintayo/OpenID-Connect-in-GitHub-Actions-for-AWS">here</a>.</p>
<p><strong>Note:</strong> The <code>aws-actions/configure-aws-credentials</code> action handles the entire OIDC token exchange automatically. It requests the JWT from GitHub, calls <code>sts:AssumeRoleWithWebIdentity</code>, and exports the temporary credentials as environment variables for the rest of the job.</p>
<p>See the <a href="https://github.com/aws-actions/configure-aws-credentials">action's official documentation</a> for all available options.</p>
<h2 id="heading-step-6-run-and-verify-your-workflow">Step 6: Run and Verify Your Workflow</h2>
<p>Push your workflow to the <code>main</code> branch and open the <strong>Actions</strong> tab in your repository to watch it run.</p>
<h3 id="heading-what-a-successful-run-looks-like">What a Successful Run Looks Like</h3>
<p>The Configure AWS credentials via OIDC step should show:</p>
<pre><code class="language-plaintext">Assuming role with OIDC: arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole
</code></pre>
<p>The Verify AWS identity step (<code>aws sts get-caller-identity</code>) should return:</p>
<pre><code class="language-json">{
    "UserId": "AROA...:GitHubActions",
    "Account": "YOUR_ACCOUNT_ID",
    "Arn": "arn:aws:sts::YOUR_ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions"
}
</code></pre>
<p>If you see an <code>assumed-role</code> ARN in the output, OIDC is working correctly. Your workflow is now authenticating to AWS without a single stored credential.</p>
<h2 id="heading-security-best-practices">Security Best Practices</h2>
<p>Getting OIDC working is step one. Locking it down properly is step two.</p>
<h3 id="heading-scope-the-sub-condition-as-tightly-as-possible">Scope the <code>sub</code> Condition as Tightly as Possible</h3>
<p>Don't use a wildcard like <code>repo:your-org/*:*</code> that allows any repository in your organization to assume the role. Scope it to the exact repository and branch that needs access.</p>
<pre><code class="language-json">"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
</code></pre>
<h3 id="heading-use-github-environments-for-production-deployments">Use GitHub Environments for Production Deployments</h3>
<p>GitHub Environments let you add manual approval gates and restrict which branches can deploy. When combined with OIDC, you can scope your trust policy to only allow the <code>production</code> environment:</p>
<pre><code class="language-json">"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"
</code></pre>
<h3 id="heading-apply-least-privilege-permissions-to-every-iam-role">Apply Least-Privilege Permissions to Every IAM Role</h3>
<p>Never attach <code>AdministratorAccess</code> or <code>PowerUserAccess</code> to a role used by CI/CD. Define a custom policy with only the actions your workflow actually needs.</p>
<h3 id="heading-create-separate-iam-roles-per-environment">Create Separate IAM Roles Per Environment</h3>
<p>A staging role and a production role should have different permission scopes. Your staging deployment role should never have write access to production resources.</p>
<h3 id="heading-enable-aws-cloudtrail">Enable AWS CloudTrail</h3>
<p>Every call made using the temporary credentials is logged in CloudTrail under the assumed role ARN. This gives you a full audit trail of exactly what your workflow did in AWS.</p>
<p><strong>Reference:</strong> GitHub's official security hardening guide for OIDC: <a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect">About security hardening with OpenID Connect</a></p>
<h2 id="heading-troubleshooting-common-errors">Troubleshooting Common Errors</h2>
<h3 id="heading-error-not-authorized-to-perform-stsassumerolewithwebidentity">Error: <code>Not authorized to perform sts:AssumeRoleWithWebIdentity</code></h3>
<p>This usually means the trust policy on your IAM role doesn't match the <code>sub</code> claim in the JWT.</p>
<p>Check the following:</p>
<ul>
<li><p>The <code>sub</code> condition exactly matches your repository path (it is case-sensitive)</p>
</li>
<li><p>The <code>aud</code> condition is set to <code>sts.amazonaws.com</code></p>
</li>
<li><p>The <code>Federated</code> principal uses the correct AWS account ID</p>
</li>
</ul>
<p>To inspect the actual token claims your workflow is receiving, add this debug step temporarily:</p>
<pre><code class="language-yaml">- name: Print OIDC token claims
  run: |
    TOKEN=\((curl -s -H "Authorization: Bearer \)ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&amp;audience=sts.amazonaws.com" | jq -r '.value')
    echo $TOKEN | cut -d '.' -f2 | base64 -d 2&gt;/dev/null | jq .
</code></pre>
<h3 id="heading-error-could-not-load-credentials-from-any-providers">Error: <code>Could not load credentials from any providers</code></h3>
<p>This almost always means <code>id-token: write</code> is missing from your workflow permissions. Double-check that you have:</p>
<pre><code class="language-yaml">permissions:
  id-token: write
  contents: read
</code></pre>
<h3 id="heading-error-accessdenied-when-calling-an-aws-service">Error: <code>AccessDenied</code> When Calling an AWS Service</h3>
<p>Authentication succeeded but the IAM role doesn't have permission to perform the action your workflow is attempting. Check the permissions policy attached to your role and compare it against the specific action in the error message.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've gone from storing static, long-lived AWS credentials in GitHub Secrets to a fully keyless authentication setup using OIDC. Here's what you accomplished:</p>
<ul>
<li><p>Registered GitHub as a trusted OIDC identity provider in AWS.</p>
</li>
<li><p>Created an IAM role with a scoped trust policy tied to a specific repository.</p>
</li>
<li><p>Attached least-privilege permissions to that role.</p>
</li>
<li><p>Configured your GitHub Actions workflow to request and use short-lived AWS credentials.</p>
</li>
<li><p>Verified the authentication flow end-to-end.</p>
</li>
</ul>
<p>This pattern works across every AWS service from S3, ECS, Lambda, ECR, Secrets Manager, and more. The workflow example here uses S3, but you only need to swap out the permissions policy and the deployment commands to adapt it for any service.</p>
<p>If you want to go further, explore:</p>
<ul>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#supported-cloud-providers">Configuring OIDC for multiple cloud providers</a>: Azure, GCP, and HashiCorp Vault.</p>
</li>
<li><p><a href="https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment">GitHub Environments and deployment protection rules</a>: for multi-stage pipelines with approval gates.</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/what-is-access-analyzer.html">AWS IAM Access Analyzer</a>: to validate and tighten your role policies automatically.</p>
</li>
</ul>
<p><em>If you're building out your DevOps practice and want a complete, production-ready reference for infrastructure automation, CI/CD, and platform engineering, check out</em> <a href="https://coachli.co/tolani-akintayo/PR-H4oQS"><em><strong>The Startup DevOps Field Guide</strong></em></a><em>. It covers the patterns, templates, and runbooks I've used across real AWS environments.</em></p>
<p><em>You can also connect with me on</em> <a href="https://www.linkedin.com/in/tolani-akintayo"><em>LinkedIn</em></a></p>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect">GitHub Docs: About security hardening with OpenID Connect</a></p>
</li>
<li><p><a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services">GitHub Docs: Configuring OpenID Connect in Amazon Web Services</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html">AWS Docs: Creating OpenID Connect (OIDC) identity providers</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html">AWS Docs: AssumeRoleWithWebIdentity API Reference</a></p>
</li>
<li><p><a href="https://github.com/aws-actions/configure-aws-credentials">aws-actions/configure-aws-credentials - GitHub</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/service-authorization/latest/reference/reference_policies_actions-resources-contextkeys.html">AWS IAM Actions Reference</a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html">AWS CloudTrail User Guide</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
