Tolani Akintayo - freeCodeCamp.org

Common DevOps Mistakes and How to Avoid Them — Tips for Startups

Tolani Akintayo — Thu, 14 May 2026 17:53:38 +0000

Most DevOps engineers don't fail because they lack knowledge about tools. They fail because nobody told them what not to do before they got into production.

Startup environments make this worse. The pressure to ship fast, the small team sizes, and the absence of senior engineers to review your decisions means mistakes happen quietly until they become outages, data loss events, or security incidents that cost the company thousands of dollars and weeks of recovery time.

This article is a direct breakdown of the ten most costly DevOps mistakes engineers make early in their careers at startups. For each mistake, you will get the real-world scenario, the business impact, and the concrete fix you can apply immediately.

Whether you are setting up your first production environment or auditing an existing one, this guide will help you build systems that are reliable, secure, and aligned with what the business actually needs.

Who This Article Is For
Why Startups Are a Different Environment
Mistake 1: Deploying Without Understanding What You're Deploying
Mistake 2: Using Production as a Development Environment
Mistake 3: Hardcoding Secrets and Credentials
Mistake 4: Overengineering for Problems You Don't Have Yet
Mistake 5: No Observability Before Launch
Mistake 6: Treating Security as a Final Step
Mistake 7: Manual Deployments in Production
Mistake 8: No Disaster Recovery Plan
Mistake 9: No Documentation or Runbooks
Mistake 10: Solving Technical Problems Without Understanding the Business
The System Thinking Framework Every DevOps Engineer Needs
Your Production Readiness Checklist
Conclusion

Who This Article Is For

Early-career DevOps and cloud engineers who are building or maintaining production infrastructure at a startup.
Backend developers who have recently taken on DevOps responsibilities.
Engineers joining a startup who want to understand what operational discipline actually looks like in a fast-moving environment.

You do not need to be an expert in any specific tool to follow this article. The focus is on decision-making patterns and operational discipline, not tool configuration.

Why Startups Are a Different Environment

Before getting into the mistakes, you have to understand why startups produce them in the first place.

In a large company, you typically have dedicated security engineers, an SRE team, a platform team, and multiple reviewers for every infrastructure change. In a startup, you mostly likely have one engineer responsible for all of that simultaneously.

This creates four specific pressure points:

Speed pressure. The business needs features shipped now. Operational discipline gets treated as optional because nobody is watching closely yet.
Budget constraints. Every infrastructure decision has a direct impact on company runway. Engineers optimize for the cheapest option rather than the most reliable one.
Absent guardrails. There is no senior engineer reviewing your Terraform plans. There is no security audit before launch. The absence of immediate consequences can make bad decisions feel like good ones.
Constantly changing requirements. The architecture you design today may need to support a completely different product in six months. None of these pressures are excuses for poor decisions. But understanding them helps you see why the following mistakes happen so consistently.

Mistake 1: Deploying Without Understanding What You're Deploying

The Scenario

A junior engineer is asked to deploy the company's Node.js API to AWS. They find a tutorial for Elastic Beanstalk, follow it, and it works. Two weeks later, traffic increases. They try to scale "the same way as in the tutorial." The application goes down. They cannot debug it because they never understood what the deployment was actually doing.

The Business Impact

When production breaks and the person who deployed the system cannot explain how it works, diagnosis takes hours instead of minutes. The longer the incident runs, the higher the cost in customer trust, team morale, and potentially direct revenue loss.

The Fix

Before you deploy anything to production, you should be able to answer these five questions in writing:

What compute type is running my code? (EC2, Lambda, Fargate, container?)
How does a new version replace the old one? (Rolling? Blue/green? All-at-once?)
Where does configuration and secrets come from? (SSM? Secrets Manager? Environment file?)
What downstream services depend on this? (Database connections? Other APIs? Cache?)
How do I roll back in under five minutes if this breaks?

If you cannot answer all five, do not deploy until you can. The tutorial that got it running is not the documentation for how it operates.

"It is better to spend two hours understanding a system before deploying it than two days debugging it after something breaks."

Personally, when learning a new technology, tool, or implementing something I have not worked with before, I usually focus on three core questions: What, Why, and How.

The first question is: What is this technology or concept about?
This helps me build a solid foundation by doing deep research, studying the official documentation, understanding the core principles, and sometimes even learning the history behind the tool or technology. I believe having a well-grounded understanding before implementation is very important.
The second question is: Why do we need it?
I try to understand the value the technology brings, why it should be implemented, what problem it solves, and how it benefits the team or organization. This helps me make informed technical decisions instead of just implementing tools without understanding their purpose.
The third question is: How should it be implemented?
There are usually multiple approaches to solving a problem or implementing a technology, so I focus on understanding the best and most practical approach based on the use case and expected outcome.

This structured approach has helped me learn new technologies quickly, adapt fast, and implement solutions effectively in real-world environments.

Mistake 2: Using Production as a Development Environment

The Scenario

To save time, an engineer tests a new deployment script directly in the production AWS account. They accidentally run a command that terminates the production database instance. Automated backups exist but were misconfigured. Six hours of customer data is unrecoverable.

This scenario happens more often than you would expect. The reasoning is always the same: "It will only take a minute."

The Business Impact

A single test-in-production incident can result in data loss, hours of downtime, and a customer communication crisis. In a startup, that can permanently damage the company's reputation before it has had the chance to build one.

The Fix

You need at minimum three separate environments and ideally three separate AWS accounts:

Environment	Purpose	Access Level
dev	Break things freely. No real data.	Engineers have broad access
staging	Mirror of production. Final verification.	Controlled access
production	Real customers. Real data.	MFA required. No manual deployments.

Using separate AWS accounts (not just separate VPCs) gives you account-level isolation. A permission error in the dev account cannot accidentally touch production infrastructure at the API level.

Infrastructure as Code (Terraform or CloudFormation) makes this affordable, you write the configuration once and apply it three times with different variable files.

# terraform/environments/prod/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "production"
  instance_type = "t3.medium"
  db_instance_class = "db.t3.medium"
  multi_az          = true
}

# terraform/environments/staging/main.tf
module "app" {
  source      = "../../modules/app"
  environment = "staging"
  instance_type = "t3.small"
  db_instance_class = "db.t3.small"
  multi_az          = false
}

The module is the same. The environment-specific variables are different. Separate environments are not a luxury, they are the minimum operating standard for any team running real software.

Mistake 3: Hardcoding Secrets and Credentials

The Scenario

A new engineer joins a startup and clones the repository. Inside they find a .env file committed to Git containing the production database password, the Stripe secret key, and an AWS access key with admin permissions. The repository has been public for six months.

GitHub's automated secret scanning never triggered because the secrets were inside a .env file rather than raw in the code. The credentials had been valid and actively used for over six months.

The Business Impact

Automated scanners run by attackers find exposed credentials within minutes of them being pushed to a public repository. A single exposed AWS access key with admin permissions can result in:

Crypto-mining workloads generating thousands of dollars in cloud bills overnight
Complete exfiltration of customer data from every S3 bucket
Privilege escalation: the attacker creates new admin users and locks you out of your own account
AWS account suspension while the investigation runs

According to GitHub's annual security report, millions of secrets are exposed in public repositories every year. The average time to detect a compromised cloud credential is 197 days.

The Fix

Step 1: Never commit secrets to Git. Not temporarily. Not in a branch. Not in a private repository.

Step 2: Add .gitignore before you create the first file. Check in the .gitignore with the first line of code before any .env files exist.

# .gitignore
.env
.env.*
*.pem
*.key
secrets/

Step 3: Use AWS Secrets Manager or SSM Parameter Store for all production secrets. Your application reads secrets at runtime:

# Python example — fetch secret at runtime, never at build time
import boto3
import json
 
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])
 
# Usage
db_config = get_secret("prod/myapp/database")
DATABASE_URL = db_config["connection_string"]

Step 4: Scan your existing repositories immediately. You may already have a problem:

# Install trufflehog to scan for exposed secrets in your repo history
pip install trufflehog
 
# Scan the entire commit history of your repository
trufflehog git file://.
 
# Or scan a remote GitHub repo
trufflehog github --repo https://github.com/your-org/your-repo

Step 5: Add a pre-commit hook to prevent future accidents:

pip install pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/awslabs/git-secrets
    rev: master
    hooks:
      - id: git-secrets
  - repo: https://github.com/Yelp/detect-secrets
    rev: v1.4.0
    hooks:
      - id: detect-secrets

pre-commit install
# Now the hook runs before every commit and blocks detected secrets

There is no recovery from a publicly exposed database password. The fix takes ten minutes upfront. The incident takes weeks.

Mistake 4: Overengineering for Problems You Don't Have Yet

The Scenario

A five-person startup with 200 users decides to build a microservices architecture on Kubernetes because "Netflix uses it." They spend three months setting up Kubernetes, Istio service mesh, ArgoCD, Vault, Prometheus, and Grafana. Their product has not shipped a new feature in three months. A competitor with a monolith on a single EC2 instance shipped twelve new features in the same period.

The Business Impact

Every layer of infrastructure you add is a layer that can break, a layer that requires expertise to operate, and a layer that slows down every future change. Kubernetes is the right answer for organizations with the scale and team size to operate it. For a five-person startup, it is an expensive distraction.

Premature complexity does not just cost engineering time. It costs the competitive advantage that speed provides in the early stage.

The Fix

Match your infrastructure to your actual stage:

Scale	Right Infrastructure	Cost Range
1–1,000 users	Single EC2 + RDS + Nginx reverse proxy	$20–50/month
1K–50K users	Auto-scaling group, RDS Multi-AZ, ALB, basic CI/CD	$200-500/month
50K–500K users	ECS Fargate, RDS read replicas, ElastiCache, full observability	$1K-5K/month
500K+ users	Multi-region, managed Kubernetes, dedicated SRE	$10K+/month

The question to ask before every infrastructure decision is: "What specific, measurable problem does this solve today that my current setup cannot solve?"

Amazon, Netflix, and Uber did not start with microservices. They started with monoliths and extracted services only when the monolith became the actual bottleneck. You are not Netflix. You are solving the problems in front of you today.

Use managed services wherever possible, RDS instead of self-hosted Postgres, Fargate instead of self-managed Kubernetes, ElastiCache instead of self-hosted Redis. Managed services let your team focus on the product instead of the infrastructure.

Mistake 5: No Observability Before Launch

The Scenario

A startup's checkout flow breaks on a Friday evening. Users are abandoning their carts and the company is losing revenue. The DevOps engineer finds out 45 minutes later because a customer sent a direct message to the CEO on Twitter.

The engineer has no dashboards, no log aggregation, and no alerting. They SSH into the production server and scroll through raw log files. Two hours later, they find the issue: a database connection pool was exhausted by a memory leak introduced in that morning's deployment.

Business Impact

Without observability:

You find out about production problems from users, not from your systems
Incidents take 10x longer to resolve because diagnosis is guesswork
You cannot tell whether a deployment improved or degraded performance
You have no data for making better architecture decisions

The Fix

Implement the four golden signals before any service goes to production. These come from Google's Site Reliability Engineering book:

Latency: How long requests take to complete (p50, p95, p99)
Traffic: How many requests per second the system is handling
Errors: The rate of failed requests (5xx responses per minute)
Saturation: How close the system is to its limits (CPU, memory, connection pool)

Here is a minimal CloudWatch alarm setup using the AWS CLI:

# Alert when error rate exceeds 1% for 5 consecutive minutes

aws cloudwatch put-metric-alarm \
  --alarm-name "high-error-rate-production" \
  --alarm-description "Error rate exceeded 1% for 5 minutes" \
  --metric-name "5XXError" \
  --namespace "AWS/ApplicationELB" \
  --statistic "Average" \
  --period 60 \
  --evaluation-periods 5 \
  --threshold 0.01 \
  --comparison-operator "GreaterThanOrEqualToThreshold" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:pagerduty-production" \
  --dimensions Name=LoadBalancer,Value=app/my-alb/1234567890abcdef

Every application should also expose a /health endpoint that returns 200 OK when healthy:

# FastAPI example

from fastapi import FastAPI
from sqlalchemy import text
 
app = FastAPI()
 
@app.get("/health")
async def health_check():
    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        db_status = "healthy"
    except Exception:
        db_status = "unhealthy"
 
    return {
        "status": "healthy" if db_status == "healthy" else "degraded",
        "database": db_status,
        "version": os.getenv("APP_VERSION", "unknown")
    }

Your load balancer checks this endpoint. Your uptime monitor checks it. You check it after every deployment.

You do not get to say a system is working unless you have data to prove it. "Nobody complained" is not the same as "nothing is broken."

Mistake 6: Treating Security as a Final Step

The Scenario

A startup rushes to launch their MVP. Security reviews are "planned for after launch." Six months later, a potential enterprise customer requires a security audit before signing a contract. The audit reveals:

S3 buckets publicly accessible by default
EC2 instances with port 22 open to 0.0.0.0/0
IAM users with AdministratorAccess for the entire team
No encryption on the database at rest
JWT secrets hardcoded in environment variables The audit fails. The enterprise deal worth $120,000 annually is lost. Remediation takes four weeks of engineering time.

The Business Impact

Security debt is the most expensive technical debt you can accumulate. Unlike performance debt that degrades gradually, security vulnerabilities cause sudden, catastrophic events: data breaches, ransomware, account takeovers, and regulatory fines. At a startup, any one of these can end the company.

The Fix

Apply these six security controls before the first line of production code ships:

1. Principle of Least Privilege every IAM role gets only what it needs:

One of the most common security mistakes in AWS is granting roles more permissions than they need either out of convenience (s3:*) or uncertainty about what the service actually requires. This creates unnecessary risk: if a role is compromised, the attacker inherits every permission you granted.

The fix is simple: look at what your service actually does, then write a policy that allows exactly that.

If your app uploads and reads files from a specific S3 bucket, the policy should say exactly that:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-uploads/*"
    }
  ]
}

Notice the Resource is scoped to my-app-uploads/* not all S3 buckets. And the Action list covers only GetObject and PutObject not DeleteObject, not s3:*. If the service gets compromised, the attacker can read and write to that one bucket. That is it. The rest of your account is untouched.

2. Block all S3 public access by default:

AWS S3 buckets are private by default when created but that can be overridden at the bucket level, the object level, or through a bucket policy. Misconfigured S3 buckets are one of the most common causes of data breaches, and they are almost always accidental.

The safest approach is to enable the "Block Public Access" setting at the account level, which overrides all other settings and prevents any bucket from being made public even if someone tries:

aws s3api put-public-access-block \
  --bucket my-app-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Run this for every bucket you create. Better yet, enable it at the AWS account level so it applies automatically to all future buckets by default.

3. Never open SSH to the internet, use AWS Systems Manager Session Manager instead:

Port 22 open to 0.0.0.0/0 is an attack surface that exists on thousands of AWS instances right now. Brute-force bots scan the internet continuously looking for open SSH ports. Even with a strong key, the exposure is unnecessary because AWS provides a better alternative.

AWS Systems Manager Session Manager gives you full shell access to any EC2 instance without opening a single inbound port on the security group. There is no port to scan, no port to attack, and every session is logged automatically to CloudTrail:

# Start a session on an EC2 instance without port 22 open
aws ssm start-session --target i-0123456789abcdef0

To use Session Manager, the EC2 instance needs the SSM Agent installed (included by default on Amazon Linux 2 and Ubuntu 20.04+) and an IAM instance profile with the AmazonSSMManagedInstanceCore policy attached. Once that is set up, you can close port 22 on the security group entirely.

4. Enable MFA for all IAM users and enforce it via policy:

A leaked IAM username and password with no MFA is a fully compromised account. Multi-factor authentication is the single most effective control against credential theft, and it costs nothing to enable.

Enforce it through an IAM policy that denies all actions when MFA is not present, except the actions needed to set up MFA in the first place. This means even if a set of credentials is stolen, the attacker cannot do anything without the second factor.

The AWS documentation provides the Complete Deny Without MFA Policy, attach it to every IAM user or group in your account. This is a one-time setup that permanently raises your account's security baseline.

5. Enable CloudTrail in all regions:

Without CloudTrail, you have no record of who did what in your AWS account. If a credential is compromised, you cannot investigate what the attacker accessed. If an engineer accidentally deletes a resource, you cannot trace it. You are operating blind.

CloudTrail logs every AWS API call who made it, from which IP, at what time, and what the response was. Enable it across all regions so activity in regions you do not actively use is also captured:

aws cloudtrail create-trail \
  --name production-audit-trail \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation

The --enable-log-file-validation flag generates a digest file for each log that lets you verify the log has not been tampered with, this is important if you ever need to use these logs in a security investigation or compliance audit. Once this is running, every AssumeRole, every DeleteBucket, and every RunInstances call in your account is permanently recorded.

6. Run AWS Security Hub from day one:

Most teams only discover security misconfigurations after a breach or a compliance audit. Security Hub inverts this, it continuously scans your AWS environment against industry-standard frameworks (CIS AWS Foundations Benchmark, AWS Foundational Security Best Practices) and surfaces findings before they become incidents.

Enabling it takes a single command:

aws securityhub enable-security-hub

Within minutes, Security Hub gives your account a compliance score and a prioritized list of findings. A finding might tell you that a security group has port 22 open to the world, that an S3 bucket has logging disabled, or that root account credentials were recently used. Each finding includes the affected resource and a remediation guide.

Treat every Security Hub finding the same way you treat a production bug: assign it a priority, assign an owner, and close it. A finding sitting unaddressed for 30 days is a known vulnerability you chose to leave open.

Mistake 7: Manual Deployments in Production

The Scenario

A startup's deployment process is documented in a Notion page that is four months out of date. It involves SSH-ing into the server, running git pull, running npm install, and restarting the PM2 process. Different engineers do it slightly differently. One engineer, rushing a late-night release, skips npm install. The application starts crashing because a new dependency is missing.

The Business Impact

Manual deployment processes are inherently unreliable. Humans under pressure skip steps, perform steps in the wrong order, and remember procedures differently. Every manual step in a production deployment process is a scheduled incident waiting for the right moment of stress.

The Fix

If a deployment step is performed manually more than twice, it needs to be automated. Here is a minimal but complete GitHub Actions deployment workflow for an ECS Fargate service:

# .github/workflows/deploy.yml
name: Deploy to Production
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write   # Required for OIDC authentication with AWS
  contents: read
 
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
          aws-region: us-east-1
 
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2
 
      - name: Build and push Docker image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t \(ECR_REGISTRY/my-app:\)IMAGE_TAG .
          docker push \(ECR_REGISTRY/my-app:\)IMAGE_TAG
          echo "image=\(ECR_REGISTRY/my-app:\)IMAGE_TAG" >> $GITHUB_OUTPUT
 
      - name: Deploy to Amazon ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: task-definition.json
          service: my-app-service
          cluster: production
          wait-for-service-stability: true

Notice wait-for-service-stability: true. Without this, the workflow reports success the moment ECS accepts the new task definition before the containers are actually healthy. With it, the workflow fails if the new containers crash. You want to know immediately, not discover it from user reports thirty minutes later.

Mistake 8: No Disaster Recovery Plan

The Scenario

A startup's production database runs on a single RDS instance with no Multi-AZ configuration. Automated backups are enabled but have never been tested. The EBS volume backing the instance fails. AWS provisions a new instance from the last snapshot, which is 18 hours old. 18 hours of customer data is permanently lost.

The startup had no disaster recovery plan, no tested recovery procedure, and no communication template ready for customers.

The Business Impact

The question is not whether your infrastructure will fail. It will fail. Every database, every server, every availability zone experiences failures. The question is whether you have a tested plan for when it does.

Data loss of any magnitude is serious. For startups that handle financial data, healthcare data, or anything under GDPR, even partial data loss can trigger regulatory consequences.

The Fix

Define your RTO and RPO before you design anything:

RTO (Recovery Time Objective): How long can the business survive without this system? A payment API might have an RTO of 15 minutes. An internal analytics dashboard might have an RTO of 4 hours.
RPO (Recovery Point Objective): How much data loss is acceptable? Zero means real-time replication. One hour means hourly snapshots are sufficient. This directly determines your backup frequency and architecture.

Enable RDS Multi-AZ for all production databases:

# Terraform
resource "aws_db_instance" "production" {
  identifier        = "prod-postgres"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
 
  # Multi-AZ: automatic failover to standby in a different AZ
  # No data loss. Automatic failover in ~60-120 seconds.
  multi_az = true
 
  # Encryption at rest — non-negotiable
  storage_encrypted = true
 
  # Automated backups with 7-day retention
  backup_retention_period = 7
  backup_window           = "03:00-04:00"
 
  # Enable deletion protection in production
  deletion_protection = true
 
  tags = {
    Environment = "production"
  }
}

Test your backups on a schedule. Create a monthly calendar event: "Restore production backup to staging and verify data integrity." An untested backup is not a backup, it is a hope.

# Restore a snapshot to a test instance and verify
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier recovery-test \
  --db-snapshot-identifier rds:prod-postgres-2025-01-15 \
  --db-instance-class db.t3.medium \
  --no-multi-az
 
# Connect and verify row counts
psql -h recovery-test.xxxx.rds.amazonaws.com -U admin -d mydb \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"

For official guidance on RDS backup and restore, refer to the AWS RDS Backup and Restore documentation.

Mistake 9: No Documentation or Runbooks

The Scenario

The startup's most experienced DevOps engineer takes two weeks of vacation. On day three of their holiday, the staging environment goes down. Nobody else knows how it was built, the engineer set it up manually over six months with no documentation, no Terraform, no notes. The team spends four days trying to reconstruct the environment from memory and guesswork. The engineer gets messages on their vacation every day. When they return, they rebuild the environment in four hours.

The Business Impact

Undocumented infrastructure creates single points of failure not in your systems, but in your team. It makes onboarding new engineers take weeks instead of hours. It makes incident response depend on specific people being available. When that person leaves the company, the knowledge walks out with them.

The Fix

Documentation for an engineering team means three specific things:

Infrastructure as Code is the highest form of documentation. The Terraform that defines your infrastructure IS the documentation for what exists and how it is configured. If something is not in code, it should not exist in production.
A runbook for every operational task. A runbook is a step-by-step procedure written well enough that someone in their first week at the company can follow it during an incident:

# Runbook: Production Database Connection Exhaustion
 
## Symptoms
- Application logs: "too many connections" errors
- 500 error rate spike on database-dependent endpoints
- pg_stat_activity shows max connections reached
 
## Diagnosis
# Check current connection count
psql -h \(DB_HOST -U \)DB_USER -c "SELECT COUNT(*) FROM pg_stat_activity;"
 
# See connections by application
psql -h \(DB_HOST -U \)DB_USER \
  -c "SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

## Resolution
1. Identify and restart the service causing the connection leak
2. If immediate relief needed: kill idle connections older than 10 minutes
3. Long-term: review connection pool settings in application config

## Escalation
If unresolved in 30 minutes: page the on-call backend engineer.

An architecture README in every repository. Every engineer who clones your repository should be able to understand what it does, how to run it locally, how to deploy it, and what it depends on without asking anyone.

Mistake 10: Solving Technical Problems Without Understanding the Business

The Scenario

A startup is experiencing slow page loads. A DevOps engineer decides to solve it by migrating to Kubernetes with horizontal pod auto-scaling. The migration takes six weeks. Page loads improve slightly. But 80% of the slowness was caused by unoptimized database queries that had nothing to do with the infrastructure layer. The six-week migration solved 20% of the problem.

The Business Impact

Technical solutions to misdiagnosed problems are extraordinarily expensive. Every hour spent building the wrong solution is an hour not spent on the right one. Infrastructure is a tool for delivering business outcomes not an end in itself.

The Fix

Before making any infrastructure decision, answer these four questions:

What is the actual, measured bottleneck? Instrument before you act. The bottleneck is almost never where you assumed it was.
What does success look like, and how will you measure it? "Pages are faster" is not measurable. "p95 page load time drops below 1.2 seconds" is measurable.
What is the full cost of this solution? Time to implement, ongoing operational burden, team learning curve. Is this cost justified by the measured impact?
Can a simpler solution solve 80% of the problem in 20% of the time?

Always profile and measure before you rebuild:

# Check slow queries in PostgreSQL before any infrastructure changes
psql -h \(DB_HOST -U \)DB_USER -d $DB_NAME -c "
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY avg_ms DESC
LIMIT 10;
"

Nine times out of ten, slow applications have slow queries, missing indexes, or an N+1 query problem, none of which require a new infrastructure layer to fix.

The System Thinking Framework Every DevOps Engineer Needs

Most of the mistakes above share a common root cause: the engineer was thinking about one component in isolation instead of the full system.

A system thinker asks six questions before making any change in production:

Question	Why You Ask It
What does this change?	List every configuration, file, or service that will be different.
What does this depend on?	What must be true upstream for this component to work correctly?
What depends on this?	What downstream systems are affected if this changes or fails?
What is the failure mode?	Does this fail loudly (500 errors) or silently (wrong data)?
What is the rollback path?	How do you reverse this in under five minutes?
What does healthy look like after the change?	What metrics confirm everything is working correctly?

This is not a checklist you run through slowly. It is a thinking habit that becomes automatic with practice. Senior engineers do not spend more time on deployments than junior engineers do, they spend their time on different things, and this is one of them.

Your Production Readiness Checklist

Use this checklist before any production system goes live. Mark each item as done, in progress, or not yet started.

Infrastructure

Infrastructure is defined as code (Terraform or CloudFormation) and version-controlled in Git
Separate dev, staging, and production environments exist with separate credentials
All production changes go through an automated CI/CD pipeline, no manual SSH deployments
You can rebuild the entire production environment from code in under two hours

Security

No secrets, credentials, or API keys exist in any Git repository
All production secrets are in Secrets Manager or SSM Parameter Store
All IAM roles follow the principle of least privilege
S3 buckets have public access blocked by default
Port 22 is not open to 0.0.0.0/0 on any security group
CloudTrail is enabled in all regions
All IAM users have MFA enabled
AWS Security Hub is enabled and findings are reviewed weekly

Observability

Every service has a /health endpoint that monitoring checks continuously
Alerts fire within five minutes of a production error rate spike
Dashboards exist showing latency, error rate, and resource utilization
Logs are centralized and searchable, not scattered across individual servers

Reliability

Production database has Multi-AZ enabled
Backup restoration has been tested in the last 30 days
Written runbooks exist for the three most likely failure scenarios
RTO and RPO requirements are documented and the architecture meets them

Documentation

Every repository has a README explaining what it does and how to deploy it
A new engineer could understand the production architecture from documentation alone
No single engineer holds critical knowledge that lives only in their head

Conclusion

None of the mistakes in this article require rare misfortune to experience. They are the predictable result of decisions that feel reasonable under startup pressure but accumulate into real operational risk over time.

The good news is that every single one of them is preventable with the right awareness and the right habits applied early.

You do not need a perfect infrastructure from day one. You need a correct one: version-controlled, automated, observable, secure, and documented. Start with that foundation. Add complexity only when a specific, measured problem requires it. Always connect technical decisions to business outcomes.

The goal of DevOps in a startup is not to build impressive infrastructure. It is to build reliable systems that support product growth safely, efficiently, and sustainably and to make sure that when something does break, you can recover faster than anyone notices.

Want to Go Deeper?

If this article resonated with you, The Startup DevOps Field Guide covers these principles in full depth with complete infrastructure blueprints, security frameworks, CI/CD pipeline templates, and the end-to-end decision-making playbook for engineers building DevOps practices in startup environments from scratch.

It is written specifically for the engineer who wants to do this right from the beginning not the one rebuilding everything after the first major incident.

How to Migrate to S3 Native State Locking in Terraform

Tolani Akintayo — Thu, 07 May 2026 22:58:43 +0000

If you've been running Terraform on AWS for any length of time, you know the setup: an S3 bucket for state storage, a DynamoDB table for state locking, and a handful of IAM policies tying them together. It works. It has worked for years.

But it has always carried a cost that rarely gets discussed openly. That cost isn't just money, though a DynamoDB table with on-demand billing adds up across multiple teams and environments.

The real cost is complexity. Every new AWS environment needs both resources provisioned before Terraform can manage anything else. Every engineer who sets up their first Terraform backend has to understand why two completely different AWS services are responsible for what is logically one thing: storing and protecting state. And every incident involving a stuck lock has required someone to manually delete a record from DynamoDB to unblock the team.

In November 2024, AWS announced that S3 now supports native object locking for Terraform state files, meaning DynamoDB is no longer required for state locking. Terraform 1.10 added support for this feature, and it's now generally available.

In this tutorial, you'll learn:

What S3 native locking is and how it works
How to set it up from scratch if you're starting a new project
How to migrate an existing S3 + DynamoDB setup to S3 native locking safely
How to verify locking is working and handle edge cases

By the end, you'll have a simpler, cleaner Terraform backend with one fewer AWS resource to manage.

What Is Terraform State Locking?
What Is S3 Native State Locking?
How S3 Native Locking Compares to the S3 + DynamoDB Approach
Prerequisites
Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch
Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking
How to Verify That Locking Is Working
How to Handle a Stuck Lock
Rollback Plan: If Something Goes Wrong
Security Best Practices for Your State Bucket
Conclusion
References

What is Terraform State Locking?

Before looking at the new approach, it helps to understand what state locking is solving.

Terraform stores everything it knows about your infrastructure in a state file – a JSON document that maps your configuration to real AWS resources. When you run terraform apply, Terraform reads this file, calculates the difference between the current state and your configuration, and makes the necessary changes.

The problem arises when two engineers or two CI/CD pipelines run and try to apply changes at the same time. If both read the state file simultaneously, calculate changes independently, and both try to write back, you get a race condition. The second write overwrites changes from the first, and your state is now out of sync with reality. This is a serious problem that can cause resources to be untracked, doubled, or destroyed unexpectedly.

State locking solves this by creating a lock when any operation starts that could modify state. If a lock already exists, Terraform refuses to proceed and reports who holds the lock and when it was acquired. Only one operation can hold the lock at a time. When the operation completes, the lock is released.

Terraform Run A                 State File / Lock                Terraform Run B
(User 1)                         (S3/DynamoDB)                   (User 2)

   |                                   |                            |
   |------- 1. Acquire Lock ---------->|                            |
   |                                   |                            |
   |<------ 2. Lock Granted -----------|                            |
   |                                   |                            |
   |                                   |------- 3. Acquire Lock --->|
   |            [PROCESSING]           |                            |
   |      (Modifying Infrastructure)   |<------ 4. Lock Denied -----|
   |                                   |        (Wait / Retry)      |
   |                                   |                            |
   |------- 5. Release Lock ---------->|                            |
   |                                   |                            |
   |           [COMPLETED]             |<------ 6. Lock Granted ----|
   |                                   |                            |
   |                                   |       [PROCESSING]         |
   |                                   | (Modifying Infrastructure) |              
   |                                   |                            |

What Is S3 Native State Locking?

Previously, Terraform's S3 backend used a DynamoDB table as the locking mechanism. When a lock was needed, Terraform wrote a record to DynamoDB with a LockID primary key. DynamoDB's conditional writes guaranteed that only one process could create that record, which is what made the locking atomic.

S3 native locking uses S3 Object Lock instead. S3 Object Lock is an S3 feature originally designed to enforce WORM (Write Once, Read Many) compliance for regulatory requirements. AWS extended this capability to support Terraform's state locking workflow.

When S3 native locking is enabled in your Terraform backend:

Terraform writes your state to an .tfstate object in S3 (as before)
To acquire a lock, Terraform uses S3's conditional write operations – specifically the if-none-match conditional header to create a lock file atomically
If the lock file already exists, S3 rejects the write, and Terraform reports that a lock is held
When the operation completes, Terraform deletes the lock file to release the lock.

The key difference from DynamoDB: the entire locking mechanism lives inside S3. No second service. No second set of IAM permissions. No second resource to provision.

Note: This feature requires Terraform version 1.10.0 or later and an S3 bucket with Object Lock enabled. Object Lock must be enabled at bucket creation time. You can't enable it on an existing bucket through the console or CLI. But there is a supported workaround for existing buckets, which we'll cover in Part 2.

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Aspect	S3 + DynamoDB (Old)	S3 Native Locking (New)
AWS services required	S3 + DynamoDB	S3 only
IAM permissions needed	S3 + DynamoDB permissions	S3 permissions only
Terraform version	Any	1.10.0 or later
Setup complexity	Two resources, two IAM scopes	One resource
Stuck lock resolution	Delete DynamoDB record	Delete S3 lock file
Cost	S3 storage + DynamoDB on-demand	S3 storage only
Object Lock requirement	Not required	Required on S3 bucket
Locking mechanism	DynamoDB conditional writes	S3 conditional writes (`if-none-match`)
State versioning	S3 Versioning (recommended)	S3 Versioning (required for full safety)

The functional behavior from Terraform's perspective is identical. Locking works the same way. The lock information displayed when a lock is held has the same structure. The only difference is what happens under the hood.

Prerequisites

Before you start, make sure you have the following in place:

Terraform 1.10.0 or later installed. Check your version:

terraform version

If you need to upgrade, follow the official upgrade guide.

AWS CLI installed and configured with credentials that have permission to create and manage S3 buckets.

aws --version
aws sts get-caller-identity   # confirm you're authenticated

IAM permissions to perform the following S3 actions:
- s3:CreateBucket
- s3:PutBucketVersioning
- s3:PutBucketEncryption
- s3:PutObjectLegalHold
- s3:PutObjectRetention
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
- s3:ListBucket
For the migration path: access to your existing Terraform project and the S3 bucket and DynamoDB table currently in use.

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Follow this section if you're starting a new Terraform project and want to use S3 native locking from the beginning.

Step 1: Create the S3 Bucket with Versioning and Encryption

Object Lock must be enabled at bucket creation time. You can't add it afterward through the standard console flow. Create the bucket using the AWS CLI with Object Lock enabled:

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region us-east-1 \
  --object-lock-enabled-for-bucket

Note: For regions other than us-east-1, add the --create-bucket-configuration flag.

aws s3api create-bucket \
  --bucket your-project-terraform-state \
  --region eu-west-1 \
  --create-bucket-configuration LocationConstraint=eu-west-1 \
  --object-lock-enabled-for-bucket

Now enable versioning on the bucket. Versioning is required alongside Object Lock and allows Terraform to recover previous state versions if something goes wrong:

aws s3api put-bucket-versioning \
  --bucket your-project-terraform-state \
  --versioning-configuration Status=Enabled

Enable server-side encryption so your state files are encrypted at rest:

aws s3api put-bucket-encryption \
  --bucket your-project-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "AES256"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Block all public access to the bucket. A Terraform state file contains resource IDs, IP addresses, and potentially sensitive values. It should never be publicly accessible:

aws s3api put-public-access-block \
  --bucket your-project-terraform-state \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Verify the bucket configuration:

# Confirm Object Lock is enabled
aws s3api get-object-lock-configuration \
  --bucket your-project-terraform-state
 
# Confirm versioning is enabled
aws s3api get-bucket-versioning \
  --bucket your-project-terraform-state
 
# Confirm encryption is configured
aws s3api get-bucket-encryption \
  --bucket your-project-terraform-state

Expected output for the Object Lock check:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Step 2: Configure the Terraform Backend with Native Locking

In your Terraform project, create or update your backend.tf file:

terraform {
  backend "s3" {
    bucket = "your-project-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
 
    # Enable S3 native state locking
    # Requires Terraform 1.10.0+ and a bucket with Object Lock enabled
    use_lockfile = true
 
    # Encryption at rest
    encrypt = true
  }
}

The critical difference from the old configuration is the use_lockfile = true parameter. Notice what is absent: there's no dynamodb_table argument. No DynamoDB table. No second service.

Here's a direct comparison of the old and new configurations:

Old configuration (S3 + DynamoDB):

terraform {
  backend "s3" {
    bucket         = "your-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # this goes away
  }
}

New configuration (S3 native locking):

terraform {
  backend "s3" {
    bucket       = "your-project-terraform-state"
    key          = "production/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true   # this replaces dynamodb_table
  }
}

Step 3: Initialize and Verify

Run terraform init to initialize the backend:

terraform init

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
 
Terraform has been successfully initialized!

Run a plan to confirm everything is working end-to-end:

terraform plan

If locking is working, you'll see a brief pause while Terraform acquires the lock before the plan output appears. You'll also see the lock information if you look at the S3 bucket – a .tflock file will appear temporarily alongside your state file during the operation and disappear when it completes.

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Follow this section if you have an existing Terraform setup using an S3 bucket and DynamoDB table for state locking, and you want to migrate to S3 native locking.

Important: Migration requires a maintenance window or at minimum a period where no Terraform operations are running. You're changing the backend configuration, which means all team members and CI/CD pipelines must stop running terraform plan or terraform apply during the migration. The migration itself takes under 10 minutes.

Step 1: Verify Your Current Setup

Before making any changes, document your existing backend configuration and confirm the state file is accessible:

# Confirm your state file is in S3
aws s3 ls s3://your-existing-bucket/path/to/terraform.tfstate
 
# Confirm the DynamoDB table exists
aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table \
  --query 'Table.TableStatus'

Check your current backend.tf and note the exact values:

# Your current backend.tf - note these values before changing anything
terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"       # note this
    key            = "path/to/terraform.tfstate"   # note this
    region         = "us-east-1"                   # note this
    encrypt        = true
    dynamodb_table = "your-dynamodb-lock-table"    # this will be removed
  }
}

Run one final plan to confirm the current state is clean and there are no unexpected changes pending:

terraform plan

If the plan shows no changes, you're in a safe state to proceed.

Step 2: Enable Object Lock on the Existing S3 Bucket

This is the most important step in the migration. Object Lock can't normally be enabled on an existing bucket. It's a setting that must be configured at creation time.

But AWS provides a way to enable Object Lock on an existing bucket through a support request or through a direct API call that's not exposed in the standard console UI. AWS has officially documented this path for the Terraform migration use case.

Run the following AWS CLI command to enable Object Lock on your existing bucket:

aws s3api put-object-lock-configuration \
  --bucket your-existing-bucket \
  --object-lock-configuration '{"ObjectLockEnabled": "Enabled"}'

Note: This command enables Object Lock in governance mode with no default retention, meaning it enables the locking capability without setting a default retention period on all objects. This is exactly what Terraform's native locking needs: the ability to create and delete lock files, not permanent object retention.

Verify Object Lock is now enabled:

aws s3api get-object-lock-configuration \
  --bucket your-existing-bucket

Expected output:

{
    "ObjectLockConfiguration": {
        "ObjectLockEnabled": "Enabled"
    }
}

Also verify that versioning is already enabled (it should be if you are running a production Terraform setup):

aws s3api get-bucket-versioning \
  --bucket your-existing-bucket

Expected output:

{
    "Status": "Enabled"
}

If versioning isn't enabled, enable it before proceeding:

aws s3api put-bucket-versioning \
  --bucket your-existing-bucket \
  --versioning-configuration Status=Enabled

Step 3: Update the Terraform Backend Configuration

Update your backend.tf to remove the dynamodb_table argument and add use_lockfile = true:

terraform {
  backend "s3" {
    bucket = "your-existing-bucket"
    key    = "path/to/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
 
    # Add this:
    use_lockfile = true
 
    # Remove this line entirely:
    # dynamodb_table = "your-dynamodb-lock-table"
  }
}

Your updated backend.tf should look like this:

terraform {
  backend "s3" {
    bucket       = "your-existing-bucket"
    key          = "path/to/terraform.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true
  }
}

Step 4: Reinitialize Terraform

Run terraform init with the -reconfigure flag. This flag tells Terraform that the backend configuration has changed intentionally and to reinitialize without prompting you to copy state (the state is already in the same bucket):

terraform init -reconfigure

Expected output:

Initializing the backend...
 
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
 
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
 
Terraform has been successfully initialized!

If you see an error here: The most common cause is that Object Lock wasn't successfully enabled on the bucket. Re-run the verification from Step 2 before proceeding.

Step 5: Verify the Migration

Run a plan to confirm Terraform is working correctly with the new backend configuration:

terraform plan

The plan should:

Complete successfully
Show the same result as the plan you ran in Step 1 (no changes, or the same changes as before)
NOT mention DynamoDB anywhere in its output

To confirm that locking is actually using S3 instead of DynamoDB, open a second terminal and run a plan while the first one is running. You should see the second terminal output a lock error that mentions S3, not DynamoDB:

╷
│ Error: Error acquiring the state lock
│
│Error message: operation error S3: PutObject, https response       error StatusCode: 409,
│ RequestID: ..., api error Conflict: Object lock already exists for this key.
│
│ Lock Info:
│   ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
│   Path:      your-existing-bucket/path/to/terraform.tfstate.tflock
│   Operation: OperationTypePlan
│   Who:       user@hostname
│   Version:   1.10.0
│   Created:   2026-05-06 14:22:01 UTC
│   Info:
╵

The Path field shows .tfstate.tflock, a file in your S3 bucket, not a DynamoDB record. This confirms that locking is now handled entirely by S3.

Step 6: Clean Up the DynamoDB Table

Once you've confirmed the migration is working correctly and your team has run at least one successful plan and apply cycle using the new backend, you can remove the DynamoDB table.

Wait at least 24-48 hours before deleting the DynamoDB table if you have CI/CD pipelines or multiple team members. This gives time to catch any pipeline that wasn't updated with the new backend configuration.

When you're ready, delete the DynamoDB table:

aws dynamodb delete-table \
  --table-name your-dynamodb-lock-table

Confirm the deletion:

aws dynamodb describe-table \
  --table-name your-dynamodb-lock-table

Expected output:

An error occurred (ResourceNotFoundException) when calling the DescribeTable operation:
Requested resource not found

This error confirms that the table is gone. The migration is complete.

If you provisioned the DynamoDB table using Terraform (which is the recommended pattern), remove the resource from your Terraform configuration and run terraform apply to destroy it via Terraform rather than the CLI directly. This keeps your state clean:

# Remove this entire block from your Terraform configuration:
resource "aws_dynamodb_table" "terraform_state_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
}

After removing the block, run:

terraform apply

Terraform will detect that the DynamoDB table resource has been removed from configuration and will destroy the table.

How to Verify That Locking Is Working

After completing either the fresh setup or the migration, use this procedure to independently verify that locking is functioning correctly.

Method 1: Observe the lock file during an operation

In one terminal, start a long-running plan against a configuration with many resources:

terraform plan

While it's running, in a second terminal, check for the lock file in S3:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

You should see a file like:

2026-05-06 14:22:01        512 terraform.tfstate.tflock

After the plan completes, run the same command again. The .tflock file should be gone.

Method 2: Read the lock file contents

While a plan is running, download and read the lock file to see its contents:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/current.lock && cat /tmp/current.lock

Expected output (formatted for readability):

{
  "ID": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "Operation": "OperationTypePlan",
  "Info": "",
  "Who": "tolani@dev-machine",
  "Version": "1.10.0",
  "Created": "2026-05-06T14:22:01.123456789Z",
  "Path": "your-bucket/path/to/terraform.tfstate"
}

This is the same lock information that Terraform displays when a lock is held. It's now a JSON file in S3 rather than a record in DynamoDB.

How to Handle a Stuck Lock

With the DynamoDB backend, resolving a stuck lock meant deleting a record from the DynamoDB table. With S3 native locking, it means deleting the .tflock file from S3.

A lock can get stuck if:

A terraform apply or plan process was killed mid-execution
A CI/CD pipeline runner crashed during a Terraform operation
A network interruption prevented the lock release from completing

Here's how you can check for a stuck lock:

aws s3 ls s3://your-bucket/path/to/ | grep tflock

If a .tflock file exists and no Terraform operation is currently running, it is a stuck lock.

You can also read the lock to understand who held it:

aws s3 cp \
  s3://your-bucket/path/to/terraform.tfstate.tflock \
  /tmp/stuck.lock && cat /tmp/stuck.lock

This tells you who (Who field) was running the operation, what operation it was (Operation field), and when it was acquired (Created field).

And you can force-unlock using Terraform like this:

terraform force-unlock LOCK-ID

Replace LOCK-ID with the ID value from the lock file contents. For example:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Terraform will confirm:

Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may be still be in use. Only 'yes' will be accepted to confirm.
 
  Enter a value: yes
 
Terraform state has been successfully unlocked!

An alternative is to delete the lock file directly via CLI. If terraform force-unlock doesn't work (for example, because you are running in a CI environment without Terraform available), delete the lock file directly:

aws s3 rm s3://your-bucket/path/to/terraform.tfstate.tflock

Only delete the lock file if you are certain no Terraform operation is currently running. Deleting a lock that is actively held by a running operation will allow a second concurrent operation to start, which is exactly the race condition locking is designed to prevent.

Rollback Plan: If Something Goes Wrong

If you encounter problems after migrating, you can roll back to the S3 + DynamoDB setup with these steps.

Step 1: Stop all Terraform operations in your team and CI/CD pipelines.

Step 2: Recreate the DynamoDB table if you already deleted it:

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Step 3: Revert backend.tf to the previous configuration:

terraform {
  backend "s3" {
    bucket         = "your-existing-bucket"
    key            = "path/to/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"   # restored
    # Remove: use_lockfile = true
  }
}

Step 4: Reinitialize:

terraform init -reconfigure

Step 5: Verify:

terraform plan

The state file hasn't moved, so there's no data loss during a rollback. The only change is which locking mechanism Terraform uses.

Note: Object Lock being enabled on the S3 bucket doesn't prevent the rollback. Object Lock and DynamoDB locking can coexist, Object Lock simply adds a capability to the bucket. Using dynamodb_table in your backend config tells Terraform to use DynamoDB regardless of whether Object Lock is enabled on the bucket.

Security Best Practices for Your State Bucket

Migrating to S3 native locking is a good opportunity to review the overall security configuration of your state bucket. Here are the practices every production Terraform state bucket should implement:

Enable Versioning (Required)

Versioning is a hard requirement for S3 native locking to work safely. It ensures that if a state file is accidentally overwritten or corrupted, you can restore a previous version.

aws s3api put-bucket-versioning \
  --bucket your-state-bucket \
  --versioning-configuration Status=Enabled

Block All Public Access (Non-Negotiable)

Your state file contains resource ARNs, IP addresses, and may contain sensitive values passed through Terraform variables. It must never be publicly accessible.

aws s3api put-public-access-block \
  --bucket your-state-bucket \
  --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Enable Server-Side Encryption

Always encrypt state files at rest. AES256 is the minimum. If your organization requires KMS key management:

aws s3api put-bucket-encryption \
  --bucket your-state-bucket \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

Apply Least-Privilege IAM Permissions

The role or user that Terraform uses to access the state bucket should have only the permissions it needs. Here's a minimal IAM policy for S3 native locking:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TerraformStateAccess",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::your-state-bucket",
        "arn:aws:s3:::your-state-bucket/*"
      ]
    },
    {
      "Sid": "TerraformStateLocking",
      "Effect": "Allow",
      "Action": [
        "s3:GetObjectLegalHold",
        "s3:PutObjectLegalHold",
        "s3:GetObjectRetention",
        "s3:PutObjectRetention"
      ],
      "Resource": "arn:aws:s3:::your-state-bucket/*.tflock"
    }
  ]
}

Notice what is absent: there are no DynamoDB permissions. This is a cleaner, smaller permission set than the old approach required.

Enable Access Logging

Log all access to your state bucket in CloudTrail or S3 server access logs. This gives you an audit trail of every time state was read, written, or locked:

aws s3api put-bucket-logging \
  --bucket your-state-bucket \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "your-logging-bucket",
      "TargetPrefix": "terraform-state-access/"
    }
  }'

Conclusion

AWS S3 native state locking removes the need for a DynamoDB table from your Terraform backend setup. The result is simpler infrastructure, a smaller IAM permission surface, and one fewer service to provision, monitor, and pay for across every environment your team manages.

Here's a summary of what you accomplished:

Understood what state locking is and why it's required for safe Terraform operations
Compared S3 native locking to the existing S3 + DynamoDB approach
Set up a fresh Terraform backend using S3 native locking with correct bucket configuration
Migrated an existing backend from S3 + DynamoDB to S3 native locking safely
Learned how to verify locking, handle stuck locks, and roll back if needed
Applied security best practices to the state bucket

This pattern – using S3 native locking – is the recommended approach for all new Terraform projects on AWS going forward. If you're managing a large estate with multiple Terraform backends, consider automating the migration using a script or Terraform module that applies the pattern across all your state buckets.

If you are building or optimizing cloud infrastructure for a startup and want a complete reference for production-ready Terraform modules, CI/CD pipeline patterns, and infrastructure runbooks, check out The Startup DevOps Field Guide. It covers the full lifecycle of AWS infrastructure from initial setup to production reliability.

References

How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For

Tolani Akintayo — Thu, 30 Apr 2026 14:33:32 +0000

You've completed three AWS courses. You have notes from a dozen Docker tutorials. You know what Kubernetes is, what CI/CD means, and you can explain Infrastructure as Code without hesitating.

And yet the applications go out, and nothing comes back.

This is one of the most frustrating experiences in tech. You're genuinely learning, genuinely putting in the time, and you have nothing to show for it in terms of results. You start to wonder if the market is too competitive, if you need one more certification, or if there's some hidden door everyone else found that you're missing.

The truth is simpler and more actionable than any of that: hiring managers can't see your YouTube watch history. They can see your GitHub. Most beginners optimize for learning. Hired candidates optimize for proof.

In this guide, you'll get an honest breakdown of the nine factors hiring managers actually evaluate when they look at a junior cloud or DevOps candidate and a concrete 90-day plan to address each one. By the end, you'll know exactly where you stand and exactly what to do next.

The Three Patterns That Keep Beginners Stuck
What Hiring Managers Are Actually Evaluating
Factor 1: Proof of Work (The Non-Negotiable)
- The Three Projects That Cover Everything
Factor 2: System-Level Thinking
Factor 3: Software Engineering Fundamentals
Factor 4: Communication Skills
Factor 5: Consistency Over Intensity
Factor 6: Networking and Visibility
Factor 7: Ownership Mindset
Factor 8: Business Awareness
Factor 9: Learning Agility
Your 90-Day Action Plan
Honest Self-Assessment: Where Do You Stand?
Conclusion
References and Recommended Resources

The Three Patterns That Keep Beginners Stuck

Pattern 1: The Tutorial Loop

Week 1: You watch eight hours of Docker content. Week 2: You start an AWS course and get 70% through. Week 3: A Kubernetes series looks interesting, so you start that instead. Week 4: You open LinkedIn and wonder why you're not getting callbacks.

Watching tutorials feels like progress. It's comfortable, passive, and has no failure state. Nothing breaks. Nothing goes wrong.

The problem is that it produces nothing a hiring manager can evaluate. Courses and certifications tell an employer what you've been exposed to. Your GitHub tells them what you can actually do.

Pattern 2: The Theory-Practice Gap

You can explain CI/CD fluently. You've read the Kubernetes documentation. You understand the conceptual difference between a container and a virtual machine.

But you've never taken a simple application, containerized it, connected it to a pipeline, and deployed it to a cloud server with a real URL that someone can visit.

In an interview, "I understand how it works" and "I have built this and here is the link" are not equivalent answers. Hiring managers hear the first version from hundreds of candidates. The second version gets callbacks.

Pattern 3: Silent Learning

This one is perhaps the most painful pattern because the learning is real. You're putting in the work every day but nobody knows. No GitHub activity. No LinkedIn posts. No community presence. Just cold applications sent from job boards to ATS systems that filter you out before a human ever sees your name.

The hard truth: people get hired through people. A hiring manager who has seen your LinkedIn post about a problem you solved is significantly more likely to give your résumé serious attention than a stranger who applied through a portal.

What Hiring Managers Are Actually Evaluating

I've grouped the nine factors that follow into three buckets: Mindset, Execution, and Visibility. The order matters: mindset shapes how you execute, and execution is what powers visibility.

Bucket	Covers	Factors
Mindset	How you think about problems and your career	Factors 2, 7, 8, 9
Execution	What you actually build and demonstrate	Factors 1, 3
Visibility	Whether the right people know you exist	Factors 4, 5, 6

Let's go through each one.

Factor 1: Proof of Work (The Non-Negotiable)

If there's one thing to take from this entire article, it's this: no portfolio means no serious consideration. The most technically capable candidate in the applicant pool is invisible without proof of work.

This isn't about impressing anyone with complexity. It's about demonstrating that you can take a system from zero to deployed, documented, and working.

Here's the checklist every portfolio project should meet before you consider it done:

It's deployed: there's a real URL you can share, not "it works on my machine"
It has a CI/CD pipeline: code changes are automatically tested and deployed
Infrastructure is defined as code: not manually clicked together in the AWS console
It has monitoring and alerting: you know when it breaks before users tell you
It's documented: a README explains what it does, how to run it, and how it works
It's on GitHub publicly: with real commit history showing iterative work

If your project meets all six criteria, you have proof of work. If it meets four of six, you have a project in progress. Finish it before you start applying.

The Three Projects That Cover Everything

You don't need ten projects. You need two to three projects that together demonstrate the full range of DevOps skills.

Project 1 : The Full-Stack Deploy Pipeline

This is the foundational DevOps project every beginner should build first.

Take any simple web application – a Python Flask app, a Node.js API, or even a static site. Containerize it with Docker. Write a CI/CD pipeline that runs tests, builds the Docker image, and deploys to a cloud server automatically on every push to the main branch. You can also set up Nginx as a reverse proxy and add an uptime monitor (UptimeRobot has a free tier).

Tools: GitHub Actions, Docker, AWS EC2 or Render.com, Nginx.

Why it matters to a hiring manager: it proves you can automate a full deployment workflow end-to-end. The hiring manager can visit your URL, see it running, and inspect your pipeline history.

This single project puts you ahead of most applicants who only have course completion screenshots.

Project 2: Infrastructure as Code with Terraform

Write Terraform code that provisions a complete environment: a VPC, public and private subnets, an EC2 instance with properly scoped security group rules, and an S3 bucket for remote state. Destroy it and recreate it from scratch to prove the code actually works. Add a GitHub Actions workflow that runs terraform plan on pull requests and terraform apply on merge to main.

Tools: Terraform, AWS (or Azure/GCP), GitHub Actions.

Why it matters: Infrastructure as Code with Terraform is a required skill at almost every company running cloud infrastructure. Showing you can write, version-control, and automate Terraform demonstrates a core professional competency.

Project 3: Monitoring and Observability Stack

Deploy a monitoring stack using Docker Compose: Prometheus scraping metrics from your application and the host, Grafana dashboards showing CPU, memory, request rates, and error rates, and Alertmanager configured to send alerts to Slack or email when thresholds are crossed. Connect this to your Project 1 application so the pipeline deploys and the monitoring watches it.

Tools: Prometheus, Grafana, Alertmanager, Node Exporter, Docker Compose.

Why it matters: most beginner portfolios have zero observability work. This project immediately signals that you understand production engineering, not just deployment. Any senior DevOps engineer or SRE reviewing your application will notice it and it will set you apart.

Factor 2: System-Level Thinking

This is the mindset that separates a DevOps engineer from someone who just knows a collection of tools. System-level thinking means you can see the whole picture, not just the part you happen to be working on at any given moment.

Here's the mental test hiring managers are running throughout your interview: can you trace a user request from the moment they click a button to the moment they see a response, and explain what happens at every layer in between?

Here's the full journey of a web request, the map of modern infrastructure every DevOps engineer needs to understand:

Step	Layer	What's happening and what can go wrong
1	User's Browser	The user types a URL. The browser needs to find the server.
2	DNS Resolution	The domain is translated into an IP address. DNS misconfigurations mean users can't reach you at all.
3	CDN / Edge Network	Traffic hits a CDN (Cloudflare, CloudFront) first. Static assets are served from the nearest edge. SSL terminates here.
4	Load Balancer	Routes the request to an available application server. If all targets are unhealthy, users get 502/503 errors.
5	Compute / Application Servers	The application code runs here in containers, on VMs, or in server-less functions. Business logic executes.
6	Database Layer	The application reads from or writes to a database. Slow queries or a full disk causes slow responses or outages.
7	Cache Layer	Redis or Memcached caches frequently-read data. Cache misses cause extra database load.
8	Response Returns	The response travels back through the stack and the user sees the result.
9	Logging and Monitoring	Every step above should emit logs and metrics. Good monitoring alerts you before users notice a problem.

Why does this matter in an interview? Consider two candidates answering the question: "Tell me about a time something broke in production."

Candidate A: "The website was down."

Candidate B: "The load balancer health checks were failing because the app containers were running out of memory due to a memory leak introduced in the previous deploy. We identified it via memory metrics in Grafana, rolled back, and added a memory limit to the container spec."

Same incident. Completely different answer. System-level thinking is what makes the difference.

Factor 3: Software Engineering Fundamentals

Many beginners rush to learn Kubernetes and Terraform before mastering the foundations that make those tools make sense. This creates a knowledge structure that looks impressive but has no solid base underneath it.

Here are the fundamentals that actually matter and what to do if you have a gap in any of them:

1. Linux and the Command Line

DevOps tools run on Linux. CI/CD jobs run in Linux containers. SSH is the front door to every server. If the terminal makes you uncomfortable, you're not ready for a production environment. This is not a preference, it's a prerequisite.

Start with daily Linux practice. The Linux Foundation's free introductory materials are a solid starting point. And here's a solid freeCodeCamp course on Linux basics.

2. Networking Fundamentals

DNS, TCP/IP, HTTP/HTTPS, load balancing, firewalls, VPCs, subnets these concepts appear in every cloud architecture. Without them, Terraform and Kubernetes are magic boxes. Study the request flow in Factor 2 above until you can draw it from memory without looking.

Here's a computer networking fundamentals course to get you started.

3. Scripting: Bash and Python

CI/CD pipelines are scripts. Automation is scripting. If you cannot write a Bash script that reads a config file, calls an API, and handles errors gracefully your automation ceiling is very low. Fix this by writing one small, useful script every week. Solve real problems with code.

Here's a helpful tutorial on shell scripting in Linux for beginners.

4. Git and Version Control

Not just git commit and git push. Branching strategies, pull requests, merge conflicts, rebasing, and tagging releases are all standard practice in professional DevOps teams. Use Git for everything including your personal learning notes. Practice branching workflows intentionally.

Here's a full book on all the Git basics (and some more advanced topics, too) you need to know.

5. Docker and Containers

Docker is the universal packaging format for modern software. Understanding layers, multi-stage builds, volumes, networking, and container security is the floor not the ceiling. Every project you build should be containerized. Write your Dockerfiles by hand instead of copying them.

Here's a course on Docker and Kubernetes to get you started,

Factor 4: Communication Skills

Technical skills set your ceiling. Communication skills determine how fast you reach it. This is the most consistently underestimated factor among beginner DevOps candidates.

Two candidates with identical technical ability will have very different career outcomes based on how clearly they communicate. Here's what that looks like in practice:

Architecture explanation: Can you describe how your project works to someone who has never seen it? Can you draw the architecture on a whiteboard and walk someone through your design decisions and the trade-offs you made?

Trade-off articulation: "I chose X over Y because..." is one of the most powerful phrases in a technical interview. It shows you understand that every decision has pros and cons and you made a conscious, reasoned choice rather than just copying a tutorial.

Written documentation: A README is your project's cover letter. A well-written README with clear setup instructions, an architecture diagram, and documented decisions demonstrates engineering maturity that most beginners don't show.

Here's a quick test: open your most recent project on GitHub and read the README as if you're a hiring manager seeing it for the first time. Does it answer these questions?

What does this project do, and why did you build it?
What does the architecture look like?
How do I run this locally, and how do I deploy it?
What decisions did you make, and why?
What would you improve if you continued working on it?

If you answered "no" to more than two of those rewrite the README before applying anywhere. This single action will meaningfully improve your response rate.

Interview communication: Hiring managers assess communication throughout the entire interview not just your answers. Thinking out loud, structuring your responses, and admitting uncertainty honestly are all evaluated.

Factor 5: Consistency Over Intensity

Hiring managers are pattern recognition machines. They look at your GitHub contribution graph, your LinkedIn activity, and your learning trajectory and form an impression before reading a single word on your résumé.

A binge-learning approach, 10-hour weekends followed by weeks of nothing produces a GitHub graph that tells the wrong story. Thirty minutes of focused daily practice for six months beats a monthly 10-hour binge. At the six-month mark, the daily practitioner has 90 hours of focused work. The binge learner has 60 with significantly worse retention.

Here's how to build consistency in practice:

Pick a time slot in your day that you will protect. Thirty minutes is enough to make progress.
Define a four-week learning sprint with a specific goal, not "learn Terraform" but "build and deploy a VPC with Terraform and write the README."
Keep a private learning journal: date, what you studied, what you built, what confused you.
When the sprint ends, evaluate what you built and plan the next one.

What to avoid: declaring publicly on LinkedIn that you're "grinding DevOps full time" and then disappearing for six weeks. The absence is noticed. Only commit publicly to what you will actually sustain.

Factor 6: Networking and Visibility

This is the factor most beginners resist most, and the one that makes the biggest practical difference in time-to-hire.

Most DevOps jobs are filled through people referrals, community connections, LinkedIn conversations. A warm introduction from someone who has seen your work outweighs fifty cold applications every time.

Here are three ways to build visibility without it feeling performative:

Community Engagement

Join communities where DevOps engineers actually talk: AWS User Groups, local DevOps meetups, DevOps Discord servers, Reddit communities like r/devops and r/kubernetes. You don't need to be the expert. Ask specific questions, answer what you genuinely know, and show up consistently. After three to six months, people will recognize your name.

LinkedIn Content

Post once per week about something you learned, built, or got stuck on. Not marketing – documentation. A post that says "This week I configured Prometheus alerting for a Docker Compose stack. Here's what tripped me up and how I solved it" attracts recruiters, leads to conversations, and builds a searchable record of your growth over time.

Asking Good Questions in Public

When you get stuck and figure it out, write it up. Post the solution in the same community where you asked the question. Answer someone else's version of the same question later. You position yourself as a helpful, engaged learner, exactly who hiring managers want to hire.

Here's a concrete three-month visibility sprint to follow:

Timeframe	Action
Week 1-2	Update your LinkedIn headline: "Cloud / DevOps Engineer in Training │ Building with AWS, Docker, Terraform". Connect with 20 people in DevOps engineers, recruiters, hiring managers. Add a short personal note when connecting.
Week 3-4	Write your first LinkedIn post. Document something you built or learned this week. Keep it honest and specific. 150–200 words is enough.
Month 2	Join one community. Introduce yourself. Answer one question per week.
Month 3	Post consistently once per week. Engage with others' posts. Start appearing in recruiter searches.

By month three, recruiters searching for "DevOps" in your location will encounter your activity. Some of the best entry-level DevOps opportunities come from exactly this kind of low-pressure visibility.

Factor 7: Ownership Mindset

This factor is less about personality type and more about observable behavior. Hiring managers are looking for evidence that you finish what you start not just that you start things.

Here's what the contrast looks like:

What hiring managers frequently see	What hiring managers want to see
"I started a Kubernetes project and encountered a lot of issues"	"Here is a complete project. It deploys to AWS, has a CI/CD pipeline, is monitored, and you can access it at this URL right now."
"I was working through a Terraform course, learnt a lot about XYZ."	"I finished it, documented it, and wrote a post about what I learned."

Ownership mindset has three components. First, finish things: a complete, simple project is worth ten times more than ten incomplete complex ones. Second, take responsibility without blame when something breaks: ownership means identifying the cause, fixing it, and adding monitoring so it doesn't happen again. Third, self-direct your learning you don't wait for someone to tell you what to learn next. You see a gap, identify how to close it, and close it. This is what "junior who can work independently" actually means in job descriptions.

Factor 8: Business Awareness

Technical skill gets you in the door. Business awareness keeps you there and accelerates your career.

The core question hiring managers are testing is: can you connect your technical decisions to cost, uptime, and user impact? Infrastructure decisions are business decisions. Cloud costs are typically the second-largest engineering expense at most companies after salaries. A misconfigured auto-scaling group or a forgotten large EC2 instance can burn thousands of dollars overnight.

Here are a few benchmark questions worth being able to answer comfortably:

If your company has a 99.9% SLA, how many minutes of downtime per month is that? (About 43 minutes.)
If you move workloads from on-demand EC2 instances to Reserved Instances, what's the approximate cost saving? (Around 40–60%.)
If your CI/CD pipeline takes 45 minutes per build and you run 20 builds per day, how much developer wait time does that represent weekly?

Most junior candidates can't answer these fluently in an interview. Candidates who can stand out immediately not because the questions are hard, but because so few people bother to connect infrastructure and business.

The simple habit to build: whenever you describe a technical decision in your project documentation or in an interview, add the business dimension. "I configured auto-scaling" becomes "I configured auto-scaling to handle traffic spikes, which eliminated the cost of over-provisioning and reduced our estimated monthly cloud spend by approximately $X."

Factor 9: Learning Agility

Everyone claims to be a fast learner. It's the most overused phrase in technology job applications. Here's how to make it actually mean something.

Saying "I'm a fast learner" in an interview is table stakes. The question is whether you can prove it. Proof sounds like this: "I had never used GitHub Actions before. I needed a CI/CD pipeline for a project I was building. In 48 hours, I had a working pipeline that runs tests, builds a Docker image, and deploys to AWS."

What makes that credible: it names a specific tool, a specific timeframe, and a specific outcome. There is a GitHub repository with a commit history and a working pipeline that a hiring manager can actually look at.

Learning agility is not about knowing many tools shallowly. It's about picking up new tools quickly because you deeply understand the underlying concepts. Tool names change every few years. Concepts networking, automation, observability, reliability do not.

To build a concrete track record of learning agility: once a month, pick one tool you haven't used. Follow its quick-start guide. Build something small. Document what was difficult. Post about it. This is your learning agility portfolio visible, dated, and specific.

Your 90-Day Action Plan

Here is a concrete, sequential plan that takes you from where you are now to your first DevOps interview-ready state.

Month 1: Build Your Foundation

Focus entirely on Project 1 from the Proof of Work section. Build it completely. Deploy it. Get the live URL. Don't start Project 2 until Project 1 meets all six checklist criteria.

Alongside the build: 30 minutes of Linux and Bash scripting practice daily. This isn't optional, it's the foundation everything else runs on.

Month 2: Expand Your Execution and Start Your Visibility

Begin Project 2 (Terraform IaC). Write your first LinkedIn post, it doesn't need to be polished, it needs to be specific. Join one community and introduce yourself.

Month 3: Complete the Portfolio and Document Everything

Finish all three projects to full checklist standard. Polish every README. Add architecture diagrams. Optimize your GitHub profile, pin your three best repos, write a profile README that describes who you are and what you build, and add links to your live project URLs.

Month 4 Onward: Apply with Strategy

Don't start applying before month four. Apply with real proof of work in hand. Target five to ten quality applications per week rather than spraying a hundred. Include your GitHub and your best project's live URL in every application. For roles at companies where you have a community connection, reach out to that person before applying.

Track every application in a spreadsheet: company, role, date applied, status, outcome, notes. After thirty applications, you'll have enough data to see what's working and what isn't.

Here's the full 90-day breakdown:

Timeframe	Focus	Milestone
Week 1-2	Linux fundamentals. Set up GitHub profile. Start Project 1.	Foundation
Week 3-4	Complete Project 1 CI/CD pipeline. Deploy. Get live URL. Write README.	First Proof of Work
Month 2	Begin Project 2. First LinkedIn post. Join one community.	Visibility begins
Month 2-3	Complete Project 2. Scaffold monitoring (Project 3). Post weekly on LinkedIn.	Building momentum
Month 3	Finish all 3 projects to checklist standard. Polish READMEs and GitHub profile.	Portfolio complete
Month 4+	Apply strategically. Continue posting and community engagement.	Active job search

Honest Self-Assessment: Where Do You Stand?

Go through each statement below. Be completely honest: this is for you, not anyone else.

Statement	Action if the answer is No
I can explain a web request end-to-end (DNS → load balancer → compute → database → logs)	Study Factor 2 until you can draw this from memory
I have at least one deployed project with a live URL	This is Priority 1. Nothing else matters more right now.
My best project has a CI/CD pipeline that auto-deploys on push	Add this to your existing project this week
I have written infrastructure as code (Terraform or CloudFormation)	Project 2 is your next build target
My projects have READMEs that explain architecture and decisions	Spend one hour today rewriting your README
I have posted about my learning on LinkedIn in the last 30 days	Post something today, document what you built last week
I am part of at least one DevOps community	Join r/devops or an AWS Discord server this week
I can write a Bash script that solves a real automation problem	30 minutes of daily scripting practice for the next 30 days
I can explain what I built, why I made each decision, and what I'd change	Practice saying this out loud about each project until it's fluent

Count your "no" answers. Each one is a specific, actionable gap, not a vague sense of being behind. That's the difference between this self-assessment and the anxious feeling of "I'm not ready yet." You're not behind. You just have a prioritized list of what to build next.

Conclusion

Here's what you know now that most beginners still don't:

The gap between you and a DevOps job isn't a gap in certifications, a gap in courses completed, or a gap in the number of tools you've heard about. It's a gap in proof of work, visibility, and the consistency with which you execute.

Hiring managers aren't looking for someone who has watched everything. They're looking for someone who has built something, documented it, deployed it, monitored it, and can clearly explain every decision they made along the way.

The path isn't secret. It's just work. Build two to three complete projects that meet the full checklist. Document everything. Show up consistently in communities and on LinkedIn. Apply with strategy. Iterate based on feedback.

If you want a production-grade reference to support your DevOps journey complete with real Terraform modules, CI/CD workflow templates, infrastructure runbooks, and platform engineering patterns used in real startup environments The Startup DevOps Field Guide was built for exactly this stage of your career.

The information gap between you and your first DevOps role is smaller than you think. The execution gap is where the work is. Start today.

References and Recommended Resources

roadmap.sh/devops: The community-maintained DevOps learning roadmap. Use this to sequence what you learn next and avoid random jumps between topics.
DORA State of DevOps Report: Free annual report on what DevOps practices actually improve software delivery performance. Gives you the vocabulary hiring managers speak.
Linux Foundation - Introduction to Linux: Free introductory Linux course. If the terminal still makes you nervous, start here.
The Phoenix Project: A business novel about DevOps transformation. Teaches core concepts through story. Gives you vocabulary for business-aware conversations.
ExplainShell.com: Paste any command you find online and see exactly what every part does. Use this constantly while building your projects.
GitHub - How to Write a Good README: Official GitHub guidance on repository documentation.
Prometheus Documentation: Official docs for the monitoring tool used in Project 3.
Terraform Getting Started - AWS: Official step-by-step guide for Project 2.
GitHub Actions Documentation: Complete reference for building CI/CD pipelines in Project 1.
freeCodeCamp - Learn Linux for Beginners: Comprehensive Linux guide available on freeCodeCamp.

How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS

Tolani Akintayo — Mon, 27 Apr 2026 15:07:43 +0000

If you've been storing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as GitHub Secrets to deploy to AWS, you're not alone. It's the most common approach and it's also one of the biggest security risks in a CI/CD pipeline.

Here's why: static credentials don't expire on their own. If they get leaked through a misconfigured workflow, a public fork, or a compromised repository, an attacker has persistent access to your AWS environment until you manually rotate them. And most teams don't rotate them often enough.

OpenID Connect (OIDC) solves this entirely. Instead of storing long-lived credentials, GitHub Actions requests a short-lived token directly from AWS every time your workflow runs. No secrets to rotate. No credentials to leak. No manual key management.

In this tutorial, you'll learn how to set up OIDC authentication between GitHub Actions and AWS from scratch. By the end, your workflows will authenticate to AWS securely without storing a single access key.

What Is OpenID Connect (OIDC)?
How OIDC Works Between GitHub Actions and AWS
Prerequisites
Step 1: Create an IAM OIDC Identity Provider in AWS

Step 2: Create an IAM Role with a Trust Policy

Step 3: Attach Permissions to the IAM Role

Step 4: Store the Role ARN as a GitHub Actions Variable

Step 5: Configure Your GitHub Actions Workflow

Step 6: Run and Verify Your Workflow
Security Best Practices
Troubleshooting Common Errors
Conclusion
References

What Is OpenID Connect (OIDC)?

OpenID Connect is an identity protocol built on top of OAuth 2.0. It allows systems to verify identity through tokens rather than shared secrets.

In the context of GitHub Actions and AWS:

GitHub acts as the identity provider (IdP). It issues a signed JWT (JSON Web Token) for each workflow run.
AWS acts as the service provider. It validates that token against GitHub's public keys and exchanges it for temporary AWS credentials. The credentials AWS returns are short-lived (valid for up to 1 hour by default) and scoped to exactly the IAM role you define. When the workflow ends, those credentials are gone.

This model is called federated identity. It's the same concept used when you "Sign in with Google" on a third-party website. The difference is that instead of a user signing in, your workflow is the one authenticating.

How OIDC Works Between GitHub Actions and AWS

Before writing a single line of YAML, it beneficial to understand the flow. This is my personal approach when implementing new technologies or concepts. Here's what happens every time your workflow runs:

The diagram illustrates a secure authentication flow between GitHub Actions and AWS using OpenID Connect (OIDC), eliminating the need to store long-lived AWS credentials in GitHub. Here's what happens step-by-step:

1. Initial Authentication Request

When your GitHub Actions workflow starts, the runner (the virtual machine executing your workflow) requests a JSON Web Token (JWT) from GitHub's OIDC provider located at https://token.actions.githubusercontent.com.

2. Token Issuance

GitHub's OIDC provider generates and signs a JWT containing important claims (metadata) about your workflow. These claims include details like which repository the workflow is running from, which branch triggered it, what environment it's running in, and other contextual information that proves the workflow's identity.

3. Token Validation

The GitHub Actions runner presents this signed JWT to AWS Security Token Service (STS). AWS STS validates the JWT's signature by checking it against GitHub's publicly available cryptographic keys, ensuring the token is authentic and hasn't been tampered with.

4. Trust Policy Verification

AWS STS checks the trust policy configured on your IAM Role. This trust policy specifies which GitHub repositories, branches, or environments are allowed to assume this role. If the claims in the JWT match your trust policy conditions, authentication succeeds.

5. Temporary Credentials Issued

Once validated, AWS STS returns temporary security credentials to the GitHub Actions runner. These credentials include an Access Key ID, Secret Access Key, and Session Token that are valid for a limited time (typically 1 hour by default, configurable up to 12 hours).

6. AWS API Access

The GitHub Actions runner uses these temporary credentials to authenticate API calls to your AWS resources such as pushing Docker images to ECR, updating ECS services, writing to S3 buckets, or invoking Lambda functions.

The key point: AWS never sees your GitHub credentials, and GitHub never sees your AWS credentials. The JWT is the only thing exchanged and it's signed, scoped, and short-lived.

Prerequisites

Before you start, make sure you have the following in place:

An AWS account with IAM permissions to create identity providers and roles
A GitHub repository (public or private) where your workflows will run
Basic familiarity with GitHub Actions, knowing how to write a .yml workflow file
Basic familiarity with AWS IAM roles, policies, and permissions
The AWS CLI installed and configured (optional, but useful for verification). You don't need to be an AWS expert. Each step includes the exact console path and the configuration values you need.

Step 1: Create an IAM OIDC Identity Provider in AWS

The first thing you need to do is tell AWS to trust GitHub as an identity provider. This is a one-time setup per AWS account.

How to Do It in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Identity providers

3. Click Add provider

4. For Provider type, select OpenID Connect

5. For Provider URL, enter:

https://token.actions.githubusercontent.com

6. For Audience, enter:

sts.amazonaws.com

7. Click Add provider

How to Do It with the AWS CLI

If you prefer the terminal, run this command:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \

Once created, you'll see token.actions.githubusercontent.com listed under Identity providers in your IAM console. This provider will be referenced in your IAM role's trust policy in the next step.

Step 2: Create an IAM Role with a Trust Policy

Now you need an IAM role that your GitHub Actions workflow will assume. The trust policy on this role controls which repositories and branches are allowed to request credentials.

How to Create the IAM Role in the AWS Console

1. Open the AWS IAM Console

2. In the left sidebar, click Roles

3. Click Create role

4. For Trusted entity type, select Web identity

5. For Identity Provider, choose: token.actions.githubusercontent.com which you created earlier.

6. For Audience, choose sts.amazonaws.com as well

7. For GitHub organisation, enter your GitHub username or organization name

8. For GitHub repository, enter your GitHub repository

9. For GitHub branch, enter your branch name (for example, main)

10. Click Next, then Next, give a name to the role and click create role

Note: Creating the IAM role using this approach already establishes the Trusted Entities using a trusted policy based on the step 4-9 above. You can verify this by clicking on the created role and navigating to Trust relationships.

How to Create the IAM Role with the AWS CLI

First, you'll need to create a trust policy document on your local machine: You can call it trust-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:YOUR_GITHUB_ORG/YOUR_REPO_NAME:*"
        }
      }
    }
  ]
}

Replace the following placeholders before saving:

Placeholder	Replace With
`YOUR_ACCOUNT_ID`	Your 12-digit AWS account ID
`YOUR_GITHUB_ORG`	Your GitHub username or organization name
`YOUR_REPO_NAME`	The name of your GitHub repository

How to Understand the `sub` Condition

The sub (subject) claim in the JWT tells AWS exactly where the request is coming from. The value repo:your-org/your-repo:* means any branch in that repository can assume this role.

You can tighten this further depending on your needs:

# Only the main branch
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
 
# Only a specific GitHub Environment
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Scoping this correctly is one of the most important security decisions in this setup. Here's how to decide:

Use ref:refs/heads/main if only your main/production branch should deploy to AWS. This is the most restrictive and secure option: feature branches can't accidentally (or maliciously) trigger deployments or modify production resources.
Use environment:production if you're using GitHub Environments with protection rules (required reviewers, deployment gates). This lets you control deployments through GitHub's approval workflow while still restricting which workflows can access AWS.
Use repo:your-org/your-repo:* (wildcard) only if you need any branch to deploy. for example, in development environments where every feature branch deploys to its own isolated stack. Never use this for production roles.

Run this command to create the role using your trust policy:

aws iam create-role \
  --role-name GitHubActionsOIDCRole \
  --assume-role-policy-document file://trust-policy.json \
  --description "Role assumed by GitHub Actions via OIDC"

Take note of the Role ARN in the output. It will look like this:

arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

You'll need this ARN in your workflow YAML in Step 4.

Step 3: Attach Permissions to the IAM Role

The IAM role can now authenticate, but it has no permissions yet. You need to attach a policy that defines what your workflow is actually allowed to do in AWS.

How to Apply the Principle of Least Privilege

Only grant the permissions your workflow genuinely needs. If your workflow deploys to S3, give it S3 permissions. If it pushes images to ECR, give it ECR permissions. Never attach AdministratorAccess to a CI/CD role.

Option 1: Attach an AWS managed policy (quick start):

aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

Option 2: Create a custom policy scoped to a specific S3 bucket (recommended for production):

This approach is recommended for production because it limits the blast radius of a security incident. If your workflow credentials are ever compromised, a custom policy scoped to a specific bucket means an attacker can only affect that single bucket not every S3 bucket in your AWS account. It also prevents accidental misconfigurations in your workflow from impacting unrelated resources.

Create a file called s3-deploy-policy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Then create and attach it:

aws iam create-policy \
  --policy-name GitHubActionsS3DeployPolicy \
  --policy-document file://s3-deploy-policy.json
 
aws iam attach-role-policy \
  --role-name GitHubActionsOIDCRole \
  --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/GitHubActionsS3DeployPolicy

Note: You can as well implement Step 3 via the console.

Reference: For a full list of available AWS IAM actions, see the AWS IAM actions reference.

Step 4: Store the Role ARN as a GitHub Actions Variable

Before you configure your workflow, you need to make the Role ARN available to it. You'll store it as a repository variable in GitHub, not a secret, because the ARN itself isn't sensitive data.

How to Add the Variable in Your Repository

First, open your GitHub repository and click Settings:

In the left sidebar, scroll down to Secrets and variables, then click Actions:

Then click the Variables tab (not Secrets). Click New repository variable – you can set the Name to:

AWS_ROLE_ARN

Set the Value to your Role ARN from Step 2, for example:

arn:aws:iam::YOUR_ACCOUNT_ID::role/GitHubActionsOIDCRole

Click Add variable:

You'll reference this variable in your workflow in the next step using ${{ vars.AWS_ROLE_ARN }}.

Step 5: Configure Your GitHub Actions Workflow

With AWS and GitHub fully configured, you now need to update your workflow to request an OIDC token and use it to authenticate.

How to Set the Required Workflow Permissions

Your workflow must declare id-token: write. Without this, GitHub won't issue an OIDC token to the runner.

permissions:
  id-token: write   # Required to request the OIDC JWT
  contents: read    # Required to checkout the repository

Important: If you set permissions at the job level, they override any top-level permissions. Make sure id-token: write is present at whichever level your AWS authentication step runs.

Full Workflow Example

Here's a complete workflow that authenticates to AWS using OIDC and deploys a static site to S3:

name: Deploy to AWS S3
 
on:
  push:
    branches:
      - main
 
permissions:
  id-token: write
  contents: read
 
jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
 
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_ROLE_ARN }}
          aws-region: us-east-2
 
      - name: Verify AWS identity
        run: aws sts get-caller-identity
 
      - name: Deploy to S3
        run: |
          aws s3 sync ./code s3://your-bucket-name

Replace the following before committing:

Placeholder	Replace With
`AWS_ROLE_ARN`	The variable name for your IAM role ARN in GitHub
`us-east-2`	Your target AWS region
`your-bucket-name`	Your S3 bucket name
`./code`	The local directory where the file you want to sync to S3 is located

You can see the code sample in my GitHub Repo here.

Note: The aws-actions/configure-aws-credentials action handles the entire OIDC token exchange automatically. It requests the JWT from GitHub, calls sts:AssumeRoleWithWebIdentity, and exports the temporary credentials as environment variables for the rest of the job.

See the action's official documentation for all available options.

Step 6: Run and Verify Your Workflow

Push your workflow to the main branch and open the Actions tab in your repository to watch it run.

What a Successful Run Looks Like

The Configure AWS credentials via OIDC step should show:

Assuming role with OIDC: arn:aws:iam::YOUR_ACCOUNT_ID:role/GitHubActionsOIDCRole

The Verify AWS identity step (aws sts get-caller-identity) should return:

{
    "UserId": "AROA...:GitHubActions",
    "Account": "YOUR_ACCOUNT_ID",
    "Arn": "arn:aws:sts::YOUR_ACCOUNT_ID:assumed-role/GitHubActionsOIDCRole/GitHubActions"
}

If you see an assumed-role ARN in the output, OIDC is working correctly. Your workflow is now authenticating to AWS without a single stored credential.

Security Best Practices

Getting OIDC working is step one. Locking it down properly is step two.

Scope the `sub` Condition as Tightly as Possible

Don't use a wildcard like repo:your-org/*:* that allows any repository in your organization to assume the role. Scope it to the exact repository and branch that needs access.

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"

Use GitHub Environments for Production Deployments

GitHub Environments let you add manual approval gates and restrict which branches can deploy. When combined with OIDC, you can scope your trust policy to only allow the production environment:

"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:environment:production"

Apply Least-Privilege Permissions to Every IAM Role

Never attach AdministratorAccess or PowerUserAccess to a role used by CI/CD. Define a custom policy with only the actions your workflow actually needs.

Create Separate IAM Roles Per Environment

A staging role and a production role should have different permission scopes. Your staging deployment role should never have write access to production resources.

Enable AWS CloudTrail

Every call made using the temporary credentials is logged in CloudTrail under the assumed role ARN. This gives you a full audit trail of exactly what your workflow did in AWS.

Reference: GitHub's official security hardening guide for OIDC: About security hardening with OpenID Connect

Troubleshooting Common Errors

Error: `Not authorized to perform sts:AssumeRoleWithWebIdentity`

This usually means the trust policy on your IAM role doesn't match the sub claim in the JWT.

Check the following:

The sub condition exactly matches your repository path (it is case-sensitive)
The aud condition is set to sts.amazonaws.com
The Federated principal uses the correct AWS account ID

To inspect the actual token claims your workflow is receiving, add this debug step temporarily:

- name: Print OIDC token claims
  run: |
    TOKEN=\((curl -s -H "Authorization: Bearer \)ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
      "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=sts.amazonaws.com" | jq -r '.value')
    echo $TOKEN | cut -d '.' -f2 | base64 -d 2>/dev/null | jq .

Error: `Could not load credentials from any providers`

This almost always means id-token: write is missing from your workflow permissions. Double-check that you have:

permissions:
  id-token: write
  contents: read

Error: `AccessDenied` When Calling an AWS Service

Authentication succeeded but the IAM role doesn't have permission to perform the action your workflow is attempting. Check the permissions policy attached to your role and compare it against the specific action in the error message.

Conclusion

You've gone from storing static, long-lived AWS credentials in GitHub Secrets to a fully keyless authentication setup using OIDC. Here's what you accomplished:

Registered GitHub as a trusted OIDC identity provider in AWS.
Created an IAM role with a scoped trust policy tied to a specific repository.
Attached least-privilege permissions to that role.
Configured your GitHub Actions workflow to request and use short-lived AWS credentials.
Verified the authentication flow end-to-end.

This pattern works across every AWS service from S3, ECS, Lambda, ECR, Secrets Manager, and more. The workflow example here uses S3, but you only need to swap out the permissions policy and the deployment commands to adapt it for any service.

If you want to go further, explore:

Configuring OIDC for multiple cloud providers: Azure, GCP, and HashiCorp Vault.
GitHub Environments and deployment protection rules: for multi-stage pipelines with approval gates.
AWS IAM Access Analyzer: to validate and tighten your role policies automatically.

If you're building out your DevOps practice and want a complete, production-ready reference for infrastructure automation, CI/CD, and platform engineering, check out The Startup DevOps Field Guide. It covers the patterns, templates, and runbooks I've used across real AWS environments.

You can also connect with me on LinkedIn

Tolani Akintayo - freeCodeCamp.org

Common DevOps Mistakes and How to Avoid Them — Tips for Startups

Table of Contents

Who This Article Is For

Why Startups Are a Different Environment

Mistake 1: Deploying Without Understanding What You're Deploying

The Scenario

The Business Impact

The Fix

Mistake 2: Using Production as a Development Environment

The Scenario

The Business Impact

The Fix

Mistake 3: Hardcoding Secrets and Credentials

The Scenario

The Business Impact

The Fix

Mistake 4: Overengineering for Problems You Don't Have Yet

The Scenario

The Business Impact

The Fix

Mistake 5: No Observability Before Launch

The Scenario

Business Impact

The Fix

Mistake 6: Treating Security as a Final Step

The Scenario

The Business Impact

The Fix

Mistake 7: Manual Deployments in Production

The Scenario

The Business Impact

The Fix

Mistake 8: No Disaster Recovery Plan

The Scenario

The Business Impact

The Fix

Mistake 9: No Documentation or Runbooks

The Scenario

The Business Impact

The Fix

Mistake 10: Solving Technical Problems Without Understanding the Business

The Scenario

The Business Impact

The Fix

The System Thinking Framework Every DevOps Engineer Needs

Your Production Readiness Checklist

Infrastructure

Security

Observability

Reliability

Documentation

Conclusion

Want to Go Deeper?

How to Migrate to S3 Native State Locking in Terraform

Table of Contents

What is Terraform State Locking?

What Is S3 Native State Locking?

How S3 Native Locking Compares to the S3 + DynamoDB Approach

Prerequisites

Part 1: Fresh Setup – How to Configure S3 Native Locking from Scratch

Step 1: Create the S3 Bucket with Versioning and Encryption

Step 2: Configure the Terraform Backend with Native Locking

Step 3: Initialize and Verify

Part 2: Migration – How to Move from S3 + DynamoDB to S3 Native Locking

Step 1: Verify Your Current Setup

Step 2: Enable Object Lock on the Existing S3 Bucket

Step 3: Update the Terraform Backend Configuration

Step 4: Reinitialize Terraform

Step 5: Verify the Migration

Step 6: Clean Up the DynamoDB Table

How to Verify That Locking Is Working

Method 1: Observe the lock file during an operation

Method 2: Read the lock file contents

How to Handle a Stuck Lock

Rollback Plan: If Something Goes Wrong

Security Best Practices for Your State Bucket

Enable Versioning (Required)

Block All Public Access (Non-Negotiable)

Enable Server-Side Encryption

How to Understand the `sub` Condition

Scope the `sub` Condition as Tightly as Possible