SRE - freeCodeCamp.org

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Osomudeya Zudonu — Thu, 26 Mar 2026 14:25:52 +0000

If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?

Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.

In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.

By the end, you'll be able to:

Explain the full architecture from vault to pod
Run the lab locally in about 15 minutes
Prove why environment variables go stale after rotation, while mounted secret files stay fresh
Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD
Troubleshoot the most common failures

Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.

Prerequisites
How to Understand the Secret Flow
How to Run the Local Lab
How to Inspect the ExternalSecret and the Application
How to Test Secret Rotation
How to Choose Between External Secrets Operator and the CSI Driver
How to Deploy the Pattern on Amazon Elastic Kubernetes Service
How to Configure GitHub Actions Without Stored AWS Credentials
How to Troubleshoot the Most Common Failures
Conclusion

Prerequisites

Before you begin, make sure you have the following tools installed and configured.

For the local lab:

An AWS account with access to AWS Secrets Manager
The AWS CLI installed and configured. Run aws configure and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.
kubectl installed. For Microk8s, run microk8s kubectl config view --raw > ~/.kube/config after installation to connect kubectl to your local cluster.
Terraform installed
Helm installed
Docker installed
A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the Microk8s install guide before continuing.

For the Amazon Elastic Kubernetes Service sections:

An Amazon Elastic Kubernetes Service cluster you can create or manage
A GitHub repository you can configure for workflows and secrets

The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's docs/DEPLOY-LOCAL.md and docs/DEPLOY-EKS.md.

How to Understand the Secret Flow

Before you run any command, you need to understand how the pieces connect.

The flow has four stages:

A developer or automated system updates a secret in AWS Secrets Manager.
The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.
Your pod reads that Kubernetes Secret.
During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.

How the External Secrets Operator Sync Works

The External Secrets Operator reads a custom Kubernetes resource called ExternalSecret. That resource tells the operator three things:

Which secret store to connect to
Which Kubernetes Secret name to create or update
How often to refresh

In this lab, the ExternalSecret creates a Kubernetes Secret named myapp-database-creds. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.

How the App Consumes Secrets

The sample application exposes three endpoints so you can validate behavior at any time.

/secrets/env shows what environment variables the pod sees
/secrets/volume shows what files in the mounted secret directory look like
/secrets/compare compares both and reports whether rotation has been detected

The app checks four keys: DB_USERNAME, DB_PASSWORD, DB_HOST, and DB_PORT.

How to Run the Local Lab

The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.

Step 1: Clone the Repo

git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab

Step 2: Run the Spin-Up Script

bash spinup.sh

The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.

If the script fails at any point, check docs/TROUBLESHOOTING.md before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.

Important: Run the Lab UI

The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at lab-ui/ that walks you through each concept and checkpoint as you work through the lab.

To start it, open a second terminal and run:

cd lab-ui && npm install && npm run dev

Then open http://localhost:5173. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.

Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (localhost:3000) are two separate things, the UI guides you through the steps, the app shows you the live secrets.

Step 3: Access the Application

Once the lab finishes, port-forward the service.

kubectl port-forward svc/myapp 3000:80 -n default

Open http://localhost:3000. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.

Step 4: Validate That Secrets Match

Run the compare endpoint directly from the terminal.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

When everything is working, the response will include "all_match": true.

How to Inspect the ExternalSecret and the Application

At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.

Step 1: Read the ExternalSecret Manifest

Open k8s/aws/external-secret.yaml. Focus on these four fields:

refreshInterval: how often the operator polls AWS Secrets Manager
secretStoreRef: which store the operator authenticates against
target: the name of the Kubernetes Secret to create
data: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys

Here is what that mapping looks like in this lab:

spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username

The property field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.

Two fields here are worth understanding before you move on. creationPolicy: Owner means the operator owns the Kubernetes Secret it creates. If you delete the ExternalSecret, the Secret is deleted too. ClusterSecretStore is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain SecretStore is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.

Step 2: Read the Deployment Manifest

Open k8s/aws/deployment.yaml. You are looking for two sections: envFrom and volumeMounts.

envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true

Both paths read from the same Kubernetes Secret, myapp-database-creds. The envFrom block injects all keys as environment variables at pod start.
The volumeMounts block mounts the same secret as files under /etc/secrets.

This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.

Step 3: Read the App Comparison Logic

Open app/server.js. The comparison logic reads environment variables from process.env and reads mounted secret files from /etc/secrets/. Then it computes a per-key match and a global all_match value.

The /secrets/compare endpoint sets rotation_detected: true when any key differs between env and volume.

How to Test Secret Rotation

Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.

How the Rotation Gap Works

When a pod starts, Kubernetes gives it two ways to read a secret.

The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.

The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.

Same secret, two paths. One goes stale while one stays fresh.

The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.

That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.

Here is what you're about to observe in the lab:

The rotation script updates the secret in AWS
ESO syncs the new value into Kubernetes within seconds
The volume file updates automatically
The environment variable stays stale until the pod restarts
The /secrets/compare endpoint shows both values side by side so you can see the gap live

Step 1: Confirm the Lab Is Ready

Make sure your pod and the External Secrets Operator are both running before you start.

kubectl get pods -n external-secrets
kubectl get pods -n default

Both should show Running.

Step 2: Run the Rotation Test Script

bash rotation/test-rotation.sh

The script performs these actions in order:

Reads the current DB_PASSWORD from the volume mount at /etc/secrets/DB_PASSWORD
Reads the current DB_PASSWORD from the environment variable
Updates AWS Secrets Manager with a new password using put-secret-value
Forces an immediate ESO sync by annotating the ExternalSecret with force-sync
Reads the volume value again
Reads the environment variable again

After the script runs, the volume and the env var will show different values.

Step 3: Validate With the Compare Endpoint

Hit the compare endpoint and look at the output.

curl -s http://localhost:3000/secrets/compare | python3 -m json.tool

You'll see something like this:

{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}

Step 4: Restart the Deployment to Sync Env Vars

Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.

kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default

Then hit /secrets/compare again. All rows should now show "all_match": true.

How to Automate Restarts With Reloader

If you don't want to restart deployments manually after every rotation, you can install Stakater Reloader. It watches an annotation on the Deployment and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.

How to Choose Between External Secrets Operator and the CSI Driver

Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the Secrets Store CSI Driver.

Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:

Feature	External Secrets Operator	Secrets Store CSI Driver
Creates a Kubernetes Secret	Yes	No by default
Supports `envFrom`	Yes	No (workaround only)
Secret stored in etcd	Yes (base64)	No, if you skip sync
Rotation	ESO updates the Secret, Reloader restarts pods	Volume file can update in place
Best for	Most teams. Multi-cloud, env var support	Security policies that prohibit secrets in etcd

This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both envFrom and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.

Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native envFrom model.

How to Deploy the Pattern on Amazon Elastic Kubernetes Service

The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.

Step 1: Prepare Terraform and OpenID Connect Access

The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the terraform/github-oidc folder.

cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn

Copy the role ARN from the output. You'll need it in the next step.

Step 2: Set the Required Environment Variable

The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.

To find your AWS account ID, run:

aws sts get-caller-identity --query Account --output text

Then set the variable, replacing ACCOUNT with the number that command returns.

export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role

Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service

bash spinup.sh --cluster eks

When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing Match ✓.

Step 4: Test Rotation on the Deployed App

After you confirm normal operation, run the rotation test the same way you did locally.

bash rotation/test-rotation.sh

Then use /secrets/compare on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.

⚠️ Cost warning: Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run bash teardown.sh from the repo root to destroy all AWS resources and stop charges.

How to Configure GitHub Actions Without Stored AWS Credentials

The typical CI/CD setup stores AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.

OpenID Connect eliminates that problem entirely.

How OpenID Connect Works for GitHub Actions

GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via AssumeRoleWithWebIdentity. No long-lived keys are ever stored anywhere.

Step 1: Create the IAM Role With Terraform

The terraform/github-oidc folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.

Step 2: Add the Role ARN to GitHub Repository Secrets

In your GitHub repository:

Go to Settings → Secrets and variables → Actions
Click New repository secret
Name it AWS_ROLE_ARN
Paste the role ARN from the Terraform output

That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.

Step 3: Configure Terraform State

For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.

Step 4: Push to Main and Let Workflows Run

After your first spin-up, every push to the main branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use /secrets/compare to validate rotation behavior on the live environment.

How to Troubleshoot the Most Common Failures

Here's a shortlist of the most common symptoms and their fixes.

Symptom	Most Likely Cause	Fix
`ExternalSecret` is not syncing	Missing credentials or wrong store reference	Confirm the operator can access AWS Secrets Manager and that `secretStoreRef` points to the correct store
Pod is stuck in `Pending`	Missing storage setup for local cluster	For Microk8s, enable the storage add-on
Env and volume still match after rotation	Rotation happened but the pod never restarted	Run `kubectl rollout restart` or install Reloader
CRD or API version mismatch	ESO version and manifest `apiVersion` don't match	Verify the `apiVersion` for `ClusterSecretStore` and `ExternalSecret` match your installed ESO version
Amazon Elastic Kubernetes Service node group never joins	Networking or IAM permissions for nodes are wrong	Fix internet routing and review the node IAM policy

How to Inspect the Operator and the ExternalSecret

When something isn't syncing, start with these two commands.

# Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets

The status conditions on the ExternalSecret resource will usually tell you exactly what failed.

How to Validate Rotation From the App Side

When you are debugging rotation, don't rely only on Kubernetes resource state. Use the /secrets/compare endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.

Conclusion

You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the ExternalSecret and Deployment manifests, and validated that the application sees the right credentials.

You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.

Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.

The lab repository is at github.com/Osomudeya/k8s-secret-lab. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.

If this helped you, star the repo and share it with someone who is learning Kubernetes.

I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures
→ Join the newsletter

How to Debug Kubernetes Pods with Traceloop: A Complete Beginner's Guide

Opaluwa Emidowojo — Fri, 29 Aug 2025 16:09:24 +0000

Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional kubectl commands show you logs and statuses, but they can't tell you exactly what your application was doing at the system level when things went wrong.

What if you had a flight recorder for your applications, something that captures every system call in real-time, so you can "rewind" and see the exact sequence of events that led to a crash? That's what Traceloop does. It continuously traces system calls in your pods, giving you a detailed replay of what happened before, during, and after issues occur.

In this guide, you’ll learn how to use Traceloop's system call tracing to debug pod issues that would otherwise be nearly impossible to diagnose.

Prerequisites

Before we begin, here are some prerequisites – things you’ll need to know and have:

Basic Kubernetes concepts: Understanding of pods, deployments, services, and namespaces
kubectl fundamentals: Comfortable with commands like kubectl get, kubectl describe, kubectl logs, and kubectl exec
Container basics: Understanding how containerized applications work
Basic Linux concepts: Understanding of processes and system calls (helpful, but we'll explain as we go)

Technical Requirements

Kubernetes cluster access: Local (minikube, kind, Docker Desktop) or cloud-based cluster
kubectl installed and configured to connect to your cluster
Sufficient permissions (cluster admin or equivalent RBAC) to:
- Install and run eBPF-based tools (Traceloop uses eBPF)
- Create/modify pods and deployments
- Access pod logs and system-level data
Linux-based Kubernetes nodes: Most clusters already run on Linux.

System Requirements

Extended Berkeley Packet Filter (eBPF) support: Used for tracing and monitoring at the kernel level. Kernel version 5.10+ recommended.
Sufficient cluster resources: Traceloop runs alongside your applications

What is Traceloop?
How Traceloop Works
How to Set Up Traceloop
Your First Trace: Hands-On Tutorial
Step-by-Step Debugging Walkthrough
Real-World Debugging Scenarios
Best Practices
Conclusion

What is Traceloop?

Traceloop is a system call tracing and observability tool that works across containerized environments, from Docker containers running locally to pods in production Kubernetes clusters. But before we discuss what that means, let's talk about why system calls matter for debugging.

Every time your application does anything (like opening a file, making a network request, allocating memory, or crashing), it has to interact with the operating system through system calls. These are the fundamental building blocks of how any program interacts with the world around it.

Here's where traditional debugging falls short: when your container crashes, the logs might tell you "segmentation fault" or "out of memory," but they don't tell you the sequence of events that led there. Did the application try to access a file that didn't exist? Was it making network calls that failed? Did it run out of file descriptors?

Traceloop captures this missing piece. It sits at the kernel level using eBPF technology, recording every system call your application makes in real-time. Think of it as installing a dashcam in your application. It's always recording with minimal resources, and when something goes wrong, you have the footage.

Strace is another popular debugging tool – but it requires you to know that there's a problem first. With Traceloop, we can conveniently run it continuously in the background with minimal overhead. If your container crashes at 3am, you can immediately "rewind the tape" and see exactly what system calls happened leading up to the crash.

This helps debug intermittent issues that happen randomly in production but never when you are watching. Because Traceloop is always recording, you finally have visibility into what your application was doing when these mysterious failures occur.

How Traceloop Works

Now that you understand what Traceloop does, let's look under the hood at how it captures and processes system calls in your containerized environments.

The Technical Foundation

Traceloop is built on eBPF, a technology that allows programs to run safely in the Linux kernel without changing kernel code. Think of eBPF as a way to install "hooks" directly into the kernel that can observe everything happening on your system with minimal performance impact.

Unlike traditional monitoring tools that work from userspace, eBPF programs run in kernel space, giving them access to system calls as they happen, without relying on the application logging appropriate error messages. This is why Traceloop can capture events that never make it to application logs, like failed system calls or crashes that happen before the application can write anything.

The Flight Recorder Architecture

Traceloop uses eBPF maps as an overwriteable ring buffer. Imagine a tape recorder that continuously records over itself. It's always capturing system calls, but it only keeps the most recent data in memory. When something goes wrong, the recording automatically preserves what happened leading up to the incident, just like an airplane's flight recorder after a crash.

This approach solves the production debugging problem: you don't need to predict when issues will happen or attach debuggers after the fact. The recording is always running, waiting for you to need it.

System Call Capture Flow

Here's how Traceloop captures and processes system calls across your Kubernetes environment:

Application pods generate system calls through normal operation – opening files, making network connections, allocating memory.
eBPF probes (also called hooks) intercept these system calls at the kernel level before they're processed.
Traceloop recorder captures the events, buffers them, and adds container context using Inspektor Gadget enrichment (pod name, namespace, container ID).
Output stream formats the data and makes it available for analysis in real-time or after an incident.
Traceloop user views and analyzes the captured trace to diagnose the root cause of issues.

Below is a visual representation of the flow. The key advantage is that Traceloop sees everything your application does, even actions that fail silently or happen too quickly for traditional logging to catch. This gives you complete visibility into your application's interaction with the operating system.

Container Isolation and Context

One of Traceloop's strengths is understanding containerized environments. It doesn't just capture raw system calls – it adds context about which pod, container, and namespace generated each call. This means you can trace specific applications without getting overwhelmed by system calls from other containers running on the same node.

This container awareness makes Traceloop particularly powerful in Kubernetes environments where you might have dozens of pods running on a single node, but you only care about debugging one specific application.

How to Set Up Traceloop

Before we can start tracing system calls, we need to set up Traceloop in your Kubernetes environment. Traceloop is part of the Inspektor Gadget ecosystem, which provides flexibility in how you use it.

Installation Overview

This setup:

Deploys Inspektor Gadget components to all worker nodes
Eliminates the download and initialization overhead on each use, as components are pre-loaded and ready
Eliminates the need to reinstall or reconfigure for each debugging session – just run your traces immediately
Requires cluster admin permissions
Works best for teams doing regular debugging

Installation Requirements

First, ensure your cluster meets the requirements:

Kubernetes cluster with Linux nodes
eBPF support
kubectl installed and configured
Cluster admin permissions

Install kubectl gadget

The recommended way is using krew (kubectl plugin manager):

# Install krew if you don't have it
curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-linux_amd64.tar.gz"
tar zxvf krew-linux_amd64.tar.gz
./krew-linux_amd64 install krew
export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"

# Install kubectl gadget
kubectl krew install gadget

Alternatively, you can install directly:

# For Linux/macOS
curl -sL https://github.com/inspektor-gadget/inspektor-gadget/releases/latest/download/kubectl-gadget-linux-amd64.tar.gz | sudo tar -C /usr/local/bin -xzf - kubectl-gadget

# Verify installation
kubectl gadget version

Deploy Inspektor Gadget to Your Cluster

Deploy the Inspektor Gadget components to your cluster:

kubectl gadget deploy

This installs the necessary DaemonSets and RBAC configurations that allow gadgets like Traceloop to run on your cluster nodes.

Alternatively, you can also deploy using Helm.

Verify Installation

Check that the gadget pods are running:

kubectl get pods -n gadget

You should see gadget pods running on each node in your cluster.

Your First Trace: Hands-On Tutorial

Now let's capture our first system call trace. We'll create a simple scenario and watch what happens at the system level.

Setting Up the Test Environment

First, create a dedicated namespace for our tracing experiments:

kubectl create ns test-traceloop-ns

Expected output:

namespace/test-traceloop-ns created

Next, create a simple pod that we can interact with:

kubectl run -n test-traceloop-ns --image busybox test-traceloop-pod --command -- sleep inf

Expected output:

pod/test-traceloop-pod created

This creates a BusyBox container that sleeps indefinitely, giving us a stable target for tracing.

Starting Your First Trace

Next, start tracing system calls for our test pod:

kubectl gadget run traceloop:latest --namespace test-traceloop-ns

This command starts the flight recorder. You'll see column headers showing what information Traceloop captures:

K8S.NODE    K8S.NAMESPACE    K8S.PODNAME    K8S.CONTAINERNAME    CPU    PID    COMM    SYSCALL    PARAMETERS    RET

The trace is now running in the background, continuously recording system calls from our pod.

Generating System Calls

With the trace running, let's generate some activity. In a new terminal window, run a command inside your test pod:

kubectl exec -ti -n test-traceloop-ns test-traceloop-pod -- /bin/sh

Once inside the container, run some basic commands:

ls /
echo "Hello World" > /tmp/test.txt
cat /tmp/test.txt

Collecting the Trace

Back in your original terminal where Traceloop is running, press Ctrl+C to stop the recording and see the captured system calls.

You'll see output similar to this:

K8S.NODE            K8S.NAMESPACE        K8S.PODNAME          K8S.CONTAINERNAME    CPU  PID    COMM  SYSCALL      PARAMETERS                   RET
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    openat       dfd=-100, filename="/lib"    3
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    getdents64   fd=3, dirent=0x...          201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    write        fd=1, buf="bin dev etc..."   201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    exit_group   error_code=0                 0

Understanding Your First Trace

Let's break down what we're seeing:

K8S.PODNAME: Which pod generated these system calls
PID: Process ID of the command that ran
COMM: The command name (ls, echo, cat)
SYSCALL: The actual system call made (openat, write, exit_group)
PARAMETERS: Arguments passed to the system call
RET: Return value (0 usually means success)

This trace shows the ls command opening the /lib directory, reading directory entries, writing the output to stdout, and exiting successfully.

Clean Up

Remove the test resources:

kubectl delete pod test-traceloop-pod -n test-traceloop-ns
kubectl delete ns test-traceloop-ns

You can now see exactly what your applications are doing at the kernel level, something that traditional logs and kubectl commands can't show you.

Let's try this with an application that crashes.

Step-by-Step Debugging Walkthrough

Now that you know how to capture traces, let's take a look at a real debugging scenario. We'll create an application that crashes and use Traceloop to uncover the root cause. Something that would be nearly impossible with traditional kubectl debugging.

The Scenario: A Mysterious Crash

Let's create a Python application that has a subtle bug. It tries to write to a file it doesn't have permission to access, then crashes. This mimics real-world scenarios where applications fail due to permission issues, missing files, or resource constraints.

Setting Up the Problematic Application

First, we’ll create a new namespace for our debugging exercise:

kubectl create ns debug-traceloop-ns

Now, let's create a pod with an application that will crash:

kubectl run -n debug-traceloop-ns crash-app --image=python:3.9-slim --restart=Never -- python3 -c "
import time
import os
print('App starting...')
time.sleep(5)
print('Trying to write to restricted file...')
try:
    with open('/etc/passwd', 'w') as f:
        f.write('malicious content')
except Exception as e:
    print(f'Error: {e}')
    exit(1)
"

This creates a pod that will:

Start successfully
Try to write to /etc/passwd (a restricted system file)
Fail and crash with exit code 1

Starting the Trace Before the Crash

Here's the key difference from traditional debugging. We start tracing before we know there's a problem. In a real scenario, you'd have Traceloop running continuously.

kubectl gadget run traceloop:latest --namespace debug-traceloop-ns

The trace starts recording immediately. You'll see the column headers, and the flight recorder is now capturing every system call.

Observing the Application Behavior

In another terminal, check the pod status:

kubectl get pods -n debug-traceloop-ns -w

You'll see the pod go through these states:

Pending → Running → Error → CrashLoopBackOff

Traditional debugging would show you:

kubectl logs -n debug-traceloop-ns crash-app

Output:

App starting...
Trying to write to restricted file...
Error: [Errno 13] Permission denied: '/etc/passwd'

But this doesn't tell you exactly what the application tried to do at the system level.

Collecting and Analyzing the Trace

Back in your Traceloop terminal, press Ctrl+C to stop the recording. You'll see system calls like this:

K8S.NODE        K8S.NAMESPACE      K8S.PODNAME  COMM    SYSCALL    PARAMETERS                           RET
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename="/etc/passwd"    -13
minikube-docker debug-traceloop-ns crash-app    python3 write      fd=3, buf="App starting..."         16
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename="/etc/passwd"    -13
minikube-docker debug-traceloop-ns crash-app    python3 exit_group error_code=1                        0

Reading the System Call Story

The trace reveals the exact sequence of events:

openat filename="/etc/passwd" RET=-13: The application tried to open /etc/passwd for writing
- Return code -13 = EACCES (Permission denied)
write buf="App starting...": Normal logging output (successful)
openat filename="/etc/passwd" RET=-13: Second attempt to open the restricted file (still denied)
exit_group error_code=1: Application exits with error code 1

What Traceloop Revealed

Traditional debugging told us "Permission denied" but Traceloop shows us:

Exactly which file the application tried to access
When the permission denial happened in the execution flow
How many times it tried (twice in this case)
The exact system call that failed (openat)

Real-World Applications

This same approach works for debugging:

File not found errors: See exactly which files your app is looking for
Network connection failures: Observe failed connect() system calls with specific addresses
Memory issues: Watch mmap() and brk() calls that fail
Container startup problems: See which system calls fail during initialization

Clean Up

Remove the test resources:

kubectl delete pod crash-app -n debug-traceloop-ns
kubectl delete ns debug-traceloop-ns

Key Takeaway

Traditional Kubernetes debugging shows you what went wrong after it happened. Traceloop's continuous recording shows you exactly how it went wrong at the system level. This level of detail is invaluable for debugging complex production issues where the logs don't tell the full story.

Real-World Debugging Scenarios

Now that you understand the fundamentals, let's explore common production issues and how Traceloop helps diagnose them. These scenarios mirror real problems you'll encounter in Kubernetes environments.

Scenario 1: Container Startup Failures

The problem: Your pod gets stuck in CrashLoopBackOff with unhelpful logs.

Traditional kubectl commands show limited information:

kubectl describe pod failing-app
# Events: Back-off restarting failed container

kubectl logs failing-app
# (Empty or minimal output)

System calls show the application tried to:

Access configuration files that don't exist
Connect to services that aren't available
Write to directories without proper permissions

Key system calls to watch:

openat with -2 return (file not found)
connect with -111 return (connection refused)
access with -13 return (permission denied)

Scenario 2: Memory and Resource Issues

The problem: Application performance degrades or gets OOMKilled.

What Traceloop shows:

mmap calls failing (memory allocation issues)
brk system calls indicating heap growth
File descriptor exhaustion through failed openat calls
Excessive write calls indicating memory pressure

Example pattern:

SYSCALL    PARAMETERS           RET
mmap       length=1048576       -12  # ENOMEM - out of memory
brk        brk=0x55555557d000   0    # Heap expansion
openat     filename="/tmp/..."   -24  # EMFILE - too many open files

Scenario 3: Network Connectivity Problems

The problem: Service-to-service communication fails intermittently.

Traditional debugging limitations:

Application logs show "connection timeout"
Network policies seem correct
DNS resolution appears to work

What Traceloop reveals:

Exact IP addresses and ports being attempted
DNS resolution patterns through openat on /etc/resolv.conf
Failed connect calls with specific error codes
Socket creation and binding issues

Key indicators:

SYSCALL    PARAMETERS                    RET
socket     family=AF_INET, type=SOCK     3
connect    fd=3, addr=10.96.0.1:443     -110  # ETIMEDOUT
close      fd=3                         0

Scenario 4: Configuration and Secret Issues

The problem: Application can't access mounted secrets or config maps.

What system calls reveal:

File access patterns for mounted volumes
Permission checks on secret files
Configuration file parsing attempts

Common patterns:

Multiple openat attempts on different config file paths
access calls checking file permissions before opening
Failed reads from mounted secret volumes

Scenario 5: Performance Bottlenecks

The problem: Application response times are slow without obvious cause.

Traceloop analysis:

Excessive fsync calls (disk I/O bottlenecks)
Many futex calls (lock contention)
Frequent recvfrom timeouts (network issues)
Repeated file system operations

Performance indicators:

SYSCALL     FREQUENCY    ISSUE
fsync       High         Disk I/O bottleneck
futex       Excessive    Lock contention
poll        Many         Waiting for I/O
recvfrom    Timeouts     Network delays

Best Practices

When to Use Traceloop

Traceloop is most useful when you’re dealing with the kinds of problems that are notoriously difficult to pin down. If you’ve ever struggled with debugging intermittent crashes that don’t happen on demand, or run into confusing permission and access issues, this is where it works best.

It also helps uncover performance bottlenecks at the system level and provides visibility into application behavior during tricky startup failures. Another common use case is diagnosing network connectivity problems between pods, where other tools usually can't help

Of course, not every problem requires system call tracing. For application-level issues, logs and APM tools are more effective. Cluster-level concerns are often better handled with kubectl describe or by looking at events, and if you’re primarily monitoring resources, standard metrics and dashboards show you what's happening.

Performance Considerations

Like any tracing tool, Traceloop adds some overhead, but it keeps the overhead low. You can keep it efficient by narrowing the scope of your traces. For example, filtering by namespace with --namespace specific-ns, or targeting specific pods using --podname target-pod. In high-traffic environments, it’s best to run traces for shorter periods, and node-specific tracing can further isolate debugging when you don’t want to instrument the entire cluster.

In most cases, Traceloop uses very little CPU and memory, thanks to its eBPF-based approach. This makes it lighter than traditional tools like strace. The actual cost depends on the volume of system calls being recorded, so it’s a good practice to monitor resource usage in your own environment to confirm it’s operating within acceptable limits.

Integration with Your Workflow

Traceloop works well in dev and production workflows. In development, it’s a powerful way to understand how your application interacts with the system. You can use it to confirm that your app handles edge cases correctly, or to validate permission and resource configurations before promoting workloads into production.

In production environments, you can deploy it in different ways. Depending on how much overhead you're okay with, some teams run it continuously on a small subset of nodes, while others use it only when traditional debugging methods don’t provide enough insight. Pairing Traceloop with your existing monitoring and logging stack can give you a much more complete picture of system behavior.

It also helps with teamwork. Sharing trace outputs makes it easier for teams to reason about complex issues together. The data it provides can guide improvements in error handling and logging, and documenting common system call patterns can help onboard new developers more quickly.

Security Considerations

Because Traceloop records low-level system activity, you need to be mindful of what it captures.

What Traceloop Can See:

System call parameters (such as filenames and network addresses)
Process information and command arguments
File access patterns and permissions

Privacy Measures:

Limit trace duration to minimize data collection
Use namespace isolation to avoid capturing unrelated workloads
Apply data retention policies for trace outputs
Watch for sensitive information in file paths or system call parameters

Conclusion

Traceloop doesn’t just tell you something went wrong – it shows you how. By recording every system call in real time, it turns mysterious Kubernetes failures into solvable problems. Whether the issue happened seconds ago or in the middle of the night, the tool gives you the ability to rewind, inspect, and respond with confidence.

When to Use It

Keep in mind that Traceloop complements your existing debugging toolkit rather than replacing it. Reach for it when logs don’t tell the whole story, when intermittent problems are hiding in the shadows, when kubectl commands leave you guessing, or when you need to see how your application is really interacting with the system.

Once you’re comfortable with Traceloop, you can add more tools. Inspektor Gadget offers other tools for network, security, and performance debugging that pair well with Traceloop. Integrating it into your incident response workflow, sharing insights across your team, and even considering continuous tracing for critical workloads are good things to try next.

The next time you run into a stubborn Kubernetes pod failure, you won’t be stuck speculating. With Traceloop, you can “rewind the tape” and see exactly what happened. System call tracing may sound complex at first, but in practice, it’s one of the most powerful ways to truly understand how applications behave in containerized environments.

PS: Have any questions about Traceloop or want to share your debugging challenges? The Inspektor Gadget team and community hang out in the #inspektor-gadget channel on Kubernetes Slack. It's a great place to get help from the engineers who built these tools, share experiences, and maybe even contribute to making the ecosystem even better.

You can also connect with me on LinkedIn if you’d like to stay in touch. If you made it to the end of this tutorial, thanks for reading!

What is SRE? A Beginner's Guide to Site Reliability Engineering

Omolade Ekpeni — Wed, 26 Mar 2025 16:07:59 +0000

In today’s digital age, we expect our online experiences to be fast, reliable, and always available. But what happens behind the scenes to make our expectations a reality?

The answer is Site Reliability Engineering (SRE). SRE is a discipline that ensures that your favorite online services keep running smoothly, even when things go wrong.

In this guide, you’ll learn about the core principles behind SRE, how automation can help you in this process, how to handle failure, and more.

SRE: More Than Just Fixing Problems
Bridging the Gap Between Development and Operations
The Core Principles of SRE
The SRE Role: A Balancing Act
Why Automation Matters
Key Takeaways for Anyone Involved in Digital Services
Wrapping Up

SRE: More Than Just Fixing Problems

SRE goes beyond reacting to outages. It is a proactive approach to building and maintaining reliable systems. You can think of it as a blend of traditional IT operations, software engineering, and a relentless drive or pursuit for automation.

You might have heard of SRE being discussed alongside DevOps, so let’s differentiate them. DevOps is a broader set of principles that aims to improve collaboration and automation across the entire software development lifecycle. Site Reliability Engineering (SRE), on the other hand, is a specific implementation of these DevOps principles, with a strong focus on the operational aspects of running large-scale, highly reliable systems.

Let’s imagine a software company that wants to embrace DevOps. They might start the process by fostering better communication and shared goals between their development teams (who write the code) and their operations teams (who run the code in production). Also, they might implement continuous integration and continuous delivery (CI/CD) pipelines to automate the process of building, testing, and deploying software. This aligns with DevOps' focus on faster release cycles and improved collaboration.

Within this DevOps-oriented company, the SRE team might be specifically tasked with ensuring the reliability of their e-commerce platform. They would take the general DevOps principles and apply them to the operational challenges being experienced with a software engineering view.

For example, they would:

define and measure Service Level Objectives (SLOs)
develop and implement automated monitoring and alerting systems
create self-healing infrastructure and automated incident response playbooks
collaborate with development teams early in the software development lifecycle to ensure reliability
conduct blameless post-incident reviews to learn from failures
and track and automate away 'toil'.

Bridging the Gap Between Development and Operations

So as you can see, SRE is closely related to DevOps. One of the ways SRE implements DevOps principles is by bridging the gap between development and operations. SREs can do this in several ways.

First, SREs share responsibility with development teams for the reliability and performance of applications in production. This helps foster a collaborative environment and ensures that operational concerns are considered throughout the software development lifecycle.

SREs also provide valuable feedback to development teams based on their operational experience. They understand how software is designed and how it actually runs in production. This unique perspective allows them to identify potential issues early on and suggest improvements to the code, architecture, or deployment process.

And finally, SREs and development teams work together towards common goals, such as improving system reliability, increasing deployment frequency, and reducing time to recovery. This alignment ensures that everyone is working towards the same objectives.

The Core Principles of SRE:

Focus on Availability and Reliability

SREs aim to achieve specific service level objectives (SLOs), which are measurable targets for uptime and performance.

Scenario: A popular e-commerce website, used heavily during Nigerian business hours, sets an SLO of 99.9% uptime for its product catalog service. This high standard means the service is expected to be available almost all the time.

To understand just how little downtime this allows, let's break it down:

Downtime Percentage: An uptime of 99.9% means the allowed downtime is 100% - 99.9% = 0.1%.
- Minutes in a day: There are 24 hours in a day, and each hour has 60 minutes, so there are 24 x 60 = 1440 minutes in a day.
- Minutes in an average month*:* Assuming an average month of 30 days, there are approximately 30 x 1440 = 43,200 minutes in a month.
- Allowed downtime in minutes: To find 0.1% of the minutes in a month, we calculate (0.1 / 100) x 43,200 minutes = 0.001 x 43,200 minutes = 43.2 minutes.

Therefore, a 99.9% uptime SLO for the product catalog service means it can be unavailable for a maximum of about 43 minutes per month. The SRE team constantly monitors the service's availability using tools that track request success rates and latency. If the availability drops below 99.95% (a leading indicator), the SRE team is alerted to investigate and remediate before the SLO is breached.

Example: An online banking platform in Nigeria has an SLO for transaction processing latency: 99% of transactions must be completed within 500 milliseconds. SRE dashboards track this metric in real time. If the latency starts to increase, indicating a potential performance issue, SREs investigate whether it's due to database bottlenecks, network congestion within Nigeria, or application code inefficiencies.

Embrace Automation

Automation is the heart of SRE. It reduces manual labor, improves consistency, and speeds up issue resolution.

Scenario: When a new server is provisioned for an application, an SRE has automated the entire process using infrastructure-as-code tools (like Terraform or Ansible). This includes configuring the operating system, installing necessary software, setting up monitoring agents, and deploying the application code.

Previously, this involved multiple manual steps taking hours and was prone to human error. Now, it's completed consistently in minutes.

Example: During peak traffic hours (for example, around lunchtime in Nigeria when many people are online), the load on a web server cluster increases. An SRE has implemented auto-scaling rules that automatically add more servers to the cluster when CPU utilization exceeds a certain threshold and remove them when the load decreases. This automated scaling ensures the service remains responsive without manual intervention.

Measure Everything

SREs rely on data and metrics to understand system behavior and identify various areas for improvement.

Scenario: For a ride-hailing app popular in Lagos, SREs track a wide range of metrics beyond just uptime. These metrics are often referred to as Service Level Indicators (SLIs), which are quantitative measures of a service's performance.

Examples include:

Request latency: How long it takes for a user to request a ride and get a confirmation.
Error rates: The percentage of ride requests or payment transactions that fail.
Resource utilization: CPU, memory, and disk usage of the servers.
Database query performance: The time it takes for database operations.
User engagement metrics: How often key features are used.

These SLIs are crucial for determining if the service is meeting its Service Level Objectives (SLOs) – the target values or ranges for these indicators (for example, 99% of ride requests should have a latency under 200ms). The metrics are visualized on dashboards, allowing SREs to understand the system's health and identify correlations between different indicators, ultimately helping them determine if the SLOs are being met or are at risk.

Example: After deploying a new version of their mobile app, SREs closely monitor key performance indicators (KPIs) like the number of active users in Lagos, the average time to complete a booking, and the frequency of crashes reported by users in Nigeria. This data helps them quickly identify if the new release has introduced any performance or stability regressions.

Work with Developers

SREs collaborate closely with development teams to ensure that applications are designed for reliability.

Scenario: When developers are designing a new feature for their Nigerian user base that involves significant data processing, SREs are involved early in the design phase.

They provide guidance on how to build the feature in a reliable and scalable way, suggesting patterns like circuit breakers, retries, and proper error handling.

This proactive collaboration helps prevent reliability issues from being baked into the application. SREs can also participate in design reviews, providing operational insights and raising concerns about potential failure points.

Example: Before a major marketing campaign is launched in Nigeria, which is expected to significantly increase traffic, SREs work with the development team to perform load testing on the application. This helps identify potential bottlenecks and areas for optimization before the actual surge in users occurs.

SREs provide insights into the system's capacity and suggest code changes or infrastructure adjustments to handle the anticipated load. SREs can analyze the load test results with developers, providing insights into the system's capacity and suggesting code changes, database optimizations, or infrastructure adjustments to handle the expected load. They can also jointly develop monitoring and alerting rules specific to the campaign's expected traffic.

Learn from Failure

Failure is inevitable. SREs use post-incident reviews to analyze failures, identify root causes, and implement preventative measures.

Scenario: A critical outage occurred on a payment gateway used by many Nigerian businesses. After the service is restored, the SRE team conducts a blameless post-incident review. They gather all relevant data (logs, metrics, timelines, communication records) and collaboratively analyze the sequence of events, the underlying causes (which might involve a combination of software bugs, configuration errors, and insufficient monitoring), and the impact on users.

The outcome of the review is a detailed document outlining the root causes and a list of actionable items with owners and deadlines to prevent similar incidents in the future (for example, improving monitoring for a specific metric, implementing a new rollback strategy, fixing a configuration management issue).

Example: A minor incident occurred where a specific API endpoint became slow for a short period during peak hours in Lagos. Even though the impact was minimal, the SRE team still conducts a lightweight post-incident review.

They analyze the logs and metrics to understand why the slowdown happened (perhaps a temporary spike in database load) and identify potential preventative measures, such as optimizing the database query or adjusting resource limits.

The actionable item might be to create a new dashboard specifically for this API endpoint's performance, with a target completion date and assigned to a specific SRE (owner). Afterward, the team will follow up and ensure the dashboard is serving its purpose.

SREs acknowledge that systems will fail, and the goal is not to prevent all failures but to minimize their impact. SREs can achieve this through:

Monitoring: SREs implement real-time tracking of system health and performance, which allows them to detect issues early on.
Logging: They use detailed records of system events for analysis, investigation, debugging, and troubleshooting, which is essential for understanding the root cause of failures.
Alerting: SREs set up automated notifications when system metrics deviate from expected thresholds, enabling them to respond quickly to potential problems.
Incident response: They establish structured and documented procedures for responding to and resolving incidents, ensuring a coordinated and efficient approach.
Post-incident reviews: SREs conduct in-depth analysis of incidents to identify root causes and prevent recurrence, treating every incident as a learning opportunity. This is a crucial aspect of continuous improvement.

The SRE Role: A Balancing Act

SREs face the challenge of balancing day-to-day operational needs with longer-term engineering initiatives. This "balancing act" is crucial for maintaining a system's stability and its ability to evolve and improve.

SREs typically spend their time in two key areas, each requiring a different skillset and focus:

Operational Responsibilities (50%):

An SRE’s operational responsibilities are pretty wide-ranging. They typically involve responding to incidents and outages, which is a core part of any operations role. SREs are often on-call, meaning they are available to address urgent issues outside of regular work hours.

They also handle escalations, which means taking over complex or critical issues that other teams can't resolve.

SREs also provide support to internal and external customers, which can involve troubleshooting problems, answering questions, and providing guidance.

These responsibilities require strong problem-solving skills, quick thinking, and the ability to remain calm under pressure.

Engineering Responsibilities (50%):

Engineering responsibilities are what truly distinguish SREs. SREs are responsible for automating manual tasks, which is crucial for increasing efficiency and reducing errors.

They also develop monitoring and alerting systems, which involve designing and implementing tools to track system health and notify teams of potential problems.

SREs contribute to improving system reliability and performance by identifying and addressing bottlenecks, optimizing code, and implementing best practices.

They contribute to software development with a focus on operational concerns, which means they work with developers to ensure that applications are designed for scalability, maintainability, and resilience.

These responsibilities require strong programming skills, a deep understanding of system architecture, and a proactive approach to problem-solving.

Why Automation Matters

Automation is an important tool that SREs use to achieve both their operational and engineering goals. It's not about replacing human engineers, but about empowering them to work more effectively.

There are several key areas where automation is really important:

Reducing toil: SREs use automation to eliminate repetitive, manual tasks, often referred to as "toil." This frees up their time to focus on more strategic work, such as improving system design and implementing new features.
Improving efficiency: Automation can significantly speed up processes like deployments, rollbacks, and incident response, which leads to faster recovery times and reduced downtime.
Enhancing reliability: By automating critical processes, SREs can reduce the risk of human error, which is a common cause of outages and other issues.
Gaining deeper understanding: Every time an SRE automates a process, they gain a deeper understanding of the system, leading to further improvements or enhancements. This iterative process of automation and learning is central to the SRE approach.

Key Takeaways for Anyone Involved in Digital Services:

Reliability is a feature: Treat reliability as a major requirement, not an option.
Automation is essential: Embrace automation to reduce toil and improve efficiency.
Make data-driven decisions: Use metrics to understand system behavior and in turn guide improvements.
Collaboration is key: Foster close collaboration between development and operations teams.
Focus on continuous improvement: Adopt a culture of continuous learning and improvement.

Wrapping Up

You've now gained a foundational understanding of Site Reliability Engineering and its core principles centered around availability, automation, measurement, collaboration, and learning from failure. You’ve also learned how it plays a crucial role in ensuring the smooth operation of the digital services we rely on every day.

If you found this tutorial helpful and want to stay connected for more insights on Site Reliability Engineering, you can follow me on Twitter, connect on LinkedIn, or reach out via email at omolade.ekp@gmail.com.

SRE - freeCodeCamp.org

How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator

Table of Contents

Prerequisites

How to Understand the Secret Flow

How the External Secrets Operator Sync Works

How the App Consumes Secrets

How to Run the Local Lab

Step 1: Clone the Repo

Step 2: Run the Spin-Up Script

Important: Run the Lab UI

Step 3: Access the Application

Step 4: Validate That Secrets Match

How to Inspect the ExternalSecret and the Application

Step 1: Read the ExternalSecret Manifest

Step 2: Read the Deployment Manifest

Step 3: Read the App Comparison Logic

How to Test Secret Rotation

How the Rotation Gap Works

Step 1: Confirm the Lab Is Ready

Step 2: Run the Rotation Test Script

Step 3: Validate With the Compare Endpoint

Step 4: Restart the Deployment to Sync Env Vars

How to Automate Restarts With Reloader

How to Choose Between External Secrets Operator and the CSI Driver

How to Deploy the Pattern on Amazon Elastic Kubernetes Service

Step 1: Prepare Terraform and OpenID Connect Access

Step 2: Set the Required Environment Variable

Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service

Step 4: Test Rotation on the Deployed App

How to Configure GitHub Actions Without Stored AWS Credentials

How OpenID Connect Works for GitHub Actions

Step 1: Create the IAM Role With Terraform

Step 2: Add the Role ARN to GitHub Repository Secrets

Step 3: Configure Terraform State

Step 4: Push to Main and Let Workflows Run

How to Troubleshoot the Most Common Failures

How to Inspect the Operator and the ExternalSecret

How to Validate Rotation From the App Side

Conclusion

How to Debug Kubernetes Pods with Traceloop: A Complete Beginner's Guide

Prerequisites

Table of Contents

What is Traceloop?

How Traceloop Works

The Technical Foundation

The Flight Recorder Architecture

System Call Capture Flow

Container Isolation and Context

How to Set Up Traceloop

Installation Overview

Installation Requirements

Install kubectl gadget

Deploy Inspektor Gadget to Your Cluster

Verify Installation

Your First Trace: Hands-On Tutorial

Setting Up the Test Environment

Starting Your First Trace

Generating System Calls

Collecting the Trace

Understanding Your First Trace

Clean Up

Step-by-Step Debugging Walkthrough

The Scenario: A Mysterious Crash

Setting Up the Problematic Application

Starting the Trace Before the Crash

Observing the Application Behavior

Collecting and Analyzing the Trace

Reading the System Call Story

What Traceloop Revealed

Real-World Applications

Clean Up

Key Takeaway

Real-World Debugging Scenarios

Scenario 1: Container Startup Failures

Scenario 2: Memory and Resource Issues

Scenario 3: Network Connectivity Problems

Scenario 4: Configuration and Secret Issues

Scenario 5: Performance Bottlenecks

Best Practices