<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ SRE - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ SRE - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 18 May 2026 04:48:57 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/sre/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator ]]>
                </title>
                <description>
                    <![CDATA[ If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently? Storing them is straightforward. But handling rotation, stale env vars, and the gap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sync-aws-secrets-manager-secrets-into-kubernetes-with-the-external-secrets-operator/</link>
                <guid isPermaLink="false">69c541f010e664c5dadc877e</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ secrets management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Thu, 26 Mar 2026 14:25:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6cca126e-dd50-4400-ae9d-65449581345b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?</p>
<p>Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.</p>
<p>In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.</p>
<p>By the end, you'll be able to:</p>
<ul>
<li><p>Explain the full architecture from vault to pod</p>
</li>
<li><p>Run the lab locally in about 15 minutes</p>
</li>
<li><p>Prove why environment variables go stale after rotation, while mounted secret files stay fresh</p>
</li>
<li><p>Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD</p>
</li>
<li><p>Troubleshoot the most common failures</p>
</li>
</ul>
<p>Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ac8bfc9e-304e-41b8-b6a3-7ce1795b29a9.png" alt="Architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds." style="display:block;margin:0 auto" width="1024" height="1536" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-local-lab">How to Run the Local Lab</a></p>
</li>
<li><p><a href="#heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</a></p>
</li>
<li><p><a href="#heading-how-to-test-secret-rotation">How to Test Secret Rotation</a></p>
</li>
<li><p><a href="#heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</a></p>
</li>
<li><p><a href="#heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</a></p>
</li>
<li><p><a href="#heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following tools installed and configured.</p>
<p><strong>For the local lab:</strong></p>
<ul>
<li><p>An AWS account with access to AWS Secrets Manager</p>
</li>
<li><p>The AWS CLI installed and configured. Run <code>aws configure</code> and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.</p>
</li>
<li><p><code>kubectl</code> installed. For Microk8s, run <code>microk8s kubectl config view --raw &gt; ~/.kube/config</code> after installation to connect kubectl to your local cluster.</p>
</li>
<li><p>Terraform installed</p>
</li>
<li><p>Helm installed</p>
</li>
<li><p>Docker installed</p>
</li>
<li><p>A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the <a href="https://microk8s.io/">Microk8s install guide</a> before continuing.</p>
</li>
</ul>
<p><strong>For the Amazon Elastic Kubernetes Service sections:</strong></p>
<ul>
<li><p>An Amazon Elastic Kubernetes Service cluster you can create or manage</p>
</li>
<li><p>A GitHub repository you can configure for workflows and secrets</p>
</li>
</ul>
<p>The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-LOCAL.md"><code>docs/DEPLOY-LOCAL.md</code></a> and <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-EKS.md"><code>docs/DEPLOY-EKS.md</code></a>.</p>
<h2 id="heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</h2>
<p>Before you run any command, you need to understand how the pieces connect.</p>
<p>The flow has four stages:</p>
<ol>
<li><p>A developer or automated system updates a secret in AWS Secrets Manager.</p>
</li>
<li><p>The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.</p>
</li>
<li><p>Your pod reads that Kubernetes Secret.</p>
</li>
<li><p>During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/9dc52f99-add4-490a-ad86-25a30d0ae306.png" alt="A step-by-step flow diagram showing the four stages of secret flow above" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-how-the-external-secrets-operator-sync-works">How the External Secrets Operator Sync Works</h3>
<p>The External Secrets Operator reads a custom Kubernetes resource called <code>ExternalSecret</code>. That resource tells the operator three things:</p>
<ul>
<li><p>Which secret store to connect to</p>
</li>
<li><p>Which Kubernetes Secret name to create or update</p>
</li>
<li><p>How often to refresh</p>
</li>
</ul>
<p>In this lab, the <code>ExternalSecret</code> creates a Kubernetes Secret named <code>myapp-database-creds</code>. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.</p>
<h3 id="heading-how-the-app-consumes-secrets">How the App Consumes Secrets</h3>
<p>The sample application exposes three endpoints so you can validate behavior at any time.</p>
<ul>
<li><p><code>/secrets/env</code> shows what environment variables the pod sees</p>
</li>
<li><p><code>/secrets/volume</code> shows what files in the mounted secret directory look like</p>
</li>
<li><p><code>/secrets/compare</code> compares both and reports whether rotation has been detected</p>
</li>
</ul>
<p>The app checks four keys: <code>DB_USERNAME</code>, <code>DB_PASSWORD</code>, <code>DB_HOST</code>, and <code>DB_PORT</code>.</p>
<h2 id="heading-how-to-run-the-local-lab">How to Run the Local Lab</h2>
<p>The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.</p>
<h3 id="heading-step-1-clone-the-repo">Step 1: Clone the Repo</h3>
<pre><code class="language-bash">git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab
</code></pre>
<h3 id="heading-step-2-run-the-spin-up-script">Step 2: Run the Spin-Up Script</h3>
<pre><code class="language-bash">bash spinup.sh
</code></pre>
<p>The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.</p>
<p>If the script fails at any point, check <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/TROUBLESHOOTING.md"><code>docs/TROUBLESHOOTING.md</code></a> before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.</p>
<h3 id="heading-important-run-the-lab-ui">Important: Run the Lab UI</h3>
<p>The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at <code>lab-ui/</code> that walks you through each concept and checkpoint as you work through the lab.</p>
<p>To start it, open a second terminal and run:</p>
<pre><code class="language-bash">cd lab-ui &amp;&amp; npm install &amp;&amp; npm run dev
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/873166e9-6bff-4e56-a18d-e58b9e9a5af9.png" alt="Screenshot of npm run dev lab ui terminal" style="display:block;margin:0 auto" width="849" height="435" loading="lazy">

<p>Then open <a href="http://localhost:5173"><code>http://localhost:5173</code></a>. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a5b220b-3f23-4c7c-8388-f2e23d122e2c.png" alt="Screenshot of The Lab UI, a guided tutorial interface that runs alongside the lab and walks you through each concept and checkpoint." style="display:block;margin:0 auto" width="1399" height="1287" loading="lazy">

<p>Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (<code>localhost:3000</code>) are two separate things, the UI guides you through the steps, the app shows you the live secrets.</p>
<h3 id="heading-step-3-access-the-application">Step 3: Access the Application</h3>
<p>Once the lab finishes, port-forward the service.</p>
<pre><code class="language-bash">kubectl port-forward svc/myapp 3000:80 -n default
</code></pre>
<p>Open <code>http://localhost:3000</code>. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/dbe122ac-b787-40d0-96f4-4b1276bab017.png" alt="Screenshot of the running application at localhost:3000. Every row in the table should show &quot;Match ✓" style="display:block;margin:0 auto" width="1213" height="902" loading="lazy">

<h3 id="heading-step-4-validate-that-secrets-match">Step 4: Validate That Secrets Match</h3>
<p>Run the compare endpoint directly from the terminal.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>When everything is working, the response will include <code>"all_match": true</code>.</p>
<h2 id="heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</h2>
<p>At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.</p>
<h3 id="heading-step-1-read-the-externalsecret-manifest">Step 1: Read the ExternalSecret Manifest</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/external-secret.yaml"><code>k8s/aws/external-secret.yaml</code></a>. Focus on these four fields:</p>
<ul>
<li><p><code>refreshInterval</code>: how often the operator polls AWS Secrets Manager</p>
</li>
<li><p><code>secretStoreRef</code>: which store the operator authenticates against</p>
</li>
<li><p><code>target</code>: the name of the Kubernetes Secret to create</p>
</li>
<li><p><code>data</code>: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys</p>
</li>
</ul>
<p>Here is what that mapping looks like in this lab:</p>
<pre><code class="language-yaml">spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username
</code></pre>
<p>The <code>property</code> field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.</p>
<p>Two fields here are worth understanding before you move on. <code>creationPolicy: Owner</code> means the operator owns the Kubernetes Secret it creates. If you delete the <code>ExternalSecret</code>, the Secret is deleted too. <code>ClusterSecretStore</code> is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain <code>SecretStore</code> is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.</p>
<h3 id="heading-step-2-read-the-deployment-manifest">Step 2: Read the Deployment Manifest</h3>
<p>Open <a href="http://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/deployment.yaml"><code>k8s/aws/deployment.yaml</code></a>. You are looking for two sections: <code>envFrom</code> and <code>volumeMounts</code>.</p>
<pre><code class="language-yaml">envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true
</code></pre>
<p>Both paths read from the same Kubernetes Secret, <code>myapp-database-creds</code>. The <code>envFrom</code> block injects all keys as environment variables at pod start.<br>The <code>volumeMounts</code> block mounts the same secret as files under <code>/etc/secrets</code>.</p>
<p>This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.</p>
<h3 id="heading-step-3-read-the-app-comparison-logic">Step 3: Read the App Comparison Logic</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/app/server.js"><code>app/server.js</code></a>. The comparison logic reads environment variables from <code>process.env</code> and reads mounted secret files from <code>/etc/secrets/&lt;key&gt;</code>. Then it computes a per-key match and a global <code>all_match</code> value.</p>
<p>The <code>/secrets/compare</code> endpoint sets <code>rotation_detected: true</code> when any key differs between env and volume.</p>
<h2 id="heading-how-to-test-secret-rotation">How to Test Secret Rotation</h2>
<p>Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.</p>
<h3 id="heading-how-the-rotation-gap-works"><strong>How the Rotation Gap Works</strong></h3>
<p>When a pod starts, Kubernetes gives it two ways to read a secret.</p>
<p>The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.</p>
<p>The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.</p>
<p>Same secret, two paths. One goes stale while one stays fresh.</p>
<p>The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.</p>
<p>That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.</p>
<p>Here is what you're about to observe in the lab:</p>
<ul>
<li><p>The rotation script updates the secret in AWS</p>
</li>
<li><p>ESO syncs the new value into Kubernetes within seconds</p>
</li>
<li><p>The volume file updates automatically</p>
</li>
<li><p>The environment variable stays stale until the pod restarts</p>
</li>
<li><p>The <code>/secrets/compare</code> endpoint shows both values side by side so you can see the gap live</p>
</li>
</ul>
<h3 id="heading-step-1-confirm-the-lab-is-ready">Step 1: Confirm the Lab Is Ready</h3>
<p>Make sure your pod and the External Secrets Operator are both running before you start.</p>
<pre><code class="language-bash">kubectl get pods -n external-secrets
kubectl get pods -n default
</code></pre>
<p>Both should show <code>Running</code>.</p>
<h3 id="heading-step-2-run-the-rotation-test-script">Step 2: Run the Rotation Test Script</h3>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>The script performs these actions in order:</p>
<ol>
<li><p>Reads the current <code>DB_PASSWORD</code> from the volume mount at <code>/etc/secrets/DB_PASSWORD</code></p>
</li>
<li><p>Reads the current <code>DB_PASSWORD</code> from the environment variable</p>
</li>
<li><p>Updates AWS Secrets Manager with a new password using <code>put-secret-value</code></p>
</li>
<li><p>Forces an immediate ESO sync by annotating the <code>ExternalSecret</code> with <code>force-sync</code></p>
</li>
<li><p>Reads the volume value again</p>
</li>
<li><p>Reads the environment variable again</p>
</li>
</ol>
<p>After the script runs, the volume and the env var will show different values.</p>
<h3 id="heading-step-3-validate-with-the-compare-endpoint">Step 3: Validate With the Compare Endpoint</h3>
<p>Hit the compare endpoint and look at the output.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>You'll see something like this:</p>
<pre><code class="language-json">{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c4ebb09f-e605-4f68-8e12-1361d94199b2.png" alt="Rotation mismatch, the volume file updated with the new password but the env var still holds the old value from pod startup." style="display:block;margin:0 auto" width="832" height="290" loading="lazy">

<h3 id="heading-step-4-restart-the-deployment-to-sync-env-vars">Step 4: Restart the Deployment to Sync Env Vars</h3>
<p>Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.</p>
<pre><code class="language-bash">kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default
</code></pre>
<p>Then hit <code>/secrets/compare</code> again. All rows should now show <code>"all_match": true</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0040274d-a398-408c-9486-ce0a9e527479.png" alt="After a rolling restart, new pods pick up fresh env vars and all keys match." style="display:block;margin:0 auto" width="821" height="436" loading="lazy">

<h3 id="heading-how-to-automate-restarts-with-reloader">How to Automate Restarts With Reloader</h3>
<p>If you don't want to restart deployments manually after every rotation, you can install <a href="https://github.com/stakater/reloader"><strong>Stakater Reloader</strong></a>. It watches an annotation on the <code>Deployment</code> and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.</p>
<h2 id="heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</h2>
<p>Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the <a href="https://secrets-store-csi-driver.sigs.k8s.io/">Secrets Store CSI Driver</a>.</p>
<p>Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>External Secrets Operator</th>
<th>Secrets Store CSI Driver</th>
</tr>
</thead>
<tbody><tr>
<td>Creates a Kubernetes Secret</td>
<td>Yes</td>
<td>No by default</td>
</tr>
<tr>
<td>Supports <code>envFrom</code></td>
<td>Yes</td>
<td>No (workaround only)</td>
</tr>
<tr>
<td>Secret stored in etcd</td>
<td>Yes (base64)</td>
<td>No, if you skip sync</td>
</tr>
<tr>
<td>Rotation</td>
<td>ESO updates the Secret, Reloader restarts pods</td>
<td>Volume file can update in place</td>
</tr>
<tr>
<td>Best for</td>
<td>Most teams. Multi-cloud, env var support</td>
<td>Security policies that prohibit secrets in etcd</td>
</tr>
</tbody></table>
<p>This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both <code>envFrom</code> and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.</p>
<p>Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native <code>envFrom</code> model.</p>
<h2 id="heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</h2>
<p>The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.</p>
<h3 id="heading-step-1-prepare-terraform-and-openid-connect-access">Step 1: Prepare Terraform and OpenID Connect Access</h3>
<p>The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder.</p>
<pre><code class="language-bash">cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn
</code></pre>
<p>Copy the role ARN from the output. You'll need it in the next step.</p>
<h3 id="heading-step-2-set-the-required-environment-variable">Step 2: Set the Required Environment Variable</h3>
<p>The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.</p>
<p>To find your AWS account ID, run:</p>
<pre><code class="language-bash">aws sts get-caller-identity --query Account --output text
</code></pre>
<p>Then set the variable, replacing <code>ACCOUNT</code> with the number that command returns.</p>
<pre><code class="language-bash">export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role
</code></pre>
<h3 id="heading-step-3-run-the-spin-up-script-for-amazon-elastic-kubernetes-service">Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service</h3>
<pre><code class="language-bash">bash spinup.sh --cluster eks
</code></pre>
<p>When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing <code>Match ✓</code>.</p>
<h3 id="heading-step-4-test-rotation-on-the-deployed-app">Step 4: Test Rotation on the Deployed App</h3>
<p>After you confirm normal operation, run the rotation test the same way you did locally.</p>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>Then use <code>/secrets/compare</code> on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.</p>
<p>⚠️ <strong>Cost warning:</strong> Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/teardown.sh"><code>bash teardown.sh</code></a> from the repo root to destroy all AWS resources and stop charges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/56f05ace-9ab6-4b67-ade6-a0bd1fa3962c.png" alt="Screenshot of the app running on the ALB URL, showing all keys matched" style="display:block;margin:0 auto" width="912" height="891" loading="lazy">

<h2 id="heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</h2>
<p>The typical CI/CD setup stores <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.</p>
<p>OpenID Connect eliminates that problem entirely.</p>
<h3 id="heading-how-openid-connect-works-for-github-actions">How OpenID Connect Works for GitHub Actions</h3>
<p>GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via <code>AssumeRoleWithWebIdentity</code>. No long-lived keys are ever stored anywhere.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/48e72210-a669-440e-b42e-81b0c15746ec.png" alt="The full OIDC authentication flow for GitHub Actions deploying to EKS — from minting the JWT token through AssumeRoleWithWebIdentity to temporary credentials, kubeconfig retrieval, and final kubectl apply steps." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-step-1-create-the-iam-role-with-terraform">Step 1: Create the IAM Role With Terraform</h3>
<p>The <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.</p>
<h3 id="heading-step-2-add-the-role-arn-to-github-repository-secrets">Step 2: Add the Role ARN to GitHub Repository Secrets</h3>
<p>In your GitHub repository:</p>
<ol>
<li><p>Go to Settings → Secrets and variables → Actions</p>
</li>
<li><p>Click New repository secret</p>
</li>
<li><p>Name it <code>AWS_ROLE_ARN</code></p>
</li>
<li><p>Paste the role ARN from the Terraform output</p>
</li>
</ol>
<p>That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.</p>
<h3 id="heading-step-3-configure-terraform-state">Step 3: Configure Terraform State</h3>
<p>For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.</p>
<h3 id="heading-step-4-push-to-main-and-let-workflows-run">Step 4: Push to Main and Let Workflows Run</h3>
<p>After your first spin-up, every push to the <code>main</code> branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use <code>/secrets/compare</code> to validate rotation behavior on the live environment.</p>
<h2 id="heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</h2>
<p>Here's a shortlist of the most common symptoms and their fixes.</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Most Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>ExternalSecret</code> is not syncing</td>
<td>Missing credentials or wrong store reference</td>
<td>Confirm the operator can access AWS Secrets Manager and that <code>secretStoreRef</code> points to the correct store</td>
</tr>
<tr>
<td>Pod is stuck in <code>Pending</code></td>
<td>Missing storage setup for local cluster</td>
<td>For Microk8s, enable the storage add-on</td>
</tr>
<tr>
<td>Env and volume still match after rotation</td>
<td>Rotation happened but the pod never restarted</td>
<td>Run <code>kubectl rollout restart</code> or install Reloader</td>
</tr>
<tr>
<td>CRD or API version mismatch</td>
<td>ESO version and manifest <code>apiVersion</code> don't match</td>
<td>Verify the <code>apiVersion</code> for <code>ClusterSecretStore</code> and <code>ExternalSecret</code> match your installed ESO version</td>
</tr>
<tr>
<td>Amazon Elastic Kubernetes Service node group never joins</td>
<td>Networking or IAM permissions for nodes are wrong</td>
<td>Fix internet routing and review the node IAM policy</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-inspect-the-operator-and-the-externalsecret">How to Inspect the Operator and the ExternalSecret</h3>
<p>When something isn't syncing, start with these two commands.</p>
<pre><code class="language-bash"># Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
</code></pre>
<p>The status conditions on the <code>ExternalSecret</code> resource will usually tell you exactly what failed.</p>
<h3 id="heading-how-to-validate-rotation-from-the-app-side">How to Validate Rotation From the App Side</h3>
<p>When you are debugging rotation, don't rely only on Kubernetes resource state. Use the <code>/secrets/compare</code> endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the <code>ExternalSecret</code> and <code>Deployment</code> manifests, and validated that the application sees the right credentials.</p>
<p>You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.</p>
<p>Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.</p>
<p>The lab repository is at <a href="https://github.com/Osomudeya/k8s-secret-lab">github.com/Osomudeya/k8s-secret-lab</a>. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.</p>
<p>If this helped you, star the repo and share it with someone who is learning Kubernetes.</p>
<p><em>I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures<br>→</em> <a href="https://osomudeya.gumroad.com/subscribe"><em>Join the newsletter</em></a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Debug Kubernetes Pods with Traceloop: A Complete Beginner's Guide ]]>
                </title>
                <description>
                    <![CDATA[ Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional kubectl commands show you logs and statuses, but they can't tell you exactl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-kubernetes-pods-with-traceloop-a-complete-beginners-guide/</link>
                <guid isPermaLink="false">68b1d0b4c2405fa2535ed0c8</guid>
                
                    <category>
                        <![CDATA[ Traceloop ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ debugging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ inspektor gadget ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Fri, 29 Aug 2025 16:09:24 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756483063551/4179b718-7883-4a89-a9c2-1c678185469a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Debugging Kubernetes pods can feel like detective work. Your app crashes, and you're left wondering what happened in those critical moments leading up to failure. Traditional <code>kubectl</code> commands show you logs and statuses, but they can't tell you exactly what your application was doing at the system level when things went wrong.</p>
<p>What if you had a flight recorder for your applications, something that captures every system call in real-time, so you can "rewind" and see the exact sequence of events that led to a crash? That's what Traceloop does. It continuously traces system calls in your pods, giving you a detailed replay of what happened before, during, and after issues occur.</p>
<p>In this guide, you’ll learn how to use Traceloop's system call tracing to debug pod issues that would otherwise be nearly impossible to diagnose.</p>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before we begin, here are some prerequisites – things you’ll need to know and have:</p>
<ul>
<li><p><strong>Basic Kubernetes concepts</strong>: Understanding of pods, deployments, services, and namespaces</p>
</li>
<li><p><strong>kubectl fundamentals</strong>: Comfortable with commands like <code>kubectl get</code>, <code>kubectl describe</code>, <code>kubectl logs</code>, and <code>kubectl exec</code></p>
</li>
<li><p><strong>Container basics</strong>: Understanding how containerized applications work</p>
</li>
<li><p><strong>Basic Linux concepts</strong>: Understanding of processes and system calls (helpful, but we'll explain as we go)</p>
</li>
</ul>
<p><strong>Technical Requirements</strong></p>
<ul>
<li><p><strong>Kubernetes cluster access</strong>: Local (minikube, kind, Docker Desktop) or cloud-based cluster</p>
</li>
<li><p><code>kubectl</code> installed and configured to connect to your cluster</p>
</li>
<li><p>Sufficient permissions (cluster admin or equivalent RBAC) to:</p>
<ul>
<li><p>Install and run eBPF-based tools (Traceloop uses eBPF)</p>
</li>
<li><p>Create/modify pods and deployments</p>
</li>
<li><p>Access pod logs and system-level data</p>
</li>
</ul>
</li>
<li><p><strong>Linux-based Kubernetes nodes</strong>: Most clusters already run on Linux.</p>
</li>
</ul>
<p><strong>System Requirements</strong></p>
<ul>
<li><p><strong>Extended Berkeley Packet Filter (eBPF) support</strong>: Used for tracing and monitoring at the kernel level. Kernel version 5.10+ recommended.</p>
</li>
<li><p><strong>Sufficient cluster resources</strong>: Traceloop runs alongside your applications</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-traceloop">What is Traceloop?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-traceloop-works">How Traceloop Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-traceloop">How to Set Up Traceloop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-your-first-trace-hands-on-tutorial">Your First Trace: Hands-On Tutorial</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-by-step-debugging-walkthrough">Step-by-Step Debugging Walkthrough</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices">Best Practices</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-traceloop">What is Traceloop?</h2>
<p><a target="_blank" href="https://inspektor-gadget.io/docs/main/gadgets/traceloop/">Traceloop</a> is a system call tracing and observability tool that works across containerized environments, from Docker containers running locally to pods in production Kubernetes clusters. But before we discuss what that means, let's talk about why system calls matter for debugging.</p>
<p>Every time your application does anything (like opening a file, making a network request, allocating memory, or crashing), it has to interact with the operating system through system calls. These are the fundamental building blocks of how any program interacts with the world around it.</p>
<p>Here's where traditional debugging falls short: when your container crashes, the logs might tell you "segmentation fault" or "out of memory," but they don't tell you the sequence of events that led there. Did the application try to access a file that didn't exist? Was it making network calls that failed? Did it run out of file descriptors?</p>
<p>Traceloop captures this missing piece. It sits at the kernel level using eBPF technology, recording every system call your application makes in real-time. Think of it as installing a dashcam in your application. It's always recording with minimal resources, and when something goes wrong, you have the footage.</p>
<p>Strace is another popular debugging tool – but it requires you to know that there's a problem first. With Traceloop, we can conveniently run it continuously in the background with minimal overhead. If your container crashes at 3am, you can immediately "rewind the tape" and see exactly what system calls happened leading up to the crash.</p>
<p>This helps debug intermittent issues that happen randomly in production but never when you are watching. Because Traceloop is always recording, you finally have visibility into what your application was doing when these mysterious failures occur.</p>
<h2 id="heading-how-traceloop-works">How Traceloop Works</h2>
<p>Now that you understand what Traceloop does, let's look under the hood at how it captures and processes system calls in your containerized environments.</p>
<h3 id="heading-the-technical-foundation">The Technical Foundation</h3>
<p>Traceloop is built on eBPF, a technology that allows programs to run safely in the Linux kernel without changing kernel code. Think of eBPF as a way to install "hooks" directly into the kernel that can observe everything happening on your system with minimal performance impact.</p>
<p>Unlike traditional monitoring tools that work from userspace, eBPF programs run in kernel space, giving them access to system calls as they happen, without relying on the application logging appropriate error messages. This is why Traceloop can capture events that never make it to application logs, like failed system calls or crashes that happen before the application can write anything.</p>
<h3 id="heading-the-flight-recorder-architecture">The Flight Recorder Architecture</h3>
<p>Traceloop uses eBPF maps as an overwriteable ring buffer. Imagine a tape recorder that continuously records over itself. It's always capturing system calls, but it only keeps the most recent data in memory. When something goes wrong, the recording automatically preserves what happened leading up to the incident, just like an airplane's flight recorder after a crash.</p>
<p>This approach solves the production debugging problem: you don't need to predict when issues will happen or attach debuggers after the fact. The recording is always running, waiting for you to need it.</p>
<h3 id="heading-system-call-capture-flow">System Call Capture Flow</h3>
<p>Here's how Traceloop captures and processes system calls across your Kubernetes environment:</p>
<ol>
<li><p><strong>Application pods</strong> generate system calls through normal operation – opening files, making network connections, allocating memory.</p>
</li>
<li><p><strong>eBPF probes (also called hooks)</strong> intercept these system calls at the kernel level before they're processed.</p>
</li>
<li><p><strong>Traceloop recorder</strong> captures the events, buffers them, and adds container context using Inspektor Gadget enrichment (pod name, namespace, container ID).</p>
</li>
<li><p><strong>Output stream</strong> formats the data and makes it available for analysis in real-time or after an incident.</p>
</li>
<li><p><strong>Traceloop user</strong> views and analyzes the captured trace to diagnose the root cause of issues.</p>
</li>
</ol>
<p>Below is a visual representation of the flow. The key advantage is that Traceloop sees everything your application does, even actions that fail silently or happen too quickly for traditional logging to catch. This gives you complete visibility into your application's interaction with the operating system.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755043403339/c5047de7-afc4-48aa-a28e-ee3a1dfbe47f.jpeg" alt="Flow diagram showing how Traceloop works. Application Pods generate system calls, which undergo kernel-level interception via eBPF probes. The probes capture events and pass them to the Traceloop Recorder, which buffers and formats the data. The Output Stream then displays the results to the Traceloop User. The process highlights steps from generating syscalls to capturing, recording, formatting, and presenting the results." class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-container-isolation-and-context">Container Isolation and Context</h3>
<p>One of Traceloop's strengths is understanding containerized environments. It doesn't just capture raw system calls – it adds context about which pod, container, and namespace generated each call. This means you can trace specific applications without getting overwhelmed by system calls from other containers running on the same node.</p>
<p>This container awareness makes Traceloop particularly powerful in Kubernetes environments where you might have dozens of pods running on a single node, but you only care about debugging one specific application.</p>
<h2 id="heading-how-to-set-up-traceloop">How to Set Up Traceloop</h2>
<p>Before we can start tracing system calls, we need to set up Traceloop in your Kubernetes environment. Traceloop is part of the <a target="_blank" href="https://inspektor-gadget.io/">Inspektor Gadget</a> ecosystem, which provides flexibility in how you use it.</p>
<h3 id="heading-installation-overview">Installation Overview</h3>
<p>This setup:</p>
<ul>
<li><p>Deploys Inspektor Gadget components to all worker nodes</p>
</li>
<li><p>Eliminates the download and initialization overhead on each use, as components are pre-loaded and ready </p>
</li>
<li><p>Eliminates the need to reinstall or reconfigure for each debugging session – just run your traces immediately</p>
</li>
<li><p>Requires cluster admin permissions</p>
</li>
<li><p>Works best for teams doing regular debugging</p>
</li>
</ul>
<h4 id="heading-installation-requirements">Installation Requirements</h4>
<p>First, ensure your cluster meets the requirements:</p>
<ul>
<li><p>Kubernetes cluster with Linux nodes</p>
</li>
<li><p>eBPF support</p>
</li>
<li><p>kubectl installed and configured</p>
</li>
<li><p>Cluster admin permissions</p>
</li>
</ul>
<h4 id="heading-install-kubectl-gadget">Install kubectl gadget</h4>
<p>The recommended way is using krew (kubectl plugin manager):</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Install krew if you don't have it</span>
curl -fsSLO <span class="hljs-string">"https://github.com/kubernetes-sigs/krew/releases/latest/download/krew-linux_amd64.tar.gz"</span>
tar zxvf krew-linux_amd64.tar.gz
./krew-linux_amd64 install krew
<span class="hljs-built_in">export</span> PATH=<span class="hljs-string">"<span class="hljs-variable">${KREW_ROOT:-<span class="hljs-variable">$HOME</span>/.krew}</span>/bin:<span class="hljs-variable">$PATH</span>"</span>

<span class="hljs-comment"># Install kubectl gadget</span>
kubectl krew install gadget
</code></pre>
<p>Alternatively, you can install directly:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># For Linux/macOS</span>
curl -sL https://github.com/inspektor-gadget/inspektor-gadget/releases/latest/download/kubectl-gadget-linux-amd64.tar.gz | sudo tar -C /usr/<span class="hljs-built_in">local</span>/bin -xzf - kubectl-gadget

<span class="hljs-comment"># Verify installation</span>
kubectl gadget version
</code></pre>
<h4 id="heading-deploy-inspektor-gadget-to-your-cluster">Deploy Inspektor Gadget to Your Cluster</h4>
<p>Deploy the Inspektor Gadget components to your cluster:</p>
<pre><code class="lang-bash">kubectl gadget deploy
</code></pre>
<p>This installs the necessary DaemonSets and RBAC configurations that allow gadgets like Traceloop to run on your cluster nodes.</p>
<p>Alternatively, you can also deploy using <a target="_blank" href="https://inspektor-gadget.io/docs/v0.43.0/reference/install-kubernetes/#installation-with-the-helm-chart">Helm</a>.</p>
<h4 id="heading-verify-installation">Verify Installation</h4>
<p>Check that the gadget pods are running:</p>
<pre><code class="lang-bash">kubectl get pods -n gadget
</code></pre>
<p>You should see gadget pods running on each node in your cluster.</p>
<h2 id="heading-your-first-trace-hands-on-tutorial">Your First Trace: Hands-On Tutorial</h2>
<p>Now let's capture our first system call trace. We'll create a simple scenario and watch what happens at the system level.</p>
<h3 id="heading-setting-up-the-test-environment">Setting Up the Test Environment</h3>
<p>First, create a dedicated namespace for our tracing experiments:</p>
<pre><code class="lang-bash">kubectl create ns test-traceloop-ns
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="lang-bash">namespace/test-traceloop-ns created
</code></pre>
<p>Next, create a simple pod that we can interact with:</p>
<pre><code class="lang-bash">kubectl run -n test-traceloop-ns --image busybox test-traceloop-pod --<span class="hljs-built_in">command</span> -- sleep inf
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="lang-bash">pod/test-traceloop-pod created
</code></pre>
<p>This creates a BusyBox container that sleeps indefinitely, giving us a stable target for tracing.</p>
<h3 id="heading-starting-your-first-trace">Starting Your First Trace</h3>
<p>Next, start tracing system calls for our test pod:</p>
<pre><code class="lang-bash">kubectl gadget run traceloop:latest --namespace test-traceloop-ns
</code></pre>
<p>This command starts the flight recorder. You'll see column headers showing what information Traceloop captures:</p>
<pre><code class="lang-bash">K8S.NODE    K8S.NAMESPACE    K8S.PODNAME    K8S.CONTAINERNAME    CPU    PID    COMM    SYSCALL    PARAMETERS    RET
</code></pre>
<p>The trace is now running in the background, continuously recording system calls from our pod.</p>
<h3 id="heading-generating-system-calls">Generating System Calls</h3>
<p>With the trace running, let's generate some activity. In a new terminal window, run a command inside your test pod:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -ti -n test-traceloop-ns test-traceloop-pod -- /bin/sh
</code></pre>
<p>Once inside the container, run some basic commands:</p>
<pre><code class="lang-bash">ls /
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Hello World"</span> &gt; /tmp/test.txt
cat /tmp/test.txt
</code></pre>
<h3 id="heading-collecting-the-trace">Collecting the Trace</h3>
<p>Back in your original terminal where Traceloop is running, press <strong>Ctrl+C</strong> to stop the recording and see the captured system calls.</p>
<p>You'll see output similar to this:</p>
<pre><code class="lang-bash">K8S.NODE            K8S.NAMESPACE        K8S.PODNAME          K8S.CONTAINERNAME    CPU  PID    COMM  SYSCALL      PARAMETERS                   RET
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    openat       dfd=-100, filename=<span class="hljs-string">"/lib"</span>    3
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    getdents64   fd=3, dirent=0x...          201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    write        fd=1, buf=<span class="hljs-string">"bin dev etc..."</span>   201
minikube-docker     test-traceloop-ns    test-traceloop-pod   test-traceloop-pod   2    95419  ls    exit_group   error_code=0                 0
</code></pre>
<h3 id="heading-understanding-your-first-trace">Understanding Your First Trace</h3>
<p>Let's break down what we're seeing:</p>
<ul>
<li><p><strong>K8S.PODNAME</strong>: Which pod generated these system calls</p>
</li>
<li><p><strong>PID</strong>: Process ID of the command that ran</p>
</li>
<li><p><strong>COMM</strong>: The command name (ls, echo, cat)</p>
</li>
<li><p><strong>SYSCALL</strong>: The actual system call made (openat, write, exit_group)</p>
</li>
<li><p><strong>PARAMETERS</strong>: Arguments passed to the system call</p>
</li>
<li><p><strong>RET</strong>: Return value (0 usually means success)</p>
</li>
</ul>
<p>This trace shows the <code>ls</code> command opening the <code>/lib</code> directory, reading directory entries, writing the output to stdout, and exiting successfully.</p>
<h3 id="heading-clean-up">Clean Up</h3>
<p>Remove the test resources:</p>
<pre><code class="lang-bash">kubectl delete pod test-traceloop-pod -n test-traceloop-ns
kubectl delete ns test-traceloop-ns
</code></pre>
<p>You can now see exactly what your applications are doing at the kernel level, something that traditional logs and kubectl commands can't show you.</p>
<p>Let's try this with an application that crashes.</p>
<h2 id="heading-step-by-step-debugging-walkthrough">Step-by-Step Debugging Walkthrough</h2>
<p>Now that you know how to capture traces, let's take a look at a real debugging scenario. We'll create an application that crashes and use Traceloop to uncover the root cause. Something that would be nearly impossible with traditional kubectl debugging.</p>
<h3 id="heading-the-scenario-a-mysterious-crash">The Scenario: A Mysterious Crash</h3>
<p>Let's create a Python application that has a subtle bug. It tries to write to a file it doesn't have permission to access, then crashes. This mimics real-world scenarios where applications fail due to permission issues, missing files, or resource constraints.</p>
<h3 id="heading-setting-up-the-problematic-application">Setting Up the Problematic Application</h3>
<p>First, we’ll create a new namespace for our debugging exercise:</p>
<pre><code class="lang-bash">kubectl create ns debug-traceloop-ns
</code></pre>
<p>Now, let's create a pod with an application that will crash:</p>
<pre><code class="lang-bash">kubectl run -n debug-traceloop-ns crash-app --image=python:3.9-slim --restart=Never -- python3 -c <span class="hljs-string">"
import time
import os
print('App starting...')
time.sleep(5)
print('Trying to write to restricted file...')
try:
    with open('/etc/passwd', 'w') as f:
        f.write('malicious content')
except Exception as e:
    print(f'Error: {e}')
    exit(1)
"</span>
</code></pre>
<p>This creates a pod that will:</p>
<ol>
<li><p>Start successfully</p>
</li>
<li><p>Try to write to <code>/etc/passwd</code> (a restricted system file)</p>
</li>
<li><p>Fail and crash with exit code 1</p>
</li>
</ol>
<h3 id="heading-starting-the-trace-before-the-crash">Starting the Trace Before the Crash</h3>
<p>Here's the key difference from traditional debugging. We start tracing before we know there's a problem. In a real scenario, you'd have Traceloop running continuously.</p>
<pre><code class="lang-bash">kubectl gadget run traceloop:latest --namespace debug-traceloop-ns
</code></pre>
<p>The trace starts recording immediately. You'll see the column headers, and the flight recorder is now capturing every system call.</p>
<h3 id="heading-observing-the-application-behavior">Observing the Application Behavior</h3>
<p>In another terminal, check the pod status:</p>
<pre><code class="lang-bash">kubectl get pods -n debug-traceloop-ns -w
</code></pre>
<p>You'll see the pod go through these states:</p>
<ul>
<li><code>Pending</code> → <code>Running</code> → <code>Error</code> → <code>CrashLoopBackOff</code></li>
</ul>
<p>Traditional debugging would show you:</p>
<pre><code class="lang-bash">kubectl logs -n debug-traceloop-ns crash-app
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">App starting...
Trying to write to restricted file...
Error: [Errno 13] Permission denied: <span class="hljs-string">'/etc/passwd'</span>
</code></pre>
<p>But this doesn't tell you exactly what the application tried to do at the system level.</p>
<h3 id="heading-collecting-and-analyzing-the-trace">Collecting and Analyzing the Trace</h3>
<p>Back in your Traceloop terminal, press <strong>Ctrl+C</strong> to stop the recording. You'll see system calls like this:</p>
<pre><code class="lang-bash">K8S.NODE        K8S.NAMESPACE      K8S.PODNAME  COMM    SYSCALL    PARAMETERS                           RET
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename=<span class="hljs-string">"/etc/passwd"</span>    -13
minikube-docker debug-traceloop-ns crash-app    python3 write      fd=3, buf=<span class="hljs-string">"App starting..."</span>         16
minikube-docker debug-traceloop-ns crash-app    python3 openat     dfd=-100, filename=<span class="hljs-string">"/etc/passwd"</span>    -13
minikube-docker debug-traceloop-ns crash-app    python3 exit_group error_code=1                        0
</code></pre>
<h3 id="heading-reading-the-system-call-story">Reading the System Call Story</h3>
<p>The trace reveals the exact sequence of events:</p>
<ol>
<li><p><code>openat filename="/etc/passwd" RET=-13</code>: The application tried to open <code>/etc/passwd</code> for writing</p>
<ul>
<li>Return code <code>-13</code> = <code>EACCES</code> (Permission denied)</li>
</ul>
</li>
<li><p><code>write buf="App starting..."</code>: Normal logging output (successful)</p>
</li>
<li><p><code>openat filename="/etc/passwd" RET=-13</code>: Second attempt to open the restricted file (still denied)</p>
</li>
<li><p><code>exit_group error_code=1</code>: Application exits with error code 1</p>
</li>
</ol>
<h3 id="heading-what-traceloop-revealed">What Traceloop Revealed</h3>
<p>Traditional debugging told us "Permission denied" but Traceloop shows us:</p>
<ul>
<li><p><strong>Exactly which file</strong> the application tried to access</p>
</li>
<li><p><strong>When</strong> the permission denial happened in the execution flow</p>
</li>
<li><p><strong>How many times</strong> it tried (twice in this case)</p>
</li>
<li><p><strong>The exact system call</strong> that failed (<code>openat</code>)</p>
</li>
</ul>
<h3 id="heading-real-world-applications">Real-World Applications</h3>
<p>This same approach works for debugging:</p>
<ul>
<li><p><strong>File not found errors</strong>: See exactly which files your app is looking for</p>
</li>
<li><p><strong>Network connection failures</strong>: Observe failed <code>connect()</code> system calls with specific addresses</p>
</li>
<li><p><strong>Memory issues</strong>: Watch <code>mmap()</code> and <code>brk()</code> calls that fail</p>
</li>
<li><p><strong>Container startup problems</strong>: See which system calls fail during initialization</p>
</li>
</ul>
<h3 id="heading-clean-up-1">Clean Up</h3>
<p>Remove the test resources:</p>
<pre><code class="lang-bash">kubectl delete pod crash-app -n debug-traceloop-ns
kubectl delete ns debug-traceloop-ns
</code></pre>
<h3 id="heading-key-takeaway">Key Takeaway</h3>
<p>Traditional Kubernetes debugging shows you what went wrong after it happened. Traceloop's continuous recording shows you exactly how it went wrong at the system level. This level of detail is invaluable for debugging complex production issues where the logs don't tell the full story.</p>
<h2 id="heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</h2>
<p>Now that you understand the fundamentals, let's explore common production issues and how Traceloop helps diagnose them. These scenarios mirror real problems you'll encounter in Kubernetes environments.</p>
<h3 id="heading-scenario-1-container-startup-failures">Scenario 1: Container Startup Failures</h3>
<p><strong>The problem</strong>: Your pod gets stuck in <code>CrashLoopBackOff</code> with unhelpful logs.</p>
<p>Traditional <code>kubectl</code> commands show limited information:</p>
<pre><code class="lang-bash">kubectl describe pod failing-app
<span class="hljs-comment"># Events: Back-off restarting failed container</span>

kubectl logs failing-app
<span class="hljs-comment"># (Empty or minimal output)</span>
</code></pre>
<p>System calls show the application tried to:</p>
<ol>
<li><p>Access configuration files that don't exist</p>
</li>
<li><p>Connect to services that aren't available</p>
</li>
<li><p>Write to directories without proper permissions</p>
</li>
</ol>
<p>Key system calls to watch:</p>
<ol>
<li><p><code>openat</code> with <code>-2</code> return (file not found)</p>
</li>
<li><p><code>connect</code> with <code>-111</code> return (connection refused)</p>
</li>
<li><p><code>access</code> with <code>-13</code> return (permission denied)</p>
</li>
</ol>
<h3 id="heading-scenario-2-memory-and-resource-issues">Scenario 2: Memory and Resource Issues</h3>
<p><strong>The problem</strong>: Application performance degrades or gets OOMKilled.</p>
<p>What Traceloop shows:</p>
<ol>
<li><p><code>mmap</code> calls failing (memory allocation issues)</p>
</li>
<li><p><code>brk</code> system calls indicating heap growth</p>
</li>
<li><p>File descriptor exhaustion through failed <code>openat</code> calls</p>
</li>
<li><p>Excessive <code>write</code> calls indicating memory pressure</p>
</li>
</ol>
<p><strong>Example pattern</strong>:</p>
<pre><code class="lang-bash">SYSCALL    PARAMETERS           RET
mmap       length=1048576       -12  <span class="hljs-comment"># ENOMEM - out of memory</span>
brk        brk=0x55555557d000   0    <span class="hljs-comment"># Heap expansion</span>
openat     filename=<span class="hljs-string">"/tmp/..."</span>   -24  <span class="hljs-comment"># EMFILE - too many open files</span>
</code></pre>
<h3 id="heading-scenario-3-network-connectivity-problems">Scenario 3: Network Connectivity Problems</h3>
<p><strong>The problem</strong>: Service-to-service communication fails intermittently.</p>
<p>Traditional debugging limitations:</p>
<ol>
<li><p>Application logs show "connection timeout"</p>
</li>
<li><p>Network policies seem correct</p>
</li>
<li><p>DNS resolution appears to work</p>
</li>
</ol>
<p>What Traceloop reveals:</p>
<ol>
<li><p>Exact IP addresses and ports being attempted</p>
</li>
<li><p>DNS resolution patterns through <code>openat</code> on <code>/etc/resolv.conf</code></p>
</li>
<li><p>Failed <code>connect</code> calls with specific error codes</p>
</li>
<li><p>Socket creation and binding issues</p>
</li>
</ol>
<p><strong>Key indicators</strong>:</p>
<pre><code class="lang-bash">SYSCALL    PARAMETERS                    RET
socket     family=AF_INET, <span class="hljs-built_in">type</span>=SOCK     3
connect    fd=3, addr=10.96.0.1:443     -110  <span class="hljs-comment"># ETIMEDOUT</span>
close      fd=3                         0
</code></pre>
<h3 id="heading-scenario-4-configuration-and-secret-issues">Scenario 4: Configuration and Secret Issues</h3>
<p><strong>The problem</strong>: Application can't access mounted secrets or config maps.</p>
<p>What system calls reveal:</p>
<ol>
<li><p>File access patterns for mounted volumes</p>
</li>
<li><p>Permission checks on secret files</p>
</li>
<li><p>Configuration file parsing attempts</p>
</li>
</ol>
<p>Common patterns:</p>
<ol>
<li><p>Multiple <code>openat</code> attempts on different config file paths</p>
</li>
<li><p><code>access</code> calls checking file permissions before opening</p>
</li>
<li><p>Failed reads from mounted secret volumes</p>
</li>
</ol>
<h3 id="heading-scenario-5-performance-bottlenecks">Scenario 5: Performance Bottlenecks</h3>
<p><strong>The problem</strong>: Application response times are slow without obvious cause.</p>
<p>Traceloop analysis:</p>
<ol>
<li><p>Excessive <code>fsync</code> calls (disk I/O bottlenecks)</p>
</li>
<li><p>Many <code>futex</code> calls (lock contention)</p>
</li>
<li><p>Frequent <code>recvfrom</code> timeouts (network issues)</p>
</li>
<li><p>Repeated file system operations</p>
</li>
</ol>
<p><strong>Performance indicators</strong>:</p>
<pre><code class="lang-bash">SYSCALL     FREQUENCY    ISSUE
fsync       High         Disk I/O bottleneck
futex       Excessive    Lock contention
poll        Many         Waiting <span class="hljs-keyword">for</span> I/O
recvfrom    Timeouts     Network delays
</code></pre>
<h2 id="heading-best-practices"><strong>Best Practices</strong></h2>
<h3 id="heading-when-to-use-traceloop"><strong>When to Use Traceloop</strong></h3>
<p>Traceloop is most useful when you’re dealing with the kinds of problems that are notoriously difficult to pin down. If you’ve ever struggled with debugging intermittent crashes that don’t happen on demand, or run into confusing permission and access issues, this is where it works best.  </p>
<p>It also helps uncover performance bottlenecks at the system level and provides visibility into application behavior during tricky startup failures. Another common use case is diagnosing network connectivity problems between pods, where other tools usually can't help</p>
<p>Of course, not every problem requires system call tracing. For application-level issues, logs and APM tools are more effective. Cluster-level concerns are often better handled with <code>kubectl describe</code> or by looking at events, and if you’re primarily monitoring resources, standard metrics and dashboards show you what's happening.</p>
<h3 id="heading-performance-considerations"><strong>Performance Considerations</strong></h3>
<p>Like any tracing tool, Traceloop adds some overhead, but it keeps the overhead low. You can keep it efficient by narrowing the scope of your traces. For example, filtering by namespace with <code>--namespace specific-ns</code>, or targeting specific pods using <code>--podname target-pod</code>. In high-traffic environments, it’s best to run traces for shorter periods, and node-specific tracing can further isolate debugging when you don’t want to instrument the entire cluster.</p>
<p>In most cases, Traceloop uses very little CPU and memory, thanks to its eBPF-based approach. This makes it lighter than traditional tools like strace. The actual cost depends on the volume of system calls being recorded, so it’s a good practice to monitor resource usage in your own environment to confirm it’s operating within acceptable limits.</p>
<h3 id="heading-integration-with-your-workflow"><strong>Integration with Your Workflow</strong></h3>
<p>Traceloop works well in dev and production workflows. In development, it’s a powerful way to understand how your application interacts with the system. You can use it to confirm that your app handles edge cases correctly, or to validate permission and resource configurations before promoting workloads into production.</p>
<p>In production environments, you can deploy it in different ways. Depending on how much overhead you're okay with, some teams run it continuously on a small subset of nodes, while others use it only when traditional debugging methods don’t provide enough insight. Pairing Traceloop with your existing monitoring and logging stack can give you a much more complete picture of system behavior.</p>
<p>It also helps with teamwork. Sharing trace outputs makes it easier for teams to reason about complex issues together. The data it provides can guide improvements in error handling and logging, and documenting common system call patterns can help onboard new developers more quickly.</p>
<h3 id="heading-security-considerations"><strong>Security Considerations</strong></h3>
<p>Because Traceloop records low-level system activity, you need to be mindful of what it captures.</p>
<p><strong>What Traceloop Can See:</strong></p>
<ul>
<li><p>System call parameters (such as filenames and network addresses)</p>
</li>
<li><p>Process information and command arguments</p>
</li>
<li><p>File access patterns and permissions</p>
</li>
</ul>
<p><strong>Privacy Measures:</strong></p>
<ul>
<li><p>Limit trace duration to minimize data collection</p>
</li>
<li><p>Use namespace isolation to avoid capturing unrelated workloads</p>
</li>
<li><p>Apply data retention policies for trace outputs</p>
</li>
<li><p>Watch for sensitive information in file paths or system call parameters</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Traceloop doesn’t just tell you something went wrong – it shows you how. By recording every system call in real time, it turns mysterious Kubernetes failures into solvable problems. Whether the issue happened seconds ago or in the middle of the night, the tool gives you the ability to rewind, inspect, and respond with confidence.</p>
<h3 id="heading-when-to-use-it">When to Use It</h3>
<p>Keep in mind that Traceloop complements your existing debugging toolkit rather than replacing it. Reach for it when logs don’t tell the whole story, when intermittent problems are hiding in the shadows, when <code>kubectl</code> commands leave you guessing, or when you need to see how your application is really interacting with the system.</p>
<p>Once you’re comfortable with Traceloop, you can add more tools. <a target="_blank" href="https://inspektor-gadget.io/">Inspektor Gadget</a> offers other tools for network, security, and performance debugging that pair well with Traceloop. Integrating it into your incident response workflow, sharing insights across your team, and even considering continuous tracing for critical workloads are good things to try next.</p>
<p>The next time you run into a stubborn Kubernetes pod failure, you won’t be stuck speculating. With Traceloop, you can “rewind the tape” and see exactly what happened. System call tracing may sound complex at first, but in practice, it’s one of the most powerful ways to truly understand how applications behave in containerized environments.</p>
<p><strong>PS:</strong> Have any questions about Traceloop or want to share your debugging challenges? The Inspektor Gadget team and community hang out in the <a target="_blank" href="https://kubernetes.slack.com/archives/CSYL75LF6">#inspektor-gadget</a> channel on Kubernetes Slack. It's a great place to get help from the engineers who built these tools, share experiences, and maybe even contribute to making the ecosystem even better.  </p>
<p>You can also connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> if you’d like to stay in touch. If you made it to the end of this tutorial, thanks for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is SRE? A Beginner's Guide to Site Reliability Engineering ]]>
                </title>
                <description>
                    <![CDATA[ In today’s digital age, we expect our online experiences to be fast, reliable, and always available. But what happens behind the scenes to make our expectations a reality? The answer is Site Reliability Engineering (SRE). SRE is a discipline that ens... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-site-reliability-engineering/</link>
                <guid isPermaLink="false">67e4265f7281ac30028843e1</guid>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Site Reliability Engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Technical Support ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Omolade Ekpeni ]]>
                </dc:creator>
                <pubDate>Wed, 26 Mar 2025 16:07:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743005263079/95dfe528-a274-4172-8a06-46187c1668eb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today’s digital age, we expect our online experiences to be fast, reliable, and always available. But what happens behind the scenes to make our expectations a reality?</p>
<p>The answer is Site Reliability Engineering (SRE). SRE is a discipline that ensures that your favorite online services keep running smoothly, even when things go wrong.</p>
<p>In this guide, you’ll learn about the core principles behind SRE, how automation can help you in this process, how to handle failure, and more.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-sre-is-more-than-just-fixing-problems">SRE: More Than Just Fixing Problems</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bridging-the-gap-between-development-and-operations">Bridging the Gap Between Development and Operations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-core-principles-of-sre">The Core Principles of SRE</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-focus-on-availability-and-reliability">Focus on Availability and Reliability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-embrace-automation">Embrace Automation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-measure-everything">Measure Everything</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-work-with-developers">Work with Developers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-learn-from-failure">Learn from Failure</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-sre-role-a-balancing-act">The SRE Role: A Balancing Act</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-automation-matters">Why Automation Matters</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-takeaways-for-anyone-involved-in-digital-services">Key Takeaways for Anyone Involved in Digital Services</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-sre-more-than-just-fixing-problems">SRE: More Than Just Fixing Problems</h2>
<p>SRE goes beyond reacting to outages. It is a proactive approach to building and maintaining reliable systems. You can think of it as a blend of traditional IT operations, software engineering, and a relentless drive or pursuit for automation.</p>
<p>You might have heard of SRE being discussed alongside DevOps, so let’s differentiate them. DevOps is a broader set of principles that aims to improve collaboration and automation across the entire software development lifecycle. Site Reliability Engineering (SRE), on the other hand, is a specific implementation of these DevOps principles, with a strong focus on the operational aspects of running large-scale, highly reliable systems.  </p>
<p>Let’s imagine a software company that wants to embrace DevOps. They might start the process by fostering better communication and shared goals between their development teams (who write the code) and their operations teams (who run the code in production). Also, they might implement continuous integration and continuous delivery (CI/CD) pipelines to automate the process of building, testing, and deploying software. This aligns with DevOps' focus on faster release cycles and improved collaboration.</p>
<p>Within this DevOps-oriented company, the SRE team might be specifically tasked with ensuring the reliability of their e-commerce platform. They would take the general DevOps principles and apply them to the operational challenges being experienced with a software engineering view.</p>
<p>For example, they would:</p>
<ul>
<li><p>define and measure Service Level Objectives (SLOs)</p>
</li>
<li><p>develop and implement automated monitoring and alerting systems</p>
</li>
<li><p>create self-healing infrastructure and automated incident response playbooks</p>
</li>
<li><p>collaborate with development teams early in the software development lifecycle to ensure reliability</p>
</li>
<li><p>conduct blameless post-incident reviews to learn from failures</p>
</li>
<li><p>and track and automate away 'toil'.</p>
</li>
</ul>
<h3 id="heading-bridging-the-gap-between-development-and-operations">Bridging the Gap Between Development and Operations</h3>
<p>So as you can see, SRE is closely related to DevOps. One of the ways SRE implements DevOps principles is by bridging the gap between development and operations. SREs can do this in several ways.</p>
<p>First, SREs share responsibility with development teams for the reliability and performance of applications in production. This helps foster a collaborative environment and ensures that operational concerns are considered throughout the software development lifecycle.</p>
<p>SREs also provide valuable feedback to development teams based on their operational experience. They understand how software is designed and how it actually runs in production. This unique perspective allows them to identify potential issues early on and suggest improvements to the code, architecture, or deployment process.</p>
<p>And finally, SREs and development teams work together towards common goals, such as improving system reliability, increasing deployment frequency, and reducing time to recovery. This alignment ensures that everyone is working towards the same objectives.</p>
<h2 id="heading-the-core-principles-of-sre">The Core Principles of SRE:</h2>
<h3 id="heading-focus-on-availability-and-reliability"><strong>Focus on Availability and Reliability</strong></h3>
<p>SREs aim to achieve specific service level objectives (SLOs), which are measurable targets for uptime and performance.  </p>
<p><strong>Scenario</strong>: A popular e-commerce website, used heavily during Nigerian business hours, sets an SLO of 99.9% uptime for its product catalog service. This high standard means the service is expected to be available almost all the time.  </p>
<p>To understand just how little downtime this allows, let's break it down:</p>
<ol>
<li><p><strong>Downtime Percentage</strong>: An uptime of 99.9% means the allowed downtime is 100% - 99.9% = 0.1%.</p>
<ul>
<li><p><strong>Minutes in a day</strong>: There are 24 hours in a day, and each hour has 60 minutes, so there are 24 x 60 = 1440 minutes in a day.</p>
</li>
<li><p><strong>Minutes in an average month*</strong>:* Assuming an average month of 30 days, there are approximately 30 x 1440 = 43,200 minutes in a month.</p>
</li>
<li><p><strong>Allowed downtime in minutes</strong>: To find 0.1% of the minutes in a month, we calculate (0.1 / 100) x 43,200 minutes = 0.001 x 43,200 minutes = 43.2 minutes.</p>
</li>
</ul>
</li>
</ol>
<p>Therefore, a 99.9% uptime SLO for the product catalog service means it can be unavailable for a maximum of about 43 minutes per month. The SRE team constantly monitors the service's availability using tools that track request success rates and latency. If the availability drops below 99.95% (a leading indicator), the SRE team is alerted to investigate and remediate before the SLO is breached.  </p>
<p><strong>Example</strong>: An online banking platform in Nigeria has an SLO for transaction processing latency: 99% of transactions must be completed within 500 milliseconds. SRE dashboards track this metric in real time. If the latency starts to increase, indicating a potential performance issue, SREs investigate whether it's due to database bottlenecks, network congestion within Nigeria, or application code inefficiencies.</p>
<h3 id="heading-embrace-automation"><strong>Embrace Automation</strong></h3>
<p>Automation is the heart of SRE. It reduces manual labor, improves consistency, and speeds up issue resolution.  </p>
<p><strong>Scenario</strong>: When a new server is provisioned for an application, an SRE has automated the entire process using infrastructure-as-code tools (like Terraform or Ansible). This includes configuring the operating system, installing necessary software, setting up monitoring agents, and deploying the application code.</p>
<p>Previously, this involved multiple manual steps taking hours and was prone to human error. Now, it's completed consistently in minutes.  </p>
<p><strong>Example</strong>: During peak traffic hours (for example, around lunchtime in Nigeria when many people are online), the load on a web server cluster increases. An SRE has implemented auto-scaling rules that automatically add more servers to the cluster when CPU utilization exceeds a certain threshold and remove them when the load decreases. This automated scaling ensures the service remains responsive without manual intervention.</p>
<h3 id="heading-measure-everything"><strong>Measure Everything</strong></h3>
<p>SREs rely on data and metrics to understand system behavior and identify various areas for improvement.  </p>
<p><strong>Scenario</strong>: For a ride-hailing app popular in Lagos, SREs track a wide range of metrics beyond just uptime. These metrics are often referred to as Service Level Indicators (SLIs), which are quantitative measures of a service's performance.</p>
<p>Examples include:</p>
<ol>
<li><p><strong>Request latency</strong>: How long it takes for a user to request a ride and get a confirmation.</p>
</li>
<li><p><strong>Error rates:</strong> The percentage of ride requests or payment transactions that fail.</p>
</li>
<li><p><strong>Resource utilization:</strong> CPU, memory, and disk usage of the servers.</p>
</li>
<li><p><strong>Database query performance:</strong> The time it takes for database operations.</p>
</li>
<li><p><strong>User engagement metrics:</strong> How often key features are used.</p>
</li>
</ol>
<p>These SLIs are crucial for determining if the service is meeting its Service Level Objectives (SLOs) – the target values or ranges for these indicators (for example, 99% of ride requests should have a latency under 200ms). The metrics are visualized on dashboards, allowing SREs to understand the system's health and identify correlations between different indicators, ultimately helping them determine if the SLOs are being met or are at risk.  </p>
<p><strong>Example</strong>: After deploying a new version of their mobile app, SREs closely monitor key performance indicators (KPIs) like the number of active users in Lagos, the average time to complete a booking, and the frequency of crashes reported by users in Nigeria. This data helps them quickly identify if the new release has introduced any performance or stability regressions.</p>
<h3 id="heading-work-with-developers"><strong>Work with Developers</strong></h3>
<p>SREs collaborate closely with development teams to ensure that applications are designed for reliability.  </p>
<p><strong>Scenario</strong>: When developers are designing a new feature for their Nigerian user base that involves significant data processing, SREs are involved early in the design phase.</p>
<p>They provide guidance on how to build the feature in a reliable and scalable way, suggesting patterns like circuit breakers, retries, and proper error handling.</p>
<p>This proactive collaboration helps prevent reliability issues from being baked into the application. SREs can also participate in design reviews, providing operational insights and raising concerns about potential failure points.  </p>
<p><strong>Example</strong>: Before a major marketing campaign is launched in Nigeria, which is expected to significantly increase traffic, SREs work with the development team to perform load testing on the application. This helps identify potential bottlenecks and areas for optimization before the actual surge in users occurs.</p>
<p>SREs provide insights into the system's capacity and suggest code changes or infrastructure adjustments to handle the anticipated load. SREs can analyze the load test results with developers, providing insights into the system's capacity and suggesting code changes, database optimizations, or infrastructure adjustments to handle the expected load. They can also jointly develop monitoring and alerting rules specific to the campaign's expected traffic.</p>
<h3 id="heading-learn-from-failure"><strong>Learn from Failure</strong></h3>
<p>Failure is inevitable. SREs use post-incident reviews to analyze failures, identify root causes, and implement preventative measures.  </p>
<p><strong>Scenario:</strong> A critical outage occurred on a payment gateway used by many Nigerian businesses. After the service is restored, the SRE team conducts a blameless post-incident review. They gather all relevant data (logs, metrics, timelines, communication records) and collaboratively analyze the sequence of events, the underlying causes (which might involve a combination of software bugs, configuration errors, and insufficient monitoring), and the impact on users.</p>
<p>The outcome of the review is a detailed document outlining the root causes and a list of actionable items with owners and deadlines to prevent similar incidents in the future (for example, improving monitoring for a specific metric, implementing a new rollback strategy, fixing a configuration management issue).  </p>
<p><strong>Example:</strong> A minor incident occurred where a specific API endpoint became slow for a short period during peak hours in Lagos. Even though the impact was minimal, the SRE team still conducts a lightweight post-incident review.</p>
<p>They analyze the logs and metrics to understand why the slowdown happened (perhaps a temporary spike in database load) and identify potential preventative measures, such as optimizing the database query or adjusting resource limits.</p>
<p>The actionable item might be to create a new dashboard specifically for this API endpoint's performance, with a target completion date and assigned to a specific SRE (owner). Afterward, the team will follow up and ensure the dashboard is serving its purpose.</p>
<p>SREs acknowledge that systems will fail, and the goal is not to prevent all failures but to minimize their impact. SREs can achieve this through:</p>
<ol>
<li><p><strong>Monitoring</strong>: SREs implement real-time tracking of system health and performance, which allows them to detect issues early on.</p>
</li>
<li><p><strong>Logging</strong>: They use detailed records of system events for analysis, investigation, debugging, and troubleshooting, which is essential for understanding the root cause of failures.</p>
</li>
<li><p><strong>Alerting</strong>: SREs set up automated notifications when system metrics deviate from expected thresholds, enabling them to respond quickly to potential problems.</p>
</li>
<li><p>I<strong>ncident response</strong>: They establish structured and documented procedures for responding to and resolving incidents, ensuring a coordinated and efficient approach.</p>
</li>
<li><p><strong>Post-incident reviews</strong>: SREs conduct in-depth analysis of incidents to identify root causes and prevent recurrence, treating every incident as a learning opportunity. This is a crucial aspect of continuous improvement.</p>
</li>
</ol>
<h2 id="heading-the-sre-role-a-balancing-act">The SRE Role: A Balancing Act</h2>
<p>SREs face the challenge of balancing day-to-day operational needs with longer-term engineering initiatives. This "balancing act" is crucial for maintaining a system's stability and its ability to evolve and improve.</p>
<p>SREs typically spend their time in two key areas, each requiring a different skillset and focus:</p>
<h3 id="heading-operational-responsibilities-50"><strong>Operational Responsibilities (50%):</strong></h3>
<p>An SRE’s operational responsibilities are pretty wide-ranging. They typically involve responding to incidents and outages, which is a core part of any operations role. SREs are often on-call, meaning they are available to address urgent issues outside of regular work hours.</p>
<p>They also handle escalations, which means taking over complex or critical issues that other teams can't resolve.</p>
<p>SREs also provide support to internal and external customers, which can involve troubleshooting problems, answering questions, and providing guidance.</p>
<p>These responsibilities require strong problem-solving skills, quick thinking, and the ability to remain calm under pressure.</p>
<h3 id="heading-engineering-responsibilities-50"><strong>Engineering Responsibilities (50%):</strong></h3>
<p>Engineering responsibilities are what truly distinguish SREs. SREs are responsible for automating manual tasks, which is crucial for increasing efficiency and reducing errors.</p>
<p>They also develop monitoring and alerting systems, which involve designing and implementing tools to track system health and notify teams of potential problems.</p>
<p>SREs contribute to improving system reliability and performance by identifying and addressing bottlenecks, optimizing code, and implementing best practices.</p>
<p>They contribute to software development with a focus on operational concerns, which means they work with developers to ensure that applications are designed for scalability, maintainability, and resilience.</p>
<p>These responsibilities require strong programming skills, a deep understanding of system architecture, and a proactive approach to problem-solving.</p>
<h2 id="heading-why-automation-matters">Why Automation Matters</h2>
<p>Automation is an important tool that SREs use to achieve both their operational and engineering goals. It's not about replacing human engineers, but about empowering them to work more effectively.</p>
<p>There are several key areas where automation is really important:</p>
<ol>
<li><p><strong>Reducing toil</strong>: SREs use automation to eliminate repetitive, manual tasks, often referred to as "toil." This frees up their time to focus on more strategic work, such as improving system design and implementing new features.</p>
</li>
<li><p><strong>Improving efficiency</strong>: Automation can significantly speed up processes like deployments, rollbacks, and incident response, which leads to faster recovery times and reduced downtime.</p>
</li>
<li><p><strong>Enhancing reliability</strong>: By automating critical processes, SREs can reduce the risk of human error, which is a common cause of outages and other issues.</p>
</li>
<li><p><strong>Gaining deeper understanding</strong>: Every time an SRE automates a process, they gain a deeper understanding of the system, leading to further improvements or enhancements. This iterative process of automation and learning is central to the SRE approach.</p>
</li>
</ol>
<h2 id="heading-key-takeaways-for-anyone-involved-in-digital-services">Key Takeaways for Anyone Involved in Digital Services:</h2>
<ol>
<li><p><strong>Reliability is a feature:</strong> Treat reliability as a major requirement, not an option.</p>
</li>
<li><p><strong>Automation is essential:</strong> Embrace automation to reduce toil and improve efficiency.</p>
</li>
<li><p><strong>Make data-driven decisions:</strong> Use metrics to understand system behavior and in turn guide improvements.</p>
</li>
<li><p><strong>Collaboration is key:</strong> Foster close collaboration between development and operations teams.</p>
</li>
<li><p><strong>Focus on continuous improvement:</strong> Adopt a culture of continuous learning and improvement.</p>
</li>
</ol>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>You've now gained a foundational understanding of Site Reliability Engineering and its core principles centered around availability, automation, measurement, collaboration, and learning from failure. You’ve also learned how it plays a crucial role in ensuring the smooth operation of the digital services we rely on every day.</p>
<p>If you found this tutorial helpful and want to stay connected for more insights on Site Reliability Engineering, you can follow me on <a target="_blank" href="https://x.com/OmoladeEkpeni">Twitter</a>, connect on <a target="_blank" href="https://www.linkedin.com/in/omolade-ekpeni-b7b431188/">LinkedIn</a>, or reach out via email at omolade.ekp@gmail.com.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
