<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Kubernetes - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Kubernetes - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 18 May 2026 20:19:15 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/kubernetes/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible ]]>
                </title>
                <description>
                    <![CDATA[ The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS. I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $ ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-local-devops-homelab-with-docker-kubernetes-and-ansible/</link>
                <guid isPermaLink="false">69dd667c217f5dfcbd55b7b4</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Homelab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops articles ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 21:56:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1e970f8b-eb52-4582-9c98-13cbce867c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.</p>
<p>I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.</p>
<p>That was the last time I practiced on someone else's infrastructure.</p>
<p>Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.</p>
<p>That last part is where the actual learning happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>A laptop with at least 8GB of RAM (16GB is better)</p>
</li>
<li><p>At least 20GB of free disk space</p>
</li>
<li><p>Windows, macOS, or Linux operating system</p>
</li>
<li><p>Administrator access to your computer</p>
</li>
<li><p>Virtualization enabled in your BIOS/UEFI settings</p>
</li>
<li><p>A stable internet connection for the initial downloads</p>
</li>
</ul>
<p>Knowledge and comfort level:</p>
<ul>
<li><p>You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).</p>
</li>
<li><p>Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.</p>
</li>
</ul>
<p>If you can follow step-by-step instructions and read error output without panicking, you're ready.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-is-devops">What is DevOps?</a></p>
</li>
<li><p><a href="#heading-why-build-a-local-lab">Why Build a Local Lab?</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-docker">How to Set Up Docker</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-to-install-kubectl">How to Install kubectl</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-vagrant">How to Set Up Vagrant</a></p>
</li>
<li><p><a href="#heading-how-to-install-ansible">How to Install Ansible</a></p>
</li>
<li><p><a href="#heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</a></p>
</li>
<li><p><a href="#heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</a></p>
</li>
<li><p><a href="#heading-what-you-can-now-do">What You Can Now Do</a></p>
</li>
</ol>
<h2 id="heading-what-is-devops">What is DevOps?</h2>
<p>DevOps is the practice of breaking down the wall between software development and IT operations teams.</p>
<p>Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.</p>
<p>The tools you'll install in this guide each solve a specific part of that process:</p>
<ul>
<li><p><strong>Docker</strong> packages your application and everything it needs into a portable container that runs the same way on any machine.</p>
</li>
<li><p><strong>Kubernetes</strong> manages multiple containers at scale, handling restarts, networking, and load balancing automatically.</p>
</li>
<li><p><strong>Vagrant</strong> creates and manages virtual machine environments so your whole team always works on identical setups.</p>
</li>
<li><p><strong>Ansible</strong> automates repetitive configuration tasks across many servers without writing a script for each one.</p>
</li>
</ul>
<h2 id="heading-why-build-a-local-lab">Why Build a Local Lab?</h2>
<p>A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.</p>
<p>Here's what you get with a local setup:</p>
<ul>
<li><p><strong>Zero cost.</strong> No cloud bills, no surprise charges, and no credit card required.</p>
</li>
<li><p><strong>Works offline.</strong> Practice anywhere, even without internet after the initial setup.</p>
</li>
<li><p><strong>Full control.</strong> You manage every layer from the OS up to the application.</p>
</li>
<li><p><strong>Safe experimentation.</strong> Break things freely. Nothing here affects production.</p>
</li>
<li><p><strong>Fast feedback.</strong> No waiting for cloud resources to spin up. Everything runs on your machine.</p>
</li>
</ul>
<p>The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.</p>
<h2 id="heading-how-to-set-up-docker">How to Set Up Docker</h2>
<p>Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.</p>
<h3 id="heading-how-to-install-docker-on-windows">How to Install Docker on Windows</h3>
<p>First, enable virtualization in your BIOS:</p>
<ol>
<li><p>Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.</p>
</li>
<li><p>Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.</p>
</li>
<li><p>Enable it, save your changes, and exit.</p>
</li>
</ol>
<p>Then install Docker Desktop:</p>
<ol>
<li><p>Download Docker Desktop from <a href="https://www.docker.com/products/docker-desktop/">Docker's official website</a>.</p>
</li>
<li><p>Run the installer and follow the prompts.</p>
</li>
<li><p>Enable WSL 2 (Windows Subsystem for Linux) when asked.</p>
</li>
<li><p>Restart your computer.</p>
</li>
<li><p>Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.</p>
</li>
</ol>
<p><strong>Troubleshooting:</strong> If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:</p>
<pre><code class="language-powershell">systeminfo | findstr "Hyper-V Requirements"
</code></pre>
<p>All items should show "Yes". If they don't, revisit your BIOS settings.</p>
<h3 id="heading-how-to-install-docker-on-mac">How to Install Docker on Mac</h3>
<ol>
<li><p>Download Docker Desktop for Mac from <a href="https://www.docker.com/products/docker-desktop/">Docker's website</a>.</p>
</li>
<li><p>Open the downloaded <code>.dmg</code> file and drag Docker to your Applications folder.</p>
</li>
<li><p>Open Docker from Applications.</p>
</li>
<li><p>Enter your password when prompted.</p>
</li>
<li><p>Wait for the whale icon in the menu bar to stop animating.</p>
</li>
</ol>
<h3 id="heading-how-to-install-docker-on-linux">How to Install Docker on Linux</h3>
<p>Run these commands in order:</p>
<pre><code class="language-bash"># Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER
</code></pre>
<p>Log out and back in for the group change to take effect.</p>
<h3 id="heading-how-to-test-docker">How to Test Docker</h3>
<p>Run this command:</p>
<pre><code class="language-bash">docker run hello-world
</code></pre>
<p>If you see "Hello from Docker!" then Docker is working correctly.</p>
<p>Docker is set up. Next, you'll install Kubernetes to manage containers at scale.</p>
<h2 id="heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</h2>
<p>Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:</p>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best for</th>
<th>RAM needed</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Minikube</strong></td>
<td>Beginners. Easiest setup, built-in dashboard</td>
<td>2GB+</td>
</tr>
<tr>
<td><strong>Kind</strong></td>
<td>Faster startup, works well inside CI pipelines</td>
<td>1GB+</td>
</tr>
<tr>
<td><strong>k3s</strong></td>
<td>Low-resource machines. Lightweight but production-like</td>
<td>512MB+</td>
</tr>
<tr>
<td><strong>kubeadm</strong></td>
<td>Learning how clusters are actually bootstrapped in production</td>
<td>2GB+ per node</td>
</tr>
</tbody></table>
<p>If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.</p>
<p>If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.</p>
<p>Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.</p>
<h3 id="heading-how-to-install-minikube-recommended-for-beginners">How to Install Minikube (Recommended for Beginners)</h3>
<p>Minikube creates a single-node Kubernetes cluster on your laptop.</p>
<p>On Windows:</p>
<ol>
<li><p>Download the Minikube installer from <a href="https://github.com/kubernetes/minikube/releases">Minikube's GitHub releases page</a>.</p>
</li>
<li><p>Run the <code>.exe</code> installer.</p>
</li>
<li><p>Open Command Prompt as Administrator and start Minikube:</p>
</li>
</ol>
<pre><code class="language-cmd">minikube start --driver=docker
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install minikube
minikube start --driver=docker
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker
</code></pre>
<p>Test your cluster:</p>
<pre><code class="language-bash">minikube status
minikube dashboard
</code></pre>
<h3 id="heading-how-to-install-k3s-recommended-for-low-ram-machines">How to Install k3s (Recommended for Low-RAM Machines)</h3>
<p>k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.</p>
<p>On Linux (and Mac via Multipass):</p>
<pre><code class="language-bash">curl -sfL https://get.k3s.io | sh -
</code></pre>
<p>That single command installs k3s and runs it automatically in the background. Check that it is running:</p>
<pre><code class="language-bash">sudo k3s kubectl get nodes
</code></pre>
<p>You should see one node with status <code>Ready</code>.</p>
<p>On Mac directly — k3s doesn't run natively on macOS. Use <a href="https://multipass.run">Multipass</a> to spin up a lightweight Ubuntu VM first, then run the install command inside it.</p>
<p>On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.</p>
<h3 id="heading-how-to-install-kind-kubernetes-in-docker">How to Install Kind (Kubernetes IN Docker)</h3>
<p>Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.</p>
<pre><code class="language-bash"># Mac or Linux
brew install kind

# Windows
choco install kind
</code></pre>
<p>Create a cluster:</p>
<pre><code class="language-bash">kind create cluster --name my-local-lab
</code></pre>
<h3 id="heading-how-to-install-kubeadm-for-understanding-cluster-bootstrap">How to Install kubeadm (For Understanding Cluster Bootstrap)</h3>
<p>kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.</p>
<p>It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/">official kubeadm installation guide</a> for your OS, then initialize your cluster:</p>
<pre><code class="language-bash">sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
<p>After init, join worker nodes using the command kubeadm prints at the end of the output.</p>
<h3 id="heading-how-to-install-kubectl">How to Install kubectl</h3>
<p>kubectl is the command-line tool you use to interact with any Kubernetes cluster.</p>
<p>On Windows:</p>
<p>Download <code>kubectl.exe</code> from <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/">Kubernetes' website</a> and place it in a directory that is in your PATH. Or install with Chocolatey:</p>
<pre><code class="language-cmd">choco install kubernetes-cli
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install kubectl
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl
</code></pre>
<p>Test it:</p>
<pre><code class="language-bash">kubectl get pods --all-namespaces
</code></pre>
<p>On a fresh cluster, you'll see system pods running in the <code>kube-system</code> namespace — things like <code>coredns</code> and <code>storage-provisioner</code>. That's the expected output. It means your cluster is up and kubectl can talk to it.</p>
<p>Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.</p>
<h4 id="heading-docker-vs-vagrant-they-arent-the-same-thing">Docker vs Vagrant — they aren't the same thing</h4>
<p>Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.</p>
<p>Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.</p>
<h2 id="heading-how-to-set-up-vagrant">How to Set Up Vagrant</h2>
<p>Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.</p>
<h3 id="heading-how-to-install-vagrant-on-windows">How to Install Vagrant on Windows</h3>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a> with default options.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
<li><p>Restart your computer if prompted.</p>
</li>
</ol>
<p><strong>Note:</strong> VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:</p>
<pre><code class="language-cmd">systeminfo | findstr "Hyper-V"
</code></pre>
<p>If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:</p>
<pre><code class="language-powershell">Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All
</code></pre>
<p>Restart after disabling.</p>
<h3 id="heading-how-to-install-vagrant-on-mac-and-linux">How to Install Vagrant on Mac and Linux</h3>
<p>On Mac:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>After installation, open <strong>System Preferences &gt; Security &amp; Privacy &gt; General</strong>. You will see a message saying system software from Oracle was blocked. Click <strong>Allow</strong> and restart your Mac. Without this step, VirtualBox will not run.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p><strong>Note for Apple Silicon (M1/M2/M3) Macs:</strong> VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use <a href="https://mac.getutm.app/">UTM</a> as your VM provider instead, or use Multipass which works natively on Apple Silicon.</p>
<p>On Linux:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p>Verify both are installed:</p>
<pre><code class="language-bash">vboxmanage --version
vagrant --version
</code></pre>
<h3 id="heading-how-to-create-your-first-vagrant-environment">How to Create Your First Vagrant Environment</h3>
<p>Create a new directory for your project. Inside it, create a file named <code>Vagrantfile</code> with this content:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: &lt;&lt;-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" &gt; /var/www/html/index.html
  SHELL
end
</code></pre>
<p>Start the VM:</p>
<pre><code class="language-bash">vagrant up
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/342f11ad-7c7d-40d2-a810-113b8c71edac.png" alt="screnshot showing VB server and terminal installation processes" style="display:block;margin:0 auto" width="1848" height="323" loading="lazy">

<p>Visit <code>http://localhost:8080</code> in your browser. You should see "Hello from Vagrant!"</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/bcd66a76-4a5b-4f26-bb7e-e203672968d8.png" alt="screenshot showing &quot;Hello from Vagrant!&quot; in browser" style="display:block;margin:0 auto" width="643" height="483" loading="lazy">

<h4 id="heading-troubleshooting-ssh-on-windows">Troubleshooting SSH on Windows</h4>
<p>If <code>vagrant ssh</code> fails, try:</p>
<pre><code class="language-bash">vagrant ssh -- -v
</code></pre>
<p>Or connect manually:</p>
<pre><code class="language-bash">ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222
</code></pre>
<h3 id="heading-how-to-create-a-local-vagrant-box-without-internet">How to Create a Local Vagrant Box Without Internet</h3>
<p><strong>Note:</strong> Most readers can skip this. Only do this if you want to work fully offline after the initial setup.</p>
<ol>
<li><p>Download <a href="https://ubuntu.com/download/server">Ubuntu 20.04 LTS</a> and save the <code>.iso</code> file locally.</p>
</li>
<li><p>Open VirtualBox and create a new VM: Name it <code>ubuntu-devops</code>, Type: Linux, Version: Ubuntu (64-bit).</p>
</li>
<li><p>Assign 2048MB RAM and a 20GB VDI disk.</p>
</li>
<li><p>Attach the <code>.iso</code> under Storage &gt; Optical Drive.</p>
</li>
<li><p>Start the VM and complete the Ubuntu installation.</p>
</li>
<li><p>Once installed, shut down the VM and run:</p>
</li>
</ol>
<pre><code class="language-bash">VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box
</code></pre>
<p>You now have a reusable local box that works without internet.</p>
<p>You can spin up virtual machines. Next is Ansible, which automates what goes inside them.</p>
<h2 id="heading-how-to-install-ansible">How to Install Ansible</h2>
<p>Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.</p>
<h3 id="heading-how-to-install-ansible-on-windows">How to Install Ansible on Windows</h3>
<p>Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).</p>
<ol>
<li>Open PowerShell as Administrator and enable WSL:</li>
</ol>
<pre><code class="language-powershell">dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
</code></pre>
<ol>
<li><p>Restart your computer.</p>
</li>
<li><p>Install Ubuntu from the Microsoft Store.</p>
</li>
<li><p>Open Ubuntu and install Ansible:</p>
</li>
</ol>
<pre><code class="language-bash">sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-mac">How to Install Ansible on Mac</h3>
<pre><code class="language-bash">brew install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-linux">How to Install Ansible on Linux</h3>
<pre><code class="language-bash"># Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible
</code></pre>
<h3 id="heading-how-to-test-ansible">How to Test Ansible</h3>
<p>Create a file called <code>hosts</code> in your current directory:</p>
<pre><code class="language-ini">[local]
localhost ansible_connection=local
</code></pre>
<p>Create a file called <code>playbook.yml</code> in the same directory:</p>
<pre><code class="language-yaml">---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"
</code></pre>
<p>Run the playbook, passing the local <code>hosts</code> file with <code>-i</code>:</p>
<pre><code class="language-bash">ansible-playbook -i hosts playbook.yml
</code></pre>
<p>You should see the message "Ansible is working!" in the output.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/081e6ff3-b983-42a0-960e-5340bbd24e3b.png" alt="screenshot showing ansible playbook complete terminal installation" style="display:block;margin:0 auto" width="849" height="287" loading="lazy">

<p>Alright, all your tools are installed. Now you'll use them together to build something real.</p>
<h2 id="heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</h2>
<p>You can find the entire code for this lab in this repo: <a href="https://github.com/Osomudeya/homelab-demo-article">https://github.com/Osomudeya/homelab-demo-article</a></p>
<p>Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.</p>
<p><strong>Before you start,</strong> create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.</p>
<p>You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.</p>
<h3 id="heading-step-1-create-the-project-directory">Step 1: Create the Project Directory</h3>
<pre><code class="language-bash">mkdir devops-lab-project &amp;&amp; cd devops-lab-project
</code></pre>
<h3 id="heading-step-2-write-your-site-content">Step 2: Write Your Site Content</h3>
<p>Create a file called <code>index.html</code> in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:</p>
<pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;&lt;title&gt;My DevOps Lab&lt;/title&gt;&lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;My DevOps Lab&lt;/h1&gt;
    &lt;p&gt;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&lt;/p&gt;
    &lt;p&gt;Built on a laptop. No cloud account needed.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre>
<p>Change the text to whatever you like. This is your page.</p>
<h3 id="heading-step-3-write-the-vagrantfile">Step 3: Write the Vagrantfile</h3>
<p>Create a file called <code>Vagrantfile</code> in the same directory:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end
</code></pre>
<h3 id="heading-step-4-start-the-virtual-machines">Step 4: Start the Virtual Machines</h3>
<pre><code class="language-bash">vagrant up
</code></pre>
<p>The first run downloads the <code>ubuntu/focal64</code> box, which is around 500MB.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/264866b0-9977-490e-96a3-69b3070be589.png" alt="screenshot showing virtualbox installation processes in terminal" style="display:block;margin:0 auto" width="867" height="377" loading="lazy">

<p>Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/118d2fb2-70f6-41e8-afb2-6f45fb895e98.png" alt="screenshot showing 2 virtualbox servers &quot;running&quot; in VB manager" style="display:block;margin:0 auto" width="926" height="396" loading="lazy">

<h3 id="heading-step-5-create-the-ansible-inventory">Step 5: Create the Ansible Inventory</h3>
<p>Create a file called <code>inventory</code> in the same directory:</p>
<pre><code class="language-ini">[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key
</code></pre>
<p>Ansible uses the Vagrant-generated private keys so it can SSH in as the <code>vagrant</code> user. Host key checking for this lab is turned off in <code>ansible.cfg</code> (next step), not in the inventory.</p>
<h3 id="heading-step-6-create-the-ansible-config-file">Step 6: Create the Ansible Config File</h3>
<p>Before running the playbook, create a file called <code>ansible.cfg</code> in the same directory:</p>
<pre><code class="language-ini">[defaults]
inventory = inventory
host_key_checking = False
</code></pre>
<p>The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.</p>
<p>These settings are for a local lab only. Do not use host_key_checking = False for production systems.</p>
<h3 id="heading-step-7-create-the-ansible-playbook">Step 7: Create the Ansible Playbook</h3>
<p>Create a file called <code>playbook.yml</code>:</p>
<pre><code class="language-yaml">---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2&gt;/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes
</code></pre>
<p>Two lines worth paying attention to:</p>
<ul>
<li><p><code>src: index.html</code> — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.</p>
</li>
<li><p><code>-v /var/www/html:/usr/share/nginx/html:ro</code> — this mounts the directory from the VM into the Nginx container. The <code>:ro</code> means read-only. Nginx serves whatever is in that folder.</p>
</li>
</ul>
<h3 id="heading-step-8-run-the-playbook">Step 8: Run the Playbook</h3>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green <code>ok</code> or yellow <code>changed</code> next to each task means it worked. Red <code>fatal</code> means something failed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/91241b41-981c-4e23-9dc4-8531e551c39e.png" alt="terminal screenshot of A green ok or yellow changed next to each task means it worked. Red fatal means something failed." style="display:block;margin:0 auto" width="875" height="267" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c02db252-8aff-42e5-b937-d812d070a75b.png" alt="terminal screenshot of playbook run completion" style="display:block;margin:0 auto" width="867" height="425" loading="lazy">

<h3 id="heading-step-9-verify-the-setup">Step 9: Verify the Setup</h3>
<p>Open <code>http://localhost:8080</code> in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.</p>
<p>If you see the page, every tool in this lab is working together.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0d3d897b-3f51-46fb-b548-832cc5ec3272.png" alt="Browser showing localhost:8082 with the heading &quot;My DevOps Lab&quot; and the text &quot;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&quot;" style="display:block;margin:0 auto" width="746" height="418" loading="lazy">

<h3 id="heading-step-9-clean-up-optional">Step 9: Clean Up (Optional)</h3>
<p>When you're done:</p>
<pre><code class="language-bash">vagrant destroy -f
</code></pre>
<p>This shuts down and deletes both VMs. Your <code>Vagrantfile</code>, <code>inventory</code>, <code>playbook.yml</code>, and <code>index.html</code> stay on disk — run <code>vagrant up</code> followed by <code>ansible-playbook -i inventory playbook.yml</code> any time to bring it all back.</p>
<p>Now that you have a working lab, let's use it properly.</p>
<h2 id="heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</h2>
<p>Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.</p>
<p>Here are five things to break and what to look for when you do.</p>
<h3 id="heading-break-1-crash-the-main-process-inside-the-container-and-watch-it-come-back">Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)</h3>
<p>Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of <code>--restart always</code>, and your site can come back without re-running Ansible.</p>
<p>After <code>vagrant ssh web</code>, every <code>docker</code> command below runs <strong>on the web VM</strong>. So keep your browser on your laptop at <a href="http://localhost:8080"><code>http://localhost:8080</code></a> (Vagrant forwards your host port to the VM’s port 80).</p>
<h4 id="heading-troubleshooting-if-your-lab-isnt-ready">Troubleshooting: If Your Lab Isn't Ready</h4>
<p>From your project folder on the host (your laptop) – unless the step says to run it on the VM:</p>
<ul>
<li><p>You ran <code>vagrant destroy -f</code>. Run <code>vagrant up</code>, then <code>ansible-playbook -i inventory playbook.yml</code>.</p>
</li>
<li><p><code>docker ps</code> shows <code>webapp</code> but status is Exited. On the web VM, run <code>sudo docker start webapp</code>, then <code>sudo docker ps</code> again.</p>
</li>
<li><p>There's no <code>webapp</code> row in <code>docker ps -a</code><strong>.</strong> Re-run <code>ansible-playbook -i inventory playbook.yml</code> on the host.</p>
</li>
</ul>
<p>If the playbook is already applied and <code>webapp</code> is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or <code>docker ps</code>. You need the VM shell and a quick check before you run <code>docker exec</code>.)</p>
<h4 id="heading-steps-happy-path">Steps (happy path)</h4>
<ol>
<li>SSH into the web VM:</li>
</ol>
<pre><code class="language-plaintext">vagrant ssh web
</code></pre>
<ol>
<li><p>Confirm <code>webapp</code> is <strong>Up</strong>:</p>
<pre><code class="language-plaintext">sudo docker ps
</code></pre>
</li>
<li><p><strong>Break it on purpose:</strong> kill the container’s main process <strong>from inside</strong> (PID 1). That ends the container the same way a crashing app would, not the same as <code>docker stop</code> on the host:</p>
</li>
</ol>
<pre><code class="language-bash">sudo docker exec webapp sh -c 'sleep 5 &amp;&amp; kill 1'
</code></pre>
<p>The <code>sleep</code> 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a>. You may catch a brief error or blank page while nothing is listening on port 80.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ac89703-63f3-45d8-954f-35adbd2c7dec.png" alt="Browser showing ERR_CONNECTION_RESET on localhost:8082 after the Nginx container process was killed" style="display:block;margin:0 auto" width="1242" height="1057" loading="lazy">

<ol>
<li>Watch Docker restart the container:</li>
</ol>
<pre><code class="language-bash">watch sudo docker ps -a
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5c61d90d-61d6-4023-b3f5-e3eb427e8492.png" alt="Terminal running watch docker ps showing webapp container status as Up 10 seconds after automatic restart" style="display:block;margin:0 auto" width="1011" height="393" loading="lazy">

<p>Within a few seconds you should see <strong>Exited (137)</strong> become <strong>Up</strong> again. (Press Ctrl+C to exit <code>watch</code>.)</p>
<p>5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under <code>/var/www/html</code> and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.</p>
<h4 id="heading-why-not-docker-stop-or-docker-kill-on-the-host-for-this-demo"><strong>Why not</strong> <code>docker stop</code> <strong>or</strong> <code>docker kill</code> <strong>on the host for this demo?</strong></h4>
<p>Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (<code>hasBeenManuallyStopped</code>), and <code>--restart always</code> may not bring the container back until you <code>docker start</code> it or similar.</p>
<p>Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.</p>
<p><strong>Kubernetes analogy:</strong> A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.</p>
<p><strong>What to observe (three separate checks):</strong></p>
<ol>
<li><p><strong>Exit code:</strong> After <code>kill 1</code>, <code>docker ps -a</code> should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran <code>docker stop</code> on the host.</p>
</li>
<li><p><strong>Restart delay vs browser:</strong> Watch how many seconds pass between Exited and Up in <code>docker ps -a</code>; that interval is Docker applying <code>--restart always</code>. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.</p>
</li>
<li><p><strong>Content after recovery:</strong> After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with <code>-v</code>), not inside a file that vanishes when the container process restarts. The process was replaced, not your <code>index.html</code> on the host path.</p>
</li>
</ol>
<h3 id="heading-break-2-cause-a-container-name-conflict">Break 2: Cause a Container Name Conflict</h3>
<p>On a single Docker daemon (here, on your web VM), a container name is a <strong>unique label</strong>. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use <code>docker run --name webapp</code> without cleaning up first hit this error constantly and recognizing it saves time in real work.</p>
<p><strong>Before you start:</strong> Ansible already created one container named <code>webapp</code>.<br>Stay on the web VM (for example still inside <code>vagrant ssh web</code>) so the commands below run where that container lives.</p>
<p>So now, try to start a second container and also call it <code>webapp</code>. The image is plain <code>nginx</code> here on purpose – the point is the <strong>name clash</strong>, not matching your site’s ports or volume mounts.</p>
<pre><code class="language-plaintext">sudo docker run -d --name webapp nginx
</code></pre>
<p>What actually happens here is that Docker <strong>doesn't</strong> create a second container. It returns an error immediately. Your original <code>webapp</code> is unchanged.</p>
<p>This is because the name <code>webapp</code> is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.</p>
<p>Example error (your ID will differ):</p>
<pre><code class="language-plaintext">docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/1fd42c16-c28e-4539-9290-3583206eb8ff.png" alt="container name conflict terminal error screenshot" style="display:block;margin:0 auto" width="914" height="252" loading="lazy">

<p>To fix it, free the name, then create <code>webapp</code> again the same way the playbook does (publish port 80, mount your HTML, restart policy):</p>
<pre><code class="language-plaintext">sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx
</code></pre>
<p>After that, your site should behave as before (refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a> from your laptop).</p>
<h4 id="heading-what-to-observe">What to observe:</h4>
<p>Read Docker’s Conflict message end to end. You should see that the name <code>/webapp</code> is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run <code>docker run</code> again.”</p>
<h3 id="heading-break-3-make-ansible-fail-to-reach-a-vm">Break 3: Make Ansible Fail to Reach a VM</h3>
<p>Ansible separates “could not connect” from “connected, but a task broke.” The first is <strong>UNREACHABLE</strong>, the second is <strong>FAILED</strong>. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.</p>
<p>On your laptop, in the project folder, edit <code>inventory</code> and change the web server address from <code>192.168.33.10</code> to an IP <strong>no VM uses</strong>, for example <code>192.168.33.99</code>. Save the file.</p>
<pre><code class="language-ini">[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key
</code></pre>
<p>What you run (from the same project folder on the host):</p>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>After this, Ansible tries to SSH to <code>192.168.33.99</code>. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible <strong>never runs tasks</strong> on the web server. It stops that host with UNREACHABLE:</p>
<pre><code class="language-plaintext">fatal: [192.168.33.99]: UNREACHABLE! =&gt; {"msg": "Failed to connect to the host via ssh"}
</code></pre>
<p>This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is <strong>no working SSH session</strong>.</p>
<p>Now it's time to put it back: restore <code>192.168.33.10</code> in <code>inventory</code> and run <code>ansible-playbook -i inventory playbook.yml</code> again. The web play should reach the VM and complete (assuming your lab is up).</p>
<p><strong>UNREACHABLE vs FAILED – what to observe:</strong></p>
<ul>
<li><p>If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.</p>
</li>
<li><p>If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.</p>
</li>
</ul>
<p>When you debug later, you should look at the keyword Ansible prints: <strong>UNREACHABLE</strong> points to reachability while <strong>FAILED</strong> points to task output and the first failed task under that host.</p>
<h3 id="heading-break-4-fill-the-vms-disk">Break 4: Fill the VM's Disk</h3>
<p>Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.</p>
<p>All commands below run <strong>on the db VM</strong> after <code>vagrant ssh db</code>. MariaDB was installed there by your playbook.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Open a shell on the db VM:</p>
<pre><code class="language-plaintext">vagrant ssh db
</code></pre>
</li>
<li><p>Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:</p>
<pre><code class="language-plaintext">sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
</code></pre>
<p>Use <code>df -h</code> to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger <code>count=</code> <strong>only in a lab</strong> (never fill production disks on purpose without a plan).</p>
</li>
<li><p>Ask systemd to restart MariaDB and show status:</p>
<pre><code class="language-plaintext">sudo systemctl restart mariadb
sudo systemctl status mariadb
</code></pre>
<p>If the disk is critically full, restart may fail or the service may show failed or not running.</p>
</li>
<li><p>If something looks wrong, read recent logs for the MariaDB unit:</p>
<pre><code class="language-plaintext">sudo journalctl -u mariadb --no-pager | tail -20
</code></pre>
<p>Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.</p>
</li>
<li><p>Clean up so your VM stays usable:</p>
<pre><code class="language-plaintext">sudo rm /tmp/bigfile
</code></pre>
<p>Optionally run <code>sudo systemctl restart mariadb</code> again and confirm it is active (running).</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should use <code>df -h</code> first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.</p>
</li>
<li><p>You should read <code>systemctl status mariadb</code> to see whether systemd thinks the service is active, failed, or flapping.</p>
</li>
<li><p>You should read <code>journalctl -u mariadb</code> when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). <strong>Space + status + logs</strong> is the same order you would use on a production server.</p>
</li>
</ul>
<h3 id="heading-break-5-run-minikube-out-of-resources">Break 5: Run Minikube Out of Resources</h3>
<p>Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay <strong>Pending</strong> and <strong>Events</strong> explain why (for example <em>Insufficient cpu</em>). That is not the same as a pod that starts and then crashes.</p>
<p>To do this, you'll need a local cluster (we're using <a href="https://minikube.sigs.k8s.io/docs/start/?arch=%2Fmacos%2Fx86-64%2Fstable%2Fbinary+download"><strong>Minikube</strong></a> in this guide) and <code>kubectl</code> on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.</p>
<p>You'll run this on your <strong>Mac, Linux, or Windows terminal</strong> (host), not inside <code>vagrant ssh</code>. If you're still inside a VM, type <code>exit</code> until your prompt is back on the host.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Check Minikube:</p>
<pre><code class="language-plaintext">minikube status
</code></pre>
<p>If it's stopped, start it (Docker driver matches earlier sections):</p>
<pre><code class="language-plaintext">minikube start --driver=docker
</code></pre>
</li>
<li><p>Create a deployment with many replicas so your single Minikube node can't run them all at once:</p>
<pre><code class="language-plaintext">kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
</code></pre>
<p>Press Ctrl+C when you're done watching. Some pods may stay <strong>Pending</strong> while others are <strong>Running</strong>.</p>
</li>
<li><p>Pick one Pending pod name from <code>kubectl get pods</code> and inspect it:</p>
<pre><code class="language-plaintext">kubectl describe pod &lt;pod-name&gt;
</code></pre>
<p>Under Events, look for FailedScheduling and a line similar to:</p>
<pre><code class="language-plaintext">Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
</code></pre>
<p>You might see <strong>Insufficient memory</strong> instead, depending on your machine.</p>
</li>
<li><p>Fix the lab by scaling back so the cluster can catch up:</p>
<pre><code class="language-plaintext">kubectl scale deployment stress --replicas=2
</code></pre>
<p>You can delete the deployment entirely when finished: <code>kubectl delete deployment stress</code>.</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any <strong>node</strong> yet, usually because the node is out of CPU or memory for that workload.</p>
</li>
<li><p>You should read <code>kubectl describe pod &lt;pod-name&gt;</code> and scroll to <strong>Events</strong>. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.</p>
</li>
<li><p>You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.</p>
</li>
</ul>
<h2 id="heading-what-you-can-now-do">What You Can Now Do</h2>
<p>You didn't just install tools in this tutorial. You also used them.</p>
<p>You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.</p>
<p>You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.</p>
<p>You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.</p>
<p>That's the difference between someone who has read about DevOps and someone who has run it.</p>
<p><strong>Here are four free projects you can run in this same lab to go further:</strong></p>
<ul>
<li><p><strong>DevOps Home-Lab 2026</strong> — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.</p>
</li>
<li><p><strong>KubeLab</strong> — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.</p>
</li>
<li><p><strong>K8s Secrets Lab</strong> — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.</p>
</li>
<li><p><strong>DevOps Troubleshooting Toolkit</strong> — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.</p>
</li>
</ul>
<p>All free and open source: <a href="https://github.com/Osomudeya/List-Of-DevOps-Projects">github.com/Osomudeya/List-Of-DevOps-Projects</a>.</p>
<p>If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at <a href="https://osomudeya.gumroad.com/l/BuildYourOwnDevOpsLab">Build Your Own DevOps Lab</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening ins ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-and-deploy-multi-architecture-docker-apps-on-google-cloud-using-arm-nodes/</link>
                <guid isPermaLink="false">69dcf2c3f57346bc1e05a01d</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ARM ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Amina Lawal ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:42:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/e89ae65a-4b3a-44b7-94d8-d0638f017bf6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.</p>
<p>Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.</p>
<p>AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.</p>
<p>But there's a problem that catches almost every team off guard when they start this transition: <strong>container architecture mismatch</strong>.</p>
<p>If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic <code>exec format error</code>.</p>
<p>The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.</p>
<p>We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.</p>
<p><strong>Here's what you'll build, step by step:</strong></p>
<ul>
<li><p>A Go HTTP server that reports the CPU architecture it's running on at runtime</p>
</li>
<li><p>A multi-stage Dockerfile that cross-compiles for both <code>linux/amd64</code> and <code>linux/arm64</code> without slow QEMU emulation</p>
</li>
<li><p>A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture</p>
</li>
<li><p>A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool</p>
</li>
<li><p>A Kubernetes Deployment that pins your workload exclusively to the ARM nodes</p>
</li>
</ul>
<p>By the end, you'll hit a live endpoint and see the word <code>arm64</code> staring back at you from a Google Cloud ARM node. Let's get into it.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</a></p>
</li>
<li><p><a href="#heading-step-3-write-the-application">Step 3: Write the Application</a></p>
</li>
<li><p><a href="#heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</a></p>
</li>
<li><p><a href="#heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</a></p>
</li>
<li><p><a href="#heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</a></p>
</li>
<li><p><a href="#heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</a></p>
</li>
<li><p><a href="#heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-project-file-structure">Project File Structure</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following ready:</p>
<ul>
<li><p><strong>A Google Cloud project</strong> with billing enabled. If you don't have one, create it at <a href="https://console.cloud.google.com">console.cloud.google.com</a>. The total cost to follow this tutorial is around $5–10.</p>
</li>
<li><p><code>gcloud</code> <strong>CLI</strong> installed and authenticated. Run <code>gcloud auth login</code> to sign in and <code>gcloud config set project YOUR_PROJECT_ID</code> to point it at your project.</p>
</li>
<li><p><strong>Docker Desktop</strong> version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.</p>
</li>
<li><p><code>kubectl</code> installed. This is the CLI for interacting with Kubernetes clusters.</p>
</li>
<li><p>Basic familiarity with <strong>Docker</strong> (images, layers, Dockerfile) and <strong>Kubernetes</strong> (pods, deployments, services). You don't need to be an expert, but you should know what these things are.</p>
</li>
</ul>
<h2 id="heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</h2>
<p>Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.</p>
<h3 id="heading-enable-the-required-apis">Enable the Required APIs</h3>
<p>Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:</p>
<pre><code class="language-bash">gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com
</code></pre>
<p>Here's what each one does:</p>
<ul>
<li><p><code>artifactregistry.googleapis.com</code> — enables <strong>Artifact Registry</strong>, where we'll store our Docker images</p>
</li>
<li><p><code>container.googleapis.com</code> — enables <strong>Google Kubernetes Engine (GKE)</strong>, where our cluster will run</p>
</li>
<li><p><code>containeranalysis.googleapis.com</code> — enables vulnerability scanning for images stored in Artifact Registry</p>
</li>
</ul>
<h3 id="heading-create-a-docker-repository-in-artifact-registry">Create a Docker Repository in Artifact Registry</h3>
<p>Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:</p>
<pre><code class="language-bash">gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"
</code></pre>
<p>Breaking down the flags:</p>
<ul>
<li><p><code>--repository-format=docker</code> — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)</p>
</li>
<li><p><code>--location=us-central1</code> — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run <code>gcloud artifacts locations list</code> to see all options.</p>
</li>
<li><p><code>--description</code> — a human-readable label for the repository, shown in the console.</p>
</li>
</ul>
<h3 id="heading-authenticate-docker-to-push-to-artifact-registry">Authenticate Docker to Push to Artifact Registry</h3>
<p>Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:</p>
<pre><code class="language-bash">gcloud auth configure-docker us-central1-docker.pkg.dev
</code></pre>
<p>This adds a credential helper entry to your <code>~/.docker/config.json</code> file. What that means in practice: any time Docker tries to push or pull from a URL under <code>us-central1-docker.pkg.dev</code>, it will automatically call <code>gcloud</code> to get a valid auth token. You won't need to run <code>docker login</code> manually.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/31fd020f-ffa2-40bd-9057-57b16a61b325.png" alt="Terminal output of the gcloud artifacts repositories list command, showing a row for multi-arch-repo with format DOCKER, location us-central1" style="display:block;margin:0 auto" width="2870" height="1512" loading="lazy">

<h2 id="heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</h2>
<p>With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.</p>
<pre><code class="language-bash">gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>What each flag does:</p>
<ul>
<li><p><code>--zone=us-central1-a</code> — creates a zonal cluster in a single availability zone. A regional cluster (using <code>--region</code>) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If <code>us-central1-a</code> is unavailable, try <code>us-central1-b</code>.</p>
</li>
<li><p><code>--num-nodes=2</code> — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.</p>
</li>
<li><p><code>--machine-type=e2-standard-2</code> — the machine type for this default node pool. <code>e2-standard-2</code> is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.</p>
</li>
<li><p><code>--workload-pool=PROJECT_ID.svc.id.goog</code> — enables <strong>Workload Identity</strong>, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.</p>
</li>
</ul>
<p>This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/332250a8-3f99-4eb1-849f-51ab054c9567.png" alt="GCP Console Kubernetes Engine Clusters page showing axion-tutorial-cluster with a green checkmark status, the zone us-central1-a, and Kubernetes version in the table." style="display:block;margin:0 auto" width="1457" height="720" loading="lazy">

<h2 id="heading-step-3-write-the-application">Step 3: Write the Application</h2>
<p>We need an application to containerize. We'll use <strong>Go</strong> for three specific reasons:</p>
<ol>
<li><p>Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.</p>
</li>
<li><p>Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.</p>
</li>
<li><p>Go exposes the architecture the binary was compiled for via <code>runtime.GOARCH</code>. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.</p>
</li>
</ol>
<p>Start by creating the project directories:</p>
<pre><code class="language-bash">mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app
</code></pre>
<p>Initialize the Go module from inside <code>app/</code>. This creates <code>go.mod</code> in the current directory:</p>
<pre><code class="language-bash">go mod init hello-axion
</code></pre>
<p><code>go mod init</code> is Go's built-in command for starting a new module. It writes a <code>go.mod</code> file that declares the module name (<code>hello-axion</code>) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.</p>
<p>Now create the application at <code>app/main.go</code>:</p>
<pre><code class="language-go">package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}
</code></pre>
<p>Verify both files were created:</p>
<pre><code class="language-bash">ls -la
</code></pre>
<p>You should see <code>go.mod</code> and <code>main.go</code> listed.</p>
<p>Let's walk through what this code does:</p>
<ul>
<li><p><code>import "runtime"</code> — imports Go's built-in <code>runtime</code> package, which exposes information about the Go runtime environment, including the CPU architecture.</p>
</li>
<li><p><code>runtime.GOARCH</code> — returns a string like <code>"arm64"</code> or <code>"amd64"</code> representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be <code>arm64</code>. This is the core of our proof.</p>
</li>
<li><p><code>os.Hostname()</code> — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.</p>
</li>
<li><p><code>handler</code> — the main HTTP handler, registered on the root path <code>/</code>. It writes the architecture, OS, and hostname to the response.</p>
</li>
<li><p><code>healthz</code> — a separate handler registered on <code>/healthz</code>. It returns HTTP 200 with the text <code>ok</code>. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.</p>
</li>
<li><p><code>http.ListenAndServe(":8080", nil)</code> — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.</p>
</li>
</ul>
<h2 id="heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</h2>
<p>Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.</p>
<h3 id="heading-why-your-docker-images-are-architecture-specific-by-default">Why Your Docker Images Are Architecture-Specific By Default</h3>
<p>A CPU only understands instructions written for its specific <strong>Instruction Set Architecture (ISA)</strong>. ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.</p>
<p>When you build a Docker image the normal way (<code>docker build</code>), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:</p>
<pre><code class="language-shell">standard_init_linux.go:228: exec user process caused: exec format error
</code></pre>
<p>That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."</p>
<h3 id="heading-the-solution-a-single-image-tag-that-serves-any-architecture">The Solution: A Single Image Tag That Serves Any Architecture</h3>
<p>Docker solves this with a structure called a <strong>Manifest List</strong> (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.</p>
<p>When a server pulls <code>hello-axion:v1</code>, here's what actually happens:</p>
<ol>
<li><p>Docker contacts the registry and requests the manifest for <code>hello-axion:v1</code></p>
</li>
<li><p>The registry returns the Manifest List, which looks like this internally:</p>
</li>
</ol>
<pre><code class="language-json">{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}
</code></pre>
<ol>
<li>Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.</li>
</ol>
<p>One tag, two actual images. Completely transparent to your deployment manifests.</p>
<h3 id="heading-set-up-docker-buildx">Set Up Docker Buildx</h3>
<p><strong>Docker Buildx</strong> is the CLI tool that builds these Manifest Lists. It's powered by the <strong>BuildKit</strong> engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:</p>
<pre><code class="language-bash">docker buildx create --name multiarch-builder --use
</code></pre>
<ul>
<li><p><code>--name multiarch-builder</code> — gives this builder a memorable name. You can have multiple builders. This command creates a new one named <code>multiarch-builder</code>.</p>
</li>
<li><p><code>--use</code> — immediately sets this new builder as the active one, so all future <code>docker buildx build</code> commands use it.</p>
</li>
</ul>
<p>Now boot the builder and confirm it supports the platforms we need:</p>
<pre><code class="language-bash">docker buildx inspect --bootstrap
</code></pre>
<ul>
<li><code>--bootstrap</code> — starts the builder container if it isn't already running, and prints its full configuration.</li>
</ul>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...
</code></pre>
<p>The <code>Platforms</code> line lists every architecture this builder can produce images for. As long as you see <code>linux/amd64</code> and <code>linux/arm64</code> in that list, you're ready to build for both x86 and ARM.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/1c19aca1-30c4-406d-9c37-679ee4f2928f.png" alt="Terminal output showing the multiarch-builder details with Name, Driver set to docker-container, and a Platforms list that includes linux/amd64 and linux/arm64 highlighted." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h2 id="heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</h2>
<p>Now we can write the Dockerfile. We'll use two techniques together: a <strong>multi-stage build</strong> to keep the final image tiny, and a <strong>cross-compilation trick</strong> to avoid slow CPU emulation.</p>
<p>Create <code>app/Dockerfile</code> with the following content:</p>
<pre><code class="language-dockerfile"># -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]
</code></pre>
<p>There's a lot happening here. Let's go through it carefully.</p>
<h3 id="heading-stage-1-the-builder">Stage 1: The Builder</h3>
<p><code>FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder</code></p>
<p>This is the most important line in the file. <code>\(BUILDPLATFORM</code> is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine <em>running the build</em> (your laptop). By pinning the builder stage to <code>\)BUILDPLATFORM</code>, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.</p>
<p>Without <code>--platform=$BUILDPLATFORM</code>, Buildx would have to use <strong>QEMU</strong> — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.</p>
<p><code>ARG TARGETOS</code> <strong>and</strong> <code>ARG TARGETARCH</code></p>
<p>These two lines declare that our Dockerfile expects build arguments named <code>TARGETOS</code> and <code>TARGETARCH</code>. Buildx injects these automatically based on the <code>--platform</code> flag you pass at build time. For a <code>linux/arm64</code> target, <code>TARGETOS</code> will be <code>linux</code> and <code>TARGETARCH</code> will be <code>arm64</code>.</p>
<p><code>COPY go.mod .</code> <strong>and</strong> <code>RUN go mod download</code></p>
<p>We copy <code>go.mod</code> first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for <code>go mod download</code>.</p>
<p>On future builds, as long as <code>go.mod</code> hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.</p>
<p><code>RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go</code></p>
<p>This is the cross-compilation step. <code>GOOS</code> and <code>GOARCH</code> are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the <code>\(TARGETOS</code> and <code>\)TARGETARCH</code> build args injected by Buildx.</p>
<p>The <code>-ldflags="-w -s"</code> flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.</p>
<h3 id="heading-stage-2-the-runtime-image">Stage 2: The Runtime Image</h3>
<p><code>FROM alpine:latest</code></p>
<p>This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, <code>alpine:latest</code> is itself a multi-arch image, so Docker automatically selects the <code>arm64</code> or <code>amd64</code> Alpine variant depending on which platform this stage is built for.</p>
<p>Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains <em>only</em> Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.</p>
<p><code>RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup</code> and <code>USER appuser</code></p>
<p>These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.</p>
<p><code>COPY --from=builder /app/server .</code></p>
<p>This is how multi-stage builds work: the <code>--from=builder</code> flag tells Docker to copy files from the <code>builder</code> stage (Stage 1), not from your local disk. Only the compiled binary (<code>server</code>) makes it into the final image.</p>
<h2 id="heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</h2>
<p>With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.</p>
<p>From inside the <code>app/</code> directory, run:</p>
<pre><code class="language-bash">docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual GCP project ID.</p>
<p>Here's what each part of this command does:</p>
<ul>
<li><p><code>docker buildx build</code> — uses the Buildx CLI instead of the standard <code>docker build</code>. Buildx is required for multi-platform builds.</p>
</li>
<li><p><code>--platform linux/amd64,linux/arm64</code> — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the <code>$BUILDPLATFORM</code> cross-compilation trick, both builds run natively on your machine without QEMU emulation.</p>
</li>
<li><p><code>-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1</code> — the full image path in Artifact Registry. The format is always <code>REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG</code>.</p>
</li>
<li><p><code>--push</code> — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.</p>
</li>
<li><p><code>.</code> — the build context, the directory Docker scans for the Dockerfile and any files the build needs.</p>
</li>
</ul>
<p>Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:</p>
<pre><code class="language-plaintext"> =&gt; [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 =&gt; [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 =&gt; pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/dc88f558-b4ee-4100-bfe1-eaa943bec9bc.png" alt="Terminal showing docker buildx build output with two parallel build tracks labeled linux/amd64 and linux/arm64, and a final line reading pushing manifest for the Artifact Registry image path." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h3 id="heading-verify-the-multi-arch-image-in-artifact-registry">Verify the Multi-Arch Image in Artifact Registry</h3>
<p>Once the push completes, navigate to <strong>GCP Console → Artifact Registry → Repositories → multi-arch-repo</strong> and click on <code>hello-axion</code>.</p>
<p>You won't see a single image — you'll see something labelled <strong>"Image Index"</strong>. That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for <code>linux/amd64</code> and one for <code>linux/arm64</code>.</p>
<p>You can also inspect this from the command line:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/28d0e4a4-1d45-4c0b-ac47-34dc3b72c11d.png" alt="Google Cloud Artifact Registry console showing hello-axion as an Image Index with two child images: one labeled linux/amd64 and one labeled linux/arm64, each with its own digest and size." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<p>The output lists every manifest inside the image index. You'll see entries for <code>linux/amd64</code> and <code>linux/arm64</code> — those are our two real images. You'll also see two entries with <code>Platform: unknown/unknown</code> labelled as <code>attestation-manifest</code>. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).</p>
<p>The two entries you care about are <code>linux/amd64</code> and <code>linux/arm64</code>. Note the digest for the <code>arm64</code> entry — we'll use it in the verification step to confirm the cluster pulled the right variant.</p>
<h2 id="heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</h2>
<p>We have a universal image. Now we need somewhere to run it.</p>
<p>Recall the cluster we created in Step 2 — it's running <code>e2-standard-2</code> x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a <strong>mixed-architecture cluster</strong> where different workloads can be routed to different hardware.</p>
<h3 id="heading-choosing-your-arm-machine-type">Choosing Your ARM Machine Type</h3>
<p>Google Cloud currently offers two ARM-based machine series in GKE:</p>
<table>
<thead>
<tr>
<th>Series</th>
<th>Example type</th>
<th>What it is</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tau T2A</strong></td>
<td><code>t2a-standard-2</code></td>
<td>First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.</td>
</tr>
<tr>
<td><strong>Axion (C4A)</strong></td>
<td><code>c4a-standard-2</code></td>
<td>Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.</td>
</tr>
</tbody></table>
<p>This tutorial uses <code>t2a-standard-2</code> because it's widely available. The commands are identical for <code>c4a-standard-2</code> — just swap the <code>--machine-type</code> value. If <code>t2a-standard-2</code> isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.</p>
<h3 id="heading-create-the-arm-node-pool">Create the ARM Node Pool</h3>
<p>Add the ARM node pool to your existing cluster:</p>
<pre><code class="language-bash">gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized
</code></pre>
<p>What each flag does:</p>
<ul>
<li><p><code>--cluster=axion-tutorial-cluster</code> — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.</p>
</li>
<li><p><code>--zone=us-central1-a</code> — must match the zone you used when creating the cluster.</p>
</li>
<li><p><code>--machine-type=t2a-standard-2</code> — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.</p>
</li>
<li><p><code>--num-nodes=2</code> — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.</p>
</li>
<li><p><code>--node-labels=workload-type=arm-optimized</code> — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic <code>kubernetes.io/arch=arm64</code> label) is good practice in real clusters — it communicates the <em>intent</em> of the pool, not just its hardware.</p>
</li>
</ul>
<p>This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:</p>
<pre><code class="language-bash">gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch
</code></pre>
<p>The <code>get-credentials</code> command configures <code>kubectl</code> to authenticate with your new cluster. The <code>get nodes</code> command then lists all nodes and adds a column showing the <code>kubernetes.io/arch</code> label.</p>
<p>You should see something like:</p>
<pre><code class="language-plaintext">NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m
</code></pre>
<p><code>amd64</code> for the default x86 pool, <code>arm64</code> for our new Axion pool. This <code>kubernetes.io/arch</code> label is applied automatically by GKE — you don't set it, it's derived from the hardware.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/6389f4c6-17fe-4086-982f-39d94dbfa252.png" alt="Terminal output of kubectl get nodes with a ARCH column showing amd64 for two default-pool nodes and arm64 for two axion-pool nodes." style="display:block;margin:0 auto" width="2330" height="646" loading="lazy">

<h2 id="heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</h2>
<p>We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: <strong>Kubernetes doesn't know or care about image architecture by default</strong>.</p>
<p>If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.</p>
<p>To guarantee that pods land on ARM nodes and only ARM nodes, we use a <code>nodeSelector</code>.</p>
<h3 id="heading-how-nodeselector-works">How nodeSelector Works</h3>
<p>A <code>nodeSelector</code> is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the <code>nodeSelector</code>, the scheduler skips it — the pod will remain in <code>Pending</code> state rather than land on the wrong node.</p>
<p>This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (<code>preferredDuringSchedulingIgnoredDuringExecution</code>), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.</p>
<h3 id="heading-write-the-deployment-manifest">Write the Deployment Manifest</h3>
<p>Create <code>k8s/deployment.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your project ID. Here's what the key sections do:</p>
<p><code>replicas: 3</code> — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in <code>us-central1</code>, which distributes load across availability zones.</p>
<p><code>selector.matchLabels</code> and <code>template.metadata.labels</code> — these two blocks must match. The <code>selector</code> tells the Deployment which pods it "owns," and the <code>template.metadata.labels</code> is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.</p>
<p><code>nodeSelector: kubernetes.io/arch: arm64</code> — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies <code>kubernetes.io/arch=arm64</code> to all ARM nodes, our pods will schedule only onto the <code>axion-pool</code> nodes.</p>
<p><code>livenessProbe</code> — periodically calls <code>GET /healthz</code>. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. <code>initialDelaySeconds: 5</code> gives the server 5 seconds to start up before the first check.</p>
<p><code>readinessProbe</code> — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.</p>
<p><code>resources.requests</code> — reserves <code>250m</code> (25% of a CPU core) and <code>64Mi</code> of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.</p>
<p><code>resources.limits</code> — caps the container at <code>500m</code> CPU and <code>128Mi</code> memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.</p>
<h3 id="heading-a-note-on-taints-and-tolerations">A Note on Taints and Tolerations</h3>
<p>Once you're comfortable with <code>nodeSelector</code>, the next step in production clusters is adding a <strong>taint</strong> to your ARM node pool. A taint is a repellent — any pod without an explicit <strong>toleration</strong> for that taint is blocked from landing on the tainted node.</p>
<p>This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:</p>
<pre><code class="language-bash"># Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule
</code></pre>
<p>And a matching toleration in the pod spec:</p>
<pre><code class="language-yaml">tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"
</code></pre>
<p>We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.</p>
<h3 id="heading-write-the-service-manifest">Write the Service Manifest</h3>
<p>We also need a Kubernetes Service to expose the pods over the network. Create <code>k8s/service.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer
</code></pre>
<ul>
<li><p><code>selector: app: hello-axion</code> — the Service discovers pods using labels. Any pod with <code>app: hello-axion</code> on it will be added to this Service's load balancer pool.</p>
</li>
<li><p><code>port: 80</code> — the port the Service is reachable on from outside the cluster.</p>
</li>
<li><p><code>targetPort: 8080</code> — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.</p>
</li>
<li><p><code>type: LoadBalancer</code> — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.</p>
</li>
</ul>
<h3 id="heading-apply-both-manifests">Apply Both Manifests</h3>
<pre><code class="language-bash">kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
</code></pre>
<p><code>kubectl apply</code> reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.</p>
<p>Watch the pods come up in real time:</p>
<pre><code class="language-bash">kubectl get pods -w
</code></pre>
<p>The <code>-w</code> flag watches for changes and prints updates as they happen. You should see pods transition from <code>Pending</code> → <code>ContainerCreating</code> → <code>Running</code>. Once all three show <code>Running</code>, press <code>Ctrl+C</code> to stop watching.</p>
<h2 id="heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</h2>
<p>Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.</p>
<h3 id="heading-confirm-pod-placement">Confirm Pod Placement</h3>
<pre><code class="language-bash">kubectl get pods -o wide
</code></pre>
<p>The <code>-o wide</code> flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the <code>NODE</code> column:</p>
<pre><code class="language-plaintext">NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...
</code></pre>
<p>All three pods should show node names containing <code>axion-pool</code>. None should show <code>default-pool</code>.</p>
<h3 id="heading-confirm-the-nodes-are-arm">Confirm the Nodes Are ARM</h3>
<p>Take one of those node names and verify its architecture label:</p>
<pre><code class="language-bash">kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch
</code></pre>
<p>Replace <code>NODE_NAME</code> with one of the node names from the previous command. You should see:</p>
<pre><code class="language-plaintext">kubernetes.io/arch=arm64
</code></pre>
<p>That's the automatic label GKE applied when it provisioned the ARM hardware. Our <code>nodeSelector</code> matched on this label to pin the pods here.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/815312ea-e2bf-4106-863e-55cd0bdad5f7.png" alt="Terminal split into two sections: the top showing kubectl get pods -o wide with all pods scheduled on nodes containing axion-pool in the name, and the bottom showing kubectl get node with kubernetes.io/arch=arm64 in the labels output." style="display:block;margin:0 auto" width="2848" height="1500" loading="lazy">

<h3 id="heading-ask-the-application-itself">Ask the Application Itself</h3>
<p>This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.</p>
<p>Use <code>kubectl port-forward</code> to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:</p>
<pre><code class="language-bash">kubectl port-forward deployment/hello-axion 8080:8080
</code></pre>
<p>This command stays running in the foreground — open a <strong>second terminal window</strong> and run:</p>
<pre><code class="language-bash">curl http://localhost:8080
</code></pre>
<p>You should see:</p>
<pre><code class="language-plaintext">Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12
</code></pre>
<p><code>Architecture : arm64</code>. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/114ff82d-950f-4059-a1fa-89baffb90b6c.png" alt="Terminal output of curl http://localhost:8080 showing the four-line response: Hello from freeCodeCamp, Architecture: arm64, OS: linux, and the pod hostname." style="display:block;margin:0 auto" width="1042" height="292" loading="lazy">

<h3 id="heading-the-bonus-see-the-manifest-list-in-action">The Bonus: See the Manifest List in Action</h3>
<p>Want to see the multi-arch image indexing at work? Stop the port-forward, then run:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>You'll see four entries in the manifest list. Two are real images — <code>Platform: linux/amd64</code> and <code>Platform: linux/arm64</code>. The other two show <code>Platform: unknown/unknown</code> with an <code>attestation-manifest</code> annotation. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.</p>
<p>You may notice that if you check the image digest recorded in a running pod:</p>
<pre><code class="language-bash">kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'
</code></pre>
<p>Replace <code>POD_NAME</code> with one of the pod names from earlier.</p>
<p>The digest returned matches the <strong>top-level manifest list digest</strong>, not the <code>arm64</code>-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.</p>
<p>The definitive proof that the right binary is running is what you already have: the node labeled <code>kubernetes.io/arch=arm64</code> and the application reporting <code>Architecture: arm64</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/7dffe0c8-28cf-4a5d-8459-1e8db3da7dc0.png" alt="top-level manifest list digest" style="display:block;margin:0 auto" width="2302" height="1000" loading="lazy">

<h2 id="heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</h2>
<p>The hands-on work is done. Let's talk about why any of this is worth the effort.</p>
<h3 id="heading-the-cost-math">The Cost Math</h3>
<p>At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the <a href="https://cloud.google.com/compute/vm-instance-pricing">official pricing page</a> before making decisions):</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>vCPU</th>
<th>Memory</th>
<th>Approx. $/hour</th>
</tr>
</thead>
<tbody><tr>
<td><code>n2-standard-4</code> (x86)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.19</td>
</tr>
<tr>
<td><code>t2a-standard-4</code> (Tau ARM)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.14</td>
</tr>
<tr>
<td><code>c4a-standard-4</code> (Axion)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.15</td>
</tr>
</tbody></table>
<p>That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.</p>
<p>Here's how that looks at scale, for a service running 20 nodes continuously for a year:</p>
<ul>
<li><p>20 × <code>n2-standard-4</code> × \(0.19 × 8,760 hours = <strong>\)33,288/year</strong></p>
</li>
<li><p>20 × <code>t2a-standard-4</code> × \(0.14 × 8,760 hours = <strong>\)24,528/year</strong></p>
</li>
</ul>
<p>That's roughly <strong>$8,760 saved annually</strong> on compute, before committed use discounts (which further widen the gap).</p>
<h3 id="heading-when-arm-is-the-right-choice">When ARM Is the Right Choice</h3>
<p>ARM works best for:</p>
<ul>
<li><p><strong>Stateless API servers and web applications</strong> — like the app we built. ARM excels at high-throughput, low-latency network workloads.</p>
</li>
<li><p><strong>Background workers and queue processors</strong> — long-running services that don't depend on x86-specific binaries.</p>
</li>
<li><p><strong>Microservices written in Go, Rust, or Python</strong> — these languages have excellent ARM64 support and are built cross-platform by default.</p>
</li>
</ul>
<h3 id="heading-when-to-proceed-carefully">When to Proceed Carefully</h3>
<ul>
<li><p><strong>Native library dependencies</strong> — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.</p>
</li>
<li><p><strong>CI pipelines need ARM too</strong> — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.</p>
</li>
<li><p><strong>Profile before optimizing</strong> — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.</p>
</li>
</ul>
<h2 id="heading-cleanup">Cleanup</h2>
<p>When you're done, clean up to avoid ongoing charges:</p>
<pre><code class="language-bash"># Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Let's recap what you built and why each part matters.</p>
<p>You started with a Go application, a Dockerfile, and a <code>docker buildx build</code> command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.</p>
<p>You provisioned a GKE cluster with two node pools running different CPU architectures, then used <code>nodeSelector</code> to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.</p>
<p>The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same <code>nodeSelector</code> approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.</p>
<p><strong>Where to go from here:</strong></p>
<ul>
<li><p>Add a GitHub Actions workflow that runs <code>docker buildx build --platform linux/amd64,linux/arm64</code> on every push, automating this entire process in CI.</p>
</li>
<li><p>Audit one of your existing stateless services for ARM compatibility and try migrating it.</p>
</li>
<li><p>Explore <strong>Node Affinity</strong> as a softer alternative to <code>nodeSelector</code> for workloads that can run on either architecture but prefer ARM.</p>
</li>
<li><p>Look into <strong>GKE Autopilot</strong>, which now supports ARM nodes and handles node pool management automatically.</p>
</li>
</ul>
<p>Happy building.</p>
<h2 id="heading-project-file-structure">Project File Structure</h2>
<pre><code class="language-plaintext">hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service
</code></pre>
<p>All source files for this tutorial are available in the companion GitHub repository: <a href="https://github.com/Amiynarh/multi-arch-docker-gke-arm">https://github.com/Amiynarh/multi-arch-docker-gke-arm</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Authenticate Users in Kubernetes: x509 Certificates, OIDC, and Cloud Identity ]]>
                </title>
                <description>
                    <![CDATA[ Kubernetes doesn't know who you are. It has no user database, no built-in login system, no password file. When you run kubectl get pods, Kubernetes receives an HTTP request and asks one question: who  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-authenticate-users-in-kubernetes-x509-certificates-oidc-and-cloud-identity/</link>
                <guid isPermaLink="false">69d4182f40c9cabf4484dbdb</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ authentication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:31:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36356282-0cfb-43a8-8461-84f20e64b041.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Kubernetes doesn't know who you are.</p>
<p>It has no user database, no built-in login system, no password file. When you run <code>kubectl get pods</code>, Kubernetes receives an HTTP request and asks one question: who signed this, and do I trust that signature? Everything else — what you're allowed to do, which namespaces you can access, whether your request goes through at all — comes after that question is answered.</p>
<p>This surprises most engineers who are new to Kubernetes. They expect something like a database of users with passwords. Instead, they find a pluggable chain of authenticators, each one able to vouch for a request in a different way:</p>
<ul>
<li><p>Client certificates</p>
</li>
<li><p>OIDC tokens from an external identity provider</p>
</li>
<li><p>Cloud provider IAM tokens</p>
</li>
<li><p>Service account tokens projected into pods.</p>
</li>
</ul>
<p>Any of these can be active at the same time.</p>
<p>Understanding this model is what separates engineers who can debug authentication failures from engineers who copy kubeconfig files and hope for the best.</p>
<p>In this article, you'll work through how the Kubernetes authentication chain works from first principles. You'll see how x509 client certificates are used — and why they're a poor choice for human users in production. You'll configure OIDC authentication with Dex, giving your cluster a real browser-based login flow. And you'll see how AWS, GCP, and Azure each plug into the same underlying model.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A running kind cluster — a fresh one works fine, or reuse an existing one</p>
</li>
<li><p><code>kubectl</code> and <code>helm</code> installed</p>
</li>
<li><p><code>openssl</code> available on your machine (comes pre-installed on macOS and most Linux distros)</p>
</li>
<li><p>Basic familiarity with what a JWT is (a signed JSON object with claims) — you don't need to be able to write one, just recognise one</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</a></p>
<ul>
<li><p><a href="#heading-the-authenticator-chain">The Authenticator Chain</a></p>
</li>
<li><p><a href="#heading-users-vs-service-accounts">Users vs Service Accounts</a></p>
</li>
<li><p><a href="#heading-what-happens-after-authentication">What Happens After Authentication</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</a></p>
<ul>
<li><p><a href="#heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</a></p>
</li>
<li><p><a href="#the-cluster-ca">The Cluster CA</a></p>
</li>
<li><p><a href="#heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</a></p>
<ul>
<li><p><a href="#heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</a></p>
</li>
<li><p><a href="#heading-the-api-server-configuration">The API Server Configuration</a></p>
</li>
<li><p><a href="#heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</a></p>
</li>
<li><p><a href="#heading-how-kubelogin-works">How kubelogin Works</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</a></p>
</li>
<li><p><a href="#heading-cloud-provider-authentication">Cloud Provider Authentication</a></p>
<ul>
<li><p><a href="#heading-aws-eks">AWS EKS</a></p>
</li>
<li><p><a href="#heading-google-gke">Google GKE</a></p>
</li>
<li><p><a href="#heading-azure-aks">Azure AKS</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-webhook-token-authentication">Webhook Token Authentication</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</h2>
<p>Every request that reaches the Kubernetes API server — whether from <code>kubectl</code>, a pod, a controller, or a CI pipeline — carries a credential of some kind.</p>
<p>The API server passes that credential through a chain of authenticators in sequence. The first authenticator that can verify the credential wins. If none can, the request is treated as anonymous.</p>
<h3 id="heading-the-authenticator-chain">The Authenticator Chain</h3>
<p>Kubernetes supports several authentication strategies simultaneously. You can have client certificate authentication and OIDC authentication active on the same cluster at the same time, which is common in production: cluster administrators use certificates, regular developers use OIDC. The strategies active on a cluster are determined by flags passed to the <code>kube-apiserver</code> process.</p>
<p>The strategies available are x509 client certificates, bearer tokens (static token files — rarely used in production), bootstrap tokens (used during node join operations), service account tokens, OIDC tokens, authenticating proxies, and webhook token authentication. A cluster doesn't have to use all of them, and most don't. But knowing they all exist helps when you're diagnosing an auth failure.</p>
<h3 id="heading-users-vs-service-accounts">Users vs Service Accounts</h3>
<p>There is an important distinction in how Kubernetes thinks about identity. Service accounts are Kubernetes objects — they live in a namespace, get created with <code>kubectl create serviceaccount</code>, and have tokens managed by the cluster itself. Every pod runs as a service account. These are machine identities for workloads.</p>
<p>Users, on the other hand, don't exist as Kubernetes objects at all. There is no <code>kubectl create user</code> command. Kubernetes doesn't manage user accounts. Instead, it trusts external systems to assert user identity — a certificate authority, an OIDC provider, or a cloud provider's IAM system. Kubernetes just verifies the assertion and extracts the username and group memberships from it.</p>
<table>
<thead>
<tr>
<th></th>
<th>Service Account</th>
<th>User</th>
</tr>
</thead>
<tbody><tr>
<td>Kubernetes object?</td>
<td>Yes — lives in a namespace</td>
<td>No — managed externally</td>
</tr>
<tr>
<td>Created with</td>
<td><code>kubectl create serviceaccount</code></td>
<td>External system (CA, IdP, cloud IAM)</td>
</tr>
<tr>
<td>Used by</td>
<td>Pods and workloads</td>
<td>Humans and CI systems</td>
</tr>
<tr>
<td>Token managed by</td>
<td>Kubernetes</td>
<td>External system</td>
</tr>
<tr>
<td>Namespaced?</td>
<td>Yes</td>
<td>No</td>
</tr>
</tbody></table>
<h3 id="heading-what-happens-after-authentication">What Happens After Authentication</h3>
<p>Authentication only answers one question: who is this? Once the API server has a verified identity — a username and zero or more group memberships — it passes the request to the authorisation layer. By default that is RBAC, which checks the identity against Role and ClusterRole bindings to determine what the request is allowed to do.</p>
<p>This is why authentication and authorisation are separate concerns in Kubernetes. A valid certificate gets you past the front door. What you can do inside is RBAC's job. An authenticated user with no RBAC bindings can authenticate successfully but will be denied every API call.</p>
<p>If you want a deep dive into how RBAC rules, roles, and bindings work, check out this handbook on <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a>.</p>
<h2 id="heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</h2>
<p>x509 client certificate authentication is the oldest and simplest authentication method in Kubernetes. It's how <code>kubectl</code> works out of the box when you create a cluster — the kubeconfig file that <code>kind</code> or <code>kubeadm</code> generates contains an embedded client certificate signed by the cluster's Certificate Authority.</p>
<h3 id="heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</h3>
<p>When the API server receives a request with a client certificate, it validates the certificate against its trusted CA, then reads two fields (The Common Name and Organization) from the certificate to construct an identity.</p>
<p>The <strong>Common Name (CN)</strong> field becomes the username. The <strong>Organization (O)</strong> field, which can contain multiple values, becomes the list of groups the user belongs to.</p>
<p>So a certificate with <code>CN=jane</code> and <code>O=engineering</code> authenticates as username <code>jane</code> in group <code>engineering</code>. If you want to give <code>jane</code> permissions, you create a RoleBinding that references either the username <code>jane</code> or the group <code>engineering</code> as a subject.</p>
<p>This is the same mechanism behind <code>system:masters</code>. When <code>kind</code> creates a cluster and writes a kubeconfig for you, it generates a certificate with <code>O=system:masters</code>. Kubernetes has a built-in ClusterRoleBinding that grants <code>cluster-admin</code> to anyone in the <code>system:masters</code> group. That's why your default kubeconfig has full admin access — it's not magic, it's a certificate with the right group.</p>
<h3 id="heading-the-cluster-ca">The Cluster CA</h3>
<p>Every Kubernetes cluster has a root Certificate Authority — a private key and a self-signed certificate that the API server trusts. Any client certificate signed by this CA is trusted by the cluster.</p>
<p>The CA certificate and key are typically stored in <code>/etc/kubernetes/pki/</code> on the control plane node, or in the <code>kube-system</code> namespace as a secret, depending on how the cluster was created.</p>
<p>On kind clusters, you can copy the CA cert and key directly from the control plane container:</p>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>Whoever holds the CA key can issue certificates for any username and any group, including <code>system:masters</code>. This makes the CA key the most sensitive secret in a Kubernetes cluster. Guard it accordingly.</p>
<h3 id="heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</h3>
<p>Client certificates work, but they have two fundamental problems that make them a poor choice for human users in production.</p>
<p>The first is that <strong>Kubernetes doesn't check certificate revocation lists (CRLs)</strong>. If a developer's kubeconfig is stolen, the embedded certificate remains valid until it expires — which is typically one year in most Kubernetes setups. There's no way to immediately invalidate it. You can't "log out" a certificate. The only mitigation is to rotate the entire cluster CA, which invalidates every certificate including those belonging to other legitimate users.</p>
<p>The second is <strong>operational overhead</strong>. Certificates must be generated, distributed to users, and rotated before expiry. There's no self-service. In a team of ten engineers, managing certificates is annoying. In a team of a hundred, it's a full-time job.</p>
<p>For human access in production, OIDC is the right answer: short-lived tokens issued by a trusted identity provider, with a central revocation mechanism, and a standard browser-based login flow. Certificates are fine for service accounts and automation, where token management can be automated and rotation is handled programmatically.</p>
<p>That said, understanding certificates isn't optional. Your kubeconfig uses one. Your CI system probably does too. And cert-based auth is what you fall back to when everything else breaks.</p>
<h2 id="heading-demo-1-create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</h2>
<p>In this section, you'll generate a user certificate signed by the cluster CA, bind it to an RBAC role, and use it to authenticate to the cluster as a different user.</p>
<p><strong>This guide is for local development and learning only.</strong> Manually signing certificates with the cluster CA and storing keys on disk is done here for simplicity.</p>
<p>In production, you should use the Kubernetes CertificateSigningRequest API or cert-manager for certificate issuance, enforce short-lived certificates with automatic rotation, and store private keys in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or hardware security module (HSM) — never distribute the cluster CA key.</p>
<h3 id="heading-step-1-copy-the-ca-cert-and-key-from-the-kind-control-plane">Step 1: Copy the CA cert and key from the kind control plane</h3>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>This will create two files in your current directory called <code>ca.crt</code> and <code>ca.key</code></p>
<h3 id="heading-step-2-generate-a-private-key-and-csr-for-a-new-user">Step 2: Generate a private key and CSR for a new user</h3>
<p>You're creating a certificate for a user named <code>jane</code> in the <code>engineering</code> group:</p>
<pre><code class="language-bash"># Generate the private key
openssl genrsa -out jane.key 2048

# Generate a Certificate Signing Request
# CN = username, O = group
openssl req -new \
  -key jane.key \
  -out jane.csr \
  -subj "/CN=jane/O=engineering"
</code></pre>
<h3 id="heading-step-3-sign-the-csr-with-the-cluster-ca">Step 3: Sign the CSR with the cluster CA</h3>
<pre><code class="language-bash">openssl x509 -req \
  -in jane.csr \
  -CA ca.crt \
  -CAkey ca.key \
  -CAcreateserial \
  -out jane.crt \
  -days 365
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Certificate request self-signature ok
subject=CN=jane, O=engineering
</code></pre>
<h3 id="heading-step-4-inspect-the-certificate">Step 4: Inspect the certificate</h3>
<p>Before using it, confirm the identity it carries:</p>
<pre><code class="language-bash">openssl x509 -in jane.crt -noout -subject -dates
</code></pre>
<pre><code class="language-plaintext">subject=CN=jane, O=engineering
notBefore=Mar 20 10:00:00 2024 GMT
notAfter=Mar 20 10:00:00 2025 GMT
</code></pre>
<p>One year from now, this certificate becomes invalid and must be replaced. There's no way to extend it — you have to issue a new one.</p>
<h3 id="heading-step-5-build-a-kubeconfig-entry-for-jane">Step 5: Build a kubeconfig entry for jane</h3>
<pre><code class="language-bash"># Get the cluster API server address from the current context
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Create a kubeconfig for jane
kubectl config set-cluster k8s-security \
  --server=$APISERVER \
  --certificate-authority=ca.crt \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-credentials jane \
  --client-certificate=jane.crt \
  --client-key=jane.key \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-context jane@k8s-security \
  --cluster=k8s-security \
  --user=jane \
  --kubeconfig=jane.kubeconfig

kubectl config use-context jane@k8s-security \
  --kubeconfig=jane.kubeconfig
</code></pre>
<h3 id="heading-step-6-test-authentication-before-rbac">Step 6: Test authentication — before RBAC</h3>
<p>Try to list pods using jane's kubeconfig:</p>
<pre><code class="language-bash">kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">Error from server (Forbidden): pods is forbidden: User "jane" cannot list
resource "pods" in API group "" in the namespace "staging"
</code></pre>
<p>This is correct. Jane authenticated successfully — Kubernetes knows who she is. But she has no RBAC bindings, so every API call is denied. Authentication passed, but authorisation failed.</p>
<h3 id="heading-step-7-grant-jane-access-with-rbac">Step 7: Grant jane access with RBAC</h3>
<p>RBAC bindings use the username exactly as it appears in the certificate's CN field. If you need a refresher on how Roles, ClusterRoles, and RoleBindings work, this handbook <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a> covers the full RBAC model. For now, a simple RoleBinding using the built-in <code>view</code> ClusterRole is enough:</p>
<pre><code class="language-yaml"># jane-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jane-reader
  namespace: staging
subjects:
  - kind: User
    name: jane          # matches the CN in the certificate
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f jane-rolebinding.yaml
kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">No resources found in staging namespace.
</code></pre>
<p>No error — jane can now list pods in <code>staging</code>. She can't delete them, create them, or access other namespaces. The certificate got her in. RBAC determines what she can do.</p>
<h2 id="heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</h2>
<p>OpenID Connect is an identity layer on top of OAuth 2.0. It's how Kubernetes integrates with enterprise identity providers — Active Directory, Okta, Google Workspace, Keycloak, and any other provider that speaks OIDC. Understanding how Kubernetes uses it requires following the token from the user's browser to the API server's decision.</p>
<h3 id="heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</h3>
<p>When a developer runs <code>kubectl get pods</code> with OIDC configured, the following happens:</p>
<ol>
<li><p><code>kubectl</code> checks whether the current credential in the kubeconfig is a valid, unexpired OIDC token</p>
</li>
<li><p>If not, it launches <code>kubelogin</code>, a kubectl plugin that opens a browser window</p>
</li>
<li><p>The browser redirects to the OIDC provider (Dex, Okta, your corporate IdP)</p>
</li>
<li><p>The user logs in with their corporate credentials</p>
</li>
<li><p>The OIDC provider issues a signed JWT and returns it to kubelogin</p>
</li>
<li><p>kubelogin caches the token locally (under <code>~/.kube/cache/oidc-login/</code>) and returns it to <code>kubectl</code></p>
</li>
<li><p><code>kubectl</code> sends the token to the API server as a <code>Bearer</code> header</p>
</li>
<li><p>The API server fetches the provider's public keys from its JWKS endpoint and verifies the token signature</p>
</li>
<li><p>If valid, the API server extracts the username and group claims from the token</p>
</li>
<li><p>RBAC takes over from there</p>
</li>
</ol>
<p>The Kubernetes API server never contacts the OIDC provider for each request. It only fetches the provider's public keys periodically to verify signatures locally. This makes OIDC authentication stateless and scalable.</p>
<h3 id="heading-the-api-server-configuration">The API Server Configuration</h3>
<p>For OIDC to work, the API server needs to know where to find the identity provider and how to interpret the tokens it issues.</p>
<p>In Kubernetes v1.30+, this is configured through an <code>AuthenticationConfiguration</code> file passed via the <code>--authentication-config</code> flag. (In older versions, individual <code>--oidc-*</code> flags were used instead, but these were removed in v1.35.)</p>
<p>The <code>AuthenticationConfiguration</code> defines OIDC providers under the <code>jwt</code> key:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it does</th>
<th>Example</th>
</tr>
</thead>
<tbody><tr>
<td><code>issuer.url</code></td>
<td>The OIDC provider's base URL — must match the <code>iss</code> claim in the token</td>
<td><code>https://dex.example.com</code></td>
</tr>
<tr>
<td><code>issuer.audiences</code></td>
<td>The client IDs the token was issued for — must match the <code>aud</code> claim</td>
<td><code>["kubernetes"]</code></td>
</tr>
<tr>
<td><code>issuer.certificateAuthority</code></td>
<td>CA certificate to trust when contacting the OIDC provider (inlined PEM)</td>
<td><code>-----BEGIN CERTIFICATE-----...</code></td>
</tr>
<tr>
<td><code>claimMappings.username.claim</code></td>
<td>Which JWT claim to use as the Kubernetes username</td>
<td><code>email</code></td>
</tr>
<tr>
<td><code>claimMappings.groups.claim</code></td>
<td>Which JWT claim to use as the Kubernetes group list</td>
<td><code>groups</code></td>
</tr>
<tr>
<td><code>claimMappings.*.prefix</code></td>
<td>Prefix added to the claim value — set to <code>""</code> for no prefix</td>
<td><code>""</code></td>
</tr>
</tbody></table>
<p>On a kind cluster, the <code>--authentication-config</code> flag is set in the cluster configuration before creation, not after. You'll see this in the next demo.</p>
<h3 id="heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</h3>
<p>A JWT is a signed JSON object with three sections: a header, a payload, and a signature. The payload is a set of claims – key-value pairs that assert facts about the token. Kubernetes reads specific claims from the payload to build an identity.</p>
<p>The required claims are <code>iss</code> (the issuer URL, must match <code>issuer.url</code> in the <code>AuthenticationConfiguration</code>), <code>sub</code> (the subject, a unique identifier for the user), and <code>aud</code> (the audience, must match the <code>issuer.audiences</code> list). The <code>exp</code> claim (expiry time) is also required as the API server rejects expired tokens.</p>
<p>The most useful optional claim is <code>groups</code> (or whatever you configure via <code>claimMappings.groups.claim</code>). When this claim is present, Kubernetes can map OIDC group memberships directly to RBAC group bindings. A user in the <code>platform-engineers</code> group in your identity provider automatically gets the RBAC permissions you've bound to that group in Kubernetes — no manual user management required.</p>
<h3 id="heading-how-kubelogin-works">How kubelogin Works</h3>
<p>kubelogin (also distributed as <code>kubectl oidc-login</code>) is a kubectl credential plugin. Instead of embedding a static certificate or token in your kubeconfig, you configure a credential plugin that runs a helper binary when <code>kubectl</code> needs a token.</p>
<p>When kubelogin is invoked, it checks its local token cache. If the cached token is still valid, it returns it immediately. If the token has expired, it initiates the OIDC authorization code flow — opens a browser, redirects to the identity provider, receives the token after login, caches it locally, and returns it to <code>kubectl</code>. The whole flow takes about five seconds when it triggers.</p>
<p>This means tokens are short-lived (typically an hour) and rotate automatically. If a developer's machine is compromised, the token expires on its own. There is no long-lived credential sitting in a file somewhere.</p>
<h2 id="heading-demo-2-configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</h2>
<p>In this section, you'll deploy Dex as a self-hosted OIDC provider, configure a kind cluster to trust it, and log in with a browser. Dex is a good demo vehicle because it runs inside the cluster and doesn't require a cloud account or an external service.</p>
<p><strong>This guide is for local development and learning only.</strong> Self-signed certificates, static passwords, and certs stored on disk are used here for simplicity.</p>
<p>In production, use a managed identity provider (Azure Entra ID, Google Workspace, Okta), automate certificate lifecycle with cert-manager, and store secrets in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or inject them via CSI driver — never commit or store certs as local files.</p>
<h3 id="heading-step-1-create-a-kind-cluster-with-oidc-authentication">Step 1: Create a kind cluster with OIDC authentication</h3>
<p>OIDC authentication for the API server must be configured at cluster creation time on Kind because the API server needs to know which identity provider to trust before it starts accepting requests.</p>
<p><strong>Note:</strong> Kubernetes v1.30+ deprecated the <code>--oidc-*</code> API server flags in favor of the structured <code>AuthenticationConfiguration</code> API (via <code>--authentication-config</code>). In v1.35+ the old flags are removed entirely. This guide uses the new approach.</p>
<p><strong>nip.io</strong> is a wildcard DNS service — <code>dex.127.0.0.1.nip.io</code> resolves to <code>127.0.0.1</code>. This lets us use a real hostname for TLS without editing <code>/etc/hosts</code>.</p>
<p>First, generate a self-signed CA and TLS certificate for Dex:</p>
<pre><code class="language-bash"># Generate a CA for Dex
openssl req -x509 -newkey rsa:4096 -keyout dex-ca.key \
  -out dex-ca.crt -days 365 -nodes \
  -subj "/CN=dex-ca"

# Generate a certificate for Dex signed by that CA
openssl req -newkey rsa:2048 -keyout dex.key \
  -out dex.csr -nodes \
  -subj "/CN=dex.127.0.0.1.nip.io"

openssl x509 -req -in dex.csr \
  -CA dex-ca.crt -CAkey dex-ca.key \
  -CAcreateserial -out dex.crt -days 365 \
  -extfile &lt;(printf "subjectAltName=DNS:dex.127.0.0.1.nip.io")
</code></pre>
<p>Next, generate the <code>AuthenticationConfiguration</code> file. This tells the API server how to validate JWTs — which issuer to trust (<code>url</code>), which audience to expect (<code>audiences</code>), and which JWT claims map to Kubernetes usernames and groups (<code>claimMappings</code>). The CA cert is inlined so the API server can verify Dex's TLS certificate when fetching signing keys:</p>
<pre><code class="language-bash">cat &gt; auth-config.yaml &lt;&lt;EOF
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
  - issuer:
      url: https://dex.127.0.0.1.nip.io:32000
      audiences:
        - kubernetes
      certificateAuthority: |
$(sed 's/^/        /' dex-ca.crt)
    claimMappings:
      username:
        claim: email
        prefix: ""
      groups:
        claim: groups
        prefix: ""
EOF
</code></pre>
<p>The <code>kind-oidc.yaml</code> config uses <code>extraPortMappings</code> to expose Dex's port to your browser, <code>extraMounts</code> to copy files into the Kind node, and a <code>kubeadmConfigPatch</code> to pass <code>--authentication-config</code> to the API server:</p>
<pre><code class="language-yaml"># kind-oidc.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      # Forward port 32000 from the Docker container to localhost,
      # so your browser can reach Dex's login page
      - containerPort: 32000
        hostPort: 32000
        protocol: TCP
    extraMounts:
      # Copy files from your machine into the Kind node's filesystem
      - hostPath: ./dex-ca.crt
        containerPath: /etc/ca-certificates/dex-ca.crt
        readOnly: true
      - hostPath: ./auth-config.yaml
        containerPath: /etc/kubernetes/auth-config.yaml
        readOnly: true
    kubeadmConfigPatches:
      # Patch the API server to enable OIDC authentication
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            # Tell the API server to load our AuthenticationConfiguration
            authentication-config: /etc/kubernetes/auth-config.yaml
          extraVolumes:
            # Mount files into the API server pod (it runs as a static pod,
            # so it needs explicit volume mounts even though files are on the node)
            - name: dex-ca
              hostPath: /etc/ca-certificates/dex-ca.crt
              mountPath: /etc/ca-certificates/dex-ca.crt
              readOnly: true
              pathType: File
            - name: auth-config
              hostPath: /etc/kubernetes/auth-config.yaml
              mountPath: /etc/kubernetes/auth-config.yaml
              readOnly: true
              pathType: File
</code></pre>
<p>Create the cluster:</p>
<pre><code class="language-bash">kind create cluster --name k8s-auth --config kind-oidc.yaml
</code></pre>
<h3 id="heading-step-2-deploy-dex">Step 2: Deploy Dex</h3>
<p>Dex is an OIDC-compliant identity provider that acts as a bridge between Kubernetes and upstream identity sources (LDAP, SAML, GitHub, and so on). In this demo it runs inside the cluster with a static password database — two hardcoded users you can log in as.</p>
<p>The API server doesn't talk to Dex directly on every request. It only needs Dex's CA certificate (which you inlined in the <code>AuthenticationConfiguration</code>) to verify the JWT signatures on tokens that Dex issues.</p>
<p>The deployment has four parts: a ConfigMap with Dex's configuration, a Deployment to run Dex, a NodePort Service to expose it on port 32000 (matching the issuer URL), and RBAC resources so Dex can store state using Kubernetes CRDs.</p>
<p>First, create the namespace and load the TLS certificate as a Kubernetes Secret. Dex needs this to serve HTTPS. Without it, your browser and the API server would refuse to connect:</p>
<pre><code class="language-bash">kubectl create namespace dex

kubectl create secret tls dex-tls \
  --cert=dex.crt \
  --key=dex.key \
  -n dex
</code></pre>
<p>Save the following as <code>dex-config.yaml</code>. This configures Dex with a static password connector — two hardcoded users for the demo:</p>
<pre><code class="language-yaml"># dex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex-config
  namespace: dex
data:
  config.yaml: |
    # issuer must exactly match the URL in your AuthenticationConfiguration
    issuer: https://dex.127.0.0.1.nip.io:32000

    # Dex stores refresh tokens and auth codes — here it uses Kubernetes CRDs
    storage:
      type: kubernetes
      config:
        inCluster: true

    # Dex's HTTPS listener — serves the login page and token endpoints
    web:
      https: 0.0.0.0:5556
      tlsCert: /etc/dex/tls/tls.crt
      tlsKey: /etc/dex/tls/tls.key

    # staticClients defines which applications can request tokens.
    # "kubernetes" is the client ID that kubelogin uses when authenticating
    staticClients:
      - id: kubernetes
        redirectURIs:
          - http://localhost:8000     # kubelogin listens here to receive the callback
        name: Kubernetes
        secret: kubernetes-secret     # shared secret between kubelogin and Dex

    # Two demo users with the password "password" (bcrypt-hashed).
    # In production, you'd connect Dex to LDAP, SAML, or a social login instead
    enablePasswordDB: true
    staticPasswords:
      - email: "jane@example.com"
        # bcrypt hash of "password" — generate your own with: htpasswd -bnBC 10 "" password
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "jane"
        userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
      - email: "admin@example.com"
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "admin"
        userID: "a8b53e13-7e8c-4f7b-9a33-6c2f4d8c6a1b"
        groups:
          - platform-engineers
</code></pre>
<p>Save the following as <code>dex-deployment.yaml</code>. This creates the Deployment, Service, ServiceAccount, and RBAC that Dex needs to run:</p>
<pre><code class="language-yaml"># dex-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dex
  namespace: dex
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dex
  template:
    metadata:
      labels:
        app: dex
    spec:
      serviceAccountName: dex
      containers:
        - name: dex
          # v2.45.0+ required — earlier versions don't include groups from staticPasswords in tokens
          image: ghcr.io/dexidp/dex:v2.45.0
          command: ["dex", "serve", "/etc/dex/cfg/config.yaml"]
          ports:
            - name: https
              containerPort: 5556
          volumeMounts:
            - name: config
              mountPath: /etc/dex/cfg
            - name: tls
              mountPath: /etc/dex/tls
      volumes:
        - name: config
          configMap:
            name: dex-config
        - name: tls
          secret:
            secretName: dex-tls
---
# NodePort Service — exposes Dex on port 32000 on the Kind node.
# Combined with extraPortMappings, this makes Dex reachable from your browser
apiVersion: v1
kind: Service
metadata:
  name: dex
  namespace: dex
spec:
  type: NodePort
  ports:
    - name: https
      port: 5556
      targetPort: 5556
      nodePort: 32000
  selector:
    app: dex
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dex
  namespace: dex
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dex
rules:
  - apiGroups: ["dex.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dex
subjects:
  - kind: ServiceAccount
    name: dex
    namespace: dex
roleRef:
  kind: ClusterRole
  name: dex
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f dex-config.yaml
kubectl apply -f dex-deployment.yaml
kubectl rollout status deployment/dex -n dex
</code></pre>
<h3 id="heading-step-3-install-kubelogin">Step 3: Install kubelogin</h3>
<pre><code class="language-bash"># macOS
brew install int128/kubelogin/kubelogin

# Linux
curl -LO https://github.com/int128/kubelogin/releases/latest/download/kubelogin_linux_amd64.zip
unzip -j kubelogin_linux_amd64.zip kubelogin -d /tmp
sudo mv /tmp/kubelogin /usr/local/bin/kubectl-oidc_login
rm kubelogin_linux_amd64.zip
</code></pre>
<p>Confirm it's installed:</p>
<pre><code class="language-bash">kubectl oidc-login --version
</code></pre>
<h3 id="heading-step-4-configure-a-kubeconfig-entry-for-oidc">Step 4: Configure a kubeconfig entry for OIDC</h3>
<p>This creates a new user and context in your kubeconfig. Instead of using a client certificate (like the default Kind admin), it tells kubectl to use kubelogin to get a token from Dex.</p>
<p>The <code>--oidc-extra-scope</code> flags are important: without <code>email</code> and <code>groups</code>, Dex won't include those claims in the JWT, and the API server won't know who you are or what groups you belong to.</p>
<pre><code class="language-bash">kubectl config set-credentials oidc-user \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg=--oidc-issuer-url=https://dex.127.0.0.1.nip.io:32000 \
  --exec-arg=--oidc-client-id=kubernetes \
  --exec-arg=--oidc-client-secret=kubernetes-secret \
  --exec-arg=--oidc-extra-scope=email \
  --exec-arg=--oidc-extra-scope=groups \
  --exec-arg=--certificate-authority=$(pwd)/dex-ca.crt

kubectl config set-context oidc@k8s-auth \
  --cluster=kind-k8s-auth \
  --user=oidc-user

kubectl config use-context oidc@k8s-auth
</code></pre>
<h3 id="heading-step-5-trigger-the-login-flow">Step 5: Trigger the login flow</h3>
<p>Jane has no RBAC permissions yet, so first grant her read access from the admin context:</p>
<pre><code class="language-bash">kubectl --context kind-k8s-auth create clusterrolebinding jane-view \
  --clusterrole=view --user=jane@example.com
</code></pre>
<p>Now switch to the OIDC context and trigger a login:</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Your browser opens and redirects to the Dex login page. Log in as <code>jane@example.com</code> with password <code>password</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/44fe0657-b383-4245-9e43-45daea7a3f4f.png" alt="dexidp login screen" style="display:block;margin:0 auto" width="866" height="549" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/4f77442a-3055-47fc-a141-8d881731a1f4.png" alt="dexidp grant access" style="display:block;margin:0 auto" width="925" height="512" loading="lazy">

<p>After login, the terminal completes:</p>
<pre><code class="language-plaintext">No resources found in default namespace.
</code></pre>
<p>The browser-based authentication worked. <code>kubectl</code> received the token from Dex, sent it to the API server, the API server validated the JWT signature using the CA certificate from the <code>AuthenticationConfiguration</code>, extracted <code>jane@example.com</code> from the <code>email</code> claim, matched it against the RBAC binding, and authorized the request.</p>
<p>Without the <code>clusterrolebinding</code>, you would see <code>Error from server (Forbidden)</code> — authentication succeeds (the API server knows <em>who</em> you are) but authorization fails (jane has no permissions). This is the distinction between 401 Unauthorized and 403 Forbidden.</p>
<h3 id="heading-step-6-inspect-the-jwt">Step 6: Inspect the JWT</h3>
<p>A JWT (JSON Web Token) is a signed JSON payload that contains claims about the user. kubelogin caches the token locally under <code>~/.kube/cache/oidc-login/</code> so you don't have to log in on every kubectl command.</p>
<p>List the directory to find the cached file:</p>
<pre><code class="language-bash">ls ~/.kube/cache/oidc-login/
</code></pre>
<p>Decode the JWT payload directly from the cache:</p>
<pre><code class="language-bash">cat ~/.kube/cache/oidc-login/$(ls ~/.kube/cache/oidc-login/ | grep -v lock | head -1) | \
  python3 -c "
import json, sys, base64
token = json.load(sys.stdin)['id_token'].split('.')[1]
token += '=' * (4 - len(token) % 4)
print(json.dumps(json.loads(base64.urlsafe_b64decode(token)), indent=2))
"
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-json">{
  "iss": "https://dex.127.0.0.1.nip.io:32000",
  "sub": "CiQwOGE4Njg0Yi1kYjg4LTRiNzMtOTBhOS0zY2QxNjYxZjU0NjYSBWxvY2Fs",
  "aud": "kubernetes",
  "exp": 1775307910,
  "iat": 1775221510,
  "email": "jane@example.com",
  "email_verified": true
}
</code></pre>
<p>The <code>email</code> claim becomes jane's Kubernetes username because the <code>AuthenticationConfiguration</code> maps <code>username.claim: email</code>. The <code>aud</code> matches the configured <code>audiences</code>. The <code>iss</code> matches the issuer <code>url</code>. This is how the API server validates the token without contacting Dex on every request — it only needs the CA certificate to verify the JWT signature.</p>
<h3 id="heading-step-7-map-oidc-groups-to-rbac">Step 7: Map OIDC groups to RBAC</h3>
<p>The <code>admin@example.com</code> user has a <code>groups</code> claim in the Dex config containing <code>platform-engineers</code>. Instead of creating individual RBAC bindings per user, you can bind permissions to a group — anyone whose JWT contains that group gets the permissions automatically:</p>
<pre><code class="language-yaml"># platform-engineers-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-engineers-admin
subjects:
  - kind: Group
    name: platform-engineers     # matches the groups claim in the JWT
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>You're currently logged in as <code>jane@example.com</code> via the OIDC context, but jane only has <code>view</code> permissions — she can't create cluster-wide RBAC bindings. Switch back to the admin context to apply this:</p>
<pre><code class="language-bash">kubectl config use-context kind-k8s-auth
kubectl apply -f platform-engineers-binding.yaml
kubectl config use-context oidc@k8s-auth
</code></pre>
<p>Now clear the cached token to log out of jane's session, then trigger a new login as <code>admin@example.com</code>:</p>
<pre><code class="language-bash"># Clear the cached token — this is how you "log out" with kubelogin
rm -rf ~/.kube/cache/oidc-login/

# This will open the browser again for a fresh login
kubectl get pods -n default
</code></pre>
<p>Log in as <code>admin@example.com</code> with password <code>password</code>. This time the JWT will contain <code>"groups": ["platform-engineers"]</code>, which matches the <code>ClusterRoleBinding</code> you just created. The admin user gets full cluster access — without ever being added to a kubeconfig by name.</p>
<p>You can verify by decoding the new token (Step 6) — the <code>groups</code> claim will be present:</p>
<pre><code class="language-json">{
  "email": "admin@example.com",
  "groups": ["platform-engineers"]
}
</code></pre>
<p>This is the real power of OIDC group claims: you manage group membership in your identity provider, and Kubernetes permissions follow automatically. Add someone to the <code>platform-engineers</code> group in Dex (or any upstream IdP), and they get cluster-admin access on their next login — no kubeconfig or RBAC changes needed.</p>
<h2 id="heading-cloud-provider-authentication">Cloud Provider Authentication</h2>
<p>AWS, GCP, and Azure each give Kubernetes clusters a native authentication mechanism that ties into their IAM systems.</p>
<p>The implementations differ in API surface, but they all use the same underlying mechanism: OIDC token projection. Once you understand how Dex works above, these are all variations on the same theme.</p>
<h3 id="heading-aws-eks">AWS EKS</h3>
<p>EKS uses the <code>aws-iam-authenticator</code> to translate AWS IAM identities into Kubernetes identities. When you run <code>kubectl</code> against an EKS cluster, the AWS CLI generates a short-lived token signed with your IAM credentials. The API server passes this token to the aws-iam-authenticator webhook, which verifies it against AWS STS and returns the corresponding username and groups.</p>
<p>User access is controlled via the <code>aws-auth</code> ConfigMap in <code>kube-system</code>, which maps IAM role ARNs and IAM user ARNs to Kubernetes usernames and groups. A typical entry looks like this:</p>
<pre><code class="language-yaml"># In kube-system/aws-auth ConfigMap
mapRoles:
  - rolearn: arn:aws:iam::123456789:role/platform-engineers
    username: platform-engineer:{{SessionName}}
    groups:
      - platform-engineers
</code></pre>
<p>AWS is migrating from the <code>aws-auth</code> ConfigMap to a newer Access Entries API, which manages the same mapping through the EKS API rather than a ConfigMap. The underlying authentication mechanism is the same.</p>
<h3 id="heading-google-gke">Google GKE</h3>
<p>GKE integrates with Google Cloud IAM using two different mechanisms, depending on whether you're authenticating as a human user or as a workload.</p>
<p>For human users, GKE accepts standard Google OAuth2 tokens. Running <code>gcloud container clusters get-credentials</code> writes a kubeconfig that uses the <code>gcloud</code> CLI as a credential plugin, generating short-lived tokens from your Google account automatically.</p>
<p>For pod-level identity — letting a pod assume a Google Cloud IAM role — GKE uses Workload Identity. You annotate a Kubernetes service account to bind it to a Google Service Account, and pods running as that service account can call Google Cloud APIs using the GSA's permissions:</p>
<pre><code class="language-bash"># Bind a Kubernetes SA to a Google Service Account
kubectl annotate serviceaccount my-app \
  --namespace production \
  iam.gke.io/gcp-service-account=my-app@my-project.iam.gserviceaccount.com
</code></pre>
<h3 id="heading-azure-aks">Azure AKS</h3>
<p>AKS integrates with Azure Active Directory. When Azure AD integration is enabled, <code>kubectl</code> requests an Azure AD token on behalf of the user via the Azure CLI, and the AKS API server validates it against Azure AD.</p>
<p>For pod-level identity, AKS uses Azure Workload Identity, which follows the same OIDC federation pattern as GKE Workload Identity. A Kubernetes service account is annotated with an Azure Managed Identity client ID, and pods can request Azure AD tokens without storing any credentials:</p>
<pre><code class="language-bash"># Annotate a service account with the Azure Managed Identity client ID
kubectl annotate serviceaccount my-app \
  --namespace production \
  azure.workload.identity/client-id=&lt;MANAGED_IDENTITY_CLIENT_ID&gt;
</code></pre>
<p>The underlying pattern across all three providers is the same: a trusted OIDC token is issued by the cloud provider, verified by the Kubernetes API server, and mapped to an identity through a binding (the <code>aws-auth</code> ConfigMap, a GKE Workload Identity binding, or an AKS federated identity credential). The OIDC section in this article is the conceptual foundation for all of them.</p>
<h2 id="heading-webhook-token-authentication">Webhook Token Authentication</h2>
<p>Webhook token authentication is worth knowing about because it appears in several common Kubernetes setups, even if you never configure it yourself.</p>
<p>When a request arrives with a bearer token that no other authenticator recognises, Kubernetes can send that token to an external HTTP endpoint for validation. The endpoint returns a response indicating who the token belongs to.</p>
<p>This is how EKS authentication worked before the aws-iam-authenticator was built into the API server. It's also how bootstrap tokens work during node join operations: a token is generated, embedded in the <code>kubeadm join</code> command, and validated by the bootstrap webhook when the new node contacts the API server for the first time.</p>
<p>For most clusters, you'll encounter webhook auth as something already running rather than something you configure. The main thing to know is that it exists and what it looks like when it appears in logs or configuration.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the OIDC demo cluster
kind delete cluster --name k8s-auth

# Remove generated certificate files
rm -f ca.crt ca.key jane.key jane.csr jane.crt jane.kubeconfig
rm -f dex-ca.crt dex-ca.key dex.crt dex.key dex.csr dex-ca.srl auth-config.yaml

# Remove the kubelogin token cache
rm -rf ~/.kube/cache/oidc-login/
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes authentication is not a single mechanism — it's a chain of pluggable strategies, each one suited to different use cases. In this article you worked through the most important ones.</p>
<p>x509 client certificates are how Kubernetes works out of the box. The CN field becomes the username, the O field becomes the group, and the cluster CA is the trust anchor. You created a certificate for a new user, bound it to RBAC, and saw exactly how authentication and authorisation interact — authentication gets you in, RBAC determines what you can do.</p>
<p>You also saw the fundamental limitation: Kubernetes doesn't check certificate revocation lists, so a compromised certificate remains valid until it expires. This makes certificates a poor fit for human users in production environments.</p>
<p>OIDC is the production-grade answer. Tokens are short-lived, issued by a trusted identity provider, and map directly to Kubernetes groups through JWT claims. You deployed Dex as a self-hosted OIDC provider, configured the API server to trust it, and set up kubelogin for browser-based authentication.</p>
<p>You then decoded a JWT to see exactly what the API server reads from it, and mapped an OIDC group claim to a Kubernetes ClusterRoleBinding.</p>
<p>Cloud provider authentication — EKS, GKE, AKS — uses the same OIDC foundation with provider-specific wrappers. Understanding how Dex works makes each of those systems immediately readable.</p>
<p>All YAML, certificates, and configuration files from this article are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Run Multiple Kubernetes Clusters Without the Overhead Using kcp ]]>
                </title>
                <description>
                    <![CDATA[ In Kubernetes, when you need to isolate workloads, you might start by using namespaces. Namespaces provide a simple way to separate workloads within a single cluster. But as your requirements grow, es ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-multiple-kubernetes-clusters-without-the-overhead-using-kcp/</link>
                <guid isPermaLink="false">69c6ea5a7cf27065104ab997</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ multi-cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #multitenancy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ consumer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Provider ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Olalekan Odukoya ]]>
                </dc:creator>
                <pubDate>Fri, 27 Mar 2026 20:36:42 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a42c1a28-7a9e-4676-891d-eae7d64f2900.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In Kubernetes, when you need to isolate workloads, you might start by using namespaces. Namespaces provide a simple way to separate workloads within a single cluster.</p>
<p>But as your requirements grow, especially around compliance, security, multi-tenancy, or conflicting dependencies, your team will likely move beyond namespaces and start creating separate clusters.</p>
<p>What starts as a clean separation quickly becomes cluster sprawl, bringing higher costs, complex networking, and constant operational overhead.</p>
<p>In this article, we'll explore how <strong>kcp</strong> can help fix this problem by allowing you to run multiple “logical clusters” inside a single control plane.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-challenge-of-namespaces-and-multiple-kubernetes-clusters">The Challenge of Namespaces and Multiple Kubernetes Clusters</a></p>
</li>
<li><p><a href="#heading-introducing-kcp">Introducing kcp</a></p>
</li>
<li><p><a href="#heading-getting-started-with-kcp">Getting Started with kcp</a></p>
</li>
<li><p><a href="#heading-deploying-and-managing-applications">Deploying and Managing Applications</a></p>
</li>
<li><p><a href="#heading-beyond-the-primitives-what-we-didnt-cover">Beyond the Primitives: What We Didn't Cover</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p><strong>kubectl</strong> installed.</p>
</li>
<li><p>A terminal to run commands</p>
</li>
<li><p><strong>Curl</strong> installed</p>
</li>
</ul>
<h2 id="heading-the-challenge-of-namespaces-and-multiple-kubernetes-clusters">The Challenge of Namespaces and Multiple Kubernetes Clusters</h2>
<p>While namespaces provide some level of isolation, many teams often default to creating entirely new Kubernetes clusters to achieve stronger multi-tenancy, environment separation, or geographic distribution.</p>
<p>At first, this approach works well. But as systems grow, managing a fleet of clusters introduces challenges that often outweigh the benefits.</p>
<p>Every new cluster comes with its own control plane, which you'll need to continuously patch, upgrade, and monitor. Over time, this operational overhead will add up, consuming cycles that platform teams could otherwise spend on higher-value work.</p>
<p>Also, clusters don't naturally share service discovery or identity. This forces you to introduce extra layers like service meshes or VPN-based networking, which increases your system's complexity and expands the overall attack surface.</p>
<p>There’s also the cost factor. Clusters incur baseline infrastructure costs regardless of how much workload they run. Creating dedicated clusters for small teams can lead to underutilized resources or, worse, delay the creation of necessary environments because the cost feels too high.</p>
<p>As a result, platform teams often find themselves acting as “cluster plumbers”, spending more time maintaining infrastructure than enabling developer productivity.</p>
<h3 id="heading-illustrating-the-namespace-problem">Illustrating the Namespace Problem</h3>
<p>As I mentioned earlier, when managing multiple clusters gets too complex, a natural alternative is to use namespaces for isolation within a single cluster.</p>
<p>At first glance, this seems like the perfect solution.</p>
<p>But to understand where this approach falls short, let’s walk through a real-world example using a common requirement in shared Kubernetes environments: running databases.</p>
<p>We'll start by creating different namespaces for each team:</p>
<pre><code class="language-shell">➜ ~ kubectl create namespace team-a 
➜ ~ kubectl create namespace team-b
</code></pre>
<p>Let's say <strong>Team A</strong> needs a MongoDB database for one of its services. The team must first install the required <a href="https://github.com/mongodb/mongodb-kubernetes">MongoDB Custom Resource Definitions (CRDs)</a> into the cluster, so Kubernetes knows how to understand the different <code>MongoDB</code> resources:</p>
<pre><code class="language-shell">➜ ~ kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/crds.yaml

customresourcedefinition.apiextensions.k8s.io/clustermongodbroles.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodb.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbmulticluster.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbsearch.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbusers.mongodb.com created customresourcedefinition.apiextensions.k8s.io/opsmanagers.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbcommunity.mongodbcommunity.mongodb.com created
</code></pre>
<p>Secondly, <strong>Team A</strong> installs the actual Operator application (the controller that continuously runs the database logic) into their designated namespace:</p>
<pre><code class="language-shell">➜ ~ kubectl apply -n team-a -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/mongodb-kubernetes.yaml
</code></pre>
<p>But the installation isn't completed due to the error below:</p>
<pre><code class="language-shell">the namespace from the provided object "mongodb" does not match the namespace "team-a". You must pass '--namespace=mongodb' to perform this operation.
</code></pre>
<p>Why did this fail? This is because most Kubernetes Operators are designed assuming they own the entire cluster and not just a single namespace.</p>
<p>To force the operator to run in <code>team-a</code>, we can modify the manifest on the fly:</p>
<pre><code class="language-shell">curl -s https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/mongodb-kubernetes.yaml \
  | sed 's/namespace: mongodb/namespace: team-a/g' \
  | kubectl apply -f 
</code></pre>
<p>We can then confirm that the operator is installed and running:</p>
<pre><code class="language-plaintext">➜ ~ k get po -n team-a 
NAME                                          READY STATUS  RESTARTS AGE 
mongodb-kubernetes-operator-6f5f8bb7fd-8h5hj  1/1   Running 0        59s
</code></pre>
<p>But even after tricking the Operator into running inside <code>team-a</code>'s namespace, we still haven't solved the real problem.</p>
<p>At first glance, <code>team-a</code>'s operator is neatly confined to their namespace. But remember Step 1? <strong>The CRDs aren't namespaced – they're strictly cluster-scoped.</strong> So, even though <code>team-a</code> orchestrated this deployment purely for their own use, those CRDs are now globally registered across the entire cluster.</p>
<p>If Team B checks the API, they'll see all the MongoDB-related CRDs installed by Team A.</p>
<pre><code class="language-shell">➜ ~ kubectl get crds | grep mongodb

clustermongodbroles.mongodb.com               2026-03-24T10:49:35Z
mongodb.mongodb.com                           2026-03-24T10:49:36Z
mongodbcommunity.mongodbcommunity.mongodb.com 2026-03-24T10:49:38Z
mongodbmulticluster.mongodb.com               2026-03-24T10:49:36Z
mongodbsearch.mongodb.com                     2026-03-24T10:49:37Z 
mongodbusers.mongodb.com                      2026-03-24T10:49:37Z 
opsmanagers.mongodb.com                       2026-03-24T10:49:37Z
</code></pre>
<p>Now consider what happens if Team B needs to install a different version of MongoDB for its own services. Because the CRDs are shared across the cluster, both teams are now coupled to the same definitions. This means one team’s changes can easily impact the other, turning what should be isolated environments into a source of conflict.</p>
<h2 id="heading-introducing-kcp">Introducing kcp</h2>
<p><strong>kcp</strong> is an open-source project that lets you run multiple logical Kubernetes clusters on a single control plane.</p>
<p>These logical clusters are called <strong>workspaces</strong>, and each one behaves like an independent Kubernetes cluster. Every workspace has its own API endpoint, authentication, authorization, and policies, giving teams the experience of working in fully isolated environments.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e6abef0af89662115c0f5ca/ede32f6e-c260-426e-8d50-4f78f11fa1b1.svg" alt="brief kcp architecture and component" style="display:block;margin:0 auto" width="913.6673828124999" height="688.4281249999999" loading="lazy">

<p>This decoupling of the control plane from the worker nodes is what makes kcp different.</p>
<p>In traditional Kubernetes, spinning up a new cluster means provisioning a new API server, a new etcd instance, and all the associated controllers. With kcp, you spin up a workspace, and you have a strong, confined environment for your workload.</p>
<p>It's worth noting that <strong>kcp itself doesn't run workloads.</strong> It's strictly a control plane. Your actual applications still run on physical Kubernetes clusters. kcp only manages the workspaces and the synchronization of resources to those underlying clusters.</p>
<h2 id="heading-getting-started-with-kcp">Getting Started with kcp</h2>
<p>Now that we've covered what kcp is and why it matters, let's get our hands dirty. We'll set up a local kcp environment and explore the core concepts in action.</p>
<p>To make this realistic, we'll follow a common kcp workflow: a platform team that provides custom APIs, and tenant teams that consume them.</p>
<p>In our case, the platform team will export a MongoDB API, and our two tenant teams will subscribe to those APIs using <strong>APIBindings</strong>. Once bound, they can deploy MongoDB instances into their workspaces and sync them to physical clusters.</p>
<p>This pattern is at the heart of how kcp enables scalable multi-tenancy. The platform team controls the API definitions and versioning. Tenant teams get self-service access without needing to understand the underlying infrastructure. Let's see how it works!</p>
<h3 id="heading-installing-kcp">Installing kcp</h3>
<p>Running kcp locally is incredibly lightweight since there are no heavy worker nodes to spin up. You will need two things: the <code>kcp</code> server itself, <code>kubectl-kcp</code> , and the <code>kubectl-ws</code> plugin to manage workspaces.</p>
<p>To install the binaries, let's head over to the <a href="https://github.com/kcp-dev/kcp/releases/tag/v0.30.1">kcp-dev releases page</a>.</p>
<p>The commands below are for macOS Apple Silicon. If you're using an Intel Mac or Linux, simply replace <code>darwin_arm64</code> with your respective architecture.</p>
<ol>
<li>Download the kcp server and workspace plugins:</li>
</ol>
<pre><code class="language-shell">➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kcp_0.30.1_darwin_arm64.tar.gz 

➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kubectl-kcp-plugin_0.30.1_darwin_arm64.tar.gz

➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kubectl-ws-plugin_0.30.1_darwin_arm64.tar.gz
</code></pre>
<ol>
<li>Extract the archives:</li>
</ol>
<pre><code class="language-shell">➜ ~ tar -xzf kcp_0.30.1_darwin_arm64.tar.gz 
➜ ~ tar -xzf kubectl-kcp-plugin_0.30.1_darwin_arm64.tar.gz
➜ ~ tar -xzf kubectl-ws-plugin_0.30.1_darwin_arm64.tar.gz
</code></pre>
<ol>
<li>Move the required binaries into your <strong>PATH</strong>:</li>
</ol>
<pre><code class="language-shell">➜ ~ sudo mv bin/kcp /usr/local/bin/
➜ ~ sudo mv bin/kubectl-kcp /usr/local/bin/
➜ ~ sudo mv bin/kubectl-ws /usr/local/bin/
</code></pre>
<p>You can confirm the installation by checking the version.</p>
<pre><code class="language-shell">➜ ~ kcp --version
kcp version v1.33.3+kcp-v0.0.0-627385a6
</code></pre>
<h3 id="heading-starting-the-server">Starting the Server</h3>
<p>With the binaries installed, let's boot up your local control plane and bind it to localhost. But first, let's create a "work-folder".</p>
<pre><code class="language-plaintext">➜ ~ mkdir kcp-test
➜ ~ cd kcp-test
</code></pre>
<p>We can then start the kcp server in this directory.</p>
<pre><code class="language-shell">➜ ~ kcp start --bind-address=127.0.0.1
</code></pre>
<p>You'll see a flurry of logs as kcp boots up its internal database and exposes the API server. Leave this terminal running in the background.</p>
<h3 id="heading-connecting-to-the-root-workspace">Connecting to the Root Workspace</h3>
<p>Open a new terminal window and navigate back into the <code>kcp-test</code> folder we just created.</p>
<p>At first, if you run a standard <code>ls</code> command, the folder will look empty. But during startup, kcp silently generated a hidden <code>.kcp</code> directory that contains our local certificates and our administrative <code>kubeconfig</code> file. Let's verify that:</p>
<pre><code class="language-shell">➜ ~ cd kcp-test 
➜ kcp-test ls
➜ kcp-test ls -a . .. .kcp 
➜ kcp-test ls .kcp admin.kubeconfig apiserver.crt apiserver.key etcd-server sa.key
</code></pre>
<p>Now that we know exactly where the configuration file lives, let's export it so our <code>kubectl</code> commands are routed to kcp instead of your default cluster:</p>
<pre><code class="language-plaintext">export KUBECONFIG=$PWD/.kcp/admin.kubeconfig
</code></pre>
<p>Finally, let's use the workspace plugin we installed earlier to verify that we're connected accurately:</p>
<pre><code class="language-shell"> ➜ kubectl ws .
</code></pre>
<p>You should see the message below printed to the console:</p>
<pre><code class="language-shell">Current workspace is 'root'.
</code></pre>
<p>This shows that you're now officially inside the kcp <strong>root workspace</strong>. This is the highest-level administrative boundary where we'll begin creating our tenant logical clusters.</p>
<h3 id="heading-creating-and-managing-workspaces">Creating and Managing Workspaces</h3>
<p>As we discussed above, in a standard Kubernetes cluster, separating teams means using <code>kubectl create namespace</code>. In kcp, we solve the problem by creating entirely isolated logical clusters – workspaces.</p>
<p>If you recall our architecture diagram from earlier, we want to create three distinct environments for our company: one for the platform engineers to manage shared APIs, and two for our isolated tenant development teams.</p>
<p>Since we're currently inside the administrative <code>root</code> workspace, we can create our new tenant workspaces as children of the <code>root</code>:</p>
<pre><code class="language-plaintext">➜ kubectl ws create platform-team
Workspace "platform-team" (type root:organization) created.
Waiting for it to be ready... 
Workspace "platform-team" (type root:organization) is ready to use.

➜ kubectl ws create team-a 
Workspace "team-a" (type root:organization) created.
Waiting for it to be ready... 
Workspace "team-a" (type root:organization) is ready to use.

➜ kubectl ws create team-b
Workspace "team-b" (type root:organization) created.
Waiting for it to be ready... 
Workspace "team-b" (type root:organization) is ready to use.
</code></pre>
<p>Now, here is where kcp truly shines. Unlike a standard cluster, where objects are just a massive flat list, kcp manages its API as a hierarchy. We can visually prove the structure of our new logical clusters using the <code>tree</code> command:</p>
<pre><code class="language-shell">➜ kubectl ws tree
.
└── root
      ├── platform-team
      ├── team-a
      └── team-b
</code></pre>
<p>Jumping between these logical clusters is as fast as changing directories in a terminal. Let's switch our context over into Team A's workspace:</p>
<pre><code class="language-plaintext">➜ kubectl ws team-a 
Current workspace is 'root:team-a' (type root:organization).
</code></pre>
<h4 id="heading-proving-the-isolation">Proving the Isolation</h4>
<p>To truly understand the power of what we just did, let's try running a standard Kubernetes command while inside <code>team-a</code>:</p>
<pre><code class="language-plaintext">➜ kubectl get namespaces

NAME STATUS AGE 
default Active 15m
</code></pre>
<p>Let's also ask the cluster what APIs are actually available to us out of the box:</p>
<pre><code class="language-plaintext">➜ kubectl api-resources
</code></pre>
<p>Your output should be similar to what is in the image below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e6abef0af89662115c0f5ca/775eff52-8ae7-4363-bd37-fce6ab0cc587.png" alt="775eff52-8ae7-4363-bd37-fce6ab0cc587" style="display:block;margin:0 auto" width="2022" height="1360" loading="lazy">

<p>When you take a closer look at that list. You'll notice that there are no Pods, Deployments, or even ReplicaSets. You don't see all the available APIs that you see in a standard Kubernetes Cluster.</p>
<p>This output proves exactly what we discussed in the architecture section. kcp is incredibly lightweight because every new workspace is born <strong>completely stripped of compute</strong>. Out of the box, it only contains the absolute bare-minimum control plane APIs needed for routing, RBAC, namespaces, and authentication.</p>
<p>From Team A's perspective, they own this pristine, empty universe. If they install a massive, noisy operator right now, like the MongoDB CRD, it will only exist right here in this specific API bucket.</p>
<p>But this raises the ultimate question: If there are no <code>Deployments</code> or <code>Pods</code> APIs in this workspace... how do we actually deploy our applications?</p>
<h2 id="heading-deploying-and-managing-applications">Deploying and Managing Applications</h2>
<p>Now that we have set up our isolated environments, we must address the glaring issue from our last terminal output: <strong>How do developers actually deploy applications</strong> if there are no <code>Deployment</code> or <code>Pod</code> APIs?</p>
<p>In standard Kubernetes, the API is monolithic. You get everything whether you need it or not, and adding a new schema (like an Operator) forces it globally onto everyone.</p>
<p>kcp takes the exact opposite approach. Every workspace starts completely empty. You then selectively "subscribe" your workspace to only the APIs you actually need using two incredibly powerful new concepts: <strong>APIExports</strong> and <strong>APIBindings</strong>.</p>
<p>Let's see exactly how this solves our MongoDB multi-tenancy problem, step by step.</p>
<h3 id="heading-1-the-platform-team-exports-the-api">1. The Platform Team "Exports" the API</h3>
<p>Instead of treating Custom Resource Definitions as global hazards, the platform engineers manage them centrally. First, lets switch into the platform-team workspace:</p>
<pre><code class="language-plaintext">➜ kubectl ws :root:platform-team

Current workspace is 'root:platform-team' (type root:organization).
</code></pre>
<p>Here, we'll install the MongoDB Operator CRDs in the platform-team's workspace:</p>
<pre><code class="language-plaintext">➜ kubectl apply -f kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/crds.yaml
</code></pre>
<p>To confirm that this is indeed isolated, let's first check what CRDs were installed,</p>
<pre><code class="language-shell">➜ kubectl get crd

NAME                                          CREATED AT
clustermongodbroles.mongodb.com               2026-03-24T20:45:50Z
mongodb.mongodb.com                           2026-03-24T20:45:50Z
mongodbcommunity.mongodbcommunity.mongodb.com 2026-03-24T20:45:51Z
mongodbmulticluster.mongodb.com               2026-03-24T20:45:50Z
mongodbsearch.mongodb.com                     2026-03-24T20:45:51Z
mongodbusers.mongodb.com                      2026-03-24T20:45:51Z
opsmanagers.mongodb.com                       2026-03-24T20:45:51Z
</code></pre>
<p>We can switch to <code>team-a'</code>s workspace (any of the team's workspaces can be used, we're just trying to establish that the installed <em><strong>CRD</strong></em> is only visible in the <code>platform-team'</code>s workspace).</p>
<pre><code class="language-shell">➜ kubectl ws :root:team-a

Current workspace is 'root:team-a' (type root:organization).
</code></pre>
<pre><code class="language-plaintext">➜ kubectl get crd 
No resources found
</code></pre>
<p>What we get as output is that there are no custom resources found or registered. This is the power of kcp.</p>
<p>If you don't want to continually type out paths to switch between your logical clusters, the <code>kcp</code> plugin includes a powerful interactive UI right in your terminal.</p>
<p>By running <code>kubectl ws -i</code>, you can use your arrow keys to navigate through your hierarchy and press <code>Enter</code> to instantly switch your context. Even better, this interactive mode provides a holistic view of your environment at any given time. With a single glance, you can see exactly how many APIExports are hosted inside a specific workspace, or which APIs are currently <strong>bound</strong> by other workspaces.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4d86d960-a23c-4cb1-8155-6fe236240893.png" alt="4d86d960-a23c-4cb1-8155-6fe236240893" style="display:block;margin:0 auto" width="2992" height="1796" loading="lazy">

<p>Let's switch back to the <code>platform-team'</code>s workspace to continue with our setup.</p>
<p>Now, we need to do something kcp-specific. If you check your resources right now, those CRDs are strictly local to this workspace. To safely share them with our tenant teams, we need to convert them into an internal kcp tracking object called an <strong>APIResourceSchema</strong>. This is how kcp structurally version-controls APIs so they can be securely exported.</p>
<p>To do this, we use our <code>kcp</code> plugin to take a "snapshot" of the local MongoDB CRD:</p>
<pre><code class="language-plaintext">kubectl get crd mongodbcommunity.mongodbcommunity.mongodb.com -o yaml | kubectl kcp crd snapshot -f - --prefix v1 | kubectl apply -f -
</code></pre>
<p>You should see an output that says:</p>
<blockquote>
<p>apiresourceschema.apis.kcp.io/v1.mongodbcommunity.mongodbcommunity.mongodb.com created</p>
</blockquote>
<p>This tells kcp: "Get the CRD we just installed, take a snapshot with the prefix 'v1', and apply the resulting <strong>APIResourceSchema</strong> back to the cluster."</p>
<p>Now, let's look for the schema kcp just generated for us:</p>
<pre><code class="language-plaintext">➜ kubectl get apiresourceschemas

NAME                                             AGE
v1.mongodbcommunity.mongodbcommunity.mongodb.com 11s
</code></pre>
<p>To safely share this API with our teams, we wrap that generated schema into an <code>APIExport</code>. This acts like "APIs as a Service," publishing the schema so that other workspaces can optionally choose to consume it.</p>
<p>Let's create the Export using the exact schema name we just found:</p>
<pre><code class="language-shell">➜ cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: apis.kcp.io/v1alpha1
kind: APIExport
metadata:
  name: mongodb-v1
spec:
  latestResourceSchemas:
    - v1.mongodbcommunity.mongodbcommunity.mongodb.com
EOF
</code></pre>
<p>We can confirm this was successfully created by checking the APIExport resource we have</p>
<pre><code class="language-plaintext">➜ kubectl get apiexports

NAME       AGE
mongodb-v1 2m46s
</code></pre>
<h3 id="heading-2-tenant-teams-bind-to-the-api">2. Tenant Teams "Bind" to the API</h3>
<p>Now let's switch our terminal context back over to Team A. Remember our previous output? Their workspace currently has no idea what a MongoDB cluster is. Let's prove it:</p>
<pre><code class="language-plaintext">➜ kubectl ws :root:team-a
Current workspace is "root:team-a" (type root:organization).

➜ kubectl api-resources | grep mongodb
# (No output. The API does not exist here!)
</code></pre>
<p>To securely subscribe to the platform team's newly created API service, Team A needs to create an <code>APIBinding</code>.</p>
<p>While we can write standard Kubernetes YAML to do this, the <code>kcp</code> plugin provides a <code>bind</code> command. Team A simply points the <code>bind</code> command directly at the workspace and the specific API export they want to consume:</p>
<pre><code class="language-plaintext">➜ kubectl kcp bind apiexport root:platform-team:mongodb-v1
apibinding mongodb-v1 created. Waiting to successfully bind ...
mongodb-v1 created and bound.

➜ kcp-test kubectl get apibindings
NAME                  AGE   READY
mongodb-v1            73s   True
tenancy.kcp.io-bqt7a  7h10m True
topology.kcp.io-9dlvq 7h10m True
</code></pre>
<p>The moment Team A executes that <code>bind</code> command, their workspace is magically updated with the new capabilities. Let's check our <code>api-resources</code> one more time:</p>
<pre><code class="language-plaintext">➜ kubectl api-resources | grep mongodb
mongodbcommunity mdbc mongodbcommunity.mongodb.com/v1 true MongoDBCommunity
</code></pre>
<h2 id="heading-beyond-the-primitives-what-we-didnt-cover">Beyond the Primitives: What We Didn't Cover</h2>
<p>At this point, you should have a firm, hands-on grasp of the core user primitives of kcp, that is <strong>Workspaces</strong>, <strong>APIExports</strong>, and <strong>APIBindings</strong>. But we've only just scratched the surface of what this architecture makes possible.</p>
<p>To keep this guide digestible, there are a few massive topics that I deliberately didn't cover in this article:</p>
<ol>
<li><p><strong>Shards and High Availability:</strong> Since kcp is designed to host thousands of logical clusters, a single database isn't enough. kcp introduces the <code>Shard</code> primitive, allowing platform administrators to horizontally partition workspace state across multiple underlying <code>etcd</code> instances. This gives kcp infinite scalability and massive High Availability (HA) without complicating the developer experience.</p>
</li>
<li><p><strong>Front-Proxy:</strong> When kcp scales to host millions of logical clusters, it needs a way to seamlessly direct traffic. The kcp <strong>Front-Proxy</strong> sits at the absolute edge of the architecture, dynamically routing incoming <code>kubectl</code> API requests go straight to the correct underlying workspace and shard. It ensures the developer experience feels perfectly unified, no matter how massive the background infrastructure actually becomes.</p>
</li>
<li><p><strong>Virtual Workspaces:</strong> While the workspaces we built today act as simple isolated buckets of state, kcp also supports <strong>Virtual Workspaces</strong>. These act as dynamic, read-only projections of data. For example, <em><strong>kcp</strong></em> uses virtual workspaces to project a unified view of a specific API across multiple tenant workspaces so that controllers can easily watch them all at once.</p>
</li>
<li><p><strong>APIExportEndpointSlices:</strong> Just like standard Kubernetes uses endpoints to route traffic to pods, kcp uses <code>EndpointSlices</code> to efficiently route and scale the delivery of massive <code>APIExports</code> across thousands of consuming workspaces.</p>
</li>
<li><p><strong>Wiring up the Sync Agent (</strong><code>api-syncagent</code><strong>):</strong> We discussed this conceptually in our architecture diagram, but we didn't actually attach a physical cluster. In a production scenario, you deploy the Sync Agent onto a fleet of downstream execution clusters (like EKS, GKE, or On-Premises environments) to automatically pull workloads safely out of kcp and execute them seamlessly on physical hardware.</p>
</li>
<li><p><strong>External Integrations Like Crossplane:</strong> Because kcp acts purely as a multi-tenant API control plane, it pairs incredibly well with <strong>Crossplane</strong>. By publishing Crossplane as an <code>APIExport</code>, you can empower developer teams to provision actual cloud infrastructure (like AWS databases or Cloud Spanners) using standard YAML directly from their completely isolated kcp workspaces.</p>
</li>
</ol>
<p>We will cover those advanced integrations in a future deep-dive. But armed with just the base primitives we built today, we can already solve the incredibly complex infrastructure problems we outlined at the beginning of the article.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator ]]>
                </title>
                <description>
                    <![CDATA[ If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently? Storing them is straightforward. But handling rotation, stale env vars, and the gap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sync-aws-secrets-manager-secrets-into-kubernetes-with-the-external-secrets-operator/</link>
                <guid isPermaLink="false">69c541f010e664c5dadc877e</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ secrets management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Thu, 26 Mar 2026 14:25:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6cca126e-dd50-4400-ae9d-65449581345b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?</p>
<p>Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.</p>
<p>In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.</p>
<p>By the end, you'll be able to:</p>
<ul>
<li><p>Explain the full architecture from vault to pod</p>
</li>
<li><p>Run the lab locally in about 15 minutes</p>
</li>
<li><p>Prove why environment variables go stale after rotation, while mounted secret files stay fresh</p>
</li>
<li><p>Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD</p>
</li>
<li><p>Troubleshoot the most common failures</p>
</li>
</ul>
<p>Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ac8bfc9e-304e-41b8-b6a3-7ce1795b29a9.png" alt="Architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds." style="display:block;margin:0 auto" width="1024" height="1536" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-local-lab">How to Run the Local Lab</a></p>
</li>
<li><p><a href="#heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</a></p>
</li>
<li><p><a href="#heading-how-to-test-secret-rotation">How to Test Secret Rotation</a></p>
</li>
<li><p><a href="#heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</a></p>
</li>
<li><p><a href="#heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</a></p>
</li>
<li><p><a href="#heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following tools installed and configured.</p>
<p><strong>For the local lab:</strong></p>
<ul>
<li><p>An AWS account with access to AWS Secrets Manager</p>
</li>
<li><p>The AWS CLI installed and configured. Run <code>aws configure</code> and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.</p>
</li>
<li><p><code>kubectl</code> installed. For Microk8s, run <code>microk8s kubectl config view --raw &gt; ~/.kube/config</code> after installation to connect kubectl to your local cluster.</p>
</li>
<li><p>Terraform installed</p>
</li>
<li><p>Helm installed</p>
</li>
<li><p>Docker installed</p>
</li>
<li><p>A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the <a href="https://microk8s.io/">Microk8s install guide</a> before continuing.</p>
</li>
</ul>
<p><strong>For the Amazon Elastic Kubernetes Service sections:</strong></p>
<ul>
<li><p>An Amazon Elastic Kubernetes Service cluster you can create or manage</p>
</li>
<li><p>A GitHub repository you can configure for workflows and secrets</p>
</li>
</ul>
<p>The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-LOCAL.md"><code>docs/DEPLOY-LOCAL.md</code></a> and <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-EKS.md"><code>docs/DEPLOY-EKS.md</code></a>.</p>
<h2 id="heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</h2>
<p>Before you run any command, you need to understand how the pieces connect.</p>
<p>The flow has four stages:</p>
<ol>
<li><p>A developer or automated system updates a secret in AWS Secrets Manager.</p>
</li>
<li><p>The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.</p>
</li>
<li><p>Your pod reads that Kubernetes Secret.</p>
</li>
<li><p>During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/9dc52f99-add4-490a-ad86-25a30d0ae306.png" alt="A step-by-step flow diagram showing the four stages of secret flow above" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-how-the-external-secrets-operator-sync-works">How the External Secrets Operator Sync Works</h3>
<p>The External Secrets Operator reads a custom Kubernetes resource called <code>ExternalSecret</code>. That resource tells the operator three things:</p>
<ul>
<li><p>Which secret store to connect to</p>
</li>
<li><p>Which Kubernetes Secret name to create or update</p>
</li>
<li><p>How often to refresh</p>
</li>
</ul>
<p>In this lab, the <code>ExternalSecret</code> creates a Kubernetes Secret named <code>myapp-database-creds</code>. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.</p>
<h3 id="heading-how-the-app-consumes-secrets">How the App Consumes Secrets</h3>
<p>The sample application exposes three endpoints so you can validate behavior at any time.</p>
<ul>
<li><p><code>/secrets/env</code> shows what environment variables the pod sees</p>
</li>
<li><p><code>/secrets/volume</code> shows what files in the mounted secret directory look like</p>
</li>
<li><p><code>/secrets/compare</code> compares both and reports whether rotation has been detected</p>
</li>
</ul>
<p>The app checks four keys: <code>DB_USERNAME</code>, <code>DB_PASSWORD</code>, <code>DB_HOST</code>, and <code>DB_PORT</code>.</p>
<h2 id="heading-how-to-run-the-local-lab">How to Run the Local Lab</h2>
<p>The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.</p>
<h3 id="heading-step-1-clone-the-repo">Step 1: Clone the Repo</h3>
<pre><code class="language-bash">git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab
</code></pre>
<h3 id="heading-step-2-run-the-spin-up-script">Step 2: Run the Spin-Up Script</h3>
<pre><code class="language-bash">bash spinup.sh
</code></pre>
<p>The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.</p>
<p>If the script fails at any point, check <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/TROUBLESHOOTING.md"><code>docs/TROUBLESHOOTING.md</code></a> before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.</p>
<h3 id="heading-important-run-the-lab-ui">Important: Run the Lab UI</h3>
<p>The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at <code>lab-ui/</code> that walks you through each concept and checkpoint as you work through the lab.</p>
<p>To start it, open a second terminal and run:</p>
<pre><code class="language-bash">cd lab-ui &amp;&amp; npm install &amp;&amp; npm run dev
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/873166e9-6bff-4e56-a18d-e58b9e9a5af9.png" alt="Screenshot of npm run dev lab ui terminal" style="display:block;margin:0 auto" width="849" height="435" loading="lazy">

<p>Then open <a href="http://localhost:5173"><code>http://localhost:5173</code></a>. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a5b220b-3f23-4c7c-8388-f2e23d122e2c.png" alt="Screenshot of The Lab UI, a guided tutorial interface that runs alongside the lab and walks you through each concept and checkpoint." style="display:block;margin:0 auto" width="1399" height="1287" loading="lazy">

<p>Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (<code>localhost:3000</code>) are two separate things, the UI guides you through the steps, the app shows you the live secrets.</p>
<h3 id="heading-step-3-access-the-application">Step 3: Access the Application</h3>
<p>Once the lab finishes, port-forward the service.</p>
<pre><code class="language-bash">kubectl port-forward svc/myapp 3000:80 -n default
</code></pre>
<p>Open <code>http://localhost:3000</code>. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/dbe122ac-b787-40d0-96f4-4b1276bab017.png" alt="Screenshot of the running application at localhost:3000. Every row in the table should show &quot;Match ✓" style="display:block;margin:0 auto" width="1213" height="902" loading="lazy">

<h3 id="heading-step-4-validate-that-secrets-match">Step 4: Validate That Secrets Match</h3>
<p>Run the compare endpoint directly from the terminal.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>When everything is working, the response will include <code>"all_match": true</code>.</p>
<h2 id="heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</h2>
<p>At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.</p>
<h3 id="heading-step-1-read-the-externalsecret-manifest">Step 1: Read the ExternalSecret Manifest</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/external-secret.yaml"><code>k8s/aws/external-secret.yaml</code></a>. Focus on these four fields:</p>
<ul>
<li><p><code>refreshInterval</code>: how often the operator polls AWS Secrets Manager</p>
</li>
<li><p><code>secretStoreRef</code>: which store the operator authenticates against</p>
</li>
<li><p><code>target</code>: the name of the Kubernetes Secret to create</p>
</li>
<li><p><code>data</code>: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys</p>
</li>
</ul>
<p>Here is what that mapping looks like in this lab:</p>
<pre><code class="language-yaml">spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username
</code></pre>
<p>The <code>property</code> field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.</p>
<p>Two fields here are worth understanding before you move on. <code>creationPolicy: Owner</code> means the operator owns the Kubernetes Secret it creates. If you delete the <code>ExternalSecret</code>, the Secret is deleted too. <code>ClusterSecretStore</code> is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain <code>SecretStore</code> is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.</p>
<h3 id="heading-step-2-read-the-deployment-manifest">Step 2: Read the Deployment Manifest</h3>
<p>Open <a href="http://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/deployment.yaml"><code>k8s/aws/deployment.yaml</code></a>. You are looking for two sections: <code>envFrom</code> and <code>volumeMounts</code>.</p>
<pre><code class="language-yaml">envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true
</code></pre>
<p>Both paths read from the same Kubernetes Secret, <code>myapp-database-creds</code>. The <code>envFrom</code> block injects all keys as environment variables at pod start.<br>The <code>volumeMounts</code> block mounts the same secret as files under <code>/etc/secrets</code>.</p>
<p>This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.</p>
<h3 id="heading-step-3-read-the-app-comparison-logic">Step 3: Read the App Comparison Logic</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/app/server.js"><code>app/server.js</code></a>. The comparison logic reads environment variables from <code>process.env</code> and reads mounted secret files from <code>/etc/secrets/&lt;key&gt;</code>. Then it computes a per-key match and a global <code>all_match</code> value.</p>
<p>The <code>/secrets/compare</code> endpoint sets <code>rotation_detected: true</code> when any key differs between env and volume.</p>
<h2 id="heading-how-to-test-secret-rotation">How to Test Secret Rotation</h2>
<p>Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.</p>
<h3 id="heading-how-the-rotation-gap-works"><strong>How the Rotation Gap Works</strong></h3>
<p>When a pod starts, Kubernetes gives it two ways to read a secret.</p>
<p>The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.</p>
<p>The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.</p>
<p>Same secret, two paths. One goes stale while one stays fresh.</p>
<p>The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.</p>
<p>That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.</p>
<p>Here is what you're about to observe in the lab:</p>
<ul>
<li><p>The rotation script updates the secret in AWS</p>
</li>
<li><p>ESO syncs the new value into Kubernetes within seconds</p>
</li>
<li><p>The volume file updates automatically</p>
</li>
<li><p>The environment variable stays stale until the pod restarts</p>
</li>
<li><p>The <code>/secrets/compare</code> endpoint shows both values side by side so you can see the gap live</p>
</li>
</ul>
<h3 id="heading-step-1-confirm-the-lab-is-ready">Step 1: Confirm the Lab Is Ready</h3>
<p>Make sure your pod and the External Secrets Operator are both running before you start.</p>
<pre><code class="language-bash">kubectl get pods -n external-secrets
kubectl get pods -n default
</code></pre>
<p>Both should show <code>Running</code>.</p>
<h3 id="heading-step-2-run-the-rotation-test-script">Step 2: Run the Rotation Test Script</h3>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>The script performs these actions in order:</p>
<ol>
<li><p>Reads the current <code>DB_PASSWORD</code> from the volume mount at <code>/etc/secrets/DB_PASSWORD</code></p>
</li>
<li><p>Reads the current <code>DB_PASSWORD</code> from the environment variable</p>
</li>
<li><p>Updates AWS Secrets Manager with a new password using <code>put-secret-value</code></p>
</li>
<li><p>Forces an immediate ESO sync by annotating the <code>ExternalSecret</code> with <code>force-sync</code></p>
</li>
<li><p>Reads the volume value again</p>
</li>
<li><p>Reads the environment variable again</p>
</li>
</ol>
<p>After the script runs, the volume and the env var will show different values.</p>
<h3 id="heading-step-3-validate-with-the-compare-endpoint">Step 3: Validate With the Compare Endpoint</h3>
<p>Hit the compare endpoint and look at the output.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>You'll see something like this:</p>
<pre><code class="language-json">{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c4ebb09f-e605-4f68-8e12-1361d94199b2.png" alt="Rotation mismatch, the volume file updated with the new password but the env var still holds the old value from pod startup." style="display:block;margin:0 auto" width="832" height="290" loading="lazy">

<h3 id="heading-step-4-restart-the-deployment-to-sync-env-vars">Step 4: Restart the Deployment to Sync Env Vars</h3>
<p>Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.</p>
<pre><code class="language-bash">kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default
</code></pre>
<p>Then hit <code>/secrets/compare</code> again. All rows should now show <code>"all_match": true</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0040274d-a398-408c-9486-ce0a9e527479.png" alt="After a rolling restart, new pods pick up fresh env vars and all keys match." style="display:block;margin:0 auto" width="821" height="436" loading="lazy">

<h3 id="heading-how-to-automate-restarts-with-reloader">How to Automate Restarts With Reloader</h3>
<p>If you don't want to restart deployments manually after every rotation, you can install <a href="https://github.com/stakater/reloader"><strong>Stakater Reloader</strong></a>. It watches an annotation on the <code>Deployment</code> and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.</p>
<h2 id="heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</h2>
<p>Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the <a href="https://secrets-store-csi-driver.sigs.k8s.io/">Secrets Store CSI Driver</a>.</p>
<p>Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>External Secrets Operator</th>
<th>Secrets Store CSI Driver</th>
</tr>
</thead>
<tbody><tr>
<td>Creates a Kubernetes Secret</td>
<td>Yes</td>
<td>No by default</td>
</tr>
<tr>
<td>Supports <code>envFrom</code></td>
<td>Yes</td>
<td>No (workaround only)</td>
</tr>
<tr>
<td>Secret stored in etcd</td>
<td>Yes (base64)</td>
<td>No, if you skip sync</td>
</tr>
<tr>
<td>Rotation</td>
<td>ESO updates the Secret, Reloader restarts pods</td>
<td>Volume file can update in place</td>
</tr>
<tr>
<td>Best for</td>
<td>Most teams. Multi-cloud, env var support</td>
<td>Security policies that prohibit secrets in etcd</td>
</tr>
</tbody></table>
<p>This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both <code>envFrom</code> and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.</p>
<p>Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native <code>envFrom</code> model.</p>
<h2 id="heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</h2>
<p>The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.</p>
<h3 id="heading-step-1-prepare-terraform-and-openid-connect-access">Step 1: Prepare Terraform and OpenID Connect Access</h3>
<p>The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder.</p>
<pre><code class="language-bash">cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn
</code></pre>
<p>Copy the role ARN from the output. You'll need it in the next step.</p>
<h3 id="heading-step-2-set-the-required-environment-variable">Step 2: Set the Required Environment Variable</h3>
<p>The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.</p>
<p>To find your AWS account ID, run:</p>
<pre><code class="language-bash">aws sts get-caller-identity --query Account --output text
</code></pre>
<p>Then set the variable, replacing <code>ACCOUNT</code> with the number that command returns.</p>
<pre><code class="language-bash">export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role
</code></pre>
<h3 id="heading-step-3-run-the-spin-up-script-for-amazon-elastic-kubernetes-service">Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service</h3>
<pre><code class="language-bash">bash spinup.sh --cluster eks
</code></pre>
<p>When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing <code>Match ✓</code>.</p>
<h3 id="heading-step-4-test-rotation-on-the-deployed-app">Step 4: Test Rotation on the Deployed App</h3>
<p>After you confirm normal operation, run the rotation test the same way you did locally.</p>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>Then use <code>/secrets/compare</code> on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.</p>
<p>⚠️ <strong>Cost warning:</strong> Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/teardown.sh"><code>bash teardown.sh</code></a> from the repo root to destroy all AWS resources and stop charges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/56f05ace-9ab6-4b67-ade6-a0bd1fa3962c.png" alt="Screenshot of the app running on the ALB URL, showing all keys matched" style="display:block;margin:0 auto" width="912" height="891" loading="lazy">

<h2 id="heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</h2>
<p>The typical CI/CD setup stores <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.</p>
<p>OpenID Connect eliminates that problem entirely.</p>
<h3 id="heading-how-openid-connect-works-for-github-actions">How OpenID Connect Works for GitHub Actions</h3>
<p>GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via <code>AssumeRoleWithWebIdentity</code>. No long-lived keys are ever stored anywhere.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/48e72210-a669-440e-b42e-81b0c15746ec.png" alt="The full OIDC authentication flow for GitHub Actions deploying to EKS — from minting the JWT token through AssumeRoleWithWebIdentity to temporary credentials, kubeconfig retrieval, and final kubectl apply steps." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-step-1-create-the-iam-role-with-terraform">Step 1: Create the IAM Role With Terraform</h3>
<p>The <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.</p>
<h3 id="heading-step-2-add-the-role-arn-to-github-repository-secrets">Step 2: Add the Role ARN to GitHub Repository Secrets</h3>
<p>In your GitHub repository:</p>
<ol>
<li><p>Go to Settings → Secrets and variables → Actions</p>
</li>
<li><p>Click New repository secret</p>
</li>
<li><p>Name it <code>AWS_ROLE_ARN</code></p>
</li>
<li><p>Paste the role ARN from the Terraform output</p>
</li>
</ol>
<p>That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.</p>
<h3 id="heading-step-3-configure-terraform-state">Step 3: Configure Terraform State</h3>
<p>For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.</p>
<h3 id="heading-step-4-push-to-main-and-let-workflows-run">Step 4: Push to Main and Let Workflows Run</h3>
<p>After your first spin-up, every push to the <code>main</code> branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use <code>/secrets/compare</code> to validate rotation behavior on the live environment.</p>
<h2 id="heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</h2>
<p>Here's a shortlist of the most common symptoms and their fixes.</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Most Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>ExternalSecret</code> is not syncing</td>
<td>Missing credentials or wrong store reference</td>
<td>Confirm the operator can access AWS Secrets Manager and that <code>secretStoreRef</code> points to the correct store</td>
</tr>
<tr>
<td>Pod is stuck in <code>Pending</code></td>
<td>Missing storage setup for local cluster</td>
<td>For Microk8s, enable the storage add-on</td>
</tr>
<tr>
<td>Env and volume still match after rotation</td>
<td>Rotation happened but the pod never restarted</td>
<td>Run <code>kubectl rollout restart</code> or install Reloader</td>
</tr>
<tr>
<td>CRD or API version mismatch</td>
<td>ESO version and manifest <code>apiVersion</code> don't match</td>
<td>Verify the <code>apiVersion</code> for <code>ClusterSecretStore</code> and <code>ExternalSecret</code> match your installed ESO version</td>
</tr>
<tr>
<td>Amazon Elastic Kubernetes Service node group never joins</td>
<td>Networking or IAM permissions for nodes are wrong</td>
<td>Fix internet routing and review the node IAM policy</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-inspect-the-operator-and-the-externalsecret">How to Inspect the Operator and the ExternalSecret</h3>
<p>When something isn't syncing, start with these two commands.</p>
<pre><code class="language-bash"># Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
</code></pre>
<p>The status conditions on the <code>ExternalSecret</code> resource will usually tell you exactly what failed.</p>
<h3 id="heading-how-to-validate-rotation-from-the-app-side">How to Validate Rotation From the App Side</h3>
<p>When you are debugging rotation, don't rely only on Kubernetes resource state. Use the <code>/secrets/compare</code> endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the <code>ExternalSecret</code> and <code>Deployment</code> manifests, and validated that the application sees the right credentials.</p>
<p>You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.</p>
<p>Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.</p>
<p>The lab repository is at <a href="https://github.com/Osomudeya/k8s-secret-lab">github.com/Osomudeya/k8s-secret-lab</a>. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.</p>
<p>If this helped you, star the repo and share it with someone who is learning Kubernetes.</p>
<p><em>I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures<br>→</em> <a href="https://osomudeya.gumroad.com/subscribe"><em>Join the newsletter</em></a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection ]]>
                </title>
                <description>
                    <![CDATA[ In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it. An attacker had found it, deployed pods inside T ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/</link>
                <guid isPermaLink="false">69c4112310e664c5dac43f41</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 25 Mar 2026 16:45:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4039b7a4-bb45-4df5-b13b-7414985c1a7e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it.</p>
<p>An attacker had found it, deployed pods inside Tesla's cluster, and was using them to mine cryptocurrency – all on Tesla's AWS bill. The cluster had no authentication on the dashboard, no network restrictions on egress, and nothing monitoring for intrusion. Any one of those controls would have stopped the attack. None of them were in place.</p>
<p>This wasn't a sophisticated zero-day exploit. It was a misconfigured default.</p>
<p>Kubernetes ships with powerful security primitives. The problem is that almost none of them are enabled by default. A fresh cluster is deliberately permissive so it's easy to get started. That permissiveness is a feature in development. In production, it's a liability.</p>
<p>In this handbook, we'll work through the three most impactful security layers in Kubernetes. We'll start with Role-Based Access Control, which governs who can do what to which resources in the API. From there we'll move to pod runtime security, which locks down what containers can actually do once they're running on a node. Finally we'll deploy Falco, a syscall-level detection engine that watches for attacks in progress and alerts in real time.</p>
<p>By the end, you'll have a hardened cluster with working RBAC policies, enforced pod security standards, and live detection rules that fire when something suspicious happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><code>kubectl</code> installed and configured</p>
</li>
<li><p>Docker Desktop or a Linux machine (to run kind)</p>
</li>
<li><p>Basic Kubernetes familiarity – you know what a Pod, Deployment, and Namespace are</p>
</li>
<li><p>No prior security experience needed</p>
</li>
</ul>
<p>All demos run on a local kind cluster. Full YAML and setup scripts are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</a></p>
</li>
<li><p><a href="#heading-what-youll-build">What You'll Build</a></p>
</li>
<li><p><a href="#heading-demo-1--run-a-cluster-security-baseline-with-kube-bench">Demo 1 — Run a Cluster Security Baseline with kube-bench</a></p>
</li>
<li><p><a href="#heading-how-to-configure-rbac">How to Configure RBAC</a></p>
<ul>
<li><p><a href="#heading-the-four-rbac-objects">The Four RBAC Objects</a></p>
</li>
<li><p><a href="#heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</a></p>
</li>
<li><p><a href="#heading-roles-and-clusterroles">Roles and ClusterRoles</a></p>
</li>
<li><p><a href="#heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</a></p>
</li>
<li><p><a href="#heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</a></p>
</li>
<li><p><a href="#heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 — Build a Least-Privilege RBAC Policy for a CI Pipeline</a></p>
</li>
<li><p><a href="#heading-demo-3--audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 — Audit RBAC with rakkess and rbac-lookup</a></p>
</li>
<li><p><a href="#how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</a></p>
<ul>
<li><p><a href="#heading-pod-security-admission">Pod Security Admission</a></p>
</li>
<li><p><a href="#heading-how-to-configure-securitycontext">How to Configure securityContext</a></p>
</li>
<li><p><a href="#heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</a></p>
</li>
<li><p><a href="#heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-4--harden-a-pod-with-securitycontext">Demo 4 — Harden a Pod with securityContext</a></p>
</li>
<li><p><a href="#heading-demo-5--deploy-falco-and-write-a-custom-detection-rule">Demo 5 — Deploy Falco and Write a Custom Detection Rule</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</h2>
<p>To understand what you're defending against, you need to understand where Kubernetes exposes attack surface. There are six main areas, and most production incidents trace back to at least one of them.</p>
<p>The <strong>API server</strong> is the front door to your cluster. Every <code>kubectl</code> command, every CI deploy, and every controller reconciliation loop sends requests here. Unauthenticated or over-privileged access to the API server is effectively game over: an attacker who can talk to it can create pods, read secrets, and modify workloads freely.</p>
<p><strong>etcd</strong> is the key-value store where all cluster state lives, including your Secrets. Kubernetes Secrets are base64-encoded by default, not encrypted. Anyone with direct access to etcd can read every password, token, and certificate in the cluster without going through the API server at all.</p>
<p>The <strong>kubelet</strong> runs on each node and manages the pods assigned to it. If its API is reachable without authentication – which is the default on older clusters – an attacker can exec into any pod on that node and read its memory without ever touching the API server.</p>
<p>The <strong>container runtime</strong> is the layer that actually runs your containers. A container that escapes its isolation boundary lands directly in the host OS. A privileged container with <code>hostPID: true</code> can read the memory of every other process on the node, including other containers.</p>
<p>Your <strong>supply chain</strong> (base images, third-party dependencies, Helm charts, operators) is a potential entry point at every step. The XZ Utils backdoor discovered in 2024 showed how close a well-positioned supply chain attack can come to widespread infrastructure compromise.</p>
<p>Finally, the <strong>network</strong>: by default, every pod in a Kubernetes cluster can reach every other pod on any port. There are no internal firewalls between workloads unless you explicitly create them with NetworkPolicy.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/2e49d975-4f69-4d14-9646-76c6ec377115.png" alt="Kubernetes threat landscape" style="display:block;margin:0 auto" width="4079" height="980" loading="lazy">

<h3 id="heading-real-world-breaches">Real-World Breaches</h3>
<p>These three incidents are worth understanding before you write a single line of YAML. They're not theoretical – they're documented post-mortems from real production clusters.</p>
<table>
<thead>
<tr>
<th>Incident</th>
<th>Year</th>
<th>Root cause</th>
<th>What was missing</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tesla cryptomining</strong></td>
<td>2018</td>
<td>Kubernetes dashboard exposed with no authentication, Unrestricted egress</td>
<td>RBAC on the dashboard endpoint + default-deny NetworkPolicy</td>
</tr>
<tr>
<td><strong>Capital One data breach</strong></td>
<td>2019</td>
<td>SSRF vulnerability in a WAF let an attacker reach the EC2 metadata API, which returned credentials for an over-privileged IAM role</td>
<td>Pod-level IAM restrictions (IRSA) + blocking metadata API egress</td>
</tr>
<tr>
<td><strong>Shopify bug bounty (Kubernetes)</strong></td>
<td>2021</td>
<td>A researcher accessed internal Kubernetes metadata through a misconfigured internal service, exposing pod environment variables containing secrets</td>
<td>Secret management outside environment variables + network segmentation</td>
</tr>
</tbody></table>
<p>The pattern across all three: not zero-day exploits, but misconfigured defaults and missing controls that should have been standard practice.</p>
<p>This article addresses the RBAC and pod security gaps directly.</p>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>Before the first command, here is the security posture you'll have by the end of this article:</p>
<p>You'll start by running kube-bench to get a CIS Benchmark baseline – a concrete score showing where a default cluster stands before any hardening. From there you'll build a least-privilege RBAC policy for a CI pipeline service account and verify its permission boundaries, then audit the full cluster to confirm no over-privileged accounts exist.</p>
<p>On the pod security side, you'll enforce the <code>restricted</code> Pod Security Admission profile on your workload namespace and apply a hardened <code>securityContext</code> to a deployment: non-root user, read-only root filesystem, dropped capabilities, and seccomp profile. To close out, you'll deploy Falco in eBPF mode with a custom detection rule that fires when suspicious tools are run inside a container.</p>
<p>Start to finish, with a kind cluster already running, the demos take about 45–60 minutes.</p>
<h2 id="heading-demo-1-run-a-cluster-security-baseline-with-kube-bench">Demo 1: Run a Cluster Security Baseline with kube-bench</h2>
<p>Before hardening anything, it's a good idea to measure where you are. <a href="https://github.com/aquasecurity/kube-bench">kube-bench</a> runs the CIS Kubernetes Benchmark against your cluster and reports which checks pass and which fail. A baseline run gives you a concrete picture of your cluster's default security posture – and a reference point you can re-run after applying any hardening changes.</p>
<h3 id="heading-step-1-create-a-kind-cluster">Step 1: Create a kind cluster</h3>
<p>Save the following as <code>kind-config.yaml</code>:</p>
<pre><code class="language-yaml"># kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
</code></pre>
<pre><code class="language-bash">kind create cluster --name k8s-security --config kind-config.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Creating cluster "k8s-security" ...
 ✓ Ensuring node image (kindest/node:v1.29.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-security"
</code></pre>
<h3 id="heading-step-2-run-kube-bench">Step 2: Run kube-bench</h3>
<p>kube-bench runs as a Job inside the cluster, mounting the host filesystem to inspect Kubernetes configuration files and processes:</p>
<pre><code class="language-bash">kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench
</code></pre>
<p>The output is long. Scroll to the summary at the bottom:</p>
<pre><code class="language-plaintext">== Summary master ==
0 checks PASS
11 checks FAIL
 9 checks WARN
 0 checks INFO

== Summary node ==
17 checks PASS
 2 checks FAIL
40 checks WARN
 0 checks INFO
</code></pre>
<p>A fresh kind cluster typically fails around 14 checks. Three of the most important failures explain why defaults are a problem:</p>
<table>
<thead>
<tr>
<th>Check ID</th>
<th>Description</th>
<th>Why it matters</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1.2.1</strong></td>
<td><code>--anonymous-auth</code> is not set to false on the API server</td>
<td>Anonymous requests can reach the API server without authentication – exactly how the Tesla dashboard was accessed</td>
</tr>
<tr>
<td><strong>1.2.6</strong></td>
<td><code>--kubelet-certificate-authority</code> is not set</td>
<td>The API server cannot verify kubelet identity, enabling man-in-the-middle attacks between the control plane and nodes</td>
</tr>
<tr>
<td><strong>4.2.6</strong></td>
<td><code>--protect-kernel-defaults</code> is not set on the kubelet</td>
<td>Kernel parameters can be modified from within a container, which is one step toward a container escape</td>
</tr>
</tbody></table>
<p><strong>Note:</strong> Some kube-bench findings are expected on kind because kind is a development tool, not a production-hardened environment. The important thing is to understand what each finding means and whether it applies to your target production setup.</p>
<p>Delete the Job when you're done:</p>
<pre><code class="language-bash">kubectl delete job kube-bench
</code></pre>
<p>Now that you have a baseline, you know what you're starting from. The next step is to work through the most impactful control on that list: access control. RBAC governs every interaction with the Kubernetes API, and getting it right is the foundation everything else builds on.</p>
<h2 id="heading-how-to-configure-rbac">How to Configure RBAC</h2>
<p>Role-Based Access Control is the authorisation layer in Kubernetes. Every request that reaches the API server – from <code>kubectl</code>, from a pod, from a controller – is checked against RBAC rules after authentication succeeds. If there is no rule that explicitly allows the action, Kubernetes denies it.</p>
<p>The key word is "explicitly". RBAC in Kubernetes is additive only. There is no <code>deny</code> rule. You grant access by creating rules, and you remove access by deleting them. This makes the mental model clean: if a subject can do something, you gave it permission to do that thing.</p>
<h3 id="heading-a-brief-case-study-the-shopify-kubernetes-misconfiguration">A Brief Case Study: The Shopify Kubernetes Misconfiguration</h3>
<p>In 2021, security researcher Silas Cutler discovered that a Shopify internal service exposed Kubernetes metadata through an SSRF vulnerability. The metadata included pod environment variables that contained secrets. The root cause was partly RBAC: the service's service account had broader cluster access than it needed, and there was no least-privilege review process.</p>
<p>Shopify paid a $25,000 bug bounty and fixed the issue. The lesson is straightforward: a service account should only have the permissions it needs to do its specific job. Nothing more.</p>
<p>This is the principle you'll apply in Demo 2.</p>
<h3 id="heading-the-four-rbac-objects">The Four RBAC Objects</h3>
<p>RBAC in Kubernetes is built from four API objects. Two define permissions, two bind those permissions to subjects:</p>
<table>
<thead>
<tr>
<th>Object</th>
<th>Scope</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>Role</code></td>
<td>Namespace</td>
<td>Defines a set of permissions within one namespace</td>
</tr>
<tr>
<td><code>ClusterRole</code></td>
<td>Cluster-wide</td>
<td>Defines permissions across all namespaces, or for cluster-scoped resources like Nodes</td>
</tr>
<tr>
<td><code>RoleBinding</code></td>
<td>Namespace</td>
<td>Grants the permissions of a Role or ClusterRole to a subject, within one namespace</td>
</tr>
<tr>
<td><code>ClusterRoleBinding</code></td>
<td>Cluster-wide</td>
<td>Grants the permissions of a ClusterRole to a subject across the entire cluster</td>
</tr>
</tbody></table>
<p>A <strong>subject</strong> is a user, a group, or a service account. Users and groups come from your authentication layer – client certificates, OIDC tokens, or cloud provider identity. Service accounts are Kubernetes-native identities created for pods.</p>
<h3 id="heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</h3>
<p>Before you can write a <code>Role</code>, you need to know three things: the resource name, the API group it belongs to, and the verbs it supports. You shouldn't have to guess any of them – <code>kubectl</code> can tell you everything.</p>
<h4 id="heading-list-all-available-resources-and-their-api-groups">List all available resources and their API groups</h4>
<pre><code class="language-bash">kubectl api-resources
</code></pre>
<p>Partial output:</p>
<pre><code class="language-plaintext">NAME                    SHORTNAMES  APIVERSION                     NAMESPACED  KIND
bindings                            v1                             true        Binding
configmaps              cm          v1                             true        ConfigMap
endpoints               ep          v1                             true        Endpoints
events                  ev          v1                             true        Event
namespaces              ns          v1                             false       Namespace
nodes                   no          v1                             false       Node
pods                    po          v1                             true        Pod
secrets                             v1                             true        Secret
serviceaccounts         sa          v1                             true        ServiceAccount
services                svc         v1                             true        Service
deployments             deploy      apps/v1                        true        Deployment
replicasets             rs          apps/v1                        true        ReplicaSet
statefulsets            sts         apps/v1                        true        StatefulSet
cronjobs                cj          batch/v1                       true        CronJob
jobs                                batch/v1                       true        Job
ingresses               ing         networking.k8s.io/v1           true        Ingress
networkpolicies         netpol      networking.k8s.io/v1           true        NetworkPolicy
clusterroles                        rbac.authorization.k8s.io/v1   false       ClusterRole
roles                               rbac.authorization.k8s.io/v1   true        Role
</code></pre>
<p>The <code>APIVERSION</code> column is what you put in <code>apiGroups</code>. Strip the version suffix and use only the group part:</p>
<table>
<thead>
<tr>
<th>APIVERSION in output</th>
<th>apiGroups value in Role</th>
</tr>
</thead>
<tbody><tr>
<td><code>v1</code></td>
<td><code>""</code> (empty string – the core group)</td>
</tr>
<tr>
<td><code>apps/v1</code></td>
<td><code>"apps"</code></td>
</tr>
<tr>
<td><code>batch/v1</code></td>
<td><code>"batch"</code></td>
</tr>
<tr>
<td><code>networking.k8s.io/v1</code></td>
<td><code>"networking.k8s.io"</code></td>
</tr>
<tr>
<td><code>rbac.authorization.k8s.io/v1</code></td>
<td><code>"rbac.authorization.k8s.io"</code></td>
</tr>
</tbody></table>
<p>The <code>NAMESPACED</code> column tells you whether to use a <code>Role</code> (namespaced resources) or a <code>ClusterRole</code> (non-namespaced resources like <code>nodes</code>).</p>
<h4 id="heading-filter-by-api-group">Filter by API group</h4>
<p>If you want to see only resources in a specific group, for example, everything in <code>apps</code>:</p>
<pre><code class="language-bash">kubectl api-resources --api-group=apps
</code></pre>
<pre><code class="language-plaintext">NAME                  SHORTNAMES  APIVERSION  NAMESPACED  KIND
controllerrevisions               apps/v1     true        ControllerRevision
daemonsets            ds          apps/v1     true        DaemonSet
deployments           deploy      apps/v1     true        Deployment
replicasets           rs          apps/v1     true        ReplicaSet
statefulsets          sts         apps/v1     true        StatefulSet
</code></pre>
<h4 id="heading-list-all-verbs-for-a-specific-resource">List all verbs for a specific resource</h4>
<p>Each resource supports a different set of verbs. To see exactly which verbs a resource supports, use <code>kubectl api-resources</code> with <code>-o wide</code> and look at the <code>VERBS</code> column:</p>
<pre><code class="language-bash">kubectl api-resources -o wide | grep -E "^NAME|^pods "
</code></pre>
<pre><code class="language-plaintext">NAME  SHORTNAMES  APIVERSION  NAMESPACED  KIND  VERBS
pods  po          v1          true        Pod   create,delete,deletecollection,get,list,patch,update,watch
</code></pre>
<p>Or explain the resource directly:</p>
<pre><code class="language-bash">kubectl explain pod --api-version=v1 | head -10
</code></pre>
<p>The full set of verbs Kubernetes supports in RBAC rules is:</p>
<table>
<thead>
<tr>
<th>Verb</th>
<th>What it allows</th>
</tr>
</thead>
<tbody><tr>
<td><code>get</code></td>
<td>Read a single named resource: <code>kubectl get pod my-pod</code></td>
</tr>
<tr>
<td><code>list</code></td>
<td>Read all resources of a type: <code>kubectl get pods</code></td>
</tr>
<tr>
<td><code>watch</code></td>
<td>Stream changes to resources: used by controllers and informers</td>
</tr>
<tr>
<td><code>create</code></td>
<td>Create a new resource</td>
</tr>
<tr>
<td><code>update</code></td>
<td>Replace an existing resource (<code>kubectl apply</code> on an existing object)</td>
</tr>
<tr>
<td><code>patch</code></td>
<td>Partially modify a resource (<code>kubectl patch</code>)</td>
</tr>
<tr>
<td><code>delete</code></td>
<td>Delete a single resource</td>
</tr>
<tr>
<td><code>deletecollection</code></td>
<td>Delete all resources of a type in a namespace</td>
</tr>
<tr>
<td><code>exec</code></td>
<td>Run a command inside a pod (<code>kubectl exec</code>)</td>
</tr>
<tr>
<td><code>portforward</code></td>
<td>Forward a port from a pod (<code>kubectl port-forward</code>)</td>
</tr>
<tr>
<td><code>proxy</code></td>
<td>Proxy HTTP requests to a pod</td>
</tr>
<tr>
<td><code>log</code></td>
<td>Read pod logs (<code>kubectl logs</code>)</td>
</tr>
</tbody></table>
<p><strong>Important:</strong> <code>get</code> and <code>list</code> are separate verbs. Granting <code>list</code> on <code>secrets</code> lets a subject enumerate every secret name and value in a namespace, even if you didn't also grant <code>get</code>. Always think about both when working with sensitive resources like <code>secrets</code>, <code>serviceaccounts</code>, and <code>configmaps</code>.</p>
<h4 id="heading-look-up-a-resources-group-with-kubectl-explain">Look up a resource's group with kubectl explain</h4>
<p>If you already know the resource name but aren't sure of its group, <code>kubectl explain</code> tells you:</p>
<pre><code class="language-bash">kubectl explain deployment
</code></pre>
<pre><code class="language-plaintext">GROUP:      apps
KIND:       Deployment
VERSION:    v1
...
</code></pre>
<pre><code class="language-bash">kubectl explain ingress
</code></pre>
<pre><code class="language-plaintext">GROUP:      networking.k8s.io
KIND:       Ingress
VERSION:    v1
...
</code></pre>
<p>This is the fastest way to look up the <code>apiGroups</code> value for any resource when writing a Role.</p>
<h4 id="heading-a-complete-lookup-workflow">A complete lookup workflow</h4>
<p>Here is the practical workflow when writing a new Role from scratch:</p>
<pre><code class="language-bash"># 1. Find the resource name and API group
kubectl api-resources | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment

# 2. Find the verbs it supports
kubectl api-resources -o wide | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment   create,delete,...,get,list,patch,update,watch

# 3. Write the Role using the group (strip the version) and the verbs you need
</code></pre>
<pre><code class="language-yaml">apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-reader
  namespace: staging
rules:
  - apiGroups: ["apps"]       # from: apps/v1 → strip /v1
    resources: ["deployments"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>With this workflow, you never have to guess an API group or verb. You look it up, then write the minimal rule you need.</p>
<h3 id="heading-roles-and-clusterroles">Roles and ClusterRoles</h3>
<p>A <code>Role</code> defines which verbs are allowed on which resources. Here is a Role that grants read-only access to Pods and ConfigMaps inside the <code>staging</code> namespace:</p>
<pre><code class="language-yaml"># role-ci-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]          # "" = the core API group (Pods, Services, Secrets, ConfigMaps)
    resources: ["pods", "configmaps"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>The <code>apiGroups</code> field tells Kubernetes which API group owns the resource. The core group uses an empty string <code>""</code>. Apps-level resources like Deployments use <code>"apps"</code>. Custom resources use their own group, such as <code>"networking.k8s.io"</code>.</p>
<p>A <code>ClusterRole</code> is structurally identical but omits the namespace and can reference cluster-scoped resources like Nodes and PersistentVolumes:</p>
<pre><code class="language-yaml"># clusterrole-node-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader    # no namespace field
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
</code></pre>
<h4 id="heading-when-to-use-which">When to use which:</h4>
<p>Use a <code>Role</code> when the permission is specific to one namespace. A compromised service account can only affect that namespace: the blast radius is contained. Use a <code>ClusterRole</code> when you need access to cluster-scoped resources, or when you want a reusable permission template that multiple namespaces can share.</p>
<p>A common mistake is reaching for a <code>ClusterRole</code> "just to be safe" because it's easier to configure. Namespace-scoped <code>Roles</code> are almost always the right default.</p>
<h3 id="heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</h3>
<p>A Role by itself does nothing. You need a binding to attach it to a subject. Here is a <code>RoleBinding</code> that grants the <code>ci-reader</code> Role to the <code>ci-pipeline</code> service account:</p>
<pre><code class="language-yaml"># rolebinding-ci.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline       # the service account name
    namespace: staging      # the namespace the SA lives in
roleRef:
  kind: Role
  name: ci-reader           # must match the Role name exactly
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>There is a useful pattern worth knowing: you can bind a <code>ClusterRole</code> using a <code>RoleBinding</code>. This creates namespace-scoped access using a reusable permission template. The <code>ClusterRole</code> defines the rules, while the <code>RoleBinding</code> constrains those rules to a single namespace.</p>
<pre><code class="language-yaml"># RoleBinding referencing a ClusterRole — scoped to one namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: view-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: ClusterRole          # ClusterRole, but bound to one namespace via RoleBinding
  name: view                 # Kubernetes built-in ClusterRole: read-only access to most resources
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>Kubernetes ships with several useful built-in ClusterRoles: <code>view</code> (read-only access to most resources), <code>edit</code> (read/write to most resources), <code>admin</code> (full namespace admin), and <code>cluster-admin</code> (full cluster admin). Use them rather than reinventing them.</p>
<h3 id="heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</h3>
<p>Every pod in Kubernetes runs as a service account. If you don't specify one, Kubernetes uses the <code>default</code> service account in that namespace.</p>
<p>The default service account starts with no permissions – but it still has a token automatically mounted into every pod at <code>/var/run/secrets/kubernetes.io/serviceaccount/token</code>. This means every container in your cluster can authenticate to the API server by default, even if it has nothing useful to do there.</p>
<p>The single most impactful change you can make is to disable this automatic token mounting on service accounts that don't need API access:</p>
<pre><code class="language-yaml"># serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
automountServiceAccountToken: false   # no token mounted into pods by default
</code></pre>
<p>You can also control it at the pod level:</p>
<pre><code class="language-yaml">spec:
  automountServiceAccountToken: false   # override at pod level
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-app:1.0
</code></pre>
<h4 id="heading-the-cluster-admin-anti-pattern">The cluster-admin anti-pattern:</h4>
<p>Never bind <code>cluster-admin</code> to a service account that runs in a pod. <code>cluster-admin</code> grants full read/write access to every resource in the cluster. An attacker who compromises a pod running as <code>cluster-admin</code> owns your cluster completely.</p>
<p>You will see this in Helm charts and tutorials because it "makes things work". It works because it disables the entire authorisation layer. That is not a solution – it's a ticking clock.</p>
<p>The Capital One breach is a direct example of this pattern at the cloud layer: an EC2 instance role had permissions far beyond what the application needed. The SSRF vulnerability was the initial foothold. The over-privileged role was what turned a minor bug into a $80 million fine.</p>
<h3 id="heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</h3>
<p>The <code>kubectl auth can-i</code> command lets you check permissions for any subject. Use <code>--as</code> to impersonate a service account:</p>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

# These should return 'yes'
kubectl auth can-i list pods        --namespace staging --as $SA
kubectl auth can-i get  configmaps  --namespace staging --as $SA

# These should return 'no'
kubectl auth can-i delete pods      --namespace staging --as $SA
kubectl auth can-i get  secrets     --namespace staging --as $SA
kubectl auth can-i list pods        --namespace production --as $SA
</code></pre>
<p>To list every permission a subject has in a namespace:</p>
<pre><code class="language-bash">kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-pipeline
</code></pre>
<p>For a visual matrix across the whole cluster, install <a href="https://github.com/corneliusweig/rakkess">rakkess</a> (part of krew):</p>
<pre><code class="language-bash">kubectl krew install access-matrix

# Permission matrix for all service accounts in staging
kubectl access-matrix --namespace staging
</code></pre>
<p>Example output:</p>
<pre><code class="language-plaintext">NAME          GET  LIST  WATCH  CREATE  UPDATE  PATCH  DELETE
ci-pipeline    ✓    ✓     ✓      ✗       ✗       ✗      ✗
default        ✗    ✗     ✗      ✗       ✗       ✗      ✗
monitoring     ✓    ✓     ✓      ✗       ✗       ✗      ✗
</code></pre>
<p>If you see <code>✓</code> in the CREATE, UPDATE, PATCH, or DELETE columns for a service account that should only read, that's a finding that needs remediation.</p>
<p>⚠️ <strong>The wildcard danger:</strong> The most dangerous RBAC configuration is a wildcard on all three dimensions:</p>
<pre><code class="language-yaml">apiGroups: [""] 
resources: [""] 
verbs: ["*"]
</code></pre>
<p>This is functionally identical to <code>cluster-admin</code>. You will find it in Helm charts for controllers installed with "convenience" permissions. Always audit third-party RBAC before installing operators into a production cluster.</p>
<h2 id="heading-demo-2-build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 – Build a Least-Privilege RBAC Policy for a CI Pipeline</h2>
<p>In this demo, you'll create a service account for a CI pipeline that can list pods and read configmaps in the <code>staging</code> namespace – and nothing else.</p>
<h3 id="heading-step-1-create-the-namespace-and-service-account">Step 1: Create the namespace and service account</h3>
<pre><code class="language-bash">kubectl create namespace staging
</code></pre>
<pre><code class="language-yaml"># ci-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-pipeline
  namespace: staging
automountServiceAccountToken: false
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-serviceaccount.yaml
</code></pre>
<h3 id="heading-step-2-create-the-role">Step 2: Create the Role</h3>
<pre><code class="language-yaml"># ci-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-role.yaml
</code></pre>
<h3 id="heading-step-3-bind-the-role-to-the-service-account">Step 3: Bind the Role to the service account</h3>
<pre><code class="language-yaml"># ci-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: Role
  name: ci-reader
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-rolebinding.yaml
</code></pre>
<h3 id="heading-step-4-test-allowed-operations">Step 4: Test allowed operations</h3>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

kubectl auth can-i list pods       --namespace staging     --as $SA   # yes
kubectl auth can-i get  pods       --namespace staging     --as $SA   # yes
kubectl auth can-i list configmaps --namespace staging     --as $SA   # yes
</code></pre>
<h3 id="heading-step-5-test-denied-operations">Step 5: Test denied operations</h3>
<pre><code class="language-bash">kubectl auth can-i delete pods       --namespace staging     --as $SA   # no
kubectl auth can-i get  secrets      --namespace staging     --as $SA   # no
kubectl auth can-i list pods         --namespace production  --as $SA   # no
kubectl auth can-i create deployments --namespace staging    --as $SA   # no
</code></pre>
<p>All four should return <code>no</code>. Notice the third test: even if there were a matching Role in the <code>staging</code> namespace, the service account cannot access <code>production</code>. A <code>RoleBinding</code> cannot cross namespace boundaries, this is by design.</p>
<p>Writing a least-privilege policy for a service account you control is the easy part. The harder part is auditing what already exists in a cluster. That's what Demo 3 covers.</p>
<h2 id="heading-demo-3-audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 – Audit RBAC with rakkess and rbac-lookup</h2>
<p>Now you'll scan the full cluster to surface any accounts with more permissions than they need.</p>
<h3 id="heading-step-1-install-the-tools">Step 1: Install the tools</h3>
<pre><code class="language-bash">kubectl krew install access-matrix
kubectl krew install rbac-lookup
</code></pre>
<h3 id="heading-step-2-run-rakkess-across-the-cluster">Step 2: Run rakkess across the cluster</h3>
<pre><code class="language-bash"># All service accounts in kube-system
kubectl access-matrix --namespace kube-system

# All ServiceAccounts cluster-wide
kubectl access-matrix
</code></pre>
<h3 id="heading-step-3-find-all-cluster-admin-bindings">Step 3: Find all cluster-admin bindings</h3>
<p>There are two ways subjects get cluster-admin access: via a <code>ClusterRoleBinding</code> (cluster-wide), or via a <code>RoleBinding</code> that references the <code>cluster-admin</code> ClusterRole (namespace-scoped, still dangerous). Check both:</p>
<pre><code class="language-bash"># Find ClusterRoleBindings that grant cluster-admin
kubectl rbac-lookup cluster-admin --kind ClusterRole --output wide
</code></pre>
<p>On a fresh kind cluster this returns:</p>
<pre><code class="language-plaintext">No RBAC Bindings found
</code></pre>
<p>That is the correct and expected result. A default kind cluster doesn't create any <code>ClusterRoleBindings</code> to <code>cluster-admin</code>. The role exists, but nothing is bound to it at the cluster level by default. If you see entries here in your production cluster, each one is a finding worth investigating.</p>
<p>To find who has cluster-level admin access through other means, query the bindings directly:</p>
<pre><code class="language-bash"># Find all ClusterRoleBindings and the subjects they grant
kubectl get clusterrolebindings -o wide
</code></pre>
<pre><code class="language-plaintext">NAME                                                   ROLE                                                                       AGE   USERS                         GROUPS                         SERVICEACCOUNTS
cluster-admin                                          ClusterRole/cluster-admin                                                  10d   system:masters
system:kube-controller-manager                         ClusterRole/system:kube-controller-manager                                 10d
system:kube-scheduler                                  ClusterRole/system:kube-scheduler                                          10d
system:node                                            ClusterRole/system:node                                                    10d
...
</code></pre>
<p>The <code>cluster-admin</code> ClusterRoleBinding grants access to the <code>system:masters</code> group – the group your kubeconfig certificate belongs to. This is expected. Every other binding in this list is worth reviewing to understand what it grants and why.</p>
<p><strong>What to look for:</strong> Any binding where the SERVICEACCOUNTS column is populated with an application service account (not a <code>system:</code> prefixed one) is a potential over-privilege finding. Application pods should never need cluster-admin.</p>
<h3 id="heading-step-4-verify-the-ci-pipeline-service-account">Step 4: Verify the ci-pipeline service account</h3>
<pre><code class="language-bash">kubectl rbac-lookup ci-pipeline --kind ServiceAccount --output wide
</code></pre>
<p>Expected output:</p>
<pre><code class="language-bash">SUBJECT                               SCOPE     ROLE             SOURCE
ServiceAccount/staging:ci-pipeline    staging   Role/ci-reader   RoleBinding/ci-reader-binding
</code></pre>
<p>The format is <code>/&lt;role-name&gt; &lt;binding-kind&gt;/&lt;binding-name&gt;</code>. This tells you:</p>
<ul>
<li><p>The service account is bound to the <code>ci-reader</code> Role</p>
</li>
<li><p>The binding is a <code>RoleBinding</code> named <code>ci-reader-binding</code></p>
</li>
<li><p>There is no namespace prefix on the role name because it is a namespaced <code>Role</code>, not a <code>ClusterRole</code></p>
</li>
</ul>
<p>If the output showed <code>ClusterRole/something</code> here, that would be a finding. It would mean the service account has cluster-wide permissions, not namespace-scoped ones.</p>
<p><strong>rbac-lookup vs kubectl get:</strong> <code>rbac-lookup</code> gives you a subject-centric view: "what does this account have access to?" <code>kubectl get rolebindings,clusterrolebindings -A</code> gives you a binding-centric view: "what bindings exist in the cluster?" Use both. rbac-lookup is faster for auditing a specific service account, while the <code>kubectl get</code> approach is better for a full cluster inventory.</p>
<p>With RBAC locked down, the API server is protected. But RBAC says nothing about what a container can do once it's running. That's a separate layer entirely.</p>
<h2 id="heading-how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</h2>
<p>RBAC controls who can talk to the Kubernetes API. Pod security controls what containers can do once they're running on a node. These are different threat vectors: RBAC protects the control plane, pod security protects the data plane.</p>
<p>A container that runs as root with no capability restrictions can, if compromised, write backdoors to the host filesystem, load kernel modules, read the memory of other processes if <code>hostPID: true</code> is set, and in some configurations escape the container entirely. Pod security closes these doors before an attacker can open them.</p>
<h3 id="heading-a-case-study-the-hildegard-malware-campaign">A Case Study: The Hildegard Malware Campaign</h3>
<p>In early 2021, Palo Alto's Unit 42 research team documented a cryptomining malware campaign called Hildegard that specifically targeted Kubernetes clusters. The attack chain was:</p>
<ol>
<li><p>Find a cluster with the kubelet API exposed without authentication</p>
</li>
<li><p>Deploy a privileged pod with <code>hostPID: true</code></p>
</li>
<li><p>Use the privileged pod to read credentials from other containers' memory</p>
</li>
<li><p>Establish persistence by writing to the host filesystem</p>
</li>
</ol>
<p>Steps 3 and 4 would have been impossible if the pods in the cluster had been running with <code>readOnlyRootFilesystem: true</code>, dropped capabilities, and no <code>hostPID</code>. The attacker had the initial foothold. Pod security would have contained the blast radius.</p>
<h3 id="heading-pod-security-admission">Pod Security Admission</h3>
<p>Pod Security Admission (PSA) is the built-in admission controller that enforces pod security standards at the namespace level. It replaced PodSecurityPolicy in Kubernetes 1.25.</p>
<p><strong>Migrating from PSP?</strong> If you're on Kubernetes &lt; 1.25, you may still be using PodSecurityPolicy, which was removed in 1.25. The migration path is: enable PSA in <code>audit</code> mode first to identify violations, fix them workload by workload, then switch to <code>enforce</code>. For policies PSA cannot express, add Kyverno alongside it.</p>
<p>PSA defines three profiles:</p>
<table>
<thead>
<tr>
<th>Profile</th>
<th>Who it's for</th>
<th>What it restricts</th>
</tr>
</thead>
<tbody><tr>
<td><code>privileged</code></td>
<td>System components (CNI plugins, monitoring agents)</td>
<td>Nothing – no restrictions</td>
</tr>
<tr>
<td><code>baseline</code></td>
<td>Most workloads</td>
<td>Blocks known privilege escalations: no <code>hostNetwork</code>, no <code>hostPID</code>, no privileged containers</td>
</tr>
<tr>
<td><code>restricted</code></td>
<td>Security-sensitive workloads</td>
<td>Everything in baseline, plus: must run as non-root, must drop capabilities, must set a seccomp profile</td>
</tr>
</tbody></table>
<p>And three enforcement modes:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Effect</th>
<th>When to use</th>
</tr>
</thead>
<tbody><tr>
<td><code>enforce</code></td>
<td>Rejects pods that violate the profile at admission</td>
<td>Production – once you've fixed violations</td>
</tr>
<tr>
<td><code>audit</code></td>
<td>Allows pods but records violations in the audit log</td>
<td>Migration – see what would break without breaking anything</td>
</tr>
<tr>
<td><code>warn</code></td>
<td>Allows pods but sends a warning to the client</td>
<td>Development – fast feedback in your terminal</td>
</tr>
</tbody></table>
<p>The migration path: start with <code>audit</code> and <code>warn</code> to identify violations, fix them, then switch to <code>enforce</code>. The two modes can run simultaneously.</p>
<p>Apply them as namespace labels:</p>
<pre><code class="language-yaml"># namespace-staging.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    # Start here: audit and warn simultaneously
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest
</code></pre>
<p>Once violations are resolved, add enforce:</p>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  --overwrite
</code></pre>
<p>Note: don't use <code>--overwrite</code> here. Without it, if <code>enforce</code> is already set to a different value the command will error – which is exactly what you want. You should see:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>If you see <code>namespace/staging not labeled</code>, it means <code>enforce=restricted</code> and <code>enforce-version=latest</code> were already set to those exact values. Confirm enforcement is active:</p>
<pre><code class="language-bash">kubectl get namespace staging --show-labels
</code></pre>
<p>Look for <code>pod-security.kubernetes.io/enforce=restricted</code> in the output. If it's there, enforcement is active.</p>
<h3 id="heading-how-to-configure-securitycontext">How to Configure securityContext</h3>
<p>A <code>securityContext</code> defines the privilege and access control settings for a pod or container. These are the seven fields you should configure on every production workload:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>Set at</th>
<th>What it controls</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot</code></td>
<td>Pod</td>
<td>Rejects containers that run as UID 0 (root)</td>
</tr>
<tr>
<td><code>runAsUser</code> / <code>runAsGroup</code></td>
<td>Pod</td>
<td>Sets a specific UID/GID – don't rely on the image default</td>
</tr>
<tr>
<td><code>fsGroup</code></td>
<td>Pod</td>
<td>All mounted volumes are owned by this GID</td>
</tr>
<tr>
<td><code>seccompProfile</code></td>
<td>Pod</td>
<td>Filters syscalls using a seccomp profile</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation</code></td>
<td>Container</td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code></td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem</code></td>
<td>Container</td>
<td>Makes the container filesystem read-only</td>
</tr>
<tr>
<td><code>capabilities.drop</code></td>
<td>Container</td>
<td>Removes Linux capabilities (drop <code>ALL</code>, add back only what is needed)</td>
</tr>
</tbody></table>
<p>The annotated YAML below shows all seven in context:</p>
<pre><code class="language-yaml"># secure-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: staging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true         # container must run as a non-root user
        runAsUser: 10001           # explicit UID — don't rely on the image's default
        runAsGroup: 10001          # explicit GID
        fsGroup: 10001             # volumes are owned by this group
        seccompProfile:
          type: RuntimeDefault     # use the container runtime's default seccomp profile
      automountServiceAccountToken: false
      containers:
        - name: app
          image: nginx:1.25-alpine
          securityContext:
            allowPrivilegeEscalation: false   # block setuid and sudo inside the container
            readOnlyRootFilesystem: true      # the single highest-impact setting
            capabilities:
              drop:
                - ALL                         # drop every Linux capability
              add: []                         # add back only what is explicitly needed
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
      volumes:
        # nginx needs writable directories — provide them as emptyDir volumes
        - name: tmp
          emptyDir: {}
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}
</code></pre>
<h4 id="heading-why-readonlyrootfilesystem-true-is-the-most-important-setting">Why <code>readOnlyRootFilesystem: true</code> is the most important setting:</h4>
<p>Most post-exploitation techniques require writing to the filesystem. Dropping a backdoor, modifying a binary, writing a cron job, or installing a keylogger all require a writable filesystem. Set <code>readOnlyRootFilesystem: true</code> and every one of these techniques is blocked.</p>
<p>The downside is that many applications write to directories like <code>/tmp</code> or <code>/var/cache</code>. The fix is to mount <code>emptyDir</code> volumes at those specific paths, as shown above. The rest of the filesystem stays read-only.</p>
<p><strong>What each field prevents:</strong></p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it prevents</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot: true</code></td>
<td>Blocks containers that were built to run as root – they fail at admission</td>
</tr>
<tr>
<td><code>runAsUser: 10001</code></td>
<td>Ensures a known, non-privileged UID even if the image doesn't set one</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation: false</code></td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code> – the most common privilege escalation path</td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem: true</code></td>
<td>Prevents writing backdoors, modifying binaries, or creating persistence</td>
</tr>
<tr>
<td><code>capabilities: drop: ALL</code></td>
<td>Removes Linux capabilities like <code>NET_RAW</code> (raw socket access) and <code>SYS_ADMIN</code> (kernel operations)</td>
</tr>
<tr>
<td><code>seccompProfile: RuntimeDefault</code></td>
<td>Filters syscalls to a safe default set – blocks ~300 of the ~400 available syscalls</td>
</tr>
</tbody></table>
<h3 id="heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</h3>
<p>PSA covers the fundamentals. But you'll eventually need policies that PSA cannot express: all images must come from your private registry, all pods must have resource limits, no container may use the <code>latest</code> tag. For these, you need a policy engine.</p>
<p>Two mature options exist:</p>
<table>
<thead>
<tr>
<th></th>
<th>OPA/Gatekeeper</th>
<th>Kyverno</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Policy language</strong></td>
<td>Rego (a custom logic language)</td>
<td>YAML, same format as Kubernetes resources</td>
</tr>
<tr>
<td><strong>Learning curve</strong></td>
<td>Steep: Rego takes real time to learn</td>
<td>Gentle: if you write YAML, you can write policies</td>
</tr>
<tr>
<td><strong>Mutation</strong></td>
<td>Yes, via <code>Assign</code>/<code>AssignMetadata</code></td>
<td>Yes: first-class, well-documented feature</td>
</tr>
<tr>
<td><strong>Audit mode</strong></td>
<td>Yes: reports existing violations</td>
<td>Yes: policy audit mode</td>
</tr>
<tr>
<td><strong>Ecosystem</strong></td>
<td>Integrates with OPA in non-K8s contexts</td>
<td>Kubernetes-native only</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Complex cross-resource logic and teams already using OPA</td>
<td>Teams who want K8s-native syntax and fast setup</td>
</tr>
</tbody></table>
<p>If you're starting fresh, Kyverno gets you to working policies faster. Here is a Kyverno policy that blocks images from outside your trusted registry:</p>
<pre><code class="language-yaml"># kyverno-registry-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Images must come from registry.corp.internal/"
        pattern:
          spec:
            containers:
              - image: "registry.corp.internal/*"
</code></pre>
<h3 id="heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</h3>
<p>PSA and <code>securityContext</code> are preventive controls: they block known-bad configurations before pods start. Falco is a detective control. It watches what containers do while they're running and alerts when something looks wrong.</p>
<p>Falco operates at the syscall level using eBPF. It attaches to the Linux kernel and intercepts every system call made by every container on the node – file opens, network connections, process spawns, privilege escalations. It does this without modifying containers, without injecting sidecars, and with minimal overhead.</p>
<h4 id="heading-what-falco-detects-out-of-the-box">What Falco detects out of the box:</h4>
<p>Falco's default ruleset covers the most common attack patterns. It fires when a shell is opened inside a running container, whether that's a <code>kubectl exec</code> session or a reverse shell from an exploit.</p>
<p>It watches for reads on sensitive files like <code>/etc/shadow</code>, <code>/etc/kubernetes/admin.conf</code>, and <code>/root/.ssh/</code>. It catches the dropper pattern: a binary written to disk and immediately executed. It detects outbound connections to known malicious IPs, writes to <code>/proc</code> or <code>/sys</code> that suggest kernel manipulation, and package managers like <code>apt</code>, <code>yum</code>, or <code>pip</code> being run inside containers that have no business installing software.</p>
<p>Each of these is a rule in Falco's default ruleset. You can extend it with custom rules for your specific workloads – which is exactly what you'll do in Demo 5. But first let's harden the Pod.</p>
<h2 id="heading-demo-4-harden-a-pod-with-securitycontext">Demo 4 – Harden a Pod with securityContext</h2>
<p>In this demo, you'll start with a default nginx deployment, observe the PSA violations it triggers, harden it step by step, and confirm it passes under the <code>restricted</code> profile.</p>
<h3 id="heading-step-1-apply-psa-labels-in-audit-mode">Step 1: Apply PSA labels in audit mode</h3>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted
</code></pre>
<h3 id="heading-step-2-deploy-insecure-nginx-and-observe-the-warnings">Step 2: Deploy insecure nginx and observe the warnings</h3>
<pre><code class="language-yaml"># insecure-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-insecure
  namespace: staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-insecure
  template:
    metadata:
      labels:
        app: nginx-insecure
    spec:
      containers:
        - name: nginx
          image: nginx:1.25-alpine
</code></pre>
<pre><code class="language-bash">kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output (PSA warns but still creates the deployment in <code>warn</code> mode):</p>
<pre><code class="language-plaintext">Warning: would violate PodSecurity "restricted:latest":
  allowPrivilegeEscalation != false (container "nginx" must set
    securityContext.allowPrivilegeEscalation=false)
  unrestricted capabilities (container "nginx" must set
    securityContext.capabilities.drop=["ALL"])
  runAsNonRoot != true (pod or container "nginx" must set
    securityContext.runAsNonRoot=true)
  seccompProfile not set (pod or container "nginx" must set
    securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/nginx-insecure created
</code></pre>
<p>Four violations. Every one of them is a real security gap. But the pod was still created "deployment.apps/nginx-insecure created"</p>
<h3 id="heading-step-3-deploy-the-hardened-version">Step 3: Deploy the hardened version</h3>
<pre><code class="language-bash">kubectl apply -f secure-deployment.yaml   # the YAML from the securityContext section above
</code></pre>
<p>No warnings this time.</p>
<h3 id="heading-step-4-switch-the-namespace-to-enforce">Step 4: Switch the namespace to enforce</h3>
<pre><code class="language-bash&quot;">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>This is the moment enforcement becomes active. Any new pod that violates the <code>restricted</code> profile will be rejected from this point on.</p>
<h3 id="heading-step-5-confirm-insecure-deployments-are-now-rejected">Step 5: Confirm insecure deployments are now rejected</h3>
<pre><code class="language-bash">kubectl delete deployment nginx-insecure -n staging
kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-shell">Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false ...
deployment.apps/nginx-insecure created
</code></pre>
<p>The Deployment object is created. PSA enforces at the <strong>pod</strong> level, not the Deployment level. The Deployment and its ReplicaSet exist, but every attempt to create a pod is rejected. Check the ReplicaSet:</p>
<pre><code class="language-bash">kubectl get replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">NAME                       DESIRED   CURRENT   READY   AGE
nginx-insecure-b668d867b   1         0         0       30s
</code></pre>
<p><code>DESIRED=1</code> but <code>CURRENT=0</code>. The ReplicaSet cannot create any pods because they're rejected at admission. Describe the ReplicaSet to see the rejection events:</p>
<pre><code class="language-bash">kubectl describe replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">Warning  FailedCreate  ReplicaSet "nginx-insecure-b668d867b" create Pod
  "nginx-insecure-xxx" failed: pods is forbidden: violates PodSecurity
  "restricted:latest": allowPrivilegeEscalation != false, unrestricted
  capabilities, runAsNonRoot != true, seccompProfile not set
</code></pre>
<p>The hardened deployment continues running with its pods intact. The insecure one has zero pods and never will. This is exactly how PSA is supposed to work.</p>
<h3 id="heading-step-6-score-the-hardened-pod-with-kube-score">Step 6: Score the hardened pod with kube-score</h3>
<p><a href="https://github.com/zegl/kube-score">kube-score</a> is a static analysis tool that scores Kubernetes manifests against security and reliability best practices:</p>
<pre><code class="language-bash"># macOS
brew install kube-score
# Linux: https://github.com/zegl/kube-score/releases

kube-score score secure-deployment.yaml -v
</code></pre>
<p>Expected output (abridged):</p>
<pre><code class="language-plaintext">apps/v1/Deployment secure-app in staging 
  path=secure-deployment.yaml
    [OK] Stable version
    [OK] Label values
    [CRITICAL] Container Resources
        · app -&gt; CPU limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.cpu
        · app -&gt; Memory limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.memory
        · app -&gt; CPU request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.cpu
        · app -&gt; Memory request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.memory
    [CRITICAL] Container Image Pull Policy
        · app -&gt; ImagePullPolicy is not set to Always
            It's recommended to always set the ImagePullPolicy to Always, to make sure that the imagePullSecrets are always correct, and to always get the image you want.
    [OK] Pod Probes Identical
    [CRITICAL] Container Ephemeral Storage Request and Limit
        · app -&gt; Ephemeral Storage limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.ephemeral-storage
        · app -&gt; Ephemeral Storage request is not set
            Resource requests are recommended to make sure the application can start and run without crashing. Set resource.requests.ephemeral-storage
    [OK] Environment Variable Key Duplication
    [OK] Container Security Context Privileged
    [OK] Pod Topology Spread Constraints
        · Pod Topology Spread Constraints
            No Pod Topology Spread Constraints set, kube-scheduler defaults assumed
    [OK] Container Image Tag
    [CRITICAL] Pod NetworkPolicy
        · The pod does not have a matching NetworkPolicy
            Create a NetworkPolicy that targets this pod to control who/what can communicate with this pod. Note, this feature needs to be supported by the CNI implementation used in the Kubernetes cluster to have an effect.
    [OK] Container Security Context User Group ID
    [OK] Container Security Context ReadOnlyRootFilesystem
    [CRITICAL] Deployment has PodDisruptionBudget
        · No matching PodDisruptionBudget was found
            It's recommended to define a PodDisruptionBudget to avoid unexpected downtime during Kubernetes maintenance operations, such as when draining a node.
    [WARNING] Deployment has host PodAntiAffinity
        · Deployment does not have a host podAntiAffinity set
            It's recommended to set a podAntiAffinity that stops multiple pods from a deployment from being scheduled on the same node. This increases availability in case the node becomes unavailable.
    [OK] Deployment Pod Selector labels match template metadata labels
</code></pre>
<p>Notice there are no security context violations: <code>securityContext</code>, <code>readOnlyRootFilesystem</code>, <code>seccompProfile</code>, and <code>runAsNonRoot</code> all pass. The remaining findings are about <strong>resource management</strong> (CPU/memory limits, ephemeral storage), <strong>availability</strong> (PodDisruptionBudget, anti-affinity), and <strong>network policy</strong> – not security context hardening. Those are important for production readiness, but they're a separate concern from the pod security hardening we did here.</p>
<p>You now have a pod that PSA accepts and kube-score validates. The next step is to add a detection layer – something that watches what the pod does at runtime, not just how it was configured at admission.</p>
<h2 id="heading-demo-5-deploy-falco-and-write-a-custom-detection-rule">Demo 5 – Deploy Falco and Write a Custom Detection Rule</h2>
<p>Now, you'll deploy Falco in eBPF mode, trigger a default alert, then extend Falco with a custom rule that catches <code>curl</code> and <code>wget</code> being run inside containers.</p>
<h3 id="heading-step-1-install-falco-via-helm">Step 1: Install Falco via Helm</h3>
<pre><code class="language-bash">helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  --wait
</code></pre>
<p>Confirm Falco is running on every node:</p>
<pre><code class="language-shell">kubectl get pods -n falco
</code></pre>
<pre><code class="language-shell">NAME           READY   STATUS    RESTARTS   AGE
falco-x8k2p    1/1     Running   0          45s
falco-m9nqr    1/1     Running   0          45s
falco-j4tpw    1/1     Running   0          45s
</code></pre>
<p>One pod per node. Falco runs as a DaemonSet because it needs to monitor syscalls on every node independently.</p>
<h3 id="heading-step-2-trigger-a-default-alert">Step 2: Trigger a default alert</h3>
<p>Open a second terminal and stream the Falco logs:</p>
<pre><code class="language-shell"># Terminal 2 — watch for alerts
kubectl logs -n falco -l app.kubernetes.io/name=falco -f --max-log-requests 3
</code></pre>
<p>In your first terminal, exec into the secure-app pod:</p>
<pre><code class="language-bash"># Terminal 1 — trigger the shell detection
POD=$(kubectl get pod -n staging -l app=secure-app \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD -n staging -- sh
</code></pre>
<p>Within a second, Terminal 2 shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:23:41.456Z: Notice A shell was spawned in a container with an attached terminal
  (user=root user_loginuid=-1 k8s.ns=staging k8s.pod=secure-app-7d9f8b-xxx
   container=app shell=sh parent=runc cmdline=sh terminal=34816)
  rule=Terminal shell in container  priority=NOTICE
  tags=[container, shell, mitre_execution]
</code></pre>
<p>This is Falco's built-in <code>Terminal shell in container</code> rule firing. It detected the <code>kubectl exec</code> session the moment you ran it.</p>
<h3 id="heading-step-3-write-a-custom-rule">Step 3: Write a custom rule</h3>
<p>The built-in rules are comprehensive, but every production environment has workloads with unique behaviour. Here is a custom rule that alerts when <code>curl</code> or <code>wget</code> is executed inside any container:</p>
<pre><code class="language-yaml"># custom-rules.yaml
customRules:
  custom-rules.yaml: |-
    - rule: Suspicious network tool in container
      desc: &gt;
        Detects execution of curl or wget inside a running container.
        These tools are commonly used for data exfiltration, downloading
        attacker payloads, or reaching command-and-control servers.
        Production containers should not be making ad-hoc HTTP requests.
      condition: &gt;
        spawned_process
        and container
        and proc.name in (curl, wget)
      output: &gt;
        Network tool executed in container
        (user=%user.name tool=%proc.name cmd=%proc.cmdline
         pod=%k8s.pod.name ns=%k8s.ns.name image=%container.image)
      priority: WARNING
      tags: [network, exfiltration, custom]
</code></pre>
<p>Apply it by upgrading the Helm release:</p>
<pre><code class="language-bash"> helm upgrade falco falcosecurity/falco \
  --namespace falco \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  -f custom-rules.yaml
</code></pre>
<p>Good, it deployed. Now wait for pods to be ready and test your custom rule:</p>
<h3 id="heading-step-4-test-the-custom-rule">Step 4: Test the custom rule</h3>
<pre><code class="language-bash"># Terminal 1 — run curl inside the container
kubectl exec -it $POD -n staging -- sh -c 'curl https://example.com'
</code></pre>
<p>Terminal 2 immediately shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:31:07.812Z: Warning Network tool executed in container
  (user=root tool=curl cmd=curl https://example.com
   pod=secure-app-7d9f8b-xxx ns=staging image=nginx:1.25-alpine)
  rule=Suspicious network tool in container  priority=WARNING
  tags=[network, exfiltration, custom]
</code></pre>
<h3 id="heading-step-5-route-alerts-to-slack-with-falcosidekick">Step 5: Route alerts to Slack with Falcosidekick</h3>
<p>Streaming logs is useful during development. In production, you need alerts routed to your alerting pipeline. Falcosidekick handles this with support for Slack, PagerDuty, Datadog, Elasticsearch, and over 50 other outputs:</p>
<pre><code class="language-yaml"># falcosidekick-values.yaml
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    minimumpriority: "warning"
    messageformat: &gt;
      [{{.Priority}}] {{.Rule}} |
      pod: {{.OutputFields.k8s.pod.name}} |
      ns: {{.OutputFields.k8s.ns.name}} |
      image: {{.OutputFields.container.image}}
</code></pre>
<pre><code class="language-bash">helm install falcosidekick falcosecurity/falcosidekick \
  --namespace falco \
  -f falcosidekick-values.yaml
</code></pre>
<p><strong>Tuning Falco for production:</strong> A fresh Falco deployment will generate false positives, especially in the first week. Your job is to tune rules to match your workloads' normal behaviour, not to respond to every alert.</p>
<p>Here's the workflow: deploy in staging → identify false positives → add <code>except</code> conditions to rules → validate the false positive rate is low → enable in production with alerting.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the staging namespace and everything in it
kubectl delete namespace staging
 
# Delete Falco and Falcosidekick
helm uninstall falco -n falco
helm uninstall falcosidekick -n falco
kubectl delete namespace falco
 
# Delete the kind cluster entirely
kind delete cluster --name k8s-security
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this handbook, you secured a Kubernetes cluster across three layers: RBAC, pod runtime security, and runtime threat detection.</p>
<p>You built a least-privilege service account, enforced the restricted Pod Security Admission profile, hardened pods with securityContext, deployed Falco for syscall-level detection, and wrote a custom rule to catch suspicious tools inside containers.</p>
<p>Each layer maps to a real-world breach – Tesla, Capital One, Hildegard – showing how these controls would have contained the damage. Run kube-bench again to measure the improvement.</p>
<p>All YAML manifests, Helm values, and setup scripts from this article are available in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Implement GitOps on Kubernetes Using Argo CD ]]>
                </title>
                <description>
                    <![CDATA[ If you’re still running kubectl apply from your local terminal, you aren’t managing a cluster, you’re babysitting one. I’ve spent more nights than I care to admit staring at a terminal, trying to figu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-implement-gitops-on-kubernetes-using-argo-cd/</link>
                <guid isPermaLink="false">69b99877c22d3eeb8ae62100</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gitops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ArgoCD ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Olumoko Moses ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 18:07:51 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/2fe40cbd-1b8a-4cc6-a721-45cc20a80c76.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re still running <code>kubectl apply</code> from your local terminal, you aren’t managing a cluster, you’re babysitting one.</p>
<p>I’ve spent more nights than I care to admit staring at a terminal, trying to figure out why a staging environment suddenly "broke" even though no one supposedly touched it.</p>
<p>We’ve all been there, right? a manual edit here, a quick hotfix there, and suddenly your Git repository is no longer a Source of Truth, it’s a historical document of what used to be running.</p>
<p>Without a reliable strategy, Kubernetes deployments quickly descend into a mess of drift, painful rollbacks, and non-existent audit trails. I learned the hard way that simply storing manifests in Git isn't enough. If your cluster isn't actively listening to your code, you're still working with a gap.</p>
<p>GitOps closes that gap. It turns your cluster into a mirror of your repository. If it isn't in Git, it doesn't exist.</p>
<p>In this tutorial, you aren't just going to read about the theory. You’re going to implement a "Zero-Touch" deployment loop from scratch. We’ll use Argo CD, GitHub Actions, and the Argo CD Image Updater to build a system that builds, tags, and deploys your code the second you hit <code>git push</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/93b61e74-66c3-47d7-aeff-f7a69d0d7390.jpg" alt="Architecture diagram of a complete GitOps CI/CD workflow. A developer pushes code to a GitHub repository, triggering a GitHub Actions pipeline that builds and pushes a new Docker image to DockerHub. The Argo CD Image Updater polls DockerHub for the new tag and commits the change back to the GitHub repository. Finally, the Argo CD Server detects the updated manifest in Git and syncs the changes to the live Kubernetes cluster." style="display:block;margin:0 auto" width="640" height="640" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-gitops-really-means">What GitOps Really Means</a></p>
</li>
<li><p><a href="#heading-what-is-argo-cd-and-how-does-it-implement-gitops">What is Argo CD and How Does it Implement GitOps?</a></p>
</li>
<li><p><a href="#heading-preparing-the-application-source-code">Preparing the Application Source Code and Repo Structure</a></p>
</li>
<li><p><a href="#heading-automating-image-builds-with-github-actions">Automating Image Builds with GitHub Actions</a></p>
</li>
<li><p><a href="#heading-how-to-install-and-access-argo-cd">How to Install and Access Argo CD</a></p>
</li>
<li><p><a href="#heading-understanding-the-argo-cd-application">Understanding the Argo CD Application</a></p>
</li>
<li><p><a href="#heading-deploying-the-application-manifest">Deploying the Application Manifest</a></p>
</li>
<li><p><a href="#heading-automating-updates-with-argo-cd-image-updater">Automating Updates with Argo CD Image Updater</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following ready in your environment:</p>
<ul>
<li><p><strong>A GitHub repository:</strong> You'll need a repository (for example, <code>my-gitops-demo</code>) to serve as your Single Source of Truth. If you're following this tutorial from scratch, start with an empty repo.</p>
</li>
<li><p><strong>A DockerHub account:</strong> This will act as your Container Registry. You’ll need this to build, push, and store the Docker images that GitHub Actions creates.</p>
</li>
<li><p><strong>A running Kubernetes cluster:</strong> You can use a local solution like Minikube or Kind, or a cloud-managed service like Amazon EKS or GKE.</p>
</li>
<li><p><strong>Kubernetes tooling:</strong> Ensure <code>kubectl</code> is installed and configured to communicate with your cluster.</p>
</li>
<li><p><strong>Fundamental K8s knowledge:</strong> You should be comfortable with basic Kubernetes concepts like Pods, Deployments, and Services.</p>
</li>
</ul>
<h3 id="heading-note-for-readers-with-existing-projects">Note for Readers with Existing Projects</h3>
<p>If you already have a project and want to migrate it to this GitOps workflow, you don't need to start over. You can adapt your existing repository by following these three steps:</p>
<ol>
<li><p>Standardize your manifests: Move all your existing Kubernetes YAML files into a dedicated Kubernetes-manifest/ directory at the root of your project.</p>
</li>
<li><p>Containerize your services: Ensure every service you intend to deploy has a Dockerfile in its respective subdirectory (for example, /main-api/Dockerfile).</p>
</li>
<li><p>Prepare for automation: Be ready to replace any manual kubectl apply steps in your current CI pipeline with the automated tagging strategy we’ll implement in the next sections.</p>
</li>
</ol>
<h2 id="heading-what-gitops-really-means">What GitOps Really Means</h2>
<p>At its core, GitOps is an operational framework that uses Git as the single source of truth for your infrastructure and applications. In a traditional setup, you might run <code>kubectl apply -f deployment.yaml</code> from your laptop. This makes it impossible to track who changed what, leading to "snowflake" clusters that no one can reproduce.</p>
<p>GitOps enforces four key principles:</p>
<ol>
<li><p><strong>Declarative:</strong> You describe the <em>desired</em> state (for example, "3 replicas of Nginx"), not the commands to get there.</p>
</li>
<li><p><strong>Versioned and immutable:</strong> Your entire state is in Git. If a deployment fails, you <code>git revert</code> to a previous known-good state.</p>
</li>
<li><p><strong>Pulled automatically:</strong> A software agent (Argo CD) pulls the state from Git.</p>
</li>
<li><p><strong>Continuously reconciled:</strong> The system constantly fixes "drift." If a developer manually changes a service in the cluster, Argo CD will overwrite it to match Git.</p>
</li>
</ol>
<h2 id="heading-what-is-argo-cd-and-how-does-it-implement-gitops">What is Argo CD and How Does it Implement GitOps</h2>
<p>Before we dive into the setup, let’s define the tool we'll be working with.</p>
<p>Argo CD is a declarative, GitOps' continuous delivery engine built specifically for Kubernetes. As a graduated project of the Cloud Native Computing Foundation (CNCF), it has become the industry standard for managing modern infrastructure.</p>
<p>Think of Argo CD as a persistent watchdog that lives inside your cluster. To understand why it's so powerful, we have to look at how it differs from traditional CI/CD tools like Jenkins or GitHub Actions.</p>
<h3 id="heading-the-push-vs-pull-model">The "Push" vs. "Pull" Model</h3>
<p>Traditional tools like the one I mentioned above use a <strong>"Push" model</strong>. In this setup, an external pipeline sends commands (like <code>kubectl apply</code>) into your cluster. This is risky because you must store sensitive cluster administrative keys inside your external CI tool. If your CI tool is compromised, your cluster is, too.</p>
<p>Argo CD flips this script using a <strong>"Pull"</strong> <strong>model</strong>:</p>
<ul>
<li><p><strong>The bridge:</strong> It sits between your Git repo (the "Desired State") and your cluster (the "Live State").</p>
</li>
<li><p><strong>Continuous monitoring:</strong> It watches your Git repo 24/7. The moment it detects a new commit, it "pulls" that change and applies it from <em>inside</em> the cluster.</p>
</li>
<li><p><strong>Self-healing:</strong> If someone manually changes a setting in the cluster (known as "drift"), Argo CD detects the discrepancy and automatically overwrites it to match what is written in Git.</p>
</li>
</ul>
<p>This approach is not only more secure, since no cluster credentials ever leave the environment, but it also ensures that your infrastructure is a perfect, predictable mirror of your code.</p>
<h2 id="heading-preparing-the-application-source-code">Preparing the Application Source Code</h2>
<p>Before we automate the build, we need actual code in our repository. We'll create two simple microservices: a Main API and an Auxiliary Service.</p>
<h3 id="heading-repo-structure">Repo Structure</h3>
<p>Ensure your repository follows this structure exactly. Consistency in naming is vital for the automation to find your files.</p>
<pre><code class="language-plaintext">GITOPS-ARGOCD-DEMO/
├── .github/workflows/main.yml
├── auxiliary-service/
│   └── Dockerfile
├── main-api/
│   └── Dockerfile
├── Kubernetes-manifest/
│   ├── aux-api.yaml
│   ├── kustomization.yaml
│   └── main-api.yaml
├── application.yaml
└── image-updater.yaml
</code></pre>
<h3 id="heading-create-the-dockerfiles">Create the Dockerfiles</h3>
<p>In each service folder, create a simple <code>Dockerfile</code> so our pipeline has something to build.</p>
<p><strong>main-api/Dockerfile</strong></p>
<pre><code class="language-plaintext">FROM nginx:alpine
RUN echo "&lt;h1&gt;Main API - Version 1.0&lt;/h1&gt;" &gt; /usr/share/nginx/html/index.html
EXPOSE 80
</code></pre>
<p><strong>auxiliary-service/Dockerfile</strong></p>
<pre><code class="language-plaintext">FROM nginx:alpine
RUN echo "&lt;h1&gt;Auxiliary Service - Version 1.0&lt;/h1&gt;" &gt; /usr/share/nginx/html/index.html
EXPOSE 80
</code></pre>
<h2 id="heading-automating-image-builds-with-github-actions">Automating Image Builds with GitHub Actions</h2>
<p>In a professional GitOps workflow, your Kubernetes manifests and your application source code often live in the same repository (or linked ones). While Argo CD handles the deployment, you still need a way to turn your code into Docker images. This is where <strong>Continuous Integration (CI)</strong> comes in.</p>
<p>I have included a GitHub Actions workflow in this demo to automate this. Every time you push code to the <code>main</code> branch, this pipeline builds your images and pushes them to DockerHub.</p>
<h3 id="heading-the-ci-pipeline-workflow">The CI Pipeline Workflow</h3>
<p>Create a file at <code>.github/workflows/main.yml</code> and add the following:</p>
<pre><code class="language-plaintext">name: Build and Push Image to DockerHub

on:
  push:
    branches:
      - main
    # Skip builds for image updater commits
    paths-ignore:
      - 'Kubernetes-manifest/**'

jobs:
  docker_build:
    name: Build &amp; Push ${{ matrix.service }}
    environment: argocd-demo
    runs-on: ubuntu-latest
    permissions:
      contents: write

    strategy:
      matrix:
        include:
          - service: aux-service
            dockerfile: auxiliary-service/Dockerfile
          - service: main-service
            dockerfile: main-api/Dockerfile

    env:
      DOCKER_USER: ${{ secrets.DOCKERHUB_USERNAME }}
      RUN_TAG: ${{ github.run_number }}

    steps:
      - name: Check out code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ env.DOCKER_USER }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}

      - name: Build and Push ${{ matrix.service }}
        uses: docker/build-push-action@v6
        with:
          context: .
          file: ${{ matrix.dockerfile }}
          push: true
          tags: \({{ env.DOCKER_USER }}/\){{ matrix.service }}:${{ env.RUN_TAG }}
          cache-from: type=gha,scope=${{ matrix.service }}
          cache-to: type=gha,mode=max,scope=${{ matrix.service }}
</code></pre>
<p><strong>Pro tip:</strong> The <code>paths-ignore</code> section is critical. Later, the Argo CD Image Updater will write changes back to the <code>Kubernetes-manifest/</code> folder. Without this ignore rule, your pipeline would trigger itself forever in an infinite loop.</p>
<p><strong>Note:</strong> You must add <code>DOCKERHUB_USERNAME</code> and <code>DOCKERHUB_PASSWORD</code> to your GitHub Repo Settings &gt; Secrets.</p>
<h2 id="heading-how-to-install-and-access-argo-cd">How to Install and Access Argo CD</h2>
<p>Now that your cluster is running, you can install Argo CD. You'll perform the installation using a standard Kubernetes manifest provided by the Argo project.</p>
<h3 id="heading-step-1-create-the-namespace-and-apply-the-manifests">Step 1: Create the Namespace and Apply the Manifests</h3>
<p>In Kubernetes, it is a best practice to keep your administrative tools separate from your applications. You will create a dedicated namespace named <code>argocd</code> and then apply the official installation script from the Argo project. This script includes all the necessary ServiceAccounts, Roles, and Deployments.</p>
<p>Run the following commands in your terminal:</p>
<pre><code class="language-markdown">kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
</code></pre>
<p>You'll see a long list of resources being created. Wait a minute or two for the pods to initialize. you can verify that all the core components of Argo CD are running.:</p>
<pre><code class="language-markdown">kubectl get all -n argocd
</code></pre>
<p>Ensure all pods show a status of <code>Running</code> before proceeding.</p>
<h3 id="heading-step-2-access-the-argo-cd-user-interface">Step 2: Access the Argo CD User Interface</h3>
<p>To access the dashboard, we use a technique called <strong>port forwarding</strong>. Since the Argo CD server is running inside the cluster's private network, your browser can't see it yet. Port forwarding creates a secure 'tunnel' between a port on your local machine (8080) and a port on the cluster service (443). This allows you to interact with internal services without exposing them to the public internet.</p>
<p>Run the following command:</p>
<pre><code class="language-markdown">kubectl port-forward svc/argocd-server -n argocd 8080:443
</code></pre>
<p>You can now open your browser and navigate to <code>https://localhost:8080</code>. Your browser may warn you that the connection is not private because of a self-signed certificate. You can safely click "Advanced" and proceed to the site.</p>
<h3 id="heading-step-3-how-to-log-in">Step 3: How to Log In</h3>
<p>The default username for Argo CD is <code>admin</code>. The password is autogenerated during the installation process and is stored securely as a Kubernetes secret.</p>
<p>To retrieve this password, open a new terminal tab (so the port-forwarding keeps running) and run:</p>
<pre><code class="language-markdown">kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo
</code></pre>
<p>Copy the output and use it as the password to log into the dashboard.</p>
<h2 id="heading-understanding-the-argo-cd-application">Understanding the Argo CD Application</h2>
<p>An Argo CD <strong>Application</strong> is a Custom Resource (CRD) that acts as a "contract" between your Git repo and your cluster. It defines:</p>
<ul>
<li><p><code>repoURL</code> &amp; <code>path</code>: This tells Argo CD exactly which Git repository to watch and which folder inside that repo contains your YAML manifests.</p>
</li>
<li><p><code>destination</code>: This defines where the app should live. We use <code>https://kubernetes.default.svc</code> to point to the local cluster where Argo CD is installed.</p>
</li>
<li><p><code>syncPolicy</code>: This is the heart of GitOps. By setting <code>automated</code> with <code>selfHeal: true</code>, we tell Argo CD to automatically fix the cluster if someone manually changes something (drift). The <code>prune: true</code> setting ensures that if you delete a file in Git, it also gets deleted in the cluster.</p>
</li>
</ul>
<h3 id="heading-the-application-manifest">The Application Manifest</h3>
<p>Create <code>application.yaml</code> in your project root:</p>
<pre><code class="language-plaintext">apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gitops-argocd-demo
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/&lt;YOUR_GITHUB_USERNAME&gt;/&lt;YOUR_REPO_NAME&gt;.git
    targetRevision: HEAD
    path: Kubernetes-manifest
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd-demo-ns
  syncPolicy:
    automated:
      selfHeal: true
      prune: true
    syncOptions:
      - CreateNamespace=true
</code></pre>
<h3 id="heading-deploying-the-application-manifest">Deploying the Application Manifest</h3>
<p>Now we'll define our Kubernetes resources in the <code>Kubernetes-manifest/</code> folder.</p>
<p><strong>main-api.yaml</strong></p>
<pre><code class="language-plaintext">apiVersion: apps/v1
kind: Deployment
metadata:
  name: main-deployment
  namespace: argocd-demo-ns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: main-api
  template:
    metadata:
      labels:
        app: main-api
    spec:
      containers:
      - name: main-service
        image: &lt;YOUR_DOCKERHUB_USERNAME&gt;/main-service:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: main-service-lb
  namespace: argocd-demo-ns
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: main-api
</code></pre>
<p><strong>aux-api.yaml</strong></p>
<pre><code class="language-plaintext">apiVersion: apps/v1
kind: Deployment
metadata:
  name: aux-deployment
  namespace: argocd-demo-ns
spec:
  replicas: 2
  selector:
    matchLabels:
      app: aux-service
  template:
    metadata:
      labels:
        app: aux-service
    spec:
      containers:
      - name: aux-service
        image: &lt;YOUR_DOCKERHUB_USERNAME&gt;/aux-service:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: aux-service
  namespace: argocd-demo-ns
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: aux-service
</code></pre>
<h2 id="heading-push-and-sync">Push and Sync</h2>
<h3 id="heading-step-1-apply-the-application-manifest">Step 1: Apply the Application Manifest</h3>
<p>Use <code>kubectl</code> to deploy this manifest into the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl apply -f application.yaml -n argocd
</code></pre>
<h3 id="heading-step-2-push-to-your-repository">Step 2: Push to Your Repository</h3>
<p>To trigger the initial deployment and ensure Argo CD stays in sync with your source of truth, add, commit, and push your latest changes to the GitHub repository you configured in the manifest:</p>
<pre><code class="language-plaintext">git add .
git commit -m "initial argo application deployment"
git push origin main
</code></pre>
<h3 id="heading-step-3-verify-the-result-in-argo-cd">Step 3: Verify the Result in Argo CD</h3>
<p>Once you push your changes, head over to your Argo CD dashboard. You'll see the <code>gitops-argocd-demo</code> application appear. After the initial sync, the dashboard will display a healthy, green status indicating that your live cluster state perfectly matches your Git repository.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/7d37a8bd-c913-4393-b82c-ab0adc875574.jpg" alt="Argo CD dashboard showing the gitops-argocd-demo application in a Healthy and Synced state. The resource tree displays the hierarchy of services, deployments, replica sets, and pods running in the cluster." style="display:block;margin:0 auto" width="1440" height="547" loading="lazy">

<p><strong>Note:</strong> As you can see in the screenshot above, Argo CD provides a visual representation of how your Kubernetes objects – Services, Deployments, and Pods – are related and confirms they are "Synced" with your Git repo.</p>
<h2 id="heading-automating-updates-with-argo-cd-image-updater">Automating Updates with Argo CD Image Updater</h2>
<p>Now that we have automated the deployment, let’s solve the final manual hurdle: automatically updating image tags in our manifests whenever a new build is pushed to DockerHub.</p>
<h3 id="heading-step-1-install-argocd-image-updater">Step 1: Install ArgoCD Image Updater</h3>
<p>Install the Image Updater into the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj-labs/argocd-image-updater/stable/config/install.yaml
</code></pre>
<p>Verify the pod is running:</p>
<pre><code class="language-plaintext">kubectl get pods -n argocd | grep image-updater
</code></pre>
<p><strong>Note:</strong> Version 1.1+ uses a CRD-based approach (<code>ImageUpdater</code> custom resources) instead of the annotation-based approach used in older versions. This guide covers the CRD method.</p>
<h3 id="heading-step-2-create-a-github-personal-access-token">Step 2: Create a GitHub Personal Access Token</h3>
<p>The Image Updater needs Git credentials to push write-back commits to your repository.</p>
<ol>
<li><p>Go to GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)</p>
</li>
<li><p>Click Generate new token</p>
</li>
<li><p>Select the <code>repo</code> scope (full control of private repositories)</p>
</li>
<li><p>Copy the generated token</p>
</li>
</ol>
<h3 id="heading-step-3-create-the-git-credentials-secret">Step 3: Create the Git Credentials Secret</h3>
<p>Store the GitHub credentials as a Kubernetes secret in the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl -n argocd create secret generic git-creds \
  --from-literal=username=&lt;YOUR_GITHUB_USERNAME&gt; \
  --from-literal=password=&lt;YOUR_GITHUB_PAT&gt;
</code></pre>
<p>Replace <code>&lt;YOUR_GITHUB_USERNAME&gt;</code> and <code>&lt;YOUR_GITHUB_PAT&gt;</code> with your actual values.</p>
<h3 id="heading-step-4-add-a-kustomization-file-to-your-manifests">Step 4: Add a Kustomization File to Your Manifests</h3>
<p>The Image Updater uses Kustomize's <code>images</code> field to write updated tags. If your <code>Kubernetes-manifest/</code> directory contains plain YAML files, you'll need to wrap them with a <strong>kustomization.yaml</strong> file.</p>
<p>Create a <strong>kustomization.yaml</strong> file:</p>
<pre><code class="language-plaintext">apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - main-api.yaml
  - aux-api.yaml
</code></pre>
<p><strong>How it works:</strong> When the Image Updater detects a new tag, it appends an <code>images</code> section to this file:</p>
<pre><code class="language-plaintext">images:
  - name: &lt;YOUR_GITHUB_USERNAME&gt;/main-service
    newTag: "12"
  - name: &lt;YOUR_GITHUB_USERNAME&gt;/aux-service
    newTag: "12"
</code></pre>
<p>Kustomize then overrides the image tags at deploy time, without modifying your original deployment YAML files.</p>
<p>We use Kustomize here because it allows the Image Updater to manage image tags in a separate, clean way. Instead of the Updater 'messing' with your original <code>main-api.yaml</code> file, it simply updates the <code>kustomization.yaml</code> file. Argo CD then uses Kustomize to merge those changes during deployment.</p>
<h3 id="heading-step-5-create-the-imageupdater-custom-resource">Step 5: Create the ImageUpdater Custom Resource</h3>
<p>Create <strong>image-updater.yaml</strong> in your project root:</p>
<pre><code class="language-plaintext">apiVersion: argocd-image-updater.argoproj.io/v1alpha1
kind: ImageUpdater
metadata:
  name: gitops-argocd-demo-updater
  namespace: argocd
spec:
  commonUpdateSettings:
    updateStrategy: newest-build
    allowTags: "regexp:^[0-9]+$"
  applicationRefs:
    - namePattern: "gitops-argocd-demo"
      writeBackConfig:
        method: "git:secret:argocd/git-creds"
        gitConfig:
          branch: main
          writeBackTarget: "kustomization:."
      images:
        - alias: main-service
          imageName: &lt;YOUR_DOCKERHUB_USERNAME&gt;/main-service
        - alias: aux-service
          imageName: &lt;YOUR_DOCKERHUB_USERNAME&gt;/aux-service
</code></pre>
<p>This ImageUpdater resource is the <strong>"brain"</strong> of our automated tagging system. Here is what the specific fields are doing:</p>
<p><code>updateStrategy:</code></p>
<ul>
<li><code>newest-build:</code> It tells the updater to always look for the most recent image version in DockerHub based on creation time.</li>
</ul>
<p><code>writeBackConfig:</code> This is where the magic happens. It uses the git-creds secret we created to authorize the updater to 'write' back to your repository.</p>
<p><code>writeBackTarget:</code></p>
<ul>
<li><code>kustomization:</code> We are telling the updater specifically to modify the kustomization.yaml file in the manifests folder rather than touching the deployment files directly.</li>
</ul>
<p><code>images:</code> We provide aliases (main-service and aux-service) so the updater knows exactly which images in DockerHub correspond to which containers in our Kubernetes manifests.</p>
<p><strong>Apply the ImageUpdater CR to the cluster:</strong></p>
<pre><code class="language-plaintext">kubectl apply -f image-updater.yaml -n argocd
</code></pre>
<p>Push the kustomization.yaml to your Git repository (the Image Updater clones the repo, so it must exist remotely):</p>
<pre><code class="language-plaintext">git add Kubernetes-manifest/kustomization.yaml
git commit -m "Add kustomization.yaml for image updater write-back"
git push origin main
</code></pre>
<h3 id="heading-step-6-verify-the-image-updater">Step 6: Verify the Image Updater</h3>
<p>Check the Image Updater logs to confirm it's working:</p>
<pre><code class="language-plaintext">kubectl logs -n argocd deployment/argocd-image-updater-controller --tail=20
</code></pre>
<p><strong>Successful output looks like:</strong></p>
<pre><code class="language-plaintext">msg="Starting image update cycle, considering 1 application(s) for update"
msg="Setting new image to YOUR_DOCKERHUB_USERNAME/main-service:11"
msg="Successfully updated image 'YOUR_DOCKERHUB_USERNAME/main-service:7' to 'YOUR_DOCKERHUB_USERNAME/main-service:11'"
msg="Setting new image to YOUR_DOCKERHUB_USERNAME/aux-service:11"
msg="Committing 2 parameter update(s) for application gitops-argocd-demo"
msg="git push origin main"
msg="Successfully updated the live application spec"
msg="Processing results: applications=1 images_considered=2 images_skipped=0 images_updated=2 errors=0"
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You have successfully implemented a professional-grade GitOps loop from scratch. By integrating GitHub Actions, Argo CD, and the Argo CD Image Updater, you’ve bridged the gap between your source code and your live environment.</p>
<p>Think about the workflow you just built:</p>
<ol>
<li><p>You push code to GitHub.</p>
</li>
<li><p>GitHub Actions builds and tags a fresh Docker image.</p>
</li>
<li><p>Argo CD Image Updater detects that new tag and automatically commits it back to your Git manifests.</p>
</li>
<li><p>Argo CD pulls those changes and reconciles your cluster to the new desired state.</p>
</li>
</ol>
<p>No more manual <code>kubectl apply</code>, no more configuration drift, and no more 2:00 AM mysteries. Your Git repository is now truly the Single Source of Truth. If it isn't in Git, it doesn't exist in your cluster, and that is the ultimate DevOps superpower.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster ]]>
                </title>
                <description>
                    <![CDATA[ I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet contro ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kubernetes-self-healing-explained/</link>
                <guid isPermaLink="false">69aae80e78c5adcd0e1c63bc</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Testing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Fri, 06 Mar 2026 14:43:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/ef1ba178-622f-4a28-b58a-7fb8a58be964.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from <code>kubectl describe</code>, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.</p>
<p>You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-kubelab-is">What KubeLab Is?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-get-the-lab-running">How to Get the Lab Running</a></p>
</li>
<li><p><a href="#heading-simulation-1-kill-random-pod">Simulation 1 — Kill Random Pod</a></p>
</li>
<li><p><a href="#heading-simulation-2-drain-a-worker-node">Simulation 2 — Drain a Worker Node</a></p>
</li>
<li><p><a href="#heading-simulation-3-cpu-stress-and-throttling">Simulation 3 — CPU Stress and Throttling</a></p>
</li>
<li><p><a href="#heading-simulation-4-memory-stress-and-oomkill">Simulation 4 — Memory Stress and OOMKill</a></p>
</li>
<li><p><a href="#heading-simulation-5-database-failure">Simulation 5 — Database Failure</a></p>
</li>
<li><p><a href="#heading-simulation-6-cascading-pod-failure">Simulation 6 — Cascading Pod Failure</a></p>
</li>
<li><p><a href="#heading-simulation-7-readiness-probe-failure">Simulation 7 — Readiness Probe Failure</a></p>
</li>
<li><p><a href="#heading-how-to-read-the-signals-in-grafana">How to Read the Signals in Grafana</a></p>
</li>
<li><p><a href="#heading-how-to-use-this-for-production-debugging">How to Use This for Production Debugging</a></p>
</li>
</ul>
<h2 id="heading-what-is-kubelab"><strong>What is KubeLab?</strong></h2>
<p>KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.</p>
<table>
<thead>
<tr>
<th>Simulation</th>
<th>What it teaches</th>
</tr>
</thead>
<tbody><tr>
<td>Kill Random Pod</td>
<td>ReplicaSet self-healing, pod immutability</td>
</tr>
<tr>
<td>Drain Worker Node</td>
<td>Zero-downtime maintenance, PodDisruptionBudgets</td>
</tr>
<tr>
<td>CPU Stress</td>
<td>Throttling vs crashing, invisible latency</td>
</tr>
<tr>
<td>Memory Stress</td>
<td>OOMKill, exit code 137, silent restart loops</td>
</tr>
<tr>
<td>Database Failure</td>
<td>StatefulSets, PVC persistence</td>
</tr>
<tr>
<td>Cascading Pod Failure</td>
<td>Why replicas: 2 isn't enough</td>
</tr>
<tr>
<td>Readiness Probe Failure</td>
<td>Liveness vs readiness, traffic control</td>
</tr>
</tbody></table>
<p>Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1cd2a06d-7a7a-4250-ab5d-8a78d24af7b5.png" alt="KubeLab cluster map — pods grouped by node, color-coded by status. During simulations, chips change color and move between nodes in real time." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.</p>
<p><strong>Hardware:</strong> 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/docker-compose-preview.md">setup/docker-compose-preview.md</a> full UI with mock data, no real cluster needed.</p>
<h2 id="heading-how-to-get-the-lab-running"><strong>How to Get the Lab Running</strong></h2>
<p>Full cluster setup lives at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/k8s-cluster-setup.md">setup/k8s-cluster-setup.md</a> in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:</p>
<pre><code class="language-bash">kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running
</code></pre>
<p>Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:</p>
<pre><code class="language-bash"># Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000
</code></pre>
<p>Grafana login: <code>admin</code> / <code>kubelab-grafana-2026</code>.</p>
<blockquote>
<p>Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.</p>
</blockquote>
<h2 id="heading-simulation-1-kill-random-pod"><strong>Simulation 1: Kill Random Pod</strong></h2>
<p>This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code>. Watch for a pod to go Terminating then a new one to appear.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/3d3cb733-407a-482f-82e7-cbeea496157b.png" alt="Terminals running side by side before clicking Run, events streaming, pod watch, frontend and grafana port forwarding." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<pre><code class="language-bash">kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement
</code></pre>
<p><strong>What happened:</strong> The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.</p>
<p><strong>The production trap:</strong> A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.</p>
<p><strong>The fix:</strong> Set <code>replicas: 2</code>, add a readiness probe, and set <code>terminationGracePeriodSeconds</code> to match your longest request timeout.</p>
<h2 id="heading-simulation-2-drain-a-worker-node"><strong>Simulation 2: Drain a Worker Node</strong></h2>
<p>This simulation cordons a worker node, then evicts all its pods to the remaining node.</p>
<p>To <em><strong>"cordon"</strong></em> a worker node means to mark it as unschedulable. When you run <code>kubectl cordon &lt;node-name&gt;</code>, the Kubernetes control plane adds the <code>node.kubernetes.io/unschedulable:NoSchedule</code> taint to the node. (A <strong>taint</strong> is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does <strong>not</strong> affect the pods that are already running there.</p>
<p>Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.</p>
<p>Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -o wide -w</code>. Watch which node each pod runs on.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -o wide -w
</code></pre>
<pre><code class="language-plaintext">NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled
</code></pre>
<p>In <code>kubectl get nodes</code> the node shows <code>Ready,SchedulingDisabled</code> until you run <code>kubectl uncordon</code>.</p>
<p><strong>What happened:</strong> The node spec got <code>spec.unschedulable=true</code>. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw <code>kubectl delete pod</code> bypasses this check entirely — which is why draining with <code>kubectl drain</code> is always safer than deleting pods manually during maintenance.</p>
<p><strong>The production trap:</strong> Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite <code>replicas: 2</code>.</p>
<p><strong>The fix:</strong> Use pod anti-affinity with topology key: <code>kubernetes.io/hostname</code> and a PodDisruptionBudget with <code>minAvailable: 1</code>.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1161cbf9-2482-41c7-9b5c-751762d3baaa.png" alt="Node drain CLI output: cordoned node shows Ready,SchedulingDisabled; pods reschedule to the other node." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-simulation-3-cpu-stress-and-throttling"><strong>Simulation 3: CPU Stress and Throttling</strong></h2>
<p>This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.</p>
<p><strong>Before you click:</strong> Run <code>watch -n 2 kubectl top pods -n kubelab</code> and open the Grafana CPU Usage panel.</p>
<pre><code class="language-bash">kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m
</code></pre>
<p><strong>What happened:</strong> The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.</p>
<p><strong>The production trap:</strong> <code>kubectl top</code> shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.</p>
<p><strong>The fix:</strong> For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5e3fd49b-c9a0-4271-9be7-b7fec3122c1a.png" alt="One backend pod flatlined at exactly 95-150m for 60 seconds. A healthy pod's CPU fluctuates, this flat ceiling is the throttle." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-simulation-4-memory-stress-and-oomkill"><strong>Simulation 4: Memory Stress and OOMKill</strong></h2>
<p>This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -l app=backend -w</code> and open the Grafana Memory Usage panel.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown
</code></pre>
<p><strong>What happened:</strong> The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.</p>
<p><strong>The production trap:</strong> The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."</p>
<p><strong>The fix:</strong> Alert on <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) &gt; 3</code> before users notice.<br>The Prometheus expression means: look at how many times containers in the <code>kubelab</code> namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.</p>
<p>Confirm it happened:</p>
<pre><code class="language-bash">kubectl describe pod -n kubelab &lt;pod-name&gt; | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137
</code></pre>
<p>To see the last output before the kernel killed the process, run <code>kubectl logs -n kubelab &lt;pod-name&gt; --previous</code>. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/8ced107b-9d14-4d40-b6d6-7ae0fe35b1b7.png" alt="One backend pod's memory climbs, then the line drops at the OOMKill and reappears as the container restarts. The other pod's line stays flat the whole time" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-simulation-5-database-failure"><strong>Simulation 5: Database Failure</strong></h2>
<p>This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods,pvc -n kubelab</code>. Note that the PVC exists before you start.</p>
<pre><code class="language-bash">kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume
</code></pre>
<p>A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new <code>postgres-0</code> pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.</p>
<p><strong>What happened:</strong> The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. <code>postgres-0</code> always mounts <code>postgres-data-postgres-0</code>. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.</p>
<p><strong>The production trap:</strong> Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.</p>
<p><strong>The fix:</strong> Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.</p>
<h2 id="heading-simulation-6-cascading-pod-failure"><strong>Simulation 6: Cascading Pod Failure</strong></h2>
<p>This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get endpoints -n kubelab backend-service -w</code>. Watch the IP list.</p>
<pre><code class="language-bash">kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS   &lt;none&gt;   ← every request in this window gets Connection refused
</code></pre>
<p><strong>What happened:</strong> Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/6cae14e0-faf2-4d42-90f4-32d00a1b4119.png" alt="The 5xx spike during Cascading Failure, 5 to 15 seconds of real downtime with the exact window timestamped" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>The production trap:</strong> <code>replicas: 2</code> protects you from one pod dying at a time, nothing more.<br>If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.<br>Check right now with <code>kubectl get pods -n kubelab -o wide | grep backend</code>, and if both pods show the same NODE, you are one node failure away from an outage.</p>
<p><strong>The fix:</strong> Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with <code>minAvailable: 1</code> to block any voluntary action that would leave zero replicas.</p>
<h2 id="heading-simulation-7-readiness-probe-failure"><strong>Simulation 7: Readiness Probe Failure</strong></h2>
<p>This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code> in one tab and <code>kubectl get endpoints -n kubelab backend-service -w</code> in another.</p>
<pre><code class="language-bash"># Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic
</code></pre>
<p><strong>What happened:</strong> <code>/ready</code> returned 503. The kubelet marked the pod <code>Ready=False</code>. The Endpoints controller removed its IP from the Service. The liveness probe <code>/health</code>) still returned 200, so no restart. After 120 seconds <code>/ready</code> recovered and the pod rejoined. Run <code>kubectl logs -n kubelab &lt;failing-pod&gt; -f</code> to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.</p>
<p><strong>The production trap:</strong> Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.</p>
<p><strong>The fix:</strong> Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.</p>
<h2 id="heading-4-how-to-read-the-signals-in-grafana"><strong>4. How to Read the Signals in Grafana</strong></h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/e6709c25-2d80-489c-b7fb-418ef303b7e2.png" alt="A screenshot showing my grafana dashboards" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><code>kubectl</code> shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.</p>
<h3 id="heading-the-four-panels-that-matter"><strong>The Four Panels that Matter</strong></h3>
<p><strong>Pod Restarts:</strong> A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.</p>
<p><strong>CPU Usage:</strong> A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.</p>
<p><strong>Memory Usage:</strong> Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.</p>
<p><strong>HTTP Request Rate:</strong> During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.</p>
<h3 id="heading-5-how-to-read-the-terminal-signals"><strong>5. How to Read the Terminal Signals</strong></h3>
<p>What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.</p>
<p>The <code>-w</code> flag on <code>kubectl get pods -n kubelab -w</code> streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — <code>1/2</code> means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.</p>
<p><code>kubectl get events -n kubelab --sort-by=.lastTimestamp</code> is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.</p>
<p><code>kubectl describe pod -n kubelab &lt;pod-name&gt;</code> is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.</p>
<p><code>kubectl get endpoints -n kubelab backend-service</code> shows which pod IPs are actually receiving traffic right now. A pod can show Running in <code>kubectl get pods</code> and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.</p>
<p><code>kubectl logs -n kubelab &lt;pod-name&gt;</code> shows the container's stdout and stderr. Use <code>-f</code> to follow the stream. After a pod restarts, use <code>--previous</code> to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.</p>
<p>A full event sequence during Kill Pod recovery looks like this:</p>
<pre><code class="language-bash">kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10
</code></pre>
<pre><code class="language-plaintext">REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running
</code></pre>
<p>The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.</p>
<h3 id="heading-two-prometheus-queries-worth-memorizing"><strong>Two Prometheus Queries Worth Memorizing</strong></h3>
<p><strong>First query: silent restart loop.</strong> <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])</code> counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.</p>
<p><strong>Second query: invisible CPU throttling.</strong> <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code> measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in <code>kubectl top</code> often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).</p>
<pre><code class="language-plaintext"># Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])
</code></pre>
<p>Run these against your own cluster. Not just KubeLab. These are production queries.</p>
<h2 id="heading-6-how-to-use-this-for-production-debugging"><strong>6. How to Use This for Production Debugging</strong></h2>
<p>The repo includes <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/diagnose.md">docs/diagnose.md</a>, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.</p>
<p><strong>Exit code 137, pods restarting.</strong> Run the Memory Stress simulation. Confirm with <code>kubectl describe pod | grep -A 5 "Last State:"</code> and look for <code>Reason: OOMKilled</code>. Raise limits or find the leak. The simulation shows both.</p>
<p><strong>High latency, pods look healthy, zero restarts.</strong> Run the CPU Stress simulation. Check <code>container_cpu_cfs_throttled_seconds_total</code> in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.</p>
<p><strong>503 on some requests, pods show Running.</strong> Run the Readiness Probe Failure simulation. Check <code>kubectl get endpoints</code> — one pod IP is missing despite Running. The pod gets zero traffic.</p>
<p><strong>Pods stuck Pending after a node went down.</strong> Run the Drain Node simulation. Run <code>kubectl describe pod &lt;pending-pod&gt;</code> and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from <code>kubectl describe</code>, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.</p>
<p>What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.</p>
<p>The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/interview-prep.md">docs/interview-prep.md</a> has answers to the 13 most common Kubernetes interview questions. The observability guide at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/observability.md">docs/observability.md</a> covers Prometheus and Grafana setup in detail.</p>
<p>If this helped you, star the repo at <a href="https://github.com/Osomudeya/kubelab">https://github.com/Osomudeya/kube-lab</a> and share it with someone who is learning Kubernetes the hard way.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Master Kubernetes Through Production-Ready Practice ]]>
                </title>
                <description>
                    <![CDATA[ Stop memorizing isolated commands and start building like a platform engineer. We just posted comprehensive Kubernetes course on the freeCodeCamp.org YouTube channel. This hands-on course is designed  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/master-kubernetes-through-production-ready-practice/</link>
                <guid isPermaLink="false">69a06148ab6baac8ff198f13</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 26 Feb 2026 15:05:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5f68e7df6dfc523d0a894e7c/b1d63be6-f5d9-4ddc-8afc-4455c2ed95ee.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Stop memorizing isolated commands and start building like a platform engineer.</p>
<p>We just posted comprehensive Kubernetes course on the <a href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel. This hands-on course is designed to bridge the gap between theoretical container orchestration and real-world deployment. Saiyam Pathak developed this course.</p>
<p>You will learn to deploy a cloud-native microservices stack from the ground up. You’ll explore the inner workings of the Kubernetes architecture, including the Control Plane, Worker Nodes, and essential interfaces like CRI, CNI, and CSI. You'll learn to ship a functional application complete with a frontend, auth service, and game engine.</p>
<p>This course is a deep dive into the modern Kubernetes ecosystem. You will implement advanced industry standards such as Gateway API for traffic management, CloudNativePG for managing PostgreSQL databases, and cert-manager for automated HTTPS security. By the time you reach the final demo, you’ll have integrated full-stack observability using Prometheus and Grafana, giving you the confidence to manage production-grade environments.</p>
<p>Watch the full course on <a href="https://youtu.be/_4uQI4ihGVU">the freeCodeCamp.org YouTube channel</a> (6-hour watch).</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/_4uQI4ihGVU?si=jZUjCZl2V2T7fEz9" frameborder="0" allowfullscreen="" title="Embedded content" loading="lazy"></iframe></div> ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Different Container Runtimes: Docker, Podman, and Containerd Explained ]]>
                </title>
                <description>
                    <![CDATA[ If you’re a developer working with containers, chances are Docker is your go-to tool. But did you know that there's a whole ecosystem of container runtimes out there? Some are lighter, some are more secure, and some are specifically built for Kuberne... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-different-container-runtimes-docker-podman-and-containerd-explained/</link>
                <guid isPermaLink="false">6994e01b44a48dd86fdf0816</guid>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Tue, 17 Feb 2026 21:39:39 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771357533601/1cba7a91-19f0-4038-93e6-504b121a9a03.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re a developer working with containers, chances are Docker is your go-to tool. But did you know that there's a whole ecosystem of container runtimes out there? Some are lighter, some are more secure, and some are specifically built for Kubernetes.</p>
<p>Understanding different container runtimes gives you more options. You can choose the right tool for your specific needs, whether that's better security, lower resource usage, or easier integration with Kubernetes.</p>
<p>In this tutorial, you'll learn about three major container runtimes and how to use them on your system. We’ll dive into practical examples with complete code you can run right now. By the end, you’ll understand when to use each runtime and how to move containers between them.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-are-container-runtimes">What Are Container Runtimes?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-high-level-vs-low-level-runtimes">How to Understand High-Level vs Low-Level Runtimes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-docker-as-your-baseline">How to Use Docker as Your Baseline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-podman-the-daemonless-alternative">How to Use Podman – The Daemonless Alternative</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-work-with-containerd">How to Work with Containerd</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-move-containers-between-runtimes">How to Move Containers Between Runtimes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-use-cases">Real-World Use Cases</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-quick-reference-guide">Quick Reference Guide</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-are-container-runtimes">What Are Container Runtimes?</h2>
<p>A container runtime is the software that actually runs your containers. When you type <code>docker run nginx</code>, for example, several things happen behind the scenes. The Docker CLI talks to the Docker daemon, which then uses a container runtime (usually containerd) to actually create and run the container.</p>
<p>Think of it like this: if containers are apps on your phone, the container runtime is the operating system that makes those apps work. Just like you can install the same app on different phones (iPhone vs Android), you can run the same container on different runtimes.</p>
<h3 id="heading-why-does-this-matter">Why Does This Matter?</h3>
<p>You might wonder why you should care about what's running your containers. Docker works fine, right? Here are a few reasons:</p>
<ol>
<li><p><strong>Security:</strong> Some runtimes like Podman can run containers without root privileges. This means if someone breaks out of your container, they don't have full system access.</p>
</li>
<li><p><strong>Resource usage:</strong> Different runtimes use different amounts of memory and CPU. On a resource-constrained server or edge device, this matters a lot.</p>
</li>
<li><p><strong>Integration:</strong> If you're deploying to Kubernetes, understanding containerd or CRI-O helps you troubleshoot production issues.</p>
</li>
<li><p><strong>Licensing:</strong> Docker Desktop has licensing requirements for large companies. Alternatives like Podman are completely free.</p>
</li>
</ol>
<p>Here’s a chart that summarizes these key points:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770901945553/8ef53746-02d1-4936-8930-fc7255aaa2bc.jpeg" alt="Container runtime comparison chart" class="image--center mx-auto" width="2647" height="1510" loading="lazy"></p>
<h2 id="heading-how-to-understand-high-level-vs-low-level-runtimes">How to Understand High-Level vs Low-Level Runtimes</h2>
<p>Container runtimes are split into two categories, and understanding this distinction helps you see how everything fits together.</p>
<h3 id="heading-low-level-runtimes">Low-Level Runtimes</h3>
<p>Low-level runtimes like <code>runc</code> and <code>crun</code> do the actual work of creating containers. They interact directly with the Linux kernel to create isolated environments using features like namespaces and cgroups.</p>
<p><strong>Namespaces</strong> isolate what a process can see. For example, a process namespace means the container can't see other processes running on your system. A network namespace means it has its own network stack.</p>
<p><strong>Cgroups</strong> (control groups) limit what a process can use. You can limit a container to 512MB of RAM or 50% of one CPU core. This prevents one container from hogging all your resources.</p>
<p>These low-level runtimes implement the OCI (Open Container Initiative) Runtime Specification. This is a standard that defines exactly how to run a container. Because of this standard, you can swap out runtimes and your containers still work.</p>
<h3 id="heading-high-level-runtimes">High-Level Runtimes</h3>
<p>High-level runtimes like Docker, Podman, and containerd manage images, networking, volumes, and provide user-friendly interfaces. They handle pulling images from registries, setting up networks between containers, and managing container lifecycles.</p>
<p>These high-level runtimes use low-level runtimes under the hood. When you run <code>docker run</code>, Docker ultimately calls <code>runc</code> to create the container. This layering means you get a nice interface while still benefiting from the standard, battle-tested low-level runtime.</p>
<h4 id="heading-why-this-layering-matters">Why This Layering Matters:</h4>
<p>This separation of concerns is powerful. High-level runtimes can focus on user experience and features while low-level runtimes focus on reliably creating containers. You can swap low-level runtimes without changing your workflow. Some people use <code>crun</code> instead of <code>runc</code> because it's written in C and starts faster.</p>
<h2 id="heading-how-to-use-docker-as-your-baseline">How to Use Docker as Your Baseline</h2>
<p>Let's start with Docker since you're probably already familiar with it. This will give us a baseline to compare other runtimes against. We'll build a simple web application and then run the same application in different runtimes to see how they compare.</p>
<h3 id="heading-how-to-install-docker">How to Install Docker</h3>
<p>You can find installation guides for your operating system:</p>
<ul>
<li><p><a target="_blank" href="https://docs.docker.com/desktop/install/mac-install/">Docker Desktop for</a> <a target="_blank" href="https://docs.docker.com/desktop/install/mac-install/">Mac</a></p>
</li>
<li><p><a target="_blank" href="https://docs.docker.com/desktop/install/windows-install/">Docker Desktop for Windows</a></p>
</li>
<li><p><a target="_blank" href="https://docs.docker.com/engine/install/">Docker Engine for Linux</a></p>
</li>
</ul>
<h3 id="heading-how-to-run-a-test-container">How to Run a Test Container</h3>
<p>Let's verify that Docker works by running a simple container:</p>
<pre><code class="lang-bash">docker run hello-world
</code></pre>
<p>You should see a message that says:</p>
<pre><code class="lang-bash">Hello from Docker!
This message shows that your installation appears to be working correctly.
</code></pre>
<h4 id="heading-what-just-happened">What Just Happened?</h4>
<p>When you ran that command, Docker checked if the <code>hello-world</code> image exists locally. It didn't find it, so it pulled the image from Docker Hub (a public registry). Then it created a container from that image, started the container, and the container printed its message and exited.</p>
<p>All of this happened in a few seconds. Now let's build something more useful.</p>
<h3 id="heading-how-to-create-a-web-server">How to Create a Web Server</h3>
<p>Create a new directory for your project:</p>
<pre><code class="lang-bash">mkdir ~/container-demo
<span class="hljs-built_in">cd</span> ~/container-demo
</code></pre>
<p>The <code>~</code> symbol means your home directory. On macOS, this is <code>/Users/yourname</code>. On Linux, it's <code>/home/yourname</code>.</p>
<p>Create a simple HTML file:</p>
<pre><code class="lang-bash">cat &gt; index.html &lt;&lt; <span class="hljs-string">'EOF'</span>
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;&lt;title&gt;Container Demo&lt;/title&gt;&lt;/head&gt;
&lt;body&gt;
  &lt;h1&gt;Hello from Docker!&lt;/h1&gt;
  &lt;p&gt;This is running <span class="hljs-keyword">in</span> a container.&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
EOF
</code></pre>
<p>This creates a basic HTML file. The <code>cat &gt;</code> command writes to a file, and <code>&lt;&lt; 'EOF'</code> means "read until you see EOF" (End Of File). This is a handy way to create files from the command line.</p>
<h3 id="heading-how-to-create-a-dockerfile">How to Create a Dockerfile</h3>
<p>You can create a dockerfile like this:</p>
<pre><code class="lang-bash">cat &gt; Dockerfile &lt;&lt; <span class="hljs-string">'EOF'</span>
FROM nginx:alpine
COPY index.html /usr/share/nginx/html/
EOF
</code></pre>
<h4 id="heading-understanding-the-dockerfile">Understanding the Dockerfile:</h4>
<p>The Dockerfile has two instructions:</p>
<ol>
<li><p><strong>FROM nginx:alpine</strong>: This starts with the official Nginx image. The <code>:alpine</code> tag means we're using the Alpine Linux version, which is much smaller (about 20MB instead of 130MB). Alpine is a minimal Linux distribution popular in containers because of its small size.</p>
</li>
<li><p><strong>COPY index.html /usr/share/nginx/html/</strong>: This copies your HTML file into the location where Nginx serves files. Inside the container, Nginx is configured to serve files from <code>/usr/share/nginx/html/</code>.</p>
</li>
</ol>
<h3 id="heading-how-to-build-a-docker-image">How to Build a Docker Image</h3>
<pre><code class="lang-bash">docker build -t my-web-app .
</code></pre>
<p>The <code>-t</code> flag means "tag" – we're naming the image <code>my-web-app</code>. The <code>.</code> at the end means "use the current directory as the build context". Docker will look for a Dockerfile in the current directory and send all files here to the Docker daemon for building.</p>
<p>You'll see output like:</p>
<pre><code class="lang-bash">[+] Building 2.3s (7/7) FINISHED
=&gt; [internal] load build definition from Dockerfile
=&gt; =&gt; transferring dockerfile: 98B
=&gt; [internal] load .dockerignore
...
=&gt; =&gt; naming to docker.io/library/my-web-app
</code></pre>
<p>This shows Docker building your image layer by layer. Each instruction in the Dockerfile creates a new layer. These layers are cached, so if you rebuild without changes, it's instant.</p>
<h3 id="heading-how-to-run-a-docker-container">How to Run a Docker Container</h3>
<pre><code class="lang-bash">docker run -d -p 8080:80 my-web-app
</code></pre>
<h4 id="heading-understanding-the-flags">Understanding the Flags:</h4>
<ul>
<li><p><strong>-d</strong> means "detached mode" – run in the background. Without this, the container runs in the foreground and you'll see Nginx's log output. With <code>-d</code>, it returns immediately and runs in the background.</p>
</li>
<li><p><strong>-p 8080:80</strong> maps port 8080 on your host machine to port 80 inside the container. Nginx listens on port 80 inside the container. To access it from your browser, you need to map it to a port on your machine. We chose 8080, but you could use any available port.</p>
</li>
</ul>
<p>Open your browser and visit <code>http://localhost:8080</code>. You should see your HTML page!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770902754636/b6641413-7bd6-4548-aa75-dbc487630c1d.png" alt="Localhost running docker container" class="image--center mx-auto" width="884" height="594" loading="lazy"></p>
<h4 id="heading-how-to-check-running-containers">How to Check Running Containers:</h4>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>This shows all running containers. You'll see something like:</p>
<pre><code class="lang-bash">CONTAINER ID   IMAGE        COMMAND                  PORTS                  NAMES
a1b2c3d4e5f6   my-web-app   <span class="hljs-string">"/docker-entrypoint.…"</span>   0.0.0.0:8080-&gt;80/tcp   peaceful_curie
</code></pre>
<p>Docker automatically generated a random name (<code>peaceful_curie</code> in this example). You can specify a name with <code>--name</code> if you prefer.</p>
<h4 id="heading-how-to-view-container-logs">How to View Container Logs:</h4>
<pre><code class="lang-bash">docker logs &lt;container-id&gt;
</code></pre>
<p>Replace <code>&lt;container-id&gt;</code> with the ID from <code>docker ps</code> (just the first few characters work). This shows what's happening inside the container. For Nginx, you'll see access logs showing requests to your web server.</p>
<h4 id="heading-how-to-stop-the-container">How to Stop the Container:</h4>
<pre><code class="lang-bash">docker stop &lt;container-id&gt;
</code></pre>
<p>This gracefully stops the container. Nginx receives a signal to shut down cleanly.</p>
<p>Now that you understand how to use Docker, let’s check out how Podman works next.</p>
<h2 id="heading-how-to-use-podman-the-daemonless-alternative">How to Use Podman – The Daemonless Alternative</h2>
<p>Now let's try Podman. It's designed to be a drop-in replacement for Docker, but with some key differences that make it interesting for specific use cases.</p>
<h3 id="heading-why-podman-exists">Why Podman Exists</h3>
<p>Docker runs as a daemon (a background service) that requires root privileges. This daemon always runs, listening for commands. This architecture has some downsides:</p>
<ol>
<li><p><strong>Security:</strong> The Docker daemon runs as root. If someone compromises the daemon, they have root access to your entire system.</p>
</li>
<li><p><strong>Resource Usage:</strong> The daemon consumes resources even when you're not running containers.</p>
</li>
<li><p><strong>Single Point of Failure:</strong> If the daemon crashes, all your containers stop.</p>
</li>
</ol>
<p>Podman solves these problems by not using a daemon at all. Each <code>podman</code> command runs independently. This is called a "daemonless" architecture.</p>
<h3 id="heading-key-podman-features">Key Podman Features</h3>
<p>To summarize, here are some key helpful features of Podman that might make it a good fit for your projects:</p>
<ol>
<li><p><strong>No daemon required:</strong> Each command runs independently. No background service needed.</p>
</li>
<li><p><strong>Rootless by default:</strong> Containers run as your regular user, not as root. This dramatically improves security.</p>
</li>
<li><p><strong>Drop-in Docker replacement:</strong> Most Docker commands work exactly the same. You can even alias <code>docker=podman</code> and many applications won't notice the difference.</p>
</li>
<li><p><strong>Pod support:</strong> Podman has a concept of "pods" like Kubernetes. This is unique among container tools.</p>
</li>
</ol>
<p>Now that you understand the benefits of Podman, let’s see how you can use it.</p>
<h3 id="heading-how-to-install-podman">How to Install Podman</h3>
<p>Podman installation varies by operating system. Here are the official guides:</p>
<ul>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman for macOS</a></p>
</li>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman fo</a><a target="_blank" href="https://podman.io/docs/installation#windows">r</a> <a target="_blank" href="https://podman.io/docs/installation#windows">Windo</a><a target="_blank" href="https://podman.io/docs/installation#macos">ws</a></p>
</li>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman for</a> <a target="_blank" href="https://podman.io/docs/installation#linux">Li</a><a target="_blank" href="https://podman.io/docs/installation#windows">nux</a></p>
</li>
</ul>
<p><strong>For macOS users</strong> (what we'll use in this tutorial), you can install Podman using Homebrew:</p>
<pre><code class="lang-bash">brew install podman
</code></pre>
<h3 id="heading-how-to-initialize-and-start-podman-machine">How to Initialize and Start Podman Machine</h3>
<p>On macOS, Podman needs a Linux VM to run containers (since containers use Linux kernel features). Podman Machine handles this for you:</p>
<pre><code class="lang-bash">podman machine init
</code></pre>
<p>This creates a small Linux VM. You’ll only need to do this once. The VM is about 1GB and uses minimal resources when running.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903100891/671690cc-8073-4748-b2df-c4308585d411.png" alt="Initialize podman machine" class="image--center mx-auto" width="1028" height="344" loading="lazy"></p>
<p>Start the machine:</p>
<pre><code class="lang-bash">podman machine start
</code></pre>
<p>Verify it's working:</p>
<pre><code class="lang-bash">podman --version
</code></pre>
<p>You should see something like:</p>
<pre><code class="lang-bash">podman version 4.5.0
</code></pre>
<h3 id="heading-how-to-run-containers-with-podman">How to Run Containers with Podman</h3>
<p>Here's where it gets interesting. You can use nearly identical commands to Docker. Let's build and run the same web server you created earlier:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Build the image (same command as Docker)</span>
podman build -t my-web-app .

<span class="hljs-comment"># Run the container</span>
podman run -d -p 8081:80 my-web-app

<span class="hljs-comment"># See running container</span>
podman ps
</code></pre>
<p>Notice that we used port 8081 this time so it doesn't conflict with the Docker container if it's still running. Visit <code>http://localhost:8081</code> and you'll see the same page, but this time it's running in Podman!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903417925/4717dd8f-bda5-4aaa-ad16-2a24726ee820.png" alt="Localhost running podman container" class="image--center mx-auto" width="856" height="458" loading="lazy"></p>
<p>If you experience issue when running the podman build command, you can delete the docker image using <code>docker image rm my-web-app:latest</code>.</p>
<h4 id="heading-whats-different-under-the-hood">What's Different Under the Hood?</h4>
<p>Even though the commands look the same, what's happening is different: first no daemon was involved. The <code>podman</code> command directly created and started the container. And the container is running as your user, not as root.</p>
<p>You can verify this by checking what user owns the process:</p>
<pre><code class="lang-bash">podman top &lt;container-id&gt; user
</code></pre>
<p>You'll see your username, not <code>root</code>.</p>
<h3 id="heading-podman-pods-a-unique-feature">Podman Pods – A Unique Feature</h3>
<p>Podman has a unique feature that Docker doesn't have: pods. A pod is a group of containers that share networking and storage. This is the same concept Kubernetes uses, which makes Podman excellent for local Kubernetes development.</p>
<h4 id="heading-why-pods-matter">Why Pods Matter:</h4>
<p>In real applications, you often have multiple containers that need to work together. For example, a web application typically needs a database to store data, a cache layer for temporary storage of frequently accessed data and a logging container for request, response, and non-sensitive critical application metadata.</p>
<p>These four containers (web, database, cache, logger) need to communicate with each other. In Docker, you'd create a custom network and connect each container to it. In Podman, you can create a pod that automatically handles this networking.</p>
<h3 id="heading-how-to-create-a-podman-pod">How to Create a Podman Pod</h3>
<pre><code class="lang-bash">podman pod create --name my-app-pod -p 8082:80
</code></pre>
<p>This creates a pod named <code>my-app-pod</code> and exposes port 8082 on your host to port 80 inside the pod. Notice that you don't expose ports on individual containers – you expose them on the pod.</p>
<p>Add a web server to the pod:</p>
<pre><code class="lang-bash">podman run -d --pod my-app-pod --name web nginx:alpine
</code></pre>
<p>The <code>--pod</code> flag tells Podman to run this container inside the pod. The container doesn't need its own port mapping because the pod handles that.</p>
<p>Add Redis (an in-memory database) to the pod:</p>
<pre><code class="lang-bash">podman run -d --pod my-app-pod --name cache redis:alpine
</code></pre>
<p>Now you have two containers running in the same pod. Here's the powerful part: they share the same network namespace.</p>
<p>To check your pod:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># List all pods</span>
podman pod ps -a

<span class="hljs-comment"># Show details for one pod</span>
podman pod inspect &lt;pod-name-or-id&gt;

<span class="hljs-comment"># Check processes running in the pod</span>
podman top pod &lt;pod-name-or-id&gt;

<span class="hljs-comment"># See logs from containers in that pod</span>
podman logs &lt;container-name-or-id&gt;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903859712/3cabe09b-d693-4adf-85bf-74115122203a.png" alt="Podman pod inspection showing container running" class="image--center mx-auto" width="1128" height="744" loading="lazy"></p>
<h4 id="heading-understanding-shared-networking">Understanding Shared Networking:</h4>
<p>Both containers can reach each other using <code>localhost</code>. The web container can connect to Redis using <code>localhost:6379</code> (Redis's default port). It's as if they're running on the same machine.</p>
<p>This is exactly how Kubernetes pods work. If you learn Podman pods, you're learning Kubernetes networking too.</p>
<h3 id="heading-how-to-generate-kubernetes-yaml-from-pods">How to Generate Kubernetes YAML from Pods</h3>
<p>Here's where Podman really shines. You can generate Kubernetes-compatible YAML from your pod:</p>
<pre><code class="lang-bash">podman generate kube my-app-pod &gt; my-app-pod.yaml
</code></pre>
<p>Open <code>my-app-pod.yaml</code> and you'll see proper Kubernetes configuration:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Save the output of this file and use kubectl create -f to import</span>
<span class="hljs-comment"># it into Kubernetes.</span>
<span class="hljs-comment">#</span>
<span class="hljs-comment"># Created with podman-5.7.1</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">io.kubernetes.cri-o.SandboxID/cache:</span> <span class="hljs-string">5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019</span>
    <span class="hljs-attr">io.kubernetes.cri-o.SandboxID/web:</span> <span class="hljs-string">5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019</span>
  <span class="hljs-attr">creationTimestamp:</span> <span class="hljs-string">"2026-02-12T13:44:55Z"</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">my-app-pod</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-app-pod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">args:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">nginx</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">-g</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">daemon</span> <span class="hljs-string">off;</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/library/nginx:alpine</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">web</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">80</span>
      <span class="hljs-attr">hostPort:</span> <span class="hljs-number">8082</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">args:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">redis-server</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/library/redis:alpine</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">cache</span>
</code></pre>
<p>This file can be deployed directly to any Kubernetes cluster:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># using minikube cluster</span>
kubectl apply -f my-app-pod.yaml
</code></pre>
<p>This is incredibly useful for local development. You can prototype your application using Podman pods, generate the YAML, and deploy to Kubernetes without rewriting anything.</p>
<h3 id="heading-how-to-manage-podman-machines">How to Manage Podman Machines</h3>
<p>When working with Podman on macOS or Windows, you're using a Linux VM. Here's how to manage it.</p>
<h4 id="heading-list-all-podman-machines">List all Podman machines:</h4>
<pre><code class="lang-bash">podman machine list
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074980607/84d2692e-11ce-4943-9187-a6a993d43c1d.png" alt="podman machine list" class="image--center mx-auto" width="1296" height="210" loading="lazy"></p>
<p>This shows all your Podman VMs, their status (running or stopped), and their names. The default machine is usually called <code>podman-machine-default</code>.</p>
<h4 id="heading-check-machine-status-and-info">Check machine status and info:</h4>
<pre><code class="lang-bash">podman machine info
</code></pre>
<p>This displays detailed information about your current machine including CPU, memory, and disk usage.</p>
<h4 id="heading-stop-the-podman-machine">Stop the Podman machine:</h4>
<pre><code class="lang-bash">podman machine stop
</code></pre>
<p>If you have multiple machines, specify the name:</p>
<pre><code class="lang-bash">podman machine stop podman-machine-default
</code></pre>
<p>This stops the VM but preserves it. All your images and containers remain intact. When you stop the machine, all running containers inside it are stopped.</p>
<h4 id="heading-start-a-stopped-machine">Start a stopped machine:</h4>
<pre><code class="lang-bash">podman machine start
</code></pre>
<p>Or with a specific name:</p>
<pre><code class="lang-bash">podman machine start podman-machine-default
</code></pre>
<p>This restarts the VM. Your images are still there, but containers remain stopped unless you started them with a restart policy.</p>
<h4 id="heading-delete-a-podman-machine">Delete a Podman machine:</h4>
<pre><code class="lang-bash">podman machine rm podman-machine-default
</code></pre>
<p>This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space.</p>
<p>With this basic understanding of how Podman works, we can move on and learn about how to use Containerd.</p>
<h2 id="heading-how-to-work-with-containerd">How to Work with Containerd</h2>
<p>Containerd is the runtime that Docker itself uses under the hood. It's also the default runtime for most Kubernetes installations. When you run Docker, you're actually using containerd without knowing it.</p>
<h3 id="heading-why-use-containerd-directly">Why Use containerd Directly?</h3>
<p>You might wonder why you'd use containerd directly if Docker already uses it. Here are a few reasons:</p>
<ol>
<li><p><strong>Kubernetes:</strong> Most Kubernetes clusters use containerd as their container runtime. Understanding it helps you troubleshoot production issues.</p>
</li>
<li><p><strong>Minimal footprint:</strong> containerd has no UI and minimal features. It uses less memory than Docker Desktop (about 50MB vs 2GB).</p>
</li>
<li><p><strong>Building tools:</strong> If you're building container orchestration tools, working directly with containerd gives you fine-grained control.</p>
</li>
</ol>
<h3 id="heading-understanding-the-architecture">Understanding the Architecture</h3>
<p>The containerd architecture looks like this:</p>
<pre><code class="lang-bash">Your Command → nerdctl → containerd → runc → Container
</code></pre>
<p>In this chain, nerdctl provides a Docker-like CLI, containerd manages images and container lifecycle, and runc actually creates the container using kernel features.</p>
<h3 id="heading-how-to-install-containerd-with-nerdctl">How to Install containerd with nerdctl</h3>
<p>containerd is designed for systems (like Kubernetes) rather than direct developer use. The installation approach differs by operating system:</p>
<ul>
<li><p><a target="_blank" href="https://lima-vm.io/docs/installation/">Lima for macOS</a> (includes nerdctl)</p>
</li>
<li><p><a target="_blank" href="https://github.com/containerd/containerd/blob/main/docs/getting-started.md">containerd for Linux</a> (native installation)</p>
</li>
<li><p><a target="_blank" href="https://github.com/containerd/nerdctl/releases">nerdctl releases</a> (for all platforms)</p>
</li>
</ul>
<p><strong>For macOS users</strong> (what we'll use in this tutorial), we’ll use Lima, which provides a Linux VM with containerd and nerdctl already installed.</p>
<pre><code class="lang-bash">brew install lima
</code></pre>
<p>Lima comes with nerdctl built-in, so you don't need to install it separately.</p>
<p><strong>For Linux users</strong>, you can install containerd directly from your package manager and download nerdctl from the GitHub releases page. Containerd runs natively on Linux without needing a VM.</p>
<h3 id="heading-how-to-start-a-lima-instance">How to Start a Lima Instance</h3>
<pre><code class="lang-bash">limactl start
</code></pre>
<p>This creates a default Linux VM running containerd with nerdctl available. The VM is configured with reasonable defaults (2GB RAM, 100GB disk). You can customize these settings if needed.</p>
<p>Lima mounts your home directory inside the VM, so you can access your files. This makes working with Lima feel transparent – you don't need to copy files into the VM.</p>
<p>Verify it's working:</p>
<pre><code class="lang-bash">lima nerdctl run hello-world
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074008992/aa76dcd5-8eb5-4baf-9d72-47e1f4aa3ae3.png" alt="Containerd Lima instance and Hello-world container" class="image--center mx-auto" width="1686" height="834" loading="lazy"></p>
<h3 id="heading-how-to-run-your-app-with-nerdctl">How to Run Your App with nerdctl</h3>
<p>The commands are nearly identical to Docker. This is intentional – nerdctl aims for Docker compatibility. Since we're running through Lima, we’ll prefix commands with <code>lima</code>.</p>
<p>Navigate to your project directory:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> ~/container-demo
</code></pre>
<p>Build the image:</p>
<pre><code class="lang-bash">lima nerdctl build -t my-web-app .
</code></pre>
<p>Run the container:</p>
<pre><code class="lang-bash">lima nerdctl run -d -p 8083:80 my-web-app
</code></pre>
<p>Visit <code>http://localhost:8083</code> to see your app running on containerd!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074352767/8ee7339d-8145-494e-9bac-41bfc8f620e1.png" alt="Localhost running containerd container" class="image--center mx-auto" width="1066" height="598" loading="lazy"></p>
<h3 id="heading-whats-different-from-docker">What's Different from Docker?</h3>
<p>Under the hood, a lot is different. Containerd is managing your image and container. There's no daemon in the traditional sense (containerd runs differently than dockerd). Images are stored differently (though they're OCI-compliant so they're compatible).</p>
<p>But from your perspective as a developer, the commands feel the same. This is the power of standards like OCI.</p>
<h4 id="heading-how-to-check-running-containers-1">How to Check Running Containers:</h4>
<pre><code class="lang-bash">lima nerdctl ps
</code></pre>
<p>This shows all running containers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074426408/3b5da24c-0dad-4c8e-9ce8-fd8cb319f9f9.png" alt="Running containers" class="image--center mx-auto" width="1878" height="304" loading="lazy"></p>
<h3 id="heading-how-to-manage-lima-vms">How to Manage Lima VMs</h3>
<p>When working with containerd through Lima, you're using a Linux VM. Here's how to manage it.</p>
<h4 id="heading-list-all-lima-vms">List all Lima VMs:</h4>
<pre><code class="lang-bash">limactl list
</code></pre>
<p>This shows all your Lima VMs, their status (running or stopped), and their names. The default VM is usually called <code>default</code>.</p>
<h4 id="heading-check-vm-status-and-info">Check VM status and info:</h4>
<pre><code class="lang-bash">limactl info default
</code></pre>
<p>This displays detailed information about the specified VM including its configuration and resource usage.</p>
<h4 id="heading-stop-the-lima-vm">Stop the Lima VM:</h4>
<pre><code class="lang-bash">limactl stop default
</code></pre>
<p>This stops the VM but preserves it. All your images and containers remain intact. When you stop the VM, all running containers inside it are stopped. The next time you start it, your images will still be there but containers remain stopped.</p>
<h4 id="heading-start-a-stopped-vm">Start a stopped VM:</h4>
<pre><code class="lang-bash">limactl start default
</code></pre>
<p>This restarts the VM. Your images persist across restarts, so you don't need to rebuild them.</p>
<h4 id="heading-delete-a-lima-vm">Delete a Lima VM:</h4>
<pre><code class="lang-bash">limactl delete default
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074893694/071702bb-8a35-4681-98ec-f1375a52c5d7.png" alt="Containerd VM list and deletion" class="image--center mx-auto" width="1272" height="324" loading="lazy"></p>
<p>This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space. You'll need to run <code>limactl start</code> again to create a new VM.</p>
<h4 id="heading-create-a-new-vm-with-custom-settings">Create a new VM with custom settings:</h4>
<pre><code class="lang-bash">limactl start --name my-custom-vm --cpus 4 --memory 8
</code></pre>
<p>This creates a new VM with 4 CPUs and 8GB of memory. You can have multiple Lima VMs for different projects.</p>
<h2 id="heading-how-to-move-containers-between-runtimes">How to Move Containers Between Runtimes</h2>
<p>Thanks to the OCI (Open Container Initiative) standard, you can move container images between different runtimes. This is incredibly powerful – you can build with one tool and deploy with another.</p>
<h3 id="heading-why-standards-matter">Why Standards Matter</h3>
<p>Before OCI, each container runtime used its own image format. Moving images between runtimes was difficult or impossible.</p>
<p>OCI created standards for the Runtime Specification (how to run a container), the Image Specification (how to package a container image), and the Distribution Specification (how to transfer images between systems).</p>
<p>Now all major runtimes follow these standards, making images portable.</p>
<h3 id="heading-method-1-using-container-registries">Method 1 – Using Container Registries</h3>
<p>The easiest way to share images is through a container registry like Docker Hub, GitHub Container Registry, or your own private registry. Any runtime can push and pull from registries.</p>
<p>First, build with Docker:</p>
<pre><code class="lang-bash">docker build -t my-username/my-app:v1 .
</code></pre>
<p>The image name has three parts: <code>my-username</code> (your registry username), <code>my-app</code> (the application name), and <code>v1</code> (a version tag).</p>
<p>Push to Docker Hub:</p>
<pre><code class="lang-bash">docker login
docker push my-username/my-app:v1
</code></pre>
<p>You'll need to create a free Docker Hub account if you don't have one. The <code>docker login</code> command prompts for your credentials.</p>
<p>Now pull with Podman:</p>
<pre><code class="lang-bash">podman pull my-username/my-app:v1
</code></pre>
<p>Podman downloads the image from Docker Hub. Even though it was built with Docker, Podman can use it because both follow OCI standards.</p>
<p>Or pull with nerdctl:</p>
<pre><code class="lang-bash">lima nerdctl pull my-username/my-app:v1
</code></pre>
<p>Same image, three different runtimes. This is the power of standards.</p>
<h3 id="heading-method-2-export-and-import">Method 2 – Export and Import</h3>
<p>If you don't want to use a public registry (maybe your image contains proprietary code), you can export images as tar files. This is perfect for air-gapped environments or simply moving images between machines.</p>
<p>Export from Docker:</p>
<pre><code class="lang-bash">docker save my-web-app -o my-web-app.tar
</code></pre>
<p>This creates a file called <code>my-web-app.tar</code> containing the image and all its layers. The file might be large (tens or hundreds of megabytes) depending on your image.</p>
<p>Import to Podman:</p>
<pre><code class="lang-bash">podman load -i my-web-app.tar
</code></pre>
<p>Import to nerdctl:</p>
<pre><code class="lang-bash">lima nerdctl load -i my-web-app.tar
</code></pre>
<p>Now you have the same image available in all three runtimes! You can verify:</p>
<pre><code class="lang-bash">docker images
podman images  
lima nerdctl images
</code></pre>
<p>All three commands will show <code>my-web-app</code> in their image lists.</p>
<h4 id="heading-understanding-image-layers">Understanding Image Layers:</h4>
<p>When you export an image, you're exporting all its layers. Each line in your Dockerfile creates a layer. These layers are shared between images, which saves disk space.</p>
<p>For example, if you have 10 images all based on <code>nginx:alpine</code>, they all share the nginx layers. Only the layers unique to each image take up additional space.</p>
<h2 id="heading-real-world-use-cases">Real-World Use Cases</h2>
<p>Let's look at some real scenarios where choosing the right runtime matters. These examples show how technical decisions have practical impacts.</p>
<h3 id="heading-use-case-1-security-first-development">Use Case 1 – Security-First Development</h3>
<p>If you're working on security-sensitive applications (financial services, healthcare, government), Podman's rootless containers are a huge advantage.</p>
<h4 id="heading-the-security-problem">The Security Problem:</h4>
<p>Traditional Docker requires root privileges. If someone exploits a vulnerability in your container and escapes to the host system, they have root access. This is called a "container escape" vulnerability.</p>
<p>Podman's rootless mode solves this:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># All Podman commands run as your user by default</span>
podman run --rm -it alpine whoami
</code></pre>
<p>This outputs your username, not <code>root</code>. The command uses <code>--rm</code> to remove the container when it exits (cleanup), <code>-it</code> to make it interactive with a terminal, <code>alpine</code> as a minimal Linux distribution, and <code>whoami</code> as a command that prints your username.</p>
<p>Even if someone breaks out of the container, they only have your user's permissions. They can't install system-wide malware, access other users' data, modify system configuration, or install kernel modules.</p>
<p>This dramatically reduces the impact of a container escape.</p>
<h4 id="heading-example-security-scenario">Example Security Scenario:</h4>
<p>Imagine you're running a web application that processes user uploads. A vulnerability lets an attacker execute code in your container. With Docker running as root, they could escape the container, install a rootkit, steal all data from your server, and persist even after you patch the vulnerability.</p>
<p>With Podman rootless, they might escape the container but can only access files your user can access. They can't persist beyond the container and can't affect other users or system files.</p>
<p>The difference is dramatic.</p>
<h3 id="heading-use-case-2-testing-kubernetes-locally">Use Case 2 – Testing Kubernetes Locally</h3>
<p>Podman can generate Kubernetes YAML from running containers. This is perfect for prototyping before you commit to a Kubernetes configuration.</p>
<h4 id="heading-the-development-workflow">The Development Workflow:</h4>
<ol>
<li><p>Run your application locally with Podman</p>
</li>
<li><p>Test and iterate quickly</p>
</li>
<li><p>Generate Kubernetes YAML when it works</p>
</li>
<li><p>Deploy to a real cluster</p>
</li>
</ol>
<p>Here's a practical example. Let's say you're building a web application with a database:</p>
<p>Run your containers:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a pod (like a Kubernetes pod)</span>
podman pod create --name myapp -p 8080:80

<span class="hljs-comment"># Add web server</span>
podman run -d --pod myapp --name web nginx:alpine

<span class="hljs-comment"># Add PostgreSQL</span>
podman run -d --pod myapp --name db \
  -e POSTGRES_PASSWORD=secret \
  postgres:alpine
</code></pre>
<p>Test your application at <code>http://localhost:8080</code>. When it works, generate Kubernetes YAML:</p>
<pre><code class="lang-bash">podman generate kube myapp &gt; myapp.yaml
</code></pre>
<p>Now you can deploy <code>myapp.yaml</code> to any Kubernetes cluster:</p>
<pre><code class="lang-bash">kubectl apply -f myapp.yaml
</code></pre>
<p>This is much faster than writing Kubernetes YAML by hand and debugging in a cluster. You iterate locally, then deploy when ready.</p>
<h4 id="heading-why-this-matters">Why This Matters:</h4>
<p>Kubernetes has a steep learning curve. The YAML configuration is verbose and error-prone. By starting with simple Podman commands and generating YAML, you can focus on your application first, learn Kubernetes gradually, catch configuration errors early, and iterate quickly without cloud costs.</p>
<h3 id="heading-use-case-3-resource-constrained-environments">Use Case 3 – Resource-Constrained Environments</h3>
<p>containerd has the smallest footprint. If you're running containers on edge devices, Raspberry Pi, or resource-constrained servers, this matters a lot.</p>
<h4 id="heading-comparing-memory-usage">Comparing Memory Usage:</h4>
<p>Here are typical memory footprints for each runtime:</p>
<ul>
<li><p>Docker Desktop uses approximately 2GB RAM (includes the VM, daemon, UI, and Kubernetes).</p>
</li>
<li><p>Podman uses approximately 500MB RAM (includes the VM on macOS).</p>
</li>
<li><p>Containerd uses approximately 50MB RAM (just the runtime, no extras).</p>
</li>
</ul>
<p>On a developer laptop with 16GB RAM, this difference doesn't matter much. But consider these scenarios:</p>
<p><strong>1. Edge Computing:</strong></p>
<p>You're running containers on edge devices with 1GB RAM total. Docker Desktop won't fit. containerd leaves room for your application.</p>
<p><strong>2. IoT Devices:</strong></p>
<p>A Raspberry Pi with 2GB RAM running Docker Desktop leaves little room for your application. containerd uses minimal resources.</p>
<p><strong>3. High-Density Servers:</strong></p>
<p>Running 100 containers per server. Every MB counts. Using containerd instead of full Docker saves 2GB per server × 100 servers = 200GB.</p>
<p><strong>Example Setup for Edge Device:</strong></p>
<pre><code class="lang-bash"><span class="hljs-comment"># On a Raspberry Pi or similar device</span>
sudo apt-get install containerd
sudo apt-get install nerdctl

<span class="hljs-comment"># Now you can run containers with minimal overhead</span>
nerdctl run -d my-lightweight-app
</code></pre>
<p>Your application gets to use most of the available RAM instead of competing with a heavy runtime.</p>
<h2 id="heading-quick-reference-guide">Quick Reference Guide</h2>
<p>Here's a handy comparison of common commands across runtimes:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Task</td><td>Docker</td><td>Podman</td><td>nerdctl (via Lima)</td></tr>
</thead>
<tbody>
<tr>
<td>Build image</td><td><code>docker build -t app .</code></td><td><code>podman build -t app .</code></td><td><code>lima nerdctl build -t app .</code></td></tr>
<tr>
<td>Run container</td><td><code>docker run -d app</code></td><td><code>podman run -d app</code></td><td><code>lima nerdctl run -d app</code></td></tr>
<tr>
<td>List containers</td><td><code>docker ps</code></td><td><code>podman ps</code></td><td><code>lima nerdctl ps</code></td></tr>
<tr>
<td>View logs</td><td><code>docker logs &lt;id&gt;</code></td><td><code>podman logs &lt;id&gt;</code></td><td><code>lima nerdctl logs &lt;id&gt;</code></td></tr>
<tr>
<td>Stop container</td><td><code>docker stop &lt;id&gt;</code></td><td><code>podman stop &lt;id&gt;</code></td><td><code>lima nerdctl stop &lt;id&gt;</code></td></tr>
<tr>
<td>Remove container</td><td><code>docker rm &lt;id&gt;</code></td><td><code>podman rm &lt;id&gt;</code></td><td><code>lima nerdctl rm &lt;id&gt;</code></td></tr>
<tr>
<td>List images</td><td><code>docker images</code></td><td><code>podman images</code></td><td><code>lima nerdctl images</code></td></tr>
<tr>
<td>Pull image</td><td><code>docker pull nginx</code></td><td><code>podman pull nginx</code></td><td><code>lima nerdctl pull nginx</code></td></tr>
<tr>
<td>Push to registry</td><td><code>docker push app</code></td><td><code>podman push app</code></td><td><code>lima nerdctl push app</code></td></tr>
<tr>
<td>Execute in container</td><td><code>docker exec -it &lt;id&gt; sh</code></td><td><code>podman exec -it &lt;id&gt; sh</code></td><td><code>lima nerdctl exec -it &lt;id&gt; sh</code></td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion">Conclusion</h2>
<p>In this guide, we’ve explored three major container runtimes and learned how to use Docker, Podman, and containerd. The container ecosystem is much bigger than just Docker, and knowing alternatives gives you more options for security, performance, and specialized use cases.</p>
<p>Use Docker when you're learning or need the best documentation. Use Podman when you need rootless security or are building CI/CD pipelines. Use containerd when you need minimal resource usage or are deploying to Kubernetes clusters.</p>
<p>Thanks to OCI standards, your containers are portable. Build with Docker, test with Podman, deploy with containerd – it all works together! You're not locked into one vendor or tool.</p>
<p>As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> and <a target="_blank" href="https://github.com/Caesarsage/DevOps-Cloud-Projects">DevOps Cloud Projects</a></p>
<p>Happy containerizing!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Prepare for the Kubernetes Administrator Certification and Pass [2026 update] ]]>
                </title>
                <description>
                    <![CDATA[ We just posted a course on the freeCodeCamp.org YouTube channel to help prepare you for the Certified Kubernetes Administrator Certification. This course is designed to provide a deep, practical understanding of Kubernetes administration, from founda... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/prepare-for-the-kubernetes-administrator-certification-and-pass-2026-update/</link>
                <guid isPermaLink="false">698b4fff1eac20c7da3bd7ba</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Tue, 10 Feb 2026 15:34:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770737611970/40c85fc5-30e0-450c-8473-b05124831718.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We just posted a course on the freeCodeCamp.org YouTube channel to help prepare you for the Certified Kubernetes Administrator Certification. This course is designed to provide a deep, practical understanding of Kubernetes administration, from foundational concepts to advanced troubleshooting.</p>
<p>You can watch the course on <a target="_blank" href="https://youtu.be/l57xKN6OBhY">the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<p>This course was made possible by a grant from Linux Foundation. Use code FREECODECAMP to get 30% off training, certifications, and bundles from Linux Foundation.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/l57xKN6OBhY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>There are many demos in the course using Kubernetes. Below you can find all the commands used in the course so it is easier for you to follow along on your local machine.</p>
<h2 id="heading-cka-hands-on-companion-commands-and-demos">CKA Hands-On Companion: Commands and Demos</h2>
<h2 id="heading-part-1-kubernetes-fundamentals-and-lab-setup">Part 1: Kubernetes Fundamentals and Lab Setup</h2>
<p>This section covers the setup of a single-node cluster using <code>kubeadm</code> to create an environment that mirrors the CKA exam.</p>
<h3 id="heading-section-13-setting-up-your-cka-practice-environment">Section 1.3: Setting Up Your CKA Practice Environment</h3>
<h4 id="heading-step-1-install-a-container-runtime-on-all-nodes"><strong>Step 1: Install a Container Runtime (on all nodes)</strong></h4>
<ol>
<li><p><strong>Load required kernel modules:</strong></p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF | sudo tee /etc/modules-load.d/k8s.conf
 overlay
 br_netfilter
 EOF

 sudo modprobe overlay
 sudo modprobe br_netfilter
</code></pre>
</li>
<li><p><strong>Configure sysctl for networking:</strong></p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF | sudo tee /etc/sysctl.d/k8s.conf
 net.bridge.bridge-nf-call-iptables  = 1
 net.bridge.bridge-nf-call-ip6tables = 1
 net.ipv4.ip_forward               = 1
 EOF

 sudo sysctl --system
</code></pre>
</li>
<li><p><strong>Install containerd:</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y containerd
</code></pre>
</li>
<li><p><strong>Configure containerd for systemd cgroup driver:</strong></p>
<pre><code class="lang-bash"> sudo mkdir -p /etc/containerd
 sudo containerd config default | sudo tee /etc/containerd/config.toml
 sudo sed -i <span class="hljs-string">'s/SystemdCgroup = false/SystemdCgroup = true/'</span> /etc/containerd/config.toml
</code></pre>
</li>
<li><p><strong>Restart and enable containerd:</strong></p>
<pre><code class="lang-bash"> sudo systemctl restart containerd
 sudo systemctl <span class="hljs-built_in">enable</span> containerd
</code></pre>
</li>
</ol>
<h4 id="heading-step-2-install-kubernetes-binaries-on-all-nodes"><strong>Step 2: Install Kubernetes Binaries (on all nodes)</strong></h4>
<ol>
<li><p><strong>Disable swap memory:</strong></p>
<pre><code class="lang-bash"> sudo swapoff -a
 <span class="hljs-comment"># Comment out swap in fstab to make it persistent:</span>
 sudo sed -i <span class="hljs-string">'/ swap / s/^\(.*\)$/#\1/g'</span> /etc/fstab
</code></pre>
</li>
<li><p><strong>Add the Kubernetes apt repository:</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y apt-transport-https ca-certificates curl gpg
 sudo mkdir -p -m 755 /etc/apt/keyrings
 curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
 <span class="hljs-built_in">echo</span> <span class="hljs-string">'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /'</span> | sudo tee /etc/apt/sources.list.d/kubernetes.list
</code></pre>
</li>
<li><p><strong>Install and hold binaries (adjust version as needed):</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y kubelet kubeadm kubectl
 sudo apt-mark hold kubelet kubeadm kubectl
</code></pre>
</li>
</ol>
<h4 id="heading-step-3-configure-a-single-node-cluster-on-the-control-plane"><strong>Step 3: Configure a Single-Node Cluster (on the control plane)</strong></h4>
<ol>
<li><p><strong>Initialize the control-plane node:</strong></p>
<pre><code class="lang-bash"> sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
</li>
<li><p><strong>Configure kubectl for the administrative user:</strong></p>
<pre><code class="lang-bash"> mkdir -p <span class="hljs-variable">$HOME</span>/.kube
 sudo cp -i /etc/kubernetes/admin.conf <span class="hljs-variable">$HOME</span>/.kube/config
 sudo chown $(id -u):$(id -g) <span class="hljs-variable">$HOME</span>/.kube/config
</code></pre>
</li>
<li><p><strong>Remove the control-plane taint:</strong></p>
<pre><code class="lang-bash"> kubectl taint nodes --all node-role.kubernetes.io/control-plane-
</code></pre>
</li>
<li><p><strong>Install the Flannel CNI plugin:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
</code></pre>
</li>
<li><p><strong>Verify the cluster:</strong></p>
<pre><code class="lang-bash"> kubectl get nodes
 kubectl get pods -n kube-system
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-2-cluster-architecture-installation-amp-configuration-25">Part 2: Cluster Architecture, Installation &amp; Configuration (25%)</h2>
<h3 id="heading-section-21-bootstrapping-a-multi-node-cluster-with-kubeadm">Section 2.1: Bootstrapping a Multi-Node Cluster with <code>kubeadm</code></h3>
<h4 id="heading-initializing-the-control-plane-run-on-control-plane-node"><strong>Initializing the Control Plane (Run on Control Plane node)</strong></h4>
<ol>
<li><p><strong>Run</strong> <code>kubeadm init</code> (Replace <code>&lt;control-plane-private-ip&gt;</code>):</p>
<pre><code class="lang-bash"> sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=&lt;control-plane-private-ip&gt;
</code></pre>
<ul>
<li><strong>Note:</strong> Save the <code>kubeadm join</code> command from the output.</li>
</ul>
</li>
<li><p><strong>Install Calico CNI Plugin:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
</code></pre>
</li>
<li><p><strong>Verify Cluster and CNI installation:</strong></p>
<pre><code class="lang-bash"> kubectl get pods -n kube-system
 kubectl get nodes
</code></pre>
</li>
</ol>
<h4 id="heading-joining-worker-nodes-run-on-each-worker-node"><strong>Joining Worker Nodes (Run on each Worker node)</strong></h4>
<ol>
<li><p><strong>Run the join command saved from</strong> <code>kubeadm init</code>:</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># EXAMPLE - Use the exact command from your kubeadm init output</span>
 sudo kubeadm join &lt;control-plane-private-ip&gt;:6443 --token &lt;token&gt; \
     --discovery-token-ca-cert-hash sha256:&lt;<span class="hljs-built_in">hash</span>&gt;
</code></pre>
</li>
<li><p><strong>Verify the full cluster (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl get nodes -o wide
</code></pre>
</li>
</ol>
<h3 id="heading-section-22-managing-the-cluster-lifecycle">Section 2.2: Managing the Cluster Lifecycle</h3>
<h4 id="heading-upgrading-clusters-with-kubeadm-example-upgrade-to-1291"><strong>Upgrading Clusters with</strong> <code>kubeadm</code> (Example: Upgrade to 1.29.1)</h4>
<ol>
<li><p><strong>Upgrade Control Plane: Upgrade</strong> <code>kubeadm</code> binary:</p>
<pre><code class="lang-bash"> sudo apt-mark unhold kubeadm
 sudo apt-get update &amp;&amp; sudo apt-get install -y kubeadm=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubeadm
</code></pre>
</li>
<li><p><strong>Plan and apply the upgrade (on Control Plane node):</strong></p>
<pre><code class="lang-bash"> sudo kubeadm upgrade plan
 sudo kubeadm upgrade apply v1.29.1
</code></pre>
</li>
<li><p><strong>Upgrade</strong> <code>kubelet</code> and <code>kubectl</code> (on Control Plane node):</p>
<pre><code class="lang-bash"> sudo apt-mark unhold kubelet kubectl
 sudo apt-get update &amp;&amp; sudo apt-get install -y kubelet=<span class="hljs-string">'1.29.1-1.1'</span> kubectl=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubelet kubectl
 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
</code></pre>
</li>
<li><p><strong>Upgrade Worker Node: Drain the node (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl drain &lt;node-to-upgrade&gt; --ignore-daemonsets
</code></pre>
</li>
<li><p><strong>Upgrade binaries (on Worker Node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># On the worker node</span>
 sudo apt-mark unhold kubeadm kubelet
 sudo apt-get update
 sudo apt-get install -y kubeadm=<span class="hljs-string">'1.29.1-1.1'</span> kubelet=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubeadm kubelet
</code></pre>
</li>
<li><p><strong>Upgrade node configuration and restart kubelet (on Worker Node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># On the worker node</span>
 sudo kubeadm upgrade node
 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
</code></pre>
</li>
<li><p><strong>Uncordon the Node (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl uncordon &lt;node-to-upgrade&gt;
</code></pre>
</li>
</ol>
<h4 id="heading-backing-up-and-restoring-etcd-run-on-control-plane-node"><strong>Backing Up and Restoring etcd (Run on Control Plane node)</strong></h4>
<ol>
<li><p><strong>Perform a Backup (using host</strong> <code>etcdctl</code>):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the backup directory first</span>
 sudo mkdir -p /var/lib/etcd-backup

 sudo ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd-backup/snapshot.db \
     --endpoints=https://127.0.0.1:2379 \
     --cacert=/etc/kubernetes/pki/etcd/ca.crt \
     --cert=/etc/kubernetes/pki/etcd/server.crt \
     --key=/etc/kubernetes/pki/etcd/server.key
</code></pre>
</li>
<li><p><strong>Perform a Restore (on the control plane node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Stop kubelet to stop static pods</span>
 sudo systemctl stop kubelet

 <span class="hljs-comment"># Restore the snapshot to a new data directory</span>
 sudo ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd-backup/snapshot.db \
     --data-dir /var/lib/etcd-restored

 <span class="hljs-comment"># !! IMPORTANT: Manually edit /etc/kubernetes/manifests/etcd.yaml to point to the new data-dir /var/lib/etcd-restored !!</span>

 <span class="hljs-comment"># Restart kubelet to pick up the manifest change</span>
 sudo systemctl start kubelet
</code></pre>
</li>
</ol>
<h3 id="heading-section-23-implementing-a-highly-available-ha-control-plane">Section 2.3: Implementing a Highly-Available (HA) Control Plane</h3>
<ol>
<li><p><strong>Initialize the First Control-Plane Node (Replace</strong> <code>&lt;load-balancer-address:port&gt;</code>):</p>
<pre><code class="lang-bash"> sudo kubeadm init --control-plane-endpoint <span class="hljs-string">"load-balancer.example.com:6443"</span> --upload-certs
</code></pre>
<ul>
<li><strong>Note:</strong> Save the HA-specific join command and the <code>--certificate-key</code>.</li>
</ul>
</li>
<li><p><strong>Join Additional Control-Plane Nodes (Run on the second and third Control Plane nodes):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># EXAMPLE - Use the exact command from your `kubeadm init` output</span>
 sudo kubeadm join load-balancer.example.com:6443 --token &lt;token&gt; \
     --discovery-token-ca-cert-hash sha256:&lt;<span class="hljs-built_in">hash</span>&gt; \
     --control-plane --certificate-key &lt;key&gt;
</code></pre>
</li>
</ol>
<h3 id="heading-section-24-managing-role-based-access-control-rbac">Section 2.4: Managing Role-Based Access Control (RBAC)</h3>
<h4 id="heading-demo-granting-read-only-access"><strong>Demo: Granting Read-Only Access</strong></h4>
<ol>
<li><p><strong>Create a Namespace and ServiceAccount:</strong></p>
<pre><code class="lang-bash"> kubectl create namespace rbac-test
 kubectl create serviceaccount dev-user -n rbac-test
</code></pre>
</li>
<li><p><strong>Create the</strong> <code>Role</code> manifest (<code>role.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># role.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Role</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">pod-reader</span>
 <span class="hljs-attr">rules:</span>
 <span class="hljs-bullet">-</span> <span class="hljs-attr">apiGroups:</span> [<span class="hljs-string">""</span>] <span class="hljs-comment"># "" indicates the core API group</span>
   <span class="hljs-attr">resources:</span> [<span class="hljs-string">"pods"</span>]
   <span class="hljs-attr">verbs:</span> [<span class="hljs-string">"get"</span>, <span class="hljs-string">"list"</span>, <span class="hljs-string">"watch"</span>]
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f role.yaml</code></p>
</li>
<li><p><strong>Create the</strong> <code>RoleBinding</code> manifest (<code>rolebinding.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># rolebinding.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">RoleBinding</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">read-pods</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
 <span class="hljs-attr">subjects:</span>
 <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">dev-user</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
 <span class="hljs-attr">roleRef:</span>
   <span class="hljs-attr">kind:</span> <span class="hljs-string">Role</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">pod-reader</span>
   <span class="hljs-attr">apiGroup:</span> <span class="hljs-string">rbac.authorization.k8s.io</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f rolebinding.yaml</code></p>
</li>
<li><p><strong>Verify Permissions:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Check if the ServiceAccount can list pods (Should be YES)</span>
 kubectl auth can-i list pods --as=system:serviceaccount:rbac-test:dev-user -n rbac-test

 <span class="hljs-comment"># Check if the ServiceAccount can delete pods (Should be NO)</span>
 kubectl auth can-i delete pods --as=system:serviceaccount:rbac-test:dev-user -n rbac-test
</code></pre>
</li>
</ol>
<h3 id="heading-section-25-application-management-with-helm-and-kustomize">Section 2.5: Application Management with Helm and Kustomize</h3>
<h4 id="heading-demo-installing-an-application-with-helm"><strong>Demo: Installing an Application with Helm</strong></h4>
<ol>
<li><p><strong>Add a Chart Repository:</strong></p>
<pre><code class="lang-bash"> helm repo add bitnami https://charts.bitnami.com/bitnami
 helm repo update
</code></pre>
</li>
<li><p><strong>Install a Chart with a value override:</strong></p>
<pre><code class="lang-bash"> helm install my-nginx bitnami/nginx --<span class="hljs-built_in">set</span> service.type=NodePort
</code></pre>
</li>
<li><p><strong>Manage the application:</strong></p>
<pre><code class="lang-bash"> helm upgrade my-nginx bitnami/nginx --<span class="hljs-built_in">set</span> service.type=ClusterIP
 helm rollback my-nginx 1
 helm uninstall my-nginx
</code></pre>
</li>
</ol>
<h4 id="heading-demo-customizing-a-deployment-with-kustomize"><strong>Demo: Customizing a Deployment with Kustomize</strong></h4>
<ol>
<li><p><strong>Create base manifest (</strong><code>my-app/base/deployment.yaml</code>):</p>
<pre><code class="lang-bash"> mkdir -p my-app/base
 cat &lt;&lt;EOF &gt; my-app/base/deployment.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   replicas: 1
   selector:
     matchLabels:
       app: my-app
   template:
     metadata:
       labels:
         app: my-app
     spec:
       containers:
       - name: nginx
         image: nginx:1.25.0
 EOF
</code></pre>
</li>
<li><p><strong>Create base Kustomization file (</strong><code>my-app/base/kustomization.yaml</code>):</p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF &gt; my-app/base/kustomization.yaml
 resources:
 - deployment.yaml
 EOF
</code></pre>
</li>
<li><p><strong>Create production overlay and patch:</strong></p>
<pre><code class="lang-bash"> mkdir -p my-app/overlays/production
 cat &lt;&lt;EOF &gt; my-app/overlays/production/patch.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   replicas: 3
 EOF
 cat &lt;&lt;EOF &gt; my-app/overlays/production/kustomization.yaml
 bases:
 -../../base
 patches:
 - path: patch.yaml
 EOF
</code></pre>
</li>
<li><p><strong>Apply the overlay (note the</strong> <code>-k</code> flag for kustomize):</p>
<pre><code class="lang-bash"> kubectl apply -k my-app/overlays/production
</code></pre>
</li>
<li><p><strong>Verify the change:</strong></p>
<pre><code class="lang-bash"> kubectl get deployment my-app
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-3-workloads-amp-scheduling-15">Part 3: Workloads &amp; Scheduling (15%)</h2>
<h3 id="heading-section-31-mastering-deployments">Section 3.1: Mastering Deployments</h3>
<h4 id="heading-demo-performing-a-rolling-update"><strong>Demo: Performing a Rolling Update</strong></h4>
<ol>
<li><p><strong>Create a base Deployment manifest (</strong><code>deployment.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># deployment.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-deployment</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
   <span class="hljs-attr">selector:</span>
     <span class="hljs-attr">matchLabels:</span>
       <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
   <span class="hljs-attr">template:</span>
     <span class="hljs-attr">metadata:</span>
       <span class="hljs-attr">labels:</span>
         <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
     <span class="hljs-attr">spec:</span>
       <span class="hljs-attr">containers:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx</span>
         <span class="hljs-attr">image:</span> <span class="hljs-string">nginx:1.24.0</span>
         <span class="hljs-attr">ports:</span>
         <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">80</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f deployment.yaml</code></p>
</li>
<li><p><strong>Update the Container Image to trigger the rolling update:</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">set</span> image deployment/nginx-deployment nginx=nginx:1.25.0
</code></pre>
</li>
<li><p><strong>Observe the rollout:</strong></p>
<pre><code class="lang-bash"> kubectl rollout status deployment/nginx-deployment
 kubectl get pods -l app=nginx -w
</code></pre>
</li>
</ol>
<h4 id="heading-executing-and-verifying-rollbacks"><strong>Executing and Verifying Rollbacks</strong></h4>
<ol>
<li><p><strong>View Revision History:</strong></p>
<pre><code class="lang-bash"> kubectl rollout <span class="hljs-built_in">history</span> deployment/nginx-deployment
</code></pre>
</li>
<li><p><strong>Roll back to the previous version:</strong></p>
<pre><code class="lang-bash"> kubectl rollout undo deployment/nginx-deployment
</code></pre>
</li>
<li><p><strong>Roll back to a specific revision (e.g., revision 1):</strong></p>
<pre><code class="lang-bash"> kubectl rollout undo deployment/nginx-deployment --to-revision=1
</code></pre>
</li>
</ol>
<h3 id="heading-section-32-configuring-applications-with-configmaps-and-secrets">Section 3.2: Configuring Applications with ConfigMaps and Secrets</h3>
<h4 id="heading-creation-methods"><strong>Creation Methods</strong></h4>
<ol>
<li><p><strong>ConfigMap: Imperative Creation:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># From literal values</span>
 kubectl create configmap app-config --from-literal=app.color=blue --from-literal=app.mode=production

 <span class="hljs-comment"># From a file</span>
 <span class="hljs-built_in">echo</span> <span class="hljs-string">"retries = 3"</span> &gt; config.properties
 kubectl create configmap app-config-file --from-file=config.properties
</code></pre>
</li>
<li><p><strong>Secret: Imperative Creation:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Kubernetes will automatically base64 encode</span>
 kubectl create secret generic db-credentials --from-literal=username=admin --from-literal=password=<span class="hljs-string">'s3cr3t'</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-consuming-configmaps-and-secrets-in-pods"><strong>Demo: Consuming ConfigMaps and Secrets in Pods</strong></h4>
<ol>
<li><p><strong>Manifest: Environment Variables (</strong><code>pod-config.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-config.yaml (Assumes app-config-declarative ConfigMap and db-credentials Secret exist)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">config-demo-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">demo-container</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
     <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/sh"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"env &amp;&amp; sleep 3600"</span>]
     <span class="hljs-attr">env:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">THEME</span>
         <span class="hljs-attr">valueFrom:</span>
           <span class="hljs-attr">configMapKeyRef:</span>
             <span class="hljs-attr">name:</span> <span class="hljs-string">app-config-declarative</span>
             <span class="hljs-attr">key:</span> <span class="hljs-string">ui.theme</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">DB_PASSWORD</span>
         <span class="hljs-attr">valueFrom:</span>
           <span class="hljs-attr">secretKeyRef:</span>
             <span class="hljs-attr">name:</span> <span class="hljs-string">db-credentials</span>
             <span class="hljs-attr">key:</span> <span class="hljs-string">password</span>
   <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-config.yaml</code> <strong>Verify:</strong> <code>kubectl logs config-demo-pod</code></p>
</li>
<li><p><strong>Manifest: Mounted Volumes (</strong><code>pod-volume.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-volume.yaml (Assumes app-config-file ConfigMap exists)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">volume-demo-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">demo-container</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
     <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/sh"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"cat /etc/config/config.properties &amp;&amp; sleep 3600"</span>]
     <span class="hljs-attr">volumeMounts:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">config-volume</span>
       <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/etc/config</span>
   <span class="hljs-attr">volumes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">config-volume</span>
     <span class="hljs-attr">configMap:</span>
       <span class="hljs-attr">name:</span> <span class="hljs-string">app-config-file</span>
   <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-volume.yaml</code> <strong>Verify:</strong> <code>kubectl logs volume-demo-pod</code></p>
</li>
</ol>
<h3 id="heading-section-33-implementing-workload-autoscaling">Section 3.3: Implementing Workload Autoscaling</h3>
<h4 id="heading-demo-installing-and-verifying-the-metrics-server"><strong>Demo: Installing and Verifying the Metrics Server</strong></h4>
<ol>
<li><p><strong>Install the Metrics Server:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
</code></pre>
</li>
<li><p><strong>Verify Installation:</strong></p>
<pre><code class="lang-bash"> kubectl top nodes
 kubectl top pods -A
</code></pre>
</li>
</ol>
<h4 id="heading-demo-autoscaling-a-deployment"><strong>Demo: Autoscaling a Deployment</strong></h4>
<ol>
<li><p><strong>Create a Deployment with Resource Requests (requires</strong> <code>hpa-demo-deployment.yaml</code> manifest not provided, use a simple one):</p>
<pre><code class="lang-bash"> kubectl create deployment php-apache --image=k8s.gcr.io/hpa-example --requests=<span class="hljs-string">"cpu=200m"</span>
 kubectl expose deployment php-apache --port=80
</code></pre>
</li>
<li><p><strong>Create an HPA (target 50% CPU, scale 1-10 replicas):</strong></p>
<pre><code class="lang-bash"> kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
</code></pre>
</li>
<li><p><strong>Generate Load (will run in the background):</strong></p>
<pre><code class="lang-bash"> kubectl run -it --rm load-generator --image=busybox -- /bin/sh -c <span class="hljs-string">"while true; do wget -q -O- http://php-apache; done"</span>
</code></pre>
</li>
<li><p><strong>Observe Scaling:</strong></p>
<pre><code class="lang-bash"> kubectl get hpa -w
</code></pre>
<p> <em>(Stop the load generator to observe scale down)</em></p>
</li>
</ol>
<h3 id="heading-section-35-advanced-scheduling">Section 3.5: Advanced Scheduling</h3>
<h4 id="heading-demo-using-node-affinity"><strong>Demo: Using Node Affinity</strong></h4>
<ol>
<li><p><strong>Label a Node:</strong></p>
<pre><code class="lang-bash"> kubectl label node &lt;your-worker-node-name&gt; disktype=ssd
</code></pre>
</li>
<li><p><strong>Create a Pod with Node Affinity (requires</strong> <code>affinity-pod.yaml</code> manifest not provided, create a dummy pod for the node label):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the pod using the affinity rules</span>
 kubectl apply -f affinity-pod.yaml <span class="hljs-comment"># Or equivalent manifest with node affinity</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-using-taints-and-tolerations"><strong>Demo: Using Taints and Tolerations</strong></h4>
<ol>
<li><p><strong>Taint a Node (Effect:</strong> <code>NoSchedule</code>):</p>
<pre><code class="lang-bash"> kubectl taint node &lt;another-worker-node-name&gt; app=gpu:NoSchedule
</code></pre>
</li>
<li><p><strong>Create a Pod with a Toleration (requires</strong> <code>toleration-pod.yaml</code> manifest not provided, create a dummy pod for the taint):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the pod using the toleration rules</span>
 kubectl apply -f toleration-pod.yaml <span class="hljs-comment"># Or equivalent manifest with toleration</span>
</code></pre>
</li>
<li><p><strong>Verify Pod scheduling on the tainted node:</strong></p>
<pre><code class="lang-bash"> kubectl get pod gpu-pod -o wide
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-4-services-amp-networking-20">Part 4: Services &amp; Networking (20%)</h2>
<h3 id="heading-section-42-kubernetes-services">Section 4.2: Kubernetes Services</h3>
<h4 id="heading-demo-creating-a-clusterip-service"><strong>Demo: Creating a ClusterIP Service</strong></h4>
<ol>
<li><p><strong>Create a Deployment:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment my-app --image=nginx --replicas=2
</code></pre>
</li>
<li><p><strong>Expose the Deployment with a ClusterIP Service (requires</strong> <code>clusterip-service.yaml</code> manifest not provided, use an imperative command):</p>
<pre><code class="lang-bash"> kubectl expose deployment my-app --port=80 --target-port=80 --name=my-app-service --<span class="hljs-built_in">type</span>=ClusterIP
</code></pre>
</li>
<li><p><strong>Verify Access (inside a temporary Pod):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --image=busybox -- /bin/sh
 <span class="hljs-comment"># Inside the shell:</span>
 <span class="hljs-comment"># wget -O- my-app-service</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-creating-a-nodeport-service"><strong>Demo: Creating a NodePort Service</strong></h4>
<ol>
<li><p><strong>Create a NodePort Service (requires</strong> <code>nodeport-service.yaml</code> manifest not provided, use an imperative command):</p>
<pre><code class="lang-bash"> kubectl expose deployment my-app --port=80 --target-port=80 --name=my-app-nodeport --<span class="hljs-built_in">type</span>=NodePort
</code></pre>
</li>
<li><p><strong>Verify Access information:</strong></p>
<pre><code class="lang-bash"> kubectl get service my-app-nodeport
 kubectl get nodes -o wide
 <span class="hljs-comment"># Access from outside via &lt;NodeIP&gt;:&lt;NodePort&gt;</span>
</code></pre>
</li>
</ol>
<h3 id="heading-section-43-ingress-and-the-gateway-api">Section 4.3: Ingress and the Gateway API</h3>
<h4 id="heading-demo-path-based-routing-with-nginx-ingress"><strong>Demo: Path-Based Routing with NGINX Ingress</strong></h4>
<ol>
<li><p><strong>Install the NGINX Ingress Controller:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.1/deploy/static/provider/cloud/deploy.yaml
</code></pre>
</li>
<li><p><strong>Deploy Two Sample Applications and Services:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment app-one --image=k8s.gcr.io/echoserver:1.4
 kubectl expose deployment app-one --port=8080

 kubectl create deployment app-two --image=k8s.gcr.io/echoserver:1.4
 kubectl expose deployment app-two --port=8080
</code></pre>
</li>
<li><p><strong>Create an Ingress Resource (requires</strong> <code>ingress.yaml</code> manifest not provided, use the provided structure to create the file):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Apply ingress.yaml</span>
 kubectl apply -f ingress.yaml
</code></pre>
</li>
<li><p><strong>Test the Ingress:</strong></p>
<pre><code class="lang-bash"> INGRESS_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller -o jsonpath=<span class="hljs-string">'{.status.loadBalancer.ingress.ip}'</span>)
 curl http://<span class="hljs-variable">$INGRESS_IP</span>/app1
 curl http://<span class="hljs-variable">$INGRESS_IP</span>/app2
</code></pre>
</li>
</ol>
<h3 id="heading-section-44-network-policies">Section 4.4: Network Policies</h3>
<h4 id="heading-demo-securing-an-application-with-network-policies"><strong>Demo: Securing an Application with Network Policies</strong></h4>
<ol>
<li><p><strong>Create a Default Deny-All Ingress Policy (</strong><code>deny-all.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># deny-all.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">default-deny-ingress</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">podSelector:</span> {} <span class="hljs-comment"># Matches all pods in the namespace</span>
   <span class="hljs-attr">policyTypes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f deny-all.yaml</code></p>
</li>
<li><p><strong>Deploy a Web Server and a Service:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment web-server --image=nginx
 kubectl expose deployment web-server --port=80
</code></pre>
</li>
<li><p><strong>Attempt connection (will fail):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --image=busybox -- /bin/sh -c <span class="hljs-string">"wget -O- --timeout=2 web-server"</span>
</code></pre>
</li>
<li><p><strong>Create an "Allow" Policy (</strong><code>allow-web-access.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># allow-web-access.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">allow-web-access</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">podSelector:</span>
     <span class="hljs-attr">matchLabels:</span>
       <span class="hljs-attr">app:</span> <span class="hljs-string">web-server</span>
   <span class="hljs-attr">policyTypes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
   <span class="hljs-attr">ingress:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">podSelector:</span>
         <span class="hljs-attr">matchLabels:</span>
           <span class="hljs-attr">access:</span> <span class="hljs-string">"true"</span>
     <span class="hljs-attr">ports:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
       <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f allow-web-access.yaml</code></p>
</li>
<li><p><strong>Test the "Allow" Policy (connection will succeed):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --labels=access=<span class="hljs-literal">true</span> --image=busybox -- /bin/sh -c <span class="hljs-string">"wget -O- web-server"</span>
</code></pre>
</li>
</ol>
<h3 id="heading-section-45-coredns">Section 4.5: CoreDNS</h3>
<h4 id="heading-demo-customizing-coredns-for-an-external-domain"><strong>Demo: Customizing CoreDNS for an External Domain</strong></h4>
<ol>
<li><p><strong>Edit the CoreDNS ConfigMap:</strong></p>
<pre><code class="lang-bash"> kubectl edit configmap coredns -n kube-system
</code></pre>
</li>
<li><p><strong>Add a new server block inside the</strong> <code>Corefile</code> data structure (e.g., for <a target="_blank" href="http://my-corp.com"><code>my-corp.com</code></a>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># ... inside the data.Corefile string...</span>
     <span class="hljs-string">my-corp.com:53</span> {
         <span class="hljs-string">errors</span>
         <span class="hljs-string">cache</span> <span class="hljs-number">30</span>
         <span class="hljs-string">forward</span> <span class="hljs-string">.</span> <span class="hljs-number">10.10</span><span class="hljs-number">.0</span><span class="hljs-number">.53</span> <span class="hljs-comment"># Forward to your internal DNS server</span>
     }
 <span class="hljs-comment"># ...</span>
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-5-storage-10">Part 5: Storage (10%)</h2>
<h3 id="heading-section-52-volume-configuration">Section 5.2: Volume Configuration</h3>
<h4 id="heading-static-provisioning-demo"><strong>Static Provisioning Demo</strong></h4>
<ol>
<li><p><strong>Create a PersistentVolume (</strong><code>pv.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pv.yaml (Using hostPath for local testing)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolume</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">task-pv-volume</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">capacity:</span>
     <span class="hljs-attr">storage:</span> <span class="hljs-string">10Gi</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">persistentVolumeReclaimPolicy:</span> <span class="hljs-string">Retain</span>
   <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">manual</span>
   <span class="hljs-attr">hostPath:</span>
     <span class="hljs-attr">path:</span> <span class="hljs-string">"/mnt/data"</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pv.yaml</code></p>
</li>
<li><p><strong>Create a PersistentVolumeClaim (</strong><code>pvc.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pvc.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">task-pv-claim</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">manual</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">resources:</span>
     <span class="hljs-attr">requests:</span>
       <span class="hljs-attr">storage:</span> <span class="hljs-string">3Gi</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pvc.yaml</code></p>
</li>
<li><p><strong>Verify Binding:</strong></p>
<pre><code class="lang-bash"> kubectl get pv,pvc
</code></pre>
</li>
<li><p><strong>Create a Pod that Uses the PVC (</strong><code>pod-storage.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-storage.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">storage-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx</span>
       <span class="hljs-attr">image:</span> <span class="hljs-string">nginx</span>
       <span class="hljs-attr">volumeMounts:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">mountPath:</span> <span class="hljs-string">"/usr/share/nginx/html"</span>
         <span class="hljs-attr">name:</span> <span class="hljs-string">my-storage</span>
   <span class="hljs-attr">volumes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">my-storage</span>
       <span class="hljs-attr">persistentVolumeClaim:</span>
         <span class="hljs-attr">claimName:</span> <span class="hljs-string">task-pv-claim</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-storage.yaml</code></p>
</li>
</ol>
<h3 id="heading-section-53-storageclasses-and-dynamic-provisioning">Section 5.3: StorageClasses and Dynamic Provisioning</h3>
<h4 id="heading-demo-using-a-default-storageclass"><strong>Demo: Using a Default StorageClass</strong></h4>
<ol>
<li><p><strong>Inspect the Available StorageClasses:</strong></p>
<pre><code class="lang-bash"> kubectl get storageclass
</code></pre>
</li>
<li><p><strong>Create a PVC without a PV (relies on a default StorageClass):</strong></p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># dynamic-pvc.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">my-dynamic-claim</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">resources:</span>
     <span class="hljs-attr">requests:</span>
       <span class="hljs-attr">storage:</span> <span class="hljs-string">1Gi</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f dynamic-pvc.yaml</code></p>
</li>
<li><p><strong>Observe Dynamic Provisioning:</strong></p>
<pre><code class="lang-bash"> kubectl get pv
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-6-troubleshooting-30">Part 6: Troubleshooting (30%)</h2>
<h3 id="heading-section-62-troubleshooting-applications-and-pods">Section 6.2: Troubleshooting Applications and Pods</h3>
<h4 id="heading-debugging-tools-for-crashes-and-failures"><strong>Debugging Tools for Crashes and Failures</strong></h4>
<ol>
<li><p><strong>Get detailed information on a resource (the most critical debugging command):</strong></p>
<pre><code class="lang-bash"> kubectl describe pod &lt;pod-name&gt;
</code></pre>
</li>
<li><p><strong>Check application logs (for current container):</strong></p>
<pre><code class="lang-bash"> kubectl logs &lt;pod-name&gt;
</code></pre>
</li>
<li><p><strong>Check application logs (for previous crashed container instance):</strong></p>
<pre><code class="lang-bash"> kubectl logs &lt;pod-name&gt; --previous
</code></pre>
</li>
<li><p><strong>Get a shell inside a running container for live debugging:</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">exec</span> -it &lt;pod-name&gt; -- /bin/sh
</code></pre>
</li>
</ol>
<h3 id="heading-section-63-troubleshooting-cluster-and-nodes">Section 6.3: Troubleshooting Cluster and Nodes</h3>
<ol>
<li><p><strong>Check node status:</strong></p>
<pre><code class="lang-bash"> kubectl get nodes
</code></pre>
</li>
<li><p><strong>Get detailed node information:</strong></p>
<pre><code class="lang-bash"> kubectl describe node &lt;node-name&gt;
</code></pre>
</li>
<li><p><strong>View node resource capacity (for scheduling issues):</strong></p>
<pre><code class="lang-bash"> kubectl describe node &lt;node-name&gt; | grep Allocatable
</code></pre>
</li>
<li><p><strong>Check the</strong> <code>kubelet</code> service status (on the affected node via SSH):</p>
<pre><code class="lang-bash"> sudo systemctl status kubelet
 sudo journalctl -u kubelet -f
</code></pre>
</li>
<li><p><strong>Re-enable scheduling on a cordoned node:</strong></p>
<pre><code class="lang-bash"> kubectl uncordon &lt;node-name&gt;
</code></pre>
</li>
</ol>
<h3 id="heading-section-65-troubleshooting-services-and-networking">Section 6.5: Troubleshooting Services and Networking</h3>
<ol>
<li><p><strong>Check Service and Endpoints (for connectivity issues):</strong></p>
<pre><code class="lang-bash"> kubectl describe service &lt;service-name&gt;
</code></pre>
</li>
<li><p><strong>Check DNS resolution from a client Pod (from inside the client Pod's shell):</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">exec</span> -it client-pod -- nslookup &lt;service-name&gt;
</code></pre>
</li>
<li><p><strong>Check Network Policies (to see if traffic is being blocked):</strong></p>
<pre><code class="lang-bash"> kubectl get networkpolicy
</code></pre>
</li>
</ol>
<h3 id="heading-section-66-monitoring-cluster-and-application-resource-usage">Section 6.6: Monitoring Cluster and Application Resource Usage</h3>
<ol>
<li><p><strong>Get node resource usage (requires Metrics Server):</strong></p>
<pre><code class="lang-bash"> kubectl top nodes
</code></pre>
</li>
<li><p><strong>Get Pod resource usage (requires Metrics Server):</strong></p>
<pre><code class="lang-bash"> kubectl top pods -n &lt;namespace&gt;
</code></pre>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Build Your Own Kubernetes Operators with Go and Kubebuilder ]]>
                </title>
                <description>
                    <![CDATA[ We just posted a Kubernetes Operator course on the freeCodeCamp.org YouTube channel. You will learn how to extend Kubernetes by building your own custom operators and controllers from scratch. You’ll go beyond simply using Kubernetes and start treati... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-your-own-kubernetes-operators-with-go-and-kubebuilder/</link>
                <guid isPermaLink="false">696929962e8dfc11a91953f7</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 17:53:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768499578905/0f42cb04-c790-4f52-bf66-bcfe91a0ce79.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We just posted a Kubernetes Operator course on the freeCodeCamp.org YouTube channel. You will learn how to extend Kubernetes by building your own custom operators and controllers from scratch. You’ll go beyond simply using Kubernetes and start treating it as a Software Development Kit (SDK). </p>
<p>You will learn how to build a real-world operator that manages AWS EC2 instances directly from Kubernetes, covering everything from the internal architecture of Informers and Caches to advanced concepts like Finalizers and Idempotency.</p>
<p>The course is broken up into six parts:</p>
<ul>
<li><p>Part 1: The Theory of Controllers</p>
</li>
<li><p>Part 2: Kubernetes Extensibility</p>
</li>
<li><p>Part 3: Setting Up the Environment</p>
</li>
<li><p>Part 4: Building the API &amp; Logic</p>
</li>
<li><p>Part 5: Hands-on Development</p>
</li>
<li><p>Part 6: Advanced Internals &amp; Deployment</p>
</li>
</ul>
<p>Watch the full course <a target="_blank" href="https://www.youtube.com/watch?v=odP153inZUo">on the freeCodeCamp.org YouTube channel</a> (6-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/odP153inZUo" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Debug Kubernetes Apps When Logs Fail You – An eBPF Tracing Handbook ]]>
                </title>
                <description>
                    <![CDATA[ Let’s say your Kubernetes pod crashes at 3am and the logs show nothing useful. By the time you SSH into the node, the container is gone, and you're left guessing what happened in those final moments. This is the reality of debugging modern applicatio... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-debug-kubernetes-apps-when-logs-fail-you-an-ebpf-tracing-handbook/</link>
                <guid isPermaLink="false">694190c566a5d5cb99995f9f</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ eBPF ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ inspektor gadget ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Opaluwa Emidowojo ]]>
                </dc:creator>
                <pubDate>Tue, 16 Dec 2025 17:03:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765899860869/3eadf316-8539-4624-afba-1d4190b6c62a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let’s say your Kubernetes pod crashes at 3am and the logs show nothing useful. By the time you SSH into the node, the container is gone, and you're left guessing what happened in those final moments.</p>
<p>This is the reality of debugging modern applications. Traditional monitoring wasn't built for containers that live for seconds, services that shift across nodes, or network paths that change constantly.</p>
<p>eBPF changes this. It lets you see <em>inside</em> the kernel itself, watching every system call, every network packet, and every process execution – without modifying a single line of code.</p>
<p>In this tutorial, you will trace a real Kubernetes application using eBPF-powered tools. You’ll learn fundamentals that apply across the entire modern observability ecosystem, with gadgets from the Inspektor Gadget ecosystem.</p>
<p>By the end, you’ll be able to:</p>
<ul>
<li><p>Trace requests as they move through your Kubernetes pods</p>
</li>
<li><p>Observe behavior at the kernel and syscall level</p>
</li>
<li><p>Debug failures that logs and metrics simply can’t explain</p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p><strong>Knowledge requirements:</strong></p>
<ul>
<li><p>Basic Kubernetes concepts: pods, deployments, services, namespaces</p>
</li>
<li><p>Familiarity with kubectl: <code>get</code>, <code>describe</code>, <code>logs</code>, <code>exec</code></p>
</li>
<li><p>Container basics</p>
</li>
<li><p>Basic Linux concepts: processes, system calls</p>
</li>
</ul>
<p><strong>Technical requirements:</strong></p>
<ul>
<li><p>Kubernetes cluster (local or cloud-based)</p>
</li>
<li><p><code>kubectl</code> installed and configured</p>
</li>
<li><p>Cluster admin permissions</p>
</li>
<li><p>Linux kernel 5.10+ (most managed services have this)</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-understanding-ebpf-observability">Understanding eBPF Observability</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-ebpf-tracing-works-without-getting-lost-in-the-kernel">How eBPF Tracing Works (Without Getting Lost in the Kernel)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-environment">How to Set Up Your Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-trace-your-first-request-hands-on-tutorial">How to Trace Your First Request: Hands-On Tutorial</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-interpret-traces">How to Interpret Traces</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-tracing-insights">Advanced Tracing Insights</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices-and-production-considerations">Best Practices and Production Considerations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps-and-resources">Next Steps and Resources</a></p>
</li>
</ul>
<h2 id="heading-understanding-ebpf-observability">Understanding eBPF Observability</h2>
<p>eBPF (extended Berkeley Packet Filter) is a technology that allows you to run custom programs inside the Linux kernel without changing kernel code or loading kernel modules.</p>
<p>The Linux kernel is the control center of your operating system. Historically, if you wanted to observe low-level activity (like network packets, system calls, or file operations), you had to rely on kernel changes or kernel modules. Both approaches were fragile, difficult to maintain, and carried real stability and security risks.</p>
<p>eBPF shifts how we approach observability. It provides a safe, sandboxed environment where you can run observability programs directly in the kernel with built-in safety checks that prevent crashes or security vulnerabilities.</p>
<h3 id="heading-why-does-this-matter-for-observability">Why does this matter for observability?</h3>
<p>In traditional observability, you instrument your application code. You add logging statements, metrics libraries, and tracing SDKs. This works, but has significant limitations:</p>
<ul>
<li><p><strong>Code changes are required</strong>: You must modify and redeploy applications</p>
</li>
<li><p><strong>It’s language-specific</strong>: Different languages need different libraries</p>
</li>
<li><p><strong>There will likely be blind spots</strong>: You can only see what you explicitly instrument</p>
</li>
<li><p><strong>The overhead</strong>: Heavy instrumentation slows down applications</p>
</li>
<li><p><strong>Container challenges</strong>: By the time you add instrumentation and redeploy, the problem may have disappeared</p>
</li>
</ul>
<p>eBPF takes a different approach. Instead of instrumenting applications, you instrument the kernel. Since every application ultimately makes system calls to the kernel for network I/O, file operations, and process management, you can observe everything from one vantage point.</p>
<h3 id="heading-the-ebpf-advantage-for-kubernetes">The eBPF advantage for Kubernetes</h3>
<p>Kubernetes adds another layer of complexity. Your application might be spread across multiple containers, pods, and nodes. Traditional APM (Application Performance Monitoring) tools struggle here because containers come and go rapidly, network topology changes constantly, service meshes add routing complexity, and you often don't control application code (think third-party services or legacy applications you can't modify.)</p>
<p>eBPF doesn't care about any of this. It sees all activity at the kernel level, regardless of what language your app is written in, whether it's containerized, how many times the pod has been rescheduled, or whether you have access to modify the source code. This universal visibility is why the Cloud Native Computing Foundation (CNCF) and major cloud providers are betting heavily on eBPF for the future of observability.</p>
<h2 id="heading-how-ebpf-tracing-works-without-getting-lost-in-the-kernel">How eBPF Tracing Works (Without Getting Lost in the Kernel)</h2>
<p>When your application runs on Kubernetes, there's a clear separation between user space and kernel space. Your code runs in user space, where it's isolated, safe, and has limited access to system resources. To do anything useful – make network calls, read files, allocate memory – your application must ask the kernel for help. The kernel handles these requests via system calls, commonly called syscalls.</p>
<p>eBPF lets us hook into these syscalls without slowing the system down. It’s like having a CCTV camera at every doorway between user space and kernel space, watching who passes through, when, and what they’re carrying.</p>
<h3 id="heading-a-simple-example-http-request-tracing">A Simple Example: HTTP Request Tracing</h3>
<p>Your application initiates an HTTP GET request, which needs to go through the network stack. To establish a connection, your application first makes a <code>socket()</code> system call to create a network socket. Then it calls <code>connect()</code> to establish a connection to the remote server. Once connected, it uses <code>send()</code> to transmit the HTTP request. Network packets are sent across the wire, and eventually your application calls <code>recv()</code> to receive the response.</p>
<p>With eBPF tools like Inspektor Gadget's Traceloop, you can automatically hook into these syscalls. The eBPF program captures request metadata including source and destination IPs, ports, timing information, and payload sizes. You get a complete trace of the request without touching your application code.</p>
<h3 id="heading-the-ebpf-execution-flow">The eBPF Execution Flow</h3>
<p>Here's what happens under the hood when you run a trace. When you deploy Inspektor Gadget and run a gadget, several things happen behind the scenes. Once deployed, the eBPF program springs into action whenever a traced event occurs.</p>
<p>When your application makes a syscall, the eBPF hook triggers and quickly collects relevant data: timestamps, process IDs, container IDs, pod names, request details, and latency information. This data is sent to user space through eBPF maps, which are efficient data structures for kernel-to-userspace communication.</p>
<p>Inspektor Gadget adds Kubernetes context to raw kernel data. Instead of seeing only process IDs, you can see pod names, namespaces, labels, and other metadata. For example, you can tell that a request originated from the frontend pod in the production namespace and targeted the backend service.</p>
<p>The gadget then presents this information in a format that's immediately useful, whether you're using the CLI or integrating with other observability tools.</p>
<p>eBPF is fast because:</p>
<ul>
<li><p><strong>JIT compilation</strong>: Programs are turned into native machine code for maximum performance</p>
</li>
<li><p><strong>Event-driven</strong>: Only execute when relevant events occur, not continuously polling</p>
</li>
<li><p><strong>Kernel-resident</strong>: No expensive context switching between kernel and user space</p>
</li>
<li><p><strong>Highly optimized</strong>: Typically adds less than 5% overhead even under heavy load</p>
</li>
</ul>
<h3 id="heading-the-tool-inspektor-gadget-amp-traceloop">The Tool: Inspektor Gadget &amp; Traceloop</h3>
<p>For this tutorial, we're using Traceloop, an eBPF-based tool that traces request flows through applications by observing syscalls, network calls, and I/O operations at the kernel level.</p>
<p>Why are we using Traceloop for this tutorial?</p>
<ul>
<li><p>It’s quick to install and run (one command)</p>
</li>
<li><p>The output maps directly to the application’s behavior</p>
</li>
<li><p>It automatically adds Kubernetes context (pod names, namespaces)</p>
</li>
<li><p>You don’t need to make any application code changes</p>
</li>
</ul>
<p>What you'll learn applies beyond Traceloop. All eBPF tracing tools (Pixie, Cilium Hubble, Tetragon) work the same way under the hood. They attach to kernel hooks and collect event data. Once you understand the concepts here, you can use any eBPF observability tool effectively.</p>
<h2 id="heading-how-to-set-up-your-environment">How to Set Up Your Environment</h2>
<p>To get your environment ready for hands-on tracing, we'll verify that your cluster meets the requirements, install Inspektor Gadget, and deploy a sample application to trace.</p>
<h3 id="heading-verify-that-your-cluster-meets-the-requirements">Verify that Your Cluster Meets the Requirements</h3>
<p>Before installing anything, confirm that your Kubernetes cluster is ready for eBPF.</p>
<h4 id="heading-check-your-kubernetes-version">Check your Kubernetes version:</h4>
<pre><code class="lang-bash">kubectl version --short
</code></pre>
<p>You need Kubernetes 1.19 or later. Most modern clusters exceed this requirement, but it's worth verifying.</p>
<h4 id="heading-verify-kernel-version-on-your-nodes">Verify kernel version on your nodes:</h4>
<pre><code class="lang-bash">kubectl get nodes -o wide
</code></pre>
<p>Then check the kernel version on one of your nodes:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># If using a local cluster like minikube or kind</span>
uname -r

<span class="hljs-comment"># For cloud clusters, you might need to check node details</span>
kubectl debug node/&lt;node-name&gt; -it --image=ubuntu -- bash -c <span class="hljs-string">"uname -r"</span>
</code></pre>
<p>You need Linux kernel 5.10 or later for the best eBPF support. Kernel 4.18+ works but with some limitations. If you're using a managed Kubernetes service (GKE, EKS, AKS), you almost certainly have a compatible kernel.</p>
<h4 id="heading-confirm-that-you-have-cluster-admin-permissions">Confirm that you have cluster admin permissions:</h4>
<pre><code class="lang-bash">kubectl auth can-i create deployments --all-namespaces
</code></pre>
<p>This should return "yes". Inspektor Gadget needs elevated permissions to load eBPF programs into the kernel.</p>
<h3 id="heading-install-inspektor-gadget">Install Inspektor Gadget</h3>
<p>You can install Inspektor Gadget in several ways. We'll use the kubectl plugin method as it's the most straightforward for learning.</p>
<h4 id="heading-install-the-kubectl-gadget-plugin">Install the kubectl gadget plugin:</h4>
<pre><code class="lang-bash"><span class="hljs-comment"># Download and install kubectl-gadget</span>
kubectl krew install gadget

<span class="hljs-comment"># Verify installation</span>
kubectl gadget version
</code></pre>
<p>If you don't have krew (the kubectl plugin manager), you can install it first:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Install krew</span>
(
  <span class="hljs-built_in">set</span> -x; <span class="hljs-built_in">cd</span> <span class="hljs-string">"<span class="hljs-subst">$(mktemp -d)</span>"</span> &amp;&amp;
  OS=<span class="hljs-string">"<span class="hljs-subst">$(uname | tr '[:upper:]' '[:lower:]')</span>"</span> &amp;&amp;
  ARCH=<span class="hljs-string">"<span class="hljs-subst">$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')</span>"</span> &amp;&amp;
  KREW=<span class="hljs-string">"krew-<span class="hljs-variable">${OS}</span>_<span class="hljs-variable">${ARCH}</span>"</span> &amp;&amp;
  curl -fsSLO <span class="hljs-string">"https://github.com/kubernetes-sigs/krew/releases/latest/download/<span class="hljs-variable">${KREW}</span>.tar.gz"</span> &amp;&amp;
  tar zxvf <span class="hljs-string">"<span class="hljs-variable">${KREW}</span>.tar.gz"</span> &amp;&amp;
  ./<span class="hljs-string">"<span class="hljs-variable">${KREW}</span>"</span> install krew
)

<span class="hljs-comment"># Add krew to your PATH</span>
<span class="hljs-built_in">export</span> PATH=<span class="hljs-string">"<span class="hljs-variable">${KREW_ROOT:-<span class="hljs-variable">$HOME</span>/.krew}</span>/bin:<span class="hljs-variable">$PATH</span>"</span>
</code></pre>
<h4 id="heading-deploy-inspektor-gadget-to-your-cluster">Deploy Inspektor Gadget to your cluster:</h4>
<pre><code class="lang-bash">kubectl gadget deploy
</code></pre>
<p>This creates a <code>gadget</code> namespace and deploys the Inspektor Gadget daemon as a DaemonSet, ensuring each node in your cluster can run eBPF programs.</p>
<h4 id="heading-verify-the-deployment">Verify the deployment:</h4>
<pre><code class="lang-bash">kubectl get pods -n gadget
</code></pre>
<p>You should see one <code>gadget-*</code> pod per node, all in the <code>Running</code> state. If a pod is stuck in <code>Pending</code> or <code>CrashLoopBackOff</code>, check that your kernel meets the version requirements.</p>
<h4 id="heading-deploying-a-sample-application">Deploying a sample application</h4>
<p>To learn tracing effectively, we need an application that does something interesting. We'll deploy a simple microservices application with multiple components so you can see traces flowing across service boundaries.</p>
<p>Start by creating a namespace for our demo app:</p>
<pre><code class="lang-bash">kubectl create namespace demo-app
</code></pre>
<p>Then deploy a simple web application with a backend:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">gcr.io/google-samples/microservices-demo/frontend:v0.8.0</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
        <span class="hljs-attr">env:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PORT</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"8080"</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PRODUCT_CATALOG_SERVICE_ADDR</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"productcatalog:3550"</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">server</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">gcr.io/google-samples/microservices-demo/productcatalogservice:v0.8.0</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">3550</span>
        <span class="hljs-attr">env:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PORT</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">"3550"</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">demo-app</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">productcatalog</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">3550</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">3550</span>
</code></pre>
<p>Apply the configuration:</p>
<pre><code class="lang-bash">kubectl apply -f demo-app.yaml
</code></pre>
<p>And wait for pods to be ready:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">wait</span> --<span class="hljs-keyword">for</span>=condition=ready pod -l app=frontend -n demo-app --timeout=300s
kubectl <span class="hljs-built_in">wait</span> --<span class="hljs-keyword">for</span>=condition=ready pod -l app=productcatalog -n demo-app --timeout=300s
</code></pre>
<p>Then just verify that everything is running:</p>
<pre><code class="lang-bash">kubectl get pods -n demo-app
</code></pre>
<p>You should see both <code>frontend</code> and <code>productcatalog</code> pods in the <code>Running</code> state.</p>
<p>Now you’ll need to get the frontend URL:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># For local clusters (minikube, kind, Docker Desktop)</span>
kubectl port-forward -n demo-app service/frontend 8080:80

<span class="hljs-comment"># Then access http://localhost:8080 in your browser</span>

<span class="hljs-comment"># For cloud clusters</span>
kubectl get service frontend -n demo-app
<span class="hljs-comment"># Look for the EXTERNAL-IP</span>
</code></pre>
<p>Visit the application in your browser to confirm it's working. You should see a simple e-commerce storefront. This application makes HTTP requests from the frontend to the product catalog service, which is perfect for tracing.</p>
<h2 id="heading-how-to-trace-your-first-request-hands-on-tutorial">How to Trace Your First Request: Hands-On Tutorial</h2>
<p>Now that everything is set up, let's capture our first trace and see eBPF observability in action.</p>
<h3 id="heading-generate-the-traffic-to-trace">Generate the Traffic to Trace</h3>
<p>First, we need some application activity to observe. We will generate a few requests for our demo application.</p>
<p>In one terminal, start the Traceloop gadget:</p>
<pre><code class="lang-bash">kubectl gadget traceloop -n demo-app
</code></pre>
<p>This command starts tracing HTTP request handling in the <code>demo-app</code> namespace. Inspektor Gadget monitors the kernel to capture the function calls and system events that occur while processing each request.  </p>
<p>In another terminal, generate some traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># If using port-forward</span>
curl http://localhost:8080

<span class="hljs-comment"># If you have an external IP</span>
curl http://&lt;EXTERNAL-IP&gt;

<span class="hljs-comment"># Generate multiple requests</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..10}; <span class="hljs-keyword">do</span> curl http://localhost:8080; sleep 1; <span class="hljs-keyword">done</span>
```

<span class="hljs-comment">### Viewing Your First Trace</span>

Switch back to the terminal running the trace loop gadget. You should see output appearing as requests flow through your application. The output will look something like this:
```
NODE         NAMESPACE   POD              CONTAINER    PID    TYPE       COUNT  
minikube     demo-app    frontend-abc123  frontend     1234   loop       1      
minikube     demo-app    frontend-abc123  frontend     1234   loop       2
</code></pre>
<p>Each line shows a traced execution flow, with the count increasing as the same pattern is observed again.</p>
<p>We can make the output more interesting by filtering:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Stop the previous trace with Ctrl+C, then run:</span>
kubectl gadget traceloop -n demo-app --podname frontend
</code></pre>
<p>This narrows our observation to just the frontend pod, reducing noise and making patterns clearer.</p>
<h4 id="heading-understanding-what-youre-seeing">Understanding what you're seeing:</h4>
<p>Each column shows different information about your application:</p>
<ul>
<li><p><strong>NODE</strong>: Which Kubernetes node the traced event occurred on. In multi-node clusters, this helps you understand workload distribution and identify node-specific issues.</p>
</li>
<li><p><strong>NAMESPACE</strong>: The Kubernetes namespace. We filtered to <code>demo-app</code>, so you'll only see that namespace. In production, filtering by namespace is crucial for focusing on specific applications.</p>
</li>
<li><p><strong>POD</strong>: The specific pod where the event occurred. Each pod gets a unique name (like <code>frontend-abc123</code>), allowing you to distinguish between replicas of the same application.</p>
</li>
<li><p><strong>CONTAINER</strong>: Which container within the pod. Pods can have multiple containers (main application, sidecars, init containers), so this helps you pinpoint exactly where activity is happening.</p>
</li>
<li><p><strong>PID</strong>: The process ID inside the container. This is the actual Linux process that made the syscalls eBPF observed. Multiple PIDs might appear if your application uses multiple processes or threads.</p>
</li>
<li><p><strong>TYPE</strong>: The type of event traced. For Traceloop, this identifies kernel-level patterns detected during request processing.</p>
</li>
<li><p><strong>COUNT</strong>: How many times this pattern has been observed. A rapidly incrementing count indicates high request volume.</p>
</li>
</ul>
<h4 id="heading-what-this-tells-you-about-your-application">What this tells you about your application:</h4>
<p>Even from this simple output, you can derive insights. If you see events appearing for the <code>frontend</code> pod but not the <code>productcatalog</code> pod, it might indicate that requests aren't making it to the backend. This is a potential configuration issue. If the <code>COUNT</code> increases rapidly for one pod but not others, you know which replica is receiving traffic, useful for debugging load balancing issues.</p>
<p>The real power becomes clear when you correlate these kernel-level observations with what you know about your application. When you made 10 curl requests, you should see corresponding activity in the trace output. This direct relationship between application behavior and kernel observations is the foundation of eBPF observability.</p>
<h2 id="heading-how-to-interpret-traces">How to Interpret Traces</h2>
<p>Understanding raw trace output is valuable, but interpreting what it means for your application's health and performance is where the real skill lies.</p>
<h3 id="heading-trace-anatomy-spans-timing-and-request-flow">Trace Anatomy: Spans, Timing, and Request Flow</h3>
<p>A trace represents a single request's journey through your system. When you curl the frontend, that generates one trace. A span represents a single operation within that trace like "frontend handles request," "frontend calls product catalog," "product catalog queries data," and "frontend returns response." Each span has timing information: when it started, when it ended, and therefore how long it took.</p>
<p>In traditional distributed tracing with OpenTelemetry or Jaeger, you'd explicitly create these spans in your application code. With eBPF, the tool infers spans from syscall patterns. When eBPF sees your frontend process call <code>connect()</code> to the product catalog's IP, followed by <code>send()</code> and <code>recv()</code>, it understands that's a span representing an HTTP request to the backend service.</p>
<p>The request flow is the sequence of spans showing how your request moved through services. In our demo app,</p>
<ol>
<li><p>The user request arrives at the frontend,</p>
</li>
<li><p>the frontend connects to the product catalog,</p>
</li>
<li><p>the product catalog processes the request,</p>
</li>
<li><p>the product catalog returns the data, the frontend renders the page,</p>
</li>
<li><p>and finally, the response is sent to user.</p>
</li>
</ol>
<h3 id="heading-how-to-follow-requests-across-services">How to Follow Requests Across Services</h3>
<p>Let's trace a request across service boundaries to see this flow in action.</p>
<p>First, we’ll start a more detailed trace:</p>
<pre><code class="lang-bash">kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>The trace_tcp gadget shows network connections, giving us visibility into service-to-service communication.</p>
<p>Next, generate a request:</p>
<pre><code class="lang-bash">curl http://localhost:8080
</code></pre>
<p>In the trace output, look for connection patterns:</p>
<p>You should see the frontend pod establishing a TCP connection to the product catalog service. The trace will show the source (frontend) and destination (product catalog) IPs and ports, along with timing information.</p>
<p>This is how eBPF lets you follow requests: by observing the network syscalls that implement service communication. You don't need a service mesh or instrumentation libraries, the kernel sees all network activity and eBPF captures it.</p>
<h4 id="heading-understanding-the-flow">Understanding the flow:</h4>
<ol>
<li><p>Your curl command triggers a TCP connection to the frontend pod's IP on port 8080</p>
</li>
<li><p>The frontend processes the request and opens a TCP connection to the product catalog's IP on port 3550</p>
</li>
<li><p>Data flows back and forth (you'll see send/receive events)</p>
</li>
<li><p>Connections close when requests complete</p>
</li>
</ol>
<p>Each step is visible to eBPF because each step requires syscalls that the kernel handles.</p>
<h3 id="heading-how-to-identify-bottlenecks-and-errors">How to Identify Bottlenecks and Errors</h3>
<p>We can also use tracing to identify performance issues.</p>
<p>First, let’s start by simulating a slow backend:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a deliberately slow endpoint by modifying our deployment</span>
kubectl scale deployment productcatalog -n demo-app --replicas=0

<span class="hljs-comment"># Wait a moment, then scale back up</span>
kubectl scale deployment productcatalog -n demo-app --replicas=1
</code></pre>
<p>While the product catalog is down, generate some requests:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..5}; <span class="hljs-keyword">do</span> curl http://localhost:8080; <span class="hljs-keyword">done</span>
</code></pre>
<p>You should see connection attempts from the frontend to the product catalog, but if the service is unavailable, you'll see different patterns, possibly connection timeouts or connection refused errors, depending on the exact timing.</p>
<p>What bottlenecks look like in traces:</p>
<ul>
<li><p><strong>Long spans</strong>: A span that takes significantly longer than others indicates a bottleneck. In trace loop output, you might see gaps between events or notice certain operations taking longer.</p>
</li>
<li><p><strong>Retries</strong>: Repeated connection attempts to the same destination suggest a failing or slow service.</p>
</li>
<li><p><strong>Error patterns</strong>: Connection failures, timeouts, or unusual syscall sequences indicate problems.</p>
</li>
</ul>
<p>The best skill to have is pattern recognition. A typical, healthy request flow has a rhythm, and events occur in predictable sequences with consistent timing. When something breaks, the rhythm changes. Requests take longer, errors appear, or expected events don't occur at all.</p>
<h2 id="heading-real-world-debugging-scenarios">Real-World Debugging Scenarios</h2>
<p>Now let's go through three realistic scenarios where eBPF helps:</p>
<h3 id="heading-scenario-1-finding-a-slow-endpoint">Scenario 1: Finding a Slow Endpoint</h3>
<p><strong>The problem:</strong> Users report that the product catalog page sometimes loads very slowly, but metrics show normal average latency.</p>
<p>Let’s use Traceloop to investigate:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Start tracing with timing information</span>
kubectl gadget traceloop -n demo-app --podname frontend
</code></pre>
<p>We’ll generate some mixed traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Some requests to the homepage (fast)</span>
curl http://localhost:8080

<span class="hljs-comment"># Some requests to the product catalog (potentially slow)</span>
curl http://localhost:8080/products
</code></pre>
<p>In the trace output, compare the <code>COUNT</code> increments for different request patterns. If certain patterns show significantly more loop iterations or longer gaps between events, that indicates those requests are doing more work, possibly hitting a slow endpoint.</p>
<h4 id="heading-the-diagnosis">The diagnosis:</h4>
<p>You might notice that requests to <code>/products</code> cause the frontend to make multiple calls to the product catalog service (visible with <code>kubectl gadget trace_tcp</code>), while homepage requests don't. This explains why the product page is slow: it's making synchronous calls to a backend service, and if that service is slow or the network is congested, users feel the delay.</p>
<h4 id="heading-the-fix">The fix:</h4>
<p>You might implement caching, make the backend calls asynchronous, or optimize the product catalog service itself. The key is that eBPF helped you identify which specific code path was slow without adding instrumentation to your application.</p>
<h3 id="heading-scenario-2-tracking-down-failed-requests">Scenario 2: Tracking Down Failed Requests</h3>
<p><strong>The problem:</strong> Your monitoring shows a 5% error rate, but application logs don't show any errors. Where are the failures happening?</p>
<p>Now let’s use eBPF to investigate:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Trace network connections to see connection failures</span>
kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>We’ll simulate intermittent failures:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a failing scenario by temporarily breaking service connectivity</span>
kubectl delete service productcatalog -n demo-app

<span class="hljs-comment"># Generate requests</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..10}; <span class="hljs-keyword">do</span> curl http://localhost:8080; sleep 1; <span class="hljs-keyword">done</span>

<span class="hljs-comment"># Restore the service</span>
kubectl apply -f demo-app.yaml
</code></pre>
<p>In the TCP trace, you'll see connection attempts from the frontend to the product catalog that fail or time out. The trace will show the source, destination, and what happened (connection refused, timeout, and so on).</p>
<h4 id="heading-the-diagnosis-1">The diagnosis:</h4>
<p>The failures are happening at the network level, the frontend can't reach the product catalog. This might be due to network policy issues, service mesh misconfiguration, or DNS problems. Traditional application logs might not capture this because the application never receives a response to log, and the connection fails before the application layer even gets involved.</p>
<h4 id="heading-why-ebpf-finds-this-when-logs-dont">Why eBPF finds this when logs don't:</h4>
<p>Your application logs what it experiences. If a connection fails at the TCP level, your application might just see "connection refused" and retry without detailed logging.</p>
<p>eBPF sees the actual syscalls and network events, giving you visibility into what's happening beneath your application layer.</p>
<h3 id="heading-scenario-3-understanding-service-dependencies">Scenario 3: Understanding Service Dependencies</h3>
<p><strong>The problem:</strong> You're not sure which services depend on each other, and you want to understand the actual runtime dependencies before making changes.</p>
<p>We’ll use eBPF to map dependencies:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Trace all TCP connections to see who talks to whom</span>
kubectl gadget trace_tcp -n demo-app
</code></pre>
<p>And then generate normal traffic:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Make various requests to exercise different code paths</span>
curl http://localhost:8080
curl http://localhost:8080/products
curl http://localhost:8080/cart
</code></pre>
<p>The trace output shows source and destination for every connection. Build a mental (or actual) map of which pods connect to which services.</p>
<h4 id="heading-the-discovery">The discovery:</h4>
<p>You'll see that the frontend pod connects to the product catalog service, but you might also discover unexpected dependencies. Perhaps the frontend also makes calls to a Redis cache, an authentication service, or external APIs. These runtime dependencies might not be documented or might differ from what architectural diagrams show.</p>
<h4 id="heading-why-this-matters">Why this matters:</h4>
<p>Before deploying a change to the product catalog service, you now know exactly which services will be affected. Before implementing a network policy, you know which connections to allow. Before decomposing a monolith, you understand the actual communication patterns.</p>
<p>This is observability-driven architecture understanding: letting the system show you how it actually works, not how you think it works.</p>
<h2 id="heading-advanced-tracing-insights">Advanced Tracing Insights</h2>
<p>Once you're comfortable with basic request tracing, Inspektor Gadget offers deeper observability capabilities that reveal even more about your system's behavior.</p>
<h3 id="heading-syscall-level-observation">Syscall-Level Observation</h3>
<p>The traceloop and trace_tcp gadgets give you application-level insights, but sometimes you need to go deeper. The trace_exec gadget shows you every process execution in your containers.</p>
<p>First, let’s monitor process execution:</p>
<pre><code class="lang-bash">kubectl gadget trace_exec -n demo-app
</code></pre>
<p>And generate activity:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Exec into a pod and run commands</span>
kubectl <span class="hljs-built_in">exec</span> -it -n demo-app deployment/frontend -- /bin/sh
ls -la
ps aux
<span class="hljs-built_in">exit</span>
</code></pre>
<p>Every command you run inside the container appears in the trace: <code>/bin/sh</code>, <code>ls</code>, <code>ps</code>, and anything else. This helps you understand what's running in your containers, detect suspicious activity, or debug initialization issues.</p>
<p>In production scenarios, this helps you answer questions like: Is my application spawning unexpected subprocesses? Are there security issues like someone running <code>curl</code> to download malicious scripts? Is my <code>init</code> script actually running the commands I think it is?</p>
<h3 id="heading-network-tracing-insights">Network Tracing Insights</h3>
<p>Beyond TCP connections, you can trace DNS queries, which often reveal surprising things about your application's behavior.</p>
<p>Run <code>trace_dns</code>:</p>
<pre><code class="lang-bash">kubectl gadget trace_dns -n demo-app
</code></pre>
<p>Generate requests:</p>
<pre><code class="lang-bash">curl http://localhost:8080
</code></pre>
<p>You'll see every DNS query your application makes: resolving service names, checking for external APIs, perhaps even unexpected queries that indicate misconfiguration or dependencies you didn't know about.</p>
<p>Common insights from DNS tracing include discovering that your application is using external dependencies you didn't document, finding DNS resolution failures that cause intermittent errors, or identifying excessive DNS queries that could be cached.</p>
<h3 id="heading-combining-ebpf-data-with-logs-and-metrics">Combining eBPF Data with Logs and Metrics</h3>
<p>eBPF observability delivers the best results when combined with traditional observability signals. To combine them effectively:</p>
<ul>
<li><p>Use metrics for high-level health monitoring, alerting on anomalies, tracking trends over time, and dashboard visualization.</p>
</li>
<li><p>Use logs for application-specific context, business logic details, error messages with stack traces, and debugging application code.</p>
</li>
<li><p>Use eBPF traces for understanding request flows, identifying where time is spent, discovering runtime dependencies, and debugging issues that don't appear in logs.</p>
</li>
</ul>
<h4 id="heading-a-practical-workflow">A practical workflow:</h4>
<p>Your metrics alert you that latency increased. You check logs but don't see errors, requests are succeeding, just slowly. You use eBPF tracing to identify that requests are spending extra time in network I/O to a particular backend service. Now you check that service's metrics and logs, and discover it's under heavy load. The eBPF trace gave you the clue that logs and metrics alone couldn't provide.</p>
<p>This approach to observability, using the right tool for each question, is how experienced engineers debug complex systems efficiently.</p>
<h3 id="heading-what-ebpf-can-and-cant-see"><strong>What eBPF Can and Can't See</strong></h3>
<p>eBPF excels at:</p>
<ul>
<li><p>Network traffic (requests, responses, latency)</p>
</li>
<li><p>System calls (file I/O, process creation, memory allocation)</p>
</li>
<li><p>Kernel functions (scheduling, locking, resource usage)</p>
</li>
<li><p>Function calls in binaries (with uprobes)</p>
</li>
</ul>
<p>But keep in mind that eBPF has limitations:</p>
<ul>
<li><p>Cannot decrypt encrypted payloads (unless hooking SSL libraries before encryption)</p>
</li>
<li><p>Doesn't automatically understand application logic</p>
</li>
<li><p>Captures low-level events but may need context for high-level semantics</p>
</li>
</ul>
<p>That's why eBPF complements traditional observability rather than replacing it entirely. It gives you infrastructure-level visibility with no code changes and universal coverage. Traditional APM provides application-level context, business metrics, and custom instrumentation. Together, they give you complete observability across your entire stack.</p>
<h2 id="heading-best-practices-and-production-considerations">Best Practices and Production Considerations</h2>
<p>Before using eBPF tracing in production, there are important considerations around performance, security, and operational practices.</p>
<h3 id="heading-performance-impact">Performance Impact</h3>
<p>eBPF's reputation for low overhead is well-deserved, but "low" isn't "zero."</p>
<p>Most eBPF tracing tools add 2-5% CPU overhead and negligible memory overhead. The exact number depends on event frequency, tracing a service that handles 10,000 requests per second will have more overhead than one handling 10 per second.</p>
<p>Measuring the impact:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Before enabling tracing, check baseline resource usage</span>
kubectl top pods -n demo-app

<span class="hljs-comment"># Enable tracing</span>
kubectl gadget traceloop -n demo-app

<span class="hljs-comment"># Check resource usage again</span>
kubectl top pods -n demo-app
</code></pre>
<p>You should see a small increase in CPU usage in the pods where tracing is active. This is the cost of the eBPF programs running in the kernel and processing events.</p>
<h4 id="heading-production-best-practices">Production best practices:</h4>
<p>Use targeted tracing rather than tracing everything everywhere. Trace specific namespaces, pods, or individual containers when investigating issues. For high-volume services, reduce overhead by applying filters, aggregation, or sampling where supported by the tracing tool.</p>
<p>Stop tracing when you’re done investigating. Unlike metrics collection, which typically runs continuously, eBPF-based tracing is best used as an on-demand diagnostic tool to capture detailed insights during active debugging.</p>
<h4 id="heading-when-overhead-matters">When overhead matters:</h4>
<p>If you're running latency-sensitive applications (like high-frequency trading systems or real-time communications), even 2-5% overhead might be unacceptable. In these cases, use eBPF tracing in pre-production environments to identify issues, or enable it temporarily in production only when actively debugging.</p>
<h3 id="heading-security-considerations">Security Considerations</h3>
<p>eBPF is powerful, which means it requires elevated privileges. Understanding the security implications is crucial.</p>
<h4 id="heading-what-ebpf-can-access">What eBPF can access:</h4>
<p>eBPF programs can observe all syscalls, network traffic, and process execution in the kernel. This includes potentially sensitive data like connection details, file paths, and process arguments. While eBPF programs run in a sandbox and can't modify data or crash the kernel, they can read information that might be sensitive.</p>
<h4 id="heading-privilege-requirements">Privilege requirements:</h4>
<p>Loading eBPF programs requires <code>CAP_SYS_ADMIN</code> or <code>CAP_BPF</code> capabilities (on newer kernels). This is a privileged operation, only trusted users should have this access. The Inspektor Gadget DaemonSet runs with these privileges, so protect access to it accordingly.</p>
<h4 id="heading-best-practices">Best practices:</h4>
<p>Implement RBAC (Role-Based Access Control) to restrict who can run gadgets. Not every developer needs the ability to trace production systems.</p>
<p>Also, be mindful of what data you're collecting, if your traces might contain sensitive information (like authentication tokens in HTTP headers), restrict access to trace data.</p>
<p>Lastly, consider using admission controllers to prevent unauthorized eBPF program loading. Audit eBPF usage in production environments to track who ran which gadgets when.</p>
<h4 id="heading-network-policies">Network policies:</h4>
<p>Inspektor Gadget's DaemonSet needs to communicate with the API server and between its components. Ensure your network policies allow this communication while still maintaining appropriate segmentation.</p>
<h3 id="heading-when-to-use-ebpf-tracing-vs-traditional-apm">When to Use eBPF Tracing vs. Traditional APM</h3>
<p>eBPF tracing and traditional APM tools like New Relic, Datadog, or Dynatrace serve different purposes. Understanding when to use each helps you build an effective observability strategy.</p>
<p>Use eBPF tracing when:</p>
<ul>
<li><p>You can't modify application code (third-party applications, legacy systems, compiled binaries)</p>
</li>
<li><p>You need infrastructure-level visibility (network, syscalls, kernel behavior)</p>
</li>
<li><p>You're debugging issues that span service boundaries but don't show up in application logs</p>
</li>
<li><p>You want zero instrumentation overhead during normal operation (run tracing only when needed)</p>
</li>
<li><p>You need to understand what's actually happening versus what the application reports</p>
</li>
</ul>
<p>Use traditional APM when:</p>
<ul>
<li><p>You need business-context metrics (user IDs, transaction types, business-specific data)</p>
</li>
<li><p>You want automatic instrumentation with minimal setup for supported frameworks</p>
</li>
<li><p>You need long-term storage and analysis of all traces (eBPF tracing is often used for real-time investigation)</p>
</li>
<li><p>You want pre-built dashboards and alerting for common application patterns</p>
</li>
<li><p>You need application code-level visibility (stack traces, variable values, function calls)</p>
</li>
</ul>
<h3 id="heading-the-ideal-approach-use-both">The Ideal Approach: Use Both</h3>
<p>Many teams run traditional APM for continuous monitoring and use eBPF tracing for targeted investigation when APM data isn't sufficient. For example, your APM shows that a service is slow but doesn't explain why. You enable eBPF tracing on that service to understand what's happening at the kernel level, network delays, excessive syscalls, unexpected dependencies, and find the root cause.</p>
<p>This complementary approach gives you both the continuous visibility of APM and the deep diagnostic power of eBPF without the overhead of running both at maximum depth all the time.</p>
<h2 id="heading-next-steps-and-resources">Next Steps and Resources</h2>
<p>If you got this far, thanks for reading! Now that you have learned the fundamentals of eBPF observability, and hands-on tracing with Inspektor Gadget, you can continue your journey by:</p>
<h3 id="heading-exploring-other-ebpf-tools">Exploring Other eBPF Tools</h3>
<p>Now that you understand eBPF concepts through traceloop, exploring other tools will be much easier.</p>
<h4 id="heading-try-other-inspektor-gadget-gadgets">Try other Inspektor Gadget gadgets:</h4>
<pre><code class="lang-bash"><span class="hljs-comment"># See all available gadgets</span>
kubectl gadget --<span class="hljs-built_in">help</span>

<span class="hljs-comment"># Some useful ones to explore:</span>
kubectl gadget trace_open -n demo-app     <span class="hljs-comment"># File I/O tracing</span>
kubectl gadget trace_bind -n demo-app     <span class="hljs-comment"># Port binding events</span>
kubectl gadget profile cpu -n demo-app    <span class="hljs-comment"># CPU profiling</span>
kubectl gadget snapshot process -n demo-app  <span class="hljs-comment"># Process listing</span>
</code></pre>
<p>Each gadget teaches you something different about system behavior and gives you another diagnostic tool in your toolkit.</p>
<h3 id="heading-experiment-with-other-ebpf-platforms">Experiment with other eBPF platforms:</h3>
<p>If you're interested in broader observability platforms, try Pixie for its auto-instrumentation and rich UI. Install Cilium with Hubble if you're focused on network observability and want to understand service mesh behavior. Explore Tetragon if security observability interests you, seeing what processes are executing and what files they're accessing.</p>
<p>The concepts transfer directly: all these tools attach eBPF programs to kernel hooks, collect event data, and present it in different ways. Your understanding of syscalls, traces, and kernel-level observation applies universally.</p>
<h3 id="heading-connect-to-the-cncf-observability-ecosystem">Connect to the CNCF Observability Ecosystem</h3>
<p>eBPF observability tools don't exist in isolation. They're part of the broader Cloud Native Computing Foundation ecosystem.</p>
<h4 id="heading-opentelemetry-integration">OpenTelemetry integration:</h4>
<p>Many eBPF tools can export data in OpenTelemetry format, allowing you to combine kernel-level traces with application-level traces in a unified observability backend. This gives you the complete picture: eBPF shows you infrastructure behavior while OpenTelemetry shows you application context.</p>
<h4 id="heading-prometheus-and-grafana">Prometheus and Grafana:</h4>
<p>eBPF-derived metrics can be exposed as Prometheus metrics and visualized in Grafana alongside your application metrics. This unified dashboard approach helps you correlate infrastructure and application behavior.</p>
<h4 id="heading-service-mesh-integration">Service mesh integration:</h4>
<p>If you're using Istio, Linkerd, or other service meshes, eBPF tools like Cilium Hubble can provide deeper visibility into service-to-service communication than the mesh alone provides. The mesh handles traffic management while eBPF gives you kernel-level visibility.</p>
<h4 id="heading-jaeger-and-zipkin">Jaeger and Zipkin:</h4>
<p>For organizations using distributed tracing backends, eBPF traces can be exported to these systems, enriching your trace data with infrastructure-level spans that application instrumentation misses.</p>
<h3 id="heading-community-resources-and-learning-paths">Community Resources and Learning Paths</h3>
<p>The eBPF community is vibrant and welcoming. You can continue learning from the resources below.</p>
<p><strong>Official documentation and blog:</strong></p>
<ul>
<li><p><a target="_blank" href="http://eBPF.io">eBPF.io</a>: The central hub for eBPF documentation, tutorials, and project listings</p>
</li>
<li><p><a target="_blank" href="https://inspektor-gadget.io/docs/latest/">Inspektor Gadget docs</a>: Comprehensive guides for all gadgets and use cases</p>
</li>
<li><p><a target="_blank" href="https://docs.cilium.io/en/stable/index.html">Cilium documentation</a>: Deep dives into eBPF networking</p>
</li>
<li><p><a target="_blank" href="https://www.cncf.io/blog/2025/01/27/what-is-observability-2-0/">CNCF Blog — “What is Observability 2.0?</a>: A quick overview of how modern observability moves beyond traditional tools by unifying metrics, logs, and traces for real-time insight in cloud-native systems.</p>
</li>
</ul>
<p><strong>Learning resources:</strong></p>
<ul>
<li><p><a target="_blank" href="https://cilium.isovalent.com/hubfs/Learning-eBPF%20-%20Full%20book.pdf">Learning eBPF by Liz Rice</a>: Comprehensive book covering eBPF fundamentals</p>
</li>
<li><p><a target="_blank" href="https://ebpf.io/summit-2025/">eBPF Summit</a>: Annual conference with talks from eBPF creators and users</p>
</li>
<li><p><a target="_blank" href="https://www.cncf.io/online-programs/cncf-on-demand-webinar-how-to-start-building-a-self-service-infrastructure-platform-on-kubernetes/">CNCF webinars</a>: Regular sessions on observability topics</p>
</li>
<li><p><a target="_blank" href="https://www.kubernetes.dev/community/community-groups/">Kubernetes observability SIGs</a>: Community discussions and projects</p>
</li>
</ul>
<p>To make this tutorial easy to follow and experiment with, I have included all Kubernetes manifests, demo applications, and eBPF tracing commands in this <a target="_blank" href="https://github.com/Emidowojo/ebpf-k8s-tracing-tutorial">repository</a>. You can also connect with me on <a target="_blank" href="https://www.linkedin.com/in/emidowojo/">LinkedIn</a> if you’d like to stay in touch.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy Your Own Cockroach DB  Instance on Kubernetes [Full Book for Devs] ]]>
                </title>
                <description>
                    <![CDATA[ Developers are smart, wonderful people, and they’re some of the most logical thinkers you’ll ever meet. But we’re pretty terrible at naming things 😂 Like, what in the world – out of every other possible name, they decided to name a database after a ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deploy-your-own-cockroach-db-instance-on-kubernetes-full-book-for-devs/</link>
                <guid isPermaLink="false">6925e482ccc8b29b82c002c5</guid>
                
                    <category>
                        <![CDATA[ cockroachdb ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Databases ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Prince Onukwili ]]>
                </dc:creator>
                <pubDate>Tue, 25 Nov 2025 17:16:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764088553942/496bf5f4-f059-4873-b6c1-419a86e594ef.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Developers are smart, wonderful people, and they’re some of the most logical thinkers you’ll ever meet. But we’re pretty terrible at naming things 😂</p>
<p>Like, what in the world – out of every other possible name, they decided to name a database after a <em>literal cockroach</em>? 🤣</p>
<p>I mean, I get it: cockroaches are known for being resilient, and the devs were probably trying to say “our database never dies”… but still…a cockroach?</p>
<p>The name aside, out of all the databases out there, you might be wondering why would you choose CockroachDB? And if you did choose it, where would you even start when trying to host and deploy it? Would you go for a managed cloud service? Or could you actually self-manage it?</p>
<p>If you ever thought of doing it yourself – maybe in a dev environment, or even introducing it to your company – how would you go about it?</p>
<p>Well, just calm your nerves 😄</p>
<p>In this book, we’ll explore everything you need to know about <strong>deploying and managing CockroachDB on Kubernetes</strong>. We’ll dive deep into:</p>
<ul>
<li><p>Understanding how CockroachDB’s masterless (multi-primary) architecture actually works</p>
</li>
<li><p>Setting up and deploying CockroachDB on a Kubernetes cluster</p>
</li>
<li><p>Automating backups to Google Cloud Storage using just a few queries in the CockroachDB cluster</p>
</li>
<li><p>Managing service accounts and authentication securely</p>
</li>
<li><p>Tuning CockroachDB’s memory settings for stable performance</p>
</li>
<li><p>Scaling the cluster horizontally and vertically without downtime</p>
</li>
<li><p>Monitoring and maintaining the database like a pro</p>
</li>
</ul>
<p>By the end, you’ll not only understand how CockroachDB works, you’ll be confident enough to deploy and manage your own resilient, production-ready instance. 🚀</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-even-is-cockroachdb">What Even Is CockroachDB? 🤔</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-simple-definition">Simple Definition</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-who-made-cockroachdb-when-was-it-released">Who Made CockroachDB? When Was it Released?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-problems-does-cockroachdb-try-to-solve">What Problems Does CockroachDB Try to Solve?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-terms-you-should-know-in-plain-language">Key Terms You Should Know (in plain language):</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-the-name-cockroachdb">Why the name “CockroachDB”? 😅</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-why-choose-cockroachdb-over-postgresql-or-mongodb">Why Choose CockroachDB Over PostgreSQL or MongoDB 🤷🏾‍♂️?</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-fault-tolerance-is-handled-in-postgresql-and-mongodb">How Fault Tolerance is Handled in PostgreSQL and MongoDB</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-cockroachdb-handles-it-differently">How CockroachDB Handles It Differently</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-cockroachdb-works-behind-the-scenes">How CockroachDB Works Behind the Scenes ⚙️</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ranges-the-small-pieces-of-data">Ranges: The Small Pieces of Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-replication-many-copies-for-safety">Replication: Many Copies for Safety</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-raft-consensus-how-all-copies-agree">Raft Consensus: How All Copies Agree</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-multiraft-keeping-raft-efficient-when-things-scale">MultiRaft: Keeping Raft Efficient When Things Scale</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-rebalancing-movement-for-balance">Rebalancing: Movement for Balance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-distributed-transactions-doing-work-across-multiple-ranges">Distributed Transactions: Doing Work Across Multiple Ranges</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-it-all-fits-together-read-write-flow-what-happens-when-you-use-it">How It All Fits Together: Read + Write Flow (What Happens When You Use It)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-this-all-matters-putting-it-in-plain-english">Why This All Matters (Putting It in Plain English)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-where-and-how-should-you-host-cockroachdb">Where (and How) Should You Host CockroachDB? ☁️</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-option-1-cockroachdb-cloud-fully-managed-by-cockroach-labs">Option 1: CockroachDB Cloud (fully managed by Cockroach Labs)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-bring-your-own-cloud-byoc">Option 2: Bring Your Own Cloud (BYOC)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-3-use-cloud-marketplaces-aws-gcp-azure">Option 3: Use Cloud Marketplaces (AWS, GCP, Azure)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-4-my-favorite-self-hosting-especially-using-kubernetes">Option 4 (My Favorite 😁): Self-Hosting — Especially Using Kubernetes</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-setting-up-your-local-environment">Setting Up Your Local Environment 🧑‍💻</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-these-tools">Why these tools?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-install-minikube">Step 1: Install Minikube</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-install-kubectl">Step 2: Install kubectl</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-install-helm">Step 3: Install Helm</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-deploying-cockroachdb-on-minikube-the-fun-part-begins">Deploying CockroachDB on Minikube (The Fun Part Begins 😁!)</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-visit-artifacthub">Step 1: Visit ArtifactHub</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-explore-the-helm-chart">Step 2: Explore the Helm Chart</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-copy-the-default-values">Step 3: Copy the Default Values</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-create-a-folder-for-our-project">Step 4: Create a Folder for Our Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-understanding-the-key-configurations">Step 5: Understanding the Key Configurations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-create-a-simplified-values-config-for-the-cockroachdb-helm-chart">Step 6: Create a Simplified Values Config for the CockroachDB Helm Chart</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-overview-of-the-yaml-values">Overview of the YAML values</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-install-the-cockroachdb-cluster-using-helm">🚀 Step 7: Install the CockroachDB Cluster Using Helm</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-accessing-the-cockroachdb-console-amp-viewing-metrics">Accessing the CockroachDB Console &amp; Viewing Metrics</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-locate-the-cockroachdb-public-service">Step 1: Locate the CockroachDB Public Service</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-learn-more-about-the-service">Step 2: Learn More About the Service</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-access-the-cockroachdb-dashboard">Step 3: Access the CockroachDB Dashboard</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-visit-the-dashboard">Step 4: Visit the Dashboard</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-exploring-the-metrics-dashboard">Step 5: Exploring the Metrics Dashboard</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-creating-a-little-load-on-the-cockroachdb-cluster">Step 6: Creating a Little Load on the CockroachDB Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-7-viewing-the-metrics-from-the-load">Step 7: Viewing the Metrics from the Load</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-8-view-the-list-of-created-items-in-the-database">Step 8: View the List of Created Items in the Database</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-backing-up-cockroachdb-to-google-cloud-storage">Backing Up CockroachDB to Google Cloud Storage ☁️</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-backups-are-absolutely-critical">Why Backups Are Absolutely Critical</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-to-our-db-installing-beekeeper-studio">Connecting to Our DB – Installing Beekeeper Studio</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-install-beekeeper-studio">How to Install Beekeeper Studio</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-beekeeper-studio-to-cockroachdb">Connecting Beekeeper Studio to CockroachDB</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exposing-the-cluster-for-local-access">Exposing the Cluster for Local Access</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-via-beekeeper-studio">🐝 Connecting via Beekeeper Studio</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-verify-the-connection">Verify the Connection</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-a-google-cloud-account">Creating a Google Cloud Account</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-a-google-cloud-storage-bucket">Creating a Google Cloud Storage Bucket</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-giving-cockroachdb-access-to-the-bucket">Giving CockroachDB Access to the Bucket</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-attaching-the-key-to-our-cockroachdb-cluster">Attaching the Key to Our CockroachDB Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-testing-our-backup-disaster-recovery-time">Testing Our Backup — Disaster Recovery Time</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-managing-resources-amp-optimizing-memory-usage">Managing Resources &amp; Optimizing Memory Usage</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-how-cockroachdb-uses-memory">How CockroachDB Uses Memory</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-memory-usage-formula-you-must-follow">The Memory Usage Formula You Must Follow</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-where-you-find-these-settings">Where You Find These Settings</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-concrete-example-step-by-step">Concrete Example (Step-by-Step)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-on-requests-vs-limits-in-kubernetes">⚠️ On Requests vs Limits in Kubernetes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-overriding-the-default-fractions">Overriding the Default Fractions</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-scaling-cockroachdb-the-right-way">Scaling CockroachDB the Right Way</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-key-metrics-to-understand">Key Metrics to Understand</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-when-and-what-to-scale-based-on-your-metrics">When (and What) to Scale Based on Your Metrics</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-disk-bound-situations-what-to-do-when-your-disk-is-the-limiting-factor">Disk-Bound Situations — What to Do When Your Disk Is the Limiting Factor</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-memory-pressure-what-to-do-when-your-database-hits-the-limit">Memory Pressure — What to Do When Your Database Hits the Limit</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-when-queries-are-slow-but-everything-else-cpu-memory-amp-disk-looks-fine">When Queries Are Slow but Everything Else (CPU, Memory &amp; Disk) Looks “Fine”</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-disk-speed-iops-amp-throughput-across-cloud-providers">Understanding Disk Speed (IOPS &amp; Throughput) Across Cloud Providers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-downsizing-the-cluster-reducing-replicas">Downsizing the Cluster (Reducing Replicas)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-wrong-way-to-downscale">⚠️ The Wrong Way to Downscale</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-decommissioning-a-node-before-scaling-down-the-cluster">Decommissioning a Node Before Scaling Down the Cluster</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-what-to-consider-when-deploying-cockroachdb-on-google-kubernetes-engine-gke">What to Consider When Deploying CockroachDB on Google Kubernetes Engine (GKE) ☁️</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-creating-your-gke-cluster">Creating Your GKE Cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-to-your-gke-cluster">Connecting to your GKE cluster</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-deploying-cockroachdb-in-production-on-gke">Deploying CockroachDB in Production (on GKE)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-configuration">Understanding the Configuration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-installing-the-cockroachdb-cluster-on-gke">Installing the CockroachDB Cluster on GKE</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-to-our-cockroachdb-cluster-now-that-tls-mtls-are-enabled">Connecting to Our CockroachDB Cluster (Now That TLS + mTLS Are Enabled)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-via-mutual-tls-mtls-why-we-need-a-certificate-for-our-root-user">Connecting via Mutual TLS (mTLS) — Why We Need a Certificate for Our root User</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-lets-explore-our-clusters-certificate">Let’s Explore Our Cluster’s Certificate</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-certificate-sections-explained-super-simply">Understanding the Certificate Sections (Explained Super Simply)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-creating-a-client-certificate-so-we-can-finally-connect-to-cockroachdb">Creating a Client Certificate (So We Can Finally Connect to CockroachDB)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-to-our-cockroachdb-cluster-securely-using-mtls">Connecting to Our CockroachDB Cluster Securely (Using mTLS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-restoring-our-previous-database-into-the-new-gke-cockroachdb-cluster-without-sa-keys">Restoring Our Previous Database into the New GKE CockroachDB Cluster (without SA keys)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-restoring-our-previous-database-from-google-cloud-storage">Restoring Our Previous Database from Google Cloud Storage</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-now-lets-restore-the-data">Now, Let’s Restore the Data 🎉</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-to-the-database-with-a-new-user">Connecting to the Database with a New User</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-with-passwordless-authentication-mutual-tls">Connecting with Passwordless Authentication (Mutual TLS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-connecting-via-mutual-tls-mtls-from-our-apps-on-kubernetes">Connecting via Mutual TLS (mTLS) from Our Apps on Kubernetes</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-get-a-cockroachdb-enterprise-license-for-free">How to Get a CockroachDB Enterprise License for FREEE!</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-three-types-of-licenses">Three Types of Licenses</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-apply-for-the-free-enterprise-license">How to Apply for the Free Enterprise License</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-adding-your-license-to-the-cockroachdb-cluster">Adding Your License to the CockroachDB Cluster</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion-amp-next-steps">Conclusion &amp; Next Steps ✨</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-about-the-author">About the Author 👨🏾‍💻</a></li>
</ul>
</li>
</ol>
<h2 id="heading-what-even-is-cockroachdb">What Even Is CockroachDB? 🤔</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760416037885/c67edcbb-be85-4614-bdf3-104942048eea.jpeg" alt="An image summarizing what CockroachDB is" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Hey! before we jump into setting up our Kubernetes cluster and deploying our CockroachDB cluster, let’s get grounded in what CockroachDB really is. (Because if you don’t understand the why and how, the implementation and practical session will just feel like magic 😅.)</p>
<h3 id="heading-simple-definition">Simple Definition</h3>
<p>CockroachDB is a distributed SQL database. This means it gives you the features of a relational database (tables, SQL queries, JOINS, transactions) but copies data across multiple replicas (servers, nodes, instances). No need for sharding manually. 😃</p>
<p>It’s built to survive failures, scale easily (compared to other SQL databases), and keep your data consistent no matter what (across all the instances).</p>
<h3 id="heading-who-made-cockroachdb-when-was-it-released">Who Made CockroachDB? When Was it Released?</h3>
<p>CockroachDB was created by <a target="_blank" href="https://www.cockroachlabs.com/"><strong>Cockroach Labs</strong></a>, founded by Spencer Kimball, Peter Mattis, and Ben Darnell. The idea first started taking shape around 2014, and by 2015 Cockroach Labs was formally founded.</p>
<p>Its 1.0 “production-ready” version was announced in 2017, marking its transition from beta to being suitable for real-world use.</p>
<h3 id="heading-what-problems-does-cockroachdb-try-to-solve">What Problems Does CockroachDB Try to Solve?</h3>
<p>Traditional relational databases are great, but they run into real challenges when your app grows. CockroachDB was built to solve those. Here are the key pain points and how CockroachDB addresses them:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pain Point</td><td>What usually happens</td><td>How CockroachDB fixes it</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Single primary bottleneck</strong></td><td>ONLY ONE “primary” node handles writes, updates, and deletes. That node can become difficult to scale (adapt to the DB usage) without downtime</td><td>CockroachDB is <strong>multi-primary</strong>, meaning every node can accept reads and writes. No single “primary” for the entire cluster.</td></tr>
<tr>
<td><strong>Manual sharding complexity</strong></td><td>You have to split data (shard) by hand, decide which piece goes where, and handle cross-shard queries, lots of headache 😖.</td><td>CockroachDB automatically partitions data into smaller units (called <em>ranges</em>) and moves them around to balance load.</td></tr>
<tr>
<td><strong>Failover downtime</strong></td><td>If the primary node fails, you need to promote a replica (read-only instance) and switch over. During that time, your app might be down.</td><td>Because there’s no single primary, if one of the instances fail, others take over seamlessly (via consensus) without a big outage.</td></tr>
<tr>
<td><strong>Geographic scaling &amp; latency</strong></td><td>Serving users in different regions is hard — either data is far away (slow) or you must build complex replication logic.</td><td>CockroachDB lets you distribute nodes across regions. You can serve local reads/writes while keeping global consistency.</td></tr>
</tbody>
</table>
</div><p>So instead of fighting your database as it grows, CockroachDB handles much of the hard work for you.</p>
<h3 id="heading-key-terms-you-should-know-in-plain-language">Key Terms You Should Know (in plain language):</h3>
<ul>
<li><p><strong>Node:</strong> Duplicates or copies of your database. These are also known as replicas. They can be read-only (databases from which data can only be read, for example using SELECT statements), OR read-write (databases from which data can be read, created, updated, and deleted).</p>
</li>
<li><p><strong>Replication</strong>: making copies of data on multiple nodes. If one node fails, others still have the data.</p>
</li>
<li><p><strong>Raft (consensus algorithm)</strong>: a system that ensures copies (replicas) agree on changes in a safe, reliable way. For example, when you want to write data, Raft ensures that most copies agree before it’s accepted.</p>
</li>
<li><p><strong>Sharding / Ranges</strong>: Instead of putting all your data in one big blob, CockroachDB splits it into smaller chunks called <em>ranges</em>. Each range is replicated and can move between nodes.</p>
</li>
<li><p><strong>Distributed transaction</strong>: a transaction (series of operations) that might touch data stored in different nodes. CockroachDB manages this, so you still get ACID (atomic, consistent, isolated, durable) properties.</p>
</li>
</ul>
<h3 id="heading-why-the-name-cockroachdb">Why the name “CockroachDB”? 😅</h3>
<p>You might wonder: <em>Why name a database after a cockroach?</em> It sounds weird at first, but there's a reason:</p>
<p>Cockroaches are known for surviving harsh conditions: radiation, natural disasters, and so on. The founders wanted a database that feels almost “impossible to kill,” that can survive node failures, outages, and network splits. The name is a tongue-in-cheek nod to resilience.</p>
<h2 id="heading-why-choose-cockroachdb-over-postgresql-or-mongodb">Why Choose CockroachDB Over PostgreSQL or MongoDB 🤷🏾‍♂️?</h2>
<p>Let’s compare the classic setup (Postgres / MongoDB) to CockroachDB, especially why you might want to go with CockroachDB, and how it helps ease scaling. I’ll also explain some terms to make sure you’re following.</p>
<p>In many setups, when you use Postgres or MongoDB, you’ll often have one “primary” node that handles all writes (that is, inserts, updates, deletes).</p>
<p>Then you have multiple “read replicas” that copy the primary’s data and serve read requests (selects). That works okay – reads can be spread out – but all write traffic goes to that one primary node.</p>
<p>Usually, the primary eventually gets stressed when the write volume grows (for example, more customers create accounts and products on your platform).</p>
<p>You can add more read replicas (horizontal scaling for reads, for example customers trying to view their accounts, or previously created products on your site), but scaling the primary is much harder.</p>
<p>To scale the primary, you often resort to upgrading its resources (CPU, RAM, disk) – that’s vertical scaling – which often needs downtime (shut down the primary database, increase its CPU and RAM, then spin it back up).</p>
<p>Or you’d have to manually shard (split) your data across multiple primaries, route traffic carefully, and manage complexity.</p>
<h3 id="heading-how-fault-tolerance-is-handled-in-postgresql-and-mongodb">How Fault Tolerance is Handled in PostgreSQL and MongoDB</h3>
<p>When you try to make Postgres (or MongoDB) highly available and fault tolerant in a self-managed setup, you often need two+ read replicas and one primary.</p>
<p>The tricky part is handling what happens when the primary fails (or is taken down temporarily for an upgrade). You need something that can promote a replica to a primary automatically.</p>
<p>In Postgres land, that’s often handled by <a target="_blank" href="https://github.com/patroni/patroni"><strong>Patroni</strong></a> or <a target="_blank" href="https://www.repmgr.org/"><strong>repmgr</strong></a> (tools that handle cluster management, failover, leader election, and so on).</p>
<p>In MongoDB, such logic is part of the <strong>replica set</strong> behavior: it does automatic elections among replicas.</p>
<p>Here are some of the core challenges with that classic model:</p>
<ul>
<li><p>Every write must go to a single primary. If that primary fails or is overloaded, your whole system suffers.</p>
</li>
<li><p>Scaling reads is easy (add more replicas), but scaling writes is hard.</p>
</li>
<li><p>Vertical scaling (give more resources to one server) has its cons. If the primary node needs more resources, you might experience some downtime when it’s being scaled up.</p>
</li>
<li><p>Manual sharding is messy: you decide which piece of data goes to which shard, handle cross-shard queries, and build routing logic. That’s a lot of maintenance and can lead to unexpected issues if not handled properly.</p>
</li>
<li><p>One service (or load balancer/proxy) points to the primary (for ALL write queries).</p>
</li>
<li><p>Another service or routing logic handles read queries and can share reads across replicas.</p>
</li>
<li><p>You might use <strong>HAProxy</strong>, <strong>pgpool-II</strong>, or <strong>pgBouncer</strong> for Postgres to route traffic, do read/write splitting, or manage connection pooling. These are external (not part of the database core) tools you have to configure.</p>
</li>
</ul>
<p>So when the primary fails, Patroni (or repmgr, and so on) will detect it and promote one of the read replicas to be the new primary.</p>
<p>But that promotion, reconfiguration, and traffic rerouting often cause a brief window of downtime (when your primary database node becomes unavailable).</p>
<h3 id="heading-how-cockroachdb-handles-it-differently">How CockroachDB Handles It Differently</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760416070693/af1ade70-19bb-4e9f-82ec-9711c13d8079.jpeg" alt="A brief look at CockroachDB properties" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>CockroachDB changes the rules:</p>
<ul>
<li><p><strong>All replicas are equal</strong> for reads <em>and</em> writes. You don’t have a special “primary” that handles writes. Every node in the cluster can accept write requests.</p>
</li>
<li><p>CockroachDB breaks your data into small chunks (ranges) and replicates them across nodes. If you add a new node, data moves around automatically to balance the load.</p>
</li>
<li><p>Every write is automatically copied to other replicas, and consistency is managed by a protocol (Raft), so you don’t have to build this yourself.</p>
</li>
<li><p>No manual sharding needed. Because the database handles how data is split and moved, you don’t need to decide how to shard by hand.</p>
</li>
<li><p>You <strong>don’t need a special service</strong> to route writes vs reads queries. Any node can accept both reads <strong>and</strong> writes.</p>
</li>
<li><p>During scaling, you don’t have to worry about which node is the primary – because <em>there is no primary</em>.</p>
</li>
<li><p>You can scale your nodes one at a time (rollout style). When one node is being upgraded, the others continue to serve traffic. You won’t hit a downtime window just because you're scaling the “primary.”</p>
</li>
<li><p>Because there's no replica promotion logic to fight with, there's no moment where a replica needs to be “elevated” to primary – it’s all just nodes continuing to serve.</p>
</li>
</ul>
<h2 id="heading-how-cockroachdb-works-behind-the-scenes">How CockroachDB Works Behind the Scenes ⚙️</h2>
<p>In CockroachDB, there are many moving parts behind the scenes. But they work together, so you don’t have to babysit them. The core ideas, which we’ve mostly already touched on, are:</p>
<ul>
<li><p>Splitting data into pieces (<strong>ranges</strong>)</p>
</li>
<li><p>Keeping multiple copies of each piece (<strong>replicas/replication</strong>)</p>
</li>
<li><p>Making sure all copies agree via <strong>Raft consensus</strong></p>
</li>
<li><p>Moving pieces around to balance the load (<strong>automatic rebalancing/distribution</strong>)</p>
</li>
<li><p>Coordinating transactions that might touch many pieces</p>
</li>
</ul>
<p>Let’s go through each of those, one by one.</p>
<h3 id="heading-ranges-the-small-pieces-of-data">Ranges: The Small Pieces of Data</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760413105037/984f8b5c-bd53-4850-9704-57ce1dcedb80.png" alt="A little depiction of CockroachDB ranges" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Imagine you have a giant book of recipes. If you try to carry the whole thing, it’s heavy. So you split the book into smaller booklets, each covering recipes for a certain range of meals: breakfasts, lunches, dinners, desserts.</p>
<p>In CockroachDB, data is split into ranges, which are like those smaller booklets:</p>
<ul>
<li><p>Each range covers a certain block of data (like “all users whose ID is 1-1000”)</p>
</li>
<li><p>When a range gets too big (like having too many recipes in one booklet) it’s cut/split into two smaller ones. That makes each piece easier to manage.</p>
</li>
<li><p>If two neighboring ranges have become very small (few recipes), they might be merged (joined) back together so you’re not keeping too many tiny booklets.</p>
</li>
<li><p>These splits and merges happen automatically, behind the scenes, so the database stays smooth as things grow or shrink.</p>
</li>
</ul>
<p>This chopping helps the system in many ways: moving pieces, copying them, balancing load, recovering from node failures becomes easier.</p>
<h3 id="heading-replication-many-copies-for-safety">Replication: Many Copies for Safety</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760413678362/a0066780-1360-4511-8fd0-466f54ea2135.jpeg" alt="Replication of Ranges across multiple Nodes (databases) in CockroachDB" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Nobody likes losing their work, so you keep backup copies. CockroachDB does this for data as well.</p>
<p>For each range, there are usually 3 copies (replicas) stored on different machines (nodes). If one machine dies, you still have others. (<a target="_blank" href="https://www.cockroachlabs.com/docs/stable/architecture/replication-layer?utm_source=chatgpt.com">cockroachlabs.com</a>). And these copies are always kept in sync: when you write something (for example, insert or update), the change is propagated to the other copies.</p>
<p>The database also tolerates failures. If one node goes down, the system detects it and eventually makes a new copy elsewhere to replace it. So the target number of copies is maintained. This gives you fault tolerance: your data stays safe even when parts of your system fail.</p>
<h3 id="heading-raft-consensus-how-all-copies-agree">Raft Consensus: How All Copies Agree</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760415307117/79859a4b-4341-46eb-91d9-cccc3bde9a66.jpeg" alt="79859a4b-4341-46eb-91d9-cccc3bde9a66" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Having copies is useful, but you also need them to agree with each other – like all your recipe booklets have the same content in each copy. The Raft protocol is a way to make sure that happens reliably.</p>
<p>Here’s how Raft works in simple terms:</p>
<ul>
<li><p>Each range has a group of replicas. One of these replicas acts as the <strong>leader</strong>. Others are <strong>followers</strong>.</p>
</li>
<li><p>All write requests for that range go through the leader. The leader gets the request, then tells followers to record the same change.</p>
</li>
<li><p>Once most of the copies (a majority) say “yep, we got it,” the change is considered final (committed). Then the leader tells the client, “Done.”</p>
</li>
<li><p>If the leader stops working (the machine dies or the network fails), the followers notice it (they stop getting regular “I’m alive” messages), then they hold an election to pick a new leader, and the show goes on.</p>
</li>
<li><p>This way, the system ensures everyone has the same final data and no conflicting changes happen.</p>
</li>
</ul>
<p>So Raft is the agreement protocol that keeps all copies in sync and safe.</p>
<h3 id="heading-multiraft-keeping-raft-efficient-when-things-scale">MultiRaft: Keeping Raft Efficient When Things Scale</h3>
<p>When you have many ranges (many pieces of the booklets), each range has its own Raft group. That can mean a lot of “are you alive?” messages between nodes, and a lot of overhead. MultiRaft is the trick CockroachDB uses to make this efficient.</p>
<p>MultiRaft groups together Raft work for many ranges that share nodes, so overhead is reduced. Instead of sending separate heartbeat (are you alive?) messages for each range, some of the messages are bundled.</p>
<p>This reduces network chatter and resource waste and helps the database scale smoothly when you have tons of data and many pieces.</p>
<h3 id="heading-rebalancing-movement-for-balance">Rebalancing: Movement for Balance</h3>
<p>When your ranges are not evenly spread across nodes (machines), some machines are doing way too much work, and some hardly any. That’s not good. So CockroachDB automatically moves pieces around to balance things.</p>
<ul>
<li><p>The system watches how busy each node is (how many ranges it holds, how much data, how much read/write traffic).</p>
</li>
<li><p>If one node is overloaded, it will move some ranges to other nodes.</p>
</li>
<li><p>If a node dies, the system notices and makes sure that ranges that were on that node get copied somewhere else so safety (replica count) is maintained.</p>
</li>
<li><p>If you add a new node, the system starts moving ranges to the new node so its resources are used.</p>
</li>
</ul>
<p>This happens without you having to manually decide “move this here, move that there.”</p>
<h3 id="heading-distributed-transactions-doing-work-across-multiple-ranges">Distributed Transactions: Doing Work Across Multiple Ranges</h3>
<p>Often, an operation touches multiple ranges. For example, “transfer money from account A (in range 1) to account B (in range 2)”. That must be handled carefully so that either both parts succeed, or neither do.</p>
<p>CockroachDB supports <strong>distributed transactions</strong>, meaning a single transaction can work across many ranges. It uses “intent” writes (temporary placeholders) and once everything is ready, it commits the transaction so it becomes permanent. If something fails, it aborts (cancels) the whole thing. The system ensures atomic behavior: all or nothing.</p>
<h3 id="heading-how-it-all-fits-together-read-write-flow-what-happens-when-you-use-it">How It All Fits Together: Read + Write Flow (What Happens When You Use It)</h3>
<p>Let’s picture a write, step by step:</p>
<ol>
<li><p>Your app sends a write (for example, “add new user”) to any node in the CockroachDB cluster.</p>
</li>
<li><p>That node figures out which range(s) are involved (which pieces hold the data you want to write).</p>
</li>
<li><p>For each range, the write goes to that range’s leader.</p>
</li>
<li><p>The leader writes the change to their own copy, then tells followers to do the same.</p>
</li>
<li><p>Once most copies confirm they have the change, the leader declares it “committed” and tells your app, “yes, write done.”</p>
</li>
<li><p>If a node is busy or down, others still handle traffic.</p>
</li>
</ol>
<p>Read flow:</p>
<ul>
<li><p>Your app sends a read (for example “get user by ID”) to any node.</p>
</li>
<li><p>That node checks its copies. If it has a fresh copy, it answers. If not, it asks the node that does.</p>
</li>
</ul>
<p>Everything works so data is correct, up to date, and reliably available even if machines fail or network lags.</p>
<h3 id="heading-why-this-all-matters-putting-it-in-plain-english">Why This All Matters (Putting It in Plain English)</h3>
<p>All these tweaks are important for several key reasons. First of all, because data is chopped into ranges and replicated, no single node is a bottleneck. Also, Raft ensures consensus, so you can trust that data is consistent across all working replicas.</p>
<p>Beyond this, rebalancing is automatic, you don’t have to micromanage shards or worry about nodes drowning in load. And because transactions that touch multiple ranges are coordinated, you can trust ACID properties even in a distributed setup.</p>
<h2 id="heading-where-and-how-should-you-host-cockroachdb">Where (and How) Should You Host CockroachDB? ☁️</h2>
<p>There isn’t just one “right” way to host CockroachDB. There are a few paths you can pick, each with pros and cons. What you pick depends on cost, control, ease of use, and your risk tolerance.</p>
<p>In this section, we’ll explore:</p>
<ul>
<li><p>Cockroach Labs’ own managed cloud (CockroachDB Cloud)</p>
</li>
<li><p>“Bring Your Own Cloud” (BYOC) – letting Cockroach Labs manage it inside <em>your</em> cloud account</p>
</li>
<li><p>Hosting via cloud marketplaces (AWS, GCP, Azure)</p>
</li>
<li><p>Self-hosting / Kubernetes / your own infrastructure</p>
</li>
<li><p>And notes on DigitalOcean support</p>
</li>
</ul>
<p>Let’s dive in.</p>
<h3 id="heading-option-1-cockroachdb-cloud-fully-managed-by-cockroach-labs">Option 1: CockroachDB Cloud (fully managed by Cockroach Labs)</h3>
<p>This is the easiest option if you want to offload operations. You don’t manage nodes (computers, Virtual machines, and so on), upgrades, or backups, as Cockroach Labs handles all that.</p>
<p><strong>What it offers:</strong></p>
<ul>
<li><p>You sign up and click “create cluster.”</p>
</li>
<li><p>Automatic scaling, zero-downtime upgrades, and managed backups.</p>
</li>
<li><p>It supports multiple cloud providers behind the scenes (you pick region(s)).</p>
</li>
<li><p>You get tools, APIs, and Terraform integration to automate it.</p>
</li>
<li><p>They often give free credits to get started.</p>
</li>
</ul>
<p><strong>Tradeoffs:</strong></p>
<ul>
<li><p>You have less control over underlying infrastructure, for example Virtual Machines, networking, disks, and so on (you trade control for convenience).</p>
</li>
<li><p>You pay for the managed service premium.</p>
</li>
<li><p>You rely on Cockroach Labs’ SLAs, uptime, and support.</p>
</li>
</ul>
<p>If you want, you can check it out here: <a target="_blank" href="https://www.cockroachlabs.com/product/cloud/">CockroachDB Cloud (managed by Cockroach Labs)</a>.</p>
<h3 id="heading-option-2-bring-your-own-cloud-byoc">Option 2: Bring Your Own Cloud (BYOC)</h3>
<p>This is a middle ground: you keep your cloud environment, but let Cockroach Labs manage the database. It gives you control over infrastructure, billing, network, and so on, while still offloading operational complexity.</p>
<p><strong>How it works:</strong></p>
<ul>
<li><p>You run CockroachDB Cloud inside your cloud account (AWS, GCP, and so on).</p>
</li>
<li><p>Cockroach Labs still handles provisioning, upgrades, backups, and observability. You manage roles, networking, and logs.</p>
</li>
<li><p>Useful for complying with regulations, keeping data within your cloud folder/account, and using your cloud discounts.</p>
</li>
</ul>
<p><strong>Tradeoffs:</strong></p>
<ul>
<li><p>You still need to set up cloud aspects (VPCs, IAM, roles) correctly.</p>
</li>
<li><p>There’s more complexity than pure managed, but more control as well.</p>
</li>
<li><p>Cockroach Labs needs access to certain parts of your account (permissions).</p>
</li>
</ul>
<p>If you want to explore BYOC, you can read more here: <a target="_blank" href="https://www.cockroachlabs.com/product/cloud/bring-your-own-cloud/">CockroachDB Bring Your Own Cloud</a>.</p>
<h3 id="heading-option-3-use-cloud-marketplaces-aws-gcp-azure">Option 3: Use Cloud Marketplaces (AWS, GCP, Azure)</h3>
<p>If you already use a cloud provider, sometimes the easiest way is to deploy via their marketplace offerings. It gives you familiarity, billing simplicity, and so on.</p>
<ul>
<li><p><strong>GCP Marketplace</strong> – CockroachDB is available on the Google Cloud Marketplace, making it easier to deploy within your GCP environment. You can learn more here: <a target="_blank" href="https://console.cloud.google.com/marketplace/product/cockroachdb-public/cockroachdb">GCP Marketplace</a>.</p>
</li>
<li><p><strong>AWS Marketplace</strong> – CockroachDB is listed there: <a target="_blank" href="https://aws.amazon.com/marketplace/pp/prodview-n3xpypxea63du">AWS Marketplace</a>.</p>
</li>
<li><p><strong>Azure Marketplace</strong> – Also supported for Azure deployments (SaaS/managed listings): <a target="_blank" href="https://marketplace.microsoft.com/en-us/product/saas/cockroachlabs1586448087626.cockroachdb-azure?tab=overview">Azure Marketplace</a>.</p>
</li>
<li><p><strong>DigitalOcean</strong> – There is support for CockroachDB deployment on DigitalOcean using their infrastructure: <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/deploy-cockroachdb-on-digital-ocean">Deploy CockroachDB on DigitalOcean</a>.</p>
</li>
</ul>
<p>These options let you stay in your cloud console, use your existing cloud accounts, and integrate with other resources you already have.</p>
<p>But you're still responsible for certain operational tasks (networking, security, monitoring, backups) depending on how the marketplace offering is configured.</p>
<h3 id="heading-option-4-my-favorite-self-hosting-especially-using-kubernetes">Option 4 (My Favorite 😁): Self-Hosting — Especially Using Kubernetes</h3>
<p>If you self-host CockroachDB, you get <strong>full control</strong>. You’re the boss of everything: the machines, storage, networking, backups, upgrades, monitoring – all of it.</p>
<p>What’s even better is that using Kubernetes means your setup isn’t tied to one cloud provider. You can run it on AWS, GCP, Azure, or even on-premises later, with very little change. Kubernetes gives you a “portable infra” layer.</p>
<p>Managed CockroachDB services charge you extra for “maintenance, upgrades, backup, etc.” – those are baked into the price. But when you self-host, you accept the burden, but also avoid paying that extra margin. You pay for compute, disks, network, and your time/ops work.</p>
<p>You can also self-host in the cloud (using cloud VMs) but still manage every layer: disks, network, security, and so on. Using Kubernetes, there is a sweet middle ground: you get cloud reliability for VMs, but you fully control everything above that.</p>
<h4 id="heading-why-kubernetes-beats-tools-like-docker-swarm-or-hashicorp-nomad-for-databases">Why Kubernetes Beats Tools Like Docker Swarm or Hashicorp Nomad for Databases</h4>
<p>Because CockroachDB is a <strong>stateful</strong> system (it holds data), you need strong support for “data that stays even when a pod restarts or moves.” Kubernetes is designed with good primitives for that. Other tools don’t always shine there.</p>
<p>Here’s the comparison in simple terms:</p>
<ul>
<li><p><strong>Docker Swarm / Docker Compose:</strong> Great for stateless apps (web servers, APIs), but when it comes to databases, it struggles. Swarm doesn’t natively support persistent volume claims at a cluster level, so if a container (database replica) moves to a different node (VM), it might lose access to its storage. Devs often pin containers to specific nodes manually to avoid this.</p>
</li>
<li><p><strong>Nomad:</strong> More flexible and simpler in some ways, but it’s not as rich in features around connectivity, storage management, and built-in tooling for containers. It works well in mixed workloads, but handling complex databases usually means you need to build extra layers.</p>
</li>
<li><p><strong>Kubernetes:</strong> It has built-in support for stateful workloads:</p>
<ul>
<li><p><strong>StatefulSets (Properly managing data for each database):</strong> This ensures that each CockroachDB replica (pod) keeps its identity and storage intact even if the pod restarts. So the database replica doesn’t lose its “name” or data when things change.</p>
</li>
<li><p><strong>Persistent volumes and persistent volume claims (external disks):</strong> These are like dedicated hard drives or disks attached to pods (database replicas). Even if a pod moves, crashes, or restarts, the disk (data) stays. Kubernetes makes sure the data stays safe.</p>
</li>
<li><p><strong>StorageClasses (choose your disk):</strong> You can customize the disks in which your data will be stored, that is:</p>
<ul>
<li><p>HDD (most affordable, but slower),</p>
</li>
<li><p>Balanced Disk (SSD enabled, a balance between costs and speed),</p>
</li>
<li><p>Fast SSD (Very fast, recommended by the CockroachDB team, but a bit more expensive than a Balanced Disk).</p>
</li>
<li><p>Rolling updates, anti-affinity, (No Downtime, High Availability, Fault tolerance).<br>  Anti-affinity means you can tell Kubernetes, “don’t put more than one CockroachDB replica on the same VM or physical machine.” This protects you if one VM goes bad, other replicas are safe.</p>
</li>
<li><p>Rolling updates let you update one replica at a time (configuration, version, resources) without bringing down the whole cluster. While one replica updates, others serve traffic. That helps avoid downtime.</p>
</li>
<li><p>Kubernetes also has ordered start/stop for replicas (via StatefulSets) so things are predictable and safe</p>
</li>
</ul>
</li>
<li><p><strong>Vertical vs horizontal scaling (earlier talk – reminder)</strong><br>  You remember we talked about scaling in prior sections:</p>
<ul>
<li><p><strong>Horizontal scaling</strong> means adding more replicas (more pods, more nodes) so load spreads out.</p>
</li>
<li><p><strong>Vertical scaling</strong> means increasing the resources (CPU, RAM, disk) of existing nodes/replicas.</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>        In tools like Nomad or Docker Swarm, vertical scaling tends to be harder, often involves stopping services, shutting things down, and restarting VMs, which causes downtime.</p>
<p>        Kubernetes makes vertical and horizontal scaling easier at the pod level (you can resize one pod CPU + RAM) and manage rolling upgrades so you don’t take everything down at once.</p>
<p>        You can also add more database replicas to the cluster easily (to balance load and make the database process queries faster), and the data is automatically copied to the new database replica (replication), especially when you use the official CockroachDB Helm Chart.</p>
<h4 id="heading-why-other-tools-swarm-nomad-docker-compose-dont-match-up-here">Why Other Tools (Swarm / Nomad / Docker Compose) Don’t Match Up Here</h4>
<p>Docker Swarm and Docker Compose are simpler to use and are good when you don’t have much complexity. But they lack robust features for stable storage, default support for replication, vertical scaling, horizontal scaling of stateful services, and so on. For example, Swarm doesn’t have built-in StatefulSets or dynamic volume provisioning like Kubernetes.</p>
<p>Nomad is more flexible than Swarm in some ways, but many users say storage plugins (CSI) are weaker than what Kubernetes has. Also, less built-in for ordering things, rolling updates for stateful apps.</p>
<p>So while these work fine for simpler apps (stateless services, small apps), when you have a distributed stateful SQL database like CockroachDB, Kubernetes gives you more safety, more control, less chance of data loss or misconfiguration.</p>
<p>Because of all this, running CockroachDB on Kubernetes gives you the tools you need baked in, reducing how much custom plumbing you must write yourself.</p>
<h4 id="heading-trade-offhttpswwwredditcomrhashicorpcomments1ivtuo5utmsourcechatgptcoms-things-to-watch-out-for">Trade-of<a target="_blank" href="https://www.reddit.com/r/hashicorp/comments/1ivtuo5?utm_source=chatgpt.com">f</a>s (things to watch out for)</h4>
<ul>
<li><p>You have to manage everything: backups, monitoring the ENTIRE CockroachDB cluster, withstanding failures (fault tolerance), and upgrades. That’s work 🥲.</p>
</li>
<li><p>You need to know your way around infra (VMs, disks, networking, and inter-node connections) and operations (or have teammates who do – DevOps Engineers, Cloud Architects, Site Reliability Engineers).</p>
</li>
<li><p>Using managed Kubernetes (like GKE, EKS, AKS) helps as you offload the control plane. You still manage the nodes, storage, and higher layers.</p>
</li>
<li><p>But even with that, you avoid paying for “database management as a service” markup – you're only paying for infrastructure plus your time.</p>
</li>
</ul>
<h2 id="heading-setting-up-your-local-environment"><strong>Setting Up Your Local Environment 🧑‍💻</strong></h2>
<p>Alright, we’ve learned quite a bit so far: what CockroachDB is, how it works behind the scenes, and where you can host it. Now, it’s time to roll up our sleeves and get our hands dirty with some practical setup.</p>
<p>Before we deploy CockroachDB, we need a safe “playground” where we can test and experiment without touching the cloud or spending a dime.</p>
<h3 id="heading-why-these-tools">Why these tools?</h3>
<p>Before we jump into running commands, here’s a quick lookup of what tools we’ll use and why:</p>
<ul>
<li><p><strong>Minikube</strong>: A tool that runs a small Kubernetes cluster on your computer. It gives you a local “mini cloud” where you can deploy and experiment.</p>
</li>
<li><p><strong>Kubectl</strong>: The command line tool you’ll use to talk to your Kubernetes cluster to deploy apps, check status, and manage resources.</p>
</li>
<li><p><strong>Helm</strong>: A package manager for Kubernetes. It helps you install complex applications (like CockroachDB) with fewer manual steps.</p>
</li>
</ul>
<h3 id="heading-step-1-install-minikube">Step 1: Install Minikube</h3>
<p><strong>What is Minikube?</strong><br>Minikube is a lightweight tool that helps you run a small Kubernetes cluster on your personal computer.</p>
<p>Think of it as your own mini-cloud environment where you can test, deploy, and learn Kubernetes (and in our case, CockroachDB) locally. It’s perfect for learning and experimenting before deploying on the cloud.</p>
<p>Here’s how to get it on different operating systems:</p>
<h4 id="heading-windows">🪟 Windows</h4>
<ol>
<li><p>Make sure you have a hypervisor (VirtualBox, Hyper-V) or Docker installed.</p>
</li>
<li><p>Open PowerShell as Administrator.</p>
</li>
<li><p>Run:</p>
<pre><code class="lang-bash"> choco install minikube
</code></pre>
<p> or use:</p>
<pre><code class="lang-bash"> winget install minikube
</code></pre>
</li>
<li><p>After installation, check the version:</p>
<pre><code class="lang-bash"> minikube version
</code></pre>
<p> If it returns a version number, you’re good 👍🏾</p>
</li>
</ol>
<p>If you don’t have the <code>choco</code> or <code>winget</code> package manager, you can install Minikube via PowerShell by following the steps in the <a target="_blank" href="https://minikube.sigs.k8s.io/docs/start/?arch=%2Fwindows%2Fx86-64%2Fstable%2F.exe+download">docs</a>.</p>
<h4 id="heading-macos">🍎 macOS</h4>
<ol>
<li><p>Ensure you have Homebrew installed.</p>
</li>
<li><p>In Terminal, run:</p>
<pre><code class="lang-bash"> brew install minikube
</code></pre>
</li>
<li><p>Start the cluster:</p>
<pre><code class="lang-bash"> minikube start
</code></pre>
</li>
<li><p>Verify:</p>
<pre><code class="lang-bash"> minikube version
</code></pre>
</li>
</ol>
<h4 id="heading-linux">🐧 Linux</h4>
<ol>
<li><p>Ensure you’re on a supported distribution (Ubuntu, Fedora, and so on) and virtualization (Docker, KVM, and so on) is enabled.</p>
</li>
<li><p>Run:</p>
<pre><code class="lang-bash"> curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
 sudo install minikube-linux-amd64 /usr/<span class="hljs-built_in">local</span>/bin/minikube
 rm minikube-linux-amd64
</code></pre>
</li>
<li><p>Start the cluster:</p>
<pre><code class="lang-bash"> minikube start
</code></pre>
</li>
<li><p>Verify:</p>
<pre><code class="lang-bash"> minikube status
</code></pre>
</li>
</ol>
<p>✅ At this point you should have a local Kubernetes cluster up and running on your machine! Next, we’ll install Kubectl so you can talk to the cluster from your command line.</p>
<h3 id="heading-step-2-install-kubectl">Step 2: Install kubectl</h3>
<p><strong>What kubectl does:</strong><br>kubectl is the command-line tool that lets you talk to your Kubernetes cluster. Using it, you can deploy applications, check your cluster’s health, and manage resources inside your cluster.</p>
<p>You’ll use it a lot when working with Kubernetes on Minikube and later when you deploy CockroachDB.</p>
<p>Here’s how to install it on Windows, macOS, and Linux:</p>
<h4 id="heading-windows-1">🪟 Windows</h4>
<ol>
<li><p>Open PowerShell as Administrator.</p>
</li>
<li><p>Run:</p>
<pre><code class="lang-bash"> choco install kubernetes-cli
</code></pre>
<p> or if you prefer:</p>
<pre><code class="lang-bash"> choco install kubectl
</code></pre>
</li>
<li><p>Then check the version:</p>
<pre><code class="lang-bash"> kubectl version --client
</code></pre>
<p> If it prints a version number, you’re good.</p>
</li>
</ol>
<h4 id="heading-macos-1">🍎 macOS</h4>
<ol>
<li><p>Open Terminal.</p>
</li>
<li><p>If you have Homebrew installed, run:</p>
<pre><code class="lang-bash"> brew install kubectl
</code></pre>
</li>
<li><p>Check the version:</p>
<pre><code class="lang-bash"> kubectl version --client
</code></pre>
<p> That should show something like “Client Version: v1.x.x”.</p>
</li>
</ol>
<h4 id="heading-linux-1">🐧 Linux</h4>
<ol>
<li><p>Open your terminal.</p>
</li>
<li><p>Download the latest kubectl binary:</p>
<pre><code class="lang-bash"> curl -LO <span class="hljs-string">"https://dl.k8s.io/release/<span class="hljs-subst">$(curl -L -s https://dl.k8s.io/release/stable.txt)</span>/bin/linux/amd64/kubectl"</span>
</code></pre>
</li>
<li><p>Make it executable and move it into your PATH:</p>
<pre><code class="lang-bash"> chmod +x ./kubectl
 sudo mv ./kubectl /usr/<span class="hljs-built_in">local</span>/bin/kubectl
</code></pre>
</li>
<li><p>Verify:</p>
<pre><code class="lang-bash"> kubectl version --client
</code></pre>
</li>
</ol>
<p>After this, you’ll have kubectl installed and ready to use with your local Minikube cluster. Next up we’ll install Helm, which will make deploying CockroachDB much easier.</p>
<h3 id="heading-step-3-install-helm">Step 3: Install Helm</h3>
<p>Helm is basically the package manager for Kubernetes. Think of it like how you use <code>apt</code>, <code>yum</code>, or <code>brew</code> to install software on your computer. Helm does something similar for Kubernetes apps.</p>
<p>With Kubernetes, deploying a full app often means writing lots of configs (manifests – Deployments, Services, PersistentVolumes, ConfigMaps, and so on). Helm lets us bundle all of that into a single “package” (called a chart) so we don’t have to manually create the resources one-after-the-other (which could be hectic to manage btw 😖).</p>
<p>Because our goal is to deploy a pretty complex system (CockroachDB) on Kubernetes – which includes stateful nodes, persistent storage, networking, SSL/TLS, and so on – using a Helm chart makes it <em>so much easier</em> than crafting dozens of YAML files from scratch.</p>
<p>So before we install CockroachDB, we’ll install Helm. This gives us the toolkit to deploy and manage our cluster much more easily.</p>
<p>Let’s install Helm on each platform. After this, you’ll have the <code>helm</code> command ready to deploy apps into your Kubernetes cluster.</p>
<h4 id="heading-windows-2">🪟 Windows</h4>
<ol>
<li><p>Open PowerShell as Administrator.</p>
</li>
<li><p>If you have Chocolatey installed, run:</p>
<pre><code class="lang-bash"> choco install kubernetes-helm
</code></pre>
<p> Alternatively:</p>
<pre><code class="lang-bash"> choco install helm
</code></pre>
</li>
<li><p>Confirm installation:</p>
<pre><code class="lang-bash"> helm version
</code></pre>
<p> You should see something like <code>version.BuildInfo{Version:"v3.x.x",…}</code>.</p>
</li>
</ol>
<h4 id="heading-macos-2">🍎 macOS</h4>
<ol>
<li><p>Open Terminal.</p>
</li>
<li><p>With Homebrew installed, run:</p>
<pre><code class="lang-bash"> brew install helm
</code></pre>
</li>
<li><p>Verify:</p>
<pre><code class="lang-bash"> helm version
</code></pre>
<p> If you see version info, you’re good.</p>
</li>
</ol>
<h4 id="heading-linux-2">🐧 Linux</h4>
<ol>
<li><p>Open your terminal.</p>
</li>
<li><p>Download and install the binary (example for the latest version):</p>
<pre><code class="lang-bash"> curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
 chmod 700 get_helm.sh
 ./get_helm.sh
</code></pre>
<p> Or you can directly download the binary and move it into your <code>PATH</code>.</p>
</li>
<li><p>Check version:</p>
<pre><code class="lang-bash"> helm version
</code></pre>
</li>
</ol>
<p>✅ After this, you have <code>helm</code> installed and you’re ready to use it.</p>
<p>In the next part, we’ll use Helm to install CockroachDB into your local Minikube cluster. We’ll add the CockroachDB chart, configure it, and spin up a multi-node replica setup right on your PC.</p>
<h2 id="heading-deploying-cockroachdb-on-minikube-the-fun-part-begins">Deploying CockroachDB on Minikube (The Fun Part Begins 😁!)</h2>
<p>Before we go to the cloud, we’ll deploy CockroachDB locally on Minikube using Helm.</p>
<p>This process will help us:</p>
<ul>
<li><p>Understand how CockroachDB runs in a cluster</p>
</li>
<li><p>Learn how Kubernetes manages database replicas</p>
</li>
<li><p>Gain hands-on experience before deploying to the cloud</p>
</li>
</ul>
<h3 id="heading-step-1-visit-artifacthub">Step 1: Visit ArtifactHub</h3>
<p><strong>ArtifactHub</strong> is like an App Store for Kubernetes Helm Charts – a huge collection of open-source Helm charts and packages you can easily install.</p>
<ol>
<li><p>Go to <a target="_blank" href="https://artifacthub.io">https://artifacthub.io</a></p>
</li>
<li><p>In the search bar, type <strong>CockroachDB</strong></p>
</li>
<li><p>Click the <strong>CockroachDB Helm chart</strong> result (you’ll see it published by <em>Cockroach Labs</em>).</p>
</li>
</ol>
<p>You’ll see something like this 👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760848079912/1778bbcf-088a-4919-80bb-ca24241ffa85.png" alt="The official CockroachDB Helm chart" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-2-explore-the-helm-chart">Step 2: Explore the Helm Chart</h3>
<p>You’ll notice a lot of information on the page:</p>
<ul>
<li><p><strong>README</strong> – the documentation for installing and customizing CockroachDB</p>
</li>
<li><p><strong>Default Values</strong> – all the settings that define how the database runs</p>
</li>
</ul>
<p>Don’t worry if it looks overwhelming. We’ll walk through it together 😉</p>
<h3 id="heading-step-3-copy-the-default-values">Step 3: Copy the Default Values</h3>
<p>Every Helm chart has a <em>default configuration</em> file. These defaults are usually too advanced or too heavy for local setups, so we’ll create our own lighter version. But first, let’s copy the original for reference.</p>
<ol>
<li><p>On the CockroachDB chart page, click the <strong>Default Values</strong> button.</p>
</li>
<li><p>A modal window will pop up showing a long YAML file.</p>
</li>
<li><p>Click the <strong>Copy</strong> icon in the top-right corner to copy all the default values.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760848210119/17cd734b-6d7c-40dc-a8c3-f01c85edd7a7.png" alt="The Default Values button description" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760848520060/1e1ce249-0cf0-46cb-abbc-00efb3ea1343.png" alt="Copy the default values" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-4-create-a-folder-for-our-project">Step 4: Create a Folder for Our Project</h3>
<p>We’ll keep everything organized in a single folder.</p>
<pre><code class="lang-bash">mkdir cockroachdb-tutorial
<span class="hljs-built_in">cd</span> cockroachdb-tutorial
</code></pre>
<p>Inside this folder, create a new file called:</p>
<pre><code class="lang-bash">nano cockroachdb-original-values.yml
</code></pre>
<p>Now paste all the default values you copied earlier (use Ctrl+V or right-click → Paste), then save and exit (<code>Ctrl+O</code>, then <code>Ctrl+X</code> in nano).</p>
<p>If you’re on Windows, just open Notepad/VSCode, paste the content, and save the file in the same folder.</p>
<h3 id="heading-step-5-understanding-the-key-configurations">Step 5: Understanding the Key Configurations</h3>
<p>Let’s break down a few important values you’ll notice in the file.</p>
<h4 id="heading-statefulsetreplicas">🧩 <code>statefulset.replicas</code></h4>
<p>This tells CockroachDB how many database nodes (replicas) to run in the cluster. By default, it’s set to 3, meaning you’ll have 3 independent database instances that can all read and write data.</p>
<h4 id="heading-statefulsetresourcesrequests-and-statefulsetresourceslimits">⚙️ <code>statefulset.resources.requests</code> and <code>statefulset.resources.limits</code></h4>
<p>These settings tell Kubernetes how much CPU and memory to give CockroachDB.</p>
<ul>
<li><p><code>requests</code>: the minimum guaranteed amount</p>
</li>
<li><p><code>limits</code>: the maximum allowed amount</p>
</li>
</ul>
<p>CockroachDB can be a bit greedy with memory 😅, so limits make sure it doesn’t take everything and leave no room for other apps.</p>
<h4 id="heading-storagepersistentvolumesize">💾 <code>storage.persistentVolume.size</code></h4>
<p>This defines how much disk space each CockroachDB node gets. For example, if you set it to <code>10Gi</code> and you have 3 replicas, total usage = <code>30Gi</code>.</p>
<h4 id="heading-storagepersistentvolumestorageclass">💽 <code>storage.persistentVolume.storageClass</code></h4>
<p>This defines the type of disk to use:</p>
<ul>
<li><p><code>standard</code>: HDD (cheap but slow)</p>
</li>
<li><p><code>standard-rwo</code>: SSD (faster and affordable)</p>
</li>
<li><p><code>pd-ssd</code> or <code>fast-ssd</code>: NVMe (super fast but pricey)</p>
</li>
</ul>
<p>You can check available storage classes in your Minikube cluster using:</p>
<pre><code class="lang-bash">kubectl get sc
</code></pre>
<p>On Minikube, the default storage class is usually <code>standard</code>.</p>
<p>You can learn more about <a target="_blank" href="https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview">Google Cloud storage classes here</a>.</p>
<h4 id="heading-tlsenabled">🔐 <code>tls.enabled</code></h4>
<p>This controls whether CockroachDB requires <strong>TLS certificates</strong> for secure connections.</p>
<p>If <code>true</code>, you’ll need to generate certificates for any app or client that connects to your cluster (instead of using a username and password). This is <strong>strongly recommended for production</strong>, but for our local Minikube setup, we’ll disable it so it’s easier to play around and test connections.</p>
<h3 id="heading-step-6-create-a-simplified-values-config-for-the-cockroachdb-helm-chart">Step 6: Create a Simplified Values Config for the CockroachDB Helm Chart</h3>
<p>We’ll now create a new config file with lighter resource settings for our local test environment.</p>
<p>In the same folder, create:</p>
<pre><code class="lang-bash">nano cockroachdb-values.yml
</code></pre>
<p>Then paste this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">podSecurityContext:</span>
    <span class="hljs-attr">fsGroup:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">runAsUser:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">runAsGroup:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span> <span class="hljs-comment"># You should have 3GB+ of RAM free on your device; else, you can reduce this to 500Mi (this will result in your PC needing just 1.5 GB of RAM free)</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>  <span class="hljs-comment"># The same with this, you can reduce it to 500m CPU if you don't have up to 3 CPU cores (1 CPU core * 3 replicas)</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">podAntiAffinity:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">""</span>
  <span class="hljs-attr">nodeSelector:</span>
    <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">minikube</span>

<span class="hljs-attr">storage:</span>
  <span class="hljs-attr">persistentVolume:</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">5Gi</span> <span class="hljs-comment"># Make sure you have 15GB+ of free storage on your local machine, if not, you can reduce it to 2 - 3 Gi</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">standard</span>

<span class="hljs-attr">tls:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>

<span class="hljs-attr">init:</span>
  <span class="hljs-attr">jobs:</span>
    <span class="hljs-attr">wait:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Setting the <code>requests</code> and <code>limits</code> to the same value ensures Kubernetes won’t terminate CockroachDB pods due to high memory or CPU usage.</p>
<p>You can <a target="_blank" href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/">read more about this here</a>.</p>
<h3 id="heading-overview-of-the-yaml-values">Overview of the YAML values</h3>
<p>Now, let’s understand the content of the <code>cockroachdb-values.yml</code> file together</p>
<p><code>podSecurityContext</code> – why you needed it on Minikube:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">podSecurityContext:</span>
  <span class="hljs-attr">fsGroup:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">runAsUser:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">runAsGroup:</span> <span class="hljs-number">1000</span>
</code></pre>
<p>This block sets the Linux user and group IDs that the CockroachDB process runs as inside the container, and the group ownership for mounted files.</p>
<p>Why this matters, simply:</p>
<ul>
<li><p>The CockroachDB process runs as <strong>UID 1000</strong> inside the container. If the disk mount (the persistent volume) is owned by a different UID, Cockroach can’t create files there and fails with <code>permission denied</code>.</p>
</li>
<li><p><code>runAsUser</code> and <code>runAsGroup</code> make the container process run as UID/GID 1000.</p>
</li>
<li><p><code>fsGroup</code> makes the mounted volume be accessible to that group, so the process can write to <code>/cockroach/cockroach-data</code>.</p>
</li>
</ul>
<p>In short, these lines make sure the DB process has permission to create and write files on the mounted disk (volume), which is especially important on Minikube and other local setups where host-mounted storage can have odd permissions.</p>
<p><code>podAntiAffinity</code> and <code>nodeSelector</code> – what they do:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">podAntiAffinity:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">""</span>

<span class="hljs-attr">nodeSelector:</span>
  <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">minikube</span>
</code></pre>
<p><code>podAntiAffinity</code> is the default behavior. Normally this tells Kubernetes to <em>spread</em> pods across different nodes (VMs), so replicas don’t run on the same physical machine. This is good for high availability, because one node failing won’t kill multiple replicas.</p>
<p>By setting <code>type: ""</code> (empty), you <strong>disabled</strong> that spreading rule, so Kubernetes can place multiple CockroachDB replicas on the same node.</p>
<p><code>nodeSelector</code> tells Kubernetes to schedule pods only on nodes that match the label you set (here <code>kubernetes.io/hostname: minikube</code>). That forces all pods to run on the node named <code>minikube</code>.</p>
<p>Quick summary of the effect:</p>
<ul>
<li><p>Good for local testing on a multi-node Minikube cluster, when only one node has properly mounted writable storage.</p>
</li>
<li><p><strong>Not recommended for production</strong>, because it places all replicas on the same machine (single point of failure).</p>
</li>
</ul>
<p>PS: If you’re using another Kubernetes cluster provider, for example K3s, Kind, and so on… this might not get deployed due to the nodeSelector property targeting <code>minikube</code> nodes. So, I'd advise removing the <code>nodeSelector</code> property entirely.</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">nodeSelector:</span>
    <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">minikube</span>
<span class="hljs-string">...</span>
</code></pre>
<p>✅ <strong>At this point</strong>, we’ve:</p>
<ul>
<li><p>Copied the default CockroachDB Helm chart configuration</p>
</li>
<li><p>Created a lightweight version for Minikube</p>
</li>
<li><p>Learned what each key property means</p>
</li>
</ul>
<h3 id="heading-step-7-install-the-cockroachdb-cluster-using-helm">🚀 Step 7: Install the CockroachDB Cluster Using Helm</h3>
<p>Great job so far! You’ve created your <code>cockroachdb-values.yml</code> file and set up your custom configuration for Minikube. Now we’ll actually deploy the cluster.</p>
<p><strong>What we’re going to do:</strong><br>We’ll use Helm to install the official CockroachDB Helm chart using our custom values. This will spin up your 3-node cluster locally so you can play with it.</p>
<p><strong>Command to run:</strong></p>
<pre><code class="lang-bash">helm install crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<p>Here:</p>
<ul>
<li><p><code>crdb</code> is the name we’re giving this release (you can pick something else if you like).</p>
</li>
<li><p><code>cockroachdb/cockroachdb</code> tells Helm which chart to use.</p>
</li>
<li><p><code>-f cockroachdb-values.yml</code> tells Helm to use our custom file instead of default values.</p>
</li>
</ul>
<h4 id="heading-after-the-command-runs">After the command runs:</h4>
<p>After a little while the command completes, and you’ll see output telling you what resources were created (pods, services, persistent volume claims, and so on).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761386160496/babc3e67-1ea9-4aa1-b6a7-516fe3a9972a.png" alt="The CockroachDB Helm Chart post-installation message" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now to check if everything is working, do this:</p>
<pre><code class="lang-bash">kubectl get pods | grep -i crdb
</code></pre>
<p>This filters pods with “crdb” in the name (our release prefix).</p>
<p>You should see something like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761386195190/21469ce5-c909-4336-ba5f-a4c4a776a470.png" alt="The CockroachDB replicas running successfully" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>The three primary pods (<code>0</code>, <code>1</code>, <code>2</code>) should be in <code>Running</code> state. The <code>init</code> job or pod (<code>crdb-cockroachdb-init-xxx</code>) should show <code>Completed</code>. This means the initialization tasks (cluster bootstrap) succeeded.</p>
<p>If you see that, congratulations! You’ve got your local CockroachDB cluster up and running! 🎉</p>
<h2 id="heading-accessing-the-cockroachdb-console-amp-viewing-metrics">Accessing the CockroachDB Console &amp; Viewing Metrics</h2>
<p>Alright! Now that our CockroachDB cluster is up and running, let’s take a peek behind the scenes and explore the CockroachDB Admin Console. It’s a beautiful web dashboard that helps us visualize everything happening in our database cluster.</p>
<p>In this section, we’ll learn how to:</p>
<ul>
<li><p>Access the CockroachDB admin console right from your browser 🖥️</p>
</li>
<li><p>Understand what each built-in dashboard shows (CPU, memory, disk, SQL performance)</p>
</li>
<li><p>Confirm that our cluster is healthy and that all 3 nodes are working together perfectly</p>
</li>
</ul>
<h3 id="heading-step-1-locate-the-cockroachdb-public-service">Step 1: Locate the CockroachDB Public Service</h3>
<p>CockroachDB automatically creates a <strong>public service</strong> that allows us to connect to the database and also access its dashboard.</p>
<p>Let’s check it out by running:</p>
<pre><code class="lang-bash">kubectl get svc | grep -i crdb
</code></pre>
<p>You should see a line similar to:</p>
<pre><code class="lang-bash">crdb-cockroachdb-public   ClusterIP   10.x.x.x   &lt;none&gt;   26257/TCP,8080/TCP   ...
</code></pre>
<p>This service (<code>crdb-cockroachdb-public</code>) is what we’ll use to connect to both:</p>
<ul>
<li><p>The <strong>database</strong> itself (via port 26257)</p>
</li>
<li><p>The <strong>dashboard UI</strong> (via port 8080)</p>
</li>
</ul>
<h3 id="heading-step-2-learn-more-about-the-service">Step 2: Learn More About the Service</h3>
<p>Let’s dig a little deeper to understand it:</p>
<pre><code class="lang-bash">kubectl describe svc crdb-cockroachdb-public
</code></pre>
<p>Here’s what you’ll notice:</p>
<ul>
<li><p><strong>Port 26257</strong> is used for <strong>gRPC connections</strong> (this is how applications connect to send and receive SQL queries).</p>
</li>
<li><p><strong>Port 8080</strong> is used for the <strong>web dashboard</strong>, where we can view metrics and monitor performance.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761387757614/dab8cfd0-2d89-45b0-a54f-41e530f1a6ab.png" alt="Description of the crdb-cockroachdb-public service" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-3-access-the-cockroachdb-dashboard">Step 3: Access the CockroachDB Dashboard</h3>
<p>Now, let’s make the dashboard available on your local computer. Run this command:</p>
<pre><code class="lang-bash">kubectl port-forward svc/crdb-cockroachdb-public 8080:8080
</code></pre>
<p>This command simply tells Kubernetes:</p>
<blockquote>
<p>“Hey, please open a tunnel from my local computer’s port 8080 to the CockroachDB service’s port 8080 in the cluster.”</p>
</blockquote>
<p>Once you see something like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761387838362/186ff222-c643-4e67-b0a4-dbaff8777977.png" alt="Result of port-forwarding the crdb-cockroachdb-public service on port 8080" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>...you’re good to go!</p>
<h3 id="heading-step-4-visit-the-dashboard">Step 4: Visit the Dashboard</h3>
<p>Now, open your browser and go to http://localhost:8080.</p>
<p>You’ll see the CockroachDB Admin Console. This is your central command center for monitoring your cluster</p>
<p>Here, you’ll be able to view:</p>
<ul>
<li><p><strong>Number of replicas (nodes)</strong>: You should see 3 in our setup.</p>
</li>
<li><p><strong>RAM usage</strong> per node: Helps track how much memory each CockroachDB instance is using.</p>
</li>
<li><p><strong>CPU usage</strong>: Useful to know when your database is getting busy.</p>
</li>
<li><p><strong>Disk space</strong>: Shows how much data your cluster is storing and how much free space remains.</p>
</li>
</ul>
<p>Here’s what your dashboard might look like 👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761387968743/327288e5-4811-42bf-8fd8-74ed187792a4.png" alt="The CockroachDB dashboard UI on http://localhost:8080" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-5-exploring-the-metrics-dashboard">Step 5: Exploring the Metrics Dashboard</h3>
<p>Now that you’re inside the CockroachDB Admin Console (<a target="_blank" href="http://localhost:8080">http://localhost:8080</a>), let’s take things a step further by exploring the <strong>Metrics</strong> section. This is where CockroachDB really shines.</p>
<p>On the left-hand side, click on “Metrics.” Here, you’ll find a collection of dashboards showing how your database is performing behind the scenes, things like query activity, performance, memory use, and much more.</p>
<p>These metrics help you understand what’s happening inside your cluster and make data-driven decisions – like when to scale up, optimize queries, or add more nodes.</p>
<p>We’ll start by focusing on some of the most insightful ones, such as:</p>
<ul>
<li><p><strong>SQL Queries Per Second</strong> – how busy your database is</p>
</li>
<li><p><strong>Service Latency (SQL Statements, 99th percentile)</strong> – how fast or slow your queries are</p>
</li>
</ul>
<p>Then, we’ll also look at others like SQL Contention, Replicas per Node, and Capacity to get a complete view of your CockroachDB cluster’s health.</p>
<p>Here’s what each of these metrics means in simple, everyday terms 👇🏾</p>
<h4 id="heading-sql-queries-per-second">SQL Queries Per Second</h4>
<p>This metric shows the number of SQL commands (like <code>SELECT</code>, <code>INSERT</code>, <code>UPDATE</code>, <code>DELETE</code>) your database cluster is handling every second. In simpler words, it’s how busy your database is. Imagine cars passing through a toll booth – this is the count of cars per second.</p>
<p>This is useful to know because if this number is steadily climbing, your system is getting more traffic or work. You may need to scale up (more nodes, more resources) or optimize queries. If it drops suddenly, something might be wrong (traffic drop, and so on).</p>
<p>Look for a stable or expected value for your workload. Spikes or sustained high values mean you should check performance.</p>
<h4 id="heading-service-latency-sql-statements-99th-percentile">Service Latency: SQL Statements, 99th percentile</h4>
<p>This metric shows the time it takes (for the slowest ~1 % of queries) from when the database gets the request until it finishes executing it. Think of waiting in a queue: 99% percentile is what the slowest people (1 in 100) experienced.</p>
<p>You’ll want to know this because if the slowest queries are taking too long, it might signal a bottleneck (CPU, disk, network, and so on). Low latency = good user experience.</p>
<p>So keep an eye out: if this value rises (gets worse) over time, investigate what’s slowing down. If it stays low and stable, you’re in good shape.</p>
<h4 id="heading-sql-statement-contention">SQL Statement Contention</h4>
<p>Statement contention demonstrates the number of SQL queries that got “stuck” or had to wait because other queries were using the same data or resources. This is like if two people were trying to grab the same book – one has to wait. That waiting is contention.</p>
<p>High contention means your database is chasing conflicts, waiting for locks or resources. This slows things down overall. So you’ll want to keep this number as low as possible. If it starts rising, you might need to revisit your schema, queries, or scale differently.</p>
<h4 id="heading-replicas-per-node">Replicas per Node</h4>
<p>This tells you how many copies (“replicas”) of data ranges live on each database node. If you imagine your data is like documents saved in several safes (nodes), this shows how many copies are in each safe.</p>
<p>This matters, because you want balanced replicas so no node is overloaded with too many copies (which can slow it down or put it at risk).</p>
<p>To check on this, make sure nodes have roughly equal replica counts. If one node has many more replicas, you might need to rebalance or add nodes.</p>
<h4 id="heading-capacity">Capacity</h4>
<p>Capacity shows how much disk/storage your cluster has (total), how much is used, and how much is free. Imagine a warehouse: it’s like how many boxes you can store, how many you’ve filled, and how much empty space remains.</p>
<p>You’ll need to know this, because if capacity is nearly full, you risk running out of space which can cause downtime or performance issues.</p>
<p>Free space should stay healthy (for example less than ~80% used). If it crosses that, plan to add storage or nodes.</p>
<h4 id="heading-why-these-matter-together">Why These Matter Together</h4>
<p>When you combine these metrics, you get a clear picture:</p>
<ul>
<li><p>High Queries Per Second + high latency = maybe you're under-powered.</p>
</li>
<li><p>High contention = your workload design might be fighting itself.</p>
</li>
<li><p>Imbalanced replicas or full capacity = infrastructure issues.</p>
</li>
<li><p>Stable low latency + balanced replicas + plenty of capacity = sounds like a healthy cluster.</p>
</li>
</ul>
<p>So by keeping an eye on these, you make data-driven decisions: when to scale, when to optimize, when to tweak configs.</p>
<h3 id="heading-step-6-creating-a-little-load-on-the-cockroachdb-cluster">Step 6: Creating a Little Load on the CockroachDB Cluster</h3>
<p>So far, we’ve explored the CockroachDB dashboard and understood what each metric means. Now, let’s make things a bit more fun. 🎉</p>
<p>In this part, we’ll run a simple Python app that connects to our CockroachDB cluster and performs a few database operations (creating, updating, deleting, and retrieving some records). This will help us generate a small load on the database so we can actually see the metrics in action.</p>
<p>Here’s what we’ll be doing step-by-step 👇🏾</p>
<h4 id="heading-step-61-create-a-configmap-for-our-books-data">Step 6.1: Create a ConfigMap for Our Books Data</h4>
<p>We’ll first create a list of 20 books that our Python script will interact with. Each book will have basic info like name, author, genre, pages, and price.</p>
<ol>
<li><p>Create a new file called <code>books.json</code></p>
<ul>
<li><p>On Linux:</p>
<pre><code class="lang-bash">  nano books.json
</code></pre>
<p>  Paste the below JSON content into it.</p>
<pre><code class="lang-json">  [
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Bright Signal"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Ava Hart"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9783218196000"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2020</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">234</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fantasy"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">10.99</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Hidden Library"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Liam Stone"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9783863794026"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1993</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">358</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Romance"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">30.2</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Shadow Archive"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Maya Chen"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781615594078"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2001</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">404</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"History"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">16.21</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Bright Voyage"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Noah Rivers"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9785931034133"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1987</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">507</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fantasy"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">13.14</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Shadow Garden"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Zara Malik"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9785534192834"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2004</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">404</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Sci-Fi"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">28.13</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Crystal Signal"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Ethan Brooks"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9785030564135"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2009</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">508</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Self-Help"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">20.79</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Atomic Atlas"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Iris Park"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9787242388493"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2025</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">442</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Romance"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">18.5</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The First Library"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Caleb Nguyen"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9787101226911"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2017</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">528</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Romance"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">24.47</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Crystal River"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Sofia Diaz"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781845146276"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2004</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">599</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">31.15</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Crystal Archive"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Jude Bennett"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9784893252883"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1996</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">632</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">40.47</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Last Compass"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Nina Volkova"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9784303911713"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2018</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">451</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"History"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">29.53</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Crystal Garden"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Omar Haddad"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9784896383461"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1988</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">251</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Thriller"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">36.38</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Silent Signal"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Priya Kapoor"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781509839308"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2008</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">649</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fantasy"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">28.05</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Hidden Compass"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Felix Romero"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781834738291"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2025</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">180</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Self-Help"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">19.15</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Lost Signal"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Tara Quinn"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781165667017"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2010</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">368</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">41.37</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Last Signal"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Hana Sato"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9783387262476"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2005</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">467</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Nonfiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">42.01</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Crystal Archive"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Leo Fischer"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9780801326776"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1984</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">573</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Nonfiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">42.31</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Hidden Atlas"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Mila Novak"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9784746872343"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">2005</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">180</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Nonfiction"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">16.58</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Hidden Compass"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Arthur Wells"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9780097882086"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1983</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">713</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Fantasy"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">39.42</span>
    },
    {
      <span class="hljs-attr">"name"</span>: <span class="hljs-string">"The Silent Atlas"</span>,
      <span class="hljs-attr">"author"</span>: <span class="hljs-string">"Selene Ortiz"</span>,
      <span class="hljs-attr">"isbn"</span>: <span class="hljs-string">"9781939909169"</span>,
      <span class="hljs-attr">"published_year"</span>: <span class="hljs-number">1991</span>,
      <span class="hljs-attr">"pages"</span>: <span class="hljs-number">190</span>,
      <span class="hljs-attr">"genre"</span>: <span class="hljs-string">"Self-Help"</span>,
      <span class="hljs-attr">"price"</span>: <span class="hljs-number">33.79</span>
    }
  ]
</code></pre>
<p>  To save and close the file in nano:</p>
<ul>
<li><p>Press <code>CTRL + O</code> → then <code>ENTER</code> (to save)</p>
</li>
<li><p>Press <code>CTRL + X</code> (to exit the editor)</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Then create a ConfigMap from the file:</p>
<pre><code class="lang-bash"> kubectl create configmap books-json --from-file=books.json
</code></pre>
</li>
</ol>
<h4 id="heading-step-62-create-the-python-script-configmap">Step 6.2: Create the Python Script ConfigMap</h4>
<p>Next, we’ll create a simple Python script that:</p>
<ul>
<li><p>Creates a new table for books</p>
</li>
<li><p>Inserts 20 records</p>
</li>
<li><p>Updates 7 of them</p>
</li>
<li><p>Deletes 5</p>
</li>
<li><p>Retrieves 15 books from the database</p>
</li>
</ul>
<p>It’s like simulating a small library app. 📚</p>
<p>Create a new file called <code>books-script.yml</code> and paste the content below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">books-script</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">run.py:</span> <span class="hljs-string">|
    #!/usr/bin/env python3
    import argparse
    import json
    import os
    import sys
    import time
    from typing import List, Dict
</span>
    <span class="hljs-string">import</span> <span class="hljs-string">psycopg</span>
    <span class="hljs-string">from</span> <span class="hljs-string">psycopg.rows</span> <span class="hljs-string">import</span> <span class="hljs-string">dict_row</span>

    <span class="hljs-string">DDL</span> <span class="hljs-string">=</span> <span class="hljs-string">""</span><span class="hljs-string">"
    CREATE TABLE IF NOT EXISTS books (
        id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
        name STRING NOT NULL,
        author STRING NOT NULL,
        isbn STRING UNIQUE,
        published_year INT4,
        pages INT4,
        genre STRING,
        price DECIMAL(10,2),
        created_at TIMESTAMPTZ NOT NULL DEFAULT now()
    );
    "</span><span class="hljs-string">""</span>

    <span class="hljs-string">INSERT_SQL</span> <span class="hljs-string">=</span> <span class="hljs-string">""</span><span class="hljs-string">"
    INSERT INTO books (name, author, isbn, published_year, pages, genre, price)
    VALUES (%s, %s, %s, %s, %s, %s, %s);
    "</span><span class="hljs-string">""</span>

    <span class="hljs-string">UPDATE_SQL</span> <span class="hljs-string">=</span> <span class="hljs-string">""</span><span class="hljs-string">"
    UPDATE books
    SET price = %s, pages = %s
    WHERE isbn = %s;
    "</span><span class="hljs-string">""</span>

    <span class="hljs-string">DELETE_SQL</span> <span class="hljs-string">=</span> <span class="hljs-string">""</span><span class="hljs-string">"
    DELETE FROM books
    WHERE isbn = %s;
    "</span><span class="hljs-string">""</span>

    <span class="hljs-string">GET_SQL</span> <span class="hljs-string">=</span> <span class="hljs-string">""</span><span class="hljs-string">"
    SELECT id, name, author, isbn, published_year, pages, genre, price, created_at
    FROM books
    WHERE isbn = %s;
    "</span><span class="hljs-string">""</span>

    <span class="hljs-string">def</span> <span class="hljs-string">load_books(path:</span> <span class="hljs-string">str)</span> <span class="hljs-string">-&gt;</span> <span class="hljs-string">List[Dict]:</span>
        <span class="hljs-string">with</span> <span class="hljs-string">open(path,</span> <span class="hljs-string">"r"</span><span class="hljs-string">)</span> <span class="hljs-attr">as f:</span>
            <span class="hljs-string">return</span> <span class="hljs-string">json.load(f)</span>

    <span class="hljs-string">def</span> <span class="hljs-string">connect_with_retry(dsn:</span> <span class="hljs-string">str,</span> <span class="hljs-attr">attempts:</span> <span class="hljs-string">int</span> <span class="hljs-string">=</span> <span class="hljs-number">30</span><span class="hljs-string">,</span> <span class="hljs-attr">delay:</span> <span class="hljs-string">float</span> <span class="hljs-string">=</span> <span class="hljs-number">2.0</span><span class="hljs-string">):</span>
        <span class="hljs-string">last_exc</span> <span class="hljs-string">=</span> <span class="hljs-string">None</span>
        <span class="hljs-string">for</span> <span class="hljs-string">_</span> <span class="hljs-string">in</span> <span class="hljs-string">range(attempts):</span>
            <span class="hljs-attr">try:</span>
                <span class="hljs-string">conn</span> <span class="hljs-string">=</span> <span class="hljs-string">psycopg.connect(dsn,</span> <span class="hljs-string">autocommit=False)</span>
                <span class="hljs-string">return</span> <span class="hljs-string">conn</span>
            <span class="hljs-attr">except Exception as e:</span>
                <span class="hljs-string">last_exc</span> <span class="hljs-string">=</span> <span class="hljs-string">e</span>
                <span class="hljs-string">time.sleep(delay)</span>
        <span class="hljs-string">raise</span> <span class="hljs-string">last_exc</span>

    <span class="hljs-string">def</span> <span class="hljs-string">main():</span>
        <span class="hljs-string">ap</span> <span class="hljs-string">=</span> <span class="hljs-string">argparse.ArgumentParser()</span>
        <span class="hljs-string">ap.add_argument("--dsn",</span> <span class="hljs-string">required=True,</span> <span class="hljs-string">help="Postgres/CockroachDB</span> <span class="hljs-string">DSN")</span>
        <span class="hljs-string">ap.add_argument("--json",</span> <span class="hljs-string">default="/app/books.json",</span> <span class="hljs-string">help="Path</span> <span class="hljs-string">to</span> <span class="hljs-string">books</span> <span class="hljs-string">JSON")</span>
        <span class="hljs-string">args</span> <span class="hljs-string">=</span> <span class="hljs-string">ap.parse_args()</span>

        <span class="hljs-string">books</span> <span class="hljs-string">=</span> <span class="hljs-string">load_books(args.json)</span>
        <span class="hljs-string">print(f"Loaded</span> {<span class="hljs-string">len(books)</span>} <span class="hljs-string">books")</span>

        <span class="hljs-string">conn</span> <span class="hljs-string">=</span> <span class="hljs-string">connect_with_retry(args.dsn)</span>
        <span class="hljs-string">conn.row_factory</span> <span class="hljs-string">=</span> <span class="hljs-string">dict_row</span>
        <span class="hljs-attr">try:</span>
            <span class="hljs-attr">with conn:</span>
                <span class="hljs-string">with</span> <span class="hljs-string">conn.cursor()</span> <span class="hljs-attr">as cur:</span>
                    <span class="hljs-string">print("Creating</span> <span class="hljs-string">table...")</span>
                    <span class="hljs-string">cur.execute(DDL)</span>

                    <span class="hljs-string">print("Inserting</span> <span class="hljs-number">20</span> <span class="hljs-string">books...")</span>
                    <span class="hljs-string">for</span> <span class="hljs-string">b</span> <span class="hljs-string">in</span> <span class="hljs-string">books[:20]:</span>
                        <span class="hljs-string">cur.execute(INSERT_SQL,</span> <span class="hljs-string">(</span>
                            <span class="hljs-string">b["name"],</span> <span class="hljs-string">b["author"],</span> <span class="hljs-string">b["isbn"],</span>
                            <span class="hljs-string">b.get("published_year"),</span> <span class="hljs-string">b.get("pages"),</span>
                            <span class="hljs-string">b.get("genre"),</span> <span class="hljs-string">b.get("price"),</span>
                        <span class="hljs-string">))</span>

                    <span class="hljs-string">print("Updating</span> <span class="hljs-number">7</span> <span class="hljs-string">books...")</span>
                    <span class="hljs-string">for</span> <span class="hljs-string">b</span> <span class="hljs-string">in</span> <span class="hljs-string">books[:7]:</span>
                        <span class="hljs-string">new_price</span> <span class="hljs-string">=</span> <span class="hljs-string">round(float(b.get("price",</span> <span class="hljs-number">10</span><span class="hljs-string">))</span> <span class="hljs-string">+</span> <span class="hljs-number">1.23</span><span class="hljs-string">,</span> <span class="hljs-number">2</span><span class="hljs-string">)</span>
                        <span class="hljs-string">new_pages</span> <span class="hljs-string">=</span> <span class="hljs-string">int(b.get("pages",</span> <span class="hljs-number">100</span><span class="hljs-string">))</span> <span class="hljs-string">+</span> <span class="hljs-number">5</span>
                        <span class="hljs-string">cur.execute(UPDATE_SQL,</span> <span class="hljs-string">(new_price,</span> <span class="hljs-string">new_pages,</span> <span class="hljs-string">b["isbn"]))</span>

                    <span class="hljs-string">print("Deleting</span> <span class="hljs-number">5</span> <span class="hljs-string">books...")</span>
                    <span class="hljs-string">for</span> <span class="hljs-string">b</span> <span class="hljs-string">in</span> <span class="hljs-string">books[-5:]:</span>
                        <span class="hljs-string">cur.execute(DELETE_SQL,</span> <span class="hljs-string">(b["isbn"],))</span>

                    <span class="hljs-string">print("Performing</span> <span class="hljs-number">15</span> <span class="hljs-string">retrievals...")</span>
                    <span class="hljs-string">for</span> <span class="hljs-string">b</span> <span class="hljs-string">in</span> <span class="hljs-string">books[:15]:</span>
                        <span class="hljs-string">cur.execute(GET_SQL,</span> <span class="hljs-string">(b["isbn"],))</span>
                        <span class="hljs-string">row</span> <span class="hljs-string">=</span> <span class="hljs-string">cur.fetchone()</span>
                        <span class="hljs-attr">if row:</span>
                            <span class="hljs-string">print(f"GET</span> {<span class="hljs-string">b</span>[<span class="hljs-string">'isbn'</span>]}<span class="hljs-string">:</span> {<span class="hljs-string">row</span>[<span class="hljs-string">'name'</span>]} <span class="hljs-string">by</span> {<span class="hljs-string">row</span>[<span class="hljs-string">'author'</span>]} <span class="hljs-string">(${row['price']})")</span>
                        <span class="hljs-attr">else:</span>
                            <span class="hljs-string">print(f"GET</span> {<span class="hljs-string">b</span>[<span class="hljs-string">'isbn'</span>]}<span class="hljs-string">:</span> <span class="hljs-string">not</span> <span class="hljs-string">found</span> <span class="hljs-string">(possibly</span> <span class="hljs-string">deleted)")</span>

            <span class="hljs-string">print("All</span> <span class="hljs-string">operations</span> <span class="hljs-string">completed.")</span>
        <span class="hljs-attr">finally:</span>
            <span class="hljs-string">conn.close()</span>

    <span class="hljs-string">if</span> <span class="hljs-string">__name__</span> <span class="hljs-string">==</span> <span class="hljs-attr">"__main__":</span>
        <span class="hljs-string">main()</span>
</code></pre>
<p>This script connects to the CockroachDB cluster, creates a table (if it doesn’t exist), and performs all those operations in sequence.</p>
<p>It runs around 50 SQL queries in total – a mix of <code>INSERT</code>, <code>UPDATE</code>, <code>DELETE</code>, and <code>SELECT</code> statements.</p>
<p>Now apply it:</p>
<pre><code class="lang-json">kubectl apply -f books-script.yml
</code></pre>
<h4 id="heading-step-63-create-the-job-to-run-the-script">Step 6.3: Create the Job to Run the Script</h4>
<p>Next, let’s create a Kubernetes Job that will actually run our Python script inside a container.</p>
<p>Create a file called <code>books-job.yml</code> and paste the manifest below:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">batch/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Job</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">books-job</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
      <span class="hljs-attr">containers:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">runner</span>
          <span class="hljs-attr">image:</span> <span class="hljs-string">python:3.12-slim</span>
          <span class="hljs-attr">env:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">CRDB_DSN</span>
              <span class="hljs-attr">value:</span> <span class="hljs-string">"postgresql://root@crdb-cockroachdb-public:26257/defaultdb?sslmode=disable"</span>
          <span class="hljs-attr">command:</span> [<span class="hljs-string">"bash"</span>, <span class="hljs-string">"-lc"</span>]
          <span class="hljs-attr">args:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">|
              pip install --no-cache-dir "psycopg[binary]&gt;=3.1,&lt;3.3" &amp;&amp; \
              python /app/run.py --dsn "$CRDB_DSN" --json /app/books.json
</span>          <span class="hljs-attr">volumeMounts:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">script</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/app/run.py</span>
              <span class="hljs-attr">subPath:</span> <span class="hljs-string">run.py</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">books</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/app/books.json</span>
              <span class="hljs-attr">subPath:</span> <span class="hljs-string">books.json</span>
      <span class="hljs-attr">volumes:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">script</span>
          <span class="hljs-attr">configMap:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">books-script</span>
            <span class="hljs-attr">defaultMode:</span> <span class="hljs-number">0555</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">books</span>
          <span class="hljs-attr">configMap:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">books-json</span>
</code></pre>
<p>Here’s what’s happening:</p>
<ul>
<li><p>The Job runs a container based on Python 3.12-slim.</p>
</li>
<li><p>It connects to CockroachDB using the connection string <code>postgresql://root@crdb-cockroachdb-public:26257/defaultdb?sslmode=disable</code>. Notice how <code>sslmode=disable</code>: this is because we disabled TLS in our Helm values earlier.</p>
</li>
<li><p>The Job mounts the two ConfigMaps we created earlier (<code>books-json</code> and <code>books-script</code>) as <strong>volumes</strong> inside the container. Think of volumes like small external drives that the container can read from.</p>
</li>
</ul>
<p>Apply it:</p>
<pre><code class="lang-bash">kubectl apply -f books-job.yml
</code></pre>
<h4 id="heading-step-64-check-if-the-job-ran-successfully">Step 6.4: Check if the Job Ran Successfully</h4>
<p>After a minute or two, check your pods:</p>
<pre><code class="lang-bash">kubectl get po
</code></pre>
<p>If you see <code>books-job-xxx</code> with the status <strong>Completed</strong>, then your script ran successfully 🎉</p>
<p>That means our database just got a nice little workout – some records were created, updated, deleted, and read.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761460118429/99ed49a3-52e9-4357-ba2b-9295f0dfbdc8.png" alt="The Completed state of the Books Job" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-step-7-viewing-the-metrics-from-the-load">Step 7: Viewing the Metrics from the Load</h3>
<p>Now that we’ve generated a small load, let’s jump back to the CockroachDB dashboard.</p>
<p>Head to the Metrics section, and under SQL Queries Per Second, you should see a little spike: this shows the activity from our Python job.👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761460175366/6c1e129e-c8bd-4f41-89de-60a1a753026e.png" alt="The SQL Queries Per Second Metric" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Hover your mouse over the graph lines to see exact numbers.</p>
<p>Do the same for Service Latency: SQL Statements (99th percentile). You’ll notice a few bumps showing how long some of the queries took.👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761460224971/8ba9d5ed-0724-4dc6-82f4-7e5d0d05be82.png" alt="The Service Latency Metric" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This small experiment gives you a real feel for how CockroachDB reacts under activity, even a tiny one.</p>
<p>To explore more metrics and dashboards, check out the <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/ui-overview-dashboard">official CockroachDB documentation here</a>.</p>
<h3 id="heading-step-8-view-the-list-of-created-items-in-the-database">Step 8: View the List of Created Items in the Database</h3>
<p>Now that our Python job ran and touched the database (creating, updating, deleting, retrieving records), let’s check the content of our <code>books</code> table just to verify everything really happened.</p>
<p>First, we’ll create another Kubernetes job (or pod) that connects to our CockroachDB cluster and runs a simple SQL query <code>SELECT * FROM books;</code>. This pulls out all the remaining records in the table.</p>
<p>Here’s the manifest to use. Create a file named <code>view-books.yml</code> and paste the below content inside it:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">batch/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Job</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">view-books</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
      <span class="hljs-attr">containers:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">client</span>
          <span class="hljs-attr">image:</span> <span class="hljs-string">cockroachdb/cockroach:v25.3.2</span>
          <span class="hljs-attr">command:</span> [<span class="hljs-string">"bash"</span>, <span class="hljs-string">"-lc"</span>]
          <span class="hljs-attr">args:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">|
              cockroach sql \
                --insecure \
                --host=crdb-cockroachdb-public:26257 \
                --database=defaultdb \
                --format=records \
                --execute="SELECT * FROM public.books;"</span>
</code></pre>
<p>Note: We use <code>sslmode=disable</code> because we turned off TLS in our Minikube config. This job mounts nothing fancy. It just spins up, connects to the database, runs the <code>SELECT</code>, and displays the result.</p>
<p>Run the job:</p>
<pre><code class="lang-bash">kubectl apply -f view-books.yml
</code></pre>
<p>Wait a minute, then check the pod status:</p>
<pre><code class="lang-bash">kubectl get po
</code></pre>
<p>Look for something like <code>books-client-job-xxx</code> in <strong>Completed</strong> state.</p>
<p>Finally, view the job logs to see the actual records:</p>
<pre><code class="lang-bash">kubectl logs view-books
</code></pre>
<p>You’ll see output similar to the below:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761462270132/c881eca7-18b0-4647-a6b1-2841e7774969.png" alt="The list of created books in the books table in the CockroachDB database" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-backing-up-cockroachdb-to-google-cloud-storage">Backing Up CockroachDB to Google Cloud Storage ☁️</h2>
<p>In this section we’ll explain how you can automate backups of your CockroachDB cluster using simple SQL commands, service accounts (for authenticating to Google Cloud), and Google Cloud Storage (where the data will be stored).</p>
<h3 id="heading-why-backups-are-absolutely-critical">Why Backups Are Absolutely Critical</h3>
<p>Imagine you’ve built your cluster on Kubernetes, and everything’s humming along for weeks or months. You’ve got tens or hundreds of gigabytes of data and 10k+ users relying on it.</p>
<p>Then <strong>BAM!</strong> Something happens. Maybe someone accidentally overwrote the Helm release (<code>helm upgrade --install …</code> with the same release name, for example <code>crdb</code>), or a cloud disk got deleted, or a critical node failed and you lose the majority of data replicas. That’s the nightmare we all dread 😭.</p>
<p>Mistakes happen, even if you’re super careful. What matters most is: How fast and easily could you recover?</p>
<p>That’s why we’ll set up <strong>daily backups</strong> of our CockroachDB cluster, targeting a Google Cloud Storage bucket. (Quick note: Google Cloud Object Storage is a service where you can store large amounts of data in the cloud as “objects”. You can grab, store, and retrieve data from it, just like Google Drive or Apple Storage. 😃)</p>
<p>With your backups going into a storage bucket, if disaster strikes, you can restore the entire cluster (or specific databases/tables) in minutes or hours – instead of days or losing data forever.</p>
<h3 id="heading-connecting-to-our-db-installing-beekeeper-studio">Connecting to Our DB – Installing Beekeeper Studio</h3>
<p>So far, we’ve been connecting to our database programmatically, running commands from pods or jobs inside Kubernetes. But what if there was a <em>more visual</em> and <em>user-friendly</em> way to explore our data?</p>
<p>Well, meet my friend <strong>Beekeeper Studio.</strong> 🙂</p>
<p>Beekeeper Studio is a sleek, open-source database management tool that lets you connect to a wide range of databases like PostgreSQL, MySQL, SQLite, and (most importantly for us) CockroachDB.</p>
<p>It comes with a simple, modern interface for running queries, browsing tables, and viewing data – no need to jump into pods or remember command-line flags 😄</p>
<h3 id="heading-how-to-install-beekeeper-studio">How to Install Beekeeper Studio</h3>
<ol>
<li><p>Visit the official Beekeeper Studio download page here: <a target="_blank" href="https://www.beekeeperstudio.io/get">https://www.beekeeperstudio.io/get</a></p>
</li>
<li><p>Click the “Skip to the download” link. You’ll see something like this:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761542821015/2e7a0fd5-7047-4090-97fb-46b81a3dd638.png" alt="Finding the Button to Skip to the DOwnload page on the Beekeeper Studio website" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p>You’ll be redirected to a page listing download options for different operating systems.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761542877590/6034dcf0-d9b0-447b-bd2b-089458729db7.png" alt="Page to select download option according to the user OS" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p>Choose your OS and download the correct installer.</p>
</li>
<li><p>Afterwards, install the downloaded Beekeeper Studio software according to your OS</p>
</li>
</ol>
<h3 id="heading-connecting-beekeeper-studio-to-cockroachdb">Connecting Beekeeper Studio to CockroachDB</h3>
<p>Now that we’ve installed Beekeeper Studio, it’s time to connect it to our CockroachDB cluster running inside Minikube</p>
<p>But before we jump in, here’s something important to note:👇🏾</p>
<p>Our CockroachDB cluster is running INSIDE Kubernetes, and by default, it’s not accessible from outside the cluster.</p>
<p>To confirm this, run:</p>
<pre><code class="lang-bash">kubectl get svc crdb-cockroachdb-public
</code></pre>
<p>You should see something like this 👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761544640270/2cf9f8f1-15f1-459b-acd0-63b1c361fa54.png" alt="The CockroachDB service being of type ClusterIP" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Notice the <strong>CLUSTER-IP</strong> column. That means the service can only be accessed by other pods INSIDE the Minikube cluster – not from your laptop or external apps</p>
<h3 id="heading-exposing-the-cluster-for-local-access">Exposing the Cluster for Local Access</h3>
<p>To make our database accessible from your local machine (so Beekeeper Studio can reach it), we’ll use <strong>Kubernetes Port Forwarding</strong>.</p>
<p>In a new terminal tab, run:</p>
<pre><code class="lang-bash">kubectl port-forward svc/crdb-cockroachdb-public 26257
</code></pre>
<p>This command tells Kubernetes to forward your local port 26257 to CockroachDB service’s port 26257 inside the cluster.</p>
<p>Once it’s running, your CockroachDB instance will now be accessible from <a target="_blank" href="http://localhost:26257"><code>localhost:26257</code></a>.<br>(Note: it’s not accessible via your browser because this isn’t an HTTP endpoint 😅)</p>
<h3 id="heading-connecting-via-beekeeper-studio">🐝 Connecting via Beekeeper Studio</h3>
<ol>
<li><p>Open Beekeeper Studio.</p>
</li>
<li><p>Click on the dropdown that says “Select a connection type…”.</p>
</li>
<li><p>Choose CockroachDB from the list.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761544886889/98443b46-574d-4bcc-a41c-d2daa7412201.png" alt="Selecting CockroachDB as a connection type in Beekeeper Studio" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p>In the connection window that pops up:</p>
<ul>
<li><p>Disable the <code>Enable SSL</code> option.</p>
</li>
<li><p>Set User to <code>root</code></p>
</li>
<li><p>Set Default Database to <code>defaultdb</code></p>
</li>
<li><p>Host to <a target="_blank" href="http://localhost"><code>localhost</code></a></p>
</li>
<li><p>Port to <code>26257</code></p>
</li>
</ul>
</li>
<li><p>Now click <strong>Test</strong> (bottom right corner). You should see a success message like <em>Connection looks good</em>.</p>
</li>
</ol>
<p>Your setup should look like this:👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761544818021/0248173e-9969-433c-a9d4-e83684bf34cf.png" alt="Connecting to the CockroachDB cluster from the Beekeeper Studio software" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Finally, click Connect (right beside the Test button).</p>
<h3 id="heading-verify-the-connection">Verify the Connection</h3>
<p>Once connected, you’ll land on a clean workspace where you can run SQL queries.</p>
<p>To confirm you’re connected to the right cluster, run:</p>
<pre><code class="lang-bash">SELECT * FROM books;
</code></pre>
<p>You should see a table containing about 15 books (the same ones we inserted earlier):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761545094817/99ef4415-bd0d-4452-817f-380996485397.png" alt="List of books in the CockroachDB database" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>And there you go. You’ve now connected Beekeeper Studio to your CockroachDB running inside Minikube! 🚀</p>
<h3 id="heading-creating-a-google-cloud-account">Creating a Google Cloud Account</h3>
<p>Before we can back up our CockroachDB data to Google Cloud Storage, we need to have a Google Cloud account ready.</p>
<h4 id="heading-step-1-visit-the-google-cloud-console">Step 1: Visit the Google Cloud Console</h4>
<p>Head over to 👉🏾 <a target="_blank" href="https://console.cloud.google.com">https://console.cloud.google.com</a></p>
<p>If you don’t have a Google account yet, don’t worry. The process is simple and self-explanatory once you visit the site :). You’ll be guided to create a Google account first, and then your Google Cloud account.</p>
<h4 id="heading-step-2-create-or-use-a-project">Step 2: Create or Use a Project</h4>
<p>Once you’re in the Google Cloud Console, you’ll either:</p>
<ul>
<li><p>Use the <strong>default project</strong> that was automatically created for you, <strong>or</strong></p>
</li>
<li><p>Create a new one by clicking on <strong>“New Project”</strong> and naming it <code>crdb-tutorial</code>.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761546797213/295c7b09-9bb8-4c34-85cf-8701242b2768.png" alt="Creating a new Project in our Google Cloud account" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Projects are like folders that contain all your Google Cloud resources: compute instances, storage buckets, databases, and more.</p>
<h4 id="heading-step-3-link-a-billing-account-optional-but-recommended">Step 3: Link a Billing Account (Optional but Recommended)</h4>
<p>If you already have a billing account, link it to your project.</p>
<p>If not, you can easily create one by <a target="_blank" href="https://docs.cloud.google.com/billing/docs/how-to/create-billing-account">following Google’s instructions here</a>. (You’ll need a valid Debit or Credit card.)</p>
<p>Don’t worry if your card doesn’t link right away. Sometimes Google’s billing system can be picky. 😅</p>
<p>Here’s a quick fix that usually works:</p>
<ol>
<li><p>Add your card to Google Pay first.</p>
</li>
<li><p>Then go to Google Subscriptions in your Google account, and link it to your Google Billing Account.</p>
</li>
</ol>
<p>To add your card via Google Subscriptions, <a target="_blank" href="https://myaccount.google.com/payments-and-subscriptions">visit here</a>. (You need to have a Google account first. Don’t worry, the site will direct you on what to do if you don’t.)</p>
<p>You’ll see a page like this:👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1761546938934/9e983134-dd7e-49b1-85a7-cd12bd01bf67.png" alt="Adding a card to Google Subscriptions" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Click Manage payment methods, then add your card details.</p>
<p>Once you’ve done that, refresh your Google Billing Account page – you should now see your card as one of the available options.</p>
<h3 id="heading-creating-a-google-cloud-storage-bucket">Creating a Google Cloud Storage Bucket</h3>
<p>Now that we’ve set up our Google Cloud account and enabled billing, let’s create a Cloud Storage Bucket. This is simply a location (like an online folder) where our CockroachDB backup files will be stored.</p>
<p>In your Google Cloud console, type “storage” in the search bar at the top. From the dropdown results, click on “Cloud Storage”:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089121918/c737c3e1-e45f-48e1-aed9-99e273583425.png" alt="Navigating to the Cloud Storage page" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>On the new page, click on the “Buckets” link in the side menu, then click the “Create Bucket” button.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089164660/8b9336fc-c0c3-4811-ab98-d3538596ee5a.png" alt="Creating a new Bucket in Cloud Storage" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Give your bucket a unique name, like <em>cockroachdb-backup</em>-. For example, <em>cockroachdb-backup-i8wu, cockroachdb-backup-7gw8u.</em> The random characters ensure your bucket name is unique globally (no other Google Cloud user will have the same name).</p>
<p>Scroll to the bottom and click “Create” to create your bucket.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089287083/a376f695-81b8-4f5a-80a7-cd563c8b4c81.png" alt="Creating your Bucket in Google Cloud Storage" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You’ll see a pop-up asking you to <strong>confirm public access prevention</strong>. This means that only you (and people you explicitly give access to) can view or edit your bucket. Make sure the “Enforce public access prevention on this bucket” checkbox is checked, then click “Confirm.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089404876/38c8e6b5-0de0-4771-9bed-9334f8f8c43a.png" alt="Preventing random users from accessing your bucket" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Perfect! 🎉 You’ve now created a storage bucket where your CockroachDB backups will live.</p>
<h3 id="heading-giving-cockroachdb-access-to-the-bucket">Giving CockroachDB Access to the Bucket</h3>
<p>Our next goal is to let the CockroachDB cluster upload and read files from this bucket. To do this, we’ll create something called a <strong>Service Account</strong> using <strong>Google IAM</strong>.</p>
<p><strong>What’s IAM?</strong><br>IAM stands for <em>Identity and Access Management.</em> It’s basically Google Cloud’s way of managing who can access what in your project.</p>
<p>With IAM, we can create a service account (like a “digital employee”) and give it permission to interact with our bucket instead of using our personal Google account.</p>
<h4 id="heading-creating-a-service-account">Creating a Service Account</h4>
<p>Type “service account” in the search bar and click on “Service Accounts” in the results.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089569066/2855b7fa-d896-4249-825d-4ec590499ca8.png" alt="Navigating the Service Accounts page" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Click “Create Service Account” at the top of the page. On the new page, type: <em>cockroachdb-backup</em> as the service account name, then click ‘Create and Continue’</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089677768/05c9f9ed-257f-44c6-89b5-3880c8af017d.png" alt="Creating a new Service Account for the CockroachDB cluster, to give it access to our Cloud Storage Bucket" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now we’ll give this service account permission to work with our storage bucket. In the <em>Permissions</em> section, type “storage object creator” in the filter box and select it from the dropdown.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762089744927/64ed65df-88ee-43c9-8be4-892a41a24989.png" alt="Providing our Service Account with the necessary permissions to access the bucket" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Repeat the same for “storage object viewer”, and “storage object user”.</p>
<p>At the end, you should see three roles assigned:</p>
<ul>
<li><p>Storage Object Creator</p>
</li>
<li><p>Storage Object Viewer</p>
</li>
<li><p>Storage Object User</p>
</li>
</ul>
<p>Click “Continue”, then “Done.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762092953125/0419abe8-a1ff-4f1c-b367-f9e203bdf6ff.png" alt="The necessary permissions to be assigned to the Service Account" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You’ve now created a service account that can create and read files in your bucket.</p>
<h4 id="heading-downloading-the-service-account-key">Downloading the Service Account Key</h4>
<p>To let our CockroachDB cluster use this service account, we’ll generate a <strong>key file</strong>.</p>
<p><strong>What’s a key file?</strong><br>It’s just a small <strong>JSON file</strong> containing secret information your app (CockroachDB) can use to authenticate securely with Google Cloud – like an ID card.</p>
<p><strong>But be careful ⚠️</strong> If this key gets into the wrong hands, anyone could use it to access your Google Cloud resources. <strong>Never share or upload this file</strong> to your GitHub, BitBucket, or GitLab repository, or any other online repositories.</p>
<p>In the Service Accounts page, find your <code>cockroachdb-backup</code> account, click the three dots (⋮) under the Action column, then select “Manage Keys.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762090008411/11c4b373-87b0-416d-bf14-1a9ccd15c452.png" alt="Finding the newly created service account, and creating a key" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>On the new page, click “Add Key” then “Create new key.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762090059309/ebe17228-e2a8-4abe-b41b-7378013570d5.png" alt="Creating a new key for the new service account" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>A dialog box will pop-up, choose JSON as the key type, and click “Create.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762090115728/5ed82664-f57a-4489-af08-be85c2ad42e9.png" alt="Selecting the Key Type as JSON" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Google will automatically download a file named something like <code>cockroachdb-backup-1234567890abcdef.json</code></p>
<p>We’ll use this key soon when we configure our CockroachDB backup job.</p>
<h3 id="heading-attaching-the-key-to-our-cockroachdb-cluster">Attaching the Key to Our CockroachDB Cluster</h3>
<p>Now that we’ve downloaded the service account key, we need to attach it to our CockroachDB cluster so that the DB can upload and read backups from our Google Cloud Storage bucket.</p>
<p><strong>Why this is needed:</strong><br>Our Minikube cluster (and even any managed Kubernetes cluster like GKE, EKS, or AKS) <strong>doesn’t have direct access</strong> to the files on your computer. So, we’ll upload the key file to Kubernetes as a Secret, and then mount it inside our CockroachDB pods as a volume.</p>
<h4 id="heading-step-1-create-a-kubernetes-secret">Step 1: Create a Kubernetes Secret</h4>
<p>Run the command below in your terminal👇🏾 Replace <code>&lt;PATH_TO_KEY&gt;</code> with the path to your downloaded key file:</p>
<pre><code class="lang-bash">kubectl create secret generic gcs-key --from-file=key.json=&lt;PATH_TO_KEY&gt;
</code></pre>
<p>This command creates a <strong>Kubernetes Secret</strong> named <code>gcs-key</code> that securely stores your Google Cloud key.</p>
<h4 id="heading-step-2-mount-the-secret-to-the-cockroachdb-cluster">Step 2: Mount the Secret to the CockroachDB Cluster</h4>
<p>Now, let’s tell Kubernetes to use this secret inside our CockroachDB cluster.</p>
<p>Open your <code>cockroachdb-values.yml</code> file and scroll to the <code>statefulset:</code> section. Add the following lines under it:👇🏾</p>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-string">...</span>
  <span class="hljs-attr">env:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GOOGLE_APPLICATION_CREDENTIALS</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">/var/run/gcp/key.json</span>

  <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">gcs-key</span>

  <span class="hljs-attr">volumeMounts:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
      <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/var/run/gcp</span>
      <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Here’s what this does:</p>
<ul>
<li><p>The <code>volumes</code> section tells Kubernetes to create a volume from the secret we just made.</p>
</li>
<li><p>The <code>volumeMounts</code> section attaches that volume inside the CockroachDB container.</p>
</li>
<li><p>The <code>GOOGLE_APPLICATION_CREDENTIALS</code> environment variable points CockroachDB to our key file so it knows where to find it when connecting to Google Cloud.</p>
</li>
</ul>
<p>Your final file should look like this:👇🏾</p>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">podSecurityContext:</span>
    <span class="hljs-attr">fsGroup:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">runAsUser:</span> <span class="hljs-number">1000</span>
    <span class="hljs-attr">runAsGroup:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">podAntiAffinity:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">""</span>
  <span class="hljs-attr">nodeSelector:</span>
    <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">minikube</span>
  <span class="hljs-attr">env:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GOOGLE_APPLICATION_CREDENTIALS</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">/var/run/gcp/key.json</span>
  <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">gcs-key</span>
  <span class="hljs-attr">volumeMounts:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
      <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/var/run/gcp</span>
      <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span>

<span class="hljs-attr">storage:</span>
  <span class="hljs-attr">persistentVolume:</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">5Gi</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">standard</span>

<span class="hljs-attr">tls:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>

<span class="hljs-attr">init:</span>
  <span class="hljs-attr">jobs:</span>
    <span class="hljs-attr">wait:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Now, apply the update using Helm:👇🏾</p>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<h4 id="heading-step-3-confirm-the-key-exists-in-the-cluster">Step 3: Confirm the Key Exists in the Cluster</h4>
<p>Once the upgrade is complete, run this command to confirm the key is now inside your CockroachDB pods:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it crdb-cockroachdb-1 -- cat /var/run/gcp/key.json
</code></pre>
<p>You should see something similar to this:👇🏾</p>
<pre><code class="lang-bash">prince@DESKTOP-QHVTAUD:~/programming/cockroachdb-tutorial$ kubectl <span class="hljs-built_in">exec</span> -it crdb-cockroachdb-1 -- cat /var/run/gcp/key.json
{
  <span class="hljs-string">"type"</span>: <span class="hljs-string">"service_account"</span>,
  <span class="hljs-string">"project_id"</span>: ***,
  <span class="hljs-string">"private_key_id"</span>: ***,
  <span class="hljs-string">"private_key"</span>: ***,
  <span class="hljs-string">"client_email"</span>: ***,
  <span class="hljs-string">"client_id"</span>: ***,
  <span class="hljs-string">"auth_uri"</span>: <span class="hljs-string">"https://accounts.google.com/o/oauth2/auth"</span>,
  <span class="hljs-string">"token_uri"</span>: <span class="hljs-string">"https://oauth2.googleapis.com/token"</span>,
  <span class="hljs-string">"auth_provider_x509_cert_url"</span>: <span class="hljs-string">"https://www.googleapis.com/oauth2/v1/certs"</span>,
  <span class="hljs-string">"client_x509_cert_url"</span>: ***,
  <span class="hljs-string">"universe_domain"</span>: <span class="hljs-string">"googleapis.com"</span>
}
</code></pre>
<p>Nice! That means our cluster now has access to the Google Cloud key.</p>
<h4 id="heading-step-4-creating-the-backup-schedule">Step 4: Creating the Backup Schedule</h4>
<p>CockroachDB makes backups super convenient. It can automatically back up your database <strong>on a schedule</strong> (without you needing to manually create Kubernetes CronJobs).</p>
<p>To create an automatic backup schedule, run this SQL command inside the CockroachDB SQL shell 👇🏾(Replace the BUCKET_NAME placeholder with the name of your Google Cloud Storage bucket):</p>
<pre><code class="lang-bash">CREATE SCHEDULE backup_cluster
FOR BACKUP INTO <span class="hljs-string">'gs://&lt;BUCKET_NAME&gt;/cluster?AUTH=implicit'</span>
WITH revision_history
RECURRING <span class="hljs-string">'@hourly'</span>
FULL BACKUP <span class="hljs-string">'@daily'</span>
WITH SCHEDULE OPTIONS first_run = <span class="hljs-string">'now'</span>;
</code></pre>
<p>Here’s what each part means:</p>
<ul>
<li><p><code>AUTH=implicit</code> tells CockroachDB to use the Google key we mounted (<code>GOOGLE_APPLICATION_CREDENTIALS</code>) for authentication.</p>
</li>
<li><p><code>FULL BACKUP '@daily'</code> creates a complete backup of the entire database every day.</p>
</li>
<li><p><code>RECURRING '@hourly'</code> creates smaller, incremental backups every hour, capturing just the changes since the last backup.</p>
</li>
<li><p><code>WITH SCHEDULE OPTIONS first_run = 'now'</code> starts the first backup immediately after running the command.</p>
</li>
</ul>
<p>After running it, CockroachDB will return two rows:</p>
<ul>
<li><p>The first is for the <strong>recurring incremental backup</strong> (hourly updates)</p>
</li>
<li><p>The second is for the <strong>full backup</strong> (daily snapshot)</p>
</li>
</ul>
<p>You can read more about full and incremental backups in the official docs here 👉🏾<a target="_blank" href="https://www.cockroachlabs.com/docs/stable/take-full-and-incremental-backups">CockroachDB Backups Guide</a>.</p>
<h4 id="heading-step-5-checking-backup-status">Step 5: Checking Backup Status</h4>
<p>To see the status of your backups, copy the <strong>Job ID</strong> from the second row (the <code>id</code> column) and run this command:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762103549260/742fc309-9c4d-4967-9436-91539851a9b9.png" alt="The job ID to copy" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-bash">SHOW JOBS FOR SCHEDULE &lt;YOUR_JOB_ID&gt;;
</code></pre>
<p>Replace <code>&lt;YOUR_JOB_ID&gt;</code> with the ID you copied.</p>
<p>You’ll see output similar to this:👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762103606748/8627d561-0b54-4e6d-9109-ba7e1c7a85c3.png" alt="Getting the status of the backup job" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now, do the same for the recurring backup job (the ID on the 1st row of the previous result)</p>
<p>If both statuses show <code>succeeded</code>, that means your full and recurring backups worked perfectly! If either is still running, just give it a few minutes – backups can take a bit of time :)</p>
<h3 id="heading-testing-our-backup-disaster-recovery-time">Testing Our Backup — Disaster Recovery Time</h3>
<p>Woohoo! We’ve successfully created a backup of our CockroachDB cluster to Google Cloud Storage. That’s a huge milestone. But let’s be honest: how can we be <em>sure</em> it works if we’ve never tried restoring it?</p>
<p>So, in true brave-developer fashion, we’re going to do the unthinkable: <strong>destroy our entire database</strong>...yes, everything! 😬</p>
<p>Why would we do that?! Because in real life, disasters happen. A node crashes, data gets wiped, or an upgrade goes sideways. The question is: <em>Can we recover?</em> Let’s find out.</p>
<h4 id="heading-step-1-uninstall-the-helm-chart">Step 1: Uninstall the Helm Chart</h4>
<p>First, let’s remove the CockroachDB Helm release. This deletes the cluster resources like StatefulSets, pods, and secrets:</p>
<pre><code class="lang-bash">helm uninstall crdb
</code></pre>
<p>This removes the running cluster, but <strong>not the actual data</strong>, which is stored on Persistent Volumes (PVs).</p>
<h4 id="heading-step-2-delete-persistent-volume-claims-pvcs">Step 2: Delete Persistent Volume Claims (PVCs)</h4>
<p>Each CockroachDB node stores its data in a <strong>Persistent Volume Claim</strong> (PVC). These PVCs remain even after uninstalling the Helm release, so let’s manually delete them:</p>
<pre><code class="lang-bash">kubectl delete pvc datadir-crdb-cockroachdb-0
kubectl delete pvc datadir-crdb-cockroachdb-1
kubectl delete pvc datadir-crdb-cockroachdb-2
</code></pre>
<h4 id="heading-step-3-delete-the-persistent-volumes-pvs">Step 3: Delete the Persistent Volumes (PVs)</h4>
<p>Next, list all the Persistent Volumes:</p>
<pre><code class="lang-bash">kubectl get pv
</code></pre>
<p>You’ll see a list of volumes similar to this 👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762107818554/01defffd-543b-486a-aa19-4bbf6f768270.png" alt="List existing Persistent Volumes for CockroachDB" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Look for the PVs that are <strong>bound to the PVCs</strong> you just deleted. Then delete them manually using:</p>
<pre><code class="lang-bash">kubectl delete pv &lt;PV_NAME&gt;
</code></pre>
<p>At this point, you’ve completely wiped out your database like it never existed 🥲. Don’t worry: this is all part of the plan.</p>
<h4 id="heading-step-4-reinstall-the-cluster">Step 4: Reinstall the Cluster</h4>
<p>Let’s bring CockroachDB back to life (an empty one for now):</p>
<pre><code class="lang-bash">helm install crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<p>Once the installation is done, expose the cluster locally again:</p>
<pre><code class="lang-bash">kubectl port-forward svc/crdb-cockroachdb-public 26257
</code></pre>
<h4 id="heading-step-5-check-whats-left">Step 5: Check What’s Left</h4>
<p>Connect to the Beekeeper Studio to your DB if your not, and try running the query below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>You’ll get an error saying the <code>books</code> table doesn’t exist, because this is a <em>brand new</em> database.</p>
<h4 id="heading-step-6-restore-from-google-cloud-storage">Step 6: Restore from Google Cloud Storage</h4>
<p>Now for the magic part, let’s bring our data back from the backup we created earlier 😃!</p>
<p>Run this query the new cluster:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">RESTORE</span> <span class="hljs-keyword">FROM</span> LATEST <span class="hljs-keyword">IN</span> <span class="hljs-string">'gs://&lt;BUCKET_NAME&gt;/cluster?AUTH=implicit'</span>;
</code></pre>
<p>Replace <code>&lt;BUCKET_NAME&gt;</code> with your actual Google Cloud Storage bucket name (for example: <code>cockroachdb-backup-7gw8u</code>).</p>
<p>CockroachDB will begin restoring your data. This can take a few seconds or minutes depending on your backup size. When it’s done, you’ll see a response showing a success status:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762108106557/0da98d45-d8f4-48ed-b852-9f76209fb20f.png" alt="Database restored successfully" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-step-7-confirm-the-restoration">Step 7: Confirm the Restoration</h4>
<p>Now, run the same query again:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>Boom 💥 your books are back 😁! That means your backup and restore process works perfectly. You just performed a full disaster recovery test.</p>
<p>Congrats! You’ve done something many real-world teams fail to test: a <strong>full backup and restore cycle</strong>. You’ve now proven that your database setup is resilient, even in a worst-case scenario.</p>
<h2 id="heading-managing-resources-amp-optimizing-memory-usage">Managing Resources &amp; Optimizing Memory Usage</h2>
<p>In this section, we’ll learn how CockroachDB handles memory internally (for things like caching and SQL query work), and how to tune these setting<strong>s</strong> so you avoid OOM kills or Eviction – Kubernetes crashing/stopping the database due to it using too much memory than what was allocated to it.</p>
<h3 id="heading-how-cockroachdb-uses-memory">How CockroachDB Uses Memory</h3>
<p>When you deploy CockroachDB nodes (each replica) via Kubernetes, each pod (node) needs memory for multiple things. At a high level, there are two major internal uses:</p>
<ul>
<li><p><strong>Cache</strong> (<code>conf.cache</code>): This is the space CockroachDB uses to keep frequently accessed data in memory so queries can run faster without hitting the disk.</p>
</li>
<li><p><strong>SQL Memory</strong> (<code>conf.max-sql-memory</code>): This is the memory used when running SQL queries (things like sorting, joins, buffering numbers, and temporary data).</p>
</li>
</ul>
<p>Together, they need to be sized appropriately relative to the total memory you give the pod, so there’s room for these internal operations <em>plus</em> other overhead (networking, logging, background tasks).</p>
<h3 id="heading-the-memory-usage-formula-you-must-follow">The Memory Usage Formula You Must Follow</h3>
<p>Here’s the golden rule you should <strong>never forget</strong>:</p>
<pre><code class="lang-yaml"><span class="hljs-string">(2</span> <span class="hljs-string">×</span> <span class="hljs-string">max-sql-memory)</span> <span class="hljs-string">+</span> <span class="hljs-string">cache</span>  <span class="hljs-string">≤</span>  <span class="hljs-number">80</span><span class="hljs-string">%</span> <span class="hljs-string">of</span> <span class="hljs-string">the</span> <span class="hljs-string">memory</span> <span class="hljs-string">limit</span>
</code></pre>
<p>What this means:</p>
<ul>
<li><p>You take the <code>max-sql-memory</code> value and multiply by 2 (because SQL work may need space for both input and output, etc)</p>
</li>
<li><p>Add your <code>cache</code> value</p>
</li>
<li><p>That total must be <strong>less than or equal to 80%</strong> of the pod’s memory limit (<code>statefulset.resources.limits.memory</code>)</p>
</li>
<li><p>The remaining ~20% (or more) is free space for <em>other internal CockroachDB processes</em> like background jobs, metrics, network, and so on</p>
</li>
</ul>
<p>If you give CockroachDB too little “free” memory beyond these two settings, you risk OOM kills (pod gets killed by Kubernetes because it used more memory than allowed) or performance issues.</p>
<h3 id="heading-where-you-find-these-settings">Where You Find These Settings</h3>
<p>If you go to the Helm chart docs on ArtifactHub, <a target="_blank" href="https://artifacthub.io/packages/helm/cockroachdb/cockroachdb">CockroachDB Helm Chart on ArtifactHub</a>, and scroll down to the <strong>Configuration</strong> section (or press Ctrl-F for <code>conf.cache</code>), you’ll see:</p>
<ul>
<li><p><code>conf.cache</code> (cache size)</p>
</li>
<li><p><code>conf.max-sql-memory</code> (SQL memory size)</p>
</li>
<li><p>It states that each of these is by default set to roughly 25% of the memory allocation you set in the <code>resources.limits.memory</code> for the statefulset.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762235290740/bd176882-43bd-4abd-94e0-cce083335d64.png" alt="Artifacthub docs for the CockroachDB Helm chart" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-concrete-example-step-by-step">Concrete Example (Step-by-Step)</h3>
<p>Let’s do the math with numbers in our Minikube environment.</p>
<ul>
<li><p>In our case we set <code>statefulset.resources.limits.memory</code> = <strong>2 GiB</strong> for each CockroachDB pod.</p>
</li>
<li><p>The Helm default of ¼ (25%) rule means:</p>
<ul>
<li><p><code>conf.cache</code> = ¼ × 2 GiB = <strong>512 MiB</strong></p>
</li>
<li><p><code>conf.max-sql-memory</code> = ¼ × 2 GiB = <strong>512 MiB</strong></p>
</li>
</ul>
</li>
<li><p>Apply the formula: <code>(2 × 512 MiB) + 512 MiB = 1,536 MiB</code></p>
</li>
<li><p>Calculate 80% of the memory limit: <code>80% of 2 GiB = 1,638 MiB</code> (approximately)</p>
</li>
<li><p>Compare: 1,536 MiB ≤ 1,638 MiB – so we’re within the safe zone ✅</p>
</li>
<li><p>That means in this configuration, CockroachDB expects to use <strong>~1,536 MiB</strong> for its cache + SQL memory. This leaves <strong>~512 MiB</strong> (20%) of the 2 GiB limit for other internal processes.</p>
</li>
</ul>
<p>That leftover memory is for things like internal bookkeeping (range rebalancing, replication metadata), communication among database replicas, metric collection, logging, garbage collection, and temporary or unexpected memory spikes.</p>
<p>If you don’t leave this free space, your node might struggle when “normal operations”. And on Kubernetes, if the pod uses more memory than the <code>limits.memory</code> says, it can get OOM-killed which causes downtime or restarts.</p>
<h3 id="heading-on-requests-vs-limits-in-kubernetes">⚠️ On Requests vs Limits in Kubernetes</h3>
<p>Important nuance: Kubernetes schedules pods based on <strong>requests</strong> (what you ask for) but enforces limits based on <strong>limits</strong> (what you allow).</p>
<ul>
<li><p><code>statefulset.resources.requests.memory</code> = what the scheduler guarantees the pod will have.</p>
</li>
<li><p><code>statefulset.resources.limits.memory</code> = the maximum the pod can use before Kubernetes will kill it for excess memory.</p>
</li>
</ul>
<p>Because CockroachDB’s internal memory computations (cache + SQL memory) use the <strong>limit</strong> value to calculate sizing, if you set requests &lt; limits you’ll get a mismatch. Example:</p>
<ul>
<li><p>Suppose requests = 1 GiB, limits = 2 GiB</p>
</li>
<li><p>Kubernetes may schedule the pod on a node that has (at least) 1 GiB free</p>
</li>
<li><p>But internally, CockroachDB will plan for ~1.5 GiB usage (based on the 2 GiB limit)</p>
</li>
<li><p>The node may not actually have that much free memory available</p>
</li>
<li><p>The pod might try to use more memory than the node reserved and risk eviction due to less memory for other pods</p>
</li>
</ul>
<p>✅ <strong>Best practice:</strong> Set requests = limits for memory and CPU for CockroachDB pods. That way the scheduler reserves enough space for what CockroachDB will use internally.</p>
<h3 id="heading-overriding-the-default-fractions">Overriding the Default Fractions</h3>
<p>If you want to set static <code>conf.cache</code> or <code>conf.max-sql-memory</code> values (rather than relying on 25% of limit) you <em>can</em> – but you must still obey the memory usage formula.</p>
<p>For example, if you set:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">conf:</span>
  <span class="hljs-attr">cache:</span> <span class="hljs-string">"1Gi"</span>
  <span class="hljs-attr">max-sql-memory:</span> <span class="hljs-string">"1Gi"</span>
<span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
</code></pre>
<p>According to the above configuration your pod memory request and limit is <strong>3 GiB</strong>, then calculate:</p>
<pre><code class="lang-yaml"><span class="hljs-string">(2</span> <span class="hljs-string">×</span> <span class="hljs-string">1Gi)</span> <span class="hljs-string">+</span> <span class="hljs-string">1Gi</span> <span class="hljs-string">=</span> <span class="hljs-string">3Gi</span>
<span class="hljs-number">80</span><span class="hljs-string">%</span> <span class="hljs-string">of</span> <span class="hljs-string">3Gi</span> <span class="hljs-string">=</span> <span class="hljs-string">~2.4Gi</span>
</code></pre>
<p>Here <strong>3Gi &gt; 2.4Gi</strong>, so you’d be violating the rule. This is a risky setup.</p>
<p>So you’ll need to either reduce cache or SQL memory, for example to 768Mi (or increase the memory limit, for example 4Gi) so that your formula results in ≤ 80% of the limit.</p>
<h2 id="heading-scaling-cockroachdb-the-right-way">Scaling CockroachDB the Right Way</h2>
<p>In this section we’ll look at when and how you should grow your CockroachDB cluster – whether that means adding more replicas (horizontal scale), giving each node more CPU/RAM (vertical scale), or giving them more storage.</p>
<p>I’ll explain everything in simple terms and cover what metrics to watch, what decisions to make, and how to scale safely.</p>
<p>What we’ll discuss:</p>
<ul>
<li><p>How you can tell it’s time to “grow” your cluster</p>
</li>
<li><p>How to safely add more nodes or upgrade what you already have</p>
</li>
<li><p>How to decide whether you need more nodes, bigger nodes, or bigger disks</p>
</li>
<li><p>How to do all this without causing downtime or stress</p>
</li>
</ul>
<h3 id="heading-key-metrics-to-understand">Key Metrics to Understand</h3>
<p>Before we dive into how to scale our cluster, we need to understand what certain metrics mean. Because, these metrics will help us make calculated decisions, knowing what and and when to scale certain resources.</p>
<h4 id="heading-read-bytessecond-amp-write-bytessecond-throughput">Read bytes/second &amp; Write bytes/second (Throughput)</h4>
<p>Read bytes/second is how much data (in bytes) the disk is <strong>reading</strong> every second from itself to the database, that is, passing from the disk to the database app.</p>
<p>Write bytes/second is how much data is being <strong>written</strong> to the disk per second, that is, moving from the database to the disk.</p>
<p>This matters because your database is an application that stores data on disk. If your app needs to read a lot of data (reads) or write a lot of data (writes), this metric shows the <strong>volume</strong> of data flowing to/from disk.</p>
<p>To keep an eye on it, go to your CockroachDB dashboard and navigate to the “Metrics” link on the sidebar. Under the “Metrics” title, click the “Dashboard:…” drop-down and select “Hardware” from the options.</p>
<p>Now, scroll down a bit till you see “Disk Read Bytes/s” and “Disk Write Bytes/s”.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762325396257/553ac9d4-4927-40f3-b654-8b19a0b2aef8.png" alt="The Disk Read &amp; Write Bytes/s metrics" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-read-iops-amp-write-iops">Read IOPS &amp; Write IOPS</h4>
<p><strong>IOPS</strong> = “Input/Output Operations Per Second”. Here, Read IOPS = how many <strong>read operations</strong> the disk is performing per second. Write IOPS = how many <strong>write operations</strong> per second.</p>
<p>This is different from throughput because throughput is about how many bytes (data) are being transferred. IOPS, on the other hand, is about <strong>how many operations</strong> are happening (regardless of size).</p>
<p>Here’s an example: 10 read operations/sec of 1 MiB each = 10 MiB/sec throughput, 10 IOPS. Another scenario: 100 reads/sec of 10 KiB each = ~1 MiB/sec throughput, but 100 IOPS (higher operations count though lower data size.</p>
<p>Scroll down a bit more to view the IOPS metrics:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762325699278/dd549ac3-16cf-4373-9637-5a1e798bf5db.png" alt="Illustrating the IOPS metrics on the dashboard" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-sql-p99-latency-99th-percentile-latency">SQL p99 Latency (99th percentile latency)</h4>
<p>P99 latency is the time it takes for the <strong>slowest 1% of queries</strong> to finish.</p>
<p>For example, let’s say you run 1,000 queries. How long the slowest 10 of them took is what p99 shows.</p>
<p>This matters because it’s not about the average query, but about the tail (worst cases). If your p99 is high, it means some queries are seriously lagging. All other queries might be fine, but some are dragging.</p>
<p>So if p99 jumps up (for example, from 10 ms → 300 ms), you should investigate: maybe big joins, missing indexes, contention, or data takes too much time to get stored in the disk.</p>
<p>To access the SQL P99 Latency metrics, simply click the “Dashboard:…” select field, and choose the “Overview” option from the dropdown.</p>
<p>PS: The higher the p99 latency, the more problem there is (slower queries).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762326088120/e6f39e6e-942b-4db9-b808-cb228c1e0cc5.png" alt="The SQL p99 latency metric" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-disk-ops-in-progress-queue-depth">Disk Ops In Progress (Queue Depth)</h4>
<p>This shows how many disk reads and writes are waiting <em>in line</em> (queued) because the storage system is busy.</p>
<p>A queue depth of 0–5 is generally OK. If it frequently goes into double-digits (10+), that means storage is struggling and latency may spike. If you see this number high and staying high, you may need faster storage or more database replicas.</p>
<p>Simple rule: if “Ops In Progress” &gt; ~9 for extended time, this is a bad sign. Time to check disks and I/O.</p>
<p>To access the “Disk Ops In Progress“ metric, return to the “Hardware“ dashboard, and scroll down:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762488796957/b2a215fd-ec51-4ee3-9056-a5fa6d511c61.png" alt="Accessing the Disk Ops In Progress metrics on the COckroachDB dashboard" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>By monitoring these, you can choose:</p>
<ul>
<li><p>“I need <strong>more nodes</strong>” (horizontal scale)</p>
</li>
<li><p>“I need <strong>bigger nodes or faster storage</strong>” (vertical scale)</p>
</li>
<li><p>“I need <strong>better query/index tuning</strong>” (optimize rather than scale)</p>
</li>
</ul>
<h3 id="heading-when-and-what-to-scale-based-on-your-metrics">When (and What) to Scale Based on Your Metrics</h3>
<p>So, let’s imagine you’re watching your CockroachDB dashboard and notice this pattern:</p>
<ul>
<li><p>The <strong>SQL P99 latency</strong> (the slowest 1% of your queries) is high, meaning your queries are taking too long.</p>
</li>
<li><p>The <strong>CPU usage</strong> for your CockroachDB pods (under <em>Cockroach process CPU%</em>) is above <strong>80%</strong> consistently.</p>
</li>
</ul>
<p>That’s a classic sign your cluster is running out of CPU power and the database is struggling to process queries fast enough because the CPU is maxed out.</p>
<p>Here’s how to fix it 👇🏾</p>
<h4 id="heading-step-1-add-more-cpu-power">Step 1: Add More CPU Power</h4>
<p>You can scale up your CPUs directly through the <strong>Helm chart values file</strong>, <code>cockroachdb-values.yml</code>.</p>
<p>In that file, look for the section where CPU and memory requests/limits are defined under <code>statefulset.resources</code>. Then, increase the CPU allocations. For example:</p>
<pre><code class="lang-bash">statefulset:
  resources:
    requests:
      cpu: <span class="hljs-string">"3"</span>
      memory: <span class="hljs-string">"6Gi"</span>
    limits:
      cpu: <span class="hljs-string">"3"</span>
      memory: <span class="hljs-string">"6Gi"</span>
</code></pre>
<p>This means each CockroachDB pod (replica) will now <em>request</em> 3 vCPUs (guaranteed). Save the file, then apply the update with the Helm command:</p>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<p>Once the upgrade is done, give it 30 minutes to 1 hour to stabilize. The CockroachDB dashboard will automatically start showing you updated metrics.</p>
<p>If you see that the CPU usage drops below 70% and the SQL P99 latency improves, you’re good. 👍🏾</p>
<h4 id="heading-step-2-add-another-replica-new-node">Step 2: Add Another Replica (New Node)</h4>
<p>But…what if the latency is <strong>still high</strong> even after adding more CPU? That likely means the cluster is still overloaded, and it’s time to add another node (replica) to distribute the load.</p>
<p>Here’s why that works: CockroachDB is horizontally scalable, meaning it automatically spreads out your data (remember <strong>ranges</strong>?) and balances reads/writes across all replicas. So, the more nodes you add, the more evenly your cluster can share the work.</p>
<p>To add another replica, simply increase the <code>replicas</code> value in your Helm config:</p>
<pre><code class="lang-bash">statefulset:
  replicas: 4  <span class="hljs-comment"># If it was 3 before</span>
</code></pre>
<p>Then, redeploy again:</p>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<p>This adds a new pod (a new CockroachDB node) to your cluster. CockroachDB will automatically rebalance your data across nodes – no manual migration needed</p>
<p>💡 <strong>Tip:</strong> Try to keep one CockroachDB pod (replica) per VM. For example, if you have 3 replicas, you should ideally have 3 separate VMs (worker nodes). This ensures better fault tolerance and performance.</p>
<p>Luckily, the official CockroachDB Helm chart already helps with this by managing <strong>Pod</strong> <strong>anti-affinity rules</strong>, so pods are automatically spread across nodes safely.</p>
<h3 id="heading-disk-bound-situations-what-to-do-when-your-disk-is-the-limiting-factor">Disk-Bound Situations — What to Do When Your Disk Is the Limiting Factor</h3>
<p>If you’re seeing this kind of pattern in your CockroachDB dashboard and Kubernetes cluster:</p>
<ul>
<li><p>SQL P99 latency is high (queries are slow)</p>
</li>
<li><p>“Disk Ops In Progress” (queue depth) stays above ~9-10 – meaning many disk I/O operations are waiting to be processed</p>
</li>
<li><p>Disk “Read bytes/sec” or “Write bytes/sec” (throughput) are high <strong>or</strong> “Read IOPS” or “Write IOPS” are high (even though CPU looks okay)</p>
</li>
</ul>
<p>Then you’re very likely <strong>disk-bound</strong>, meaning your storage is the bottleneck.</p>
<p>Here’s how to fix it (and yes, it’s a bit more complex than just “add more RAM”)…</p>
<h4 id="heading-step-1-increase-disk-size-in-your-helm-values">Step 1: Increase Disk Size in Your Helm Values</h4>
<p>Often the first problem is that the disk size is too small. Here’s how you can increase it:</p>
<ol>
<li><p>Open your <code>cockroachdb-values.yml</code> (the Helm chart values file)</p>
</li>
<li><p>Look for the storage section, for example:</p>
</li>
</ol>
<pre><code class="lang-bash">storage:
  persistentVolume:
    size: 5Gi  <span class="hljs-comment"># current size</span>
</code></pre>
<ol start="3">
<li>Update it to a larger size, like:</li>
</ol>
<pre><code class="lang-bash">storage:
  persistentVolume:
    size: 15Gi  <span class="hljs-comment"># increased size</span>
</code></pre>
<ol start="4">
<li>Save the file and run:</li>
</ol>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<p><strong>N.B.</strong> If this doesn’t work or you receive an error from the Helm chart concerning not being able to modify some values (this is normal), just upsize the disk this way:👇🏾 (just replace the PVC_NAME and SIZE placeholders accordingly)</p>
<pre><code class="lang-bash">kubectl patch pvc &lt;PVC_NAME&gt; \
  -p <span class="hljs-string">'{"spec":{"resources":{"requests":{"storage":"&lt;SIZE&gt;"}}}}'</span>
</code></pre>
<p>Do that for each PVC (<code>datadir-crdb-cockroachdb-0</code>, <code>datadir-crdb-cockroachdb-1</code>, and so on).</p>
<p><strong>Important:</strong> Increasing size <em>may help</em>, but often alone is not enough because your disk speed (IOPS/throughput) also depends on factors beyond just size.</p>
<p>Let’s break down why that’s the case, and what really affects your disk performance (especially on Google Cloud, which is what I’m using, too).</p>
<h4 id="heading-why-disk-speed-can-vary">Why Disk Speed Can Vary</h4>
<p>Your CockroachDB cluster uses <strong>external disks</strong> provided by your cloud provider (like Google, AWS, or Azure). The speed of those disks – that is, how fast they can read/write data – isn’t fixed. It depends on a few key factors.</p>
<p>On Google Cloud, disk performance depends on three main things:</p>
<ol>
<li><p><strong>Disk type</strong>: HDD, SSD, or fast SSD (pd-ssd) (the faster the disk type, the faster it can handle data operations)</p>
</li>
<li><p><strong>Disk size</strong>: larger disks usually come with higher speed limits (the bigger, the faster)</p>
</li>
<li><p><strong>VM’s vCPU count</strong>: more CPUs mean higher quotas for both</p>
<ul>
<li><p>read/write operations per second (<strong>IOPS</strong>), and</p>
</li>
<li><p>how much data can flow to/from the disk per second (<strong>throughput</strong>)</p>
</li>
</ul>
</li>
</ol>
<h4 id="heading-the-recommended-disk-type-for-cockroachdb">The Recommended Disk Type for CockroachDB</h4>
<p>The pd-ssd (Google’s fast SSD) is the recommended type for CockroachDB.</p>
<ul>
<li><p>Each pd-ssd disk starts with a minimum of 6,000 IOPS (read or write operations per second).</p>
</li>
<li><p>It also has around 240 MiB/s (~252 MB/s) of read/write throughput.</p>
</li>
</ul>
<p>In simple terms, that means your CockroachDB disk can handle up to 6,000 read/write operations EVERY SECOND, and move 250+ MB of data in and out every second. That’s pretty impressive!</p>
<p>But here’s the catch: those numbers can still vary depending on your <strong>VM family</strong> and <strong>CPU count</strong>.</p>
<h4 id="heading-how-vm-family-affects-disk-speed-e2-example">How VM Family Affects Disk Speed (E2 Example)</h4>
<p>If your CockroachDB is running on an E2 VM family (one of Google Cloud’s general-purpose VM types):</p>
<ul>
<li><p>A VM with 2–7 vCPUs can handle up to:</p>
<ul>
<li><p>15k IOPS (read/write operations per second)</p>
</li>
<li><p>250+ MiB/s throughput (which is already far more than many databases ever use 😅)</p>
</li>
</ul>
</li>
<li><p>A VM with 8–15 vCPUs still allows 15k IOPS, but throughput jumps up to ~800 MiB/s 😮 –<br>  meaning your disk can push nearly 0.8 GB per second of data in/out IN A SECOND.</p>
</li>
</ul>
<p>The more vCPUs you have, the higher these limits grow, both for IOPS and throughput.</p>
<h4 id="heading-putting-it-all-together">Putting It All Together</h4>
<p>So, if you notice high SQL P99 latency (queries taking long), and disk read and write IOPS or throughput (read &amp; write bytes) usage close to their limits, then your disk may be maxing out, not your database itself.</p>
<p>Here’s what you can do:</p>
<ul>
<li><p>Check your current VM’s vCPU count and disk performance limit for that CPU.</p>
</li>
<li><p>If you’re using E2 with low vCPUs (for example, 2–4), try increasing it to <strong>8 vCPUs or more</strong>. That’ll immediately lift your IOPS and throughput ceiling.</p>
</li>
</ul>
<h4 id="heading-example-e2-vm-family-iopsthroughput-table">Example: E2 VM Family IOPS/Throughput Table</h4>
<pre><code class="lang-bash">E2 per-VM caps (pd-ssd):

e2-medium:     10k write / 12k <span class="hljs-built_in">read</span> IOPS, 200/200 MiB/s
2–7 vCPUs:     15k / 15k IOPS, 240/240 MiB/s
8–15 vCPUs:    15k / 15k IOPS, 800/800 MiB/s
16–31 vCPUs:   25k / 25k IOPS, 1,000 write / 1,200 <span class="hljs-built_in">read</span> MiB/s
32 vCPUs:      60k / 60k IOPS, 1,000 write / 1,200 <span class="hljs-built_in">read</span> MiB/s
</code></pre>
<p>The rule is simple — the higher the CPU tier (2–7, 8–15, and so on), the higher the disk speed cap.</p>
<h4 id="heading-but-what-if-youre-still-seeing-slow-queries">⚠️ But What If You’re Still Seeing Slow Queries?</h4>
<p>If your CockroachDB queries are <em>still</em> slow, but your metrics show that you’re not fully using your disk capacity (based on your VM’s CPU range), then your <strong>disk size</strong> might be the actual limitation.</p>
<p>In that case:</p>
<ul>
<li><p>Gradually increase your disk size, for exaxmple from <code>50Gi</code> to <code>70Gi</code> to <code>100Gi</code>.</p>
</li>
<li><p>Each increase enables your disk to pass more amount of data in and out (especially with pd-ssd).</p>
</li>
<li><p>Remember: once you increase disk size on Google Cloud, <strong>you can’t shrink it back down</strong>, so grow it slowly and observe improvements before scaling again.</p>
</li>
</ul>
<p>This step helps you pinpoint <em>exactly</em> whether the slowdown is coming from insufficient IOPS, throughput, or just a disk that’s too small for CockroachDB’s workload 💪🏾</p>
<h3 id="heading-memory-pressure-what-to-do-when-your-database-hits-the-limit">Memory Pressure — What to Do When Your Database Hits the Limit</h3>
<p>There are some signs in your cluster you can look out for that’ll tell you your database is getting close to its limit. Pods (database replicas) might be getting <strong>OOMKilled</strong> (out of memory) or being evicted by Kubernetes, or your memory usage might be staying above ~ 75–80% for a while.</p>
<p>If either these is the case, you’re often dealing with <strong>memory pressure</strong> (you can check memory usage on the CockroachDB overview dashboard).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762584827011/e7828548-7ed7-4a87-b6b2-fff52c6f6df1.png" alt="Accessing your Cluster memory usage" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-why-this-happens">Why this happens</h4>
<p>If you didn’t set memory requests and limits properly for each replica, the pod might not have enough head-room for all of its internal work (cache, SQL memory, background jobs) and Kubernetes kills it or it crashes.</p>
<p>Also, as you increase load (lots of queries, many users), your database needs more memory for two internal areas:</p>
<ul>
<li><p><code>--cache</code> (or <code>conf.cache</code>): in-memory data caching</p>
</li>
<li><p><code>--max-sql-memory</code> (or <code>conf.max-sql-memory</code>): memory for running SQL queries (joins, sorts, and so on).<br>  And yes, we covered the formula earlier <code>(2 × max-sql-memory) + cache ≤ ~ 80% of RAM limit</code>.</p>
</li>
</ul>
<h4 id="heading-what-to-do">What to do:</h4>
<p>First, you can increase the DB memory. In your Helm chart values (<code>cockroachdb-values.yml</code>), bump up the <code>statefulset.resources.limits.memory</code> and <code>statefulset.resources.requests.memory</code>. Or you can modify <code>conf.cache</code> and <code>conf.max-sql-memory</code> values (if you’re comfortable) but only if the total RAM limit is sufficient to support them.</p>
<p>Because the defaults (when you installed) set each to ~25% of RAM limit, they will scale automatically when you increase RAM.</p>
<p>For example:</p>
<ul>
<li><p>If RAM limit per pod = <strong>5 GiB</strong>, then cache ≈ <strong>1.25 GiB</strong>, max-sql-memory ≈ <strong>1.25 GiB</strong></p>
</li>
<li><p>If you raise RAM limit to <strong>8 GiB</strong>, these become ≈ <strong>2 GiB</strong> each. This keeps you inside the formula and avoids memory crashes.</p>
</li>
</ul>
<h4 id="heading-quick-yaml-snippet-example">Quick YAML snippet example:</h4>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"8Gi"</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"8Gi"</span>
<span class="hljs-attr">conf:</span>
  <span class="hljs-attr">cache:</span> <span class="hljs-string">"25%"</span>
  <span class="hljs-attr">max-sql-memory:</span> <span class="hljs-string">"25%"</span>
</code></pre>
<p>After editing your values file, remember to apply it:</p>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
<h3 id="heading-when-queries-are-slow-but-everything-else-cpu-memory-amp-disk-looks-fine">When Queries Are Slow but Everything Else (CPU, Memory &amp; Disk) Looks “Fine”</h3>
<p>Sometimes you’ll see that your resource metrics (CPU, memory, disk I/O) all seem healthy. But your queries are still slow.</p>
<p>What then? One important cause: <strong>hotspots</strong> – especially “hot ranges” or “hot nodes” in CockroachDB.</p>
<p>A <strong>hot range</strong> is a portion of data (in CockroachDB, a range is a section of data from a table) that’s receiving much more traffic (reads or writes) than others.</p>
<p>A <strong>hot node</strong>, on the other hand, is a node/replica in the cluster which has significantly more load compared to the other nodes – often because it holds one or more hot ranges.</p>
<p>Because most of the traffic (queries) go to a range which is on a specific node, even though your overall CPU / memory / disk metrics might look “okay”, performance still suffers locally: queries are funneled into that specific range, making a “hotspot”.</p>
<p>Learn more about Hotspots <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/understand-hotspots">here</a>.</p>
<h4 id="heading-why-a-high-write-workload-can-slow-reads">Why A High Write Workload Can Slow Reads</h4>
<p>When you have lots of write queries, they may overload specific ranges or nodes (especially if the keyspace is skewed). Writes tend to:</p>
<ul>
<li><p>Acquire locks or latches on rows or ranges</p>
</li>
<li><p>Cause contention among transactions</p>
</li>
<li><p>Require coordination (for example, via Raft consensus) which impacts performance.</p>
</li>
</ul>
<p>When writes dominate a range, read queries that hit the same ranges may get queued behind these write operations, or suffer longer wait times.</p>
<p>Since reads and writes are sharing the same underlying data/ranges, too much writes can delay reads by creating bottlenecks. The docs call this part of “write hotspots”.</p>
<h4 id="heading-key-signs-you-might-have-a-hotspot">Key Signs You Might Have a Hotspot</h4>
<ul>
<li><p>One node’s CPU % is much higher than the others (even though overall resources seem fine)</p>
</li>
<li><p>On the Hot Ranges page in the CockroachDB UI, some ranges show very high QPS (queries per second) compared to others.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762586236835/aeb3b0ea-b280-48d3-b12f-4cfe78d11dc1.png" alt="The Hot Ranges page in the CockorachDB dashboard UI" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p>You observe that increasing overall resources (more CPU, more nodes) didn’t resolve the slowness. This suggests the problem isn’t “not enough resources” but “resource imbalance”.</p>
</li>
</ul>
<h4 id="heading-what-you-can-do">What You Can Do</h4>
<p>There are a few things you can do to prevent hotspots:</p>
<ul>
<li><p>Use the <strong>Hot Ranges</strong> UI page (go to the Database Console and then to Hot Ranges) to identify the range IDs and table/indexes causing the issue.</p>
</li>
<li><p>Examine how the key space is being used. If your table/index primary key is monotonically increasing (for example, timestamps or serial IDs), the writes may target a narrow portion of the data, causing a hotspot. The docs suggest using hash-sharded indexes or distributing writes across the key-space.</p>
</li>
<li><p>Ensure load is balanced across nodes: avoid “one node doing most of the work”. If needed, add nodes or ensure range distribution/lease-holder movement is happening.</p>
</li>
<li><p>Monitor write-versus-read workload. if writes are heavy, they may cause queuing for reads even when resources appear OK. So look at write heavy traffic patterns and try reducing the amount of writes (if possible).</p>
</li>
</ul>
<h4 id="heading-note">⚠️ Note</h4>
<p>Learning everything about hotspots, key visualizers, and range splitting is a bit advanced. For those wanting to dive deeper: see the CockroachDB <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/performance-recipes">Performance Recipes page</a>.</p>
<h3 id="heading-understanding-disk-speed-iops-amp-throughput-across-cloud-providers">Understanding Disk Speed (IOPS &amp; Throughput) Across Cloud Providers</h3>
<p>So far, we’ve talked about how disk speed affects CockroachDB’s performance – especially how Google Cloud measures it. But it’s important to know that <strong>each cloud provider has its own way of measuring and limiting disk performance</strong> (IOPS and throughput).</p>
<p>So, while our earlier examples focused on Google Cloud, similar logic applies to AWS, Azure, and even DigitalOcean, just with different formulas and limits.</p>
<h4 id="heading-for-google-cloud">For Google Cloud:</h4>
<p>These guides break down how disk performance works:</p>
<ul>
<li><p><a target="_blank" href="https://cloud.google.com/compute/docs/disks/performance">Persistent Disk performance overview</a>: explains how baseline IOPS and throughput are calculated and the per-instance caps.</p>
</li>
<li><p><a target="_blank" href="https://docs.cloud.google.com/compute/docs/disks/persistent-disks">About Persistent Disks</a>: quick definitions of <code>pd-standard</code> (HDD), <code>pd-balanced</code> (SSD), and <code>pd-ssd</code> (SSD).</p>
</li>
<li><p><a target="_blank" href="https://cloud.google.com/compute/docs/disks/optimizing-pd-performance">Optimize PD performance</a>: shows how disk size, machine series, and tuning can affect performance.</p>
</li>
</ul>
<h4 id="heading-for-aws-ebs">For AWS (EBS):</h4>
<p>AWS’s Elastic Block Store (EBS) has several disk types:</p>
<ul>
<li><p><a target="_blank" href="https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volume-types.html">EBS volume types</a>: overview of all SSD and HDD types (<code>gp3</code>, <code>gp2</code>, <code>io2</code>, and so on).</p>
</li>
<li><p><a target="_blank" href="https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html">General Purpose SSD (gp3)</a>: lets you provision custom IOPS and throughput for your disks (about 0.25 MiB/s per IOPS, up to 2,000 MiB/s).</p>
</li>
</ul>
<h4 id="heading-for-azure-managed-disks">For Azure (Managed Disks):</h4>
<p>Azure disks also vary by type and size:</p>
<ul>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types">Disk types overview</a>: compares Standard HDD, Standard SSD, Premium SSD, Premium SSD v2, and Ultra Disk.</p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/disks-deploy-premium-v2">Premium SSD v2</a>: lets you independently set IOPS and throughput for your disks.</p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/disks-performance">VM &amp; disk performance</a>: lists per-VM IOPS and throughput caps.</p>
</li>
</ul>
<h4 id="heading-for-digitalocean">For DigitalOcean:</h4>
<p>DigitalOcean offers simpler storage setups:</p>
<ul>
<li><p><a target="_blank" href="https://docs.digitalocean.com/products/volumes/">Volumes overview</a>: explains block storage and NVMe details.</p>
</li>
<li><p><a target="_blank" href="https://docs.digitalocean.com/products/volumes/details/limits/">Volume Limits</a>: shows per-Droplet IOPS and throughput caps (including burst windows).</p>
</li>
</ul>
<h3 id="heading-downsizing-the-cluster-reducing-replicas">Downsizing the Cluster (Reducing Replicas)</h3>
<p>Now that we’ve seen how to scale up our CockroachDB cluster, let’s look at how to scale it down safely and correctly.</p>
<p>Let’s assume we scaled our cluster from 3 replicas to 5 replicas earlier (to handle more workload).</p>
<p>PS: If your CockroachDB pods were crashing often, you might need to increase the CPU and memory limits in the Helm chart configuration, like this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">5</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"2Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span> <span class="hljs-comment"># We can keep the memory requests and limits inconsistent for now, since we're in a development environment</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
<span class="hljs-string">...</span>
</code></pre>
<p>Then, you update the cluster using:</p>
<pre><code class="lang-bash">helm upgrade crdb cockroachdb/helm-chart -f cockroachdb-values.yml
</code></pre>
<p>After a few minutes, you can confirm the newly added replicas <code>kubectl get pods</code>. You should now see five CockroachDB pods running.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762612478598/dee9f9e7-6b31-4b06-aed3-e2b0b97268fd.png" alt="The newly added CockroachDB replicas" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Also, check your CockroachDB Admin UI – the new nodes should now appear in the cluster overview.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762612539734/30e01a7d-3d2b-4160-be90-2988a161d87d.png" alt="Newly added nodes in the cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>P.S: You might experience some issues when upscaling your cluster, especially if you don’t have sufficient memory and CPU on your PC or wherever you’re running your Kubernetes cluster.</p>
<h3 id="heading-the-wrong-way-to-downscale">⚠️ The Wrong Way to Downscale</h3>
<p>Now, what if your workload reduces and you’d like to cut costs by scaling down from 5 replicas back to 3?</p>
<p>You might think, <em>“Oh, I’ll just reduce the number of replicas in the Helm chart from 5 to 3 and redeploy.”</em> But hold on, that’s very wrong! 😅</p>
<p>Scaling up CockroachDB is simple…but scaling down must be done carefully, because of certain factors which will explain.</p>
<h3 id="heading-decommissioning-a-node-before-scaling-down-the-cluster">Decommissioning a Node Before Scaling Down the Cluster</h3>
<p>Before you go ahead and reduce the number of replicas in your CockroachDB cluster, it’s important to follow the right process.</p>
<p>You <em>can’t</em> just go from 5 replicas down to 3 and expect everything to go smoothly. There are steps you must take.</p>
<h4 id="heading-why-you-cant-just-scale-from-5-to-3-instantly">Why you can’t just scale from 5 to 3 instantly</h4>
<p>If you reduce your cluster size too quickly, you might:</p>
<ul>
<li><p>Lose data redundancy or fail to meet the required replication factor.</p>
</li>
<li><p>Cause data rebalancing to happen under heavy load, which can slow queries.</p>
</li>
<li><p>Put your cluster into a state where certain ranges or data replicas don’t have enough copies to remain fault-tolerant.</p>
</li>
</ul>
<h4 id="heading-the-correct-approach-decommission-first-then-scale-down-one-node-at-a-time">✅ The correct approach: Decommission first, then scale down one node at a time</h4>
<p>Here’s the safe way to downscale:</p>
<ol>
<li><p><strong>Decommission</strong> the node you plan to remove.</p>
</li>
<li><p>Once decommissioning is complete, <strong>reduce the replica count</strong> (for example, from 5 to 4).</p>
</li>
<li><p>Delete the disk/PVC tied to that removed node.</p>
</li>
<li><p>Repeat the process (remove one node at a time) until you reach your target size (for example, down to 3 replicas).</p>
</li>
</ol>
<h4 id="heading-step-by-step-decommission-the-5th-node-before-scaling-5-to-4">Step-by-step: Decommission the 5th node (before scaling 5 to 4)</h4>
<ol>
<li><p><strong>Create a client pod</strong> to run CockroachDB commands.<br> Create a file named <code>cockroachdb-client.yml</code> with this content:</p>
<pre><code class="lang-yaml"> <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">cockroachdb-client</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">serviceAccountName:</span> <span class="hljs-string">&lt;SA&gt;</span>
   <span class="hljs-attr">containers:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">cockroachdb-client</span>
       <span class="hljs-attr">image:</span> <span class="hljs-string">cockroachdb/cockroach:v25.3.1</span>
       <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">IfNotPresent</span>
       <span class="hljs-attr">command:</span>
         <span class="hljs-bullet">-</span> <span class="hljs-string">sleep</span>
         <span class="hljs-bullet">-</span> <span class="hljs-string">"2147483648"</span>
   <span class="hljs-attr">terminationGracePeriodSeconds:</span> <span class="hljs-number">300</span>
</code></pre>
<p> Replace <code>&lt;SA&gt;</code> with your CockroachDB service account name (find it via <code>kubectl get sa -l app.kubernetes.io/name=cockroachdb</code>).</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762620657038/34d5eb4b-de16-4e8a-b85c-1e7bf6b76172.png" alt="The CockroachDB service account details" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p>Apply the manifest:</p>
<pre><code class="lang-yaml"> <span class="hljs-string">kubectl</span> <span class="hljs-string">apply</span> <span class="hljs-string">-f</span> <span class="hljs-string">cockroachdb-client.yml</span>
</code></pre>
</li>
<li><p>Confirm the pod is running:</p>
<pre><code class="lang-yaml"> <span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">pods</span>
</code></pre>
<p> You should see <code>cockroachdb-client</code>.</p>
</li>
<li><p>Exec into the client pod:</p>
<pre><code class="lang-yaml"> <span class="hljs-string">kubectl</span> <span class="hljs-string">exec</span> <span class="hljs-string">-it</span> <span class="hljs-string">cockroachdb-client</span> <span class="hljs-string">--</span> <span class="hljs-string">bash</span>
</code></pre>
</li>
<li><p>Get the list of nodes and IDs:</p>
<pre><code class="lang-yaml"> <span class="hljs-string">./cockroach</span> <span class="hljs-string">node</span> <span class="hljs-string">status</span> <span class="hljs-string">--insecure</span> <span class="hljs-string">--host</span> <span class="hljs-string">&lt;SERVICE_NAME&gt;</span>
</code></pre>
<p> Find your service name: <code>kubectl get svc -l app.kubernetes.io/component=cockroachdb</code>. In our case it’s <code>crdb-cockroachdb-public</code>.</p>
<p> You’ll see nodes with IDs 1, 2, 3, 4, 5. Each maps to a replica pod like <code>crdb-cockroachdb-0</code>, <code>-1</code>, <code>-2</code>, <code>-3</code>, <code>-4</code>.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762620790692/af8d382e-71db-4eab-af7a-a3491d98c8a8.png" alt="The nodes in the CockroachDB cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p><strong>Decommission the node with the highest index</strong> (since Kubernetes will remove the highest-numbered replica when scaling down).<br> For example, if you’re removing the pod <code>crdb-cockroachdb-4…</code>, and the node ID is 5:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762620838125/b51856cb-2fbb-4b24-ba41-21f572c7678c.png" alt="The node to be decommissioned" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p> Run the command below to decommission the 5th node.</p>
<pre><code class="lang-yaml"> <span class="hljs-string">./cockroach</span> <span class="hljs-string">node</span> <span class="hljs-string">decommission</span> <span class="hljs-number">5</span> <span class="hljs-string">--host</span> <span class="hljs-string">crdb-cockroachdb-public</span> <span class="hljs-string">--insecure</span>
</code></pre>
</li>
<li><p>Navigate to the CockroachDB dashboard, and monitor until the node status shows as <code>decommissioned</code>.<br> In the CockroachDB Console’s Cluster Overview page, you’ll see formerly removed nodes under “Recently Decommissioned Nodes”.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762620923692/e678b21b-e2cc-4fe5-bd5b-46c4b0248958.png" alt="e678b21b-e2cc-4fe5-bd5b-46c4b0248958" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
<li><p><strong>Scale down the replicas</strong> in your Helm values file:</p>
<pre><code class="lang-yaml"> <span class="hljs-attr">statefulset:</span>
   <span class="hljs-attr">replicas:</span> <span class="hljs-number">4</span>
 <span class="hljs-string">...</span>
</code></pre>
<p> Then run:</p>
<pre><code class="lang-bash"> helm upgrade crdb cockroachdb/cockroachdb -f cockroachdb-values.yml
</code></pre>
</li>
<li><p>Verify pods:</p>
<pre><code class="lang-bash"> kubectl get pods
</code></pre>
<p> You should now see 4 CockroachDB replica pods.</p>
</li>
<li><p><strong>Delete the PVC</strong> for the removed node (to avoid paying for storage you’re no longer using):</p>
</li>
</ol>
<pre><code class="lang-bash">kubectl delete pvc datadir-crdb-cockroachdb-4
</code></pre>
<ol start="11">
<li>Repeat the process for the next node if you want to go from 4 to 3 replicas: decommission node #4 next, scale to 3, delete its PVC, and so on.</li>
</ol>
<p>After you’re done, you’ll have the target state (for example, 3 nodes) safely and cleanly without causing cluster instability or data loss.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762621007089/cf7fce07-a3a6-4b01-9536-1d5476c2119e.png" alt="Scaling down to 3 nodes, the nodes status on the CockroachDB dashboarrd" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>To learn more about scaling down your CockroachDB nodes, visit the <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/scale-cockroachdb-kubernetes?filters=helm#remove-nodes">official CockroachDB docs</a>.</p>
<p>Note that you should <strong>NOT</strong> use Horizontal Pod Autoscalers for scaling up and down your CockroachDB cluster.</p>
<p>Remember, before scaling down, you need to <strong>DECOMMISSION THE NODES FIRST</strong>, and <strong>scale down ONE AT A TIME</strong>!</p>
<p>However, the Horizontal Pod Autoscalers do NOT obey this. So if you intend to auto-scale your CockroachDB cluster, it's best to have a fixed size of replicas, for example, 3, 5, 7.</p>
<p>Then set up a Vertical pod Autoscaler to scale their CPU and RAM (Remember to set the Memory and CPU requests and limits to the same quantity to prevent eviction as explained earlier).</p>
<h2 id="heading-what-to-consider-when-deploying-cockroachdb-on-google-kubernetes-engine-gke">What to Consider When Deploying CockroachDB on Google Kubernetes Engine (GKE) ☁️</h2>
<p>Up until now we’ve been working in a <strong>development environment</strong> (using Minikube, local setups), testing and learning.</p>
<p>Now we’re ready to move into <strong>production mode 🤓</strong>. And one of the best places to host CockroachDB in production is on GKE.</p>
<p>In this section, we’ll cover GKE-specific considerations, such as storage classes, load balancers, networking, and how to secure our CockroachDB cluster on GKE using mTLS for authenticating our clients and encrypting any data sent to and from our CockroachDB cluster.</p>
<h3 id="heading-creating-your-gke-cluster">Creating Your GKE Cluster</h3>
<p>To get started, head over to the <a target="_blank" href="https://console.cloud.google.com/"><strong>Google Cloud Console</strong></a>.</p>
<p>In the search bar at the top, type “Kubernetes” and click on “Kubernetes Engine” from the dropdown.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762836788168/0d509529-69fb-4308-ba05-6a1426ee7fe1.png" alt="Searching the Kubernetes Engine resource" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You’ll be taken to the Kubernetes Engine page. On the left sidebar, click “Clusters.” Then click the “Create” button at the top.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762836843514/fc6d59a2-5b9d-4dee-9fea-7bbb7fc2a023.png" alt="Creating a new cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>💡 <strong>Note:</strong> You’ll need to enable the <strong>Compute Engine API</strong> before you can create a GKE cluster. If you haven’t done that yet, Google Cloud will automatically redirect you to a page where you can enable it. Just click “Enable”, then return to the cluster page.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763998084001/3ecbe47c-3def-4f9c-bc80-dabe2c0002c8.png" alt="Enabling the Compute Engine API" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You can also learn more about enabling APIs in Google Cloud here: <a target="_blank" href="https://docs.cloud.google.com/endpoints/docs/openapi/enable-api">Enable APIs in Google Cloud</a>.</p>
<p>Once you’re back, you’ll see the cluster creation page. If it defaults to Autopilot, click “Switch to Standard cluster” in the top-right corner. This gives you more control over node settings.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762836938958/a2c35e79-6404-4c3a-a821-94d4ce926839.png" alt="Switching to Standard Cluster settings" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Under Cluster basics, give your cluster a name – something like <code>cockroachdb-tutorial</code> works great! Then, set Location type to Zonal (that’s fine for now).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762836985443/eb7b1f79-66e3-4ca4-bfe3-842c5571509b.png" alt="Configuring Zonal clusters" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>On the left sidebar, go to “Node pools.” You’ll see a default pool already added.</p>
<ul>
<li><p>Keep the name as is.</p>
</li>
<li><p>Set the Number of nodes to 1.</p>
</li>
<li><p>Enable the Cluster autoscaler option (so it can scale up automatically later).</p>
</li>
<li><p>Set the Maximum number of Nodes to 10, and the minimum to 0.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762918866561/89a00b2c-46e8-440d-8662-77386cc2cf0e.png" alt="Modifying our default node pool, the cluster autoscaler, etc" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
</ul>
<p>Next, click the dropdown arrow beside “default-pool” and select “Nodes.” Here, set up your node specifications:</p>
<ul>
<li><p><strong>VM family:</strong> <code>E2</code></p>
</li>
<li><p><strong>Machine type:</strong> <code>Custom</code></p>
</li>
<li><p><strong>vCPUs:</strong> 2</p>
</li>
<li><p><strong>Memory:</strong> 7 GB</p>
</li>
<li><p><strong>Boot disk type:</strong> Standard persistent disk</p>
</li>
<li><p><strong>Disk size:</strong> 50 GB</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762837157043/89da8297-8ecc-4369-aef5-c3b0e75e37be.png" alt="Configuring the E2 Machine type" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762920102117/173a1d66-d31b-49e3-835b-436ec2781b49.png" alt="Configuring our default pool CPU, RAM, and disk" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
</li>
</ul>
<p>When all that’s set, click “Create.” Your cluster will start provisioning.</p>
<h3 id="heading-connecting-to-your-gke-cluster">Connecting to your GKE cluster</h3>
<p>Once your GKE cluster creation is complete (this might take a few minutes), you’ll see something like this in the console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762844143298/042cc870-82ae-4981-b7c8-d80b187f37a9.png" alt="Accessing out new cluster page" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Next, click the “Connect” link at the top of the page. A modal will pop up. Copy the CLI command you see.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762844213835/119b603c-26c3-46ee-83e1-8feba78031a7.png" alt="Getting the command to access the cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>It’ll look something like:</p>
<pre><code class="lang-bash">gcloud container clusters get-credentials cockroachdb-tutorial --zone us-central1<span class="hljs-_">-a</span> --project &lt;PROJECT_NAME&gt;
</code></pre>
<p>📌 <strong>Note:</strong> To run this command successfully, you need to have the <code>gcloud</code> CLI tool installed. If you don’t have it yet, visit <a target="_blank" href="https://docs.cloud.google.com/sdk/docs/install">Install Google Cloud SDK</a> and pick the steps for your OS.</p>
<p>After installing the <code>gcloud</code> CLI, run:</p>
<pre><code class="lang-bash">gcloud auth login
</code></pre>
<p>This authenticates your terminal with your Google Cloud account so you can access the cluster securely.</p>
<p>After authenticating your terminal with access to Google Cloud, run the command you copied earlier. You should see something like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762844890936/12e6d8a7-b0ae-44d1-a77c-aeb118ba269b.png" alt="The command to provide your terminate your terminal to the newly created Kubernetes cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now run the command to retrieve your pods, <code>kubectl get po</code>. This will retrieve the pods from your new cluster on Google Kubernetes Engine, not Minikube.</p>
<p>For now, we’ve not deployed anything yet, so the namespace should be empty.</p>
<p>But we should have at least 1 worker node available. Run the <code>kubectl get nodes</code> command to view it. You should see something similar to this (GKE takes care of our control plane for us, so when we view the nodes, we’ll only see the worker nodes).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762917947091/c29eb598-1723-43d0-a77f-c6611d04d3d8.png" alt="The available nodes in the GKE cluster" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-deploying-cockroachdb-in-production-on-gke">Deploying CockroachDB in Production (on GKE)</h3>
<p>Now that we’ve successfully created our Google Kubernetes Engine (GKE) cluster, it’s time to deploy our CockroachDB cluster in it – this time, in production mode.</p>
<p>Unlike our earlier Minikube setup (which we used for local development), deploying to GKE introduces new considerations like security, storage classes, and authentication methods – all tailored for a real-world production environment.</p>
<p>To get started, create a new file called <code>cockroachdb-production.yml</code>, and paste the following configuration inside:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">statefulset:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">serviceAccount:</span>
    <span class="hljs-attr">create:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">"crdb-cockroachdb"</span>
    <span class="hljs-attr">annotations:</span>
      <span class="hljs-attr">iam.gke.io/gcp-service-account:</span> <span class="hljs-string">&lt;GOOGLE_SERVICE_ACCOUNT&gt;</span>

<span class="hljs-attr">storage:</span>
  <span class="hljs-attr">persistentVolume:</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">10Gi</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">premium-rwo</span>

<span class="hljs-attr">tls:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>

<span class="hljs-attr">init:</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app.kubernetes.io/component:</span> <span class="hljs-string">init</span>
  <span class="hljs-attr">jobs:</span>
    <span class="hljs-attr">wait:</span>
      <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>Replace the placeholder <code>&lt;GOOGLE_SERVICE_ACCOUNT&gt;</code> with the <strong>CockroachDB backup service account</strong> you created earlier (in the “Backing Up CockroachDB to Google Cloud Storage” section). It should look something like this <code>cockroachdb-backup@&lt;PROJECT_ID&gt;.iam.gserviceaccount.com</code>.</p>
<h3 id="heading-understanding-the-configuration">Understanding the Configuration</h3>
<p>Let’s break down what’s happening in this production Helm values configuration and how it differs from the one we used in Minikube.👇🏽</p>
<h4 id="heading-1-modified-the-statefulset-configuration">1. Modified the <code>statefulset</code> Configuration</h4>
<p>We’re allocating 3 GiB of RAM and 1 vCPU to each replica, both as requests and limits.</p>
<p>This ensures that each node has enough guaranteed resources and avoids Kubernetes evicting it due to it using more than its requested resources.</p>
<p>We also defined a <strong>service account</strong> and annotated it with a GCP service account using the <code>iam.gke.io/gcp-service-account</code> annotation.</p>
<p>This annotation allows CockroachDB to securely access Google Cloud services (like Google Cloud Storage) without using static JSON key files (key.json), thanks to a GKE feature called <strong>Workload Identity</strong>.</p>
<p>In production, we let GKE handle authentication to Google services instead of mounting key files.</p>
<h4 id="heading-2-removed-podsecuritycontext">2. Removed <code>podSecurityContext</code></h4>
<p>In Minikube, we included this section:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">podSecurityContext:</span>
  <span class="hljs-attr">fsGroup:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">runAsUser:</span> <span class="hljs-number">1000</span>
  <span class="hljs-attr">runAsGroup:</span> <span class="hljs-number">1000</span>
<span class="hljs-string">...</span>
</code></pre>
<p>We did that to give CockroachDB permission to access our local disk for persistent storage. But in GKE, this isn’t needed. Google Cloud handles storage mounting securely on our behalf, so we can safely omit this part.</p>
<h4 id="heading-3-removed-podantiaffinity-and-nodeselector">3. Removed <code>podAntiAffinity</code> and <code>nodeSelector</code></h4>
<p>In our Minikube deployment, we used:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">podAntiAffinity:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">""</span>
<span class="hljs-attr">nodeSelector:</span>
  <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">minikube</span>
<span class="hljs-string">...</span>
</code></pre>
<p>That was just to <strong>force all CockroachDB instances to run on the same node</strong> on Minikube.</p>
<p>But in production, we <em>want</em> each replica on a different VM. This ensures high availability, even if one VM fails, only one CockroachDB replica is affected, and the cluster stays active.</p>
<p>Since our cluster uses a replication factor of 3, at least 2 replicas (a quorum) need to be active for the database to stay online, else, it will crash 🥲.</p>
<h4 id="heading-4-removed-env-volumes-and-volumemounts">4. Removed <code>env</code>, <code>volumes</code>, and <code>volumeMounts</code></h4>
<p>In Minikube, we had to manually mount the Service Account key:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">env:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">GOOGLE_APPLICATION_CREDENTIALS</span>
    <span class="hljs-attr">value:</span> <span class="hljs-string">/var/run/gcp/key.json</span>
<span class="hljs-attr">volumes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
    <span class="hljs-attr">secret:</span>
      <span class="hljs-attr">secretName:</span> <span class="hljs-string">gcs-key</span>
<span class="hljs-attr">volumeMounts:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gcp-sa</span>
    <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/var/run/gcp</span>
    <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span>
<span class="hljs-string">...</span>
</code></pre>
<p>This was needed so CockroachDB could access our Google Cloud Storage bucket for backups.</p>
<p>But in production, we don’t use key files. Instead, we use a GKE feature called Workload Identity.</p>
<p>It securely binds a Kubernetes Service Account to a Google Service Account, giving our CockroachDB pods the same permissions as the GCP account: no keys, no secrets, and much safer 🔒</p>
<h4 id="heading-5-updated-storagepersistentvolumestorageclass">5. Updated <code>storage.persistentVolume.storageClass</code></h4>
<p>In Minikube, we used a standard disk:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">storage:</span>
  <span class="hljs-attr">persistentVolume:</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">5Gi</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">standard</span>
<span class="hljs-string">...</span>
</code></pre>
<p>But for production, we’re switching to a faster SSD:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">storage:</span>
  <span class="hljs-attr">persistentVolume:</span>
    <span class="hljs-attr">size:</span> <span class="hljs-string">10Gi</span>
    <span class="hljs-attr">storageClass:</span> <span class="hljs-string">premium-rwo</span>
<span class="hljs-string">...</span>
</code></pre>
<p>This uses Google Cloud’s <code>pd-ssd</code> disk type which is the recommended choice for CockroachDB due to its <strong>high IOPS</strong> (read/write operations per second) and <strong>throughput</strong>. This gives our cluster faster read and write speeds under load, leading to better performance.</p>
<h4 id="heading-6-enabled-tls-for-secure-communication">6. Enabled TLS for Secure Communication</h4>
<p>In development, we disabled TLS:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">tls:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>
</code></pre>
<p>That made it easier and simpler to connect without dealing with certificates.</p>
<p>But in production, security is non-negotiable. We’re enabling TLS to ensure that all communication with CockroachDB is encrypted in transit, and that only clients with <strong>valid certificates</strong> (signed by the same authority) can connect. This is <strong>mutual TLS (mTLS)</strong> authentication.</p>
<p>mTLS ensures that both sides (client and server) prove who they are, preventing impersonation or man-in-the-middle attacks. It’s one of the strongest ways to secure a production database connection.</p>
<p>To learn more about TLS and mTLS encryption, check out:</p>
<ul>
<li><p><a target="_blank" href="https://www.freecodecamp.org/news/understanding-website-encryption/">Understanding Website Encryption (FreeCodeCamp)</a></p>
</li>
<li><p><a target="_blank" href="https://medium.com/@LukV/mutual-tls-mtls-a-deep-dive-into-secure-client-server-communication-bbb83f463292">Mutual TLS Deep Dive (Medium)</a></p>
</li>
</ul>
<h3 id="heading-installing-the-cockroachdb-cluster-on-gke">Installing the CockroachDB Cluster on GKE</h3>
<p>We’ll use the values file you created (<code>cockroachdb-production.yml</code>) and deploy our CockroachDB cluster in our GKE cluster using Helm.</p>
<h4 id="heading-deploy-the-cluster">Deploy the cluster</h4>
<p>Run the following command:</p>
<pre><code class="lang-bash">helm install crdb cockroachdb/cockroachdb -f cockroachdb-production.yml
</code></pre>
<p>This command tells Helm to install a release named <code>crdb</code> using the <code>cockroachdb/cockroachdb</code> chart with your custom production-values file.</p>
<p>This step will take a few minutes. GKE will spin up 3 (or more) worker nodes to host the CockroachDB replicas.</p>
<p>Thanks to pod anti-affinity rules, you’ll typically see <strong>one replica pod per VM</strong> (which improves fault tolerance).</p>
<h4 id="heading-verify-the-pods">Verify the pods</h4>
<p>Once provisioning is done, check the pods:</p>
<pre><code class="lang-bash">kubectl get pods
</code></pre>
<p>You should see three CockroachDB replica pods (for example: <code>crdb-cockroachdb-0</code>, <code>crdb-cockroachdb-1</code>, <code>crdb-cockroachdb-2</code>) in <code>Running</code> status.</p>
<h4 id="heading-verify-the-storage-class-ssd">Verify the storage class (SSD)</h4>
<p>Now check the persistent volume claims to confirm they’re using the fast SSD storage class you requested:</p>
<pre><code class="lang-bash">kubectl get pvc
</code></pre>
<p>Look for your PVCs (persistent volume claims) and check the <code>STORAGECLASS</code> column. You should see something like <code>premium-rwo</code> instead of <code>standard</code> or <code>standard-rwo</code>. This confirms that your replicas are using the high-performance disk type you configured.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762928441524/d7e3d17f-c144-468f-8cc5-d71628ac6a3b.png" alt="The CockorachDB replicas and disk in production" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>📌 This is important, because in production you want good disk IOPS and throughput. Slower disks can bottleneck the database.</p>
<h3 id="heading-connecting-to-our-cockroachdb-cluster-now-that-tls-mtls-are-enabled">Connecting to Our CockroachDB Cluster (Now That TLS + mTLS Are Enabled)</h3>
<p>Now that we’ve enabled TLS encryption and mTLS authentication, let’s actually try connecting to the cluster so you can <em>see</em> what this security setup looks like in action.</p>
<p>We’ll break down in more detail what TLS and mTLS mean shortly. But for now, let’s jump straight into trying to connect – because once you see the behavior, the explanation becomes much easier to understand.</p>
<h4 id="heading-step-1-expose-the-cockroachdb-cluster-to-your-local-pc-using-port-forwarding">Step 1: Expose the CockroachDB Cluster to Your Local PC (Using Port Forwarding)</h4>
<p>Just like we've been doing from the start, we’ll expose our CockroachDB cluster through <strong>port-forwarding</strong>.</p>
<p>Open a new terminal window and run:</p>
<pre><code class="lang-bash">kubectl port-forward svc/crdb-cockroachdb-public 26259:26257
</code></pre>
<p>What this means:</p>
<ul>
<li><p>The first port (26259) is the port on your computer.</p>
</li>
<li><p>The second port (26257) is the port inside the CockroachDB cluster.</p>
</li>
<li><p>Format is: <code>&lt;YOUR_COMPUTER_PORT&gt;</code> <strong>:</strong> <code>&lt;COCKROACHDB_PORT&gt;</code></p>
</li>
</ul>
<p>So now, CockroachDB will be reachable locally at <code>localhost:26259</code>.</p>
<h4 id="heading-step-2-open-beekeeper-studio-and-create-a-fresh-connection">Step 2: Open Beekeeper Studio and Create a Fresh Connection</h4>
<p>If Beekeeper Studio is still connected to our old Minikube cluster, or you're not seeing the “new connection” screen, just press <code>Ctrl + Shift + N</code>. This opens a new connection window instantly.</p>
<h4 id="heading-step-3-enter-the-connection-details">Step 3: Enter the Connection Details</h4>
<p>Now fill in these fields:</p>
<ul>
<li><p><strong>Port:</strong> <code>26259</code></p>
</li>
<li><p><strong>User:</strong> <code>root</code></p>
</li>
<li><p><strong>Default Database:</strong> <code>defaultdb</code></p>
</li>
</ul>
<p>Now click Test Connection.</p>
<p>And boom! You should see a message telling you something like:</p>
<blockquote>
<p>“This cluster is running in secure mode. You must use SSL to connect.”</p>
</blockquote>
<p>It’ll look similar to this:👇🏾</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763193779864/f3e7abcb-34b0-4c21-8652-48a03e4ff6c9.png" alt="Trying to connect to the new CockroachDB cluster in insecure mode" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This is good: it means our CockroachDB cluster is officially in <strong>secure mode</strong>, and it’s rejecting any connection that doesn’t include proper TLS certificates.</p>
<h3 id="heading-connecting-via-mutual-tls-mtls-why-we-need-a-certificate-for-our-root-user">Connecting via Mutual TLS (mTLS) — Why We Need a Certificate for Our <code>root</code> User</h3>
<p>Now that our CockroachDB cluster is officially running in secure mode, we can’t just connect to it with a username and port anymore. CockroachDB won’t accept that.</p>
<p>To talk to it, <strong>we must connect using Mutual TLS (mTLS)</strong>.</p>
<p>Why? Because TLS alone only protects the connection in one direction (you verifying the server). mTLS protects the connection in both directions (you verify the server, and the server also verifies <em>you</em>).</p>
<p>Let’s break this down in simple, everyday English 👇🏾</p>
<h4 id="heading-why-tls-exists-in-the-first-place">Why TLS Exists in the First Place</h4>
<p>Whenever you send anything to CockroachDB, like a query, a connection, a password, whatever, it’s all data moving over a network – for example, the internet.</p>
<p>Without protection, anyone could intercept it and read the data being sent to your DB while it’s on its way<br>TLS fixes that :)</p>
<p>✔️ The CockroachDB cluster has its own <strong>public key + private key</strong><br>✔️ It has a <strong>certificate</strong> that carries its public key<br>✔️ When you connect, the cluster sends you this certificate<br>✔️ Your database tool, for example Beekeeper, uses the public key in the process of encrypting all your traffic sent to the DB<br>✔️ Only CockroachDB can decrypt it with the help of its private key</p>
<p>This gives you encryption and proof you’re really talking to CockroachDB, not some fake service pretending to be it.</p>
<h4 id="heading-why-mtls-exists-mutual-tls">Why mTLS Exists (Mutual TLS)</h4>
<p>TLS protects the server – CockroachDB. mTLS protects <strong>both sides</strong> – you and CockroachDB.</p>
<p>So CockroachDB also wants YOU to send your certificate.</p>
<p>But not just any certificate. Your certificate must be:</p>
<ul>
<li><p>Signed by <strong>THE SAME Certificate Authority (CA)</strong></p>
</li>
<li><p>Trusted by the CockroachDB cluster</p>
</li>
<li><p>Mapped to a CockroachDB user (like <code>root</code>)</p>
</li>
</ul>
<p>This is how CockroachDB says:</p>
<blockquote>
<p>“Let me see your certificate so I know you’re someone I should allow in.”</p>
</blockquote>
<p>And we reply:</p>
<blockquote>
<p>“Here is my certificate, signed by the same CA that signed yours.”</p>
</blockquote>
<p>At that point, both sides trust each other.</p>
<p>If this still feels abstract, <a target="_blank" href="https://www.youtube.com/watch?v=EnY6fSng3Ew">watch this video</a>. It explains TLS beautifully.</p>
<h3 id="heading-lets-explore-our-clusters-certificate">Let’s Explore Our Cluster’s Certificate</h3>
<p>Remember that the Helm chart automatically created:</p>
<ul>
<li><p>The CockroachDB Certificate Authority</p>
</li>
<li><p>The CockroachDB node certificates</p>
</li>
<li><p>The keypairs used for encryption</p>
</li>
</ul>
<p>You can list all the CockroachDB-related Kubernetes secrets with:</p>
<pre><code class="lang-bash">kubectl get secrets
</code></pre>
<p>The one we're interested in is:</p>
<pre><code class="lang-bash">crdb-cockroachdb-node-secret
</code></pre>
<p>If you inspect this secret, you’ll see three keys inside:</p>
<ul>
<li><p><code>ca.crt</code>: the CA’s public certificate</p>
</li>
<li><p><code>tls.key</code>: the CockroachDB node’s private key</p>
</li>
<li><p><code>tls.crt</code>: the CockroachDB node certificate</p>
</li>
</ul>
<p>Now let’s decode the CockroachDB node certificate.</p>
<p>Run this:</p>
<pre><code class="lang-bash">kubectl get secret crdb-cockroachdb-node-secret -o jsonpath=<span class="hljs-string">'{.data.tls\.crt}'</span> | base64 -d &gt; crdb-node.crt
</code></pre>
<p>This gives you the raw certificate (which looks like gibberish):</p>
<pre><code class="lang-bash">-----BEGIN CERTIFICATE-----
MIIEGDCCAwCgAwIBAgIQWgOPJa4OLoZZjcXLgDF3bjANBgkqhkiG9w0BAQsFADAr
...
-----END CERTIFICATE-----
</code></pre>
<p>Let’s decode it into something readable:</p>
<pre><code class="lang-bash">openssl x509 -<span class="hljs-keyword">in</span> ./crdb-node.crt -text -noout &gt; crdb-node.crt.decoded
</code></pre>
<p>Open the <code>crdb-node.crt.decoded</code> file. This is the <strong>human-readable</strong> CockroachDB cluster certificate.</p>
<p><strong>N.B.:</strong> You need to have the <code>openssl</code> tool installed in order to be able to make the certificate human-readable. If you don’t, <a target="_blank" href="https://github.com/openssl/openssl#download">install it following this tutorial</a>.</p>
<h3 id="heading-understanding-the-certificate-sections-explained-super-simply">Understanding the Certificate Sections (Explained Super Simply)</h3>
<h4 id="heading-1-issuer">1. Issuer</h4>
<p>You’ll see something like:</p>
<pre><code class="lang-bash">Issuer: O = Cockroach, CN = Cockroach CA
</code></pre>
<p>This tells us:</p>
<ul>
<li><p>The certificate was signed by a Certificate Authority created by the Helm chart</p>
</li>
<li><p>The <strong>Organization (O)</strong> is “Cockroach”</p>
</li>
<li><p>The <strong>Common Name (CN)</strong> is “Cockroach CA”</p>
</li>
</ul>
<p>This basically means:</p>
<blockquote>
<p>“This certificate comes from the CockroachDB internal CA.”</p>
</blockquote>
<h4 id="heading-2-subject">2. Subject</h4>
<p>You’ll also see this:</p>
<pre><code class="lang-bash">Subject: O = Cockroach, CN = node
</code></pre>
<p>What does this mean?</p>
<p><strong>Organization = Cockroach</strong></p>
<ul>
<li><p>This simply groups all CockroachDB-generated certificates under one “organization label.”</p>
</li>
<li><p>It doesn’t refer to the company. It’s just a logical grouping created by CockroachDB’s built-in toolset.</p>
</li>
</ul>
<p><strong>Common Name = node</strong></p>
<ul>
<li><p>This tells CockroachDB that this certificate belongs to a <strong>cluster node</strong>, not a user or a client machine.</p>
</li>
<li><p>In CockroachDB, node certificates are used for:</p>
<ol>
<li><p>DB-to-DB communication</p>
</li>
<li><p>cluster gossip</p>
</li>
<li><p>handling incoming connections from clients (you)</p>
</li>
</ol>
</li>
</ul>
<p>So this certificate is saying:</p>
<blockquote>
<p>“Hi, I’m a CockroachDB node. Please trust me as part of the cluster.”</p>
</blockquote>
<h4 id="heading-3-extended-key-usage-eku">3. Extended Key Usage (EKU)</h4>
<p>Scroll down and you’ll see:</p>
<pre><code class="lang-bash">X509v3 Extended Key Usage:
    TLS Web Server Authentication
    TLS Web Client Authentication
</code></pre>
<p>This is <em>super important</em>, because it defines <strong>how</strong> this certificate is allowed to be used.</p>
<p>Let’s simplify it:</p>
<h4 id="heading-tls-web-server-authentication">TLS Web Server Authentication</h4>
<p>This means:</p>
<blockquote>
<p>“This certificate can be presented <strong>by a server</strong> to prove its identity.”</p>
</blockquote>
<p>In our case, the CockroachDB node uses this certificate to prove to you (the client) that it is the real CockroachDB server. Think of it like flashing an ID card before letting you in.</p>
<h4 id="heading-tls-web-client-authentication">TLS Web Client Authentication</h4>
<p>This means:</p>
<blockquote>
<p>“This certificate can also be used <strong>as a client certificate</strong>.”</p>
</blockquote>
<p>Why would a server have a client certificate? Well, because in CockroachDB, nodes (DBs) talk to each other. When node A connects to node B, node A is a <strong>client</strong>, and node B is a <strong>server</strong>.</p>
<p>So the same certificate serves two roles. Your local machine will use a different certificate, created specifically for your <code>root</code> user. We’ll generate that soon.</p>
<h3 id="heading-creating-a-client-certificate-so-we-can-finally-connect-to-cockroachdb">Creating a Client Certificate (So We Can Finally Connect to CockroachDB)</h3>
<p>Now that we’ve seen how the CockroachDB node certificate works, let’s generate our client certificate – the one we’ll use to connect from Beekeeper Studio.</p>
<p>Remember: CockroachDB is running in secure mode, so it won’t accept any connection that doesn’t come with a valid, signed certificate.</p>
<p>To fix that, let’s build a tiny Kubernetes pod whose only job is to create a certificate for our <code>root</code> SQL user.</p>
<h4 id="heading-step-1-create-a-file-called-gen-root-certyml">Step 1: Create a File Called <code>gen-root-cert.yml</code></h4>
<p>Paste this into it:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">gen-root-cert</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
  <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-cockroachdb-ca-secret</span>
        <span class="hljs-attr">items:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.crt</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.crt</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.key</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.key</span>
  <span class="hljs-attr">containers:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gen</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">cockroachdb/cockroach:v25.3.1</span>
      <span class="hljs-attr">command:</span> [<span class="hljs-string">"sh"</span>, <span class="hljs-string">"-ec"</span>]
      <span class="hljs-attr">args:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">|
          mkdir -p /out
</span>
          <span class="hljs-comment"># Copy the CockroachDB cluster Certificate Authority certificate file `ca.crt` (for Mutual TLS authentication)</span>
          <span class="hljs-string">cp</span> <span class="hljs-string">/ca/ca.crt</span> <span class="hljs-string">/out/ca.crt</span>

          <span class="hljs-comment"># Create the client certificate and key pair for the SQL user 'root' using the CockroachDB cluster Certificate Authority private key `ca.key`</span>
          <span class="hljs-string">/cockroach/cockroach</span> <span class="hljs-string">cert</span> <span class="hljs-string">create-client</span> <span class="hljs-string">root</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--certs-dir=/out</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--ca-key=/ca/ca.key</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--lifetime=5h</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--overwrite</span>

          <span class="hljs-comment"># List the generated files</span>
          <span class="hljs-string">ls</span> <span class="hljs-string">-al</span> <span class="hljs-string">/out</span>

          <span class="hljs-comment"># Keep the pod alive so we can kubectl cp the files</span>
          <span class="hljs-string">sleep</span> <span class="hljs-number">3600</span>
      <span class="hljs-attr">volumeMounts:</span>
        <span class="hljs-bullet">-</span> { <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>, <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/ca</span>, <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span> }
      <span class="hljs-attr">resources:</span>
        <span class="hljs-attr">requests:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"50Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"10m"</span>
        <span class="hljs-attr">limits:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"500Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"50m"</span>
</code></pre>
<p>So how does this work?</p>
<p>We previously mentioned that the Helm chart created a secret, <code>crdb-cockroachdb-ca-secret</code>.</p>
<p>This secret contains:</p>
<ul>
<li><p>The Certificate Authority public certificate</p>
</li>
<li><p>The private key (used for signing)</p>
</li>
<li><p>The CA metadata</p>
</li>
</ul>
<p>CockroachDB requires that the server certificate (node cert) and the client certificate (your root cert) be signed by <strong>THE SAME CA</strong>. Because this ensures both sides trust each other.</p>
<p>So what do we do?</p>
<p>We mount the CA secret into the pod:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">volumes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>
    <span class="hljs-attr">secret:</span>
      <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-cockroachdb-ca-secret</span>
</code></pre>
<p>This gives the pod access to:</p>
<ul>
<li><p><code>/ca/ca.crt</code>: CA public certificate</p>
</li>
<li><p><code>/ca/ca.key</code>: CA <em>private</em> key</p>
</li>
</ul>
<p>And with these, we can sign new client certificates inside the cluster.</p>
<p>The important command inside the pod:</p>
<pre><code class="lang-yaml"><span class="hljs-string">/cockroach/cockroach</span> <span class="hljs-string">cert</span> <span class="hljs-string">create-client</span> <span class="hljs-string">root</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--certs-dir=/out</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--ca-key=/ca/ca.key</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--lifetime=5h</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--overwrite</span>
</code></pre>
<p>What this does:</p>
<ul>
<li><p>Generates a brand new public/private key pair for the <code>root</code> SQL user</p>
</li>
<li><p>Uses the CA private key to <strong>sign the client certificate</strong></p>
</li>
<li><p>Places everything inside <code>/out</code></p>
</li>
<li><p>Makes the certificate valid for <strong>5 hours</strong></p>
</li>
</ul>
<p>If we passed <code>demo</code> instead of <code>root</code>, then the certificate CN would be <code>demo</code>, and CockroachDB would treat anyone using that certificate as the <code>demo</code> SQL user.</p>
<p>That’s how CockroachDB identifies and authenticates users when running in secure mode.</p>
<h4 id="heading-step-2-deploy-the-pod">Step 2: Deploy the Pod</h4>
<p>Run:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">apply</span> <span class="hljs-string">-f</span> <span class="hljs-string">gen-root-cert.yml</span>
</code></pre>
<p>Give it a minute to start and generate the files.</p>
<h4 id="heading-step-3-copy-the-certificates-to-your-local-pc">Step 3: Copy the Certificates to Your Local PC</h4>
<p>We need three files:</p>
<ul>
<li><p><code>client.root.crt</code>: client certificate</p>
</li>
<li><p><code>client.root.key</code>: private key</p>
</li>
<li><p><code>ca.crt</code>: CA certificate</p>
</li>
</ul>
<p>Copy them from the pod to your machine:</p>
<pre><code class="lang-bash">kubectl cp default/gen-root-cert:/out/client.root.crt ./client.root.crt
kubectl cp default/gen-root-cert:/out/client.root.key ./client.root.key
kubectl cp default/gen-root-cert:/out/ca.crt             ./ca.crt
</code></pre>
<p>Now your folder should contain:</p>
<pre><code class="lang-bash">client.root.crt
client.root.key
ca.crt
</code></pre>
<p>These are the files Beekeeper Studio needs for mTLS.</p>
<h4 id="heading-step-4-decode-the-client-certificate-just-like-we-did-for-the-node-certificate">Step 4: Decode the Client Certificate (Just Like We Did for the Node Certificate)</h4>
<p>Run:</p>
<pre><code class="lang-bash">openssl x509 -<span class="hljs-keyword">in</span> client.root.crt -text -noout &gt; crdb-root.crt.decoded
</code></pre>
<p>Open the <code>crdb-root.crt.decoded</code> file and look at the contents.</p>
<h4 id="heading-understanding-the-client-certificate">Understanding the Client Certificate</h4>
<ol>
<li><strong>Issuer</strong></li>
</ol>
<p>You'll see <code>Issuer: O = Cockroach, CN = Cockroach CA</code></p>
<p>This is the same Issuer as the CockroachDB node certificate.</p>
<p>This confirms that both certificates were signed by the <em>same</em> Certificate Authority, that they trust each other, and that mTLS will work perfectly.</p>
<ol start="2">
<li><strong>Subject</strong></li>
</ol>
<p>You’ll see: <code>Subject: O = Cockroach, CN = root</code></p>
<p>This means that the Organization is just a label grouping CockroachDB identities, and that the Common Name is <code>root</code>. This is VERY important.</p>
<p>The CN of a client certificate literally tells CockroachDB:</p>
<blockquote>
<p>“This connection belongs to the SQL user named <code>root</code>.”</p>
</blockquote>
<p>If CN was <code>demo</code>, CockroachDB would authenticate you as the <code>demo</code> SQL user.</p>
<h4 id="heading-extended-key-usage-eku">Extended Key Usage (EKU)</h4>
<p>You should see: <code>TLS Web Client Authentication</code>.</p>
<p>This is exactly what we want. It tells CockroachDB:</p>
<blockquote>
<p>“This certificate is only for clients connecting to the database.”</p>
</blockquote>
<p>Unlike node certificates, you will NOT see: <code>TLS Web Server Authentication</code>.</p>
<p>Why?</p>
<p>Because:</p>
<ul>
<li><p><strong>Server Authentication</strong> = for certificates the SERVER SHOWS TO THE CLIENT. For example: CockroachDB nodes proving they are legitimate.</p>
</li>
<li><p><strong>Client Authentication</strong> = for certificates THE CLIENT SENDS TO THE SERVER. For example: You proving you are the real <code>root</code> user.</p>
</li>
</ul>
<h4 id="heading-why-your-client-certificate-cannot-be-used-as-a-server-certificate">Why your client certificate <strong>cannot</strong> be used as a server certificate</h4>
<p>Because a server certificate says:</p>
<blockquote>
<p>“Trust me, I AM the CockroachDB server.”</p>
</blockquote>
<p>But your client certificate says:</p>
<blockquote>
<p>“Trust me, I am an authenticated user.”</p>
</blockquote>
<p>Two very different identities. And CockroachDB will <em>reject</em> any certificate used in the wrong role.</p>
<p>So having only TLS Web Client Authentication in your certificate is perfect for our use case. :)</p>
<h3 id="heading-connecting-to-our-cockroachdb-cluster-securely-using-mtls">Connecting to Our CockroachDB Cluster Securely (Using mTLS)</h3>
<p>Now that we’ve successfully generated the certificates and key pairs we need, it's time to use them to securely connect to our CockroachDB cluster from Beekeeper Studio.</p>
<p>Remember: CockroachDB is running in secure mode, so without these certificates, it will <em>reject all incoming connections</em>, even if you enter the correct username and password.</p>
<p>Let’s walk through the steps.👇🏾</p>
<h4 id="heading-step-1-make-sure-port-forwarding-is-still-running">Step 1: Make Sure Port Forwarding Is Still Running</h4>
<p>Before connecting, ensure that your CockroachDB cluster is still exposed to your PC.</p>
<p>If you already closed the previous terminal window, simply re-run this:</p>
<pre><code class="lang-bash">kubectl port-forward svc/crdb-cockroachdb-public 26259:26257
</code></pre>
<p>This makes your CockroachDB node reachable at: <code>localhost:26259</code>. If this step isn’t active, <em>Beekeeper Studio will not be able to connect</em>.</p>
<h4 id="heading-step-2-open-beekeeper-studio-and-set-up-the-connection">Step 2: Open Beekeeper Studio and Set Up the Connection</h4>
<p>Launch Beekeeper Studio and open a fresh connection window (Ctrl + Shift + N if needed).</p>
<p>Now fill in the fields like this:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Field</td><td>Value</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Connection Type</strong></td><td>CockroachDB</td></tr>
<tr>
<td><strong>Host</strong></td><td><code>localhost</code></td></tr>
<tr>
<td><strong>Port</strong></td><td><code>26259</code></td></tr>
<tr>
<td><strong>User</strong></td><td><code>root</code></td></tr>
<tr>
<td><strong>Default Database</strong></td><td><code>defaultdb</code></td></tr>
</tbody>
</table>
</div><p>Now enable the <strong>“Enable SSL”</strong> option. Once enabled, expand the SSL section and set the following three fields:</p>
<ul>
<li><p><strong>CA Cert:</strong> Set this to the location of: <code>ca.crt</code>. This is the root Certificate Authority file you copied earlier using: <code>kubectl cp default/gen-root-cert:/out/ca.crt ./ca.crt</code>. It should still be in your project’s root directory (for example, <code>cockroachdb-tutorial/</code>).</p>
</li>
<li><p><strong>Certificate:</strong> Set this to the location of: <code>client.root.crt</code></p>
</li>
<li><p><strong>Key File:</strong> Set this to the location of: <code>client.root.key</code></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763389469459/bbdb17c5-1c3b-4163-932f-3cd5382160f4.png" alt="Connecting to the CokcorachDB cluster from Beekeeper Studio in &quot;Secure&quot; mode" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h4 id="heading-step-3-click-connect">Step 3: Click “Connect”</h4>
<p>Once all the fields are set properly, click <strong>Connect</strong>.</p>
<p>If everything was done correctly, you should now be connected to your CockroachDB cluster securely over Mutual TLS.</p>
<p>If the connection fails:</p>
<ul>
<li><p>Double-check your certificate paths</p>
</li>
<li><p>Ensure port-forwarding is running</p>
</li>
<li><p>Verify the user is <code>root</code></p>
</li>
<li><p>Confirm the selected connection type is <code>CockroachDB</code>.</p>
</li>
</ul>
<h4 id="heading-step-4-run-your-first-secure-query">Step 4: Run Your First Secure Query</h4>
<p>Now that you're connected, let’s verify everything works by running:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SHOW</span> <span class="hljs-keyword">users</span>;
</code></pre>
<p>You should see two users automatically created by CockroachDB:</p>
<ul>
<li><p><strong>admin</strong></p>
</li>
<li><p><strong>root</strong></p>
</li>
</ul>
<p>In the next subsection, we’ll create a <strong>new SQL user</strong> and generate a certificate for that user (just like we did for the <code>root</code> user) so you’ll understand how CockroachDB handles user authentication in production environments.</p>
<h3 id="heading-restoring-our-previous-database-into-the-new-gke-cockroachdb-cluster-without-sa-keys">Restoring Our Previous Database into the New GKE CockroachDB Cluster (without SA keys)</h3>
<p>Now that our CockroachDB cluster is up and running on GKE – fully secured with TLS encryption and mTLS authentication – it’s time to bring back the data from our previous setup.</p>
<p>Remember how we backed up our CockroachDB database (running on Minikube) to Google Cloud Storage?</p>
<p>Well, now we’re going to restore that same backup into our new production cluster on GKE. But before CockroachDB can access our bucket, we must give it permission – securely.</p>
<p>And here’s the cool part: <strong>we don’t need to use Service Account keys anymore.</strong></p>
<h4 id="heading-why-we-dont-need-service-account-keys-on-gke">Why We Don’t Need Service Account Keys on GKE</h4>
<p>Earlier, in the backup section, we generated a Service Account key on our PC and mounted it into our Minikube cluster.</p>
<p>But for GKE, we intentionally left out the following fields in our <code>cockroachdb-production.yml</code>:</p>
<ul>
<li><p><code>env</code></p>
</li>
<li><p><code>volumes</code></p>
</li>
<li><p><code>volumeMounts</code></p>
</li>
</ul>
<p>The reason? GKE supports something called <strong>Workload Identity</strong>.</p>
<p>Workload Identity lets us securely connect Kubernetes Service Accounts (KSAs) to Google Cloud Service Accounts (GSAs), without storing or mounting any secret keys. The authentication happens “implicitly” thanks to Google’s metadata server.</p>
<p>💡 Workload Identity works easily when your cluster is running on GKE. It’s more complex to set up on Minikube, Kind, EKS, AKS, or any other non-GKE cluster.</p>
<h4 id="heading-step-1-linking-the-google-service-account-to-our-kubernetes-service-account">Step 1: Linking the Google Service Account to Our Kubernetes Service Account</h4>
<p>We already touched this when deploying our cluster, but let’s look at the specific line again.</p>
<p>Open your <code>cockroachdb-production.yml</code> Helm values file and scroll to the <code>serviceAccount</code> section. You should see something like this:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">serviceAccount:</span>
    <span class="hljs-attr">create:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">"crdb-cockroachdb"</span>
    <span class="hljs-attr">annotations:</span>
      <span class="hljs-attr">iam.gke.io/gcp-service-account:</span> <span class="hljs-string">cockroachdb-backup@&lt;PROJECT_ID&gt;.iam.gserviceaccount.com</span>
<span class="hljs-string">...</span>
</code></pre>
<p>Replace the <code>&lt;PROJECT_ID&gt;</code> placeholder with your real Google Cloud project ID.</p>
<p>If you’re unsure of the ID, go to Google Cloud Console, then to IAM &amp; Admin, and finally to Service Accounts. Search for <code>cockroachdb-backup</code> and copy the project ID from there.</p>
<p>This annotation instructs GKE to automatically authenticate our CockroachDB pods as the <code>cockroachdb-backup</code> Google Service Account – no keys needed.</p>
<h4 id="heading-step-2-binding-ksa-gsa-using-workload-identity">Step 2: Binding KSA ↔️ GSA Using Workload Identity</h4>
<p>Annotating the Service Account isn’t enough. We still need to explicitly allow our KSA to “impersonate" the GSA.</p>
<p>Run this command to set the active project:</p>
<pre><code class="lang-bash">gcloud config <span class="hljs-built_in">set</span> project &lt;PROJECT_ID&gt;
</code></pre>
<p>Now, apply the IAM policy binding:</p>
<pre><code class="lang-bash">gcloud iam service-accounts add-iam-policy-binding \
  &lt;GOOGLE_SERVICE_ACCOUNT&gt; \
  --role roles/iam.workloadIdentityUser \
  --member <span class="hljs-string">"serviceAccount:&lt;PROJECT_ID&gt;.svc.id.goog[&lt;NAMESPACE&gt;/&lt;KUBERNETES_SERVICE_ACCOUNT&gt;]"</span>
</code></pre>
<p>Replace the placeholders with:</p>
<ul>
<li><p><code>&lt;GOOGLE_SERVICE_ACCOUNT&gt;</code> with <code>cockroachdb-backup@&lt;PROJECT_ID&gt;.iam.gserviceaccount.com</code></p>
</li>
<li><p><code>&lt;PROJECT_ID&gt;</code> with your GCP project ID</p>
</li>
<li><p><code>&lt;NAMESPACE&gt;</code> with where CockroachDB runs (<code>default</code>)</p>
</li>
<li><p><code>&lt;KUBERNETES_SERVICE_ACCOUNT&gt;</code> with <code>crdb-cockroachdb</code></p>
</li>
</ul>
<p>After a few seconds, you should see something like:</p>
<pre><code class="lang-yaml"><span class="hljs-string">Updated</span> <span class="hljs-string">IAM</span> <span class="hljs-string">policy</span> <span class="hljs-string">for</span> <span class="hljs-string">serviceAccount</span> [<span class="hljs-string">cockroachdb-backup@&lt;PROJECT_ID&gt;.iam.gserviceaccount.com</span>]<span class="hljs-string">.</span>
<span class="hljs-attr">bindings:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">members:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">serviceAccount:&lt;PROJECT_ID&gt;.svc.id.goog[default/crdb-cockroachdb]</span>
  <span class="hljs-attr">role:</span> <span class="hljs-string">roles/iam.workloadIdentityUser</span>
<span class="hljs-attr">etag:</span> <span class="hljs-string">***</span>
<span class="hljs-attr">version:</span> <span class="hljs-number">1</span>
</code></pre>
<p>Perfect. Your KSA can now access Google Cloud Storage automatically.</p>
<h3 id="heading-restoring-our-previous-database-from-google-cloud-storage">Restoring Our Previous Database from Google Cloud Storage</h3>
<p>Now that authentication is set up, let’s restore the backup we previously created in the Minikube cluster.</p>
<p>Open Beekeeper Studio and reconnect to your CockroachDB cluster (the one running on GKE).</p>
<p>Before restoring anything, let’s check if the <code>books</code> table exists:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>You should see an error saying the table doesn’t exist. Don’t worry, that’s expected.</p>
<h3 id="heading-now-lets-restore-the-data">Now, Let’s Restore the Data 🎉</h3>
<p>Run this command:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">RESTORE</span> <span class="hljs-keyword">FROM</span> LATEST <span class="hljs-keyword">IN</span> <span class="hljs-string">'gs://&lt;BUCKET_NAME&gt;/cluster?AUTH=implicit'</span>;
</code></pre>
<p>Replace <code>&lt;BUCKET_NAME&gt;</code> with the name of the bucket you created earlier (for example: <code>cockroachdb-backup-7gw8u</code>).</p>
<p>CockroachDB will now:</p>
<ul>
<li><p>Authenticate using Workload Identity</p>
</li>
<li><p>Find the latest backup inside your bucket</p>
</li>
<li><p>Restore all tables, schemas, and data into your new GKE cluster</p>
</li>
</ul>
<p>After a couple of minutes, you should get a Success message.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763393752870/f95d76c0-3722-491a-a97c-a1b8a79bdc79.png" alt="Successfully restored CockroachDB database" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Now, run the query again:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>Boom! Your books from the Minikube cluster should now appear inside the new CockroachDB cluster running on GKE 😃.</p>
<h3 id="heading-connecting-to-the-database-with-a-new-user">Connecting to the Database with a New User</h3>
<p>So far, we’ve been connecting to our CockroachDB cluster using the <code>root</code> user. While this is super convenient for tutorials, it’s not recommended for real apps.</p>
<p>This is because the <code>root</code> user has advanced privileges – basically, full access to your entire cluster. If an attacker got hold of these credentials, or your application was compromised, they could do <strong>A LOT</strong> of damage. 😬</p>
<p>Instead, it’s best practice to create a user with <strong>limited permissions</strong> for your apps. This way, even if the user is compromised, the damage is contained.</p>
<h4 id="heading-authentication-options-for-users">Authentication Options for Users</h4>
<p>CockroachDB is flexible when it comes to authentication:</p>
<ol>
<li><p><strong>Password Authentication:</strong> Create a user with a password and connect using just username + password (no client certificates required).</p>
</li>
<li><p><strong>Passwordless / Mutual TLS Authentication:</strong> Create a user without a password, then connect using client certificates signed by the same CA (like we did for <code>root</code>).</p>
</li>
<li><p><strong>Both Password + Mutual TLS:</strong> Create a user with a password and also connect using client certificates. This adds an extra layer of security.</p>
</li>
</ol>
<p>In this subsection, we’ll start simple and use password authentication.</p>
<h4 id="heading-step-1-create-the-new-user">Step 1: Create the New User</h4>
<p>Open your current connection in Beekeeper Studio (signed in as <code>root</code>) and run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">USER</span> password_auth <span class="hljs-keyword">WITH</span> <span class="hljs-keyword">PASSWORD</span> <span class="hljs-string">'supersecret'</span>;
</code></pre>
<p>You should see a message confirming the user was created successfully.</p>
<h4 id="heading-step-2-connect-as-the-new-user">Step 2: Connect as the New User</h4>
<p>Open a new Beekeeper Studio window (Ctrl + Shift + N). <strong>DO NOT</strong> exit/close the old window, as we’ll need it later.</p>
<p>Fill in the connection fields:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Field</td><td>Value</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Connection Type</strong></td><td>CockroachDB</td></tr>
<tr>
<td><strong>Host</strong></td><td><code>localhost</code></td></tr>
<tr>
<td><strong>Port</strong></td><td><code>26259</code></td></tr>
<tr>
<td><strong>Database</strong></td><td><code>defaultdb</code></td></tr>
<tr>
<td><strong>User</strong></td><td><code>password_auth</code></td></tr>
<tr>
<td><strong>Password</strong></td><td><code>huh</code> (for now, we’ll try a wrong password to see it fail)</td></tr>
</tbody>
</table>
</div><p>Click Connect.</p>
<p>❌ You’ll see an error about SSL connection being required.</p>
<p>Even though we’re connecting with a password instead of certificates, <strong>enabling SSL is still important</strong>. It encrypts the data between Beekeeper Studio and CockroachDB.</p>
<p>Without it, sensitive info like passwords and queries could be intercepted (man-in-the-middle attacks).</p>
<h4 id="heading-step-3-enable-ssl-amp-ca-verification">Step 3 — Enable SSL &amp; CA Verification</h4>
<ul>
<li><p>Tick <strong>Enable SSL</strong></p>
</li>
<li><p>Click the <strong>CA Cert</strong> field and select the <code>ca.crt</code> file in your project root (<code>cockroachdb-tutorial/</code>)</p>
</li>
</ul>
<p>This ensures that Beekeeper Studio verifies it’s really talking to our CockroachDB cluster and protects against attackers trying to intercept the connection.</p>
<p>Now, click Connect again.</p>
<p>❌ Initially, you’ll still see a <strong>Password authentication failed</strong> error because we intentionally entered the wrong password.</p>
<h4 id="heading-step-4-connect-with-the-correct-password">Step 4: Connect With the Correct Password</h4>
<p>Replace the password with <code>supersecret</code>, then click Connect.</p>
<p>You are now signed in as the <code>password_auth</code> user!</p>
<h4 id="heading-step-5-check-permissions">Step 5: Check Permissions</h4>
<p>Run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>❌ You should see an error stating that <code>password_auth</code> does not have permission to access the <code>books</code> table.</p>
<p>This is expected, as it confirms that our limited-access user can <strong>only access what we explicitly grant it</strong>. Even if compromised, the attacker can’t modify our entire database.</p>
<h4 id="heading-step-6-granting-access-to-specific-tables">Step 6: Granting Access to Specific Tables</h4>
<p>To allow <code>password_auth</code> to work with the <code>books</code> table, switch back to the <code>root</code> connection Beekeeper Studio window and run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">USAGE</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">SCHEMA</span> defaultdb.public <span class="hljs-keyword">TO</span> password_auth;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span>, <span class="hljs-keyword">INSERT</span>, <span class="hljs-keyword">UPDATE</span>, <span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> defaultdb.public.books <span class="hljs-keyword">TO</span> password_auth;
</code></pre>
<p>This gives the user read and write access to the <code>books</code> table only.</p>
<h4 id="heading-step-7-verify-the-new-user-access">Step 7: Verify the New User Access</h4>
<p>Go back to the Beekeeper Studio window where you’re signed in as <code>password_auth</code> and run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>Boom! You should now see the list of books from your restored database.</p>
<p>Our new user is fully functional with <strong>limited privileges</strong>, making it safe for use in real applications.</p>
<h3 id="heading-connecting-with-passwordless-authentication-mutual-tls">Connecting with Passwordless Authentication (Mutual TLS)</h3>
<p>We’ve already seen how to connect to the database using a user that authenticates with a password, and without any client certificates.</p>
<p>Now, let’s look at the opposite scenario: passwordless authentication via Mutual TLS (mTLS).</p>
<p>This is one of the strongest forms of authentication because instead of a password, the database verifies you using a <strong>cryptographically signed certificate</strong>.</p>
<p>Let’s walk through it.</p>
<h4 id="heading-step-1-create-the-mtlsauth-user">Step 1: Create the <code>mtls_auth</code> User</h4>
<p>Navigate back to the Beekeeper Studio window where you're currently signed in as the <code>root</code> user. Run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">USER</span> mtls_auth;
</code></pre>
<p>You should see a success message confirming that the user has been created.</p>
<p><strong>N.B.:</strong> If this query fails, there’s a good chance your <code>root</code> client certificate has expired. Remember that we set a <strong>5-hour lifetime</strong> when generating it earlier.</p>
<p>If this happens, delete the certificate-generation pod:</p>
<pre><code class="lang-bash">kubectl delete po/gen-root-cert
</code></pre>
<p>Then re-apply the <code>gen-root-cert.yml</code> manifest. Copy the newly generated <code>client.root.crt</code>, <code>client.root.key</code>, and <code>ca.crt</code> back to your PC. Then try creating the user again.</p>
<h4 id="heading-step-2-attempt-signing-in-as-mtlsauth-expect-failure">Step 2: Attempt Signing In as <code>mtls_auth</code> (Expect Failure)</h4>
<p>Open a new Beekeeper Studio window (Ctrl + Shift + N).</p>
<p>Try filling in the connection settings using:</p>
<ul>
<li><p>User: <code>mtls_auth</code></p>
</li>
<li><p>SSL enabled</p>
</li>
<li><p>CA Cert: <code>ca.crt</code></p>
</li>
<li><p>Client Cert: <code>client.root.crt</code></p>
</li>
<li><p>Client Key: <code>client.root.key</code></p>
</li>
</ul>
<p>Click Connect.</p>
<p>You’ll see an error message similar to this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763444971964/93f41787-425b-4e36-86da-4b688cef672f.png" alt="Connecting as the mtls_auth user with the wrong certificate and key-pair" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Why does this fail?</p>
<ol>
<li><p>The user has no password, so password login is impossible.</p>
</li>
<li><p>You’re using the <em>root</em> certificate, not a certificate belonging to <code>mtls_auth</code>. CockroachDB is strict: each user must authenticate using <em>their own</em> certificate.</p>
</li>
</ol>
<p>So let's fix that by generating a new certificate + key pair for the <code>mtls_auth</code> user.</p>
<h4 id="heading-step-3-create-certificate-key-for-mtlsauth">Step 3: Create Certificate + Key for <code>mtls_auth</code></h4>
<p>Just like we generated certificates for the <code>root</code> user earlier, we’ll do the same for <code>mtls_auth</code>.</p>
<p>Create a new manifest named <code>gen-mtls_auth-cert.yml</code>.</p>
<p>Paste in this content:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">gen-mtls-auth-cert</span> 
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
  <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-cockroachdb-ca-secret</span> 
        <span class="hljs-attr">items:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.crt</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.crt</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.key</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.key</span>
  <span class="hljs-attr">containers:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">gen</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">cockroachdb/cockroach:v25.3.1</span>
      <span class="hljs-attr">command:</span> [<span class="hljs-string">"sh"</span>, <span class="hljs-string">"-ec"</span>]
      <span class="hljs-attr">args:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-string">|
          mkdir -p /out
</span>
          <span class="hljs-comment"># Copy the CA certificate</span>
          <span class="hljs-string">cp</span> <span class="hljs-string">/ca/ca.crt</span> <span class="hljs-string">/out/ca.crt</span>

          <span class="hljs-comment"># Create the client certificate and key pair for user 'mtls_auth'</span>
          <span class="hljs-string">/cockroach/cockroach</span> <span class="hljs-string">cert</span> <span class="hljs-string">create-client</span> <span class="hljs-string">mtls_auth</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--certs-dir=/out</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--ca-key=/ca/ca.key</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--lifetime=5h</span> <span class="hljs-string">\</span>
            <span class="hljs-string">--overwrite</span>

          <span class="hljs-comment"># List generated files</span>
          <span class="hljs-string">ls</span> <span class="hljs-string">-al</span> <span class="hljs-string">/out</span>

          <span class="hljs-comment"># Keep pod alive for kubectl cp</span>
          <span class="hljs-string">sleep</span> <span class="hljs-number">3600</span>
      <span class="hljs-attr">volumeMounts:</span>
        <span class="hljs-bullet">-</span> { <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>, <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/ca</span>, <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span> }
      <span class="hljs-attr">resources:</span>
        <span class="hljs-attr">requests:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"50Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"10m"</span>
        <span class="hljs-attr">limits:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"500Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"50m"</span>
</code></pre>
<p>Apply this file, wait for the pod to start, then copy the generated files:</p>
<pre><code class="lang-bash">kubectl cp default/gen-mtls-auth-cert:/out/client.mtls_auth.crt ./client.mtls_auth.crt 
kubectl cp default/gen-mtls-auth-cert:/out/client.mtls_auth.key ./client.mtls_auth.key
kubectl cp default/gen-mtls-auth-cert:/out/ca.crt ./ca.crt
</code></pre>
<p>Now we have the correct certificate + key pair for our new user.</p>
<h4 id="heading-step-4-connect-as-mtlsauth">Step 4: Connect as <code>mtls_auth</code></h4>
<p>Go back to the new Beekeeper Studio window and update the SSL fields:</p>
<ul>
<li><p><strong>CA Cert:</strong> <code>ca.crt</code></p>
</li>
<li><p><strong>Certificate:</strong> <code>client.mtls_auth.crt</code></p>
</li>
<li><p><strong>Key File:</strong> <code>client.mtls_auth.key</code></p>
</li>
</ul>
<p>Click Connect.</p>
<p>This time, it should succeed instantly</p>
<h4 id="heading-step-5-inspect-the-certificate">Step 5 — Inspect the Certificate</h4>
<p>To understand how CockroachDB links certificates to users, decode the certificate:</p>
<pre><code class="lang-bash">openssl x509 -<span class="hljs-keyword">in</span> client.mtls_auth.crt -text -noout &gt; client.mtls_auth.crt.decoded
</code></pre>
<p>Open the file, scroll to the Subject field, and you’ll see:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">Subject:</span> <span class="hljs-string">O</span> <span class="hljs-string">=</span> <span class="hljs-string">Cockroach,</span> <span class="hljs-string">CN</span> <span class="hljs-string">=</span> <span class="hljs-string">mtls_auth</span>
<span class="hljs-string">...</span>
</code></pre>
<p>The <code>CN</code> (Common Name) is the username CockroachDB uses to authenticate the session.</p>
<p>This is how CockroachDB knows you’re connecting as the <code>mtls_auth</code> user without any password at all. :)</p>
<h4 id="heading-step-6-try-reading-the-books-table">Step 6: Try Reading the Books Table</h4>
<p>Run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>❌ You’ll get a permission error, just like we did earlier with the <code>password_auth</code> user.</p>
<p>This is expected because <code>mtls_auth</code> has <em>no</em> privileges yet. Perfect!</p>
<h4 id="heading-step-7-grant-permissions-to-mtlsauth">Step 7: Grant Permissions to <code>mtls_auth</code></h4>
<p>Switch to the Beekeeper Studio window where you're signed in as <code>root</code>, and run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">USAGE</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">SCHEMA</span> defaultdb.public <span class="hljs-keyword">TO</span> mtls_auth;
<span class="hljs-keyword">GRANT</span> <span class="hljs-keyword">SELECT</span>, <span class="hljs-keyword">INSERT</span>, <span class="hljs-keyword">UPDATE</span>, <span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">ON</span> <span class="hljs-keyword">TABLE</span> defaultdb.public.books <span class="hljs-keyword">TO</span> mtls_auth;
</code></pre>
<p>You should see a success message.</p>
<p>Now return to the <code>mtls_auth</code> session and run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> books;
</code></pre>
<p>Boom! You should now see your previously restored list of books.</p>
<p>You’ve successfully connected using passwordless, certificate-based authentication and granted controlled permissions to the new user. :)</p>
<h3 id="heading-connecting-via-mutual-tls-mtls-from-our-apps-on-kubernetes">Connecting via Mutual TLS (mTLS) from Our Apps on Kubernetes</h3>
<p>So far, we’ve been connecting to our CockroachDB cluster <em>securely</em> using Beekeeper Studio thanks to our TLS certificates and mTLS authentication.</p>
<p>But…what happens when we have applications running inside our Kubernetes cluster that need to talk to CockroachDB as well?</p>
<p>Exactly: those apps also need to authenticate using client certificates</p>
<p>And that brings us to a very important point…</p>
<h4 id="heading-why-we-should-not-generate-client-certificates-using-pods-the-dangerous-way">Why We Should <em>Not</em> Generate Client Certificates Using Pods (The Dangerous Way)</h4>
<p>Up until now, we’ve been generating our client certificates using Kubernetes Pods like:</p>
<ul>
<li><p><code>gen-root-cert</code></p>
</li>
<li><p><code>gen-mtls-auth-cert</code></p>
</li>
</ul>
<p>They <em>work</em>, yes…but they’re not safe for production.</p>
<p>Why? Because these jobs <strong>mount our Certificate Authority (CA) key</strong> inside the pod:</p>
<pre><code class="lang-yaml"><span class="hljs-string">...</span>
<span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-ca</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-cockroachdb-ca-secret</span>
        <span class="hljs-attr">items:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.crt</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.crt</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">ca.key</span>
            <span class="hljs-attr">path:</span> <span class="hljs-string">ca.key</span>
<span class="hljs-string">...</span>
</code></pre>
<p>This is a <em>big</em> security risk!</p>
<p>If an attacker ever gains access to that pod?</p>
<p>🔥 Your CA key is exposed<br>🔥 They can generate <em>their own trusted certificates</em><br>🔥 They can impersonate ANY client/user, including the <code>root</code> and <code>admin</code> users<br>🔥 They’ll have full access to your CockroachDB cluster</p>
<p>And they’ll keep that access <strong>forever</strong>, until you rotate the CA key (which is painful and disruptive).</p>
<p>This is why CockroachDB strongly advises against mounting CA keys into Pods.</p>
<h4 id="heading-the-right-way-using-cert-manager-recommended-by-cockroachdb">The Right Way: Using Cert Manager (Recommended by CockroachDB)</h4>
<p>CockroachDB’s <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/secure-cockroachdb-kubernetes?filters=helm#deploy-cert-manager-for-mtls">official docs recommend</a> managing client certificates using <strong>cert-manager</strong>.</p>
<p>This is because instead of YOU exposing your CA key inside Pods, cert-manager handles everything <em>internally and securely:</em></p>
<ul>
<li><p>Cert-manager stores and protects your CA key</p>
</li>
<li><p>It generates client certificates for you</p>
</li>
<li><p>It issues private keys <em>without ever exposing your CA key</em></p>
</li>
<li><p>It auto-renews certificates before they expire</p>
</li>
<li><p>And it gives you production-grade certificate lifecycle management</p>
</li>
</ul>
<h4 id="heading-but-wait-dont-we-need-the-ca-key-to-generate-client-certificates">But Wait: Don’t We Need the CA Key to Generate Client Certificates?</h4>
<p>Great question.</p>
<p>Yes, normally you need the CA key to sign client certificates…but <strong>cert-manager takes care of that for us</strong>.</p>
<p>You simply:</p>
<ol>
<li><p>Create an Issuer (or ClusterIssuer)</p>
</li>
<li><p>Tell cert-manager to use your CockroachDB CA</p>
</li>
<li><p>Request a Certificate</p>
</li>
</ol>
<p>Then cert-manager automatically:</p>
<ol>
<li><p>Signs it</p>
</li>
<li><p>Stores it in a Kubernetes Secret (where its safe)</p>
</li>
<li><p>Rotates it before expiry</p>
</li>
<li><p>Keeps your CA key completely secure</p>
</li>
</ol>
<p>No more exposing the CA key in Pods. No more writing custom Kubernetes Pods.</p>
<h4 id="heading-certificate-rotation-another-huge-win">Certificate Rotation — Another Huge Win</h4>
<p>Let’s talk about expirations.</p>
<p>Right now:</p>
<ul>
<li><p>The <code>mtls_auth</code> client cert we generated manually has <strong>5 hours</strong> validity</p>
</li>
<li><p>After 5 hours, it expires</p>
</li>
<li><p>Your apps will fail all DB connections</p>
</li>
<li><p>You’d need to regenerate a new certificate manually</p>
</li>
<li><p>Or worse: create a CronJob to regenerate them every 4 hours</p>
</li>
</ul>
<p>This is messy and unsafe.</p>
<p>With cert-manager?</p>
<ul>
<li><p>Certificates are automatically rotated</p>
</li>
<li><p>Renewed before expiration</p>
</li>
<li><p>No downtime</p>
</li>
<li><p>No manual intervention</p>
</li>
<li><p>Apps easily reload the new certificates</p>
</li>
</ul>
<h4 id="heading-alright-lets-install-cert-manager">Alright — Let’s Install Cert Manager</h4>
<p>To start using cert-manager, install it using the Helm chart:</p>
<pre><code class="lang-bash">helm repo add cert-manager https://charts.jetstack.io

helm install cert-manager cert-manager/cert-manager \
  --<span class="hljs-built_in">set</span> crds.enabled=<span class="hljs-literal">true</span> \
  --create-namespace \
  -n cert-manager \
  --version 1.19.1
</code></pre>
<p>Once cert-manager is installed, we’ll:</p>
<ol>
<li><p>Create a <strong>ClusterIssuer</strong> that uses our CockroachDB CA</p>
</li>
<li><p>Create a <strong>Certificate</strong> for our <code>mtls_auth</code> user</p>
</li>
<li><p>Mount that Certificate into our application Pods</p>
</li>
<li><p>Connect securely to CockroachDB via mTLS from inside Kubernetes</p>
</li>
</ol>
<p>That’s what we’ll walk through next</p>
<p>Before cert-manager can issue our certificates, it needs an <strong>Issuer</strong>. And before creating an Issuer, we need a secret that contains our CA certificate and CA key using the correct key names.</p>
<h4 id="heading-creating-a-ca-secret-for-the-issuer">Creating a CA Secret for the Issuer</h4>
<p>cert-manager’s <code>Issuer</code> is a bit picky about the secret format. It expects the secret to contain two keys:</p>
<ul>
<li><p><code>tls.crt</code>: the CA certificate</p>
</li>
<li><p><code>tls.key</code>: the CA private key</p>
</li>
</ul>
<p>But\ the CockroachDB Helm chart automatically generates a secret named <code>crdb-cockroachdb-ca-secret</code>, which uses different key names:</p>
<ul>
<li><p><code>ca.crt</code></p>
</li>
<li><p><code>ca.key</code></p>
</li>
</ul>
<p>So even though this secret contains exactly what we need, cert-manager won’t accept it because the keys are not named the way it expects.</p>
<p>To fix this, we’ll re-create a new secret with the correct key names. First, copy the existing CA files from Kubernetes to your local machine:</p>
<pre><code class="lang-bash">kubectl get secret crdb-cockroachdb-ca-secret -o jsonpath=<span class="hljs-string">'{.data.ca\.crt}'</span> | base64 -d &gt; ca.crt
</code></pre>
<p>If you get a “permission denied”, simply delete any existing <code>ca.crt</code> file in your project directory.</p>
<p>Now copy the key:</p>
<pre><code class="lang-bash">kubectl get secret crdb-cockroachdb-ca-secret -o jsonpath=<span class="hljs-string">'{.data.ca\.key}'</span> | base64 -d &gt; ca.key
</code></pre>
<p>Next, create the properly formatted secret:</p>
<pre><code class="lang-bash">kubectl create secret tls crdb-ca-issuer-secret --cert=ca.crt --key=ca.key
</code></pre>
<p>If you describe it:</p>
<pre><code class="lang-bash">kubectl describe secret crdb-ca-issuer-secret
</code></pre>
<p>You should now see <code>tls.crt</code> and <code>tls.key</code> in the <code>Data</code> section – exactly what cert-manager needs.</p>
<h4 id="heading-creating-the-issuer">Creating the Issuer</h4>
<p>Now that we have a properly formatted CA secret, we can create the Issuer that cert-manager will use to sign our client certificates.</p>
<p>Create a file called <code>crdb-issuer.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">cert-manager.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Issuer</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-issuer</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ca:</span>
    <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-ca-issuer-secret</span>
</code></pre>
<p>Apply it:</p>
<pre><code class="lang-bash">kubectl apply -f crdb-issuer.yml
</code></pre>
<p>Confirm that it’s ready:</p>
<pre><code class="lang-bash">kubectl get issuer crdb-issuer
</code></pre>
<p>The <code>Ready</code> column should display <code>True</code>.</p>
<h4 id="heading-creating-the-certificate-manifest">Creating the Certificate Manifest</h4>
<p>Now we’ll define a Certificate object. This doesn’t create the client certificate instantly – instead, it tells cert-manager <strong>what kind</strong> of certificate we need. cert-manager then generates and stores the certificate automatically.</p>
<p>Create a file named <code>crdb-mtls_auth-certificate.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">cert-manager.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Certificate</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-mtls-auth-certificate</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-mtls-auth-certificate</span> <span class="hljs-comment"># Secret that will hold the cert+key</span>
  <span class="hljs-attr">commonName:</span> <span class="hljs-string">mtls_auth</span> <span class="hljs-comment"># MUST match Cockroach SQL role</span>
  <span class="hljs-attr">duration:</span> <span class="hljs-string">24h</span> <span class="hljs-comment"># 1 day</span>
  <span class="hljs-attr">renewBefore:</span> <span class="hljs-string">20h</span> <span class="hljs-comment"># renew 4 hours before expiry</span>
  <span class="hljs-attr">privateKey:</span>
    <span class="hljs-attr">algorithm:</span> <span class="hljs-string">RSA</span>
    <span class="hljs-attr">size:</span> <span class="hljs-number">2048</span>
    <span class="hljs-attr">encoding:</span> <span class="hljs-string">PKCS8</span>
  <span class="hljs-attr">usages:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">client</span> <span class="hljs-string">auth</span> <span class="hljs-comment"># important: client certificate</span>
  <span class="hljs-attr">issuerRef:</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-issuer</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Issuer</span>
    <span class="hljs-attr">group:</span> <span class="hljs-string">cert-manager.io</span>
</code></pre>
<p>Let’s look at the important properties so we can understand what the Certificate workload does:</p>
<ul>
<li><p><strong>secretName:</strong> The Kubernetes secret where cert-manager will store the generated certificate, key, and CA certificate. This is where your apps will later mount the certificate files from.</p>
</li>
<li><p><strong>commonName:</strong> Very important! This must match the <strong>CockroachDB SQL user</strong> (<code>mtls_auth</code>), because CockroachDB uses the certificate’s Common Name to identify the connecting user.</p>
</li>
<li><p><strong>duration</strong> and <strong>renewBefore:</strong> <code>duration</code> defines how long the certificate is valid. <code>renewBefore</code> ensures cert-manager renews it early, preventing the certificate from getting expired before it gets renewed (to avoid downtime).</p>
</li>
<li><p><strong>usages:</strong> Tells cert-manager what the certificate is for. <code>client auth</code> ensures this certificate is only used by clients connecting to servers, not the other way around.</p>
</li>
<li><p><strong>issuerRef:</strong> Points to the Issuer we created earlier. This tells cert-manager <em>who</em> should sign the certificate.</p>
</li>
</ul>
<p>Apply the manifest:</p>
<pre><code class="lang-bash">kubectl apply -f crdb-mtls_auth-certificate.yml
</code></pre>
<p>After a few seconds, cert-manager will generate the certificate.</p>
<p>Check the secret:</p>
<pre><code class="lang-bash">kubectl get secret crdb-mtls-auth-certificate
</code></pre>
<p>Describe it to view the keys:</p>
<pre><code class="lang-bash">kubectl describe secret crdb-mtls-auth-certificate
</code></pre>
<p>You should see:</p>
<ul>
<li><p><code>tls.crt</code></p>
</li>
<li><p><code>tls.key</code></p>
</li>
<li><p><code>ca.crt</code></p>
</li>
</ul>
<p>These are the files the application will use.</p>
<p>If we copied the content of the <code>tls.crt</code> to our local machine and decoded it using the <code>openssl x509...</code> command, we'll see similar details to the content in the <code>client.mtls_auth.crt</code> client certificate we previously generated, with the Common Name (CN being <code>mtls_auth</code>).</p>
<h4 id="heading-creating-a-pod-that-connects-using-the-client-certificate">Creating a Pod That Connects Using the Client Certificate</h4>
<p>Now let’s create a simple Pod that uses our new client certificate to connect to CockroachDB.</p>
<p>Create a file called <code>books-pod.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">books-pod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
  <span class="hljs-attr">volumes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-certs</span>
      <span class="hljs-attr">secret:</span>
        <span class="hljs-attr">secretName:</span> <span class="hljs-string">crdb-mtls-auth-certificate</span>
        <span class="hljs-comment"># Make secret files read-only for the user only: 0400 (Without this, the Python app will thow an error). Howevwe, this is not compulsory for all apps, just this one being used in this tutorial :)</span>
        <span class="hljs-attr">defaultMode:</span> <span class="hljs-number">0400</span>
  <span class="hljs-attr">containers:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">books</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">prince2006/cockroachdb-tutorial-python-app:new</span>
      <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">Always</span>
      <span class="hljs-attr">env:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">DATABASE_URL</span>
          <span class="hljs-attr">value:</span> <span class="hljs-string">&gt;-
            postgresql://mtls_auth@crdb-cockroachdb-public.default:26257/defaultdb?sslmode=verify-full&amp;sslrootcert=/crdb-certs/ca.crt&amp;sslcert=/crdb-certs/tls.crt&amp;sslkey=/crdb-certs/tls.key
</span>      <span class="hljs-attr">volumeMounts:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">crdb-certs</span>
          <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/crdb-certs</span>
          <span class="hljs-attr">readOnly:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">resources:</span>
        <span class="hljs-attr">limits:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"100Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"50m"</span>
        <span class="hljs-attr">requests:</span>
          <span class="hljs-attr">memory:</span> <span class="hljs-string">"50Mi"</span>
          <span class="hljs-attr">cpu:</span> <span class="hljs-string">"10m"</span>
</code></pre>
<p>Here’s what’s happening:</p>
<ul>
<li><p>We mount the generated certificate secret into <code>/crdb-certs</code>.</p>
</li>
<li><p>The Python app uses those certificate files (<code>tls.crt</code>, <code>tls.key</code>, <code>ca.crt</code>) to authenticate.</p>
</li>
<li><p>The connection string does <strong>NOT</strong> include a password. CockroachDB authenticates the user entirely via the certificate’s Common Name.</p>
</li>
</ul>
<p>Apply the Pod:</p>
<pre><code class="lang-bash">kubectl apply -f books-pod.yml
</code></pre>
<p>After about a minute, view the logs:</p>
<pre><code class="lang-bash">kubectl logs books-pod
</code></pre>
<p>Or if the Pod already restarted:</p>
<pre><code class="lang-bash">kubectl logs -p books-pod
</code></pre>
<p>You should see a successful connection to CockroachDB using the <code>mtls_auth</code> user and a list of books</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763534354156/60114f7b-ba62-4706-a0b7-7629e20bfaaa.png" alt="List of books from our books-pod logs" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>If you remove the certificate files or try connecting without them, the app will fail – as expected.</p>
<p><strong>Congratulations!</strong></p>
<p>You’ve officially built a fully secure, production-ready CockroachDB cluster on Kubernetes – complete with:</p>
<ul>
<li><p>End-to-end encryption (TLS)</p>
</li>
<li><p>Mutual TLS authentication (mTLS) for users and apps</p>
</li>
<li><p>Automated, daily backups to Google Cloud Storage</p>
</li>
<li><p>Proper certificate rotation with cert-manager</p>
</li>
</ul>
<h2 id="heading-how-to-get-a-cockroachdb-enterprise-license-for-free">How to Get a CockroachDB Enterprise License for Free</h2>
<p>Okay, so here’s a thing: even though you’ve built a super professional CockroachDB cluster, there’s one small catch: <strong>without a license, your cluster might be “throttled.”</strong></p>
<p>We know that because, when we access our dashboard, we get a message concerning our cluster getting throttled.</p>
<p>That means things slow down: queries take longer, performance gets worse, and scaling up won’t magically make it faster. Yeah, it’s real. 🥲</p>
<p>Why does this happen? Because CockroachDB’s “full feature set” is under a special license. If you don’t set a valid license, it limits how many SQL transactions you can run at a time.</p>
<h3 id="heading-three-types-of-licenses">Three Types of Licenses</h3>
<p>Here’s a breakdown of the different kinds of CockroachDB licenses and what they mean for you:</p>
<ol>
<li><p><strong>Trial License</strong></p>
<ul>
<li><p>Valid for <strong>30 days</strong>.</p>
</li>
<li><p>Lets you try all the “Enterprise” features.</p>
</li>
<li><p>You <em>must</em> send telemetry (more on that soon) while the trial is active.</p>
</li>
</ul>
</li>
<li><p><strong>Enterprise License (Paid)</strong></p>
<ul>
<li><p>This is CockroachDB’s “premium / fully paid” version.</p>
</li>
<li><p>You can pick the kind of license based on your environment: “Production”, “Pre-production”, or “Development.”</p>
</li>
<li><p>Companies with more than <strong>$10 million in annual revenue</strong> need to pay for this license.</p>
</li>
<li><p>There <em>are</em> discounts, startup perks, or “free” versions for smaller companies (more below).</p>
</li>
</ul>
</li>
<li><p><strong>Enterprise Free License</strong></p>
<ul>
<li><p>This is the magic one for early-stage companies or startups: it has exactly the same features as the paid Enterprise license. But it’s free if your business makes <strong>under $10 million per year</strong>.</p>
</li>
<li><p>You <em>do</em> need to renew it each year.</p>
</li>
<li><p>Support for this “Free” license is <strong>community-level</strong> (forums, docs), not paid enterprise.</p>
</li>
</ul>
</li>
</ol>
<p><strong>N.B.:</strong> To keep your free license active and <em>not</em> get throttled, CockroachDB requires telemetry. Telemetry means your cluster sends some usage data back to Cockroach Labs. And no, they’re not “stealing your data”. Here’s what that actually means:</p>
<ul>
<li><p>Telemetry includes basic usage stats, cluster health info, and configuration metrics.</p>
</li>
<li><p>It does NOT send your business data, queries, or personal customer data.</p>
</li>
<li><p>It helps Cockroach Labs <em>make sure the free license is used responsibly</em>, and helps them build better features.</p>
</li>
<li><p>If you stop sending telemetry, your cluster will eventually be throttled after 7 days (slowed down).</p>
</li>
</ul>
<h3 id="heading-how-to-apply-for-the-free-enterprise-license">How to Apply for the Free Enterprise License</h3>
<p>Here’s how you can try to get that free enterprise license:</p>
<ol>
<li><p>Go to the CockroachDB Cloud Console (Sign up if you don’t have a account). Then go to the “Organization” link on the menu, click it, then click the “Enterprise Licenses” from the dropdown.</p>
</li>
<li><p>Click the Create License button → Enable the “Find out if my company qualifies for an Enterprise Free license” option.</p>
</li>
<li><p>Fill in the form: your name, company name, job function, and the intended use of the license.</p>
</li>
<li><p>Click “Continue”.</p>
</li>
</ol>
<p>You should see this success message “Based on your company's intended use, you qualify for an Enterprise Free license.” Now agree to the terms and conditions, then click the “Generate License key“.</p>
<p>Learn more about CockroachDB licenses here 👉🏾 <a target="_blank" href="https://www.cockroachlabs.com/docs/stable/licensing-faqs">https://www.cockroachlabs.com/docs/stable/licensing-faqs</a></p>
<h3 id="heading-adding-your-license-to-the-cockroachdb-cluster">Adding Your License to the CockroachDB Cluster</h3>
<p>Now that you’ve gotten your shiny new CockroachDB license (whether it’s the Free one or the Enterprise one), the next step is…actually <em>using it</em>.</p>
<p>Let’s add it to your CockroachDB cluster so it stops shouting “THROTTLED!” at you every time you open the dashboard :)</p>
<p>We’ll do this by updating our CockroachDB Helm configuration.</p>
<h4 id="heading-step-1-update-your-cockroachdb-productionyml">Step 1: Update Your <code>cockroachdb-production.yml</code></h4>
<p>Open your production Helm values file, and inside the <code>init</code> section, add the following:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">init:</span>
<span class="hljs-string">...</span>
    <span class="hljs-attr">provisioning:</span>
        <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
        <span class="hljs-attr">clusterSettings:</span>
          <span class="hljs-attr">cluster.organization:</span> <span class="hljs-string">"'&lt;ORGANIZATION&gt;'"</span> <span class="hljs-comment"># Enter the name of your organization here </span>
          <span class="hljs-attr">enterprise.license:</span> <span class="hljs-string">"'&lt;LICENSE&gt;'"</span> <span class="hljs-comment"># Enter your CockroachDB Enterprise license key here</span>
<span class="hljs-string">...</span>
</code></pre>
<p>Now replace:</p>
<ul>
<li><p><code>&lt;ORGANIZATION&gt;</code> with the name of your startup, business, project, or company</p>
</li>
<li><p><code>&lt;LICENSE&gt;</code> with the exact license string CockroachDB gave you</p>
</li>
</ul>
<p>That’s it – super simple.</p>
<h4 id="heading-step-2-apply-the-changes-with-helm">Step 2: Apply the Changes With Helm</h4>
<p>Run your usual Helm upgrade command:</p>
<pre><code class="lang-bash">helm upgrade cockroachdb -f cockroachdb-production.yml cockroachdb/cockroachdb
</code></pre>
<h4 id="heading-step-3-confirm-the-license-was-added-correctly">Step 3: Confirm the License Was Added Correctly</h4>
<p>Now let’s double-check everything worked.</p>
<ol>
<li><p>Connect as the <code>root</code> user: You can connect using Beekeeper Studio (like we’ve been doing).</p>
</li>
<li><p>Run this query to check your license:</p>
</li>
</ol>
<pre><code class="lang-sql"><span class="hljs-keyword">SHOW</span> CLUSTER SETTING enterprise.license;
</code></pre>
<p>If everything went well, you should see your license key printed out in the results.</p>
<h4 id="heading-step-4-make-sure-telemetry-is-enabled-important">Step 4: Make Sure Telemetry Is Enabled (Important!)</h4>
<p>Remember: without telemetry enabled, your cluster will still get throttled, even if you have a valid license 🥲</p>
<p>Run:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SHOW</span> CLUSTER SETTING diagnostics.reporting.enabled;
</code></pre>
<p>If the result says “true”, you're good! Telemetry is on, CockroachDB can verify your license, and your cluster will behave normally without slowing down.</p>
<h2 id="heading-conclusion-amp-next-steps"><strong>Conclusion &amp; Next Steps ✨</strong></h2>
<p>Throughout this book, you’ve gone from “What even is CockroachDB?” to actually running your <strong>own secure, production-ready database</strong> on Kubernetes – and that’s a BIG deal. 🎉</p>
<p>You learned why CockroachDB is special, how it avoids downtime, and why it’s different from the usual databases everyone talks about.</p>
<p>Then you set up your own local environment, practiced everything safely on Minikube, and gradually built your way to a full production setup on GKE.</p>
<p>You explored CockroachDB’s dashboard, checked your cluster’s health, backed up your data to the cloud, and even learned how to keep your database fast, stable, and ready to grow when needed.</p>
<p>Finally, you deployed it on Google Cloud, secured it with encryption and certificates, and connected to it from your own PC – all step-by-step.</p>
<p>By now, you’ve basically gone from curious learner to “I can actually run this thing in production.” 🚀</p>
<p>You’ve covered a lot – and you’ve built something powerful, modern, and production-worthy. Amazing job 👏🏾😁!! And thanks for reading.</p>
<h3 id="heading-about-the-author">About the Author 👨🏾‍💻</h3>
<p>Hi, I’m Prince! I’m a DevOps engineer and Cloud architect passionate about building, deploying, architecting, and managing applications and sharing knowledge with the tech community.</p>
<p>If you enjoyed this book, you can learn more about me by exploring more of my blogs and projects on my <a target="_blank" href="https://www.linkedin.com/in/prince-onukwili-a82143233/">LinkedIn profile</a>. and reach out to me on <a target="_blank" href="https://x.com/POnukwili">Twitter (X)</a>. You can find more of my <a target="_blank" href="https://www.linkedin.com/in/prince-onukwili-a82143233/details/publications/">articles here</a> or on <a target="_blank" href="https://www.freecodecamp.org/news/author/onukwilip/">my freeCodeCamp blog</a>.</p>
<p>You can also <a target="_blank" href="https://prince-onuk.vercel.app">visit my website</a>. Let’s connect and grow together! 😊</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Prepare for the Kubernetes Administrator Certification and Pass ]]>
                </title>
                <description>
                    <![CDATA[ We just posted a course on the freeCodeCamp.org YouTube channel to help prepare you for the Certified Kubernetes Administrator Certification. This course is designed to provide a deep, practical understanding of Kubernetes administration, from founda... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/prepare-for-the-kubernetes-administrator-certification-and-pass/</link>
                <guid isPermaLink="false">6902148987da12fd1cbfa416</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Wed, 29 Oct 2025 13:20:09 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761739982392/f255a6af-6ec9-4136-b45f-6f14d4fb2c8c.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We just posted a course on the freeCodeCamp.org YouTube channel to help prepare you for the Certified Kubernetes Administrator Certification. This course is designed to provide a deep, practical understanding of Kubernetes administration, from foundational concepts to advanced troubleshooting.</p>
<p>You can watch the course on <a target="_blank" href="https://youtu.be/Fr9GqFwl6NM">the freeCodeCamp.org YouTube channel</a> (2-hour watch).</p>
<p>This course was made possible by a grant from Linux Foundation. Use code FREECODECAMP to get 30% off training, certifications, and bundles from Linux Foundation.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/Fr9GqFwl6NM" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>There are many demos in the course using Kubernetes. Below you can find all the commands used in the course so it is easier for you to follow along on your local machine.</p>
<h2 id="heading-cka-hands-on-companion-commands-and-demos">CKA Hands-On Companion: Commands and Demos</h2>
<h2 id="heading-part-1-kubernetes-fundamentals-and-lab-setup">Part 1: Kubernetes Fundamentals and Lab Setup</h2>
<p>This section covers the setup of a single-node cluster using <code>kubeadm</code> to create an environment that mirrors the CKA exam.</p>
<h3 id="heading-section-13-setting-up-your-cka-practice-environment">Section 1.3: Setting Up Your CKA Practice Environment</h3>
<h4 id="heading-step-1-install-a-container-runtime-on-all-nodes"><strong>Step 1: Install a Container Runtime (on all nodes)</strong></h4>
<ol>
<li><p><strong>Load required kernel modules:</strong></p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF | sudo tee /etc/modules-load.d/k8s.conf
 overlay
 br_netfilter
 EOF

 sudo modprobe overlay
 sudo modprobe br_netfilter
</code></pre>
</li>
<li><p><strong>Configure sysctl for networking:</strong></p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF | sudo tee /etc/sysctl.d/k8s.conf
 net.bridge.bridge-nf-call-iptables  = 1
 net.bridge.bridge-nf-call-ip6tables = 1
 net.ipv4.ip_forward               = 1
 EOF

 sudo sysctl --system
</code></pre>
</li>
<li><p><strong>Install containerd:</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y containerd
</code></pre>
</li>
<li><p><strong>Configure containerd for systemd cgroup driver:</strong></p>
<pre><code class="lang-bash"> sudo mkdir -p /etc/containerd
 sudo containerd config default | sudo tee /etc/containerd/config.toml
 sudo sed -i <span class="hljs-string">'s/SystemdCgroup = false/SystemdCgroup = true/'</span> /etc/containerd/config.toml
</code></pre>
</li>
<li><p><strong>Restart and enable containerd:</strong></p>
<pre><code class="lang-bash"> sudo systemctl restart containerd
 sudo systemctl <span class="hljs-built_in">enable</span> containerd
</code></pre>
</li>
</ol>
<h4 id="heading-step-2-install-kubernetes-binaries-on-all-nodes"><strong>Step 2: Install Kubernetes Binaries (on all nodes)</strong></h4>
<ol>
<li><p><strong>Disable swap memory:</strong></p>
<pre><code class="lang-bash"> sudo swapoff -a
 <span class="hljs-comment"># Comment out swap in fstab to make it persistent:</span>
 sudo sed -i <span class="hljs-string">'/ swap / s/^\(.*\)$/#\1/g'</span> /etc/fstab
</code></pre>
</li>
<li><p><strong>Add the Kubernetes apt repository:</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y apt-transport-https ca-certificates curl gpg
 sudo mkdir -p -m 755 /etc/apt/keyrings
 curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
 <span class="hljs-built_in">echo</span> <span class="hljs-string">'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /'</span> | sudo tee /etc/apt/sources.list.d/kubernetes.list
</code></pre>
</li>
<li><p><strong>Install and hold binaries (adjust version as needed):</strong></p>
<pre><code class="lang-bash"> sudo apt-get update
 sudo apt-get install -y kubelet kubeadm kubectl
 sudo apt-mark hold kubelet kubeadm kubectl
</code></pre>
</li>
</ol>
<h4 id="heading-step-3-configure-a-single-node-cluster-on-the-control-plane"><strong>Step 3: Configure a Single-Node Cluster (on the control plane)</strong></h4>
<ol>
<li><p><strong>Initialize the control-plane node:</strong></p>
<pre><code class="lang-bash"> sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
</li>
<li><p><strong>Configure kubectl for the administrative user:</strong></p>
<pre><code class="lang-bash"> mkdir -p <span class="hljs-variable">$HOME</span>/.kube
 sudo cp -i /etc/kubernetes/admin.conf <span class="hljs-variable">$HOME</span>/.kube/config
 sudo chown $(id -u):$(id -g) <span class="hljs-variable">$HOME</span>/.kube/config
</code></pre>
</li>
<li><p><strong>Remove the control-plane taint:</strong></p>
<pre><code class="lang-bash"> kubectl taint nodes --all node-role.kubernetes.io/control-plane-
</code></pre>
</li>
<li><p><strong>Install the Flannel CNI plugin:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml
</code></pre>
</li>
<li><p><strong>Verify the cluster:</strong></p>
<pre><code class="lang-bash"> kubectl get nodes
 kubectl get pods -n kube-system
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-2-cluster-architecture-installation-amp-configuration-25">Part 2: Cluster Architecture, Installation &amp; Configuration (25%)</h2>
<h3 id="heading-section-21-bootstrapping-a-multi-node-cluster-with-kubeadm">Section 2.1: Bootstrapping a Multi-Node Cluster with <code>kubeadm</code></h3>
<h4 id="heading-initializing-the-control-plane-run-on-control-plane-node"><strong>Initializing the Control Plane (Run on Control Plane node)</strong></h4>
<ol>
<li><p><strong>Run</strong> <code>kubeadm init</code> (Replace <code>&lt;control-plane-private-ip&gt;</code>):</p>
<pre><code class="lang-bash"> sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=&lt;control-plane-private-ip&gt;
</code></pre>
<ul>
<li><strong>Note:</strong> Save the <code>kubeadm join</code> command from the output.</li>
</ul>
</li>
<li><p><strong>Install Calico CNI Plugin:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
</code></pre>
</li>
<li><p><strong>Verify Cluster and CNI installation:</strong></p>
<pre><code class="lang-bash"> kubectl get pods -n kube-system
 kubectl get nodes
</code></pre>
</li>
</ol>
<h4 id="heading-joining-worker-nodes-run-on-each-worker-node"><strong>Joining Worker Nodes (Run on each Worker node)</strong></h4>
<ol>
<li><p><strong>Run the join command saved from</strong> <code>kubeadm init</code>:</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># EXAMPLE - Use the exact command from your kubeadm init output</span>
 sudo kubeadm join &lt;control-plane-private-ip&gt;:6443 --token &lt;token&gt; \
     --discovery-token-ca-cert-hash sha256:&lt;<span class="hljs-built_in">hash</span>&gt;
</code></pre>
</li>
<li><p><strong>Verify the full cluster (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl get nodes -o wide
</code></pre>
</li>
</ol>
<h3 id="heading-section-22-managing-the-cluster-lifecycle">Section 2.2: Managing the Cluster Lifecycle</h3>
<h4 id="heading-upgrading-clusters-with-kubeadm-example-upgrade-to-1291"><strong>Upgrading Clusters with</strong> <code>kubeadm</code> (Example: Upgrade to 1.29.1)</h4>
<ol>
<li><p><strong>Upgrade Control Plane: Upgrade</strong> <code>kubeadm</code> binary:</p>
<pre><code class="lang-bash"> sudo apt-mark unhold kubeadm
 sudo apt-get update &amp;&amp; sudo apt-get install -y kubeadm=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubeadm
</code></pre>
</li>
<li><p><strong>Plan and apply the upgrade (on Control Plane node):</strong></p>
<pre><code class="lang-bash"> sudo kubeadm upgrade plan
 sudo kubeadm upgrade apply v1.29.1
</code></pre>
</li>
<li><p><strong>Upgrade</strong> <code>kubelet</code> and <code>kubectl</code> (on Control Plane node):</p>
<pre><code class="lang-bash"> sudo apt-mark unhold kubelet kubectl
 sudo apt-get update &amp;&amp; sudo apt-get install -y kubelet=<span class="hljs-string">'1.29.1-1.1'</span> kubectl=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubelet kubectl
 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
</code></pre>
</li>
<li><p><strong>Upgrade Worker Node: Drain the node (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl drain &lt;node-to-upgrade&gt; --ignore-daemonsets
</code></pre>
</li>
<li><p><strong>Upgrade binaries (on Worker Node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># On the worker node</span>
 sudo apt-mark unhold kubeadm kubelet
 sudo apt-get update
 sudo apt-get install -y kubeadm=<span class="hljs-string">'1.29.1-1.1'</span> kubelet=<span class="hljs-string">'1.29.1-1.1'</span>
 sudo apt-mark hold kubeadm kubelet
</code></pre>
</li>
<li><p><strong>Upgrade node configuration and restart kubelet (on Worker Node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># On the worker node</span>
 sudo kubeadm upgrade node
 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
</code></pre>
</li>
<li><p><strong>Uncordon the Node (from Control Plane node):</strong></p>
<pre><code class="lang-bash"> kubectl uncordon &lt;node-to-upgrade&gt;
</code></pre>
</li>
</ol>
<h4 id="heading-backing-up-and-restoring-etcd-run-on-control-plane-node"><strong>Backing Up and Restoring etcd (Run on Control Plane node)</strong></h4>
<ol>
<li><p><strong>Perform a Backup (using host</strong> <code>etcdctl</code>):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the backup directory first</span>
 sudo mkdir -p /var/lib/etcd-backup

 sudo ETCDCTL_API=3 etcdctl snapshot save /var/lib/etcd-backup/snapshot.db \
     --endpoints=https://127.0.0.1:2379 \
     --cacert=/etc/kubernetes/pki/etcd/ca.crt \
     --cert=/etc/kubernetes/pki/etcd/server.crt \
     --key=/etc/kubernetes/pki/etcd/server.key
</code></pre>
</li>
<li><p><strong>Perform a Restore (on the control plane node):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Stop kubelet to stop static pods</span>
 sudo systemctl stop kubelet

 <span class="hljs-comment"># Restore the snapshot to a new data directory</span>
 sudo ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd-backup/snapshot.db \
     --data-dir /var/lib/etcd-restored

 <span class="hljs-comment"># !! IMPORTANT: Manually edit /etc/kubernetes/manifests/etcd.yaml to point to the new data-dir /var/lib/etcd-restored !!</span>

 <span class="hljs-comment"># Restart kubelet to pick up the manifest change</span>
 sudo systemctl start kubelet
</code></pre>
</li>
</ol>
<h3 id="heading-section-23-implementing-a-highly-available-ha-control-plane">Section 2.3: Implementing a Highly-Available (HA) Control Plane</h3>
<ol>
<li><p><strong>Initialize the First Control-Plane Node (Replace</strong> <code>&lt;load-balancer-address:port&gt;</code>):</p>
<pre><code class="lang-bash"> sudo kubeadm init --control-plane-endpoint <span class="hljs-string">"load-balancer.example.com:6443"</span> --upload-certs
</code></pre>
<ul>
<li><strong>Note:</strong> Save the HA-specific join command and the <code>--certificate-key</code>.</li>
</ul>
</li>
<li><p><strong>Join Additional Control-Plane Nodes (Run on the second and third Control Plane nodes):</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># EXAMPLE - Use the exact command from your `kubeadm init` output</span>
 sudo kubeadm join load-balancer.example.com:6443 --token &lt;token&gt; \
     --discovery-token-ca-cert-hash sha256:&lt;<span class="hljs-built_in">hash</span>&gt; \
     --control-plane --certificate-key &lt;key&gt;
</code></pre>
</li>
</ol>
<h3 id="heading-section-24-managing-role-based-access-control-rbac">Section 2.4: Managing Role-Based Access Control (RBAC)</h3>
<h4 id="heading-demo-granting-read-only-access"><strong>Demo: Granting Read-Only Access</strong></h4>
<ol>
<li><p><strong>Create a Namespace and ServiceAccount:</strong></p>
<pre><code class="lang-bash"> kubectl create namespace rbac-test
 kubectl create serviceaccount dev-user -n rbac-test
</code></pre>
</li>
<li><p><strong>Create the</strong> <code>Role</code> manifest (<code>role.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># role.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Role</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">pod-reader</span>
 <span class="hljs-attr">rules:</span>
 <span class="hljs-bullet">-</span> <span class="hljs-attr">apiGroups:</span> [<span class="hljs-string">""</span>] <span class="hljs-comment"># "" indicates the core API group</span>
   <span class="hljs-attr">resources:</span> [<span class="hljs-string">"pods"</span>]
   <span class="hljs-attr">verbs:</span> [<span class="hljs-string">"get"</span>, <span class="hljs-string">"list"</span>, <span class="hljs-string">"watch"</span>]
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f role.yaml</code></p>
</li>
<li><p><strong>Create the</strong> <code>RoleBinding</code> manifest (<code>rolebinding.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># rolebinding.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">RoleBinding</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">read-pods</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
 <span class="hljs-attr">subjects:</span>
 <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">dev-user</span>
   <span class="hljs-attr">namespace:</span> <span class="hljs-string">rbac-test</span>
 <span class="hljs-attr">roleRef:</span>
   <span class="hljs-attr">kind:</span> <span class="hljs-string">Role</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">pod-reader</span>
   <span class="hljs-attr">apiGroup:</span> <span class="hljs-string">rbac.authorization.k8s.io</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f rolebinding.yaml</code></p>
</li>
<li><p><strong>Verify Permissions:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Check if the ServiceAccount can list pods (Should be YES)</span>
 kubectl auth can-i list pods --as=system:serviceaccount:rbac-test:dev-user -n rbac-test

 <span class="hljs-comment"># Check if the ServiceAccount can delete pods (Should be NO)</span>
 kubectl auth can-i delete pods --as=system:serviceaccount:rbac-test:dev-user -n rbac-test
</code></pre>
</li>
</ol>
<h3 id="heading-section-25-application-management-with-helm-and-kustomize">Section 2.5: Application Management with Helm and Kustomize</h3>
<h4 id="heading-demo-installing-an-application-with-helm"><strong>Demo: Installing an Application with Helm</strong></h4>
<ol>
<li><p><strong>Add a Chart Repository:</strong></p>
<pre><code class="lang-bash"> helm repo add bitnami https://charts.bitnami.com/bitnami
 helm repo update
</code></pre>
</li>
<li><p><strong>Install a Chart with a value override:</strong></p>
<pre><code class="lang-bash"> helm install my-nginx bitnami/nginx --<span class="hljs-built_in">set</span> service.type=NodePort
</code></pre>
</li>
<li><p><strong>Manage the application:</strong></p>
<pre><code class="lang-bash"> helm upgrade my-nginx bitnami/nginx --<span class="hljs-built_in">set</span> service.type=ClusterIP
 helm rollback my-nginx 1
 helm uninstall my-nginx
</code></pre>
</li>
</ol>
<h4 id="heading-demo-customizing-a-deployment-with-kustomize"><strong>Demo: Customizing a Deployment with Kustomize</strong></h4>
<ol>
<li><p><strong>Create base manifest (</strong><code>my-app/base/deployment.yaml</code>):</p>
<pre><code class="lang-bash"> mkdir -p my-app/base
 cat &lt;&lt;EOF &gt; my-app/base/deployment.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   replicas: 1
   selector:
     matchLabels:
       app: my-app
   template:
     metadata:
       labels:
         app: my-app
     spec:
       containers:
       - name: nginx
         image: nginx:1.25.0
 EOF
</code></pre>
</li>
<li><p><strong>Create base Kustomization file (</strong><code>my-app/base/kustomization.yaml</code>):</p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF &gt; my-app/base/kustomization.yaml
 resources:
 - deployment.yaml
 EOF
</code></pre>
</li>
<li><p><strong>Create production overlay and patch:</strong></p>
<pre><code class="lang-bash"> mkdir -p my-app/overlays/production
 cat &lt;&lt;EOF &gt; my-app/overlays/production/patch.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   replicas: 3
 EOF
 cat &lt;&lt;EOF &gt; my-app/overlays/production/kustomization.yaml
 bases:
 -../../base
 patches:
 - path: patch.yaml
 EOF
</code></pre>
</li>
<li><p><strong>Apply the overlay (note the</strong> <code>-k</code> flag for kustomize):</p>
<pre><code class="lang-bash"> kubectl apply -k my-app/overlays/production
</code></pre>
</li>
<li><p><strong>Verify the change:</strong></p>
<pre><code class="lang-bash"> kubectl get deployment my-app
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-3-workloads-amp-scheduling-15">Part 3: Workloads &amp; Scheduling (15%)</h2>
<h3 id="heading-section-31-mastering-deployments">Section 3.1: Mastering Deployments</h3>
<h4 id="heading-demo-performing-a-rolling-update"><strong>Demo: Performing a Rolling Update</strong></h4>
<ol>
<li><p><strong>Create a base Deployment manifest (</strong><code>deployment.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># deployment.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-deployment</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
   <span class="hljs-attr">selector:</span>
     <span class="hljs-attr">matchLabels:</span>
       <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
   <span class="hljs-attr">template:</span>
     <span class="hljs-attr">metadata:</span>
       <span class="hljs-attr">labels:</span>
         <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
     <span class="hljs-attr">spec:</span>
       <span class="hljs-attr">containers:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx</span>
         <span class="hljs-attr">image:</span> <span class="hljs-string">nginx:1.24.0</span>
         <span class="hljs-attr">ports:</span>
         <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">80</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f deployment.yaml</code></p>
</li>
<li><p><strong>Update the Container Image to trigger the rolling update:</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">set</span> image deployment/nginx-deployment nginx=nginx:1.25.0
</code></pre>
</li>
<li><p><strong>Observe the rollout:</strong></p>
<pre><code class="lang-bash"> kubectl rollout status deployment/nginx-deployment
 kubectl get pods -l app=nginx -w
</code></pre>
</li>
</ol>
<h4 id="heading-executing-and-verifying-rollbacks"><strong>Executing and Verifying Rollbacks</strong></h4>
<ol>
<li><p><strong>View Revision History:</strong></p>
<pre><code class="lang-bash"> kubectl rollout <span class="hljs-built_in">history</span> deployment/nginx-deployment
</code></pre>
</li>
<li><p><strong>Roll back to the previous version:</strong></p>
<pre><code class="lang-bash"> kubectl rollout undo deployment/nginx-deployment
</code></pre>
</li>
<li><p><strong>Roll back to a specific revision (e.g., revision 1):</strong></p>
<pre><code class="lang-bash"> kubectl rollout undo deployment/nginx-deployment --to-revision=1
</code></pre>
</li>
</ol>
<h3 id="heading-section-32-configuring-applications-with-configmaps-and-secrets">Section 3.2: Configuring Applications with ConfigMaps and Secrets</h3>
<h4 id="heading-creation-methods"><strong>Creation Methods</strong></h4>
<ol>
<li><p><strong>ConfigMap: Imperative Creation:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># From literal values</span>
 kubectl create configmap app-config --from-literal=app.color=blue --from-literal=app.mode=production

 <span class="hljs-comment"># From a file</span>
 <span class="hljs-built_in">echo</span> <span class="hljs-string">"retries = 3"</span> &gt; config.properties
 kubectl create configmap app-config-file --from-file=config.properties
</code></pre>
</li>
<li><p><strong>Secret: Imperative Creation:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Kubernetes will automatically base64 encode</span>
 kubectl create secret generic db-credentials --from-literal=username=admin --from-literal=password=<span class="hljs-string">'s3cr3t'</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-consuming-configmaps-and-secrets-in-pods"><strong>Demo: Consuming ConfigMaps and Secrets in Pods</strong></h4>
<ol>
<li><p><strong>Manifest: Environment Variables (</strong><code>pod-config.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-config.yaml (Assumes app-config-declarative ConfigMap and db-credentials Secret exist)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">config-demo-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">demo-container</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
     <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/sh"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"env &amp;&amp; sleep 3600"</span>]
     <span class="hljs-attr">env:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">THEME</span>
         <span class="hljs-attr">valueFrom:</span>
           <span class="hljs-attr">configMapKeyRef:</span>
             <span class="hljs-attr">name:</span> <span class="hljs-string">app-config-declarative</span>
             <span class="hljs-attr">key:</span> <span class="hljs-string">ui.theme</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">DB_PASSWORD</span>
         <span class="hljs-attr">valueFrom:</span>
           <span class="hljs-attr">secretKeyRef:</span>
             <span class="hljs-attr">name:</span> <span class="hljs-string">db-credentials</span>
             <span class="hljs-attr">key:</span> <span class="hljs-string">password</span>
   <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-config.yaml</code> <strong>Verify:</strong> <code>kubectl logs config-demo-pod</code></p>
</li>
<li><p><strong>Manifest: Mounted Volumes (</strong><code>pod-volume.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-volume.yaml (Assumes app-config-file ConfigMap exists)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">volume-demo-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">demo-container</span>
     <span class="hljs-attr">image:</span> <span class="hljs-string">busybox</span>
     <span class="hljs-attr">command:</span> [<span class="hljs-string">"/bin/sh"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">"cat /etc/config/config.properties &amp;&amp; sleep 3600"</span>]
     <span class="hljs-attr">volumeMounts:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">config-volume</span>
       <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/etc/config</span>
   <span class="hljs-attr">volumes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">config-volume</span>
     <span class="hljs-attr">configMap:</span>
       <span class="hljs-attr">name:</span> <span class="hljs-string">app-config-file</span>
   <span class="hljs-attr">restartPolicy:</span> <span class="hljs-string">Never</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-volume.yaml</code> <strong>Verify:</strong> <code>kubectl logs volume-demo-pod</code></p>
</li>
</ol>
<h3 id="heading-section-33-implementing-workload-autoscaling">Section 3.3: Implementing Workload Autoscaling</h3>
<h4 id="heading-demo-installing-and-verifying-the-metrics-server"><strong>Demo: Installing and Verifying the Metrics Server</strong></h4>
<ol>
<li><p><strong>Install the Metrics Server:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
</code></pre>
</li>
<li><p><strong>Verify Installation:</strong></p>
<pre><code class="lang-bash"> kubectl top nodes
 kubectl top pods -A
</code></pre>
</li>
</ol>
<h4 id="heading-demo-autoscaling-a-deployment"><strong>Demo: Autoscaling a Deployment</strong></h4>
<ol>
<li><p><strong>Create a Deployment with Resource Requests (requires</strong> <code>hpa-demo-deployment.yaml</code> manifest not provided, use a simple one):</p>
<pre><code class="lang-bash"> kubectl create deployment php-apache --image=k8s.gcr.io/hpa-example --requests=<span class="hljs-string">"cpu=200m"</span>
 kubectl expose deployment php-apache --port=80
</code></pre>
</li>
<li><p><strong>Create an HPA (target 50% CPU, scale 1-10 replicas):</strong></p>
<pre><code class="lang-bash"> kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
</code></pre>
</li>
<li><p><strong>Generate Load (will run in the background):</strong></p>
<pre><code class="lang-bash"> kubectl run -it --rm load-generator --image=busybox -- /bin/sh -c <span class="hljs-string">"while true; do wget -q -O- http://php-apache; done"</span>
</code></pre>
</li>
<li><p><strong>Observe Scaling:</strong></p>
<pre><code class="lang-bash"> kubectl get hpa -w
</code></pre>
<p> <em>(Stop the load generator to observe scale down)</em></p>
</li>
</ol>
<h3 id="heading-section-35-advanced-scheduling">Section 3.5: Advanced Scheduling</h3>
<h4 id="heading-demo-using-node-affinity"><strong>Demo: Using Node Affinity</strong></h4>
<ol>
<li><p><strong>Label a Node:</strong></p>
<pre><code class="lang-bash"> kubectl label node &lt;your-worker-node-name&gt; disktype=ssd
</code></pre>
</li>
<li><p><strong>Create a Pod with Node Affinity (requires</strong> <code>affinity-pod.yaml</code> manifest not provided, create a dummy pod for the node label):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the pod using the affinity rules</span>
 kubectl apply -f affinity-pod.yaml <span class="hljs-comment"># Or equivalent manifest with node affinity</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-using-taints-and-tolerations"><strong>Demo: Using Taints and Tolerations</strong></h4>
<ol>
<li><p><strong>Taint a Node (Effect:</strong> <code>NoSchedule</code>):</p>
<pre><code class="lang-bash"> kubectl taint node &lt;another-worker-node-name&gt; app=gpu:NoSchedule
</code></pre>
</li>
<li><p><strong>Create a Pod with a Toleration (requires</strong> <code>toleration-pod.yaml</code> manifest not provided, create a dummy pod for the taint):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Create the pod using the toleration rules</span>
 kubectl apply -f toleration-pod.yaml <span class="hljs-comment"># Or equivalent manifest with toleration</span>
</code></pre>
</li>
<li><p><strong>Verify Pod scheduling on the tainted node:</strong></p>
<pre><code class="lang-bash"> kubectl get pod gpu-pod -o wide
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-4-services-amp-networking-20">Part 4: Services &amp; Networking (20%)</h2>
<h3 id="heading-section-42-kubernetes-services">Section 4.2: Kubernetes Services</h3>
<h4 id="heading-demo-creating-a-clusterip-service"><strong>Demo: Creating a ClusterIP Service</strong></h4>
<ol>
<li><p><strong>Create a Deployment:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment my-app --image=nginx --replicas=2
</code></pre>
</li>
<li><p><strong>Expose the Deployment with a ClusterIP Service (requires</strong> <code>clusterip-service.yaml</code> manifest not provided, use an imperative command):</p>
<pre><code class="lang-bash"> kubectl expose deployment my-app --port=80 --target-port=80 --name=my-app-service --<span class="hljs-built_in">type</span>=ClusterIP
</code></pre>
</li>
<li><p><strong>Verify Access (inside a temporary Pod):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --image=busybox -- /bin/sh
 <span class="hljs-comment"># Inside the shell:</span>
 <span class="hljs-comment"># wget -O- my-app-service</span>
</code></pre>
</li>
</ol>
<h4 id="heading-demo-creating-a-nodeport-service"><strong>Demo: Creating a NodePort Service</strong></h4>
<ol>
<li><p><strong>Create a NodePort Service (requires</strong> <code>nodeport-service.yaml</code> manifest not provided, use an imperative command):</p>
<pre><code class="lang-bash"> kubectl expose deployment my-app --port=80 --target-port=80 --name=my-app-nodeport --<span class="hljs-built_in">type</span>=NodePort
</code></pre>
</li>
<li><p><strong>Verify Access information:</strong></p>
<pre><code class="lang-bash"> kubectl get service my-app-nodeport
 kubectl get nodes -o wide
 <span class="hljs-comment"># Access from outside via &lt;NodeIP&gt;:&lt;NodePort&gt;</span>
</code></pre>
</li>
</ol>
<h3 id="heading-section-43-ingress-and-the-gateway-api">Section 4.3: Ingress and the Gateway API</h3>
<h4 id="heading-demo-path-based-routing-with-nginx-ingress"><strong>Demo: Path-Based Routing with NGINX Ingress</strong></h4>
<ol>
<li><p><strong>Install the NGINX Ingress Controller:</strong></p>
<pre><code class="lang-bash"> kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.10.1/deploy/static/provider/cloud/deploy.yaml
</code></pre>
</li>
<li><p><strong>Deploy Two Sample Applications and Services:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment app-one --image=k8s.gcr.io/echoserver:1.4
 kubectl expose deployment app-one --port=8080

 kubectl create deployment app-two --image=k8s.gcr.io/echoserver:1.4
 kubectl expose deployment app-two --port=8080
</code></pre>
</li>
<li><p><strong>Create an Ingress Resource (requires</strong> <code>ingress.yaml</code> manifest not provided, use the provided structure to create the file):</p>
<pre><code class="lang-bash"> <span class="hljs-comment"># Apply ingress.yaml</span>
 kubectl apply -f ingress.yaml
</code></pre>
</li>
<li><p><strong>Test the Ingress:</strong></p>
<pre><code class="lang-bash"> INGRESS_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller -o jsonpath=<span class="hljs-string">'{.status.loadBalancer.ingress.ip}'</span>)
 curl http://<span class="hljs-variable">$INGRESS_IP</span>/app1
 curl http://<span class="hljs-variable">$INGRESS_IP</span>/app2
</code></pre>
</li>
</ol>
<h3 id="heading-section-44-network-policies">Section 4.4: Network Policies</h3>
<h4 id="heading-demo-securing-an-application-with-network-policies"><strong>Demo: Securing an Application with Network Policies</strong></h4>
<ol>
<li><p><strong>Create a Default Deny-All Ingress Policy (</strong><code>deny-all.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># deny-all.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">default-deny-ingress</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">podSelector:</span> {} <span class="hljs-comment"># Matches all pods in the namespace</span>
   <span class="hljs-attr">policyTypes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f deny-all.yaml</code></p>
</li>
<li><p><strong>Deploy a Web Server and a Service:</strong></p>
<pre><code class="lang-bash"> kubectl create deployment web-server --image=nginx
 kubectl expose deployment web-server --port=80
</code></pre>
</li>
<li><p><strong>Attempt connection (will fail):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --image=busybox -- /bin/sh -c <span class="hljs-string">"wget -O- --timeout=2 web-server"</span>
</code></pre>
</li>
<li><p><strong>Create an "Allow" Policy (</strong><code>allow-web-access.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># allow-web-access.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">allow-web-access</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">podSelector:</span>
     <span class="hljs-attr">matchLabels:</span>
       <span class="hljs-attr">app:</span> <span class="hljs-string">web-server</span>
   <span class="hljs-attr">policyTypes:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
   <span class="hljs-attr">ingress:</span>
   <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">podSelector:</span>
         <span class="hljs-attr">matchLabels:</span>
           <span class="hljs-attr">access:</span> <span class="hljs-string">"true"</span>
     <span class="hljs-attr">ports:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
       <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f allow-web-access.yaml</code></p>
</li>
<li><p><strong>Test the "Allow" Policy (connection will succeed):</strong></p>
<pre><code class="lang-bash"> kubectl run tmp-shell --rm -it --labels=access=<span class="hljs-literal">true</span> --image=busybox -- /bin/sh -c <span class="hljs-string">"wget -O- web-server"</span>
</code></pre>
</li>
</ol>
<h3 id="heading-section-45-coredns">Section 4.5: CoreDNS</h3>
<h4 id="heading-demo-customizing-coredns-for-an-external-domain"><strong>Demo: Customizing CoreDNS for an External Domain</strong></h4>
<ol>
<li><p><strong>Edit the CoreDNS ConfigMap:</strong></p>
<pre><code class="lang-bash"> kubectl edit configmap coredns -n kube-system
</code></pre>
</li>
<li><p><strong>Add a new server block inside the</strong> <code>Corefile</code> data structure (e.g., for <a target="_blank" href="http://my-corp.com"><code>my-corp.com</code></a>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># ... inside the data.Corefile string...</span>
     <span class="hljs-string">my-corp.com:53</span> {
         <span class="hljs-string">errors</span>
         <span class="hljs-string">cache</span> <span class="hljs-number">30</span>
         <span class="hljs-string">forward</span> <span class="hljs-string">.</span> <span class="hljs-number">10.10</span><span class="hljs-number">.0</span><span class="hljs-number">.53</span> <span class="hljs-comment"># Forward to your internal DNS server</span>
     }
 <span class="hljs-comment"># ...</span>
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-5-storage-10">Part 5: Storage (10%)</h2>
<h3 id="heading-section-52-volume-configuration">Section 5.2: Volume Configuration</h3>
<h4 id="heading-static-provisioning-demo"><strong>Static Provisioning Demo</strong></h4>
<ol>
<li><p><strong>Create a PersistentVolume (</strong><code>pv.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pv.yaml (Using hostPath for local testing)</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolume</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">task-pv-volume</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">capacity:</span>
     <span class="hljs-attr">storage:</span> <span class="hljs-string">10Gi</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">persistentVolumeReclaimPolicy:</span> <span class="hljs-string">Retain</span>
   <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">manual</span>
   <span class="hljs-attr">hostPath:</span>
     <span class="hljs-attr">path:</span> <span class="hljs-string">"/mnt/data"</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pv.yaml</code></p>
</li>
<li><p><strong>Create a PersistentVolumeClaim (</strong><code>pvc.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pvc.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">task-pv-claim</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">manual</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">resources:</span>
     <span class="hljs-attr">requests:</span>
       <span class="hljs-attr">storage:</span> <span class="hljs-string">3Gi</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pvc.yaml</code></p>
</li>
<li><p><strong>Verify Binding:</strong></p>
<pre><code class="lang-bash"> kubectl get pv,pvc
</code></pre>
</li>
<li><p><strong>Create a Pod that Uses the PVC (</strong><code>pod-storage.yaml</code>):</p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># pod-storage.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">storage-pod</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">containers:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx</span>
       <span class="hljs-attr">image:</span> <span class="hljs-string">nginx</span>
       <span class="hljs-attr">volumeMounts:</span>
       <span class="hljs-bullet">-</span> <span class="hljs-attr">mountPath:</span> <span class="hljs-string">"/usr/share/nginx/html"</span>
         <span class="hljs-attr">name:</span> <span class="hljs-string">my-storage</span>
   <span class="hljs-attr">volumes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">my-storage</span>
       <span class="hljs-attr">persistentVolumeClaim:</span>
         <span class="hljs-attr">claimName:</span> <span class="hljs-string">task-pv-claim</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f pod-storage.yaml</code></p>
</li>
</ol>
<h3 id="heading-section-53-storageclasses-and-dynamic-provisioning">Section 5.3: StorageClasses and Dynamic Provisioning</h3>
<h4 id="heading-demo-using-a-default-storageclass"><strong>Demo: Using a Default StorageClass</strong></h4>
<ol>
<li><p><strong>Inspect the Available StorageClasses:</strong></p>
<pre><code class="lang-bash"> kubectl get storageclass
</code></pre>
</li>
<li><p><strong>Create a PVC without a PV (relies on a default StorageClass):</strong></p>
<pre><code class="lang-yaml"> <span class="hljs-comment"># dynamic-pvc.yaml</span>
 <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
 <span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
 <span class="hljs-attr">metadata:</span>
   <span class="hljs-attr">name:</span> <span class="hljs-string">my-dynamic-claim</span>
 <span class="hljs-attr">spec:</span>
   <span class="hljs-attr">accessModes:</span>
     <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
   <span class="hljs-attr">resources:</span>
     <span class="hljs-attr">requests:</span>
       <span class="hljs-attr">storage:</span> <span class="hljs-string">1Gi</span>
</code></pre>
<p> <strong>Apply:</strong> <code>kubectl apply -f dynamic-pvc.yaml</code></p>
</li>
<li><p><strong>Observe Dynamic Provisioning:</strong></p>
<pre><code class="lang-bash"> kubectl get pv
</code></pre>
</li>
</ol>
<hr>
<h2 id="heading-part-6-troubleshooting-30">Part 6: Troubleshooting (30%)</h2>
<h3 id="heading-section-62-troubleshooting-applications-and-pods">Section 6.2: Troubleshooting Applications and Pods</h3>
<h4 id="heading-debugging-tools-for-crashes-and-failures"><strong>Debugging Tools for Crashes and Failures</strong></h4>
<ol>
<li><p><strong>Get detailed information on a resource (the most critical debugging command):</strong></p>
<pre><code class="lang-bash"> kubectl describe pod &lt;pod-name&gt;
</code></pre>
</li>
<li><p><strong>Check application logs (for current container):</strong></p>
<pre><code class="lang-bash"> kubectl logs &lt;pod-name&gt;
</code></pre>
</li>
<li><p><strong>Check application logs (for previous crashed container instance):</strong></p>
<pre><code class="lang-bash"> kubectl logs &lt;pod-name&gt; --previous
</code></pre>
</li>
<li><p><strong>Get a shell inside a running container for live debugging:</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">exec</span> -it &lt;pod-name&gt; -- /bin/sh
</code></pre>
</li>
</ol>
<h3 id="heading-section-63-troubleshooting-cluster-and-nodes">Section 6.3: Troubleshooting Cluster and Nodes</h3>
<ol>
<li><p><strong>Check node status:</strong></p>
<pre><code class="lang-bash"> kubectl get nodes
</code></pre>
</li>
<li><p><strong>Get detailed node information:</strong></p>
<pre><code class="lang-bash"> kubectl describe node &lt;node-name&gt;
</code></pre>
</li>
<li><p><strong>View node resource capacity (for scheduling issues):</strong></p>
<pre><code class="lang-bash"> kubectl describe node &lt;node-name&gt; | grep Allocatable
</code></pre>
</li>
<li><p><strong>Check the</strong> <code>kubelet</code> service status (on the affected node via SSH):</p>
<pre><code class="lang-bash"> sudo systemctl status kubelet
 sudo journalctl -u kubelet -f
</code></pre>
</li>
<li><p><strong>Re-enable scheduling on a cordoned node:</strong></p>
<pre><code class="lang-bash"> kubectl uncordon &lt;node-name&gt;
</code></pre>
</li>
</ol>
<h3 id="heading-section-65-troubleshooting-services-and-networking">Section 6.5: Troubleshooting Services and Networking</h3>
<ol>
<li><p><strong>Check Service and Endpoints (for connectivity issues):</strong></p>
<pre><code class="lang-bash"> kubectl describe service &lt;service-name&gt;
</code></pre>
</li>
<li><p><strong>Check DNS resolution from a client Pod (from inside the client Pod's shell):</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">exec</span> -it client-pod -- nslookup &lt;service-name&gt;
</code></pre>
</li>
<li><p><strong>Check Network Policies (to see if traffic is being blocked):</strong></p>
<pre><code class="lang-bash"> kubectl get networkpolicy
</code></pre>
</li>
</ol>
<h3 id="heading-section-66-monitoring-cluster-and-application-resource-usage">Section 6.6: Monitoring Cluster and Application Resource Usage</h3>
<ol>
<li><p><strong>Get node resource usage (requires Metrics Server):</strong></p>
<pre><code class="lang-bash"> kubectl top nodes
</code></pre>
</li>
<li><p><strong>Get Pod resource usage (requires Metrics Server):</strong></p>
<pre><code class="lang-bash"> kubectl top pods -n &lt;namespace&gt;
</code></pre>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
