<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Kubernetes - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Kubernetes - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 02 Jul 2026 22:41:17 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/kubernetes/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Durable, Autoscaling AI Agent with Temporal, Composio, KEDA, and Kubernetes ]]>
                </title>
                <description>
                    <![CDATA[ Most AI agents are great at quick tasks. Send a message, the agent calls a few tools, and you get a response back in seconds. That works perfectly when you're asking it to summarize a document or do s ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-durable-autoscaling-ai-agent-with-temporal-composio-keda-and-kubernetes/</link>
                <guid isPermaLink="false">6a3ab180022a80fcba0df6e8</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ keda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ temporal ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Shrijal Acharya ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jun 2026 16:17:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/93e5088e-4cac-48c3-911d-982ef96dfe13.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most AI agents are great at quick tasks. Send a message, the agent calls a few tools, and you get a response back in seconds. That works perfectly when you're asking it to summarize a document or do some internet research.</p>
<p>But what happens when the task actually takes time? Something like "go through the last three days of my emails, draft replies for anything urgent, and create Linear tickets for the engineering-related ones." That's not a quick job. It might take minutes, hours, or even longer. And this is just one example.</p>
<p>That's a full workflow, and the moment your server crashes or your process restarts, you lose everything. No retry, and no resume. You're starting from scratch.</p>
<p>That's the problem this article is about.</p>
<p>In this article, you'll build a durable background agent runtime that holds up under real conditions. Dispatch a task, walk away, and it gets done.</p>
<p>Under the hood, it runs on Kubernetes with KEDA autoscaling so workers scale to zero when idle and spin back up the moment work arrives. For crash recovery and durable execution we'll use Temporal, and for agentic capabilities and tool usage we'll use Composio.</p>
<h2 id="heading-whats-covered">What's Covered?</h2>
<p>In this tutorial, you'll build a durable background agent runtime that runs on Kubernetes and scales based on actual workload. Here's what you'll learn along the way:</p>
<ul>
<li><p>What an agent loop is and how to build one with Claude and Composio</p>
</li>
<li><p>How to make long-running agent tasks handle crashes using Temporal</p>
</li>
<li><p>How to build a gateway that decouples task dispatch from execution</p>
</li>
<li><p>How to containerize the worker and gateway with Docker</p>
</li>
<li><p>How to deploy the full system to a local Kubernetes cluster</p>
</li>
<li><p>How to autoscale workers to zero with KEDA based on queue depth</p>
</li>
</ul>
<p>This gets into some advanced concepts, but follow along and you'll learn a lot along the way.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-whats-the-plan-the-architecture">What's the Plan (the Architecture)</a></p>
<ul>
<li><p><a href="#heading-dispatching-a-task">Dispatching a Task:</a></p>
</li>
<li><p><a href="#heading-running-the-task">Running the Task</a></p>
</li>
<li><p><a href="#heading-scaling">Scaling</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-set-up-the-project">How to Set Up the Project</a></p>
</li>
<li><p><a href="#heading-core-components-in-the-application">Core Components in the Application</a></p>
<ul>
<li><p><a href="#heading-the-agent-loop">The Agent Loop</a></p>
</li>
<li><p><a href="#heading-making-it-durable-with-temporal">Making it Durable with Temporal</a></p>
</li>
<li><p><a href="#heading-the-agent-gateway">The Agent Gateway</a></p>
</li>
<li><p><a href="#heading-containerizing-the-application">Containerizing the Application</a></p>
</li>
<li><p><a href="#heading-deploying-to-kubernetes">Deploying to Kubernetes</a></p>
</li>
<li><p><a href="#heading-autoscaling-with-keda">Autoscaling with KEDA</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-agent-in-action">Agent in Action</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-whats-the-plan-the-architecture">What's the Plan (the Architecture)</h2>
<p>Before diving into the code, it helps to understand how everything fits together.</p>
<p>The system is split into two distinct planes: a control plane that handles user-facing interactions (Next.js frontend), and an execution plane where the actual agent work happens. They never directly call each other, and that separation is intentional.</p>
<img src="https://cdn.hashnode.com/uploads/covers/641fd8b0be4ca15b2ad2a590/d7fd737e-6c38-4e61-abd1-26246bec2169.png" alt="Agent Architecture" style="display:block;margin:0 auto" width="1461" height="950" loading="lazy">

<p>Here's the flow from start to finish:</p>
<h3 id="heading-dispatching-a-task">Dispatching a Task</h3>
<p>When a user submits a goal, the gateway first runs a pre-flight check to verify the required Composio tool connections are active for that user. If they are, it hands the task off to Temporal and returns immediately. The user doesn't wait around.</p>
<p><strong>NOTE:</strong> You don't wait for the response to be back from the agent. It just happens all in the background. This isn't your regular chat app. You just launch the task and you forget.</p>
<h3 id="heading-running-the-task">Running the Task</h3>
<p>Temporal puts the task on a queue and a worker pod picks it up. The worker runs the agent loop, LLM reasons over the goal, Composio executes the tools, and the result gets written back to Temporal. The frontend automatically polls the gateway for status updates so the user can see progress without doing anything.</p>
<h3 id="heading-scaling">Scaling</h3>
<p>KEDA watches the Temporal queue depth and scales worker pods based on how much work is pending. When the queue is empty, workers scale down to zero. When tasks come in, they load back up. That's the beauty!</p>
<p>The reason the gateway never touches agent code is straightforward: agent tasks can take minutes, or even hours based on the work, and you don't want that in your API layer. Keeping them separate helps the control plane stay fast regardless of what's happening in the background.</p>
<p>Also, the application supports Linux CronJob-style task scheduling with no human involved. So, having a pre-flight check helps there, because failing fast at dispatch is much better than having a task silently fail because a tool connection was missing.</p>
<p>That's pretty much the high level architecture of our application. To put it simply:</p>
<ul>
<li><p><strong>Kubernetes (k8s):</strong> Orchestration Layer</p>
</li>
<li><p><strong>KEDA:</strong> Auto-scaling Layer</p>
</li>
<li><p><strong>Temporal:</strong> Durability Layer</p>
</li>
<li><p><strong>Composio:</strong> Tool Layer</p>
</li>
<li><p><strong>Any LLM of your choice (in our case, Anthropic)</strong> = Reasoning layer</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-the-project">How to Set Up the Project</h2>
<p>Before you start, make sure you have the following installed:</p>
<ul>
<li><p>Docker</p>
</li>
<li><p>k3d (for running a local Kubernetes cluster)</p>
</li>
<li><p>kubectl</p>
</li>
<li><p>Helm</p>
</li>
<li><p>Node.js and Python 3.11+</p>
</li>
</ul>
<p>You'll also need API keys for <a href="https://www.anthropic.com/">Anthropic</a> and <a href="https://dashboard.composio.dev">Composio</a>.</p>
<p>Start by cloning the repository:</p>
<pre><code class="language-shell">git clone https://github.com/shricodev/kron-k8s-agent.git
cd kron-k8s-agent
</code></pre>
<p>Next, create the cluster, build the images, and load them in:</p>
<pre><code class="language-shell"># Create the local cluster
k3d cluster create agent --wait

# Build both images and import them into the cluster
bash scripts/build-images.sh
bash scripts/load-images.sh

# Deploy Temporal (creates the temporal namespace, Postgres, and server)
kubectl apply -f infra/k8s/temporal/temporal-dev.yaml
</code></pre>
<p>Next, create the namespace and your secret. The secret has to exist before the app gets deployed, since the pods read their keys from it:</p>
<pre><code class="language-shell"># Create the agent namespace
kubectl apply -f infra/k8s/00-namespace.yaml

# Create the secret with your keys (you're supposed to remove the placeholders with the actual values...)
kubectl create secret generic agent-secrets -n agent \
 --from-literal=ANTHROPIC_API_KEY=sk-ant-... \
 --from-literal=COMPOSIO_API_KEY=ak_... \
 --from-literal=JWT_SECRET=$(openssl rand -hex 32)
</code></pre>
<p>With that in place, deploy the app and set up autoscaling:</p>
<pre><code class="language-shell"># Install KEDA, then apply the scalers
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda -n keda --create-namespace --wait

kubectl apply -f infra/k8s/40-keda-worker-scaledobject.yaml -f infra/k8s/41-gateway-hpa.yaml
</code></pre>
<p>Finally, port-forward the gateway so you can reach it from your machine:</p>
<pre><code class="language-shell"># Port-forward the gateway to localhost:8000
kubectl -n agent port-forward svc/gateway 8000:8000
</code></pre>
<p>Point the frontend at <code>http://localhost:8000</code> and you're ready to start tasks.</p>
<p><strong>Note:</strong> You don't need to touch the <code>.env</code> files in <code>apps/worker/</code> or <code>apps/gateway/</code> for this. Those are only for running the apps directly on your machine.</p>
<p>In the cluster, the pods get their config from the ConfigMap and the secret you just created gets injected as environment variables at runtime.</p>
<h2 id="heading-core-components-in-the-application">Core Components in the Application</h2>
<p>The project is huge. Walking through every single line from scratch would turn this into an hours-long read, so instead I'll focus on the core components that actually make the system work.</p>
<h3 id="heading-the-agent-loop">The Agent Loop</h3>
<p>The agent loop is the brain of the entire system. Every time a task gets dispatched, this is what runs.</p>
<p>The idea is simple even if the implementation isn't. Give the LLM a goal, let it reason, let it call tools, feed the results back, and repeat until it's done.</p>
<pre><code class="language-python">async def run_agent(
user_id: str,
goal: str,
toolkit_hint: str | None = None,
) -&gt; dict:
</code></pre>
<p>It takes three things: the user's ID (so Composio knows which connected accounts to use), the goal itself, and an optional toolkit hint. The hint lets you scope which tools get loaded. If the task is clearly a Gmail job, passing "gmail" avoids loading every tool the user has connected.</p>
<p>Before the loop starts, it creates a Composio session and fetches the tools for that user:</p>
<pre><code class="language-python">session = await create_session(user_id, toolkit_hint);
tools = await get_tools(session);
</code></pre>
<p>Then the actual loop runs:</p>
<pre><code class="language-python">for turn in range(1, settings.max_iterations + 1):
    response = await client.messages.create(
        model=settings.model,
        max_tokens=settings.max_tokens,
        system=SYSTEM_PROMPT,
        tools=tools,
        messages=messages,
    )

if response.stop_reason == "end_turn":
return finish("completed", _extract_text(response.content))

if response.stop_reason == "tool_use":
      # execute the tools, append results, continue
      # ...
</code></pre>
<p>Each turn, Claude looks at the goal and the conversation history, then decides what to do next. The <code>stop_reason</code> tells what happened:</p>
<ul>
<li><p><code>"end_turn"</code>: Claude is done. It completed the task and is returning a final answer.</p>
</li>
<li><p><code>"tool_use"</code>: it wants to call one or more tools. The loop executes those through Composio, appends the results back into the message history, and goes around again.</p>
</li>
</ul>
<p>If a tool call fails, the error gets fed back into the conversation rather than crashing the run:</p>
<pre><code class="language-python">except ComposioError as exc:
tool_result_blocks = [
    {
        "type": "tool_result",
        "tool_use_id": block.id,
        "content": f"Tool execution failed: {exc}",
        "is_error": True,
    }
        for block in tool_use_blocks
]
</code></pre>
<p>The loop runs for at most <code>max_iterations</code> turns which is 20 by default, defined in <code>apps/worker/agent/config.py</code>. If it hits that ceiling without finishing, it returns a <code>max_iterations_reached</code> status instead of hanging indefinitely.</p>
<p>Every <code>run_agent</code> call returns the same dict shape: a status, a summary, and a list of every step taken. That consistent shape is what makes it straightforward for Temporal to store and inspect the result, which we'll get to next.</p>
<h3 id="heading-making-it-durable-with-temporal">Making it Durable with Temporal</h3>
<p>The agent loop by itself has a real problem. If the worker process crashes partway through a 15-step task, everything is gone. You have no way to know how far it got, and you have to start from scratch.</p>
<h4 id="heading-workflows-and-activities">Workflows and Activities</h4>
<p>Temporal splits your code into two distinct pieces: workflows and activities.</p>
<p>A workflow describes what should happen and in what order, but it never does the actual work itself. No network calls, nothing. That constraint is exactly what lets Temporal safely replay it to reconstruct state after a crash.</p>
<p>An activity is where the real work happens. Network calls, LLM requests, tool executions – all of that goes inside an activity. Activities can fail and be retried independently without affecting the workflow state.</p>
<p>In this project, <code>AgentWorkflow</code> in <code>apps/worker/workflows.py</code> is the workflow, and <code>run_agent_activity</code> in <code>apps/worker/activities.py</code> is the activity that wraps the agent loop.</p>
<p><strong>The Workflow</strong></p>
<pre><code class="language-python">@workflow.defn(name="AgentWorkflow")
class AgentWorkflow:
    def init(self) -&gt; None:
        self._status: str = "running"
        self._result: dict | None = None
</code></pre>
<p>When a task gets dispatched, Temporal starts this workflow. It sets up a retry policy and hands all the real work off to the activity:</p>
<pre><code class="language-python">retry = RetryPolicy(
  (initial_interval = timedelta((seconds = 2))),
  (backoff_coefficient = 2.0),
  (maximum_interval = timedelta((minutes = 2))),
  (maximum_attempts = 5),
  (non_retryable_error_types = ["ValueError", "AuthenticationError"]),
);

result = await workflow.execute_activity(
  run_agent_activity,
  (args = [user_id, goal, toolkit_hint]),
  (start_to_close_timeout = timedelta((minutes = 30))),
  (retry_policy = retry),
);
</code></pre>
<p>The <code>start_to_close_timeout</code> is set to 30 minutes and caps at 5 attempts, because agent tasks can genuinely take that long. You can increase or decrease the timer based on your work requirement.</p>
<p><strong>Querying the Workflow</strong></p>
<p>One thing that makes Temporal convenient here is query handlers. The workflow exposes its current status and result without needing a separate database to track it:</p>
<pre><code class="language-python">@workflow.query
def status(self) -&gt; str:
return self._status

@workflow.query
def result(self) -&gt; dict | None:
return self._result
</code></pre>
<p>The gateway can ask Temporal "what is the status of workflow X?" at any point and get a live answer back. That's how the frontend polling works.</p>
<p><strong>The Activity</strong></p>
<p>The activity is straightforward. It wraps <code>run_agent</code> and logs what happens:</p>
<pre><code class="language-python">@activity.defn(name="run_agent_activity")
async def run_agent_activity(user_id: str, goal: str, toolkit_hint: str | None) -&gt; dict:
    result = await run_agent(user_id=user_id, goal=goal,             toolkit_hint=toolkit_hint)
    return result
</code></pre>
<p>Anything that touches the network lives here, not in the workflow. That separation is what lets Temporal do its job.</p>
<p><strong>The Worker</strong></p>
<p>The worker process is what registers everything and starts polling the queue:</p>
<pre><code class="language-python">worker = Worker(
            client,
            task_queue=temporal_settings.temporal_task_queue,
            workflows=[AgentWorkflow],
            activities=[run_agent_activity, notify_activity],
            max_concurrent_activities=5,
)
</code></pre>
<p>It connects to Temporal, registers the workflow and activities, and listens on the task queue. When a task arrives, it picks it up and runs it. This is the process running inside the Kubernetes pod, and it's exactly what KEDA will scale based on queue depth later.</p>
<h3 id="heading-the-agent-gateway">The Agent Gateway</h3>
<p>The gateway is a FastAPI app that sits between the user and Temporal. It handles task dispatch, status polling, and cancellation. Crucially, it never runs agent code itself. Its only job is to talk to Temporal and return quickly.</p>
<h4 id="heading-dispatching-a-task">Dispatching a Task</h4>
<p>The dispatch endpoint in <code>apps/gateway/routes/tasks.py</code> is where everything begins:</p>
<pre><code class="language-python">  @router.post("/dispatch", response_model=DispatchResponse)
  async def dispatch(
      body: DispatchRequest,
      user_id: str = Depends(current_user_id),
  ) -&gt; DispatchResponse:
      if body.toolkit:
          access = await check_toolkit_access(user_id, body.toolkit)
          if not access["allowed"]:
              raise HTTPException(
                  status_code=status.HTTP_409_CONFLICT,
                  detail={
                      "error": "toolkit_not_connected",
                      "connect_url": access["connect_url"],
                  },
              )

      workflow_id = f"agent-{user_id}-{uuid.uuid4().hex[:8]}"
      await client.start_workflow(
          WORKFLOW_NAME,
          args=[user_id, body.goal, body.toolkit],
          id=workflow_id,
          task_queue=settings.temporal_task_queue,
          cron_schedule=body.schedule or "",
      )
      return DispatchResponse(workflow_id=workflow_id, status="dispatched")
</code></pre>
<p>The request carries three fields: the goal, an optional toolkit name (to not spend time figuring out toolkit names), and an optional cron schedule. The endpoint runs a preflight check, hands the task off to Temporal, and returns the workflow ID immediately. The user doesn't wait for the agent to finish.</p>
<p>Notice the <code>cron_schedule</code> field. Passing a standard cron expression here turns the task into a recurring job. Temporal handles the scheduling itself, no extra infra needed.</p>
<h4 id="heading-the-preflight-check">The Preflight Check</h4>
<p>The preflight check lives in <code>apps/gateway/routes/preflight.py</code>. Before a task gets dispatched, it verifies that the user actually has the required toolkit connected in Composio:</p>
<pre><code class="language-python">  async def check_toolkit_access(user_id: str, toolkit_hint: str | None) -&gt; dict:
      if not toolkit_hint:
          return {"allowed": True}

      connected = await asyncio.to_thread(
          _has_active_account, composio, user_id, toolkit_hint
      )
      if connected:
          return {"allowed": True}

      connect_url = await asyncio.to_thread(
          _connect_link, composio, user_id, toolkit_hint
      )
      return {"allowed": False, "toolkit": toolkit_hint, "connect_url": connect_url}
</code></pre>
<p>If the connection is missing, the gateway returns a <code>connect_url</code> so the user can authorize the app right away. This matters especially for scheduled tasks.</p>
<h4 id="heading-checking-status">Checking Status</h4>
<p>Once a task is running, the frontend polls this endpoint:</p>
<pre><code class="language-python">  @router.get("/{workflow_id}", response_model=TaskStatusResponse)
  async def get_task(workflow_id: str, ...) -&gt; TaskStatusResponse:
      if not _owns(workflow_id, user_id):
          raise HTTPException(status_code=404, detail="task not found")

      handle = client.get_workflow_handle(workflow_id, run_id=run_id)
      agent_status = await handle.query("status")

      if desc.status == WorkflowExecutionStatus.COMPLETED:
          result = await handle.query("result")

      return TaskStatusResponse(...)
</code></pre>
<p>The <code>status</code> and <code>result</code> come straight from Temporal's query handlers that you saw in the workflow. There's no separate status table, and no database write after each step. Temporal is the <strong>source of truth</strong>.</p>
<h3 id="heading-containerizing-the-application">Containerizing the Application</h3>
<p>The gateway and the worker are packaged as two separate images. They share nothing at runtime, which is exactly what you want since they scale independently and have different responsibilities.</p>
<p>Both Dockerfiles live in the <code>/docker</code> directory, and use a multi-stage build.</p>
<h4 id="heading-why-multi-stage">Why Multi-Stage? 🤔</h4>
<p>The builder stage installs compilers and build tools to compile Python packages. The runtime stage gets only the finished dependencies and the application code. There's no point in putting the build tools into the final image.</p>
<h4 id="heading-the-gateway-image">The Gateway Image</h4>
<pre><code class="language-dockerfile">FROM python:3.14-slim-bookworm AS builder

RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install -r requirements.txt

FROM python:3.14-slim-bookworm AS runtime

ENV PYTHONUNBUFFERED=1 PATH="/opt/venv/bin:$PATH"

RUN useradd --create-home --uid 10001 app
WORKDIR /app

COPY --from=builder /opt/venv /opt/venv
COPY . .

USER app
EXPOSE 8000
CMD ["python", "main.py"]
</code></pre>
<p>The runtime stage copies the <code>venv</code> from the builder, drops into a non-root user, and starts the FastAPI app. And as always, we run it as a non-root user (be a good dev and follow proper security practices 😺).</p>
<h4 id="heading-the-worker-image">The Worker Image</h4>
<p>The worker Dockerfile is nearly identical with one small difference:</p>
<pre><code class="language-dockerfile"># procps gives us pgrep for the liveness probe
RUN apt-get update \
 &amp;&amp; apt-get install -y --no-install-recommends procps \
 &amp;&amp; rm -rf /var/lib/apt/lists/*
</code></pre>
<p>It installs procps so the Kubernetes liveness probe can run pgrep to check that the process is still alive.</p>
<h4 id="heading-building-and-loading-the-images">Building and Loading the Images</h4>
<p>The build script in <code>scripts/build-images.sh</code> builds both images, passing each app directory as the build context:</p>
<pre><code class="language-shell">docker build \
 -f "$ROOT/docker/Dockerfile.gateway" \
    -t "agent-gateway:$TAG" \
 "$ROOT/apps/gateway"

docker build \
 -f "$ROOT/docker/Dockerfile.worker" \
    -t "agent-worker:$TAG" \
 "$ROOT/apps/worker"
</code></pre>
<p>The Dockerfiles live under <code>docker/</code> but each one is built against its own app directory. That's what <code>COPY . .</code> actually copies.</p>
<p>After building, there's one more step before the images can run in the cluster. A local k3d cluster has no access to your Docker daemon, so images built locally aren't accessible to it. You have to import them explicitly:</p>
<pre><code class="language-shell">k3d image import "agent-gateway:dev" "agent-worker:dev" -c agent
</code></pre>
<p><code>scripts/load-images.sh</code> does this for you. Once the import completes, the cluster can pull the images like it usually does and your pods will start. 🎊</p>
<h3 id="heading-deploying-to-kubernetes">Deploying to Kubernetes</h3>
<p>With the images built and loaded into the cluster, the next step is applying the manifests. The setup is organized into two tiers. Tier 1 is the core application: the <code>namespace</code>, <code>config</code>, and <code>deployments</code>. Tier 2 is autoscaling, covered in the next section.</p>
<h4 id="heading-config-and-secrets">Config and Secrets</h4>
<p>Non-sensitive config lives in a <code>ConfigMap</code> at <code>infra/k8s/01-configmap.yaml</code>:</p>
<pre><code class="language-yaml">data:
  MODEL: "claude-opus-4-8"
  MAX_TOKENS: "4096"
  MAX_ITERATIONS: "20"
  TEMPORAL_HOST: "temporal-frontend.temporal.svc.cluster.local:7233"
  TEMPORAL_TASK_QUEUE: "agent-tasks"
  GATEWAY_HOST: "0.0.0.0"
  GATEWAY_PORT: "8000"
</code></pre>
<p>This is where the Temporal host address comes from. Notice that it uses the full in-cluster DNS name pointing at the Temporal frontend service in the temporal namespace. That address only resolves from inside the cluster, which is fine since both the gateway and the worker run there.</p>
<p>API keys go in a Kubernetes Secret that you create manually and never commit in Git. Both the <code>ConfigMap</code> and the <code>Secret</code> are mounted as environment variables using <code>envFrom</code> in each deployment.</p>
<h4 id="heading-the-gateway-deployment">The Gateway Deployment</h4>
<pre><code class="language-yaml">spec:
  replicas: 2
  containers:
    - name: gateway
      image: agent-gateway:dev
      imagePullPolicy: IfNotPresent
      command:
        [
          "python",
          "-m",
          "uvicorn",
          "main:/app",
          "--host",
          "0.0.0.0",
          "--port",
          "8000",
        ]
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
      resources:
        requests:
          cpu: 100m
          memory: 256Mi
        limits:
          cpu: 500m
          memory: 512Mi
</code></pre>
<p>A few things worth noting. <code>imagePullPolicy: IfNotPresent</code> tells Kubernetes to use the locally loaded image instead of trying to pull from a registry. The startup command bypasses the reload=True flag that main.py uses when run directly locally. The readiness probe hits <code>/health</code> before Kubernetes sends any traffic to the pod, so the gateway only receives requests once it's actually up.</p>
<p>The gateway also gets a <code>ClusterIP</code> Service so other pods and the port-forward can reach it:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: gateway
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
</code></pre>
<h4 id="heading-the-worker-deployment">The Worker Deployment</h4>
<pre><code class="language-yaml"># just polls Temporal.
spec:
  replicas: 1
  containers:
    - name: worker
      image: agent-worker:dev
      livenessProbe:
        exec:
          command: ["pgrep", "-f", "worker.py"]
        initialDelaySeconds: 15
        periodSeconds: 20
      resources:
        requests:
          cpu: 250m
          memory: 512Mi
        limits:
          cpu: "1"
          memory: 1Gi
</code></pre>
<p>The worker has no Service. It never accepts incoming connections. It connects outward to Temporal and polls for work, so nothing needs to reach it. That's why <strong>procps was installed in the Dockerfile</strong>.</p>
<p>The worker also gets more resources than the gateway. It's the one running LLM calls and executing tools, so it needs more resources. You can cap it depending on your requirements.</p>
<h4 id="heading-applying-everything">Applying Everything</h4>
<p>The deploy script at <code>scripts/deploy.sh</code> applies Tier 1 in the correct order:</p>
<pre><code class="language-bash">kubectl apply -f "$K8S/00-namespace.yaml"
kubectl apply -f "$K8S/01-configmap.yaml"
kubectl apply -f "$K8S/10-gateway-deployment.yaml"
kubectl apply -f "$K8S/20-worker-deployment.yaml"
</code></pre>
<p>Order matters here. The namespace has to exist before anything else can be created inside it, and the <code>ConfigMap</code> has to exist before the pods that read from it start up.</p>
<h3 id="heading-autoscaling-with-keda">Autoscaling with KEDA</h3>
<p>Kubernetes scales pods based on CPU or memory. That works fine for the gateway, which handles HTTP requests and actually uses CPU proportional to traffic. But it's the not the right signal for workers.</p>
<p>The worker sits completely idle when no tasks are queued. It doesn't burn CPU waiting. When tasks arrive, it gets busy fast. What you actually want to scale on is queue depth: how many tasks are waiting to be picked up.</p>
<p>That's what KEDA does. It reads external metrics like queue lengths, message counts, or in this case Temporal task queue depth, and scales your deployments accordingly.</p>
<h4 id="heading-scaling-the-worker">Scaling the Worker</h4>
<p>The <code>ScaledObject</code> in <code>infra/k8s/40-keda-worker-scaledobject.yaml</code> is what KEDA watches:</p>
<pre><code class="language-yaml">spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 0
  maxReplicaCount: 10
  cooldownPeriod: 120
  triggers:
    - type: temporal
      metadata:
        endpoint: temporal-frontend.temporal.svc.cluster.local:7233
        namespace: default
        taskQueue: agent-tasks
        queueTypes: "workflow,activity"
        targetQueueSize: "5"
        activationTargetQueueSize: "0"
</code></pre>
<p>Let's walk through the important fields:</p>
<ul>
<li><p><code>minReplicaCount</code>: 0 is the big one. KEDA can scale to zero, which a standard HPA can't do. When the queue is empty, every worker pod shuts down. You pay for nothing while the system is idle.</p>
</li>
<li><p><code>activationTargetQueueSize</code>: "0" means KEDA wakes the deployment the moment a single task enters the queue. Zero tasks, zero pods. One task, pods start spinning up.</p>
</li>
<li><p><code>targetQueueSize</code>: "5" tells KEDA to target roughly one worker pod per 5 pending tasks. Ten tasks in the queue means two pods.</p>
</li>
<li><p><code>cooldownPeriod</code>: 120 adds a 120-second buffer before KEDA scales back down after the queue clears.</p>
</li>
<li><p><code>queueTypes</code>: "workflow,activity" watches both queues. Without this, KEDA would only see part of the pending work.</p>
</li>
</ul>
<p><strong>Note</strong>: The Temporal scaler requires KEDA v2.17 or later. Make sure your Helm install is on that version or above.</p>
<h4 id="heading-scaling-the-gateway">Scaling the Gateway</h4>
<p>The gateway gets a plain CPU-based HPA at <code>infra/k8s/41-gateway-hpa.yaml</code>:</p>
<pre><code class="language-yaml">spec:
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
</code></pre>
<p>CPU is the right signal here because the gateway does real work proportional to incoming HTTP requests. It stays at a minimum of 2 replicas so there's no cold start delay on the API side.</p>
<h4 id="heading-installing-keda">Installing KEDA</h4>
<p>KEDA is installed via Helm before applying the <code>ScaledObject</code>:</p>
<pre><code class="language-shell">helm install keda kedacore/keda -n keda --create-namespace --wait
kubectl apply -f infra/k8s/40-keda-worker-scaledobject.yaml -f infra/k8s/41-gateway-hpa.yaml
</code></pre>
<p>Once those are applied, the system is fully operational. Submit a task and watch a worker pod appear. Let the queue empty and watch it disappear. That's the whole point.</p>
<p>And just like that, you have a fully durable, autoscaling AI Agent that you can schedule to run anytime. How cool is that? 😎</p>
<h2 id="heading-agent-in-action">Agent in Action</h2>
<p>Here's a quick demo of the agent in action (running inside a Kubernetes Cluster):</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/aZy_scANmU4" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>

<h2 id="heading-conclusion">Conclusion</h2>
<p>Running AI agents in production is a completely different problem than building them. I tried to focus on that gap here, and hopefully it gave you a solid reference for how to think about durability and scaling. And I hope it also helped you build or understand something different from a regular AI chat application.</p>
<p>The Temporal and KEDA combination is really something you should learn and know more about if you're into building AI agents or doing DevOps in general. Temporal helps with the biggest issue with AI agents (the durability), and KEDA makes sure that you aren't paying for idle workers at 2am (if used in prod) if nothing is running. You aren't just scaling on CPU, but based on events and that is important.</p>
<p>There's a lot of room to extend this from here. You could swap the dev JWT for proper OIDC, or expand the toolkit coverage through Composio to support more of your workflows.</p>
<p>The foundation is there. The rest is just building on top of it.</p>
<p>You can find the complete source code here: <a href="https://github.com/shricodev/kron-k8s-agent">shricodev/kron-k8s-agent</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Spring Boot App with MySQL on Amazon EKS ]]>
                </title>
                <description>
                    <![CDATA[ If you've been looking to deploy your Spring Boot app to the cloud but feel a little overwhelmed by all the moving pieces, don't worry, you're not alone. Kubernetes can seem intimidating at first, but ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks/</link>
                <guid isPermaLink="false">6a20609578a43e3153ae5422</guid>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ EKS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Springboot ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chisom Uma ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:12:53 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/5a7cd6a7-7850-4e3c-9a45-b577c2f91598.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've been looking to deploy your Spring Boot app to the cloud but feel a little overwhelmed by all the moving pieces, don't worry, you're not alone.</p>
<p>Kubernetes can seem intimidating at first, but Amazon EKS (Elastic Kubernetes Service) makes it much more approachable, especially when you have a step-by-step guide to follow.</p>
<p>In this tutorial, we'll walk through exactly how to get a Spring Boot application with a MySQL database up and running on Amazon EKS. I'll take you from from containerizing your app to connecting it to a managed database, all the way to accessing it live in the cloud. Let’s get started.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-application-overview">Application Overview</a></p>
</li>
<li><p><a href="#heading-what-is-amazon-eks">What is Amazon EKS?</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks">How to Deploy a Spring Boot App with MySQL on Amazon EKS</a></p>
<ul>
<li><p><a href="#heading-step-1-create-the-vpc">Step 1: Create the VPC</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-the-mysql-database-in-a-private-subnet">Step 2: Set Up the MySQL Database in a Private Subnet</a></p>
</li>
<li><p><a href="#heading-step-3-deploy-ec2-instance-in-a-public-subnet">Step 3: Deploy EC2 Instance in a Public Subnet</a></p>
</li>
<li><p><a href="#heading-step-4-create-ssh-tunneling-for-the-database">Step 4: Create SSH Tunneling for the Database</a></p>
</li>
<li><p><a href="#heading-step-5-set-up-a-simple-springboot-application-development">Step 5: Set Up a Simple SpringBoot Application Development</a></p>
</li>
<li><p><a href="#heading-step-6-configure-springboot-app-for-database">Step 6: Configure SpringBoot App for Database</a></p>
</li>
<li><p><a href="#heading-step-7-dockerize-the-spring-boot-application">Step 7: Dockerize the Spring Boot Application</a></p>
</li>
<li><p><a href="#heading-step-8-push-the-image-to-elastic-container-registry-ecr">Step 8: Push the Image to Elastic Container Registry (ECR)</a></p>
</li>
<li><p><a href="#heading-step-9-implement-aws-app-load-balancer">Step 9: Implement AWS App Load Balancer</a></p>
</li>
<li><p><a href="#heading-step-10-create-a-cluster-in-eks">Step 10: Create a Cluster in EKS</a></p>
</li>
<li><p><a href="#heading-step-11-install-aws-load-balancing">Step 11: Install AWS Load Balancing</a></p>
</li>
<li><p><a href="#heading-step-12-create-and-deploy-kubernetes">Step 12: Create and Deploy Kubernetes</a></p>
</li>
<li><p><a href="#heading-step-13-delete-cluster">Step 13: Delete Cluster</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, ensure you have the following:</p>
<ul>
<li><p>Basic knowledge of AWS (AWS Console access).</p>
</li>
<li><p>Basic knowledge of containerization.</p>
</li>
<li><p>Working knowledge of Kubernetes.</p>
</li>
<li><p>Basic knowledge of databases.</p>
</li>
<li><p><a href="https://helm.sh/docs/intro/install/">Helm</a> installed</p>
</li>
<li><p><a href="https://kubernetes.io/docs/tasks/tools/">Kubectl</a> installed</p>
</li>
<li><p><a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/setting-up-eksctl.html">Eksctl</a> installed</p>
</li>
<li><p>An IDE</p>
</li>
</ul>
<h2 id="heading-application-overview">Application Overview</h2>
<p>The application runs inside an AWS VPC spread across two availability zones for high availability. When a user makes a request, it flows through an Internet Gateway into an AWS Application Load Balancer sitting in the public subnet, which handles incoming traffic via an Ingress rule.</p>
<p>The Load Balancer routes requests to the App Service, which distributes them across multiple App Pods running inside AWS EKS (Elastic Kubernetes Service) in the private subnets.</p>
<p>The Docker images for these pods are pulled from AWS ECR (Elastic Container Registry). For data persistence, the app pods connect to Amazon RDS MySQL databases through a MySQL External Service, with an RDS instance in each availability zone to ensure redundancy.</p>
<p>A NAT Gateway in the public subnet allows the private resources to make outbound internet calls without being directly exposed to the internet.</p>
<h2 id="heading-what-is-amazon-eks">What is Amazon EKS?</h2>
<p>If you've ever tried to manage containers manually, you already know it can get messy pretty quickly, tracking which containers are running, restarting ones that crash, scaling up when traffic spikes... It's a lot.</p>
<p>That's exactly the problem Kubernetes was built to solve. It automates the deployment, scaling, and management of containerized applications. But setting up and maintaining your own Kubernetes cluster from scratch? That's a whole other challenge.</p>
<p>That's where <a href="https://aws.amazon.com/pm/eks/">Amazon EKS</a> comes in. EKS is a fully managed Kubernetes service provided by AWS, which means AWS handles the heavy lifting of setting up, securing, and maintaining the Kubernetes control plane for you. You just focus on deploying your application.</p>
<h2 id="heading-how-to-deploy-a-spring-boot-app-with-mysql-on-amazon-eks">How to Deploy a Spring Boot App with MySQL on Amazon EKS</h2>
<p>In this section, we’ll cover the steps to follow in deploying your SpringBoot application with MySQL on Amazon EKS.</p>
<h3 id="heading-step-1-create-the-vpc">Step 1: Create the VPC</h3>
<p>To create a VPC, log in to the <a href="https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fus-east-1.console.aws.amazon.com%2Fiam%3Fca-oauth-flow-id%3Df7d2%26hashArgs%3D%2523%26isauthcode%3Dtrue%26oauthStart%3D1777888354778%26region%3Dus-east-1%26state%3DhashArgsFromTB_us-east-1_0481039a94bc47bd&amp;client_id=arn%3Aaws%3Asignin%3A%3A%3Aconsole%2Fiamv2&amp;forceMobileApp=0&amp;code_challenge=USO5m22DxkRMX1kvbC19ZE-zr5Eyzp52MXY5jnbANB8&amp;code_challenge_method=SHA-256">AWS IAM Console</a> and search for “VPC,” then click create VPC.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/9a1f57fd-7665-469f-a0c2-7d548590c20f.png" alt="vpc interface" style="display:block;margin:0 auto" width="714" height="192" loading="lazy">

<p>Select the "VPC and more option:, and give your VPC a name for your project, for example, spring-demo. Set the IPv4 CIDR block to 10.4.0.0/16. For the NAT gateway configuration, select Zonal, then In 1 AZ.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/960002c0-9d53-481d-90be-79a7092088ce.png" alt="NAT gateway config" style="display:block;margin:0 auto" width="465" height="228" loading="lazy">

<p>Select None for VPC endpoints configuration. Next, click Create VPC, then click View VPC. This takes you to the VPC resource map.</p>
<h3 id="heading-step-2-set-up-the-mysql-database-in-a-private-subnet">Step 2: Set Up the MySQL Database in a Private Subnet</h3>
<p>First, you need to create the security group for the MySQL and EC2 instance deployment. To do that, navigate to EC2 &gt; Security Groups. For the inbound rule, select Type: All traffic and Source: Anywhere-IPv4. Then click Create security group.</p>
<p>Next, we’ll create the subnet group for the database. To do that, navigate to Aurora and RDS &gt; Subnet groups and click Create DB subnet group. Next, configure the DB subnet to include:</p>
<ul>
<li><p><strong>Name</strong>: private-subnet-db</p>
</li>
<li><p><strong>Description</strong>: private-subnet-db</p>
</li>
<li><p><strong>VPC</strong>: Select VPC</p>
</li>
<li><p><strong>Add subnets</strong>: Choose <code>us-east-1a</code> and <code>us-east-1b</code> as the availability zones, then select the private and public subnets</p>
</li>
</ul>
<p>Click Create**.**</p>
<p>Now, navigate to Databases, click Create database, and select Full configuration. Select MySQL as the engine type.</p>
<p>Select the Free tier when choosing a sample template. Next, give your DB a username and a strong password. Choose <code>db.t3.micro</code> as the instance type.</p>
<p>Select your VPC and associated private subnet. Now, uncheck the "Enable auto minor version upgrade" option in the Additional configuration section and click Create database.</p>
<p>While our database initializes, let's create a key pair for the EC2 instance, which will be launched in a public subnet. To do that, navigate to EC2 &gt; Network &amp; Security &gt; Key Pairs and click Create key pair.</p>
<p>Give your key pair a name, for example, ece-db-key-pair. Leave everything else as-is and click Create key pair. This automatically downloads the key-pair into your local machine.</p>
<h3 id="heading-step-3-deploy-ec2-instance-in-a-public-subnet">Step 3: Deploy EC2 Instance in a Public Subnet</h3>
<p>Now it’s time to create an EC2 instance. To do this, navigate to EC2 &gt; Instances and click Launch instances. Select the key pair you just created in the Key pair section.</p>
<p>Next, in the Network section, select the VPC created earlier for the project. For Auto-assign public IP, choose Enable. Next, choose the Select existing security group option and select the all-access-sg security group created earlier. Next, click Launch instance.</p>
<h3 id="heading-step-4-create-ssh-tunneling-for-the-database">Step 4: Create SSH Tunneling for the Database</h3>
<p>For this step, go into your terminal and navigate to the folder where your key pair is downloaded. Run the ls command, and you should see your key pair there.</p>
<p>Next, you need to change the permission of the key pair file. Use the command below:</p>
<pre><code class="language-shell">chmod 0400 ece-db-key-pair.pem&nbsp;
</code></pre>
<p>Now, run the SSH tunneling command below:</p>
<pre><code class="language-shell">ssh -i &lt;YOUR-KEY-PAIR&gt;.pem -f -N -L &lt;LOCAL-PORT&gt;:&lt;YOUR-RDS-ENDPOINT&gt;:&lt;RDS-PORT&gt; &lt;EC2-USERNAME&gt;@&lt;YOUR-EC2-PUBLIC-DNS&gt; -v
</code></pre>
<ul>
<li><p><code>&lt;YOUR-KEY-PAIR&gt;.pem</code>: the name of your downloaded key pair file</p>
</li>
<li><p><code>&lt;LOCAL-PORT&gt;</code>:&nbsp; the port on your laptop (3306 for MySQL, 5432 for PostgreSQL)</p>
</li>
<li><p><code>&lt;YOUR-RDS-ENDPOINT&gt;</code>: found in AWS Console &gt; RDS &gt; Your database &gt; Connectivity &amp; Security &gt; Endpoint</p>
</li>
<li><p><code>&lt;RDS-PORT&gt;</code>: same as local port (3306 for MySQL, 5432 for PostgreSQL)</p>
</li>
<li><p><code>&lt;EC2-USERNAME&gt;</code>: usually ec2-user for Amazon Linux, ubuntu for Ubuntu</p>
</li>
<li><p><code>&lt;YOUR-EC2-PUBLIC-DNS&gt;</code>: found in AWS Console &gt; EC2 &gt; Your instance &gt; Public IPv4 DNS</p>
</li>
</ul>
<p>This command lets your laptop or local machine talk directly to your remote database, as if the database were sitting on your own computer.</p>
<p>After running this command, you can open a database tool (like MySQL Workbench, DBeaver, or TablePlus) on your laptop and connect to:</p>
<ul>
<li><p>Host: localhost</p>
</li>
<li><p>Port: 3306</p>
</li>
</ul>
<p>For this tutorial, I’ll be using the community version of DBeaver. You can use other similar tools, but if you prefer to use the same tool for the purpose of this guide, you can install the community version from the official <a href="https://dbeaver.io/download/">DBeaver download page</a>.</p>
<p>After download and installation, open the DBeaver client and click the Connect to a database icon in the top-left corner of the app.</p>
<p>Select MySQL and click Next. On the next window, enter your database username and password, and set Server Host to 127.0.0.1.</p>
<p>Click Test Connection.</p>
<p>You should see a window appear on your screen, indicating that the connection is successful.</p>
<p>Click OK and Finish.</p>
<p>Now, on the left panel, you should see your connection. Expand it to see the database structure.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/5c8115eb-020a-4b8d-9c84-9a1b4c10a071.png" alt="database structure" style="display:block;margin:0 auto" width="730" height="212" loading="lazy">

<p>Now, you have successfully created SSH tunneling for your database.</p>
<h4 id="heading-troubleshooting">Troubleshooting</h4>
<p>While attempting to test the database connection, I initially ran into a “Plugin 'mysql_native_password' is not loaded” error. If you encounter this error, follow the steps below to fix it.</p>
<ol>
<li><p>On the Connection Settings window, navigate to the Driver properties tab.</p>
</li>
<li><p>Look for allowPublicKeyRetrieval and set it to FALSE.</p>
</li>
<li><p>Navigate back to the Main tab and click Test Connection.</p>
</li>
</ol>
<p>Everything should work fine now.</p>
<h3 id="heading-step-5-set-up-a-simple-springboot-application-development">Step 5: Set Up a Simple SpringBoot Application Development</h3>
<p>To get started, head over to the <a href="https://start.spring.io/">Spring Initializr website</a>. Rename Artifact to “springboot-mysql-eks”. Then click ADD DEPENDENCIES… to add dependencies for the REST APIs. Search for the following dependencies:</p>
<ul>
<li><p><strong>Spring Web:</strong> Build web apps, including RESTful applications using Spring MVC. Uses Apache Tomcat as the default embedded container.</p>
</li>
<li><p><strong>Spring Data JPA:</strong> Persist data in SQL stores with the Java Persistence API using Spring Data and Hibernate.</p>
</li>
<li><p><strong>IBM DB2 Driver:</strong> A JDBC driver that provides access to IBM DB2.</p>
</li>
<li><p><strong>Lombok:</strong> A Java annotation library that helps to reduce boilerplate code.</p>
</li>
</ul>
<p>Next, click GENERATE at the bottom center of the page. This action downloads a zip file to your local machine. Open this file in an IDE, such as VSCode or IntelliJ IDEA. For this tutorial, I use VSCode. In the build.gradle file, you can see all the added dependencies:</p>
<pre><code class="language-json">dependencies {
   implementation 'org.springframework.boot:spring-boot-starter-data-jpa'
   implementation 'org.springframework.boot:spring-boot-starter-webmvc'
   compileOnly 'org.projectlombok:lombok'
   runtimeOnly 'com.ibm.db2:jcc'
   annotationProcessor 'org.projectlombok:lombok'
   testImplementation 'org.springframework.boot:spring-boot-starter-data-jpa-test'
   testImplementation 'org.springframework.boot:spring-boot-starter-webmvc-test'
   testCompileOnly 'org.projectlombok:lombok'
   testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
   testAnnotationProcessor 'org.projectlombok:lombok'
}
</code></pre>
<h4 id="heading-what-were-building">What we're building</h4>
<p>The Spring Boot app is a currency exchange rate and conversion app:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/eef403da-eb1d-47e8-8edd-d0e4d845a3d1.png" alt="image of counter " style="display:block;margin:0 auto" width="750" height="434" loading="lazy">

<p>We'll be inserting the exchange data into the database table.</p>
<p>To continue with this tutorial, you can clone the project repo <a href="https://github.com/ChisomUma/sprint-boot-msql-eks">here</a> to save time.</p>
<p>In main &gt; java &gt; com.. &gt; model &gt; ExchangeRate, you’ll see the code below:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.model;

import jakarta.persistence.*;
import lombok.Getter;
import lombok.Setter;

import java.sql.Date;

@Getter
@Setter
@Entity
@Table(name = "exchange-rate")
public class ExchangeRate {
   @Id
   @GeneratedValue(strategy=GenerationType.AUTO)
   private Integer transactionId;
   private String sourceCurrency;
   private String targetCurrency;
   private double amount;
   private Date lastUpdated;
}
</code></pre>
<p>This class is essentially a blueprint for storing currency exchange rate data in our database. It uses the libraries and dependencies added earlier. Lombok handles all the repetitive getter/setter boilerplate so you don't have to write it yourself, while JPA annotations like <code>@Entity</code> and <code>@Table</code> tell Spring, "hey, this class maps to a database table called exchange-rate."</p>
<p>Inside the class, there are five fields that become database columns:</p>
<ul>
<li><p>A self-incrementing transactionId as the primary key.</p>
</li>
<li><p>sourceCurrency and targetCurrency to track which currencies are being converted,</p>
</li>
<li><p>The amount holding the actual exchange rate</p>
</li>
<li><p>lastUpdated date, so you always know how fresh your data is.</p>
</li>
</ul>
<p>To store the data, create a repository file in main &gt; java &gt; com.. &gt; repository &gt; ExchangeRateRepository:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.repository;

import com.example.springbootmysqleks.model.ExchangeRate;
import org.springframework.data.jpa.repository.JpaRepository;

public interface ExchangeRateRepository extends JpaRepository&lt;ExchangeRate, Integer&gt; {
   ExchangeRate findBySourceCurrencyAndTargetCurrency(String sourceCurrency, String targetCurrency);
}
</code></pre>
<p>This file acts as the middleman between your code and the database. By simply extending JpaRepository, you instantly get a whole suite of built-in database operations (like save, delete, findAll, and so on) completely for free, without writing a single SQL query.</p>
<p>The interface is typed to work with the <code>ExchangeRate</code> model we just looked at, using Integer as the primary key type.</p>
<p>The one custom method, <code>findBySourceCurrencyAndTargetCurrency</code>, is where Spring's magic really shines. Just by following a naming convention, Spring automatically figures out the SQL query it needs to run, so you can look up an exchange rate by simply passing in two currency codes like "USD" and "EUR" without writing any query logic yourself.</p>
<p>To use the <code>findBySourceCurrencyAndTargetCurrency</code> method, create a service file in main &gt; java &gt; com.. &gt; service &gt; ExchangeRateService with the code below:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.service;

import com.example.springbootmysqleks.model.ExchangeRate;
import com.example.springbootmysqleks.repository.ExchangeRateRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

@Service
public class ExchangeRateService {

   @Autowired
   private ExchangeRateRepository exchangeRateRepository;

   public ExchangeRate addExchangeRate(ExchangeRate exchangeRate) {
       return exchangeRateRepository.save(exchangeRate);
   }

   public double getAmount(String sourceCurrency, String targetCurrency) {
       ExchangeRate exchangeRate =  exchangeRateRepository.findBySourceCurrencyAndTargetCurrency(sourceCurrency, targetCurrency);
       return exchangeRate == null ? 0 : exchangeRate.getAmount();
   }
}
</code></pre>
<p>Here, we created a <code>@Service</code> class that interacts with the repository.</p>
<p>The class has two methods, the <code>addExchangeRate</code>, which simply takes an <code>ExchangeRate</code> object and saves it to the database, and <code>getAmount</code>, which takes a source and target currency, uses our custom repository method to look up the matching record, and then either returns the exchange rate amount or a safe default of 0 if no record is found.</p>
<p>That little ternary check (<code>exchangeRate == null ? 0 : exchangeRate.getAmount()</code>) ensures the app doesn't crash if you query a currency pair that doesn't exist in the database yet.</p>
<p>In main &gt; java &gt; com.. &gt; controller &gt; ExchangeRateService, we have the following code:</p>
<pre><code class="language-java">package com.example.springbootmysqleks.controller;

import com.example.springbootmysqleks.model.ExchangeRate;
import com.example.springbootmysqleks.service.ExchangeRateService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

@RestController
public class ExchangeRateController {

   @Autowired
   ExchangeRateService exchangeRateService;

   @GetMapping("/getAmount")
   public double getAmount(@RequestParam String sourceCurrency, @RequestParam String targetCurrency) {
       return exchangeRateService.getAmount(sourceCurrency, targetCurrency);
   }

   @PostMapping("/addExchangeRate")
   public ExchangeRate addExchangeRate(@RequestBody ExchangeRate exchangeRate) {
       return exchangeRateService.addExchangeRate(exchangeRate);
   }

   @GetMapping("/")
   public String getHealth() {
       return "up";
   }

}
</code></pre>
<p>The <code>@RestController</code> annotation tells Spring this class will be serving up REST API endpoints, and again <code>@Autowired</code> wires in the service layer automatically.</p>
<p>There are three endpoints:</p>
<ol>
<li><p>a GET request to <code>/getAmount</code> that accepts <code>sourceCurrency</code> and <code>targetCurrency</code> as query parameters and returns the exchange rate amount</p>
</li>
<li><p>a POST request to <code>/addExchangeRate</code> that accepts a full <code>ExchangeRate</code> object as a JSON body and saves it to the database</p>
</li>
<li><p>and finally a simple health check endpoint at / that just returns "up",&nbsp; which is a common pattern in cloud deployments to let load balancers and orchestration tools know the app is alive and running.</p>
</li>
</ol>
<h3 id="heading-step-6-configure-springboot-app-for-database">Step 6: Configure SpringBoot App for Database</h3>
<p>Now, it’s time to configure the application for the database. Navigate to src &gt; main &gt; resources &gt; application.properties, and you should see this:</p>
<pre><code class="language-java">spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.url=jdbc:mysql://\({MYSQL_HOSTNAME}:\){MYSQL_PORT}/${MYSQL_DATABASE}?createDatabaseIfNotExist=true
spring.datasource.username=${MYSQL_USERNAME}
spring.datasource.password=${MYSQL_PASSWORD}

spring.jpa.hibernate.ddl-auto=update

spring.jpa.show-sql: true
</code></pre>
<p>These are the configurations that allow your app to connect with the database.</p>
<ul>
<li><p><code>spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver</code>: The driver class for the MySQL database.</p>
</li>
<li><p><code>spring.datasource.url=jdbc:mysql://\({MYSQL_HOSTNAME}:\){MYSQL_PORT}/${MYSQL_DATABASE}?createDatabaseIfNotExist=true</code>: This is the data source URL in which we are using the MySQL hostname (127.0.0.1), port name, and database name.</p>
</li>
<li><p><code>spring.datasource.username=${MYSQL_USERNAME}</code>: your database user name.</p>
</li>
<li><p><code>spring.datasource.password=${MYSQL_PASSWORD}</code>: your database password.</p>
</li>
</ul>
<p>One thing to note: the process of configuring environment variables with your actual credentials varies depending on the IDE you're using. If you're using IntelliJ IDEA, this process is pretty straightforward. If you're using VS Code, the process is different.</p>
<p>To configure your actual credentials for the <code>env</code> variables, create a <code>.vscode/launch.json</code> file in your project root folder and paste in the following configuration:</p>
<pre><code class="language-json">{
 "version": "0.2.0",
 "configurations": [
   {
     "type": "java",
     "name": "Spring Boot App",
     "request": "launch",
     "mainClass": "com.example.springbootmysqleks.SpringbootMysqlEksApplication",
     "projectName": "springboot-mysql-eks",
     "env": {
       "MYSQL_HOSTNAME": "localhost",
       "MYSQL_PORT": "3306",
       "MYSQL_DATABASE": "exchangedb",
       "MYSQL_USERNAME": "root",
       "MYSQL_PASSWORD": "CHANGE_ME"
     }
   }
 ]
}
</code></pre>
<p>Configure the credentials to use your actual credentials.</p>
<p>Now, when you run the app, you should be able to see the created <code>exchangedb</code> table in DBeaver:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/9ea1b85c-25a8-4587-a013-4d592d1664eb.png" alt="exchnage db image" style="display:block;margin:0 auto" width="724" height="224" loading="lazy">

<p>Use an API testing tool like Postman to send a POST request to the database:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/5fa9e950-17ff-4fd6-a650-b6a33b607744.png" alt="postman request image" style="display:block;margin:0 auto" width="446" height="97" loading="lazy">

<p>Next, run the <code>select * from exchange_rate er</code> script in the <code>exchangedb</code> SQL script editor:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/a7a2ffea-f930-4f27-b29a-f5aaf8f8543e.png" alt="sql editor image" style="display:block;margin:0 auto" width="2048" height="1051" loading="lazy">

<p>At the bottom of the editor, you should see the created table from the Postman request.</p>
<p>Now, run a GET request to the endpoint below:</p>
<pre><code class="language-json">http://localhost:8080/getAmount?sourceCurrency=USD&amp;targetCurrency=EUR&amp;transactionId=1
</code></pre>
<p>You should get a 200 OK response with the currency exchange value, for example, 0.93.</p>
<h3 id="heading-step-7-dockerize-the-springboot-application">Step 7: Dockerize the SpringBoot Application</h3>
<p>To Dockerize your application, create a file named Dockerfile and paste in the configuration below:</p>
<pre><code class="language-dockerfile">FROM eclipse-temurin:17-jre-jammy
WORKDIR /app
COPY build/libs/springboot-mysql-eks.jar /app
EXPOSE 8080
CMD ["java", "-jar", "springboot-mysql-eks.jar"]
</code></pre>
<p>Our Dockerfile starts by pulling the lightweight <code>eclipse-temurin:17-jre-jammy</code> base image to keep things lean, then sets /app as the working directory inside the container. It copies our compiled Spring Boot JAR file from the local build/libs/ folder into that directory, exposes port 8080 for incoming traffic, and finally runs the app with <code>java -jar</code> when the container starts up.</p>
<p>Next, build the app to create the <code>.jar</code> file. To do that, run the command below:</p>
<pre><code class="language-shell">./gradlew clean assemble 
</code></pre>
<p>You should get a successful build output as shown below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/bde4395b-bc30-46e4-9687-bd03836f574d.png" alt="bde4395b-bc30-46e4-9687-bd03836f574d" style="display:block;margin:0 auto" width="471" height="93" loading="lazy">

<p>Navigate to build &gt; the libs folder. You’ll see the <code>springboot-mysql-eks</code> file created.</p>
<p>If you run into an “operation couldn’t be completed.” error, try running the export commands to fix this issue. If you’re using a Mac, then run the command below:</p>
<pre><code class="language-shell">brew install openjdk@21
</code></pre>
<p>Next, run the export commands:</p>
<pre><code class="language-shell">export JAVA_HOME=/opt/homebrew/opt/openjdk@21/libexec/openjdk.jdk/Contents/Home

export PATH=\(JAVA_HOME/bin:\)PATH
</code></pre>
<p>Then run the <code>./gradlew clean assemble</code> command again.</p>
<h3 id="heading-step-8-push-the-image-to-elastic-container-registry-ecr">Step 8: Push the Image to Elastic Container Registry (ECR)</h3>
<p>In this next step, we’ll create an Amazon ECR and push our image to the registry.</p>
<p>To get started, head back into your AWS Console and search for “ECR”. On the ECR page, click Create**.** Then, enter a repository name, for example, “springboot-mysql-eks.” Next, click Create.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/21c56eef-c656-49b6-a624-b1da66cf1096.png" alt="ECR image" style="display:block;margin:0 auto" width="1176" height="248" loading="lazy">

<p>Next, select the repo and click View push commands at the top of the page. This presents a window with a bunch of commands you can use to push your image to the registry. Open your terminal and run these commands. You'll need to ensure Docker is running on your local machine before running the commands.</p>
<p>After running the commands, you should see that your image has been successfully pushed to the registry.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/b39d0b9c-ea99-4cf3-bfb0-3496ac79dea9.png" alt="ECR image" style="display:block;margin:0 auto" width="1118" height="196" loading="lazy">

<h3 id="heading-step-9-implement-aws-app-load-balancer">Step 9: Implement AWS App Load Balancer</h3>
<p>Before getting started with this step, make sure you check out the installation steps and link to additional AWS documentation in the project README. This will help you follow along.</p>
<p>Now, to get started, create a new folder in your root directory named <code>cluster</code> . This is where you'll download the AWS IAM policy for the load balancer. To download the policy, go into your terminal and <code>cd</code> into <code>cluster</code>, then run the command below:</p>
<pre><code class="language-shell">curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.14.1/docs/install/iam_policy.json
</code></pre>
<p>This command is gotten from the <a href="https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html">AWS documentation</a>. Now, when you go to the folder, you’ll see an iam_policy.json file automatically generated.</p>
<p>Next, apply the IAM policy using the command below:</p>
<pre><code class="language-shell">aws iam create-policy \
    --policy-name AWSLoadBalancerControllerIAMPolicy \
    --policy-document file://iam_policy.json
</code></pre>
<p>You should get an output like this in your terminal:</p>
<img alt="terminal image" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>This shows that the IAM policy has been successfully created. To confirm this, head over to the IAM section in your console, navigate to Policies**,** and search for “AWSLoad…”. You should see the policy created there.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/c022959a-c732-4dfd-bd58-4b80d3011b71.png" alt="load balancer policy image" style="display:block;margin:0 auto" width="577" height="229" loading="lazy">

<p>The next step is creating the Kubernetes service account. But before that, you need to tag your public and private subnets as described in this <a href="https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html">documentation</a>.</p>
<p>Now, head over to the VPC dashboard, navigate to Subnets, click into a subnet, and navigate to Tags. Then, click Manage tags.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/615f3ec2-d266-413e-8936-d0767d03316d.png" alt="tag image" style="display:block;margin:0 auto" width="1180" height="207" loading="lazy">

<p>Click Add new tag, then enter the key/pair value in the documentation.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/253abe00-cb7e-4276-81de-79ba0ffc249a.png" alt="tag image" style="display:block;margin:0 auto" width="1162" height="167" loading="lazy">

<h3 id="heading-step-10-create-a-cluster-in-eks">Step 10: Create a Cluster in EKS</h3>
<p>To create a Kubernetes cluster on EKS, you need the eksctl CLI. Follow the instructions in the <a href="https://docs.aws.amazon.com/eks/latest/eksctl/installation.html">AWS eksctl documentation</a> to install the CLI. Next, you need a <a href="https://docs.aws.amazon.com/eks/latest/eksctl/schema.html">config file schema</a> to create the cluster. To use this schema, create a new file called cluster.yaml in the cluster folder.</p>
<p>Next, paste in the following configurations:</p>
<pre><code class="language-dockerfile">apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: spring-test-cluster
  region: us-east-1
  version: "1.30"

vpc:
  id: "&lt;your-vpc-id&gt;"
  subnets:
    private:
      us-east-1a:
        id: "&lt;your-private-subnet-1a-id&gt;" # spring-demo-subnet-private1-us-east-1a
      us-east-1b:
        id: "&lt;your-private-subnet-1b-id&gt;" # spring-demo-subnet-private2-us-east-1b
    public:
      us-east-1a:
        id: "&lt;your-public-subnet-1a-id&gt;" # spring-demo-subnet-public1-us-east-1a
      us-east-1b:
        id: "&lt;your-public-subnet-1b-id&gt;" # spring-demo-subnet-public2-us-east-1b

nodeGroups:
  - name: ng-1
    labels: { role: backend }
    instanceType: t2.micro
    desiredCapacity: 3
    minSize: 3
    maxSize: 5
    privateNetworking: true
    ssh:
      allow: true
      publicKeyName: &lt;your-ec2-key-name&gt;
    iam:
      withAddonPolicies:
        imageBuilder: true
        awsLoadBalancerController: true
        autoScaler: true
iam:
  withOIDC: true
  serviceAccounts:
    - metadata:
        name: aws-load-balancer-controller
        namespace: kube-system
      attachPolicyARNs:
        - arn:aws:iam::&lt;YOUR_AWS_ACCOUNT_ID&gt;:policy/AWSLoadBalancerControllerIAMPolicy
</code></pre>
<p>Th <code>ClusterConfig</code> file is used by eksctl to create our EKS cluster called <code>spring-test-cluster</code> in the <code>us-east-1 region</code>, running Kubernetes version 1.30. It plugs into our existing VPC, placing the worker nodes across private subnets in two availability zones <code>us-east-1a</code> and <code>us-east-1b</code>) for high availability, while keeping public subnets available for the load balancer.</p>
<p>The node group spins up t2.micro EC2 instances with a desired count of 3 (scaling up to 5 if needed), all with private networking enabled for security. It also sets up the necessary IAM permissions for the AWS Load Balancer Controller, Auto Scaler, and ECR image access so our cluster has everything it needs to manage traffic and pull our Docker images automatically.</p>
<p>Now, after updating your configuration with your credentials, run the command below:</p>
<pre><code class="language-shell">eksctl create cluster -f cluster.yaml
</code></pre>
<p>This creates the cluster. You should see an output like this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/758c6b7d-9cf3-4fa6-a0d5-2276adc82147.png" alt="cluster creation image" style="display:block;margin:0 auto" width="1466" height="514" loading="lazy">

<p>Now, in your AWS console, navigate to CloudFormation, and you’ll see your cluster creation process in progress.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/4aa96465-45c1-4e43-b8ca-d44a8802e02d.png" alt="stack creation image" style="display:block;margin:0 auto" width="1249" height="211" loading="lazy">

<p>Now, when you go into the EC2 instance page, you should see the three nodes created.</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/d946d993-0c64-4870-90fb-1132df51544f.png" alt="running cluster image" style="display:block;margin:0 auto" width="950" height="99" loading="lazy">

<h3 id="heading-step-11-install-aws-load-balancing">Step 11: Install AWS Load Balancing</h3>
<p>The next step is installing a load balancer for our application. To get started, run the command below:</p>
<pre><code class="language-shell"> kubectl apply -k "github.com/aws/eks-charts/stable/aws-load-balancer-controller/crds?ref=master"
</code></pre>
<p>This installs <a href="https://www.geeksforgeeks.org/devops/custom-resource-definitions-crds/">custom resource definitions (CRDs)</a> for our controller. Next, run the command below to add the Helm chart repo.</p>
<pre><code class="language-shell">helm repo add eks https://aws.github.io/eks-charts
</code></pre>
<p>Update your local repo to ensure you have the most recent charts:</p>
<pre><code class="language-shell">helm repo update eks
</code></pre>
<p>Next, install the Helm chart:</p>
<pre><code class="language-shell">helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
&nbsp; -n kube-system \
&nbsp; --set clusterName=my-cluster \
&nbsp; --set serviceAccount.create=false \
&nbsp; --set serviceAccount.name=aws-load-balancer-controller \
&nbsp; --version 1.14.0
</code></pre>
<p>Next, verify that the controller is installed:</p>
<pre><code class="language-shell">kubectl get deployment -n kube-system aws-load-balancer-controller
</code></pre>
<p>You should see this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/61d954f3-b09d-4691-9b1c-02134c2d8bf1.png" alt="61d954f3-b09d-4691-9b1c-02134c2d8bf1" style="display:block;margin:0 auto" width="1226" height="118" loading="lazy">

<p>This indicates that your controller is ready.</p>
<h3 id="heading-step-12-create-and-deploy-kubernetes">Step 12: Create and Deploy Kubernetes</h3>
<p>To get started, you'll first need to create a Kubernetes manifest file. For that, we’ll use <a href="https://www.freecodecamp.org/news/what-is-a-helm-chart-tutorial-for-kubernetes-beginners/">Helm Chart</a>.</p>
<pre><code class="language-shell">helm create ytchart
</code></pre>
<p>The command above creates a folder named <code>ytchart</code> with the templates for the components. In this folder, you need to make some configurations for your use case. First, navigate to ytchart &gt; templates and delete the <code>serviceaccount.yaml</code> file, since we already created the service account earlier.</p>
<p>Next, go to values.yaml and make the following changes:</p>
<ul>
<li><p>For <code>repository</code>, navigate to the ECR service page on the AWS Console and copy the image URI.</p>
</li>
<li><p>Tag is <code>latest</code>.</p>
</li>
<li><p>Set database name</p>
</li>
</ul>
<pre><code class="language-dockerfile">mysql:
 databaseName: exchangedb
</code></pre>
<ul>
<li><p>Change service account creation to <code>false</code>.</p>
</li>
<li><p>Scroll down a bit more and change the service <code>type</code> to <code>NodePort</code> and <code>port</code> to <code>8080</code>.</p>
</li>
</ul>
<p>You also need to store the database username and password using secrets. Navigate to the <code>templates</code> folder and go into the file named <code>secrets.yaml</code>. Here, set your database username and password, then comment out the liveness and readiness probe in <code>deployment.yaml</code>.</p>
<p>Next, we’ll create a service to connect to the database. To do that, navigate to the <code>mysql.yaml</code> file, then for <code>externalName</code>. Navigate to the RDS service page on the AWS console and copy the database endpoint.</p>
<p>Now, in the <code>deployment.yaml</code> file, paste in the following configuration:</p>
<pre><code class="language-dockerfile">          env:
            - name: SPRING_DATASOURCE_URL
              value: jdbc:mysql://spring-mysql:3306/{{ .Values.mysql.databaseName }}?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useUnicode=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true
            - name: SPRING_DATASOURCE_USERNAME
              valueFrom:
                secretKeyRef:
                  name: mysql-username
                  key: username
            - name: SPRING_DATASOURCE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-root-password
                  key: password
</code></pre>
<p>You have successfully created environment variables to secure your database credentials.</p>
<p>In the <code>ingress.yaml</code> file, paste in the following configuration:</p>
<pre><code class="language-dockerfile">apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: "spring-microservice-ingress"
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/load-balancer-name: spring-alb-test
  labels:
    app: spring-microservice
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: {{ include "ytchart.fullname" . }}
                port:
                  number: 8080
</code></pre>
<p>This is your configuration for the ingress service.</p>
<p>Run the command below to see all your configuration values:</p>
<pre><code class="language-shell">helm template ytchart/
</code></pre>
<p>Next, run the command below to deploy the chart:</p>
<pre><code class="language-shell">helm install mychart ytchart
</code></pre>
<p>You should see an output like this on your terminal:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/4178105c-f469-4bd6-94f0-58046d12c080.png" alt="helm chart image" style="display:block;margin:0 auto" width="970" height="398" loading="lazy">

<p>Now, when you run kubectl get all, you should see this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/62754329317fc95a74ca62a8/23e91fff-6368-42a1-adda-1af45616e9ef.png" alt="deployment image" style="display:block;margin:0 auto" width="576" height="127" loading="lazy">

<p>Now, navigate to EC2 &gt; Load balancers, copy the DNS name, and enter it into a browser. You should see the “up” text. This indicates that your application is working properly.</p>
<p>Now, when you call the API using the DNS URL as such:</p>
<pre><code class="language-shell">http://spring-alb-test-260424558.us-east-1.elb.amazonaws.com/addExchangeRate
</code></pre>
<p>You should get a 200 OK response. Congratulations, you have successfully deployed a SpringBoot app in Kubernetes!</p>
<h3 id="heading-step-13-delete-cluster">Step 13: Delete Cluster</h3>
<p>If you’re familiar with AWS and the cloud, you should already be aware of how costly it can be to leave resources running for extended periods, especially when you’re not using them actively.</p>
<p>Now that we've come to the end of this tutorial, it’s time to delete the resources.</p>
<p>These are the resources to delete:</p>
<ul>
<li><p>RDS database.</p>
</li>
<li><p>Cluster using the command eksctl delete cluster -f cluster.yaml.</p>
</li>
<li><p>Navigate to VPC and delete the NAT Gateway</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Deploying a Spring Boot application with MySQL on Amazon EKS involves a lot of moving parts, but each step builds logically on the last.</p>
<p>In this tutorial, you've gone from setting up a VPC and provisioning a managed database to containerizing your app, pushing it to ECR, and finally orchestrating everything with Kubernetes and an Application Load Balancer.</p>
<p>What you get is a production-grade setup with high availability, private networking, secure credential management, and auto-scaling built in. This is the kind of infrastructure that would take significant manual effort to replicate without managed services like EKS and RDS.</p>
<p>As a next step, consider adding HTTPS support via AWS Certificate Manager, setting up horizontal pod autoscaling, or integrating a CI/CD pipeline to automate future deployments. And remember to clean up your AWS resources when you're done experimenting. Your wallet will thank you.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The LLM Gateway Pattern: Why Every Kubernetes-Based AI App Needs One ]]>
                </title>
                <description>
                    <![CDATA[ You ship your first LLM-powered feature. It works and the users love it. A second team adds another feature calling a different model, and a third integrates a completely different provider. Six month ]]>
                </description>
                <link>https://www.freecodecamp.org/news/the-llm-gateway-pattern-why-every-kubernetes-based-ai-app-needs-one/</link>
                <guid isPermaLink="false">6a20607178a43e3153ae3cc4</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ development ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 17:12:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/35be7043-56b7-4df6-b56b-a48620be2dd8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You ship your first LLM-powered feature. It works and the users love it. A second team adds another feature calling a different model, and a third integrates a completely different provider.</p>
<p>Six months later, you have fourteen microservices, each holding their own API keys, writing their own retry logic, and failing in their own unique ways.</p>
<p>Nobody knows how much you're spending on tokens or which service is hammering the rate limit. And when OpenAI goes down, everything goes down with it.</p>
<p>That scenario plays out across engineering teams every single day, and the root cause is almost always the same: moving fast with LLMs while skipping the infrastructure thinking that holds everything together at scale.</p>
<p>Fortunately, a well-established architectural pattern solves exactly these problems. If you already run Kubernetes, you're more than halfway to implementing it. That pattern is called the LLM Gateway Pattern, and this article walks you through what it is, why it matters, and how to put it into practice.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-the-llm-gateway-pattern">What Is the LLM Gateway Pattern?</a></p>
<ul>
<li><a href="#heading-how-it-works">How It Works</a></li>
</ul>
</li>
<li><p><a href="#heading-the-problem-without-a-gateway">The Problem Without a Gateway</a></p>
</li>
<li><p><a href="#heading-deploying-an-llm-gateway-on-kubernetes">Deploying an LLM Gateway on Kubernetes</a></p>
<ul>
<li><p><a href="#heading-storing-api-keys-securely">Storing API Keys Securely</a></p>
</li>
<li><p><a href="#heading-defining-routing-rules-in-a-configmap">Defining Routing Rules in a ConfigMap</a></p>
</li>
<li><p><a href="#heading-scaling-the-gateway">Scaling the Gateway</a></p>
</li>
<li><p><a href="#heading-wiring-up-observability">Wiring Up Observability</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-features-of-an-llm-gateway">Features of an LLM Gateway</a></p>
<ul>
<li><p><a href="#heading-multi-provider-routing">Multi-Provider Routing</a></p>
</li>
<li><p><a href="#heading-semantic-caching">Semantic Caching</a></p>
</li>
<li><p><a href="#heading-rate-limiting-per-consumer">Rate Limiting Per Consumer</a></p>
</li>
<li><p><a href="#heading-fallback-and-failover">Fallback and Failover</a></p>
</li>
<li><p><a href="#heading-token-usage-tracking">Token Usage Tracking</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-is-the-llm-gateway-pattern">What Is the LLM Gateway Pattern?</h2>
<p>The LLM Gateway Pattern is an architectural approach where all LLM API traffic from your applications flows through a single, centralized proxy service before reaching any external provider. Think of it as the AI equivalent of an API gateway, except it's purpose-built for the unique challenges that come with language models: token budgets, streaming responses, model routing, semantic caching, and multi-provider fallback.</p>
<p>Instead of every service in your cluster talking directly to OpenAI or Anthropic, they all talk to one internal gateway. That gateway handles authentication, routing, rate limiting, logging, and failover. Your application services stay clean and focused on business logic, while the gateway takes on all the messy operational concerns of working with LLMs at scale.</p>
<p>The pattern itself is not new in concept. Engineers have used API gateways for years to manage REST traffic. What makes LLM gateways distinct is that they understand the specific shape of LLM requests, including token counts, model parameters, prompt structure, and streaming semantics.</p>
<h3 id="heading-how-it-works">How It Works</h3>
<p>The core components of an LLM Gateway on Kubernetes are straightforward. Here is the high-level flow:</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/2aaa42ed-d6b4-4a9e-9d4c-2faa42e76783.png" alt="Diagram showing how LLM Gateway works on Kubernetes" style="display:block;margin:0 auto" width="1162" height="718" loading="lazy">

<p><strong>App Pods</strong> send requests to the gateway using a standard OpenAI-compatible API format. Because of this, most existing LLM client libraries work without modification — you just change the base URL to point at your internal gateway service.</p>
<p><strong>The Gateway Service</strong> receives each incoming request, authenticates the caller, applies any configured rate limits, checks the cache, selects the appropriate upstream provider based on routing rules, and forwards the request. On the way back, it logs token usage and latency before returning the response to the caller.</p>
<p><strong>ConfigMap</strong> holds the routing rules. Which model should handle requests tagged as fast? Which provider should the system fall back to if the primary one is unavailable? All of this lives in configuration, not code, so you can update routing behaviour without redeploying anything.</p>
<p><strong>Secrets</strong> hold the actual API keys for each provider. The gateway is the only service in the cluster that needs access to them. Application pods never touch provider credentials directly.</p>
<p><strong>Provider endpoints</strong> are the actual LLM APIs: OpenAI, Anthropic, a self-hosted vLLM instance running in your cluster, or any other provider that exposes an OpenAI-compatible interface.</p>
<h2 id="heading-the-problem-without-a-gateway">The Problem Without a Gateway</h2>
<p>To appreciate why this pattern matters, it helps to look at what happens when you skip it.</p>
<h3 id="heading-1-scattered-secrets-and-no-central-control">1. Scattered Secrets and No Central Control</h3>
<p>Every service that calls an LLM needs an API key. In Kubernetes, this usually means creating a <a href="https://kubernetes.io/docs/concepts/configuration/secret/">Secret</a> per namespace or per deployment.</p>
<p>When that key rotates or gets compromised, you're hunting through dozens of manifests to update it. There's no single place to revoke access or audit who is calling what.</p>
<h3 id="heading-2-no-visibility-into-cost-or-usage">2. No Visibility into Cost or Usage</h3>
<p>LLM APIs charge per token. Without a centralized layer collecting usage data, you have no reliable way to know which service is responsible for that spike in your monthly bill.</p>
<h3 id="heading-3-provider-lock-in-at-the-application-level">3. Provider Lock-in at the Application Level</h3>
<p>When you hardcode <a href="https://api.openai.com">https://api.openai.com</a> into your service, switching to a different provider or routing certain requests to a cheaper model becomes a code change. You need to redeploy your application just to change which model handles a request type.</p>
<h3 id="heading-4-no-caching">4. No Caching</h3>
<p>Many LLM applications send semantically similar or identical prompts repeatedly. Without a shared caching layer, each one incurs full token costs and full latency. The savings from even basic caching can be significant.</p>
<p>All of these problems compound as your team grows and more services start calling LLMs. The gateway pattern cuts through all of them in one architectural decision.</p>
<h2 id="heading-deploying-an-llm-gateway-on-kubernetes">Deploying an LLM Gateway on Kubernetes</h2>
<p>There are several tools that can serve as an LLM gateway in a Kubernetes environment, including <a href="https://docs.litellm.ai/docs/simple_proxy">LiteLLM Proxy</a>, <a href="https://portkey.ai/">Portkey</a>, <a href="https://openrouter.ai/">OpenRouter</a>, and Envoy with custom filters.</p>
<p>For the rest of this walkthrough, we'll use LiteLLM Proxy. It ships with a Helm chart, supports over a hundred models across all major providers, and comes with a management UI that makes initial configuration straightforward.</p>
<h3 id="heading-storing-api-keys-securely">Storing API Keys Securely</h3>
<p>Start by creating a Kubernetes Secret that holds your provider API keys. Your gateway pods will consume these credentials as environment variables, which means no provider key ever needs to live inside your application containers:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: llm-gateway-secrets
  namespace: ai-platform
type: Opaque
stringData:
  OPENAI_API_KEY: "sk-..."
  ANTHROPIC_API_KEY: "sk-ant-..."
</code></pre>
<h3 id="heading-defining-routing-rules-in-a-configmap">Defining Routing Rules in a <code>ConfigMap</code></h3>
<p>The routing configuration tells the gateway which models are available and how to reach each one. Keeping this in a <code>ConfigMap</code> means you can update your routing rules without touching a single line of application code:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-gateway-config
  namespace: ai-platform
data:
  config.yaml: |
    model_list:
      - model_name: gpt-4o
        litellm_params:
          model: openai/gpt-4o
          api_key: os.environ/OPENAI_API_KEY
      - model_name: claude-sonnet
        litellm_params:
          model: anthropic/claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
      - model_name: fast
        litellm_params:
          model: openai/gpt-4o-mini
          api_key: os.environ/OPENAI_API_KEY
</code></pre>
<p>With this configuration in place, any application in your cluster can reach the gateway at <a href="http://llm-gateway.ai-platform.svc.cluster.local">http://llm-gateway.ai-platform.svc.cluster.local</a> using the standard OpenAI client format, regardless of which actual provider sits behind it.</p>
<h3 id="heading-scaling-the-gateway">Scaling the Gateway</h3>
<p>Because the gateway is stateless, horizontal scaling is straightforward. You can attach a <code>HorizontalPodAutoscaler</code> to scale based on CPU utilization or request rate:</p>
<pre><code class="language-yaml">apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gateway-hpa
  namespace: ai-platform
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
</code></pre>
<h3 id="heading-wiring-up-observability">Wiring Up Observability</h3>
<p>A gateway you can't observe is a gateway you can't trust, so wiring up monitoring before you go to production is worth the extra hour it takes.</p>
<p>LiteLLM exposes a <code>/metrics</code> endpoint in Prometheus format. You can scrape it with a standard <code>ServiceMonitor</code> if you run the Prometheus Operator, or configure Prometheus directly to target the gateway service.</p>
<p>The metrics that matter most in day-to-day operations are token throughput per model, request latency percentiles, error rates per provider, and cache hit ratio.</p>
<p>Once Prometheus is collecting that data, you can build Grafana dashboards that show token spend broken down by caller, model, and time period. This gives engineering managers and finance teams the cost visibility they've been asking for, and it takes surprisingly little effort to set up once the metrics pipeline is in place.</p>
<p>If you run an OpenTelemetry collector in your cluster, you can also configure the gateway to emit trace spans for every LLM request. This lets you see the full latency breakdown from the moment a user action triggers a call in your application all the way through to the provider response. So when something is slow, you can tell immediately whether the bottleneck sits in your service, the gateway, or upstream with the provider.</p>
<h2 id="heading-features-of-an-llm-gateway">Features of an LLM Gateway</h2>
<p>Not all gateway implementations are equal, so as your needs grow, these are the core capabilities worth evaluating.</p>
<h3 id="heading-multi-provider-routing">Multi-Provider Routing</h3>
<p>A well-built gateway routes requests to different providers based on declarative, configurable rules that live entirely outside your application code. This means that changing a model never requires a redeployment.</p>
<h3 id="heading-semantic-caching">Semantic Caching</h3>
<p>Rather than only caching byte-for-byte identical prompts, a semantic cache uses embedding similarity to recognise when two different prompts are asking essentially the same thing. This can cut redundant API calls dramatically.</p>
<h3 id="heading-rate-limiting-per-consumer">Rate Limiting Per Consumer</h3>
<p>The gateway should let you set token budgets and request limits per team, per namespace, or per application, so no single runaway service can starve the rest of your cluster or drive up costs unchecked.</p>
<h3 id="heading-fallback-and-failover">Fallback and Failover</h3>
<p>When a primary provider fails or exceeds acceptable latency thresholds, the gateway should automatically retry against a configured fallback. This centralizes logic that is notoriously hard to get right inside individual services.</p>
<h3 id="heading-token-usage-tracking">Token Usage Tracking</h3>
<p>Every request should produce a detailed usage record capturing input tokens, output tokens, model, caller identity, and latency. This gives engineering managers the clear, actionable picture of AI spending they need.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The LLM Gateway Pattern solves a set of operational problems that every team building on language models at scale will eventually run into. Scattered secrets, invisible costs, inconsistent failure handling, and provider lock-in are all symptoms of the same underlying issue: infrastructure concerns leaking into services that shouldn't have to deal with them.</p>
<p>A centralized gateway on Kubernetes gives your application teams a stable, provider-agnostic interface while giving your platform team the visibility and controls they need to manage cost and reliability effectively. When a provider goes down in the middle of the night, your configured fallback kicks in automatically instead of someone waking up to a page.</p>
<p>Start with LiteLLM Proxy, wire up the Prometheus metrics, build a simple Grafana dashboard, and watch how quickly the pattern pays for itself. Once you have seen what centralized LLM traffic management looks like in practice, it becomes very hard to go back to doing it any other way.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Hybrid Cloud Platform with Google Cloud Services and On-Premise Kubernetes Infrastructure ]]>
                </title>
                <description>
                    <![CDATA[ In this article, you'll learn how to design and build a secure, scalable hybrid cloud platform that connects your on‑premises Kubernetes infrastructure to Google Cloud Platform. This allows on‑prem ap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-hybrid-cloud-platform-with-google-cloud-services-and-on-premise-k8s-infra/</link>
                <guid isPermaLink="false">6a18c124782587548340fa90</guid>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cloud native ]]>
                    </category>
                
                    <category>
                        <![CDATA[ CNCF ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Hybrid Cloud ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Shubham Katara ]]>
                </dc:creator>
                <pubDate>Thu, 28 May 2026 22:26:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a86db163-f513-48bd-8194-18c6cb894615.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, you'll learn how to design and build a secure, scalable hybrid cloud platform that connects your on‑premises Kubernetes infrastructure to Google Cloud Platform. This allows on‑prem apps can consume cloud services (notably GPUs) without brittle long‑lived keys, manual credential management, or risky network patterns.</p>
<p>Who this is for:</p>
<ul>
<li><p>Platform engineers, SREs, and security-focused cloud architects who operate mixed on‑prem and cloud Kubernetes estates.</p>
</li>
<li><p>Teams that need scalable, auditable access from on‑prem workloads to GCP resources (especially GPU instances) while minimizing operational overhead and blast radius.</p>
</li>
</ul>
<p>What you’ll get from this guide:</p>
<ul>
<li><p>The motivation and economics behind a hybrid approach (why GPUs often push workloads to the cloud).</p>
</li>
<li><p>Common pitfalls with service account keys and how “accidental air gaps” occur in real environments.</p>
</li>
<li><p>A practical, end‑to‑end pattern that uses Workload Identity Federation to give on‑prem pods short‑lived, auditable access to GCP without embedding keys.</p>
</li>
</ul>
<p>What’s included:</p>
<ul>
<li><p>Conceptual explanations, security tradeoffs, and operational best practices.</p>
</li>
<li><p>Concrete examples and Kubernetes/Terraform artifacts (linked in the GitHub repo at the end of this article) so you can reproduce the setup in your environment.</p>
</li>
</ul>
<p>Read on for the theory, then follow the hands‑on sections to provision GCP resources, configure federation, enforce policies with CEL and Kyverno, and validate secure, scalable GPU access from your on‑prem Kubernetes clusters.</p>
<p><strong>Note:</strong> Kubernetes and Terraform artifacts are linked in the GitHub repo at the end of this article.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-why-hybrid-cloud-matters">Why Hybrid Cloud Matters</a></p>
</li>
<li><p><a href="#heading-the-economics-of-hybrid-gpus-changed-everything">The Economics of Hybrid: GPUs Changed Everything</a></p>
</li>
<li><p><a href="#heading-why-service-account-keys-fail-at-scale">Why Service Account Keys Fail at Scale</a></p>
</li>
<li><p><a href="#heading-how-the-accidental-air-gap-happens">How the Accidental Air Gap Happens</a></p>
</li>
<li><p><a href="#heading-how-workload-identity-federation-bridges-the-gap">How Workload Identity Federation Bridges the Gap</a></p>
</li>
<li><p><a href="#heading-how-kubernetes-identity-works">How Kubernetes Identity Works</a></p>
</li>
<li><p><a href="#heading-how-to-prepare-google-cloud-platform-resources">How to prepare Google Cloud Platform resources</a></p>
</li>
<li><p><a href="#heading-how-to-use-cel-for-fine-grained-access-control">How to Use CEL for Fine-Grained Access Control</a></p>
</li>
<li><p><a href="#heading-how-to-inject-credentials-automatically-with-kyverno">How to Inject Credentials Automatically with Kyverno</a></p>
</li>
<li><p><a href="#heading-how-to-grant-iam-permissions-to-federated-identities">How to Grant IAM Permissions to Federated Identities</a></p>
</li>
<li><p><a href="#heading-how-to-verify-the-setup">How to Verify the Setup</a></p>
</li>
<li><p><a href="#heading-how-to-connect-on-prem-apps-to-cloud-gpus">How to Connect On-Prem Apps to Cloud GPUs</a></p>
</li>
<li><p><a href="#heading-how-to-scale-gpu-access-with-cel-conditions">How to Scale GPU Access with CEL Conditions</a></p>
</li>
<li><p><a href="#heading-the-security-properties-compared">The Security Properties Compared</a></p>
</li>
<li><p><a href="#heading-the-complete-infrastructure-as-code-layout">The Complete Infrastructure as Code Layout</a></p>
</li>
<li><p><a href="#heading-how-to-run-a-proof-of-concept-with-vcluster">How to Run a Proof of Concept with vCluster</a></p>
</li>
<li><p><a href="#heading-common-issues-and-how-to-solve-them">Common Issues and How to Solve Them</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before following along, you'll need:</p>
<ul>
<li><p>A Kubernetes cluster that is <strong>not</strong> GKE (on-premises, bare-metal, or a virtual cluster)</p>
</li>
<li><p>A Google Cloud project with the following APIs enabled: IAM, Security Token Service (STS), and Workload Identity</p>
</li>
<li><p><a href="https://developer.hashicorp.com/terraform/install">Terraform</a> installed and configured</p>
</li>
<li><p><a href="https://kyverno.io/docs/installation/">Kyverno</a> installed in your cluster</p>
</li>
<li><p>Python 3 with <code>google-cloud-secret-manager</code> and <code>google-cloud-aiplatform</code> libraries (for the verification steps. Code available in the github repository.)</p>
</li>
<li><p><code>kubectl</code> access to your cluster</p>
</li>
</ul>
<h2 id="heading-why-hybrid-cloud-matters">Why Hybrid Cloud Matters</h2>
<p>If everything goes right, a hybrid cloud platform lets your on-premises and cloud workloads talk to each other as if they were part of the same network.</p>
<p>There are many practical reasons to run a hybrid cloud setup:</p>
<ul>
<li><p><strong>Offloading analytics to BigQuery:</strong> You keep your analytics apps on-prem for data sovereignty, but pipe large datasets into BigQuery for world-class processing power — without buying extra servers.</p>
</li>
<li><p><strong>Creating a unified network with Cloud Interconnect:</strong> Using Cloud Interconnect or Cloud VPN, your on-premises datacenter becomes an extension of the Google Cloud Platform (GCP) Virtual Private Cloud (VPC). Your on-prem invoice apps can talk to cloud-based user services with low latency and no public internet exposure.</p>
</li>
<li><p><strong>Cost-effective scalability via Cloud Storage:</strong> You can use cloud storage as a backend for local apps, storing logs, backups, and historical data while paying only for what you use.</p>
</li>
<li><p><strong>Event-driven syncing with Pub/Sub:</strong> When something happens on-prem, a message through Cloud Pub/Sub lets cloud services react instantly — no manual polling required.</p>
</li>
</ul>
<h2 id="heading-the-economics-of-hybrid-gpus-changed-everything">The Economics of Hybrid: GPUs Changed Everything</h2>
<p>Before diving into the technical problem, it's worth understanding why hybrid clouds matter more than ever.</p>
<p>Your organization, like most enterprises, has made significant investments in on-premises datacenters. Servers are bought. Racks are filled. Network infrastructure is paid for. The marginal cost of running one more workload is essentially zero.</p>
<p>Then came the AI wave.</p>
<p>Suddenly every team needs Graphics Processing Units (GPUs). Not one or two — dozens of A100s for training, fleets of inference endpoints, vector databases that need to sit close to the models. GPUs are scarce. Lead times for on-prem GPU hardware stretch into months. Cloud providers have them available in minutes.</p>
<p>The architecture that actually makes economic sense looks like this:</p>
<ul>
<li><p><strong>The on-prem datacenter handles the bulk of compute</strong> — web servers, business logic, databases, batch processing. This is commodity compute you've already paid for.</p>
</li>
<li><p><strong>The cloud handles what's scarce</strong> — GPU-accelerated inference, model training, AI/ML endpoints. You pay per request, scale on demand, and don't wait six months for hardware.</p>
</li>
</ul>
<p>The cloud isn't a full migration destination — it's an extension for capabilities you can't easily build on-prem.</p>
<p>But those on-prem workloads need to authenticate to cloud services. Every API call from the datacenter to a Vertex AI endpoint, every request to a GPU-powered inference service, every write to Cloud Storage for model artifacts — all of it needs credentials. That's the problem this article solves.</p>
<h2 id="heading-why-service-account-keys-fail-at-scale">Why Service Account Keys Fail at Scale</h2>
<p>Here's a scenario that plays out in thousands of enterprises daily.</p>
<p>A development team needs their on-prem application to write to Google Cloud Storage. The "obvious" solution? Generate a GCP service account key, base64 encode it, store it in a Kubernetes Secret, and mount it in the pod:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: gcp-credentials
type: Opaque
data:
  key.json: eyJ0eXBlIjoic2VydmljZV9hY2NvdW50IiwicHJvamVjdF9pZCI6…
</code></pre>
<p>This works. It also introduces serious problems:</p>
<ul>
<li><p><strong>Never expires.</strong> That key is valid until someone remembers to rotate it (they won't) or it gets compromised (it will).</p>
</li>
<li><p><strong>Can be exfiltrated trivially.</strong> Anyone with read access to that namespace can run <code>kubectl get secret -o yaml</code> and walk away with permanent GCP access.</p>
</li>
<li><p><strong>Has no audit trail for the actual workload.</strong> GCP sees "service-account-xyz accessed this bucket" — not "pod frontend-abc-123 in namespace production."</p>
</li>
<li><p><strong>Scales terribly.</strong> 50 teams × 3 environments × 4 GCP projects = 600 keys to track, rotate, and hope haven't been committed to git.</p>
</li>
</ul>
<p>Security teams know this. That's why many organizations have done the only sensible thing: they have disabled service account key generation entirely.</p>
<h2 id="heading-how-the-accidental-air-gap-happens">How the Accidental Air Gap Happens</h2>
<p>When you disable key generation, you haven't solved the hybrid cloud platform problem — you've just made it someone else's problem. That someone is usually a platform team staring at a Jira ticket that says "cannot access GCP from on-prem, P1, blocking release."</p>
<p>The result? Your "hybrid cloud platform" isn't hybrid at all. It's two disconnected systems.</p>
<p>Teams resort to building intermediary services, API gateways that proxy requests, or finding creative ways to get keys anyway. None of this is a platform. It's duct tape.</p>
<h2 id="heading-how-workload-identity-federation-bridges-the-gap">How Workload Identity Federation Bridges the Gap</h2>
<p>Every Kubernetes cluster already issues cryptographically signed identity tokens to every pod. And Google Cloud has a service specifically designed to trust those tokens.</p>
<p>This is <strong>Workload Identity Federation</strong> — and combined with OpenID Connect (OIDC), it's the missing piece that makes hybrid platforms actually work.</p>
<p>The service is quite well named because of the word Federation. it means GCP doesn't store your identity — it agrees to trust identities issued by another system, as long as they can be cryptographically verified. This all works with a very well orchestrated set of steps in the following order:</p>
<ol>
<li><p>Pod presents its Kubernetes-issued JWT to GCP's STS endpoint.</p>
</li>
<li><p>STS verifies the signature against your cluster's public JWKS.</p>
</li>
<li><p>STS checks the JWT's claims against the Workload Identity Pool's rules (audience, issuer, CEL conditions).</p>
</li>
<li><p>STS returns a short-lived Google access token (typically 1 hour) that the pod uses for API calls.</p>
</li>
</ol>
<p>It is also worth mentioning that Workload Identity Federation is not Kubernetes specific. It works with AWS IAM, Azure AD, GitHub Actions OIDC, and any OIDC-compliant identity provider.</p>
<h2 id="heading-how-kubernetes-identity-works">How Kubernetes Identity Works</h2>
<p>Every pod with a ServiceAccount gets a JSON Web Token (JWT) automatically mounted at <code>/run/secrets/kubernetes.io/serviceaccount/token</code>. This isn't just an opaque blob — it's a signed assertion of identity:</p>
<pre><code class="language-json">{
  "iss": "https://kubernetes.default.svc.cluster.local",
  "sub": "system:serviceaccount:production:backend-api",
  "aud": ["https://iam.googleapis.com/..."],
  "kubernetes.io": {
    "namespace": "production",
    "serviceaccount": {
      "name": "backend-api"
    }
  },
  "exp": 1735689600
}
</code></pre>
<p>In a JWT, claims are just the key-value pairs inside the token's payload — each one is a claim the issuer is making about the subject. Think of them as facts the token is asserting, signed cryptographically so the verifier can trust them.</p>
<p>The critical insight: this token is created by a set of JSON Web Key Set (JWKS) and is verifiable by anyone who has your cluster's public keys, exposed via the JSON Web Key Set (JWKS) endpoint:</p>
<pre><code class="language-bash">kubectl get --raw /openid/v1/jwks
</code></pre>
<p>Google Cloud's Security Token Service (STS) can validate these tokens. No keys are exchanged. No secrets are stored. Just cryptographic proof of identity.</p>
<h2 id="heading-how-to-prepare-google-cloud-platform-resources">How to Prepare Google Cloud Platform resources</h2>
<p>The Workload Identity Pool is a trust boundary — a declaration that says "I accept identities from external sources." The OIDC Provider configures how to validate those identities.</p>
<pre><code class="language-hcl">resource "google_iam_workload_identity_pool" "pool" {
  workload_identity_pool_id = "hybrid-platform-pool"
  project                   = "my-project"
}

resource "google_iam_workload_identity_pool_provider" "k8s_provider" {
  project                            = "my-project"
  workload_identity_pool_id          = google_iam_workload_identity_pool.pool.workload_identity_pool_id
  workload_identity_pool_provider_id = "on-prem-cluster"

  attribute_mapping = {
    "google.subject"      = "assertion.sub"
    "attribute.namespace" = "assertion['kubernetes.io']['namespace']"
  }

  attribute_condition = "attribute.namespace in [\"production\", \"staging\"]"

  oidc {
    issuer_uri = "https://kubernetes.default.svc.cluster.local"
    jwks_json  = file("jwks.json")  # Your cluster's public keys
  }
}
</code></pre>
<p>Two things to note here:</p>
<ol>
<li><p><code>attribute_mapping</code> extracts claims from the Kubernetes JWT and makes them available as GCP attributes. By using `assertion['kubernetes.io']['namespace']`, the namespace is pulled out so you can use it for access control.</p>
</li>
<li><p><code>attribute_condition</code> is where security policy lives. More on this in the next section.</p>
</li>
</ol>
<h2 id="heading-how-to-use-cel-for-fine-grained-access-control">How to Use CEL for Fine-Grained Access Control</h2>
<p>The <code>attribute_condition</code> field uses Common Expression Language (CEL). This single line of policy can replace dozens of Identity and Access Management (IAM) bindings:</p>
<pre><code class="language-plaintext">attribute.namespace in ["production", "staging"]
</code></pre>
<p>With this condition, a pod in the <code>kube-system</code> namespace cannot authenticate to GCP at all — the token exchange is rejected before IAM is even consulted.</p>
<p>You can get more sophisticated:</p>
<pre><code class="language-plaintext">// Only production namespace, and only specific service accounts
attribute.namespace == "production" &amp;&amp;
  attribute.service_account in ["payment-processor", "order-service"]

// Allow staging, but only during business hours
attribute.namespace == "staging" &amp;&amp;
  request.time.getHours("America/New_York") &gt;= 9 &amp;&amp;
  request.time.getHours("America/New_York") &lt; 17
</code></pre>
<p>This is defense in depth. Even if someone creates a rogue ServiceAccount or has <code>kubectl</code> access, they cannot authenticate to GCP unless the CEL condition passes. The security boundary is enforced by Google's infrastructure, not by hoping developers follow policy.</p>
<h2 id="heading-how-to-inject-credentials-automatically-with-kyverno">How to Inject Credentials Automatically with Kyverno</h2>
<p>Having a working identity federation is only half the battle. Your customers and developers shouldn't need to understand OIDC, STS, or credential configuration files. They should deploy their app and have it work.</p>
<p>Before we get to the automation, it's worth pausing on what a <em>credential configuration file</em> actually is — because the name is a little misleading.</p>
<p>A credential configuration file (sometimes called an "external account config" or "ADC config") is a small JSON document that tells Google's client libraries <strong>how to obtain</strong> a credential at runtime. It is <strong>not</strong> itself a credential. You'll see the actual file later in this article — it contains no secrets. Just metadata: the Workload Identity Pool audience, the STS token-exchange endpoint, the source token type, and the path on the pod's filesystem where the real (short-lived) Kubernetes ServiceAccount token lives.</p>
<p>Compare that to a traditional service account key:</p>
<table>
<thead>
<tr>
<th></th>
<th>Service Account Key (<code>key.json</code>)</th>
<th>Credential Config (<code>credential-configuration.json</code>)</th>
</tr>
</thead>
<tbody><tr>
<td>What's inside the file</td>
<td>An RSA private key that <em>is</em> the credential</td>
<td>Instructions for exchanging an external token</td>
</tr>
<tr>
<td>Lifetime of the secret material</td>
<td>Forever, until manually rotated</td>
<td>Source token rotates automatically (~1h TTL)</td>
</tr>
<tr>
<td>If the file leaks</td>
<td>Long-lived access to a GCP service account</td>
<td>Useless on its own — points to a token only the pod can read</td>
</tr>
<tr>
<td>Identity model</td>
<td>Impersonates a GCP service account directly</td>
<td>Federates an external identity into GCP via STS</td>
</tr>
<tr>
<td>Who handles rotation</td>
<td>A human (or no one)</td>
<td>The Kubernetes API server, transparently</td>
</tr>
</tbody></table>
<p>Both files end up referenced by <code>GOOGLE_APPLICATION_CREDENTIALS</code> and look interchangeable from the application's point of view — but only one of them is dangerous to lose. The credential config file is safe to ship in a ConfigMap precisely because there's nothing to steal.</p>
<p>Having this file in the ConfigMap is half the solution. It actually needs to end up in the workload pods that need access to GCP services. This is where Kyverno comes in. A single ClusterPolicy automatically injects everything a pod needs:</p>
<pre><code class="language-yaml">apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: workload-identity-federation
spec:
  rules:
    - name: inject-gcp-credentials
      match:
        any:
          - resources:
              kinds:
                - Deployment
              selector:
                matchLabels:
                  workload-identity-federation: "enabled"
      mutate:
        patchStrategicMerge:
          spec:
            template:
              spec:
                volumes:
                  - name: workload-identity-credential-configuration
                    configMap:
                      name: workload-identity-federation-config
                containers:
                  - (name): "*"
                    volumeMounts:
                      - name: workload-identity-credential-configuration
                        mountPath: /etc/workload-identity
                        readOnly: true
                    env:
                      - name: GOOGLE_APPLICATION_CREDENTIALS
                        value: "/etc/workload-identity/credential-configuration.json"
</code></pre>
<p>The above cluster policy does three things:</p>
<ol>
<li><p>Mounts the configmap inside the containers in the deployment at <code>/etc/workload-identity</code>.</p>
</li>
<li><p>Injects an environment variable called <code>GOOGLE_APPLICATION_CREDENTIALS</code> that points to the absolute path of the credential config file.</p>
</li>
</ol>
<p>From a developer's perspective, this is their entire integration:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    workload-identity-federation: "enabled" # That's it.
spec:
  # ... normal deployment spec
</code></pre>
<p>The credential configuration file (created by Terraform as a ConfigMap) tells Google's client libraries how to exchange tokens:</p>
<pre><code class="language-json">{
  "type": "external_account",
  "audience": "//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/providers/PROVIDER_ID",
  "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
  "token_url": "https://sts.googleapis.com/v1/token",
  "credential_source": {
    "file": "/run/secrets/kubernetes.io/serviceaccount/token"
  }
}
</code></pre>
<p>This JSON file is a credential configuration for Google's Workload Identity Federation. It instructs Google Cloud client libraries to obtain cloud access tokens by exchanging a Kubernetes ServiceAccount token (located at <code>/run/secrets/kubernetes.io/serviceaccount/token</code>) for a Google Cloud access token, using an external identity provider configured via a Workload Identity Pool. This allows workloads running outside of GCP, such as on-premises Kubernetes clusters, to authenticate to Google Cloud services without needing to manage long-lived service account keys.</p>
<p>Every Google Cloud SDK and client library understands this format. Python, Go, Java, and Node.js all just work.</p>
<h2 id="heading-how-to-grant-iam-permissions-to-federated-identities">How to Grant IAM Permissions to Federated Identities</h2>
<p>The service account token that has been trusted by the STS service, also known as a federated identity, need permissions to access resources. You bind IAM roles to the identity pool attributes:</p>
<pre><code class="language-hcl">resource "google_project_iam_member" "secret_access" {
  for_each = toset(["production", "staging"])
  project  = "my-project"
  role     = "roles/secretmanager.secretAccessor"
  member   = "principalSet://iam.googleapis.com/projects/\({PROJECT_NUMBER}/locations/global/workloadIdentityPools/\){POOL_ID}/attribute.namespace/${each.value}"
}
</code></pre>
<p>This grants Secret Manager access to all pods authenticated from the <code>production</code> or <code>staging</code> namespaces. The <code>principalSet</code> syntax allows matching on attributes. You can also restrict to specific service accounts:</p>
<pre><code class="language-plaintext">member = "principal://iam.googleapis.com/.../subject/system:serviceaccount:production:payment-processor"
</code></pre>
<h2 id="heading-how-to-verify-the-setup">How to Verify the Setup</h2>
<p>You can verify the setup with a simple Python script that lists secrets from Secret Manager. This runs inside a pod on your on-premises cluster:</p>
<pre><code class="language-python"># list_secrets.py - running on-prem, accessing GCP Secret Manager
from google.cloud import secretmanager

def list_secrets(project_id: str):
    """
    List all secrets in a GCP project.

    No credentials are passed explicitly. The google-cloud-secret-manager
    library automatically:
    1. Reads GOOGLE_APPLICATION_CREDENTIALS env var (set by Kyverno)
    2. Loads the credential configuration JSON
    3. Reads the K8s ServiceAccount token from /run/secrets/...
    4. Exchanges it for a GCP access token via STS
    5. Uses that token to call the Secret Manager API
    """
    client = secretmanager.SecretManagerServiceClient()
    parent = f"projects/{project_id}"

    print(f"Secrets in {project_id}:")
    print("-" * 40)

    for secret in client.list_secrets(request={"parent": parent}):
        secret_name = secret.name.split("/")[-1]
        print(f"  - {secret_name}")

    print("-" * 40)
    print("Authentication: Workload Identity Federation")
    print("Credentials: None stored, token exchanged at runtime")

if __name__ == "__main__":
    list_secrets("my-project-id")
</code></pre>
<p>Run this inside your labeled pod:</p>
<pre><code class="language-bash">$ kubectl exec -it my-app-xyz -- python list_secrets.py

Secrets in my-project-id:
----------------------------------------
  - database-password
  - api-key-stripe
  - oauth-client-secret
  - ml-model-api-key
----------------------------------------
Authentication: Workload Identity Federation
Credentials: None stored, token exchanged at runtime
</code></pre>
<p>No service account key. No secret mounted. Just a Kubernetes ServiceAccount token exchanged for GCP credentials at runtime.</p>
<p>This same pattern works for any GCP service — Secret Manager, Cloud Storage, BigQuery, Pub/Sub, and Vertex AI.</p>
<h2 id="heading-how-to-connect-on-prem-apps-to-cloud-gpus">How to Connect On-Prem Apps to Cloud GPUs</h2>
<p>Consider a typical flow: an on-prem order processing service needs to call a Vertex AI endpoint for fraud detection. The model runs on GPUs in Google Cloud (you can spin up A100s in minutes, not months). The application logic stays on-prem (you've already paid for that compute).</p>
<p>With the IAM bindings in place, any pod in the allowed namespaces can call Vertex AI:</p>
<pre><code class="language-python"># fraud_detector.py - running on-prem, calling cloud GPUs
from google.cloud import aiplatform

def check_fraud(transaction: dict) -&gt; float:
    """
    Call a Vertex AI endpoint for fraud detection.

    The model runs on A100 GPUs in Google Cloud.
    This code runs on-prem in the datacenter.

    Authentication is automatic:
    1. Kyverno injected GOOGLE_APPLICATION_CREDENTIALS
    2. The aiplatform SDK reads the credential config
    3. K8s SA token is exchanged for GCP token via STS
    4. Request is authenticated to Vertex AI
    """
    endpoint = aiplatform.Endpoint(
        endpoint_name="projects/my-project/locations/us-central1/endpoints/fraud-model"
    )
    prediction = endpoint.predict(instances=[transaction])
    return prediction.predictions[0]["fraud_score"]


def generate_embeddings(texts: list[str]) -&gt; list[list[float]]:
    """
    Generate text embeddings using a cloud-hosted model.

    Embedding models are GPU-intensive. Running them on-prem
    would require dedicated hardware. In the cloud, you pay per request.
    """
    from vertexai.language_models import TextEmbeddingModel

    model = TextEmbeddingModel.from_pretrained("text-embedding-004")
    embeddings = model.get_embeddings(texts)
    return [e.values for e in embeddings]
</code></pre>
<p>The developer doesn't think about authentication at all. They add the label to their deployment, and their on-prem pod can call:</p>
<ul>
<li><p><strong>Vertex AI endpoints</strong> for ML inference on cloud GPUs</p>
</li>
<li><p><strong>Cloud Storage</strong> for model artifacts and training data</p>
</li>
<li><p><strong>BigQuery</strong> for feature stores and analytics</p>
</li>
<li><p><strong>Pub/Sub</strong> for event streaming between environments</p>
</li>
<li><p><strong>Secret Manager</strong> for API keys and configuration</p>
</li>
</ul>
<p>This is the hybrid platform working as intended.</p>
<h2 id="heading-how-to-scale-gpu-access-with-cel-conditions">How to Scale GPU Access with CEL Conditions</h2>
<p>CEL conditions become especially powerful when you want to restrict GPU access to specific namespaces. For example, to allow only ML-related namespaces to access Vertex AI:</p>
<pre><code class="language-plaintext">attribute.namespace in ["ml-inference", "ml-training", "data-science"] &amp;&amp;
  attribute.service_account.startsWith("ml-")
</code></pre>
<p>You can also grant different access levels per namespace:</p>
<pre><code class="language-hcl"># ML inference namespace gets prediction access
resource "google_project_iam_member" "ml_inference" {
  project = "my-project"
  role    = "roles/aiplatform.user"
  member  = "principalSet://iam.googleapis.com/.../attribute.namespace/ml-inference"
}

# Data science namespace gets full Vertex AI access (for experimentation)
resource "google_project_iam_member" "data_science" {
  project = "my-project"
  role    = "roles/aiplatform.admin"
  member  = "principalSet://iam.googleapis.com/.../attribute.namespace/data-science"
}
</code></pre>
<p>The on-prem application teams don't need to know or care about GCP IAM. They deploy to the right namespace, add a label, and the platform handles the rest.</p>
<h2 id="heading-the-security-properties-compared">The Security Properties Compared</h2>
<p>Here's a side-by-side comparison of the two authentication approaches:</p>
<table>
<thead>
<tr>
<th>Property</th>
<th>Service Account Keys</th>
<th>Workload Identity Federation</th>
</tr>
</thead>
<tbody><tr>
<td>Credential lifetime</td>
<td>Until manually rotated (often years)</td>
<td>Short-lived (1 hour for GCP tokens)</td>
</tr>
<tr>
<td>Exfiltration risk</td>
<td>High — static key can be copied anywhere</td>
<td>Low — token expires quickly</td>
</tr>
<tr>
<td>Audit trail</td>
<td>Service account name only</td>
<td>Namespace + service account name</td>
</tr>
<tr>
<td>Key management overhead</td>
<td>600+ keys at scale</td>
<td>Zero keys to manage</td>
</tr>
<tr>
<td>Security policy enforcement</td>
<td>Manual / trust-based</td>
<td>Enforced by GCP infrastructure via CEL</td>
</tr>
<tr>
<td>Developer experience</td>
<td>Copy key, create secret, mount volume</td>
<td>Add one label to the deployment</td>
</tr>
</tbody></table>
<p>The short-lived nature of tokens deserves emphasis. Even in a worst-case scenario where a token is somehow exfiltrated, it expires. Kubernetes ServiceAccount tokens have a configurable lifetime, and the GCP access tokens issued by STS are valid for one hour. A service account key, by contrast, remains valid until someone explicitly rotates it — often years.</p>
<h2 id="heading-the-complete-infrastructure-as-code-layout">The Complete Infrastructure as Code Layout</h2>
<p>The entire solution is codified in Terraform, managing both GCP and Kubernetes resources:</p>
<pre><code class="language-plaintext">workload-identity-federation/
├── providers.tf      # Google + Kubernetes providers
├── locals.tf         # Configuration (namespaces, project ID, etc.)
├── gcp.tf            # Identity pool, provider, IAM bindings
└── kubernetes.tf     # ConfigMap with credential configuration
</code></pre>
<p>A single <code>terraform apply</code>:</p>
<ol>
<li><p>Creates the Workload Identity Pool in GCP</p>
</li>
<li><p>Configures the OIDC provider with your cluster's JWKS</p>
</li>
<li><p>Sets up IAM bindings for allowed namespaces</p>
</li>
<li><p>Creates ConfigMaps in each namespace with the credential configuration</p>
</li>
</ol>
<p>Combined with the Kyverno policy, you get a fully automated pipeline:</p>
<pre><code class="language-plaintext">New namespace added to allowed list
        │
        ▼
Terraform creates ConfigMap in that namespace
        │
        ▼
Developer deploys with label
        │
        ▼
Kyverno injects credentials automatically
        │
        ▼
Pod authenticates to GCP via OIDC
        │
        ▼
Application accesses GCP services
</code></pre>
<p>No tickets. No key requests. No secrets to manage.</p>
<h2 id="heading-how-to-run-a-proof-of-concept-with-vcluster">How to Run a Proof of Concept with vCluster</h2>
<p>To validate this works outside GKE, you can set up a demonstration using <a href="https://www.vcluster.com/">vCluster</a> — a virtual Kubernetes cluster that runs inside another Kubernetes cluster. This proves the solution works for any cluster. You can setup vCluster in Docker using <a href="https://github.com/loft-sh/vind/blob/main/docs/getting-started.md">vind</a></p>
<pre><code class="language-yaml"># vcluster.yaml
experimental:
  docker:
    nodes:
      - name: worker-1
      - name: worker-2
deploy:
  cni:
    flannel:
      enabled: true
controlPlane:
  distro:
    k8s:
      version: "v1.35.0"
</code></pre>
<pre><code class="language-shell">[root@localhost #] vcluster create hybrid --driver docker -f vcluster.yaml
[root@localhost #] kubectl get nodes
hybrid-control-plane   Ready    control-plane   14d   v1.34.0   192.168.107.2   &lt;none&gt;        Debian GNU/Linux 12 (bookworm)   7.0.5-orbstack-00330-ge3df4e19b0a0-dirty   containerd://2.1.3
hybrid-worker          Ready    &lt;none&gt;          14d   v1.34.0   192.168.107.3   &lt;none&gt;        Debian GNU/Linux 12 (bookworm)   7.0.5-orbstack-00330-ge3df4e19b0a0-dirty   containerd://2.1.3
hybrid-worker2         Ready    &lt;none&gt;          14d   v1.34.0   192.168.107.4   &lt;none&gt;        Debian GNU/Linux 12 (bookworm)   7.0.5-orbstack-00330-ge3df4e19b0a0-dirty   containerd://2.1.3
</code></pre>
<p>Inside the vCluster, deploy a simple test deployment:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: gcp-test
  labels:
    workload-identity-federation: "enabled"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gcp-test
  template:
    metadata:
      labels:
        app: gcp-test
    spec:
      containers:
        - name: test
          image: google/cloud-sdk:slim
          command: ["sleep", "infinity"]
</code></pre>
<p>Exec into the pod and verify:</p>
<pre><code class="language-bash">$ kubectl exec -it gcp-test-xxx -- bash

# Inside the pod:
\( gcloud auth login --cred-file=\)GOOGLE_APPLICATION_CREDENTIALS
Authenticated with external account credentials for: [principal://iam.googleapis.com/...]

$ gcloud secrets list --project=my-project
NAME                 CREATED
database-password    2024-01-15T10:30:00Z
api-key              2024-01-14T09:15:00Z
</code></pre>
<p>No keys. No secrets mounted. Just identity federation working as designed.</p>
<h2 id="heading-common-issues-and-how-to-solve-them">Common Issues and How to Solve Them</h2>
<h3 id="heading-how-to-handle-jwks-retrieval-for-air-gapped-clusters">How to Handle JWKS Retrieval for Air-Gapped Clusters</h3>
<p>If your cluster's OIDC discovery endpoint isn't publicly reachable (most on-prem clusters aren't), you need to manually export the JWKS and upload it to GCP:</p>
<pre><code class="language-bash">kubectl get --raw /openid/v1/jwks &gt; jwks.json
</code></pre>
<p>This file must be updated if the cluster's signing keys rotate. Set up a periodic job that checks for key changes and updates the Terraform configuration.</p>
<h3 id="heading-how-to-fix-issuer-url-mismatches">How to Fix Issuer URL Mismatches</h3>
<p>The <code>iss</code> claim in the Kubernetes token must exactly match the issuer URL configured in the OIDC provider. For clusters using internal DNS:</p>
<pre><code class="language-plaintext">issuer_uri = "https://kubernetes.default.svc.cluster.local"
</code></pre>
<p>This URL doesn't need to be reachable from GCP — the JWKS file provides the validation keys. But it must match what's in the token exactly.</p>
<h3 id="heading-how-to-debug-token-exchange-failures">How to Debug Token Exchange Failures</h3>
<p>When authentication fails, the error messages can be cryptic. Common causes and fixes:</p>
<table>
<thead>
<tr>
<th>Error</th>
<th>Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>invalid_grant</code></td>
<td>Issuer URL mismatch</td>
<td>Check <code>iss</code> claim in JWT against configured <code>issuer_uri</code></td>
</tr>
<tr>
<td><code>audience mismatch</code></td>
<td>Wrong <code>audience</code> in credential config</td>
<td>Regenerate the credential configuration JSON via Terraform</td>
</tr>
<tr>
<td><code>CEL condition failed</code></td>
<td>Namespace not in allowed list</td>
<td>Add namespace to <code>attribute_condition</code> and re-apply</td>
</tr>
<tr>
<td><code>JWKS validation failed</code></td>
<td>Signing keys have rotated</td>
<td>Re-export JWKS and update Terraform config</td>
</tr>
</tbody></table>
<h2 id="heading-conclusion">Conclusion</h2>
<p>After implementing this setup, on-premises workloads authenticate to Google Cloud exactly like GKE workloads do — without a single long-lived credential. The security team is happy (no keys to audit), developers are happy (just add a label), and the platform team is happy (no more credential management tickets).</p>
<p>Here's what you accomplished in this tutorial:</p>
<ol>
<li><p>/Understood why service account keys fail at scale and the security risks they introduce</p>
</li>
<li><p>Created a Workload Identity Pool and OIDC provider in GCP to trust your cluster's token issuer</p>
</li>
<li><p>Used CEL conditions to enforce fine-grained, namespace-level access policies</p>
</li>
<li><p>Automated credential injection into pods using a Kyverno ClusterPolicy</p>
</li>
<li><p>Bound IAM roles to federated identity attributes — no long-lived keys anywhere</p>
</li>
<li><p>Verified the setup by calling GCP APIs (Secret Manager, Vertex AI) from an on-prem pod</p>
</li>
<li><p>Proved the solution works on any Kubernetes cluster using vCluster</p>
</li>
</ol>
<p>The technologies used here aren't new. OIDC has been in Kubernetes since version 1.20. Workload Identity Federation has been in GCP for years. Kyverno and Terraform are mature tools. What this tutorial puts together is an end-to-end solution that developers can adopt with minimal effort.</p>
<p>If your organization has disabled service account keys (or should), this is the path forward. Your on-prem and cloud clusters can finally be what they were always meant to be: secure extensions of each other.</p>
<p><em>The complete implementation is available as a Terraform module with Kyverno policies:</em> <a href="https://github.com/shkatara/hybrid-platform-gcp-workload-identity-federation"><em>github.com/shkatara/hybrid-platform-gcp-workload-identity-federation</em></a></p>
<p>If this helps, you can follow me on <a href="https://www.linkedin.com/in/shubhamkatara/">https://www.linkedin.com/in/shubhamkatara/</a>, <a href="https://www.youtube.com/@kubesimplify">https://www.youtube.com/@kubesimplify</a>, <a href="https://www.linkedin.com/company/kubesimplify/">https://www.linkedin.com/company/kubesimplify/</a> and</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Avoid Rebuilding Infrastructure for Every New Project ]]>
                </title>
                <description>
                    <![CDATA[ Every production engineering team knows the pattern. A new project begins with energy. Product goals are clear. Deadlines are ambitious. Teams want to move quickly and deliver something customers can  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-avoid-rebuilding-infrastructure-for-every-new-project/</link>
                <guid isPermaLink="false">6a0f78aad8e265f60d5f7b56</guid>
                
                    <category>
                        <![CDATA[ PaaS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Thu, 21 May 2026 21:27:06 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/6c414744-42af-430a-8bbd-76a33b564e4b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every production engineering team knows the pattern. A new project begins with energy. Product goals are clear. Deadlines are ambitious. Teams want to move quickly and deliver something customers can use.</p>
<p>Then the real work starts. Infrastructure must be provisioned. CI/CD pipelines need to be set up. Secrets require management. Monitoring needs wiring. Databases need deployment. Logging needs configuration. Security policies need implementation. Networking rules need review.</p>
<p>Weeks disappear before users see anything useful. Many organizations treat this as normal. They call it engineering rigour. They assume this operational setup phase is simply part of software development.</p>
<p>It is not.</p>
<p>For teams already running production systems, rebuilding infrastructure foundations for every new project is organizational waste. It is repetitive operational labour disguised as an engineering discipline.</p>
<p>The uncomfortable question is not, “How can we do this setup faster?” The real question is: why are we still doing it ourselves at all?</p>
<p>This is where <a href="https://www.freecodecamp.org/news/from-metrics-to-meaning-how-paas-helps-developers-understand-production/">Platform as a Service</a> changes the conversation. A good PaaS shifts the starting point from “rebuild the foundations” to “start shipping. Because new projects should begin closer to customer value, not closer to infrastructure assembly.</p>
<p>In this article, we'll look at why many production teams waste time rebuilding the same infrastructure for every new project, how PaaS helps remove that work, and why engineering teams should question if managing complex infrastructure still makes sense for most projects.</p>
<h2 id="heading-what-well-cover">What We'll Cover:</h2>
<ul>
<li><p><a href="#heading-most-teams-were-not-hired-to-build-infrastructure">Most Teams Were Not Hired to Build Infrastructure</a></p>
</li>
<li><p><a href="#heading-aws-primitives-are-not-a-competitive-advantage">AWS Primitives Are Not a Competitive Advantage</a></p>
</li>
<li><p><a href="#heading-most-teams-should-not-be-managing-kubernetes">Most Teams Should Not Be Managing Kubernetes</a></p>
</li>
<li><p><a href="#heading-paas-changes-the-starting-point">PaaS Changes the Starting Point</a></p>
</li>
<li><p><a href="#heading-repetition-creates-hidden-organizational-waste">Repetition Creates Hidden Organizational Waste</a></p>
</li>
<li><p><a href="#heading-standardization-is-usually-faster-than-flexibility">Standardization Is Usually Faster Than Flexibility</a></p>
</li>
<li><p><a href="#heading-platform-teams-become-multipliers">Platform Teams Become Multipliers</a></p>
</li>
<li><p><a href="#heading-easier-starts-create-more-innovation">Easier Starts Create More Innovation</a></p>
</li>
<li><p><a href="#heading-when-specialized-control-actually-matters">When Specialized Control Actually Matters</a></p>
</li>
<li><p><a href="#heading-starting-from-zero-is-a-process-failure">Starting From Zero Is a Process Failure</a></p>
</li>
</ul>
<h2 id="heading-most-teams-were-not-hired-to-build-infrastructure"><strong>Most Teams Were Not Hired to Build Infrastructure</strong></h2>
<p>Software teams exist to solve business problems. Customers do not care whether Kubernetes manifests were structured elegantly. They do not admire carefully designed Terraform modules. They do not celebrate handcrafted networking policies.</p>
<p>Customers care about outcomes. They care about faster onboarding. Better recommendations. Smoother payments. Fewer bugs. Simpler workflows.</p>
<p>Yet many engineering organizations spend huge portions of time doing work customers never see.</p>
<p>Teams repeatedly create deployment pipelines. Configure environments. Manage certificates. Set up observability stacks. Tune infrastructure rules. Assemble cloud primitives.</p>
<p>Infrastructure matters. Reliability matters. Security matters.</p>
<p>The problem is duplication. If every project independently recreates the same operational systems, organizations keep rebuilding internal platforms over and over again without admitting it.</p>
<p>This behaviour has become so normalized that teams barely notice it anymore. But rebuilding the same foundation repeatedly is not operational maturity. It is inefficiency scaled across the organization.</p>
<h2 id="heading-aws-primitives-are-not-a-competitive-advantage"><strong>AWS Primitives Are Not a Competitive Advantage</strong></h2>
<p>Many teams confuse cloud ownership with strategic advantage. Owning Kubernetes clusters does not create differentiation. Managing IAM rules does not create customer value. Writing infrastructure glue code does not strengthen market position.</p>
<p>These are implementation details. Yet many organizations spend extraordinary energy managing them as if they are core business assets.</p>
<p>Some teams effectively become part-time infrastructure companies without realizing it. Their engineers slowly accumulate operational responsibilities until maintaining systems consumes more effort than delivering products.</p>
<p>The outcome becomes predictable. Infrastructure expands. Operational complexity grows. Delivery speed declines. Nobody notices because the pain arrives gradually.</p>
<p>A team starts with one Kubernetes cluster. Then another environment appears. More deployment pipelines emerge. Additional tooling gets layered on top. Logging systems become fragmented. Monitoring evolves differently across products.</p>
<p>Eventually, teams spend increasing amounts of time maintaining systems they never intended to own.</p>
<p>Infrastructure ownership is often not a strategy. It is inertia.</p>
<h2 id="heading-most-teams-should-not-be-managing-kubernetes"><strong>Most Teams Should Not Be Managing Kubernetes</strong></h2>
<p><a href="https://www.freecodecamp.org/news/what-does-k8s-mean-kubernetes-setup-guide/">Kubernetes</a> has become an engineering culture. It appears in architecture diagrams, conference talks, hiring requirements, and internal roadmaps. Its adoption often feels inevitable.</p>
<p>But normalization and necessity are not the same thing. Many organizations adopted Kubernetes because industry momentum made it seem like the default path. Not because they had workloads that required its complexity. But the result is predictable.</p>
<p>Small and medium teams end up managing orchestration systems designed for massive operational environments.</p>
<p>They maintain YAML configurations, networking layers, ingress systems, deployment strategies, and operational tooling stacks before delivering meaningful product value. This has become strangely accepted.</p>
<p>A ten-person engineering team maintaining infrastructure patterns designed for internet-scale organizations should raise serious questions. A small team pretending to be a platform team is an operational dysfunction.</p>
<p>Many companies adopt infrastructure complexity built for organizations operating at a vastly different scale. They inherit the burden without inheriting the benefits.</p>
<h2 id="heading-paas-changes-the-starting-point"><strong>PaaS Changes the Starting Point</strong></h2>
<p><a href="https://www.freecodecamp.org/news/the-hidden-tax-of-infrastructure-why-your-team-shouldn-t-be-running-it-anymore/">Traditional infrastructure</a> approaches force teams to think from the bottom upward. Servers come first. Then operating systems. Then networking. Then deployment systems. Then monitoring. Eventually, applications arrive.</p>
<p>PaaS reverses this sequence. Developers begin with applications and business goals. The platform absorbs operational complexity.</p>
<p>Teams stop asking, “How do we provision resources?” They start asking, “What problem are we solving?” That sounds like a small shift. In practice, it changes everything.</p>
<p>A mature PaaS environment often provides deployment pipelines, integrated observability, databases, scaling behaviour, security controls, and operational standards before a team writes meaningful application logic.</p>
<p>Projects begin with product development rather than infrastructure construction. That dramatically changes time-to-value.</p>
<h2 id="heading-repetition-creates-hidden-organizational-waste"><strong>Repetition Creates Hidden Organizational Waste</strong></h2>
<p>Organizations often underestimate operational waste because repetitive work feels familiar. Setting up a deployment pipeline may consume only a few days. Configuring logging may feel routine. Creating security rules may seem manageable.</p>
<p>No individual task appears expensive. The cost appears when repetition scales.</p>
<p>If ten projects independently spend two weeks rebuilding nearly identical operational systems, months of engineering capacity disappear. Those engineers could have shipped customer capabilities. They could have reduced friction. They could have tested new ideas. Instead, they rebuilt plumbing.</p>
<p>Engineering teams understand leverage in nearly every other area. Nobody rewrites sorting algorithms for every application. Nobody recreates database engines from scratch. Nobody builds networking stacks repeatedly.</p>
<p>Reuse is accepted as basic engineering wisdom. Infrastructure should not receive special treatment. Build once. Reuse many times.</p>
<p>PaaS simply applies software engineering principles to operational systems.</p>
<h2 id="heading-standardization-is-usually-faster-than-flexibility"><strong>Standardization Is Usually Faster Than Flexibility</strong></h2>
<p>Engineering teams often resist standardization because they fear losing control. Every project feels unique. Every system appears different. The desire for flexibility sounds reasonable.</p>
<p>But complete flexibility often creates operational chaos. Different teams deploy applications differently. Logging behaves inconsistently. Monitoring varies across systems. Security implementations drift.</p>
<p>Documentation fragments. Onboarding slows. Incident response becomes harder. Complexity quietly accumulates.</p>
<p>PaaS introduces constraints, and many engineers instinctively resist constraints. They should not. Useful constraints often increase speed.</p>
<p>Predictable deployment patterns reduce confusion. Shared monitoring standards simplify troubleshooting. Consistent environments reduce cognitive overhead.</p>
<p>Developers spend less energy understanding infrastructure differences and more time delivering product functionality.</p>
<p>Consistency compounds.</p>
<h2 id="heading-platform-teams-become-multipliers"><strong>Platform Teams Become Multipliers</strong></h2>
<p>Many organizations interpret PaaS as buying a vendor product. That misses the bigger idea.</p>
<p>PaaS is fundamentally about creating reusable capabilities. Some organizations buy platforms. Others build internal platforms.</p>
<p>The principle remains the same.</p>
<p>A platform team creates systems once and allows everyone else to benefit. Instead of dozens of product teams independently solving operational problems, a dedicated group centralizes expertise and builds reusable solutions.</p>
<p>The effect becomes substantial. One deployment improvement accelerates every future release. One observability improvement strengthens every application. One security enhancement protects every team.</p>
<p>Platform teams create organizational leverage. Without this model, expertise stays fragmented. With it, expertise compounds.</p>
<h2 id="heading-easier-starts-create-more-innovation"><strong>Easier Starts Create More Innovation</strong></h2>
<p>Operational friction changes behaviour. When launching projects becomes expensive, organizations become cautious. Teams avoid experiments. Small ideas feel risky. Prototypes become difficult to justify.</p>
<p>Over time, innovation slows. Not because organizations lack ideas, but because starting became too expensive.</p>
<p>Teams running mature platforms understand this relationship. Reducing startup friction increases experimentation. Smaller projects become practical. Learning cycles become shorter.</p>
<p>New ideas appear more often because the cost of testing them falls dramatically. The easier it becomes to launch something, the more opportunities organizations create.</p>
<p>PaaS reduces startup friction. That reduction changes culture.</p>
<h2 id="heading-when-specialized-control-actually-matters"><strong>When Specialized Control Actually Matters</strong></h2>
<p>There are exceptions. Massive data platforms, highly specialized machine learning systems, and extremely customized environments may require lower-level infrastructure ownership.</p>
<p>Some workloads genuinely need deeper operational control. But these scenarios are exceptions, not defaults. Too many teams inherit infrastructure complexity designed for edge cases and treat it as standard practice.</p>
<p>Most production applications do not need custom orchestration layers. Most teams do not need to own Kubernetes. Most engineering groups do not need to spend weeks assembling infrastructure before shipping software.</p>
<p>The default assumption should be the opposite.</p>
<h2 id="heading-starting-from-zero-is-a-process-failure"><strong>Starting From Zero Is a Process Failure</strong></h2>
<p>Many organizations normalize unnecessary operational drag. Long setup cycles become accepted. Infrastructure duplication becomes routine. Cloud complexity becomes expected.</p>
<p>Eventually, teams stop questioning it. They assume this is simply how engineering works. It is not.</p>
<p>If launching a new application requires weeks of foundational setup before customer value appears, that is not an engineering discipline.</p>
<p>The goal was never to become an infrastructure company. It was to ship software.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Encrypt Kubernetes Traffic with cert-manager, Let's Encrypt, and Internal TLS ]]>
                </title>
                <description>
                    <![CDATA[ Most engineers assume their Kubernetes cluster encrypts all of its traffic. It doesn't. The commands you run with kubectl are encrypted — your client and the API server speak TLS. The API server talki ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-encrypt-kubernetes-traffic/</link>
                <guid isPermaLink="false">6a0df3b68b034602219e482c</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 20 May 2026 17:47:34 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/c1cf9847-fa0f-49f3-93f4-3c5c1e8ac4c0.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most engineers assume their Kubernetes cluster encrypts all of its traffic. It doesn't. The commands you run with <code>kubectl</code> are encrypted — your client and the API server speak TLS. The API server talking to etcd is usually encrypted too, depending on how the cluster was provisioned.</p>
<p>But traffic between your pods? Plaintext by default. Ingress traffic from the internet to your services? Only encrypted if you explicitly configure TLS. And certificates for internal services? You have to provision those yourself.</p>
<p>This is not a Kubernetes oversight. It's a deliberate design choice — Kubernetes provides the primitives and leaves the implementation to you. The problem is that certificate management is notoriously painful. Certificates expire. Provisioning them manually doesn't scale. Forgetting to rotate them causes outages.</p>
<p>cert-manager solves this. It runs as a controller inside your cluster, watches for <code>Certificate</code> resources, requests certificates from configured issuers, stores them in Kubernetes Secrets, and rotates them automatically before they expire. You declare what you want, cert-manager makes it happen and keeps it that way.</p>
<p>In this article you'll work through how cert-manager's core model works, automate public Ingress TLS using Let's Encrypt, set up an internal Certificate Authority for service-to-service encryption, and understand how certificate rotation works so outages caused by expired certificates become a thing of the past.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A kind cluster with the nginx Ingress controller installed</p>
</li>
<li><p>Helm 3 installed</p>
</li>
<li><p>A domain name with DNS you control — needed for the Let's Encrypt demo</p>
</li>
<li><p>Basic understanding of TLS: you know what a certificate, a private key, and a CA are</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security/cert-manager">DevOps-Cloud-Projects GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-and-isnt-encrypted-in-kubernetes">What Is and Isn't Encrypted in Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-cert-manager-works">How cert-manager Works</a></p>
<ul>
<li><p><a href="#heading-the-four-core-resources">The Four Core Resources</a></p>
</li>
<li><p><a href="#heading-issuers-and-clusterissuers">Issuers and ClusterIssuers</a></p>
</li>
<li><p><a href="#heading-the-certificate-lifecycle">The Certificate Lifecycle</a></p>
</li>
<li><p><a href="#heading-acme-challenges-http-01-vs-dns-01">ACME Challenges: HTTP-01 vs DNS-01</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--install-cert-manager-and-issue-a-lets-encrypt-certificate">Demo 1 — Install cert-manager and Issue a Let's Encrypt Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-get-a-wildcard-certificate-with-dns-01">How to Get a Wildcard Certificate with DNS-01</a></p>
</li>
<li><p><a href="#heading-demo-2--set-up-an-internal-ca-for-service-to-service-tls">Demo 2 — Set Up an Internal CA for Service-to-Service TLS</a></p>
</li>
<li><p><a href="#heading-how-certificate-rotation-works">How Certificate Rotation Works</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-and-isnt-encrypted-in-kubernetes">What Is and Isn't Encrypted in Kubernetes?</h2>
<p>Before installing anything, it's worth being precise about what the cluster already protects and what it leaves open.</p>
<table>
<thead>
<tr>
<th>Traffic path</th>
<th>Encrypted by default?</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td><code>kubectl</code> → API server</td>
<td>Yes</td>
<td>TLS with the cluster CA</td>
</tr>
<tr>
<td>API server → etcd</td>
<td>Usually</td>
<td>Depends on cluster provisioner — verify with your setup</td>
</tr>
<tr>
<td>API server → kubelet</td>
<td>Yes</td>
<td>TLS, but kubelet cert verification depends on configuration</td>
</tr>
<tr>
<td>Pod → Pod (same cluster)</td>
<td><strong>No</strong></td>
<td>Plaintext unless you add a service mesh or mTLS</td>
</tr>
<tr>
<td>Internet → Ingress</td>
<td><strong>No</strong></td>
<td>Opt-in — requires TLS configuration on the Ingress resource</td>
</tr>
<tr>
<td>Pod → Kubernetes API</td>
<td>Yes</td>
<td>Via the service account token and cluster CA</td>
</tr>
</tbody></table>
<p>The two gaps that matter most in practice are pod-to-pod traffic and Ingress TLS. This article covers both Ingress TLS with Let's Encrypt and internal service-to-service encryption using a private CA.</p>
<h2 id="heading-how-cert-manager-works">How cert-manager Works</h2>
<p>cert-manager is a Kubernetes operator. It extends the Kubernetes API with custom resources that represent certificate requests and their configuration. When you create a <code>Certificate</code> resource, cert-manager's controller picks it up, requests a certificate from the configured issuer, and stores the resulting certificate and private key in a Kubernetes Secret. When the certificate approaches its expiry, cert-manager renews it automatically.</p>
<p>This model means your application doesn't know or care about certificate management. It reads a Secret. cert-manager keeps that Secret fresh.</p>
<h3 id="heading-the-four-core-resources">The Four Core Resources</h3>
<p>cert-manager introduces four custom resources that you'll use regularly:</p>
<table>
<thead>
<tr>
<th>Resource</th>
<th>What it represents</th>
</tr>
</thead>
<tbody><tr>
<td><code>Issuer</code></td>
<td>A certificate authority or ACME account — namespace-scoped</td>
</tr>
<tr>
<td><code>ClusterIssuer</code></td>
<td>Same as Issuer, but available cluster-wide</td>
</tr>
<tr>
<td><code>Certificate</code></td>
<td>A request for a certificate — describes what you want</td>
</tr>
<tr>
<td><code>CertificateRequest</code></td>
<td>An individual signing request — created automatically by cert-manager, rarely touched directly</td>
</tr>
</tbody></table>
<p>In practice you'll mostly deal with <code>ClusterIssuer</code> and <code>Certificate</code>. The <code>ClusterIssuer</code> defines where certificates come from. The <code>Certificate</code> defines what certificate you want and where to store it.</p>
<h3 id="heading-issuers-and-clusterissuers">Issuers and ClusterIssuers</h3>
<p>An <code>Issuer</code> can only issue certificates within its own namespace. A <code>ClusterIssuer</code> can issue certificates in any namespace. For shared infrastructure like Let's Encrypt, you almost always want a <code>ClusterIssuer</code>. For application-specific internal CAs, an <code>Issuer</code> scoped to that application's namespace is the safer choice.</p>
<p>cert-manager supports several issuer types. The three you'll encounter most often are:</p>
<p><strong>ACME</strong> — for public certificates from Let's Encrypt or any ACME-compatible CA. Ownership of the domain is proven via an HTTP-01 or DNS-01 challenge.</p>
<p><strong>CA</strong> — for internal certificates signed by a CA whose private key is stored in a Kubernetes Secret. Used for service-to-service TLS within the cluster.</p>
<p><strong>Self-signed</strong> — generates self-signed certificates. Rarely useful on its own, but essential as the bootstrap step when creating an internal CA.</p>
<h3 id="heading-the-certificate-lifecycle">The Certificate Lifecycle</h3>
<p>When you create a <code>Certificate</code> resource, cert-manager follows this sequence:</p>
<ol>
<li><p>Creates a <code>CertificateRequest</code> with a CSR (Certificate Signing Request)</p>
</li>
<li><p>Passes the CSR to the configured issuer</p>
</li>
<li><p>For ACME issuers: creates a <code>Challenge</code> resource and fulfils it (more on this below)</p>
</li>
<li><p>Receives the signed certificate from the issuer</p>
</li>
<li><p>Stores the certificate and private key in the Kubernetes Secret named in <code>spec.secretName</code></p>
</li>
<li><p>Monitors the certificate's expiry — by default, renews when 2/3 of the validity period has elapsed</p>
</li>
</ol>
<p>Your application mounts the Secret. cert-manager updates it silently. Most applications that watch for file changes will pick up the new certificate without a restart.</p>
<h3 id="heading-acme-challenges-http-01-vs-dns-01">ACME Challenges: HTTP-01 vs DNS-01</h3>
<p>Let's Encrypt needs proof that you control the domain before it issues a certificate. ACME defines two challenge types for this.</p>
<p><strong>HTTP-01</strong> works by having cert-manager create a temporary HTTP endpoint at <code>http://&lt;your-domain&gt;/.well-known/acme-challenge/&lt;token&gt;</code>. Let's Encrypt sends a request to that URL. If the response matches the expected token, the challenge passes. This requires your cluster to be reachable from the internet on port 80.</p>
<p><strong>DNS-01</strong> works by having cert-manager create a temporary DNS TXT record at <code>_acme-challenge.&lt;your-domain&gt;</code>. Let's Encrypt checks for that record. This doesn't require inbound HTTP access, which makes it the right choice for private clusters, and it's the only way to get wildcard certificates (<code>*.example.com</code>).</p>
<p>The trade-off: HTTP-01 is simpler to set up but only works for single domains and requires internet-accessible infrastructure. DNS-01 requires API access to your DNS provider but works for internal clusters and wildcards.</p>
<h2 id="heading-demo-1-install-cert-manager-and-issue-a-certificate-using-pebble-and-lets-encrypt">Demo 1 — Install cert-manager and Issue a Certificate Using Pebble and Let's Encrypt</h2>
<p>Pebble is Let's Encrypt's local ACME test server. It runs inside your cluster, issues certificates using the same ACME protocol as Let's Encrypt, and requires no public domain or internet access. Using Pebble lets you test the full cert-manager flow — challenge, issuance, renewal — on a plain kind cluster.</p>
<p>Once you understand the flow locally, switching to real Let's Encrypt is a one-line change: replace the ClusterIssuer server URL and point a DNS record at a publicly reachable cluster. The rest of the configuration is identical.</p>
<p>You'll install cert-manager, create a <code>ClusterIssuer</code> for Let's Encrypt, deploy a sample application with an Ingress, and watch a real certificate be issued and stored automatically.</p>
<h3 id="heading-step-1-install-cert-manager">Step 1: Install cert-manager</h3>
<p>cert-manager is now distributed via OCI Helm charts from <code>quay.io/jetstack</code>. The <code>--set crds.enabled=true</code> flag installs the Custom Resource Definitions as part of the chart:</p>
<pre><code class="language-bash">helm upgrade cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --install \
  --create-namespace \
  --namespace cert-manager \
  --set crds.enabled=true \
  --version v1.17.0 \
  --wait
</code></pre>
<p>You also need the nginx Ingress controller — cert-manager routes HTTP-01 challenges through it. The <code>controller.service.type=ClusterIP</code> override is for kind specifically: the default <code>LoadBalancer</code> Service never gets an <code>EXTERNAL-IP</code> on kind (there's no cloud LB), which makes <code>--wait</code> hang forever. On a real cluster, drop the override and keep <code>LoadBalancer</code>.</p>
<pre><code class="language-bash">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=ClusterIP \
  --wait
</code></pre>
<p>Confirm all four components are running:</p>
<pre><code class="language-bash">kubectl get pods -n cert-manager
kubectl get pods -n ingress-nginx
</code></pre>
<pre><code class="language-plaintext">NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-76f84784c8-r4fx4              1/1     Running   0          6m45s
cert-manager-cainjector-66fbf49587-gv25n   1/1     Running   0          6m45s
cert-manager-webhook-577fddf86-l5wj4       1/1     Running   0          6m45s

NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-6c7cd85885-h7zgx   1/1     Running   0          3m34s
</code></pre>
<blockquote>
<p>kind-specific gotcha — remove the nginx admission webhook now.** On kind, the nginx admission webhook serves with a self-signed certificate that the Kubernetes API server cannot verify. The first time you try to create <em>any</em> Ingress resource you'll see <code>failed calling webhook "validate.nginx.ingress.kubernetes.io": ... x509: certificate signed by unknown authority</code>. Delete the webhook up front so the rest of the demo doesn't trip over it:</p>
</blockquote>
<pre><code class="language-bash">kubectl delete validatingwebhookconfiguration ingress-nginx-admission
</code></pre>
<h3 id="heading-step-2-install-pebble">Step 2: Install Pebble</h3>
<p>Pebble is the local ACME test server, distributed by the JupyterHub project. It ships with a companion CoreDNS deployment (<code>pebble-coredns</code>) that Pebble uses to resolve names during ACME validation.</p>
<pre><code class="language-bash">helm install pebble pebble \
  --repo https://jupyterhub.github.io/helm-chart/ \
  --namespace pebble \
  --create-namespace \
  --wait
</code></pre>
<p>Confirm both pods are running:</p>
<pre><code class="language-bash">kubectl get pods -n pebble
</code></pre>
<pre><code class="language-plaintext">NAME                              READY   STATUS    RESTARTS   AGE
pebble-8d8d49d64-lz8ck            1/1     Running   0          36s
pebble-coredns-7fb5c7cbf4-4jw9h   1/1     Running   0          36s
</code></pre>
<h3 id="heading-step-3-wire-up-dns-for-the-fake-hostname">Step 3: Wire up DNS for the fake hostname</h3>
<p>We're going to issue a cert for <code>echo.pebble.local</code>. That hostname is fake — it doesn't exist in any real DNS — so we have to teach <strong>two</strong> independent resolvers about it before issuance will work:</p>
<table>
<thead>
<tr>
<th>Resolver</th>
<th>Used by</th>
<th>What we need it to do</th>
</tr>
</thead>
<tbody><tr>
<td><code>pebble-coredns</code> (in the <code>pebble</code> namespace)</td>
<td>Pebble itself, when it makes the HTTP-01 validation request</td>
<td>Resolve <code>echo.pebble.local</code> → ingress-nginx ClusterIP</td>
</tr>
<tr>
<td>Cluster CoreDNS (<code>kube-system</code>)</td>
<td>cert-manager's HTTP-01 <strong>self-check</strong> before reporting the challenge ready</td>
<td>Forward <code>pebble.local</code> lookups to <code>pebble-coredns</code></td>
</tr>
</tbody></table>
<p>If you skip either layer, the Order will go to <code>invalid</code> state with a DNS lookup failure.</p>
<p>First grab the two IPs you'll need:</p>
<pre><code class="language-bash">NGINX_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller \
  -o jsonpath='{.spec.clusterIP}')
PEBBLE_DNS_IP=$(kubectl get svc pebble-coredns -n pebble \
  -o jsonpath='{.spec.clusterIP}')
echo "NGINX_IP=\(NGINX_IP  PEBBLE_DNS_IP=\)PEBBLE_DNS_IP"
</code></pre>
<p><strong>Patch</strong> <code>pebble-coredns</code> to answer for <code>*.pebble.local</code> with the ingress controller's IP. The CoreDNS <code>template</code> plugin parses unreliably when the whole block is collapsed onto one line, so apply a real multi-line ConfigMap:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: pebble-coredns
  namespace: pebble
data:
  Corefile: |
    .:8053 {
      errors
      health
      ready
      template ANY ANY pebble.local {
        answer "{{ .Name }} 60 IN A ${NGINX_IP}"
      }
      forward . /etc/resolv.conf
      cache 2
      reload
    }
EOF

kubectl rollout restart deploy/pebble-coredns -n pebble
kubectl rollout status deploy/pebble-coredns -n pebble
</code></pre>
<p>Verify it answers correctly:</p>
<pre><code class="language-bash">kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local ${PEBBLE_DNS_IP}
</code></pre>
<p>You should see <code>Address: &lt;NGINX_IP&gt;</code> in the response. If you get <code>SERVFAIL</code>, check <code>kubectl logs -n pebble deploy/pebble-coredns</code> — a parser error like <code>not a TTL: "}"</code> means the template block collapsed onto one line again.</p>
<p><strong>Patch the cluster CoreDNS</strong> so cert-manager's self-check can resolve the same name. Add a stub zone that forwards <code>pebble.local</code> to <code>pebble-coredns</code>:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
    pebble.local:53 {
        forward . ${PEBBLE_DNS_IP}
    }
EOF

kubectl rollout restart deploy/coredns -n kube-system
kubectl rollout status deploy/coredns -n kube-system
</code></pre>
<p>Verify the cluster resolver now answers for <code>echo.pebble.local</code> (without specifying a server — it'll use the default kube-dns):</p>
<pre><code class="language-bash">kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local
</code></pre>
<p>Both <code>Server: 10.96.0.10</code> and <code>Address: &lt;NGINX_IP&gt;</code> should appear.</p>
<h3 id="heading-step-4-fetch-the-pebble-ca-and-create-the-clusterissuer">Step 4: Fetch the Pebble CA and create the ClusterIssuer</h3>
<p>Pebble signs its certificates with a self-signed root that lives in the <code>pebble</code> ConfigMap under <code>root-cert.pem</code>. cert-manager needs to trust this CA to talk to Pebble's ACME directory, so we pass it as a base64-encoded <code>caBundle</code> in the ClusterIssuer:</p>
<pre><code class="language-bash">kubectl get configmap pebble -n pebble \
  -o jsonpath='{.data.root-cert\.pem}' &gt; pebble-ca.crt

head -1 pebble-ca.crt   # should print -----BEGIN CERTIFICATE-----

CA_BUNDLE=$(base64 -i pebble-ca.crt | tr -d '\n')
echo "CA_BUNDLE length: ${#CA_BUNDLE}"   # ~1600 chars, one continuous line
</code></pre>
<p>Create the ClusterIssuer using the heredoc — the <code>${CA_BUNDLE}</code> shell variable gets substituted into the YAML before kubectl reads it:</p>
<pre><code class="language-bash">kubectl apply -f - &lt;&lt;EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: pebble
spec:
  acme:
    server: https://pebble.pebble.svc.cluster.local/dir
    email: test@example.com
    privateKeySecretRef:
      name: pebble-account-key
    caBundle: ${CA_BUNDLE}
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
EOF
</code></pre>
<p>Check the issuer is ready:</p>
<pre><code class="language-bash">kubectl get clusterissuer pebble
</code></pre>
<pre><code class="language-plaintext">NAME     READY   AGE
pebble   True    5s
</code></pre>
<p>If <code>READY</code> stays <code>False</code>, the two most common causes are a malformed caBundle (verify it's a single unbroken base64 line with no newlines) or Pebble being unreachable from the <code>cert-manager</code> namespace. To check reachability:</p>
<pre><code class="language-bash">kubectl run test-curl --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --namespace cert-manager -- \
  curl -k https://pebble.pebble.svc.cluster.local/dir
</code></pre>
<p>If that returns JSON, Pebble is reachable.</p>
<h3 id="heading-step-5-deploy-a-sample-application">Step 5: Deploy a sample application</h3>
<pre><code class="language-yaml"># echo-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - name: echo
          image: ealen/echo-server:latest
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: echo
  namespace: default
spec:
  selector:
    app: echo
  ports:
    - port: 80
      targetPort: 80
</code></pre>
<pre><code class="language-bash">kubectl apply -f echo-app.yaml
</code></pre>
<p>Verify the resources came up:</p>
<pre><code class="language-bash">kubectl get deploy,pod,svc -n default
</code></pre>
<pre><code class="language-plaintext">NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/echo   1/1     1            1           32s

NAME                        READY   STATUS    RESTARTS   AGE
pod/echo-5665fbcfdd-mbgxj   1/1     Running   0          36s

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/echo         ClusterIP   10.96.103.114   &lt;none&gt;        80/TCP    40s
service/kubernetes   ClusterIP   10.96.0.1       &lt;none&gt;        443/TCP   32m
</code></pre>
<h3 id="heading-step-6-create-an-ingress-with-tls">Step 6: Create an Ingress with TLS</h3>
<p>The <code>cert-manager.io/cluster-issuer: pebble</code> annotation tells cert-manager to automatically create a <code>Certificate</code> resource for this Ingress, using the issuer we just created. The hostname <code>echo.pebble.local</code> doesn't need to resolve externally — we taught both DNS resolvers about it in Step 3.</p>
<pre><code class="language-yaml"># echo-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: echo
  namespace: default
  annotations:
    cert-manager.io/cluster-issuer: pebble
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - echo.pebble.local
      secretName: echo-tls     # cert-manager will create this Secret
  rules:
    - host: echo.pebble.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: echo
                port:
                  number: 80
</code></pre>
<pre><code class="language-bash">kubectl apply -f echo-ingress.yaml
</code></pre>
<h3 id="heading-step-7-watch-the-certificate-being-issued">Step 7: Watch the certificate being issued</h3>
<pre><code class="language-bash"># Watch the Certificate resource (Ctrl-C once Ready=True)
kubectl get certificate echo-tls -n default -w
</code></pre>
<pre><code class="language-plaintext">NAME       READY   SECRET     AGE
echo-tls   False   echo-tls   5s
echo-tls   True    echo-tls   28s
</code></pre>
<p>When <code>READY</code> becomes <code>True</code>, the certificate has been issued and stored in the <code>echo-tls</code> Secret. The full chain — CertificateRequest → Order → Challenge → solver pod → Secret — happens in well under a minute on a healthy cluster:</p>
<pre><code class="language-bash">kubectl get certificate,certificaterequest,order,challenge -n default
</code></pre>
<pre><code class="language-plaintext">NAME                                   READY   SECRET     AGE
certificate.cert-manager.io/echo-tls   True    echo-tls   81s

NAME                                            APPROVED   DENIED   READY   ISSUER   AGE
certificaterequest.cert-manager.io/echo-tls-1   True                True    pebble   81s

NAME                                               STATE   AGE
order.acme.cert-manager.io/echo-tls-1-1824732543   valid   81s
</code></pre>
<p>(Challenges are deleted automatically once an Order completes, so <code>kubectl get challenge -n default</code> typically shows nothing at this point — that's success, not failure.)</p>
<p>If <code>READY</code> stays <code>False</code> for more than a minute, see the troubleshooting tips at the end of this section.</p>
<p>Inspect the issued certificate to confirm Pebble signed it:</p>
<pre><code class="language-bash">kubectl get secret echo-tls -n default -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -issuer -subject -dates
</code></pre>
<pre><code class="language-plaintext">issuer=CN=Pebble Intermediate CA 05478c
subject=
notBefore=May 17 19:09:22 2026 GMT
notAfter=Aug 15 19:09:21 2026 GMT
</code></pre>
<p>Issuer is Pebble's intermediate CA — proof the full ACME flow worked end-to-end. The cert is valid for 90 days, and cert-manager will renew it automatically at day 60.</p>
<p>Hit the ingress over HTTPS from inside the cluster to confirm everything is wired together:</p>
<pre><code class="language-bash">kubectl run curltest --rm -it --restart=Never --image=curlimages/curl -- \
  curl -sk https://echo.pebble.local/
</code></pre>
<p>The echo server should return a JSON blob — note the <code>"x-forwarded-proto":"https"</code> field, which proves the request came through nginx over TLS.</p>
<p><strong>Troubleshooting if the cert never goes Ready:</strong></p>
<ul>
<li><p><code>kubectl describe order -n default</code> — look for "DNS problem" or "Connection refused" in the events.</p>
</li>
<li><p><code>kubectl logs -n pebble deploy/pebble --tail=50</code> — Pebble logs the exact URL it tried to fetch during validation and any errors.</p>
</li>
<li><p>If the Order is stuck pending with no events: cert-manager hasn't reconciled yet. Wait 30s.</p>
</li>
<li><p>If the Order is <code>invalid</code>: one of the two DNS layers (Step 3) is misconfigured. Re-run both <code>nslookup</code> checks.</p>
</li>
<li><p>If the Ingress apply itself failed with an x509 webhook error: you skipped the <code>kubectl delete validatingwebhookconfiguration ingress-nginx-admission</code> step in Step 1.</p>
</li>
</ul>
<h3 id="heading-step-8-switch-to-lets-encrypt-staging-real-public-domain">Step 8: Switch to Let's Encrypt staging (real public domain)</h3>
<p>Pebble proved the flow works locally. Now move to a publicly-reachable domain pointed at a publicly-reachable cluster. The DNS gymnastics from Step 3 go away — the domain is real, so both resolvers find it without intervention.</p>
<p>Use Let's Encrypt <strong>staging</strong> first. It speaks the same ACME protocol as production but with generous rate limits, so failed attempts during testing won't lock you out:</p>
<pre><code class="language-yaml"># clusterissuer-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
</code></pre>
<pre><code class="language-bash">kubectl apply -f clusterissuer-staging.yaml

# Point the Ingress at staging and the real hostname, then force re-issuance
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-staging --overwrite -n default
kubectl delete secret echo-tls -n default
</code></pre>
<p>The new cert's issuer will look something like <code>(STAGING) Let's Encrypt</code>.</p>
<h3 id="heading-step-9-switch-to-lets-encrypt-production">Step 9: Switch to Let's Encrypt production</h3>
<p>Once staging works, repeat with the production ClusterIssuer. The only difference is the <code>server</code> URL:</p>
<pre><code class="language-yaml"># clusterissuer-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
</code></pre>
<pre><code class="language-bash">kubectl apply -f clusterissuer-prod.yaml
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-prod --overwrite -n default
kubectl delete secret echo-tls -n default
</code></pre>
<p>cert-manager detects the missing Secret and immediately requests a browser-trusted certificate from production Let's Encrypt.</p>
<p>cert-manager detects the missing Secret and immediately triggers a new certificate request using the production issuer.</p>
<h2 id="heading-how-to-get-a-wildcard-certificate-with-dns-01">How to Get a Wildcard Certificate with DNS-01</h2>
<p>HTTP-01 challenges work well for single domains with public ingress. But there are two situations where you need DNS-01 instead: when your cluster is not publicly accessible (internal clusters, air-gapped environments, staging namespaces behind a VPN), and when you want a wildcard certificate that covers all subdomains of your domain.</p>
<p>DNS-01 requires cert-manager to be able to create and delete TXT records in your DNS provider. cert-manager has built-in support for Route53, Cloud DNS, Cloudflare, Azure DNS, and many others.</p>
<p>Here is a <code>ClusterIssuer</code> for DNS-01 using AWS Route53:</p>
<pre><code class="language-yaml"># clusterissuer-dns01.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns01
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-dns01-account-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            # Use IRSA (IAM Roles for Service Accounts) in production
            # rather than static credentials
            hostedZoneID: YOUR_HOSTED_ZONE_ID
</code></pre>
<p>A wildcard <code>Certificate</code> using that issuer:</p>
<pre><code class="language-yaml"># wildcard-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: default
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-dns01
    kind: ClusterIssuer
  commonName: "*.example.com"
  dnsNames:
    - "*.example.com"
    - "example.com"        # Also cover the apex domain
  duration: 2160h           # 90 days
  renewBefore: 720h         # Renew 30 days before expiry
</code></pre>
<p>The resulting Secret <code>wildcard-example-com-tls</code> can be referenced by any Ingress in the <code>default</code> namespace. All subdomains — <code>api.example.com</code>, <code>dashboard.example.com</code>, <code>staging.example.com</code> — are covered by a single certificate that rotates automatically.</p>
<p>For Cloudflare instead of Route53, the solver section looks like this:</p>
<pre><code class="language-yaml">    solvers:
      - dns01:
          cloudflare:
            email: your-email@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
</code></pre>
<h2 id="heading-demo-2-set-up-an-internal-ca-for-service-to-service-tls">Demo 2 — Set Up an Internal CA for Service-to-Service TLS</h2>
<p>Let's Encrypt certificates are great for public-facing services. But for internal services — a gRPC microservice calling another, a web application talking to its database — you don't need public trust. You need a CA that the cluster trusts, and you need it to issue certificates for service names that don't exist as public DNS records.</p>
<p>cert-manager's CA issuer handles this. You create a root CA, tell cert-manager about it, and then issue certificates for internal services using that CA. Every service that trusts the root CA trusts every certificate it issues.</p>
<h3 id="heading-step-1-create-a-self-signed-clusterissuer">Step 1: Create a self-signed ClusterIssuer</h3>
<p>A self-signed issuer generates certificates that are signed by the certificate itself — it is its own CA. You use this as a bootstrap step to create the root CA certificate:</p>
<pre><code class="language-yaml"># selfsigned-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned
spec:
  selfSigned: {}
</code></pre>
<pre><code class="language-bash">kubectl apply -f selfsigned-issuer.yaml
</code></pre>
<h3 id="heading-step-2-create-the-root-ca-certificate">Step 2: Create the root CA certificate</h3>
<p>Use the self-signed issuer to create a CA certificate. The <code>isCA: true</code> field tells cert-manager this certificate can sign other certificates:</p>
<pre><code class="language-yaml"># internal-ca.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-ca
  namespace: cert-manager    # Store in cert-manager namespace
spec:
  isCA: true
  commonName: internal-ca
  secretName: internal-ca-secret
  duration: 87600h           # 10 years — this is a root CA
  renewBefore: 720h
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer
</code></pre>
<pre><code class="language-bash">kubectl apply -f internal-ca.yaml
kubectl get certificate internal-ca -n cert-manager
</code></pre>
<pre><code class="language-plaintext">NAME          READY   SECRET               AGE
internal-ca   True    internal-ca-secret   8s
</code></pre>
<h3 id="heading-step-3-create-a-ca-clusterissuer-backed-by-the-root-ca">Step 3: Create a CA ClusterIssuer backed by the root CA</h3>
<p>Now create a <code>ClusterIssuer</code> that uses the root CA Secret you just created. This is the issuer that will sign certificates for your internal services:</p>
<pre><code class="language-yaml"># internal-ca-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  ca:
    secretName: internal-ca-secret   # References the Secret in cert-manager namespace
</code></pre>
<pre><code class="language-bash">kubectl apply -f internal-ca-issuer.yaml
kubectl get clusterissuer internal-ca
</code></pre>
<pre><code class="language-plaintext">NAME          READY   AGE
internal-ca   True    5s
</code></pre>
<h3 id="heading-step-4-issue-a-certificate-for-an-internal-service">Step 4: Issue a certificate for an internal service</h3>
<p>Now issue a certificate for an internal gRPC service. The <code>dnsNames</code> use Kubernetes internal DNS names — <code>&lt;service&gt;.&lt;namespace&gt;.svc.cluster.local</code>:</p>
<pre><code class="language-yaml"># payments-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: payments-tls
  namespace: production
spec:
  secretName: payments-tls-secret
  issuerRef:
    name: internal-ca
    kind: ClusterIssuer
  commonName: payments.production.svc.cluster.local
  dnsNames:
    - payments.production.svc.cluster.local
    - payments.production.svc
    - payments
  duration: 2160h     # 90 days
  renewBefore: 360h   # Renew 15 days before expiry
</code></pre>
<pre><code class="language-bash">kubectl create namespace production
kubectl apply -f payments-cert.yaml
kubectl get certificate payments-tls -n production
</code></pre>
<pre><code class="language-plaintext">NAME           READY   SECRET                AGE
payments-tls   True    payments-tls-secret   6s
</code></pre>
<p>The Secret <code>payments-tls-secret</code> now contains <code>tls.crt</code>, <code>tls.key</code>, and <code>ca.crt</code>. Mount this into your application pod:</p>
<pre><code class="language-yaml"># In your Deployment spec
volumes:
  - name: tls
    secret:
      secretName: payments-tls-secret
containers:
  - name: payments
    volumeMounts:
      - name: tls
        mountPath: /etc/tls
        readOnly: true
</code></pre>
<p>Your application reads <code>/etc/tls/tls.crt</code> and <code>/etc/tls/tls.key</code> to configure TLS. Other services that need to trust it read <code>/etc/tls/ca.crt</code>.</p>
<h3 id="heading-step-5-distribute-the-ca-bundle-with-trust-manager">Step 5: Distribute the CA bundle with trust-manager</h3>
<p>The problem with a custom CA is that every service needs to know about it. cert-manager's companion tool, trust-manager, handles this by distributing the CA bundle as a <code>ConfigMap</code> to every namespace:</p>
<pre><code class="language-bash">helm upgrade trust-manager oci://quay.io/jetstack/charts/trust-manager \
  --install \
  --namespace cert-manager \
  --wait
</code></pre>
<p>Create a <code>Bundle</code> resource that takes the CA certificate from the <code>internal-ca-secret</code> and distributes it cluster-wide:</p>
<pre><code class="language-yaml"># ca-bundle.yaml
apiVersion: trust.cert-manager.io/v1alpha1
kind: Bundle
metadata:
  name: internal-ca-bundle
spec:
  sources:
    - secret:
        name: internal-ca-secret
        key: ca.crt
  target:
    configMap:
      key: ca-bundle.crt
    namespaceSelector:
      matchLabels:
        # Distribute to all namespaces with this label
        kubernetes.io/metadata.name: production
</code></pre>
<pre><code class="language-bash">kubectl apply -f ca-bundle.yaml
</code></pre>
<p>After a few seconds, every matching namespace has a ConfigMap named <code>internal-ca-bundle</code> containing the CA certificate. Applications mount this ConfigMap to trust internally-issued certificates without any per-service configuration.</p>
<h3 id="heading-step-6-verify-the-certificate-chain">Step 6: Verify the certificate chain</h3>
<pre><code class="language-bash"># Extract the CA cert and service cert
kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.ca\.crt}' | base64 -d &gt; ca.crt

kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.tls\.crt}' | base64 -d &gt; payments.crt

# Verify the cert was signed by the CA
openssl verify -CAfile ca.crt payments.crt
</code></pre>
<pre><code class="language-plaintext">payments.crt: OK
</code></pre>
<h2 id="heading-how-certificate-rotation-works">How Certificate Rotation Works</h2>
<p>Certificate rotation is the part of certificate management that breaks production clusters most often. cert-manager handles it automatically, but understanding the mechanism helps you tune it and debug it when things go wrong.</p>
<p>cert-manager watches every <code>Certificate</code> resource it manages and checks the expiry of the underlying certificate in the Secret. When the remaining validity drops below the <code>renewBefore</code> threshold, cert-manager triggers a renewal. The default <code>renewBefore</code> is 1/3 of the certificate's total validity period — so a 90-day certificate starts renewing at day 60.</p>
<p>The renewal creates a new <code>CertificateRequest</code>, goes through the full issuance flow, and updates the Secret in place. The new certificate replaces the old one atomically. Applications that use file mounts and watch for changes (most modern web servers and gRPC frameworks do) will pick up the new certificate without restarting.</p>
<pre><code class="language-bash"># See the current rotation status
kubectl describe certificate echo-tls -n default
</code></pre>
<p>Look for these fields in the output:</p>
<pre><code class="language-plaintext">Status:
  Not After:   2024-06-18T10:00:00Z
  Not Before:  2024-03-20T10:00:00Z
  Renewal Time: 2024-05-18T10:00:00Z   # When cert-manager will start renewing
  Conditions:
    Type:    Ready
    Status:  True
    Message: Certificate is up to date and has not expired
</code></pre>
<p>If a renewal fails — for example, because the HTTP-01 challenge can't be completed — cert-manager retries with exponential backoff. The existing certificate continues to serve until it actually expires, giving you a window to debug the issue.</p>
<p>To see renewal events in real time:</p>
<pre><code class="language-bash">kubectl get events -n default --field-selector reason=Issued
kubectl get events -n default --field-selector reason=Failed
</code></pre>
<p><strong>Setting</strong> <code>renewBefore</code> <strong>correctly:</strong> For public-facing services, 30 days before a 90-day certificate is a sensible buffer. For internal short-lived certificates (24-hour validity), set <code>renewBefore</code> to 8 hours so rotation happens well before expiry even if the first attempt fails. Never set <code>renewBefore</code> to more than half the certificate's validity — cert-manager will immediately try to renew a certificate it just issued.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<pre><code class="language-bash"># Remove demo resources
kubectl delete ingress echo -n default
kubectl delete service echo -n default
kubectl delete deployment echo -n default
kubectl delete secret echo-tls -n default
kubectl delete certificate payments-tls -n production
kubectl delete namespace production

# Uninstall cert-manager and trust-manager
helm uninstall trust-manager -n cert-manager
helm uninstall cert-manager -n cert-manager
kubectl delete namespace cert-manager

# Remove ClusterIssuers
kubectl delete clusterissuer letsencrypt-staging letsencrypt-prod \
  internal-ca selfsigned 2&gt;/dev/null
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes leaves TLS configuration entirely to you. In this article you worked through both the public and internal sides of that responsibility.</p>
<p>On the public side, you installed cert-manager using the current OCI Helm chart, created a <code>ClusterIssuer</code> backed by Let's Encrypt, and watched cert-manager go through the full ACME HTTP-01 challenge flow — from creating a temporary solver pod to storing a valid certificate in a Kubernetes Secret. You saw how switching from staging to production is a one-line annotation change, and how cert-manager renews certificates automatically before they expire.</p>
<p>On the internal side, you bootstrapped a private CA using cert-manager's self-signed issuer, created a <code>ClusterIssuer</code> backed by that CA, and issued certificates for internal service names that only exist inside the cluster. You used trust-manager to distribute the CA bundle cluster-wide so services can trust each other's certificates without per-service configuration. And you saw how to verify the certificate chain with <code>openssl</code> so you can confirm it's working before deploying to production.</p>
<p>Understanding certificate rotation is what separates teams that manage TLS confidently from teams that get woken up at 3am by an expired certificate. cert-manager automates the renewal, but the <code>renewBefore</code> field is your safety margin — set it correctly and know how to read the renewal status.</p>
<p>All YAML manifests and Helm values from this article are available in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security/cert-manager">DevOps-Cloud-Projects GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible ]]>
                </title>
                <description>
                    <![CDATA[ The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS. I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $ ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-local-devops-homelab-with-docker-kubernetes-and-ansible/</link>
                <guid isPermaLink="false">69dd667c217f5dfcbd55b7b4</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Homelab ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops articles ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 21:56:12 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/1e970f8b-eb52-4582-9c98-13cbce867c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The first time I tried to follow a DevOps tutorial, it told me to sign up for AWS.</p>
<p>I did. I spun up an EC2 instance, followed along for an hour, and then forgot to shut it down. A week later I had a $34 bill for a machine running nothing.</p>
<p>That was the last time I practiced on someone else's infrastructure.</p>
<p>Everything in this guide runs on your laptop. No cloud account, no credit card, no bill at the end of the month. By the end, you'll be able to spin up a multi-server environment from scratch, configure it automatically with Ansible, serve a site you wrote yourself, and diagnose what breaks when you intentionally destroy it.</p>
<p>That last part is where the actual learning happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>A laptop with at least 8GB of RAM (16GB is better)</p>
</li>
<li><p>At least 20GB of free disk space</p>
</li>
<li><p>Windows, macOS, or Linux operating system</p>
</li>
<li><p>Administrator access to your computer</p>
</li>
<li><p>Virtualization enabled in your BIOS/UEFI settings</p>
</li>
<li><p>A stable internet connection for the initial downloads</p>
</li>
</ul>
<p>Knowledge and comfort level:</p>
<ul>
<li><p>You should be comfortable using a terminal (running commands, changing directories, and editing small text files with whatever editor you like).</p>
</li>
<li><p>Basic familiarity with concepts like “a server,” “SSH,” and “a port” helps, but you don't need prior experience with Docker, Kubernetes, Vagrant, or Ansible. This guide introduces them as you go.</p>
</li>
</ul>
<p>If you can follow step-by-step instructions and read error output without panicking, you're ready.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-what-is-devops">What is DevOps?</a></p>
</li>
<li><p><a href="#heading-why-build-a-local-lab">Why Build a Local Lab?</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-docker">How to Set Up Docker</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-to-install-kubectl">How to Install kubectl</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-vagrant">How to Set Up Vagrant</a></p>
</li>
<li><p><a href="#heading-how-to-install-ansible">How to Install Ansible</a></p>
</li>
<li><p><a href="#heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</a></p>
</li>
<li><p><a href="#heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</a></p>
</li>
<li><p><a href="#heading-what-you-can-now-do">What You Can Now Do</a></p>
</li>
</ol>
<h2 id="heading-what-is-devops">What is DevOps?</h2>
<p>DevOps is the practice of breaking down the wall between software development and IT operations teams.</p>
<p>Traditionally, developers write code and hand it off to operations teams to deploy and maintain. That handoff causes delays, misunderstandings, and outages. DevOps is what happens when both teams work together from the start.</p>
<p>The tools you'll install in this guide each solve a specific part of that process:</p>
<ul>
<li><p><strong>Docker</strong> packages your application and everything it needs into a portable container that runs the same way on any machine.</p>
</li>
<li><p><strong>Kubernetes</strong> manages multiple containers at scale, handling restarts, networking, and load balancing automatically.</p>
</li>
<li><p><strong>Vagrant</strong> creates and manages virtual machine environments so your whole team always works on identical setups.</p>
</li>
<li><p><strong>Ansible</strong> automates repetitive configuration tasks across many servers without writing a script for each one.</p>
</li>
</ul>
<h2 id="heading-why-build-a-local-lab">Why Build a Local Lab?</h2>
<p>A local lab gives you a safe place to break things, fix them, and learn from that process without any cost or risk.</p>
<p>Here's what you get with a local setup:</p>
<ul>
<li><p><strong>Zero cost.</strong> No cloud bills, no surprise charges, and no credit card required.</p>
</li>
<li><p><strong>Works offline.</strong> Practice anywhere, even without internet after the initial setup.</p>
</li>
<li><p><strong>Full control.</strong> You manage every layer from the OS up to the application.</p>
</li>
<li><p><strong>Safe experimentation.</strong> Break things freely. Nothing here affects production.</p>
</li>
<li><p><strong>Fast feedback.</strong> No waiting for cloud resources to spin up. Everything runs on your machine.</p>
</li>
</ul>
<p>The tradeoff is resource limits. Your laptop's CPU and RAM are the ceiling. You can't simulate large-scale deployments, and some cloud-native services like AWS Lambda or S3 have no direct local equivalent. But for learning core DevOps workflows, none of that matters.</p>
<h2 id="heading-how-to-set-up-docker">How to Set Up Docker</h2>
<p>Docker is the foundation of this lab. Every other tool in this guide either runs inside Docker containers or works alongside them.</p>
<h3 id="heading-how-to-install-docker-on-windows">How to Install Docker on Windows</h3>
<p>First, enable virtualization in your BIOS:</p>
<ol>
<li><p>Restart your computer and enter BIOS/UEFI setup. The key is usually F2, F10, Del, or Esc during boot.</p>
</li>
<li><p>Find the virtualization setting. It's usually listed as Intel VT-x, AMD-V, SVM, or Virtualization Technology.</p>
</li>
<li><p>Enable it, save your changes, and exit.</p>
</li>
</ol>
<p>Then install Docker Desktop:</p>
<ol>
<li><p>Download Docker Desktop from <a href="https://www.docker.com/products/docker-desktop/">Docker's official website</a>.</p>
</li>
<li><p>Run the installer and follow the prompts.</p>
</li>
<li><p>Enable WSL 2 (Windows Subsystem for Linux) when asked.</p>
</li>
<li><p>Restart your computer.</p>
</li>
<li><p>Open Docker Desktop from the Start menu and wait for the whale icon in the taskbar to stop animating.</p>
</li>
</ol>
<p><strong>Troubleshooting:</strong> If Docker fails to start, run this in PowerShell as Administrator to verify virtualization is active:</p>
<pre><code class="language-powershell">systeminfo | findstr "Hyper-V Requirements"
</code></pre>
<p>All items should show "Yes". If they don't, revisit your BIOS settings.</p>
<h3 id="heading-how-to-install-docker-on-mac">How to Install Docker on Mac</h3>
<ol>
<li><p>Download Docker Desktop for Mac from <a href="https://www.docker.com/products/docker-desktop/">Docker's website</a>.</p>
</li>
<li><p>Open the downloaded <code>.dmg</code> file and drag Docker to your Applications folder.</p>
</li>
<li><p>Open Docker from Applications.</p>
</li>
<li><p>Enter your password when prompted.</p>
</li>
<li><p>Wait for the whale icon in the menu bar to stop animating.</p>
</li>
</ol>
<h3 id="heading-how-to-install-docker-on-linux">How to Install Docker on Linux</h3>
<p>Run these commands in order:</p>
<pre><code class="language-bash"># Update your package lists
sudo apt-get update

# Install prerequisites
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Add Docker's official GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Add the Docker repository
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

# Update and install Docker
sudo apt-get update
sudo apt-get install docker-ce

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add your user to the docker group
sudo usermod -aG docker $USER
</code></pre>
<p>Log out and back in for the group change to take effect.</p>
<h3 id="heading-how-to-test-docker">How to Test Docker</h3>
<p>Run this command:</p>
<pre><code class="language-bash">docker run hello-world
</code></pre>
<p>If you see "Hello from Docker!" then Docker is working correctly.</p>
<p>Docker is set up. Next, you'll install Kubernetes to manage containers at scale.</p>
<h2 id="heading-how-to-set-up-kubernetes">How to Set Up Kubernetes</h2>
<p>Kubernetes manages containers at scale. For a local lab, you have four options. Here's how to choose:</p>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Best for</th>
<th>RAM needed</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Minikube</strong></td>
<td>Beginners. Easiest setup, built-in dashboard</td>
<td>2GB+</td>
</tr>
<tr>
<td><strong>Kind</strong></td>
<td>Faster startup, works well inside CI pipelines</td>
<td>1GB+</td>
</tr>
<tr>
<td><strong>k3s</strong></td>
<td>Low-resource machines. Lightweight but production-like</td>
<td>512MB+</td>
</tr>
<tr>
<td><strong>kubeadm</strong></td>
<td>Learning how clusters are actually bootstrapped in production</td>
<td>2GB+ per node</td>
</tr>
</tbody></table>
<p>If you're just starting out, use Minikube. It has the simplest setup and a visual dashboard that helps you understand what's happening inside the cluster.</p>
<p>If your laptop has 8GB RAM or less, use k3s. It runs lean and behaves closer to a real cluster than Minikube does.</p>
<p>Use kubeadm only if you want to understand how Kubernetes nodes join a cluster — it requires more manual steps and isn't beginner-friendly.</p>
<h3 id="heading-how-to-install-minikube-recommended-for-beginners">How to Install Minikube (Recommended for Beginners)</h3>
<p>Minikube creates a single-node Kubernetes cluster on your laptop.</p>
<p>On Windows:</p>
<ol>
<li><p>Download the Minikube installer from <a href="https://github.com/kubernetes/minikube/releases">Minikube's GitHub releases page</a>.</p>
</li>
<li><p>Run the <code>.exe</code> installer.</p>
</li>
<li><p>Open Command Prompt as Administrator and start Minikube:</p>
</li>
</ol>
<pre><code class="language-cmd">minikube start --driver=docker
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install minikube
minikube start --driver=docker
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube-linux-amd64
sudo mv minikube-linux-amd64 /usr/local/bin/minikube
minikube start --driver=docker
</code></pre>
<p>Test your cluster:</p>
<pre><code class="language-bash">minikube status
minikube dashboard
</code></pre>
<h3 id="heading-how-to-install-k3s-recommended-for-low-ram-machines">How to Install k3s (Recommended for Low-RAM Machines)</h3>
<p>k3s is a lightweight version of Kubernetes that installs in under a minute. It runs lean and behaves like a real cluster — not a simplified demo version.</p>
<p>On Linux (and Mac via Multipass):</p>
<pre><code class="language-bash">curl -sfL https://get.k3s.io | sh -
</code></pre>
<p>That single command installs k3s and runs it automatically in the background. Check that it is running:</p>
<pre><code class="language-bash">sudo k3s kubectl get nodes
</code></pre>
<p>You should see one node with status <code>Ready</code>.</p>
<p>On Mac directly — k3s doesn't run natively on macOS. Use <a href="https://multipass.run">Multipass</a> to spin up a lightweight Ubuntu VM first, then run the install command inside it.</p>
<p>On Windows — use WSL2 (Ubuntu), then run the install command inside your WSL2 terminal.</p>
<h3 id="heading-how-to-install-kind-kubernetes-in-docker">How to Install Kind (Kubernetes IN Docker)</h3>
<p>Kind runs a full Kubernetes cluster inside Docker containers. It starts faster than Minikube and is useful if you want to run multiple clusters simultaneously.</p>
<pre><code class="language-bash"># Mac or Linux
brew install kind

# Windows
choco install kind
</code></pre>
<p>Create a cluster:</p>
<pre><code class="language-bash">kind create cluster --name my-local-lab
</code></pre>
<h3 id="heading-how-to-install-kubeadm-for-understanding-cluster-bootstrap">How to Install kubeadm (For Understanding Cluster Bootstrap)</h3>
<p>kubeadm is the tool Kubernetes uses to initialize and join nodes in a real cluster. Use this when you want to understand what happens under the hood — not as your daily driver.</p>
<p>It requires at least two machines (or VMs). The setup is more involved than the options above. Follow the <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/">official kubeadm installation guide</a> for your OS, then initialize your cluster:</p>
<pre><code class="language-bash">sudo kubeadm init --pod-network-cidr=10.244.0.0/16
</code></pre>
<p>After init, join worker nodes using the command kubeadm prints at the end of the output.</p>
<h3 id="heading-how-to-install-kubectl">How to Install kubectl</h3>
<p>kubectl is the command-line tool you use to interact with any Kubernetes cluster.</p>
<p>On Windows:</p>
<p>Download <code>kubectl.exe</code> from <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl-windows/">Kubernetes' website</a> and place it in a directory that is in your PATH. Or install with Chocolatey:</p>
<pre><code class="language-cmd">choco install kubernetes-cli
</code></pre>
<p>On Mac:</p>
<pre><code class="language-bash">brew install kubectl
</code></pre>
<p>On Linux:</p>
<pre><code class="language-bash">curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl
</code></pre>
<p>Test it:</p>
<pre><code class="language-bash">kubectl get pods --all-namespaces
</code></pre>
<p>On a fresh cluster, you'll see system pods running in the <code>kube-system</code> namespace — things like <code>coredns</code> and <code>storage-provisioner</code>. That's the expected output. It means your cluster is up and kubectl can talk to it.</p>
<p>Kubernetes is running. Next is Vagrant. But before that, there's one important distinction worth making.</p>
<h4 id="heading-docker-vs-vagrant-they-arent-the-same-thing">Docker vs Vagrant — they aren't the same thing</h4>
<p>Docker creates containers: lightweight processes that share your operating system's kernel. Vagrant creates full virtual machines: isolated computers with their own OS running inside your laptop.</p>
<p>Containers are fast and small. VMs are heavier but behave exactly like real servers. You'll use both in this lab for different reasons.</p>
<h2 id="heading-how-to-set-up-vagrant">How to Set Up Vagrant</h2>
<p>Vagrant lets you create and manage reproducible virtual machine environments. It is ideal for simulating multi-server setups on a single laptop.</p>
<h3 id="heading-how-to-install-vagrant-on-windows">How to Install Vagrant on Windows</h3>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a> with default options.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
<li><p>Restart your computer if prompted.</p>
</li>
</ol>
<p><strong>Note:</strong> VirtualBox and Hyper-V can't run at the same time on Windows. Check if Hyper-V is active:</p>
<pre><code class="language-cmd">systeminfo | findstr "Hyper-V"
</code></pre>
<p>If it's enabled, you have two options: switch to the Hyper-V Vagrant provider, or disable Hyper-V with:</p>
<pre><code class="language-powershell">Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-All
</code></pre>
<p>Restart after disabling.</p>
<h3 id="heading-how-to-install-vagrant-on-mac-and-linux">How to Install Vagrant on Mac and Linux</h3>
<p>On Mac:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>After installation, open <strong>System Preferences &gt; Security &amp; Privacy &gt; General</strong>. You will see a message saying system software from Oracle was blocked. Click <strong>Allow</strong> and restart your Mac. Without this step, VirtualBox will not run.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p><strong>Note for Apple Silicon (M1/M2/M3) Macs:</strong> VirtualBox support on Apple Silicon is still limited. If you're on an M-series Mac, use <a href="https://mac.getutm.app/">UTM</a> as your VM provider instead, or use Multipass which works natively on Apple Silicon.</p>
<p>On Linux:</p>
<ol>
<li><p>Download and install <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.</p>
</li>
<li><p>Download and install <a href="https://developer.hashicorp.com/vagrant/downloads">Vagrant</a>.</p>
</li>
</ol>
<p>Verify both are installed:</p>
<pre><code class="language-bash">vboxmanage --version
vagrant --version
</code></pre>
<h3 id="heading-how-to-create-your-first-vagrant-environment">How to Create Your First Vagrant Environment</h3>
<p>Create a new directory for your project. Inside it, create a file named <code>Vagrantfile</code> with this content:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  # Create a private network between VMs
  config.vm.network "private_network", type: "dhcp"

  # Forward port 8080 on your laptop to port 80 on the VM
  config.vm.network "forwarded_port", guest: 80, host: 8080

  # Install Nginx when the VM starts
  config.vm.provision "shell", inline: &lt;&lt;-SHELL
    apt-get update
    apt-get install -y nginx
    echo "Hello from Vagrant!" &gt; /var/www/html/index.html
  SHELL
end
</code></pre>
<p>Start the VM:</p>
<pre><code class="language-bash">vagrant up
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/342f11ad-7c7d-40d2-a810-113b8c71edac.png" alt="screnshot showing VB server and terminal installation processes" style="display:block;margin:0 auto" width="1848" height="323" loading="lazy">

<p>Visit <code>http://localhost:8080</code> in your browser. You should see "Hello from Vagrant!"</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/bcd66a76-4a5b-4f26-bb7e-e203672968d8.png" alt="screenshot showing &quot;Hello from Vagrant!&quot; in browser" style="display:block;margin:0 auto" width="643" height="483" loading="lazy">

<h4 id="heading-troubleshooting-ssh-on-windows">Troubleshooting SSH on Windows</h4>
<p>If <code>vagrant ssh</code> fails, try:</p>
<pre><code class="language-bash">vagrant ssh -- -v
</code></pre>
<p>Or connect manually:</p>
<pre><code class="language-bash">ssh -i .vagrant/machines/default/virtualbox/private_key vagrant@127.0.0.1 -p 2222
</code></pre>
<h3 id="heading-how-to-create-a-local-vagrant-box-without-internet">How to Create a Local Vagrant Box Without Internet</h3>
<p><strong>Note:</strong> Most readers can skip this. Only do this if you want to work fully offline after the initial setup.</p>
<ol>
<li><p>Download <a href="https://ubuntu.com/download/server">Ubuntu 20.04 LTS</a> and save the <code>.iso</code> file locally.</p>
</li>
<li><p>Open VirtualBox and create a new VM: Name it <code>ubuntu-devops</code>, Type: Linux, Version: Ubuntu (64-bit).</p>
</li>
<li><p>Assign 2048MB RAM and a 20GB VDI disk.</p>
</li>
<li><p>Attach the <code>.iso</code> under Storage &gt; Optical Drive.</p>
</li>
<li><p>Start the VM and complete the Ubuntu installation.</p>
</li>
<li><p>Once installed, shut down the VM and run:</p>
</li>
</ol>
<pre><code class="language-bash">VBoxManage list vms
vagrant package --base "ubuntu-devops" --output ubuntu2004.box
vagrant box add ubuntu2004 ubuntu2004.box
</code></pre>
<p>You now have a reusable local box that works without internet.</p>
<p>You can spin up virtual machines. Next is Ansible, which automates what goes inside them.</p>
<h2 id="heading-how-to-install-ansible">How to Install Ansible</h2>
<p>Ansible automates configuration and software installation across multiple servers. Instead of SSH-ing into ten machines and running the same commands manually, you write a playbook once and Ansible handles the rest.</p>
<h3 id="heading-how-to-install-ansible-on-windows">How to Install Ansible on Windows</h3>
<p>Ansible doesn't run natively on Windows. You need to use it through WSL (Windows Subsystem for Linux).</p>
<ol>
<li>Open PowerShell as Administrator and enable WSL:</li>
</ol>
<pre><code class="language-powershell">dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
</code></pre>
<ol>
<li><p>Restart your computer.</p>
</li>
<li><p>Install Ubuntu from the Microsoft Store.</p>
</li>
<li><p>Open Ubuntu and install Ansible:</p>
</li>
</ol>
<pre><code class="language-bash">sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-mac">How to Install Ansible on Mac</h3>
<pre><code class="language-bash">brew install ansible
</code></pre>
<h3 id="heading-how-to-install-ansible-on-linux">How to Install Ansible on Linux</h3>
<pre><code class="language-bash"># Ubuntu/Debian
sudo apt update
sudo apt install software-properties-common
sudo apt-add-repository --yes --update ppa:ansible/ansible
sudo apt install ansible

# Red Hat/CentOS
sudo yum install ansible
</code></pre>
<h3 id="heading-how-to-test-ansible">How to Test Ansible</h3>
<p>Create a file called <code>hosts</code> in your current directory:</p>
<pre><code class="language-ini">[local]
localhost ansible_connection=local
</code></pre>
<p>Create a file called <code>playbook.yml</code> in the same directory:</p>
<pre><code class="language-yaml">---
- name: Test playbook
  hosts: local
  tasks:
    - name: Print a message
      debug:
        msg: "Ansible is working!"
</code></pre>
<p>Run the playbook, passing the local <code>hosts</code> file with <code>-i</code>:</p>
<pre><code class="language-bash">ansible-playbook -i hosts playbook.yml
</code></pre>
<p>You should see the message "Ansible is working!" in the output.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/081e6ff3-b983-42a0-960e-5340bbd24e3b.png" alt="screenshot showing ansible playbook complete terminal installation" style="display:block;margin:0 auto" width="849" height="287" loading="lazy">

<p>Alright, all your tools are installed. Now you'll use them together to build something real.</p>
<h2 id="heading-how-to-build-your-first-devops-project">How to Build Your First DevOps Project</h2>
<p>You can find the entire code for this lab in this repo: <a href="https://github.com/Osomudeya/homelab-demo-article">https://github.com/Osomudeya/homelab-demo-article</a></p>
<p>Now you'll put these tools together in one project. Each tool will perform its actual job, and nothing is forced.</p>
<p><strong>Before you start,</strong> create a fresh directory for this project. Don't run it inside the directory you used to test Vagrant earlier, as the Vagrantfile here is different and will conflict.</p>
<p>You'll be building a two-VM environment: one machine serves a web page you write yourself inside a Docker container, and the other runs a MariaDB database. Vagrant creates the machines and Ansible configures them. The page you see at the end is yours.</p>
<h3 id="heading-step-1-create-the-project-directory">Step 1: Create the Project Directory</h3>
<pre><code class="language-bash">mkdir devops-lab-project &amp;&amp; cd devops-lab-project
</code></pre>
<h3 id="heading-step-2-write-your-site-content">Step 2: Write Your Site Content</h3>
<p>Create a file called <code>index.html</code> in the project directory. Write whatever you want on this page — it's what you'll see in your browser at the end:</p>
<pre><code class="language-html">&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;&lt;title&gt;My DevOps Lab&lt;/title&gt;&lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;My DevOps Lab&lt;/h1&gt;
    &lt;p&gt;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&lt;/p&gt;
    &lt;p&gt;Built on a laptop. No cloud account needed.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre>
<p>Change the text to whatever you like. This is your page.</p>
<h3 id="heading-step-3-write-the-vagrantfile">Step 3: Write the Vagrantfile</h3>
<p>Create a file called <code>Vagrantfile</code> in the same directory:</p>
<pre><code class="language-ruby">Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/focal64"

  config.vm.define "web" do |web|
    web.vm.network "private_network", ip: "192.168.33.10"
    web.vm.network "forwarded_port", guest: 80, host: 8080
  end

  config.vm.define "db" do |db|
    db.vm.network "private_network", ip: "192.168.33.11"
  end
end
</code></pre>
<h3 id="heading-step-4-start-the-virtual-machines">Step 4: Start the Virtual Machines</h3>
<pre><code class="language-bash">vagrant up
</code></pre>
<p>The first run downloads the <code>ubuntu/focal64</code> box, which is around 500MB.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/264866b0-9977-490e-96a3-69b3070be589.png" alt="screenshot showing virtualbox installation processes in terminal" style="display:block;margin:0 auto" width="867" height="377" loading="lazy">

<p>Expect this to take 10–30 minutes depending on your connection. Subsequent runs will be much faster since the box is cached locally.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/118d2fb2-70f6-41e8-afb2-6f45fb895e98.png" alt="screenshot showing 2 virtualbox servers &quot;running&quot; in VB manager" style="display:block;margin:0 auto" width="926" height="396" loading="lazy">

<h3 id="heading-step-5-create-the-ansible-inventory">Step 5: Create the Ansible Inventory</h3>
<p>Create a file called <code>inventory</code> in the same directory:</p>
<pre><code class="language-ini">[webservers]
192.168.33.10 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key

[dbservers]
192.168.33.11 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/db/virtualbox/private_key
</code></pre>
<p>Ansible uses the Vagrant-generated private keys so it can SSH in as the <code>vagrant</code> user. Host key checking for this lab is turned off in <code>ansible.cfg</code> (next step), not in the inventory.</p>
<h3 id="heading-step-6-create-the-ansible-config-file">Step 6: Create the Ansible Config File</h3>
<p>Before running the playbook, create a file called <code>ansible.cfg</code> in the same directory:</p>
<pre><code class="language-ini">[defaults]
inventory = inventory
host_key_checking = False
</code></pre>
<p>The inventory line tells Ansible to use the inventory file in this folder by default. host_key_checking = False tells Ansible not to verify SSH host keys when connecting to your Vagrant VMs. Without it, Ansible will fail with a Host key verification failed error on first connection because the VM's key is not yet in your known_hosts file.</p>
<p>These settings are for a local lab only. Do not use host_key_checking = False for production systems.</p>
<h3 id="heading-step-7-create-the-ansible-playbook">Step 7: Create the Ansible Playbook</h3>
<p>Create a file called <code>playbook.yml</code>:</p>
<pre><code class="language-yaml">---
- name: Configure web server
  hosts: webservers
  become: yes
  tasks:

    - name: Install Docker
      apt:
        name: docker.io
        state: present
        update_cache: yes

    - name: Start Docker service
      service:
        name: docker
        state: started
        enabled: yes

    # Create the directory that will hold your site content
    - name: Create web content directory
      file:
        path: /var/www/html
        state: directory
        mode: '0755'

    # This copies your index.html from your laptop into the VM
    - name: Copy site content to web server
      copy:
        src: index.html
        dest: /var/www/html/index.html

    # This mounts that file into the Nginx container so it serves your page
    # The -v flag connects /var/www/html on the VM to /usr/share/nginx/html inside the container
    - name: Run Nginx serving your content
      shell: |
        docker rm -f webapp 2&gt;/dev/null || true
        docker run -d --name webapp --restart always -p 80:80 \
          -v /var/www/html:/usr/share/nginx/html:ro nginx

- name: Configure database server
  hosts: dbservers
  become: yes
  tasks:

    # Hash sum mismatch on .deb downloads is often stale lists, a flaky mirror, or apt pipelining
    # behind NAT; fresh indices + Pipeline-Depth 0 usually fixes it on lab VMs.
    - name: Disable apt HTTP pipelining (mirror/proxy hash mismatch workaround)
      copy:
        dest: /etc/apt/apt.conf.d/99disable-pipelining
        content: 'Acquire::http::Pipeline-Depth "0";'
        mode: "0644"

    - name: Clear apt package index cache
      shell: apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/* /var/lib/apt/lists/auxfiles/*
      changed_when: true

    - name: Update apt cache after reset
      apt:
        update_cache: yes

    - name: Install MariaDB
      apt:
        name: mariadb-server
        state: present
        update_cache: no

    - name: Start MariaDB service
      service:
        name: mariadb
        state: started
        enabled: yes
</code></pre>
<p>Two lines worth paying attention to:</p>
<ul>
<li><p><code>src: index.html</code> — Ansible looks for this file in the same directory as the playbook. That is the file you wrote in Step 2.</p>
</li>
<li><p><code>-v /var/www/html:/usr/share/nginx/html:ro</code> — this mounts the directory from the VM into the Nginx container. The <code>:ro</code> means read-only. Nginx serves whatever is in that folder.</p>
</li>
</ul>
<h3 id="heading-step-8-run-the-playbook">Step 8: Run the Playbook</h3>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>You'll see task-by-task output as Ansible connects to each VM over SSH and configures it. A green <code>ok</code> or yellow <code>changed</code> next to each task means it worked. Red <code>fatal</code> means something failed.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/91241b41-981c-4e23-9dc4-8531e551c39e.png" alt="terminal screenshot of A green ok or yellow changed next to each task means it worked. Red fatal means something failed." style="display:block;margin:0 auto" width="875" height="267" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c02db252-8aff-42e5-b937-d812d070a75b.png" alt="terminal screenshot of playbook run completion" style="display:block;margin:0 auto" width="867" height="425" loading="lazy">

<h3 id="heading-step-9-verify-the-setup">Step 9: Verify the Setup</h3>
<p>Open <code>http://localhost:8080</code> in your browser. You should see the page you wrote in Step 2 served from inside a Docker container, running on a Vagrant VM, configured automatically by Ansible.</p>
<p>If you see the page, every tool in this lab is working together.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0d3d897b-3f51-46fb-b548-832cc5ec3272.png" alt="Browser showing localhost:8082 with the heading &quot;My DevOps Lab&quot; and the text &quot;Provisioned by Vagrant. Configured by Ansible. Served by Docker.&quot;" style="display:block;margin:0 auto" width="746" height="418" loading="lazy">

<h3 id="heading-step-9-clean-up-optional">Step 9: Clean Up (Optional)</h3>
<p>When you're done:</p>
<pre><code class="language-bash">vagrant destroy -f
</code></pre>
<p>This shuts down and deletes both VMs. Your <code>Vagrantfile</code>, <code>inventory</code>, <code>playbook.yml</code>, and <code>index.html</code> stay on disk — run <code>vagrant up</code> followed by <code>ansible-playbook -i inventory playbook.yml</code> any time to bring it all back.</p>
<p>Now that you have a working lab, let's use it properly.</p>
<h2 id="heading-how-to-break-your-lab-on-purpose">How to Break Your Lab on Purpose</h2>
<p>Following these steps has gotten you a running lab. Breaking things teaches you how everything actually works.</p>
<p>Here are five things to break and what to look for when you do.</p>
<h3 id="heading-break-1-crash-the-main-process-inside-the-container-and-watch-it-come-back">Break 1: Crash the Main Process Inside the Container (and Watch It Come Back)</h3>
<p>Doing this just proves that something inside the container can die (like a real bug or OOM), Docker can restart the container because of <code>--restart always</code>, and your site can come back without re-running Ansible.</p>
<p>After <code>vagrant ssh web</code>, every <code>docker</code> command below runs <strong>on the web VM</strong>. So keep your browser on your laptop at <a href="http://localhost:8080"><code>http://localhost:8080</code></a> (Vagrant forwards your host port to the VM’s port 80).</p>
<h4 id="heading-troubleshooting-if-your-lab-isnt-ready">Troubleshooting: If Your Lab Isn't Ready</h4>
<p>From your project folder on the host (your laptop) – unless the step says to run it on the VM:</p>
<ul>
<li><p>You ran <code>vagrant destroy -f</code>. Run <code>vagrant up</code>, then <code>ansible-playbook -i inventory playbook.yml</code>.</p>
</li>
<li><p><code>docker ps</code> shows <code>webapp</code> but status is Exited. On the web VM, run <code>sudo docker start webapp</code>, then <code>sudo docker ps</code> again.</p>
</li>
<li><p>There's no <code>webapp</code> row in <code>docker ps -a</code><strong>.</strong> Re-run <code>ansible-playbook -i inventory playbook.yml</code> on the host.</p>
</li>
</ul>
<p>If the playbook is already applied and <code>webapp</code> is Up, skip this section and start at step 1 under Steps (happy path) below. (Don't skip SSH or <code>docker ps</code>. You need the VM shell and a quick check before you run <code>docker exec</code>.)</p>
<h4 id="heading-steps-happy-path">Steps (happy path)</h4>
<ol>
<li>SSH into the web VM:</li>
</ol>
<pre><code class="language-plaintext">vagrant ssh web
</code></pre>
<ol>
<li><p>Confirm <code>webapp</code> is <strong>Up</strong>:</p>
<pre><code class="language-plaintext">sudo docker ps
</code></pre>
</li>
<li><p><strong>Break it on purpose:</strong> kill the container’s main process <strong>from inside</strong> (PID 1). That ends the container the same way a crashing app would, not the same as <code>docker stop</code> on the host:</p>
</li>
</ol>
<pre><code class="language-bash">sudo docker exec webapp sh -c 'sleep 5 &amp;&amp; kill 1'
</code></pre>
<p>The <code>sleep</code> 5 gives you a moment to switch to the browser. Right after you run the command, open or refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a>. You may catch a brief error or blank page while nothing is listening on port 80.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/3ac89703-63f3-45d8-954f-35adbd2c7dec.png" alt="Browser showing ERR_CONNECTION_RESET on localhost:8082 after the Nginx container process was killed" style="display:block;margin:0 auto" width="1242" height="1057" loading="lazy">

<ol>
<li>Watch Docker restart the container:</li>
</ol>
<pre><code class="language-bash">watch sudo docker ps -a
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5c61d90d-61d6-4023-b3f5-e3eb427e8492.png" alt="Terminal running watch docker ps showing webapp container status as Up 10 seconds after automatic restart" style="display:block;margin:0 auto" width="1011" height="393" loading="lazy">

<p>Within a few seconds you should see <strong>Exited (137)</strong> become <strong>Up</strong> again. (Press Ctrl+C to exit <code>watch</code>.)</p>
<p>5. Refresh the browser. You should see the same HTML as before, because the files live on the VM under <code>/var/www/html</code> and are bind-mounted into the container; restarting only replaced the Nginx process, not those files.</p>
<h4 id="heading-why-not-docker-stop-or-docker-kill-on-the-host-for-this-demo"><strong>Why not</strong> <code>docker stop</code> <strong>or</strong> <code>docker kill</code> <strong>on the host for this demo?</strong></h4>
<p>Those commands go through Docker’s API. On many setups (including recent Docker), Docker treats them as you choosing to stop the container (<code>hasBeenManuallyStopped</code>), and <code>--restart always</code> may not bring the container back until you <code>docker start</code> it or similar.</p>
<p>Killing PID 1 from inside the container is treated more like an internal crash, so the restart policy you set in the playbook is the one you actually get to observe here.</p>
<p><strong>Kubernetes analogy:</strong> A pod whose containers exit can be restarted by the kubelet; a pod you delete does not come back by itself.</p>
<p><strong>What to observe (three separate checks):</strong></p>
<ol>
<li><p><strong>Exit code:</strong> After <code>kill 1</code>, <code>docker ps -a</code> should show the container exited with code 137, meaning the main process was killed by a signal. That confirms the container really died, not that you ran <code>docker stop</code> on the host.</p>
</li>
<li><p><strong>Restart delay vs browser:</strong> Watch how many seconds pass between Exited and Up in <code>docker ps -a</code>; that interval is Docker applying <code>--restart always</code>. That's separate from what you see in the browser: the browser only shows whether something is accepting connections on port 80 on the VM, so it may show an error or blank page during the gap even while Docker is about to restart the container.</p>
</li>
<li><p><strong>Content after recovery:</strong> After status is Up again, refresh the page. You should see the same HTML as before. That shows your content lives on the VM disk (mounted into the container with <code>-v</code>), not inside a file that vanishes when the container process restarts. The process was replaced, not your <code>index.html</code> on the host path.</p>
</li>
</ol>
<h3 id="heading-break-2-cause-a-container-name-conflict">Break 2: Cause a Container Name Conflict</h3>
<p>On a single Docker daemon (here, on your web VM), a container name is a <strong>unique label</strong>. Two running (or stopped) containers can't share the same name. Scripts and playbooks that always use <code>docker run --name webapp</code> without cleaning up first hit this error constantly and recognizing it saves time in real work.</p>
<p><strong>Before you start:</strong> Ansible already created one container named <code>webapp</code>.<br>Stay on the web VM (for example still inside <code>vagrant ssh web</code>) so the commands below run where that container lives.</p>
<p>So now, try to start a second container and also call it <code>webapp</code>. The image is plain <code>nginx</code> here on purpose – the point is the <strong>name clash</strong>, not matching your site’s ports or volume mounts.</p>
<pre><code class="language-plaintext">sudo docker run -d --name webapp nginx
</code></pre>
<p>What actually happens here is that Docker <strong>doesn't</strong> create a second container. It returns an error immediately. Your original <code>webapp</code> is unchanged.</p>
<p>This is because the name <code>webapp</code> is already registered to the existing container (the error shows that container’s ID). Docker refuses to reuse the name until the old container is removed or renamed.</p>
<p>Example error (your ID will differ):</p>
<pre><code class="language-plaintext">docker: Error response from daemon: Conflict. The container name "/webapp" is already in use by container "2e48b81a311c4b71cdc1e25e0df75a22296845c7eb53aab82f9ae739fb6410ec". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/1fd42c16-c28e-4539-9290-3583206eb8ff.png" alt="container name conflict terminal error screenshot" style="display:block;margin:0 auto" width="914" height="252" loading="lazy">

<p>To fix it, free the name, then create <code>webapp</code> again the same way the playbook does (publish port 80, mount your HTML, restart policy):</p>
<pre><code class="language-plaintext">sudo docker rm -f webapp
sudo docker run -d --name webapp --restart always -p 80:80 \
  -v /var/www/html:/usr/share/nginx/html:ro nginx
</code></pre>
<p>After that, your site should behave as before (refresh <a href="http://localhost:8080"><code>http://localhost:8080</code></a> from your laptop).</p>
<h4 id="heading-what-to-observe">What to observe:</h4>
<p>Read Docker’s Conflict message end to end. You should see that the name <code>/webapp</code> is already in use and a container ID pointing at the existing box. In production, that pattern means “something already claimed this name. Just remove it, rename it, or pick a different name before you run <code>docker run</code> again.”</p>
<h3 id="heading-break-3-make-ansible-fail-to-reach-a-vm">Break 3: Make Ansible Fail to Reach a VM</h3>
<p>Ansible separates “could not connect” from “connected, but a task broke.” The first is <strong>UNREACHABLE</strong>, the second is <strong>FAILED</strong>. Knowing which one you have tells you whether to fix network / SSH or playbook / packages / permissions.</p>
<p>On your laptop, in the project folder, edit <code>inventory</code> and change the web server address from <code>192.168.33.10</code> to an IP <strong>no VM uses</strong>, for example <code>192.168.33.99</code>. Save the file.</p>
<pre><code class="language-ini">[webservers]
192.168.33.99 ansible_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/web/virtualbox/private_key
</code></pre>
<p>What you run (from the same project folder on the host):</p>
<pre><code class="language-bash">ansible-playbook -i inventory playbook.yml
</code></pre>
<p>After this, Ansible tries to SSH to <code>192.168.33.99</code>. Nothing on your lab network answers as that host (or SSH never succeeds), so Ansible <strong>never runs tasks</strong> on the web server. It stops that host with UNREACHABLE:</p>
<pre><code class="language-plaintext">fatal: [192.168.33.99]: UNREACHABLE! =&gt; {"msg": "Failed to connect to the host via ssh"}
</code></pre>
<p>This is realistic because the same message shape appears when the IP is wrong, the VM isn't running, a firewall blocks port 22, or the network is misconfigured. The common thread is <strong>no working SSH session</strong>.</p>
<p>Now it's time to put it back: restore <code>192.168.33.10</code> in <code>inventory</code> and run <code>ansible-playbook -i inventory playbook.yml</code> again. The web play should reach the VM and complete (assuming your lab is up).</p>
<p><strong>UNREACHABLE vs FAILED – what to observe:</strong></p>
<ul>
<li><p>If Ansible prints UNREACHABLE, you should assume it never opened SSH on that host and never ran tasks there. Go ahead and fix the connection (IP, VM up, firewall, key path) before you debug playbook logic.</p>
</li>
<li><p>If Ansible prints FAILED, you should assume SSH worked and a task returned an error. Read the task output for the real cause (package name, permissions, syntax), not the network first.</p>
</li>
</ul>
<p>When you debug later, you should look at the keyword Ansible prints: <strong>UNREACHABLE</strong> points to reachability while <strong>FAILED</strong> points to task output and the first failed task under that host.</p>
<h3 id="heading-break-4-fill-the-vms-disk">Break 4: Fill the VM's Disk</h3>
<p>Databases and other services need free disk for logs, temp files, and data. When the filesystem is full or nearly full, a service may fail to start or fail at runtime. This break walks through the same diagnosis habit you would use on a real server: check space, then read systemd and journal output for the service.</p>
<p>All commands below run <strong>on the db VM</strong> after <code>vagrant ssh db</code>. MariaDB was installed there by your playbook.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Open a shell on the db VM:</p>
<pre><code class="language-plaintext">vagrant ssh db
</code></pre>
</li>
<li><p>Allocate a large file full of zeros (here 1GB) to simulate something eating disk space:</p>
<pre><code class="language-plaintext">sudo dd if=/dev/zero of=/tmp/bigfile bs=1M count=1024

df -h
</code></pre>
<p>Use <code>df -h</code> to see how full the root filesystem (or relevant mount) is. Your Vagrant disk may be large enough that 1GB only raises usage. If MariaDB still starts, you still practiced the checks. To see a stronger effect, you can repeat with a larger <code>count=</code> <strong>only in a lab</strong> (never fill production disks on purpose without a plan).</p>
</li>
<li><p>Ask systemd to restart MariaDB and show status:</p>
<pre><code class="language-plaintext">sudo systemctl restart mariadb
sudo systemctl status mariadb
</code></pre>
<p>If the disk is critically full, restart may fail or the service may show failed or not running.</p>
</li>
<li><p>If something looks wrong, read recent logs for the MariaDB unit:</p>
<pre><code class="language-plaintext">sudo journalctl -u mariadb --no-pager | tail -20
</code></pre>
<p>Errors often mention disk, space, read-only filesystem, or InnoDB being unable to write.</p>
</li>
<li><p>Clean up so your VM stays usable:</p>
<pre><code class="language-plaintext">sudo rm /tmp/bigfile
</code></pre>
<p>Optionally run <code>sudo systemctl restart mariadb</code> again and confirm it is active (running).</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should use <code>df -h</code> first to confirm whether the filesystem is actually tight. That avoids blaming the database when disk space is fine.</p>
</li>
<li><p>You should read <code>systemctl status mariadb</code> to see whether systemd thinks the service is active, failed, or flapping.</p>
</li>
<li><p>You should read <code>journalctl -u mariadb</code> when status is bad, so you can tie the failure to concrete errors from MariaDB or the kernel (often mentioning disk, space, or read-only filesystem). <strong>Space + status + logs</strong> is the same order you would use on a production server.</p>
</li>
</ul>
<h3 id="heading-break-5-run-minikube-out-of-resources">Break 5: Run Minikube Out of Resources</h3>
<p>Kubernetes schedules pods onto nodes that have enough CPU and memory. If you ask for more than the cluster can place, some pods stay <strong>Pending</strong> and <strong>Events</strong> explain why (for example <em>Insufficient cpu</em>). That is not the same as a pod that starts and then crashes.</p>
<p>To do this, you'll need a local cluster (we're using <a href="https://minikube.sigs.k8s.io/docs/start/?arch=%2Fmacos%2Fx86-64%2Fstable%2Fbinary+download"><strong>Minikube</strong></a> in this guide) and <code>kubectl</code> on your laptop. This break doesn't use the Vagrant VMs. If you haven't installed Minikube yet, complete the "How to Set Up Kubernetes" section first, or skip this break until you do.</p>
<p>You'll run this on your <strong>Mac, Linux, or Windows terminal</strong> (host), not inside <code>vagrant ssh</code>. If you're still inside a VM, type <code>exit</code> until your prompt is back on the host.</p>
<h4 id="heading-what-you-do">What you do:</h4>
<ol>
<li><p>Check Minikube:</p>
<pre><code class="language-plaintext">minikube status
</code></pre>
<p>If it's stopped, start it (Docker driver matches earlier sections):</p>
<pre><code class="language-plaintext">minikube start --driver=docker
</code></pre>
</li>
<li><p>Create a deployment with many replicas so your single Minikube node can't run them all at once:</p>
<pre><code class="language-plaintext">kubectl create deployment stress --image=nginx --replicas=20

#watch pods start
kubectl get pods -w
</code></pre>
<p>Press Ctrl+C when you're done watching. Some pods may stay <strong>Pending</strong> while others are <strong>Running</strong>.</p>
</li>
<li><p>Pick one Pending pod name from <code>kubectl get pods</code> and inspect it:</p>
<pre><code class="language-plaintext">kubectl describe pod &lt;pod-name&gt;
</code></pre>
<p>Under Events, look for FailedScheduling and a line similar to:</p>
<pre><code class="language-plaintext">Warning  FailedScheduling  0/1 nodes are available: 1 Insufficient cpu.
</code></pre>
<p>You might see <strong>Insufficient memory</strong> instead, depending on your machine.</p>
</li>
<li><p>Fix the lab by scaling back so the cluster can catch up:</p>
<pre><code class="language-plaintext">kubectl scale deployment stress --replicas=2
</code></pre>
<p>You can delete the deployment entirely when finished: <code>kubectl delete deployment stress</code>.</p>
</li>
</ol>
<p><strong>What to observe:</strong></p>
<ul>
<li><p>You should see Pending pods stay unscheduled until capacity frees up. That means the scheduler hasn't placed them on any <strong>node</strong> yet, usually because the node is out of CPU or memory for that workload.</p>
</li>
<li><p>You should read <code>kubectl describe pod &lt;pod-name&gt;</code> and scroll to <strong>Events</strong>. Messages like Insufficient cpu or Insufficient memory mean the cluster ran out of schedulable capacity, not that the container image image is corrupt.</p>
</li>
<li><p>You should contrast that with a pod that reaches Running and then CrashLoopBackOff, which usually means the process inside the container keeps exiting. that is an application or config problem, not a “nowhere to run” problem.</p>
</li>
</ul>
<h2 id="heading-what-you-can-now-do">What You Can Now Do</h2>
<p>You didn't just install tools in this tutorial. You also used them.</p>
<p>You can now spin up two servers from a single file. You can write a playbook that installs software and deploys a container without touching either machine manually.</p>
<p>You can serve a page you wrote from inside a Docker container running on a Vagrant VM, and bring the whole thing back from scratch in one command.</p>
<p>You also broke it. You saw what a container conflict looks like, what Ansible prints when it can't reach a machine, what disk pressure does to a running service, and what a Kubernetes scheduler says when it runs out of resources. Those error messages aren't unfamiliar anymore.</p>
<p>That's the difference between someone who has read about DevOps and someone who has run it.</p>
<p><strong>Here are four free projects you can run in this same lab to go further:</strong></p>
<ul>
<li><p><strong>DevOps Home-Lab 2026</strong> — Build a multi-service app (frontend, API, PostgreSQL, Redis) end-to-end with Docker Compose, Kubernetes, Prometheus/Grafana monitoring, GitOps with ArgoCD, and Cloudflare for global exposure.</p>
</li>
<li><p><strong>KubeLab</strong> — Trigger real Kubernetes failure scenarios, pod crashes, OOMKills, node drains, cascading failures, and watch how the cluster responds using live metrics.</p>
</li>
<li><p><strong>K8s Secrets Lab</strong> — Build a full secret management pipeline from AWS Secrets Manager into your cluster, including rotation behavior and IRSA.</p>
</li>
<li><p><strong>DevOps Troubleshooting Toolkit</strong> — Structured debugging guides across Linux, containers, Kubernetes, cloud, databases, and observability with copy-paste commands for real incidents.</p>
</li>
</ul>
<p>All free and open source: <a href="https://github.com/Osomudeya/List-Of-DevOps-Projects">github.com/Osomudeya/List-Of-DevOps-Projects</a>.</p>
<p>If you want to go deeper, you can find six full chapters covering Terraform, Ansible, monitoring, CI/CD, and a simulated three-VM production environment at <a href="https://osomudeya.gumroad.com/l/BuildYourOwnDevOpsLab">Build Your Own DevOps Lab</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening ins ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-and-deploy-multi-architecture-docker-apps-on-google-cloud-using-arm-nodes/</link>
                <guid isPermaLink="false">69dcf2c3f57346bc1e05a01d</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ARM ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Amina Lawal ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:42:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/e89ae65a-4b3a-44b7-94d8-d0638f017bf6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.</p>
<p>Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.</p>
<p>AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.</p>
<p>But there's a problem that catches almost every team off guard when they start this transition: <strong>container architecture mismatch</strong>.</p>
<p>If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic <code>exec format error</code>.</p>
<p>The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.</p>
<p>We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.</p>
<p><strong>Here's what you'll build, step by step:</strong></p>
<ul>
<li><p>A Go HTTP server that reports the CPU architecture it's running on at runtime</p>
</li>
<li><p>A multi-stage Dockerfile that cross-compiles for both <code>linux/amd64</code> and <code>linux/arm64</code> without slow QEMU emulation</p>
</li>
<li><p>A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture</p>
</li>
<li><p>A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool</p>
</li>
<li><p>A Kubernetes Deployment that pins your workload exclusively to the ARM nodes</p>
</li>
</ul>
<p>By the end, you'll hit a live endpoint and see the word <code>arm64</code> staring back at you from a Google Cloud ARM node. Let's get into it.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</a></p>
</li>
<li><p><a href="#heading-step-3-write-the-application">Step 3: Write the Application</a></p>
</li>
<li><p><a href="#heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</a></p>
</li>
<li><p><a href="#heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</a></p>
</li>
<li><p><a href="#heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</a></p>
</li>
<li><p><a href="#heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</a></p>
</li>
<li><p><a href="#heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-project-file-structure">Project File Structure</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following ready:</p>
<ul>
<li><p><strong>A Google Cloud project</strong> with billing enabled. If you don't have one, create it at <a href="https://console.cloud.google.com">console.cloud.google.com</a>. The total cost to follow this tutorial is around $5–10.</p>
</li>
<li><p><code>gcloud</code> <strong>CLI</strong> installed and authenticated. Run <code>gcloud auth login</code> to sign in and <code>gcloud config set project YOUR_PROJECT_ID</code> to point it at your project.</p>
</li>
<li><p><strong>Docker Desktop</strong> version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.</p>
</li>
<li><p><code>kubectl</code> installed. This is the CLI for interacting with Kubernetes clusters.</p>
</li>
<li><p>Basic familiarity with <strong>Docker</strong> (images, layers, Dockerfile) and <strong>Kubernetes</strong> (pods, deployments, services). You don't need to be an expert, but you should know what these things are.</p>
</li>
</ul>
<h2 id="heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</h2>
<p>Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.</p>
<h3 id="heading-enable-the-required-apis">Enable the Required APIs</h3>
<p>Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:</p>
<pre><code class="language-bash">gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com
</code></pre>
<p>Here's what each one does:</p>
<ul>
<li><p><code>artifactregistry.googleapis.com</code> — enables <strong>Artifact Registry</strong>, where we'll store our Docker images</p>
</li>
<li><p><code>container.googleapis.com</code> — enables <strong>Google Kubernetes Engine (GKE)</strong>, where our cluster will run</p>
</li>
<li><p><code>containeranalysis.googleapis.com</code> — enables vulnerability scanning for images stored in Artifact Registry</p>
</li>
</ul>
<h3 id="heading-create-a-docker-repository-in-artifact-registry">Create a Docker Repository in Artifact Registry</h3>
<p>Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:</p>
<pre><code class="language-bash">gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"
</code></pre>
<p>Breaking down the flags:</p>
<ul>
<li><p><code>--repository-format=docker</code> — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)</p>
</li>
<li><p><code>--location=us-central1</code> — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run <code>gcloud artifacts locations list</code> to see all options.</p>
</li>
<li><p><code>--description</code> — a human-readable label for the repository, shown in the console.</p>
</li>
</ul>
<h3 id="heading-authenticate-docker-to-push-to-artifact-registry">Authenticate Docker to Push to Artifact Registry</h3>
<p>Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:</p>
<pre><code class="language-bash">gcloud auth configure-docker us-central1-docker.pkg.dev
</code></pre>
<p>This adds a credential helper entry to your <code>~/.docker/config.json</code> file. What that means in practice: any time Docker tries to push or pull from a URL under <code>us-central1-docker.pkg.dev</code>, it will automatically call <code>gcloud</code> to get a valid auth token. You won't need to run <code>docker login</code> manually.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/31fd020f-ffa2-40bd-9057-57b16a61b325.png" alt="Terminal output of the gcloud artifacts repositories list command, showing a row for multi-arch-repo with format DOCKER, location us-central1" style="display:block;margin:0 auto" width="2870" height="1512" loading="lazy">

<h2 id="heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</h2>
<p>With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.</p>
<pre><code class="language-bash">gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>What each flag does:</p>
<ul>
<li><p><code>--zone=us-central1-a</code> — creates a zonal cluster in a single availability zone. A regional cluster (using <code>--region</code>) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If <code>us-central1-a</code> is unavailable, try <code>us-central1-b</code>.</p>
</li>
<li><p><code>--num-nodes=2</code> — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.</p>
</li>
<li><p><code>--machine-type=e2-standard-2</code> — the machine type for this default node pool. <code>e2-standard-2</code> is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.</p>
</li>
<li><p><code>--workload-pool=PROJECT_ID.svc.id.goog</code> — enables <strong>Workload Identity</strong>, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.</p>
</li>
</ul>
<p>This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/332250a8-3f99-4eb1-849f-51ab054c9567.png" alt="GCP Console Kubernetes Engine Clusters page showing axion-tutorial-cluster with a green checkmark status, the zone us-central1-a, and Kubernetes version in the table." style="display:block;margin:0 auto" width="1457" height="720" loading="lazy">

<h2 id="heading-step-3-write-the-application">Step 3: Write the Application</h2>
<p>We need an application to containerize. We'll use <strong>Go</strong> for three specific reasons:</p>
<ol>
<li><p>Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.</p>
</li>
<li><p>Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.</p>
</li>
<li><p>Go exposes the architecture the binary was compiled for via <code>runtime.GOARCH</code>. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.</p>
</li>
</ol>
<p>Start by creating the project directories:</p>
<pre><code class="language-bash">mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app
</code></pre>
<p>Initialize the Go module from inside <code>app/</code>. This creates <code>go.mod</code> in the current directory:</p>
<pre><code class="language-bash">go mod init hello-axion
</code></pre>
<p><code>go mod init</code> is Go's built-in command for starting a new module. It writes a <code>go.mod</code> file that declares the module name (<code>hello-axion</code>) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.</p>
<p>Now create the application at <code>app/main.go</code>:</p>
<pre><code class="language-go">package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}
</code></pre>
<p>Verify both files were created:</p>
<pre><code class="language-bash">ls -la
</code></pre>
<p>You should see <code>go.mod</code> and <code>main.go</code> listed.</p>
<p>Let's walk through what this code does:</p>
<ul>
<li><p><code>import "runtime"</code> — imports Go's built-in <code>runtime</code> package, which exposes information about the Go runtime environment, including the CPU architecture.</p>
</li>
<li><p><code>runtime.GOARCH</code> — returns a string like <code>"arm64"</code> or <code>"amd64"</code> representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be <code>arm64</code>. This is the core of our proof.</p>
</li>
<li><p><code>os.Hostname()</code> — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.</p>
</li>
<li><p><code>handler</code> — the main HTTP handler, registered on the root path <code>/</code>. It writes the architecture, OS, and hostname to the response.</p>
</li>
<li><p><code>healthz</code> — a separate handler registered on <code>/healthz</code>. It returns HTTP 200 with the text <code>ok</code>. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.</p>
</li>
<li><p><code>http.ListenAndServe(":8080", nil)</code> — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.</p>
</li>
</ul>
<h2 id="heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</h2>
<p>Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.</p>
<h3 id="heading-why-your-docker-images-are-architecture-specific-by-default">Why Your Docker Images Are Architecture-Specific By Default</h3>
<p>A CPU only understands instructions written for its specific <strong>Instruction Set Architecture (ISA)</strong>. ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.</p>
<p>When you build a Docker image the normal way (<code>docker build</code>), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:</p>
<pre><code class="language-shell">standard_init_linux.go:228: exec user process caused: exec format error
</code></pre>
<p>That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."</p>
<h3 id="heading-the-solution-a-single-image-tag-that-serves-any-architecture">The Solution: A Single Image Tag That Serves Any Architecture</h3>
<p>Docker solves this with a structure called a <strong>Manifest List</strong> (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.</p>
<p>When a server pulls <code>hello-axion:v1</code>, here's what actually happens:</p>
<ol>
<li><p>Docker contacts the registry and requests the manifest for <code>hello-axion:v1</code></p>
</li>
<li><p>The registry returns the Manifest List, which looks like this internally:</p>
</li>
</ol>
<pre><code class="language-json">{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}
</code></pre>
<ol>
<li>Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.</li>
</ol>
<p>One tag, two actual images. Completely transparent to your deployment manifests.</p>
<h3 id="heading-set-up-docker-buildx">Set Up Docker Buildx</h3>
<p><strong>Docker Buildx</strong> is the CLI tool that builds these Manifest Lists. It's powered by the <strong>BuildKit</strong> engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:</p>
<pre><code class="language-bash">docker buildx create --name multiarch-builder --use
</code></pre>
<ul>
<li><p><code>--name multiarch-builder</code> — gives this builder a memorable name. You can have multiple builders. This command creates a new one named <code>multiarch-builder</code>.</p>
</li>
<li><p><code>--use</code> — immediately sets this new builder as the active one, so all future <code>docker buildx build</code> commands use it.</p>
</li>
</ul>
<p>Now boot the builder and confirm it supports the platforms we need:</p>
<pre><code class="language-bash">docker buildx inspect --bootstrap
</code></pre>
<ul>
<li><code>--bootstrap</code> — starts the builder container if it isn't already running, and prints its full configuration.</li>
</ul>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...
</code></pre>
<p>The <code>Platforms</code> line lists every architecture this builder can produce images for. As long as you see <code>linux/amd64</code> and <code>linux/arm64</code> in that list, you're ready to build for both x86 and ARM.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/1c19aca1-30c4-406d-9c37-679ee4f2928f.png" alt="Terminal output showing the multiarch-builder details with Name, Driver set to docker-container, and a Platforms list that includes linux/amd64 and linux/arm64 highlighted." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h2 id="heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</h2>
<p>Now we can write the Dockerfile. We'll use two techniques together: a <strong>multi-stage build</strong> to keep the final image tiny, and a <strong>cross-compilation trick</strong> to avoid slow CPU emulation.</p>
<p>Create <code>app/Dockerfile</code> with the following content:</p>
<pre><code class="language-dockerfile"># -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]
</code></pre>
<p>There's a lot happening here. Let's go through it carefully.</p>
<h3 id="heading-stage-1-the-builder">Stage 1: The Builder</h3>
<p><code>FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder</code></p>
<p>This is the most important line in the file. <code>\(BUILDPLATFORM</code> is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine <em>running the build</em> (your laptop). By pinning the builder stage to <code>\)BUILDPLATFORM</code>, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.</p>
<p>Without <code>--platform=$BUILDPLATFORM</code>, Buildx would have to use <strong>QEMU</strong> — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.</p>
<p><code>ARG TARGETOS</code> <strong>and</strong> <code>ARG TARGETARCH</code></p>
<p>These two lines declare that our Dockerfile expects build arguments named <code>TARGETOS</code> and <code>TARGETARCH</code>. Buildx injects these automatically based on the <code>--platform</code> flag you pass at build time. For a <code>linux/arm64</code> target, <code>TARGETOS</code> will be <code>linux</code> and <code>TARGETARCH</code> will be <code>arm64</code>.</p>
<p><code>COPY go.mod .</code> <strong>and</strong> <code>RUN go mod download</code></p>
<p>We copy <code>go.mod</code> first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for <code>go mod download</code>.</p>
<p>On future builds, as long as <code>go.mod</code> hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.</p>
<p><code>RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go</code></p>
<p>This is the cross-compilation step. <code>GOOS</code> and <code>GOARCH</code> are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the <code>\(TARGETOS</code> and <code>\)TARGETARCH</code> build args injected by Buildx.</p>
<p>The <code>-ldflags="-w -s"</code> flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.</p>
<h3 id="heading-stage-2-the-runtime-image">Stage 2: The Runtime Image</h3>
<p><code>FROM alpine:latest</code></p>
<p>This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, <code>alpine:latest</code> is itself a multi-arch image, so Docker automatically selects the <code>arm64</code> or <code>amd64</code> Alpine variant depending on which platform this stage is built for.</p>
<p>Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains <em>only</em> Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.</p>
<p><code>RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup</code> and <code>USER appuser</code></p>
<p>These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.</p>
<p><code>COPY --from=builder /app/server .</code></p>
<p>This is how multi-stage builds work: the <code>--from=builder</code> flag tells Docker to copy files from the <code>builder</code> stage (Stage 1), not from your local disk. Only the compiled binary (<code>server</code>) makes it into the final image.</p>
<h2 id="heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</h2>
<p>With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.</p>
<p>From inside the <code>app/</code> directory, run:</p>
<pre><code class="language-bash">docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual GCP project ID.</p>
<p>Here's what each part of this command does:</p>
<ul>
<li><p><code>docker buildx build</code> — uses the Buildx CLI instead of the standard <code>docker build</code>. Buildx is required for multi-platform builds.</p>
</li>
<li><p><code>--platform linux/amd64,linux/arm64</code> — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the <code>$BUILDPLATFORM</code> cross-compilation trick, both builds run natively on your machine without QEMU emulation.</p>
</li>
<li><p><code>-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1</code> — the full image path in Artifact Registry. The format is always <code>REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG</code>.</p>
</li>
<li><p><code>--push</code> — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.</p>
</li>
<li><p><code>.</code> — the build context, the directory Docker scans for the Dockerfile and any files the build needs.</p>
</li>
</ul>
<p>Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:</p>
<pre><code class="language-plaintext"> =&gt; [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 =&gt; [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 =&gt; pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/dc88f558-b4ee-4100-bfe1-eaa943bec9bc.png" alt="Terminal showing docker buildx build output with two parallel build tracks labeled linux/amd64 and linux/arm64, and a final line reading pushing manifest for the Artifact Registry image path." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<h3 id="heading-verify-the-multi-arch-image-in-artifact-registry">Verify the Multi-Arch Image in Artifact Registry</h3>
<p>Once the push completes, navigate to <strong>GCP Console → Artifact Registry → Repositories → multi-arch-repo</strong> and click on <code>hello-axion</code>.</p>
<p>You won't see a single image — you'll see something labelled <strong>"Image Index"</strong>. That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for <code>linux/amd64</code> and one for <code>linux/arm64</code>.</p>
<p>You can also inspect this from the command line:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/28d0e4a4-1d45-4c0b-ac47-34dc3b72c11d.png" alt="Google Cloud Artifact Registry console showing hello-axion as an Image Index with two child images: one labeled linux/amd64 and one labeled linux/arm64, each with its own digest and size." style="display:block;margin:0 auto" width="2188" height="1258" loading="lazy">

<p>The output lists every manifest inside the image index. You'll see entries for <code>linux/amd64</code> and <code>linux/arm64</code> — those are our two real images. You'll also see two entries with <code>Platform: unknown/unknown</code> labelled as <code>attestation-manifest</code>. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).</p>
<p>The two entries you care about are <code>linux/amd64</code> and <code>linux/arm64</code>. Note the digest for the <code>arm64</code> entry — we'll use it in the verification step to confirm the cluster pulled the right variant.</p>
<h2 id="heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</h2>
<p>We have a universal image. Now we need somewhere to run it.</p>
<p>Recall the cluster we created in Step 2 — it's running <code>e2-standard-2</code> x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a <strong>mixed-architecture cluster</strong> where different workloads can be routed to different hardware.</p>
<h3 id="heading-choosing-your-arm-machine-type">Choosing Your ARM Machine Type</h3>
<p>Google Cloud currently offers two ARM-based machine series in GKE:</p>
<table>
<thead>
<tr>
<th>Series</th>
<th>Example type</th>
<th>What it is</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tau T2A</strong></td>
<td><code>t2a-standard-2</code></td>
<td>First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.</td>
</tr>
<tr>
<td><strong>Axion (C4A)</strong></td>
<td><code>c4a-standard-2</code></td>
<td>Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.</td>
</tr>
</tbody></table>
<p>This tutorial uses <code>t2a-standard-2</code> because it's widely available. The commands are identical for <code>c4a-standard-2</code> — just swap the <code>--machine-type</code> value. If <code>t2a-standard-2</code> isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.</p>
<h3 id="heading-create-the-arm-node-pool">Create the ARM Node Pool</h3>
<p>Add the ARM node pool to your existing cluster:</p>
<pre><code class="language-bash">gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized
</code></pre>
<p>What each flag does:</p>
<ul>
<li><p><code>--cluster=axion-tutorial-cluster</code> — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.</p>
</li>
<li><p><code>--zone=us-central1-a</code> — must match the zone you used when creating the cluster.</p>
</li>
<li><p><code>--machine-type=t2a-standard-2</code> — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.</p>
</li>
<li><p><code>--num-nodes=2</code> — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.</p>
</li>
<li><p><code>--node-labels=workload-type=arm-optimized</code> — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic <code>kubernetes.io/arch=arm64</code> label) is good practice in real clusters — it communicates the <em>intent</em> of the pool, not just its hardware.</p>
</li>
</ul>
<p>This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:</p>
<pre><code class="language-bash">gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch
</code></pre>
<p>The <code>get-credentials</code> command configures <code>kubectl</code> to authenticate with your new cluster. The <code>get nodes</code> command then lists all nodes and adds a column showing the <code>kubernetes.io/arch</code> label.</p>
<p>You should see something like:</p>
<pre><code class="language-plaintext">NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m
</code></pre>
<p><code>amd64</code> for the default x86 pool, <code>arm64</code> for our new Axion pool. This <code>kubernetes.io/arch</code> label is applied automatically by GKE — you don't set it, it's derived from the hardware.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/6389f4c6-17fe-4086-982f-39d94dbfa252.png" alt="Terminal output of kubectl get nodes with a ARCH column showing amd64 for two default-pool nodes and arm64 for two axion-pool nodes." style="display:block;margin:0 auto" width="2330" height="646" loading="lazy">

<h2 id="heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</h2>
<p>We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: <strong>Kubernetes doesn't know or care about image architecture by default</strong>.</p>
<p>If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.</p>
<p>To guarantee that pods land on ARM nodes and only ARM nodes, we use a <code>nodeSelector</code>.</p>
<h3 id="heading-how-nodeselector-works">How nodeSelector Works</h3>
<p>A <code>nodeSelector</code> is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the <code>nodeSelector</code>, the scheduler skips it — the pod will remain in <code>Pending</code> state rather than land on the wrong node.</p>
<p>This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (<code>preferredDuringSchedulingIgnoredDuringExecution</code>), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.</p>
<h3 id="heading-write-the-deployment-manifest">Write the Deployment Manifest</h3>
<p>Create <code>k8s/deployment.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your project ID. Here's what the key sections do:</p>
<p><code>replicas: 3</code> — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in <code>us-central1</code>, which distributes load across availability zones.</p>
<p><code>selector.matchLabels</code> and <code>template.metadata.labels</code> — these two blocks must match. The <code>selector</code> tells the Deployment which pods it "owns," and the <code>template.metadata.labels</code> is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.</p>
<p><code>nodeSelector: kubernetes.io/arch: arm64</code> — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies <code>kubernetes.io/arch=arm64</code> to all ARM nodes, our pods will schedule only onto the <code>axion-pool</code> nodes.</p>
<p><code>livenessProbe</code> — periodically calls <code>GET /healthz</code>. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. <code>initialDelaySeconds: 5</code> gives the server 5 seconds to start up before the first check.</p>
<p><code>readinessProbe</code> — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.</p>
<p><code>resources.requests</code> — reserves <code>250m</code> (25% of a CPU core) and <code>64Mi</code> of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.</p>
<p><code>resources.limits</code> — caps the container at <code>500m</code> CPU and <code>128Mi</code> memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.</p>
<h3 id="heading-a-note-on-taints-and-tolerations">A Note on Taints and Tolerations</h3>
<p>Once you're comfortable with <code>nodeSelector</code>, the next step in production clusters is adding a <strong>taint</strong> to your ARM node pool. A taint is a repellent — any pod without an explicit <strong>toleration</strong> for that taint is blocked from landing on the tainted node.</p>
<p>This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:</p>
<pre><code class="language-bash"># Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule
</code></pre>
<p>And a matching toleration in the pod spec:</p>
<pre><code class="language-yaml">tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"
</code></pre>
<p>We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.</p>
<h3 id="heading-write-the-service-manifest">Write the Service Manifest</h3>
<p>We also need a Kubernetes Service to expose the pods over the network. Create <code>k8s/service.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer
</code></pre>
<ul>
<li><p><code>selector: app: hello-axion</code> — the Service discovers pods using labels. Any pod with <code>app: hello-axion</code> on it will be added to this Service's load balancer pool.</p>
</li>
<li><p><code>port: 80</code> — the port the Service is reachable on from outside the cluster.</p>
</li>
<li><p><code>targetPort: 8080</code> — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.</p>
</li>
<li><p><code>type: LoadBalancer</code> — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.</p>
</li>
</ul>
<h3 id="heading-apply-both-manifests">Apply Both Manifests</h3>
<pre><code class="language-bash">kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
</code></pre>
<p><code>kubectl apply</code> reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.</p>
<p>Watch the pods come up in real time:</p>
<pre><code class="language-bash">kubectl get pods -w
</code></pre>
<p>The <code>-w</code> flag watches for changes and prints updates as they happen. You should see pods transition from <code>Pending</code> → <code>ContainerCreating</code> → <code>Running</code>. Once all three show <code>Running</code>, press <code>Ctrl+C</code> to stop watching.</p>
<h2 id="heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</h2>
<p>Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.</p>
<h3 id="heading-confirm-pod-placement">Confirm Pod Placement</h3>
<pre><code class="language-bash">kubectl get pods -o wide
</code></pre>
<p>The <code>-o wide</code> flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the <code>NODE</code> column:</p>
<pre><code class="language-plaintext">NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...
</code></pre>
<p>All three pods should show node names containing <code>axion-pool</code>. None should show <code>default-pool</code>.</p>
<h3 id="heading-confirm-the-nodes-are-arm">Confirm the Nodes Are ARM</h3>
<p>Take one of those node names and verify its architecture label:</p>
<pre><code class="language-bash">kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch
</code></pre>
<p>Replace <code>NODE_NAME</code> with one of the node names from the previous command. You should see:</p>
<pre><code class="language-plaintext">kubernetes.io/arch=arm64
</code></pre>
<p>That's the automatic label GKE applied when it provisioned the ARM hardware. Our <code>nodeSelector</code> matched on this label to pin the pods here.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/815312ea-e2bf-4106-863e-55cd0bdad5f7.png" alt="Terminal split into two sections: the top showing kubectl get pods -o wide with all pods scheduled on nodes containing axion-pool in the name, and the bottom showing kubectl get node with kubernetes.io/arch=arm64 in the labels output." style="display:block;margin:0 auto" width="2848" height="1500" loading="lazy">

<h3 id="heading-ask-the-application-itself">Ask the Application Itself</h3>
<p>This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.</p>
<p>Use <code>kubectl port-forward</code> to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:</p>
<pre><code class="language-bash">kubectl port-forward deployment/hello-axion 8080:8080
</code></pre>
<p>This command stays running in the foreground — open a <strong>second terminal window</strong> and run:</p>
<pre><code class="language-bash">curl http://localhost:8080
</code></pre>
<p>You should see:</p>
<pre><code class="language-plaintext">Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12
</code></pre>
<p><code>Architecture : arm64</code>. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/114ff82d-950f-4059-a1fa-89baffb90b6c.png" alt="Terminal output of curl http://localhost:8080 showing the four-line response: Hello from freeCodeCamp, Architecture: arm64, OS: linux, and the pod hostname." style="display:block;margin:0 auto" width="1042" height="292" loading="lazy">

<h3 id="heading-the-bonus-see-the-manifest-list-in-action">The Bonus: See the Manifest List in Action</h3>
<p>Want to see the multi-arch image indexing at work? Stop the port-forward, then run:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>You'll see four entries in the manifest list. Two are real images — <code>Platform: linux/amd64</code> and <code>Platform: linux/arm64</code>. The other two show <code>Platform: unknown/unknown</code> with an <code>attestation-manifest</code> annotation. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.</p>
<p>You may notice that if you check the image digest recorded in a running pod:</p>
<pre><code class="language-bash">kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'
</code></pre>
<p>Replace <code>POD_NAME</code> with one of the pod names from earlier.</p>
<p>The digest returned matches the <strong>top-level manifest list digest</strong>, not the <code>arm64</code>-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.</p>
<p>The definitive proof that the right binary is running is what you already have: the node labeled <code>kubernetes.io/arch=arm64</code> and the application reporting <code>Architecture: arm64</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/7dffe0c8-28cf-4a5d-8459-1e8db3da7dc0.png" alt="top-level manifest list digest" style="display:block;margin:0 auto" width="2302" height="1000" loading="lazy">

<h2 id="heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</h2>
<p>The hands-on work is done. Let's talk about why any of this is worth the effort.</p>
<h3 id="heading-the-cost-math">The Cost Math</h3>
<p>At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the <a href="https://cloud.google.com/compute/vm-instance-pricing">official pricing page</a> before making decisions):</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>vCPU</th>
<th>Memory</th>
<th>Approx. $/hour</th>
</tr>
</thead>
<tbody><tr>
<td><code>n2-standard-4</code> (x86)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.19</td>
</tr>
<tr>
<td><code>t2a-standard-4</code> (Tau ARM)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.14</td>
</tr>
<tr>
<td><code>c4a-standard-4</code> (Axion)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.15</td>
</tr>
</tbody></table>
<p>That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.</p>
<p>Here's how that looks at scale, for a service running 20 nodes continuously for a year:</p>
<ul>
<li><p>20 × <code>n2-standard-4</code> × \(0.19 × 8,760 hours = <strong>\)33,288/year</strong></p>
</li>
<li><p>20 × <code>t2a-standard-4</code> × \(0.14 × 8,760 hours = <strong>\)24,528/year</strong></p>
</li>
</ul>
<p>That's roughly <strong>$8,760 saved annually</strong> on compute, before committed use discounts (which further widen the gap).</p>
<h3 id="heading-when-arm-is-the-right-choice">When ARM Is the Right Choice</h3>
<p>ARM works best for:</p>
<ul>
<li><p><strong>Stateless API servers and web applications</strong> — like the app we built. ARM excels at high-throughput, low-latency network workloads.</p>
</li>
<li><p><strong>Background workers and queue processors</strong> — long-running services that don't depend on x86-specific binaries.</p>
</li>
<li><p><strong>Microservices written in Go, Rust, or Python</strong> — these languages have excellent ARM64 support and are built cross-platform by default.</p>
</li>
</ul>
<h3 id="heading-when-to-proceed-carefully">When to Proceed Carefully</h3>
<ul>
<li><p><strong>Native library dependencies</strong> — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.</p>
</li>
<li><p><strong>CI pipelines need ARM too</strong> — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.</p>
</li>
<li><p><strong>Profile before optimizing</strong> — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.</p>
</li>
</ul>
<h2 id="heading-cleanup">Cleanup</h2>
<p>When you're done, clean up to avoid ongoing charges:</p>
<pre><code class="language-bash"># Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Let's recap what you built and why each part matters.</p>
<p>You started with a Go application, a Dockerfile, and a <code>docker buildx build</code> command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.</p>
<p>You provisioned a GKE cluster with two node pools running different CPU architectures, then used <code>nodeSelector</code> to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.</p>
<p>The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same <code>nodeSelector</code> approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.</p>
<p><strong>Where to go from here:</strong></p>
<ul>
<li><p>Add a GitHub Actions workflow that runs <code>docker buildx build --platform linux/amd64,linux/arm64</code> on every push, automating this entire process in CI.</p>
</li>
<li><p>Audit one of your existing stateless services for ARM compatibility and try migrating it.</p>
</li>
<li><p>Explore <strong>Node Affinity</strong> as a softer alternative to <code>nodeSelector</code> for workloads that can run on either architecture but prefer ARM.</p>
</li>
<li><p>Look into <strong>GKE Autopilot</strong>, which now supports ARM nodes and handles node pool management automatically.</p>
</li>
</ul>
<p>Happy building.</p>
<h2 id="heading-project-file-structure">Project File Structure</h2>
<pre><code class="language-plaintext">hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service
</code></pre>
<p>All source files for this tutorial are available in the companion GitHub repository: <a href="https://github.com/Amiynarh/multi-arch-docker-gke-arm">https://github.com/Amiynarh/multi-arch-docker-gke-arm</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Authenticate Users in Kubernetes: x509 Certificates, OIDC, and Cloud Identity ]]>
                </title>
                <description>
                    <![CDATA[ Kubernetes doesn't know who you are. It has no user database, no built-in login system, no password file. When you run kubectl get pods, Kubernetes receives an HTTP request and asks one question: who  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-authenticate-users-in-kubernetes-x509-certificates-oidc-and-cloud-identity/</link>
                <guid isPermaLink="false">69d4182f40c9cabf4484dbdb</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ authentication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:31:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36356282-0cfb-43a8-8461-84f20e64b041.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Kubernetes doesn't know who you are.</p>
<p>It has no user database, no built-in login system, no password file. When you run <code>kubectl get pods</code>, Kubernetes receives an HTTP request and asks one question: who signed this, and do I trust that signature? Everything else — what you're allowed to do, which namespaces you can access, whether your request goes through at all — comes after that question is answered.</p>
<p>This surprises most engineers who are new to Kubernetes. They expect something like a database of users with passwords. Instead, they find a pluggable chain of authenticators, each one able to vouch for a request in a different way:</p>
<ul>
<li><p>Client certificates</p>
</li>
<li><p>OIDC tokens from an external identity provider</p>
</li>
<li><p>Cloud provider IAM tokens</p>
</li>
<li><p>Service account tokens projected into pods.</p>
</li>
</ul>
<p>Any of these can be active at the same time.</p>
<p>Understanding this model is what separates engineers who can debug authentication failures from engineers who copy kubeconfig files and hope for the best.</p>
<p>In this article, you'll work through how the Kubernetes authentication chain works from first principles. You'll see how x509 client certificates are used — and why they're a poor choice for human users in production. You'll configure OIDC authentication with Dex, giving your cluster a real browser-based login flow. And you'll see how AWS, GCP, and Azure each plug into the same underlying model.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A running kind cluster — a fresh one works fine, or reuse an existing one</p>
</li>
<li><p><code>kubectl</code> and <code>helm</code> installed</p>
</li>
<li><p><code>openssl</code> available on your machine (comes pre-installed on macOS and most Linux distros)</p>
</li>
<li><p>Basic familiarity with what a JWT is (a signed JSON object with claims) — you don't need to be able to write one, just recognise one</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</a></p>
<ul>
<li><p><a href="#heading-the-authenticator-chain">The Authenticator Chain</a></p>
</li>
<li><p><a href="#heading-users-vs-service-accounts">Users vs Service Accounts</a></p>
</li>
<li><p><a href="#heading-what-happens-after-authentication">What Happens After Authentication</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</a></p>
<ul>
<li><p><a href="#heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</a></p>
</li>
<li><p><a href="#the-cluster-ca">The Cluster CA</a></p>
</li>
<li><p><a href="#heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</a></p>
<ul>
<li><p><a href="#heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</a></p>
</li>
<li><p><a href="#heading-the-api-server-configuration">The API Server Configuration</a></p>
</li>
<li><p><a href="#heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</a></p>
</li>
<li><p><a href="#heading-how-kubelogin-works">How kubelogin Works</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</a></p>
</li>
<li><p><a href="#heading-cloud-provider-authentication">Cloud Provider Authentication</a></p>
<ul>
<li><p><a href="#heading-aws-eks">AWS EKS</a></p>
</li>
<li><p><a href="#heading-google-gke">Google GKE</a></p>
</li>
<li><p><a href="#heading-azure-aks">Azure AKS</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-webhook-token-authentication">Webhook Token Authentication</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</h2>
<p>Every request that reaches the Kubernetes API server — whether from <code>kubectl</code>, a pod, a controller, or a CI pipeline — carries a credential of some kind.</p>
<p>The API server passes that credential through a chain of authenticators in sequence. The first authenticator that can verify the credential wins. If none can, the request is treated as anonymous.</p>
<h3 id="heading-the-authenticator-chain">The Authenticator Chain</h3>
<p>Kubernetes supports several authentication strategies simultaneously. You can have client certificate authentication and OIDC authentication active on the same cluster at the same time, which is common in production: cluster administrators use certificates, regular developers use OIDC. The strategies active on a cluster are determined by flags passed to the <code>kube-apiserver</code> process.</p>
<p>The strategies available are x509 client certificates, bearer tokens (static token files — rarely used in production), bootstrap tokens (used during node join operations), service account tokens, OIDC tokens, authenticating proxies, and webhook token authentication. A cluster doesn't have to use all of them, and most don't. But knowing they all exist helps when you're diagnosing an auth failure.</p>
<h3 id="heading-users-vs-service-accounts">Users vs Service Accounts</h3>
<p>There is an important distinction in how Kubernetes thinks about identity. Service accounts are Kubernetes objects — they live in a namespace, get created with <code>kubectl create serviceaccount</code>, and have tokens managed by the cluster itself. Every pod runs as a service account. These are machine identities for workloads.</p>
<p>Users, on the other hand, don't exist as Kubernetes objects at all. There is no <code>kubectl create user</code> command. Kubernetes doesn't manage user accounts. Instead, it trusts external systems to assert user identity — a certificate authority, an OIDC provider, or a cloud provider's IAM system. Kubernetes just verifies the assertion and extracts the username and group memberships from it.</p>
<table>
<thead>
<tr>
<th></th>
<th>Service Account</th>
<th>User</th>
</tr>
</thead>
<tbody><tr>
<td>Kubernetes object?</td>
<td>Yes — lives in a namespace</td>
<td>No — managed externally</td>
</tr>
<tr>
<td>Created with</td>
<td><code>kubectl create serviceaccount</code></td>
<td>External system (CA, IdP, cloud IAM)</td>
</tr>
<tr>
<td>Used by</td>
<td>Pods and workloads</td>
<td>Humans and CI systems</td>
</tr>
<tr>
<td>Token managed by</td>
<td>Kubernetes</td>
<td>External system</td>
</tr>
<tr>
<td>Namespaced?</td>
<td>Yes</td>
<td>No</td>
</tr>
</tbody></table>
<h3 id="heading-what-happens-after-authentication">What Happens After Authentication</h3>
<p>Authentication only answers one question: who is this? Once the API server has a verified identity — a username and zero or more group memberships — it passes the request to the authorisation layer. By default that is RBAC, which checks the identity against Role and ClusterRole bindings to determine what the request is allowed to do.</p>
<p>This is why authentication and authorisation are separate concerns in Kubernetes. A valid certificate gets you past the front door. What you can do inside is RBAC's job. An authenticated user with no RBAC bindings can authenticate successfully but will be denied every API call.</p>
<p>If you want a deep dive into how RBAC rules, roles, and bindings work, check out this handbook on <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a>.</p>
<h2 id="heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</h2>
<p>x509 client certificate authentication is the oldest and simplest authentication method in Kubernetes. It's how <code>kubectl</code> works out of the box when you create a cluster — the kubeconfig file that <code>kind</code> or <code>kubeadm</code> generates contains an embedded client certificate signed by the cluster's Certificate Authority.</p>
<h3 id="heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</h3>
<p>When the API server receives a request with a client certificate, it validates the certificate against its trusted CA, then reads two fields (The Common Name and Organization) from the certificate to construct an identity.</p>
<p>The <strong>Common Name (CN)</strong> field becomes the username. The <strong>Organization (O)</strong> field, which can contain multiple values, becomes the list of groups the user belongs to.</p>
<p>So a certificate with <code>CN=jane</code> and <code>O=engineering</code> authenticates as username <code>jane</code> in group <code>engineering</code>. If you want to give <code>jane</code> permissions, you create a RoleBinding that references either the username <code>jane</code> or the group <code>engineering</code> as a subject.</p>
<p>This is the same mechanism behind <code>system:masters</code>. When <code>kind</code> creates a cluster and writes a kubeconfig for you, it generates a certificate with <code>O=system:masters</code>. Kubernetes has a built-in ClusterRoleBinding that grants <code>cluster-admin</code> to anyone in the <code>system:masters</code> group. That's why your default kubeconfig has full admin access — it's not magic, it's a certificate with the right group.</p>
<h3 id="heading-the-cluster-ca">The Cluster CA</h3>
<p>Every Kubernetes cluster has a root Certificate Authority — a private key and a self-signed certificate that the API server trusts. Any client certificate signed by this CA is trusted by the cluster.</p>
<p>The CA certificate and key are typically stored in <code>/etc/kubernetes/pki/</code> on the control plane node, or in the <code>kube-system</code> namespace as a secret, depending on how the cluster was created.</p>
<p>On kind clusters, you can copy the CA cert and key directly from the control plane container:</p>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>Whoever holds the CA key can issue certificates for any username and any group, including <code>system:masters</code>. This makes the CA key the most sensitive secret in a Kubernetes cluster. Guard it accordingly.</p>
<h3 id="heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</h3>
<p>Client certificates work, but they have two fundamental problems that make them a poor choice for human users in production.</p>
<p>The first is that <strong>Kubernetes doesn't check certificate revocation lists (CRLs)</strong>. If a developer's kubeconfig is stolen, the embedded certificate remains valid until it expires — which is typically one year in most Kubernetes setups. There's no way to immediately invalidate it. You can't "log out" a certificate. The only mitigation is to rotate the entire cluster CA, which invalidates every certificate including those belonging to other legitimate users.</p>
<p>The second is <strong>operational overhead</strong>. Certificates must be generated, distributed to users, and rotated before expiry. There's no self-service. In a team of ten engineers, managing certificates is annoying. In a team of a hundred, it's a full-time job.</p>
<p>For human access in production, OIDC is the right answer: short-lived tokens issued by a trusted identity provider, with a central revocation mechanism, and a standard browser-based login flow. Certificates are fine for service accounts and automation, where token management can be automated and rotation is handled programmatically.</p>
<p>That said, understanding certificates isn't optional. Your kubeconfig uses one. Your CI system probably does too. And cert-based auth is what you fall back to when everything else breaks.</p>
<h2 id="heading-demo-1-create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</h2>
<p>In this section, you'll generate a user certificate signed by the cluster CA, bind it to an RBAC role, and use it to authenticate to the cluster as a different user.</p>
<p><strong>This guide is for local development and learning only.</strong> Manually signing certificates with the cluster CA and storing keys on disk is done here for simplicity.</p>
<p>In production, you should use the Kubernetes CertificateSigningRequest API or cert-manager for certificate issuance, enforce short-lived certificates with automatic rotation, and store private keys in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or hardware security module (HSM) — never distribute the cluster CA key.</p>
<h3 id="heading-step-1-copy-the-ca-cert-and-key-from-the-kind-control-plane">Step 1: Copy the CA cert and key from the kind control plane</h3>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>This will create two files in your current directory called <code>ca.crt</code> and <code>ca.key</code></p>
<h3 id="heading-step-2-generate-a-private-key-and-csr-for-a-new-user">Step 2: Generate a private key and CSR for a new user</h3>
<p>You're creating a certificate for a user named <code>jane</code> in the <code>engineering</code> group:</p>
<pre><code class="language-bash"># Generate the private key
openssl genrsa -out jane.key 2048

# Generate a Certificate Signing Request
# CN = username, O = group
openssl req -new \
  -key jane.key \
  -out jane.csr \
  -subj "/CN=jane/O=engineering"
</code></pre>
<h3 id="heading-step-3-sign-the-csr-with-the-cluster-ca">Step 3: Sign the CSR with the cluster CA</h3>
<pre><code class="language-bash">openssl x509 -req \
  -in jane.csr \
  -CA ca.crt \
  -CAkey ca.key \
  -CAcreateserial \
  -out jane.crt \
  -days 365
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Certificate request self-signature ok
subject=CN=jane, O=engineering
</code></pre>
<h3 id="heading-step-4-inspect-the-certificate">Step 4: Inspect the certificate</h3>
<p>Before using it, confirm the identity it carries:</p>
<pre><code class="language-bash">openssl x509 -in jane.crt -noout -subject -dates
</code></pre>
<pre><code class="language-plaintext">subject=CN=jane, O=engineering
notBefore=Mar 20 10:00:00 2024 GMT
notAfter=Mar 20 10:00:00 2025 GMT
</code></pre>
<p>One year from now, this certificate becomes invalid and must be replaced. There's no way to extend it — you have to issue a new one.</p>
<h3 id="heading-step-5-build-a-kubeconfig-entry-for-jane">Step 5: Build a kubeconfig entry for jane</h3>
<pre><code class="language-bash"># Get the cluster API server address from the current context
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Create a kubeconfig for jane
kubectl config set-cluster k8s-security \
  --server=$APISERVER \
  --certificate-authority=ca.crt \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-credentials jane \
  --client-certificate=jane.crt \
  --client-key=jane.key \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-context jane@k8s-security \
  --cluster=k8s-security \
  --user=jane \
  --kubeconfig=jane.kubeconfig

kubectl config use-context jane@k8s-security \
  --kubeconfig=jane.kubeconfig
</code></pre>
<h3 id="heading-step-6-test-authentication-before-rbac">Step 6: Test authentication — before RBAC</h3>
<p>Try to list pods using jane's kubeconfig:</p>
<pre><code class="language-bash">kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">Error from server (Forbidden): pods is forbidden: User "jane" cannot list
resource "pods" in API group "" in the namespace "staging"
</code></pre>
<p>This is correct. Jane authenticated successfully — Kubernetes knows who she is. But she has no RBAC bindings, so every API call is denied. Authentication passed, but authorisation failed.</p>
<h3 id="heading-step-7-grant-jane-access-with-rbac">Step 7: Grant jane access with RBAC</h3>
<p>RBAC bindings use the username exactly as it appears in the certificate's CN field. If you need a refresher on how Roles, ClusterRoles, and RoleBindings work, this handbook <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a> covers the full RBAC model. For now, a simple RoleBinding using the built-in <code>view</code> ClusterRole is enough:</p>
<pre><code class="language-yaml"># jane-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jane-reader
  namespace: staging
subjects:
  - kind: User
    name: jane          # matches the CN in the certificate
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f jane-rolebinding.yaml
kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">No resources found in staging namespace.
</code></pre>
<p>No error — jane can now list pods in <code>staging</code>. She can't delete them, create them, or access other namespaces. The certificate got her in. RBAC determines what she can do.</p>
<h2 id="heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</h2>
<p>OpenID Connect is an identity layer on top of OAuth 2.0. It's how Kubernetes integrates with enterprise identity providers — Active Directory, Okta, Google Workspace, Keycloak, and any other provider that speaks OIDC. Understanding how Kubernetes uses it requires following the token from the user's browser to the API server's decision.</p>
<h3 id="heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</h3>
<p>When a developer runs <code>kubectl get pods</code> with OIDC configured, the following happens:</p>
<ol>
<li><p><code>kubectl</code> checks whether the current credential in the kubeconfig is a valid, unexpired OIDC token</p>
</li>
<li><p>If not, it launches <code>kubelogin</code>, a kubectl plugin that opens a browser window</p>
</li>
<li><p>The browser redirects to the OIDC provider (Dex, Okta, your corporate IdP)</p>
</li>
<li><p>The user logs in with their corporate credentials</p>
</li>
<li><p>The OIDC provider issues a signed JWT and returns it to kubelogin</p>
</li>
<li><p>kubelogin caches the token locally (under <code>~/.kube/cache/oidc-login/</code>) and returns it to <code>kubectl</code></p>
</li>
<li><p><code>kubectl</code> sends the token to the API server as a <code>Bearer</code> header</p>
</li>
<li><p>The API server fetches the provider's public keys from its JWKS endpoint and verifies the token signature</p>
</li>
<li><p>If valid, the API server extracts the username and group claims from the token</p>
</li>
<li><p>RBAC takes over from there</p>
</li>
</ol>
<p>The Kubernetes API server never contacts the OIDC provider for each request. It only fetches the provider's public keys periodically to verify signatures locally. This makes OIDC authentication stateless and scalable.</p>
<h3 id="heading-the-api-server-configuration">The API Server Configuration</h3>
<p>For OIDC to work, the API server needs to know where to find the identity provider and how to interpret the tokens it issues.</p>
<p>In Kubernetes v1.30+, this is configured through an <code>AuthenticationConfiguration</code> file passed via the <code>--authentication-config</code> flag. (In older versions, individual <code>--oidc-*</code> flags were used instead, but these were removed in v1.35.)</p>
<p>The <code>AuthenticationConfiguration</code> defines OIDC providers under the <code>jwt</code> key:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it does</th>
<th>Example</th>
</tr>
</thead>
<tbody><tr>
<td><code>issuer.url</code></td>
<td>The OIDC provider's base URL — must match the <code>iss</code> claim in the token</td>
<td><code>https://dex.example.com</code></td>
</tr>
<tr>
<td><code>issuer.audiences</code></td>
<td>The client IDs the token was issued for — must match the <code>aud</code> claim</td>
<td><code>["kubernetes"]</code></td>
</tr>
<tr>
<td><code>issuer.certificateAuthority</code></td>
<td>CA certificate to trust when contacting the OIDC provider (inlined PEM)</td>
<td><code>-----BEGIN CERTIFICATE-----...</code></td>
</tr>
<tr>
<td><code>claimMappings.username.claim</code></td>
<td>Which JWT claim to use as the Kubernetes username</td>
<td><code>email</code></td>
</tr>
<tr>
<td><code>claimMappings.groups.claim</code></td>
<td>Which JWT claim to use as the Kubernetes group list</td>
<td><code>groups</code></td>
</tr>
<tr>
<td><code>claimMappings.*.prefix</code></td>
<td>Prefix added to the claim value — set to <code>""</code> for no prefix</td>
<td><code>""</code></td>
</tr>
</tbody></table>
<p>On a kind cluster, the <code>--authentication-config</code> flag is set in the cluster configuration before creation, not after. You'll see this in the next demo.</p>
<h3 id="heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</h3>
<p>A JWT is a signed JSON object with three sections: a header, a payload, and a signature. The payload is a set of claims – key-value pairs that assert facts about the token. Kubernetes reads specific claims from the payload to build an identity.</p>
<p>The required claims are <code>iss</code> (the issuer URL, must match <code>issuer.url</code> in the <code>AuthenticationConfiguration</code>), <code>sub</code> (the subject, a unique identifier for the user), and <code>aud</code> (the audience, must match the <code>issuer.audiences</code> list). The <code>exp</code> claim (expiry time) is also required as the API server rejects expired tokens.</p>
<p>The most useful optional claim is <code>groups</code> (or whatever you configure via <code>claimMappings.groups.claim</code>). When this claim is present, Kubernetes can map OIDC group memberships directly to RBAC group bindings. A user in the <code>platform-engineers</code> group in your identity provider automatically gets the RBAC permissions you've bound to that group in Kubernetes — no manual user management required.</p>
<h3 id="heading-how-kubelogin-works">How kubelogin Works</h3>
<p>kubelogin (also distributed as <code>kubectl oidc-login</code>) is a kubectl credential plugin. Instead of embedding a static certificate or token in your kubeconfig, you configure a credential plugin that runs a helper binary when <code>kubectl</code> needs a token.</p>
<p>When kubelogin is invoked, it checks its local token cache. If the cached token is still valid, it returns it immediately. If the token has expired, it initiates the OIDC authorization code flow — opens a browser, redirects to the identity provider, receives the token after login, caches it locally, and returns it to <code>kubectl</code>. The whole flow takes about five seconds when it triggers.</p>
<p>This means tokens are short-lived (typically an hour) and rotate automatically. If a developer's machine is compromised, the token expires on its own. There is no long-lived credential sitting in a file somewhere.</p>
<h2 id="heading-demo-2-configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</h2>
<p>In this section, you'll deploy Dex as a self-hosted OIDC provider, configure a kind cluster to trust it, and log in with a browser. Dex is a good demo vehicle because it runs inside the cluster and doesn't require a cloud account or an external service.</p>
<p><strong>This guide is for local development and learning only.</strong> Self-signed certificates, static passwords, and certs stored on disk are used here for simplicity.</p>
<p>In production, use a managed identity provider (Azure Entra ID, Google Workspace, Okta), automate certificate lifecycle with cert-manager, and store secrets in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or inject them via CSI driver — never commit or store certs as local files.</p>
<h3 id="heading-step-1-create-a-kind-cluster-with-oidc-authentication">Step 1: Create a kind cluster with OIDC authentication</h3>
<p>OIDC authentication for the API server must be configured at cluster creation time on Kind because the API server needs to know which identity provider to trust before it starts accepting requests.</p>
<p><strong>Note:</strong> Kubernetes v1.30+ deprecated the <code>--oidc-*</code> API server flags in favor of the structured <code>AuthenticationConfiguration</code> API (via <code>--authentication-config</code>). In v1.35+ the old flags are removed entirely. This guide uses the new approach.</p>
<p><strong>nip.io</strong> is a wildcard DNS service — <code>dex.127.0.0.1.nip.io</code> resolves to <code>127.0.0.1</code>. This lets us use a real hostname for TLS without editing <code>/etc/hosts</code>.</p>
<p>First, generate a self-signed CA and TLS certificate for Dex:</p>
<pre><code class="language-bash"># Generate a CA for Dex
openssl req -x509 -newkey rsa:4096 -keyout dex-ca.key \
  -out dex-ca.crt -days 365 -nodes \
  -subj "/CN=dex-ca"

# Generate a certificate for Dex signed by that CA
openssl req -newkey rsa:2048 -keyout dex.key \
  -out dex.csr -nodes \
  -subj "/CN=dex.127.0.0.1.nip.io"

openssl x509 -req -in dex.csr \
  -CA dex-ca.crt -CAkey dex-ca.key \
  -CAcreateserial -out dex.crt -days 365 \
  -extfile &lt;(printf "subjectAltName=DNS:dex.127.0.0.1.nip.io")
</code></pre>
<p>Next, generate the <code>AuthenticationConfiguration</code> file. This tells the API server how to validate JWTs — which issuer to trust (<code>url</code>), which audience to expect (<code>audiences</code>), and which JWT claims map to Kubernetes usernames and groups (<code>claimMappings</code>). The CA cert is inlined so the API server can verify Dex's TLS certificate when fetching signing keys:</p>
<pre><code class="language-bash">cat &gt; auth-config.yaml &lt;&lt;EOF
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
  - issuer:
      url: https://dex.127.0.0.1.nip.io:32000
      audiences:
        - kubernetes
      certificateAuthority: |
$(sed 's/^/        /' dex-ca.crt)
    claimMappings:
      username:
        claim: email
        prefix: ""
      groups:
        claim: groups
        prefix: ""
EOF
</code></pre>
<p>The <code>kind-oidc.yaml</code> config uses <code>extraPortMappings</code> to expose Dex's port to your browser, <code>extraMounts</code> to copy files into the Kind node, and a <code>kubeadmConfigPatch</code> to pass <code>--authentication-config</code> to the API server:</p>
<pre><code class="language-yaml"># kind-oidc.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      # Forward port 32000 from the Docker container to localhost,
      # so your browser can reach Dex's login page
      - containerPort: 32000
        hostPort: 32000
        protocol: TCP
    extraMounts:
      # Copy files from your machine into the Kind node's filesystem
      - hostPath: ./dex-ca.crt
        containerPath: /etc/ca-certificates/dex-ca.crt
        readOnly: true
      - hostPath: ./auth-config.yaml
        containerPath: /etc/kubernetes/auth-config.yaml
        readOnly: true
    kubeadmConfigPatches:
      # Patch the API server to enable OIDC authentication
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            # Tell the API server to load our AuthenticationConfiguration
            authentication-config: /etc/kubernetes/auth-config.yaml
          extraVolumes:
            # Mount files into the API server pod (it runs as a static pod,
            # so it needs explicit volume mounts even though files are on the node)
            - name: dex-ca
              hostPath: /etc/ca-certificates/dex-ca.crt
              mountPath: /etc/ca-certificates/dex-ca.crt
              readOnly: true
              pathType: File
            - name: auth-config
              hostPath: /etc/kubernetes/auth-config.yaml
              mountPath: /etc/kubernetes/auth-config.yaml
              readOnly: true
              pathType: File
</code></pre>
<p>Create the cluster:</p>
<pre><code class="language-bash">kind create cluster --name k8s-auth --config kind-oidc.yaml
</code></pre>
<h3 id="heading-step-2-deploy-dex">Step 2: Deploy Dex</h3>
<p>Dex is an OIDC-compliant identity provider that acts as a bridge between Kubernetes and upstream identity sources (LDAP, SAML, GitHub, and so on). In this demo it runs inside the cluster with a static password database — two hardcoded users you can log in as.</p>
<p>The API server doesn't talk to Dex directly on every request. It only needs Dex's CA certificate (which you inlined in the <code>AuthenticationConfiguration</code>) to verify the JWT signatures on tokens that Dex issues.</p>
<p>The deployment has four parts: a ConfigMap with Dex's configuration, a Deployment to run Dex, a NodePort Service to expose it on port 32000 (matching the issuer URL), and RBAC resources so Dex can store state using Kubernetes CRDs.</p>
<p>First, create the namespace and load the TLS certificate as a Kubernetes Secret. Dex needs this to serve HTTPS. Without it, your browser and the API server would refuse to connect:</p>
<pre><code class="language-bash">kubectl create namespace dex

kubectl create secret tls dex-tls \
  --cert=dex.crt \
  --key=dex.key \
  -n dex
</code></pre>
<p>Save the following as <code>dex-config.yaml</code>. This configures Dex with a static password connector — two hardcoded users for the demo:</p>
<pre><code class="language-yaml"># dex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex-config
  namespace: dex
data:
  config.yaml: |
    # issuer must exactly match the URL in your AuthenticationConfiguration
    issuer: https://dex.127.0.0.1.nip.io:32000

    # Dex stores refresh tokens and auth codes — here it uses Kubernetes CRDs
    storage:
      type: kubernetes
      config:
        inCluster: true

    # Dex's HTTPS listener — serves the login page and token endpoints
    web:
      https: 0.0.0.0:5556
      tlsCert: /etc/dex/tls/tls.crt
      tlsKey: /etc/dex/tls/tls.key

    # staticClients defines which applications can request tokens.
    # "kubernetes" is the client ID that kubelogin uses when authenticating
    staticClients:
      - id: kubernetes
        redirectURIs:
          - http://localhost:8000     # kubelogin listens here to receive the callback
        name: Kubernetes
        secret: kubernetes-secret     # shared secret between kubelogin and Dex

    # Two demo users with the password "password" (bcrypt-hashed).
    # In production, you'd connect Dex to LDAP, SAML, or a social login instead
    enablePasswordDB: true
    staticPasswords:
      - email: "jane@example.com"
        # bcrypt hash of "password" — generate your own with: htpasswd -bnBC 10 "" password
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "jane"
        userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
      - email: "admin@example.com"
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "admin"
        userID: "a8b53e13-7e8c-4f7b-9a33-6c2f4d8c6a1b"
        groups:
          - platform-engineers
</code></pre>
<p>Save the following as <code>dex-deployment.yaml</code>. This creates the Deployment, Service, ServiceAccount, and RBAC that Dex needs to run:</p>
<pre><code class="language-yaml"># dex-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dex
  namespace: dex
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dex
  template:
    metadata:
      labels:
        app: dex
    spec:
      serviceAccountName: dex
      containers:
        - name: dex
          # v2.45.0+ required — earlier versions don't include groups from staticPasswords in tokens
          image: ghcr.io/dexidp/dex:v2.45.0
          command: ["dex", "serve", "/etc/dex/cfg/config.yaml"]
          ports:
            - name: https
              containerPort: 5556
          volumeMounts:
            - name: config
              mountPath: /etc/dex/cfg
            - name: tls
              mountPath: /etc/dex/tls
      volumes:
        - name: config
          configMap:
            name: dex-config
        - name: tls
          secret:
            secretName: dex-tls
---
# NodePort Service — exposes Dex on port 32000 on the Kind node.
# Combined with extraPortMappings, this makes Dex reachable from your browser
apiVersion: v1
kind: Service
metadata:
  name: dex
  namespace: dex
spec:
  type: NodePort
  ports:
    - name: https
      port: 5556
      targetPort: 5556
      nodePort: 32000
  selector:
    app: dex
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dex
  namespace: dex
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dex
rules:
  - apiGroups: ["dex.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dex
subjects:
  - kind: ServiceAccount
    name: dex
    namespace: dex
roleRef:
  kind: ClusterRole
  name: dex
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f dex-config.yaml
kubectl apply -f dex-deployment.yaml
kubectl rollout status deployment/dex -n dex
</code></pre>
<h3 id="heading-step-3-install-kubelogin">Step 3: Install kubelogin</h3>
<pre><code class="language-bash"># macOS
brew install int128/kubelogin/kubelogin

# Linux
curl -LO https://github.com/int128/kubelogin/releases/latest/download/kubelogin_linux_amd64.zip
unzip -j kubelogin_linux_amd64.zip kubelogin -d /tmp
sudo mv /tmp/kubelogin /usr/local/bin/kubectl-oidc_login
rm kubelogin_linux_amd64.zip
</code></pre>
<p>Confirm it's installed:</p>
<pre><code class="language-bash">kubectl oidc-login --version
</code></pre>
<h3 id="heading-step-4-configure-a-kubeconfig-entry-for-oidc">Step 4: Configure a kubeconfig entry for OIDC</h3>
<p>This creates a new user and context in your kubeconfig. Instead of using a client certificate (like the default Kind admin), it tells kubectl to use kubelogin to get a token from Dex.</p>
<p>The <code>--oidc-extra-scope</code> flags are important: without <code>email</code> and <code>groups</code>, Dex won't include those claims in the JWT, and the API server won't know who you are or what groups you belong to.</p>
<pre><code class="language-bash">kubectl config set-credentials oidc-user \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg=--oidc-issuer-url=https://dex.127.0.0.1.nip.io:32000 \
  --exec-arg=--oidc-client-id=kubernetes \
  --exec-arg=--oidc-client-secret=kubernetes-secret \
  --exec-arg=--oidc-extra-scope=email \
  --exec-arg=--oidc-extra-scope=groups \
  --exec-arg=--certificate-authority=$(pwd)/dex-ca.crt

kubectl config set-context oidc@k8s-auth \
  --cluster=kind-k8s-auth \
  --user=oidc-user

kubectl config use-context oidc@k8s-auth
</code></pre>
<h3 id="heading-step-5-trigger-the-login-flow">Step 5: Trigger the login flow</h3>
<p>Jane has no RBAC permissions yet, so first grant her read access from the admin context:</p>
<pre><code class="language-bash">kubectl --context kind-k8s-auth create clusterrolebinding jane-view \
  --clusterrole=view --user=jane@example.com
</code></pre>
<p>Now switch to the OIDC context and trigger a login:</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Your browser opens and redirects to the Dex login page. Log in as <code>jane@example.com</code> with password <code>password</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/44fe0657-b383-4245-9e43-45daea7a3f4f.png" alt="dexidp login screen" style="display:block;margin:0 auto" width="866" height="549" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/4f77442a-3055-47fc-a141-8d881731a1f4.png" alt="dexidp grant access" style="display:block;margin:0 auto" width="925" height="512" loading="lazy">

<p>After login, the terminal completes:</p>
<pre><code class="language-plaintext">No resources found in default namespace.
</code></pre>
<p>The browser-based authentication worked. <code>kubectl</code> received the token from Dex, sent it to the API server, the API server validated the JWT signature using the CA certificate from the <code>AuthenticationConfiguration</code>, extracted <code>jane@example.com</code> from the <code>email</code> claim, matched it against the RBAC binding, and authorized the request.</p>
<p>Without the <code>clusterrolebinding</code>, you would see <code>Error from server (Forbidden)</code> — authentication succeeds (the API server knows <em>who</em> you are) but authorization fails (jane has no permissions). This is the distinction between 401 Unauthorized and 403 Forbidden.</p>
<h3 id="heading-step-6-inspect-the-jwt">Step 6: Inspect the JWT</h3>
<p>A JWT (JSON Web Token) is a signed JSON payload that contains claims about the user. kubelogin caches the token locally under <code>~/.kube/cache/oidc-login/</code> so you don't have to log in on every kubectl command.</p>
<p>List the directory to find the cached file:</p>
<pre><code class="language-bash">ls ~/.kube/cache/oidc-login/
</code></pre>
<p>Decode the JWT payload directly from the cache:</p>
<pre><code class="language-bash">cat ~/.kube/cache/oidc-login/$(ls ~/.kube/cache/oidc-login/ | grep -v lock | head -1) | \
  python3 -c "
import json, sys, base64
token = json.load(sys.stdin)['id_token'].split('.')[1]
token += '=' * (4 - len(token) % 4)
print(json.dumps(json.loads(base64.urlsafe_b64decode(token)), indent=2))
"
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-json">{
  "iss": "https://dex.127.0.0.1.nip.io:32000",
  "sub": "CiQwOGE4Njg0Yi1kYjg4LTRiNzMtOTBhOS0zY2QxNjYxZjU0NjYSBWxvY2Fs",
  "aud": "kubernetes",
  "exp": 1775307910,
  "iat": 1775221510,
  "email": "jane@example.com",
  "email_verified": true
}
</code></pre>
<p>The <code>email</code> claim becomes jane's Kubernetes username because the <code>AuthenticationConfiguration</code> maps <code>username.claim: email</code>. The <code>aud</code> matches the configured <code>audiences</code>. The <code>iss</code> matches the issuer <code>url</code>. This is how the API server validates the token without contacting Dex on every request — it only needs the CA certificate to verify the JWT signature.</p>
<h3 id="heading-step-7-map-oidc-groups-to-rbac">Step 7: Map OIDC groups to RBAC</h3>
<p>The <code>admin@example.com</code> user has a <code>groups</code> claim in the Dex config containing <code>platform-engineers</code>. Instead of creating individual RBAC bindings per user, you can bind permissions to a group — anyone whose JWT contains that group gets the permissions automatically:</p>
<pre><code class="language-yaml"># platform-engineers-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-engineers-admin
subjects:
  - kind: Group
    name: platform-engineers     # matches the groups claim in the JWT
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>You're currently logged in as <code>jane@example.com</code> via the OIDC context, but jane only has <code>view</code> permissions — she can't create cluster-wide RBAC bindings. Switch back to the admin context to apply this:</p>
<pre><code class="language-bash">kubectl config use-context kind-k8s-auth
kubectl apply -f platform-engineers-binding.yaml
kubectl config use-context oidc@k8s-auth
</code></pre>
<p>Now clear the cached token to log out of jane's session, then trigger a new login as <code>admin@example.com</code>:</p>
<pre><code class="language-bash"># Clear the cached token — this is how you "log out" with kubelogin
rm -rf ~/.kube/cache/oidc-login/

# This will open the browser again for a fresh login
kubectl get pods -n default
</code></pre>
<p>Log in as <code>admin@example.com</code> with password <code>password</code>. This time the JWT will contain <code>"groups": ["platform-engineers"]</code>, which matches the <code>ClusterRoleBinding</code> you just created. The admin user gets full cluster access — without ever being added to a kubeconfig by name.</p>
<p>You can verify by decoding the new token (Step 6) — the <code>groups</code> claim will be present:</p>
<pre><code class="language-json">{
  "email": "admin@example.com",
  "groups": ["platform-engineers"]
}
</code></pre>
<p>This is the real power of OIDC group claims: you manage group membership in your identity provider, and Kubernetes permissions follow automatically. Add someone to the <code>platform-engineers</code> group in Dex (or any upstream IdP), and they get cluster-admin access on their next login — no kubeconfig or RBAC changes needed.</p>
<h2 id="heading-cloud-provider-authentication">Cloud Provider Authentication</h2>
<p>AWS, GCP, and Azure each give Kubernetes clusters a native authentication mechanism that ties into their IAM systems.</p>
<p>The implementations differ in API surface, but they all use the same underlying mechanism: OIDC token projection. Once you understand how Dex works above, these are all variations on the same theme.</p>
<h3 id="heading-aws-eks">AWS EKS</h3>
<p>EKS uses the <code>aws-iam-authenticator</code> to translate AWS IAM identities into Kubernetes identities. When you run <code>kubectl</code> against an EKS cluster, the AWS CLI generates a short-lived token signed with your IAM credentials. The API server passes this token to the aws-iam-authenticator webhook, which verifies it against AWS STS and returns the corresponding username and groups.</p>
<p>User access is controlled via the <code>aws-auth</code> ConfigMap in <code>kube-system</code>, which maps IAM role ARNs and IAM user ARNs to Kubernetes usernames and groups. A typical entry looks like this:</p>
<pre><code class="language-yaml"># In kube-system/aws-auth ConfigMap
mapRoles:
  - rolearn: arn:aws:iam::123456789:role/platform-engineers
    username: platform-engineer:{{SessionName}}
    groups:
      - platform-engineers
</code></pre>
<p>AWS is migrating from the <code>aws-auth</code> ConfigMap to a newer Access Entries API, which manages the same mapping through the EKS API rather than a ConfigMap. The underlying authentication mechanism is the same.</p>
<h3 id="heading-google-gke">Google GKE</h3>
<p>GKE integrates with Google Cloud IAM using two different mechanisms, depending on whether you're authenticating as a human user or as a workload.</p>
<p>For human users, GKE accepts standard Google OAuth2 tokens. Running <code>gcloud container clusters get-credentials</code> writes a kubeconfig that uses the <code>gcloud</code> CLI as a credential plugin, generating short-lived tokens from your Google account automatically.</p>
<p>For pod-level identity — letting a pod assume a Google Cloud IAM role — GKE uses Workload Identity. You annotate a Kubernetes service account to bind it to a Google Service Account, and pods running as that service account can call Google Cloud APIs using the GSA's permissions:</p>
<pre><code class="language-bash"># Bind a Kubernetes SA to a Google Service Account
kubectl annotate serviceaccount my-app \
  --namespace production \
  iam.gke.io/gcp-service-account=my-app@my-project.iam.gserviceaccount.com
</code></pre>
<h3 id="heading-azure-aks">Azure AKS</h3>
<p>AKS integrates with Azure Active Directory. When Azure AD integration is enabled, <code>kubectl</code> requests an Azure AD token on behalf of the user via the Azure CLI, and the AKS API server validates it against Azure AD.</p>
<p>For pod-level identity, AKS uses Azure Workload Identity, which follows the same OIDC federation pattern as GKE Workload Identity. A Kubernetes service account is annotated with an Azure Managed Identity client ID, and pods can request Azure AD tokens without storing any credentials:</p>
<pre><code class="language-bash"># Annotate a service account with the Azure Managed Identity client ID
kubectl annotate serviceaccount my-app \
  --namespace production \
  azure.workload.identity/client-id=&lt;MANAGED_IDENTITY_CLIENT_ID&gt;
</code></pre>
<p>The underlying pattern across all three providers is the same: a trusted OIDC token is issued by the cloud provider, verified by the Kubernetes API server, and mapped to an identity through a binding (the <code>aws-auth</code> ConfigMap, a GKE Workload Identity binding, or an AKS federated identity credential). The OIDC section in this article is the conceptual foundation for all of them.</p>
<h2 id="heading-webhook-token-authentication">Webhook Token Authentication</h2>
<p>Webhook token authentication is worth knowing about because it appears in several common Kubernetes setups, even if you never configure it yourself.</p>
<p>When a request arrives with a bearer token that no other authenticator recognises, Kubernetes can send that token to an external HTTP endpoint for validation. The endpoint returns a response indicating who the token belongs to.</p>
<p>This is how EKS authentication worked before the aws-iam-authenticator was built into the API server. It's also how bootstrap tokens work during node join operations: a token is generated, embedded in the <code>kubeadm join</code> command, and validated by the bootstrap webhook when the new node contacts the API server for the first time.</p>
<p>For most clusters, you'll encounter webhook auth as something already running rather than something you configure. The main thing to know is that it exists and what it looks like when it appears in logs or configuration.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the OIDC demo cluster
kind delete cluster --name k8s-auth

# Remove generated certificate files
rm -f ca.crt ca.key jane.key jane.csr jane.crt jane.kubeconfig
rm -f dex-ca.crt dex-ca.key dex.crt dex.key dex.csr dex-ca.srl auth-config.yaml

# Remove the kubelogin token cache
rm -rf ~/.kube/cache/oidc-login/
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes authentication is not a single mechanism — it's a chain of pluggable strategies, each one suited to different use cases. In this article you worked through the most important ones.</p>
<p>x509 client certificates are how Kubernetes works out of the box. The CN field becomes the username, the O field becomes the group, and the cluster CA is the trust anchor. You created a certificate for a new user, bound it to RBAC, and saw exactly how authentication and authorisation interact — authentication gets you in, RBAC determines what you can do.</p>
<p>You also saw the fundamental limitation: Kubernetes doesn't check certificate revocation lists, so a compromised certificate remains valid until it expires. This makes certificates a poor fit for human users in production environments.</p>
<p>OIDC is the production-grade answer. Tokens are short-lived, issued by a trusted identity provider, and map directly to Kubernetes groups through JWT claims. You deployed Dex as a self-hosted OIDC provider, configured the API server to trust it, and set up kubelogin for browser-based authentication.</p>
<p>You then decoded a JWT to see exactly what the API server reads from it, and mapped an OIDC group claim to a Kubernetes ClusterRoleBinding.</p>
<p>Cloud provider authentication — EKS, GKE, AKS — uses the same OIDC foundation with provider-specific wrappers. Understanding how Dex works makes each of those systems immediately readable.</p>
<p>All YAML, certificates, and configuration files from this article are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Run Multiple Kubernetes Clusters Without the Overhead Using kcp ]]>
                </title>
                <description>
                    <![CDATA[ In Kubernetes, when you need to isolate workloads, you might start by using namespaces. Namespaces provide a simple way to separate workloads within a single cluster. But as your requirements grow, es ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-run-multiple-kubernetes-clusters-without-the-overhead-using-kcp/</link>
                <guid isPermaLink="false">69c6ea5a7cf27065104ab997</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ multi-cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #multitenancy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ consumer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Provider ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Olalekan Odukoya ]]>
                </dc:creator>
                <pubDate>Fri, 27 Mar 2026 20:36:42 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a42c1a28-7a9e-4676-891d-eae7d64f2900.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In Kubernetes, when you need to isolate workloads, you might start by using namespaces. Namespaces provide a simple way to separate workloads within a single cluster.</p>
<p>But as your requirements grow, especially around compliance, security, multi-tenancy, or conflicting dependencies, your team will likely move beyond namespaces and start creating separate clusters.</p>
<p>What starts as a clean separation quickly becomes cluster sprawl, bringing higher costs, complex networking, and constant operational overhead.</p>
<p>In this article, we'll explore how <strong>kcp</strong> can help fix this problem by allowing you to run multiple “logical clusters” inside a single control plane.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-challenge-of-namespaces-and-multiple-kubernetes-clusters">The Challenge of Namespaces and Multiple Kubernetes Clusters</a></p>
</li>
<li><p><a href="#heading-introducing-kcp">Introducing kcp</a></p>
</li>
<li><p><a href="#heading-getting-started-with-kcp">Getting Started with kcp</a></p>
</li>
<li><p><a href="#heading-deploying-and-managing-applications">Deploying and Managing Applications</a></p>
</li>
<li><p><a href="#heading-beyond-the-primitives-what-we-didnt-cover">Beyond the Primitives: What We Didn't Cover</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p><strong>kubectl</strong> installed.</p>
</li>
<li><p>A terminal to run commands</p>
</li>
<li><p><strong>Curl</strong> installed</p>
</li>
</ul>
<h2 id="heading-the-challenge-of-namespaces-and-multiple-kubernetes-clusters">The Challenge of Namespaces and Multiple Kubernetes Clusters</h2>
<p>While namespaces provide some level of isolation, many teams often default to creating entirely new Kubernetes clusters to achieve stronger multi-tenancy, environment separation, or geographic distribution.</p>
<p>At first, this approach works well. But as systems grow, managing a fleet of clusters introduces challenges that often outweigh the benefits.</p>
<p>Every new cluster comes with its own control plane, which you'll need to continuously patch, upgrade, and monitor. Over time, this operational overhead will add up, consuming cycles that platform teams could otherwise spend on higher-value work.</p>
<p>Also, clusters don't naturally share service discovery or identity. This forces you to introduce extra layers like service meshes or VPN-based networking, which increases your system's complexity and expands the overall attack surface.</p>
<p>There’s also the cost factor. Clusters incur baseline infrastructure costs regardless of how much workload they run. Creating dedicated clusters for small teams can lead to underutilized resources or, worse, delay the creation of necessary environments because the cost feels too high.</p>
<p>As a result, platform teams often find themselves acting as “cluster plumbers”, spending more time maintaining infrastructure than enabling developer productivity.</p>
<h3 id="heading-illustrating-the-namespace-problem">Illustrating the Namespace Problem</h3>
<p>As I mentioned earlier, when managing multiple clusters gets too complex, a natural alternative is to use namespaces for isolation within a single cluster.</p>
<p>At first glance, this seems like the perfect solution.</p>
<p>But to understand where this approach falls short, let’s walk through a real-world example using a common requirement in shared Kubernetes environments: running databases.</p>
<p>We'll start by creating different namespaces for each team:</p>
<pre><code class="language-shell">➜ ~ kubectl create namespace team-a 
➜ ~ kubectl create namespace team-b
</code></pre>
<p>Let's say <strong>Team A</strong> needs a MongoDB database for one of its services. The team must first install the required <a href="https://github.com/mongodb/mongodb-kubernetes">MongoDB Custom Resource Definitions (CRDs)</a> into the cluster, so Kubernetes knows how to understand the different <code>MongoDB</code> resources:</p>
<pre><code class="language-shell">➜ ~ kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/crds.yaml

customresourcedefinition.apiextensions.k8s.io/clustermongodbroles.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodb.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbmulticluster.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbsearch.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbusers.mongodb.com created customresourcedefinition.apiextensions.k8s.io/opsmanagers.mongodb.com created customresourcedefinition.apiextensions.k8s.io/mongodbcommunity.mongodbcommunity.mongodb.com created
</code></pre>
<p>Secondly, <strong>Team A</strong> installs the actual Operator application (the controller that continuously runs the database logic) into their designated namespace:</p>
<pre><code class="language-shell">➜ ~ kubectl apply -n team-a -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/mongodb-kubernetes.yaml
</code></pre>
<p>But the installation isn't completed due to the error below:</p>
<pre><code class="language-shell">the namespace from the provided object "mongodb" does not match the namespace "team-a". You must pass '--namespace=mongodb' to perform this operation.
</code></pre>
<p>Why did this fail? This is because most Kubernetes Operators are designed assuming they own the entire cluster and not just a single namespace.</p>
<p>To force the operator to run in <code>team-a</code>, we can modify the manifest on the fly:</p>
<pre><code class="language-shell">curl -s https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/mongodb-kubernetes.yaml \
  | sed 's/namespace: mongodb/namespace: team-a/g' \
  | kubectl apply -f 
</code></pre>
<p>We can then confirm that the operator is installed and running:</p>
<pre><code class="language-plaintext">➜ ~ k get po -n team-a 
NAME                                          READY STATUS  RESTARTS AGE 
mongodb-kubernetes-operator-6f5f8bb7fd-8h5hj  1/1   Running 0        59s
</code></pre>
<p>But even after tricking the Operator into running inside <code>team-a</code>'s namespace, we still haven't solved the real problem.</p>
<p>At first glance, <code>team-a</code>'s operator is neatly confined to their namespace. But remember Step 1? <strong>The CRDs aren't namespaced – they're strictly cluster-scoped.</strong> So, even though <code>team-a</code> orchestrated this deployment purely for their own use, those CRDs are now globally registered across the entire cluster.</p>
<p>If Team B checks the API, they'll see all the MongoDB-related CRDs installed by Team A.</p>
<pre><code class="language-shell">➜ ~ kubectl get crds | grep mongodb

clustermongodbroles.mongodb.com               2026-03-24T10:49:35Z
mongodb.mongodb.com                           2026-03-24T10:49:36Z
mongodbcommunity.mongodbcommunity.mongodb.com 2026-03-24T10:49:38Z
mongodbmulticluster.mongodb.com               2026-03-24T10:49:36Z
mongodbsearch.mongodb.com                     2026-03-24T10:49:37Z 
mongodbusers.mongodb.com                      2026-03-24T10:49:37Z 
opsmanagers.mongodb.com                       2026-03-24T10:49:37Z
</code></pre>
<p>Now consider what happens if Team B needs to install a different version of MongoDB for its own services. Because the CRDs are shared across the cluster, both teams are now coupled to the same definitions. This means one team’s changes can easily impact the other, turning what should be isolated environments into a source of conflict.</p>
<h2 id="heading-introducing-kcp">Introducing kcp</h2>
<p><strong>kcp</strong> is an open-source project that lets you run multiple logical Kubernetes clusters on a single control plane.</p>
<p>These logical clusters are called <strong>workspaces</strong>, and each one behaves like an independent Kubernetes cluster. Every workspace has its own API endpoint, authentication, authorization, and policies, giving teams the experience of working in fully isolated environments.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e6abef0af89662115c0f5ca/ede32f6e-c260-426e-8d50-4f78f11fa1b1.svg" alt="brief kcp architecture and component" style="display:block;margin:0 auto" width="913.6673828124999" height="688.4281249999999" loading="lazy">

<p>This decoupling of the control plane from the worker nodes is what makes kcp different.</p>
<p>In traditional Kubernetes, spinning up a new cluster means provisioning a new API server, a new etcd instance, and all the associated controllers. With kcp, you spin up a workspace, and you have a strong, confined environment for your workload.</p>
<p>It's worth noting that <strong>kcp itself doesn't run workloads.</strong> It's strictly a control plane. Your actual applications still run on physical Kubernetes clusters. kcp only manages the workspaces and the synchronization of resources to those underlying clusters.</p>
<h2 id="heading-getting-started-with-kcp">Getting Started with kcp</h2>
<p>Now that we've covered what kcp is and why it matters, let's get our hands dirty. We'll set up a local kcp environment and explore the core concepts in action.</p>
<p>To make this realistic, we'll follow a common kcp workflow: a platform team that provides custom APIs, and tenant teams that consume them.</p>
<p>In our case, the platform team will export a MongoDB API, and our two tenant teams will subscribe to those APIs using <strong>APIBindings</strong>. Once bound, they can deploy MongoDB instances into their workspaces and sync them to physical clusters.</p>
<p>This pattern is at the heart of how kcp enables scalable multi-tenancy. The platform team controls the API definitions and versioning. Tenant teams get self-service access without needing to understand the underlying infrastructure. Let's see how it works!</p>
<h3 id="heading-installing-kcp">Installing kcp</h3>
<p>Running kcp locally is incredibly lightweight since there are no heavy worker nodes to spin up. You will need two things: the <code>kcp</code> server itself, <code>kubectl-kcp</code> , and the <code>kubectl-ws</code> plugin to manage workspaces.</p>
<p>To install the binaries, let's head over to the <a href="https://github.com/kcp-dev/kcp/releases/tag/v0.30.1">kcp-dev releases page</a>.</p>
<p>The commands below are for macOS Apple Silicon. If you're using an Intel Mac or Linux, simply replace <code>darwin_arm64</code> with your respective architecture.</p>
<ol>
<li>Download the kcp server and workspace plugins:</li>
</ol>
<pre><code class="language-shell">➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kcp_0.30.1_darwin_arm64.tar.gz 

➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kubectl-kcp-plugin_0.30.1_darwin_arm64.tar.gz

➜ ~ curl -LO https://github.com/kcp-dev/kcp/releases/download/v0.30.1/kubectl-ws-plugin_0.30.1_darwin_arm64.tar.gz
</code></pre>
<ol>
<li>Extract the archives:</li>
</ol>
<pre><code class="language-shell">➜ ~ tar -xzf kcp_0.30.1_darwin_arm64.tar.gz 
➜ ~ tar -xzf kubectl-kcp-plugin_0.30.1_darwin_arm64.tar.gz
➜ ~ tar -xzf kubectl-ws-plugin_0.30.1_darwin_arm64.tar.gz
</code></pre>
<ol>
<li>Move the required binaries into your <strong>PATH</strong>:</li>
</ol>
<pre><code class="language-shell">➜ ~ sudo mv bin/kcp /usr/local/bin/
➜ ~ sudo mv bin/kubectl-kcp /usr/local/bin/
➜ ~ sudo mv bin/kubectl-ws /usr/local/bin/
</code></pre>
<p>You can confirm the installation by checking the version.</p>
<pre><code class="language-shell">➜ ~ kcp --version
kcp version v1.33.3+kcp-v0.0.0-627385a6
</code></pre>
<h3 id="heading-starting-the-server">Starting the Server</h3>
<p>With the binaries installed, let's boot up your local control plane and bind it to localhost. But first, let's create a "work-folder".</p>
<pre><code class="language-plaintext">➜ ~ mkdir kcp-test
➜ ~ cd kcp-test
</code></pre>
<p>We can then start the kcp server in this directory.</p>
<pre><code class="language-shell">➜ ~ kcp start --bind-address=127.0.0.1
</code></pre>
<p>You'll see a flurry of logs as kcp boots up its internal database and exposes the API server. Leave this terminal running in the background.</p>
<h3 id="heading-connecting-to-the-root-workspace">Connecting to the Root Workspace</h3>
<p>Open a new terminal window and navigate back into the <code>kcp-test</code> folder we just created.</p>
<p>At first, if you run a standard <code>ls</code> command, the folder will look empty. But during startup, kcp silently generated a hidden <code>.kcp</code> directory that contains our local certificates and our administrative <code>kubeconfig</code> file. Let's verify that:</p>
<pre><code class="language-shell">➜ ~ cd kcp-test 
➜ kcp-test ls
➜ kcp-test ls -a . .. .kcp 
➜ kcp-test ls .kcp admin.kubeconfig apiserver.crt apiserver.key etcd-server sa.key
</code></pre>
<p>Now that we know exactly where the configuration file lives, let's export it so our <code>kubectl</code> commands are routed to kcp instead of your default cluster:</p>
<pre><code class="language-plaintext">export KUBECONFIG=$PWD/.kcp/admin.kubeconfig
</code></pre>
<p>Finally, let's use the workspace plugin we installed earlier to verify that we're connected accurately:</p>
<pre><code class="language-shell"> ➜ kubectl ws .
</code></pre>
<p>You should see the message below printed to the console:</p>
<pre><code class="language-shell">Current workspace is 'root'.
</code></pre>
<p>This shows that you're now officially inside the kcp <strong>root workspace</strong>. This is the highest-level administrative boundary where we'll begin creating our tenant logical clusters.</p>
<h3 id="heading-creating-and-managing-workspaces">Creating and Managing Workspaces</h3>
<p>As we discussed above, in a standard Kubernetes cluster, separating teams means using <code>kubectl create namespace</code>. In kcp, we solve the problem by creating entirely isolated logical clusters – workspaces.</p>
<p>If you recall our architecture diagram from earlier, we want to create three distinct environments for our company: one for the platform engineers to manage shared APIs, and two for our isolated tenant development teams.</p>
<p>Since we're currently inside the administrative <code>root</code> workspace, we can create our new tenant workspaces as children of the <code>root</code>:</p>
<pre><code class="language-plaintext">➜ kubectl ws create platform-team
Workspace "platform-team" (type root:organization) created.
Waiting for it to be ready... 
Workspace "platform-team" (type root:organization) is ready to use.

➜ kubectl ws create team-a 
Workspace "team-a" (type root:organization) created.
Waiting for it to be ready... 
Workspace "team-a" (type root:organization) is ready to use.

➜ kubectl ws create team-b
Workspace "team-b" (type root:organization) created.
Waiting for it to be ready... 
Workspace "team-b" (type root:organization) is ready to use.
</code></pre>
<p>Now, here is where kcp truly shines. Unlike a standard cluster, where objects are just a massive flat list, kcp manages its API as a hierarchy. We can visually prove the structure of our new logical clusters using the <code>tree</code> command:</p>
<pre><code class="language-shell">➜ kubectl ws tree
.
└── root
      ├── platform-team
      ├── team-a
      └── team-b
</code></pre>
<p>Jumping between these logical clusters is as fast as changing directories in a terminal. Let's switch our context over into Team A's workspace:</p>
<pre><code class="language-plaintext">➜ kubectl ws team-a 
Current workspace is 'root:team-a' (type root:organization).
</code></pre>
<h4 id="heading-proving-the-isolation">Proving the Isolation</h4>
<p>To truly understand the power of what we just did, let's try running a standard Kubernetes command while inside <code>team-a</code>:</p>
<pre><code class="language-plaintext">➜ kubectl get namespaces

NAME STATUS AGE 
default Active 15m
</code></pre>
<p>Let's also ask the cluster what APIs are actually available to us out of the box:</p>
<pre><code class="language-plaintext">➜ kubectl api-resources
</code></pre>
<p>Your output should be similar to what is in the image below:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e6abef0af89662115c0f5ca/775eff52-8ae7-4363-bd37-fce6ab0cc587.png" alt="775eff52-8ae7-4363-bd37-fce6ab0cc587" style="display:block;margin:0 auto" width="2022" height="1360" loading="lazy">

<p>When you take a closer look at that list. You'll notice that there are no Pods, Deployments, or even ReplicaSets. You don't see all the available APIs that you see in a standard Kubernetes Cluster.</p>
<p>This output proves exactly what we discussed in the architecture section. kcp is incredibly lightweight because every new workspace is born <strong>completely stripped of compute</strong>. Out of the box, it only contains the absolute bare-minimum control plane APIs needed for routing, RBAC, namespaces, and authentication.</p>
<p>From Team A's perspective, they own this pristine, empty universe. If they install a massive, noisy operator right now, like the MongoDB CRD, it will only exist right here in this specific API bucket.</p>
<p>But this raises the ultimate question: If there are no <code>Deployments</code> or <code>Pods</code> APIs in this workspace... how do we actually deploy our applications?</p>
<h2 id="heading-deploying-and-managing-applications">Deploying and Managing Applications</h2>
<p>Now that we have set up our isolated environments, we must address the glaring issue from our last terminal output: <strong>How do developers actually deploy applications</strong> if there are no <code>Deployment</code> or <code>Pod</code> APIs?</p>
<p>In standard Kubernetes, the API is monolithic. You get everything whether you need it or not, and adding a new schema (like an Operator) forces it globally onto everyone.</p>
<p>kcp takes the exact opposite approach. Every workspace starts completely empty. You then selectively "subscribe" your workspace to only the APIs you actually need using two incredibly powerful new concepts: <strong>APIExports</strong> and <strong>APIBindings</strong>.</p>
<p>Let's see exactly how this solves our MongoDB multi-tenancy problem, step by step.</p>
<h3 id="heading-1-the-platform-team-exports-the-api">1. The Platform Team "Exports" the API</h3>
<p>Instead of treating Custom Resource Definitions as global hazards, the platform engineers manage them centrally. First, lets switch into the platform-team workspace:</p>
<pre><code class="language-plaintext">➜ kubectl ws :root:platform-team

Current workspace is 'root:platform-team' (type root:organization).
</code></pre>
<p>Here, we'll install the MongoDB Operator CRDs in the platform-team's workspace:</p>
<pre><code class="language-plaintext">➜ kubectl apply -f kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-kubernetes/1.7.0/public/crds.yaml
</code></pre>
<p>To confirm that this is indeed isolated, let's first check what CRDs were installed,</p>
<pre><code class="language-shell">➜ kubectl get crd

NAME                                          CREATED AT
clustermongodbroles.mongodb.com               2026-03-24T20:45:50Z
mongodb.mongodb.com                           2026-03-24T20:45:50Z
mongodbcommunity.mongodbcommunity.mongodb.com 2026-03-24T20:45:51Z
mongodbmulticluster.mongodb.com               2026-03-24T20:45:50Z
mongodbsearch.mongodb.com                     2026-03-24T20:45:51Z
mongodbusers.mongodb.com                      2026-03-24T20:45:51Z
opsmanagers.mongodb.com                       2026-03-24T20:45:51Z
</code></pre>
<p>We can switch to <code>team-a'</code>s workspace (any of the team's workspaces can be used, we're just trying to establish that the installed <em><strong>CRD</strong></em> is only visible in the <code>platform-team'</code>s workspace).</p>
<pre><code class="language-shell">➜ kubectl ws :root:team-a

Current workspace is 'root:team-a' (type root:organization).
</code></pre>
<pre><code class="language-plaintext">➜ kubectl get crd 
No resources found
</code></pre>
<p>What we get as output is that there are no custom resources found or registered. This is the power of kcp.</p>
<p>If you don't want to continually type out paths to switch between your logical clusters, the <code>kcp</code> plugin includes a powerful interactive UI right in your terminal.</p>
<p>By running <code>kubectl ws -i</code>, you can use your arrow keys to navigate through your hierarchy and press <code>Enter</code> to instantly switch your context. Even better, this interactive mode provides a holistic view of your environment at any given time. With a single glance, you can see exactly how many APIExports are hosted inside a specific workspace, or which APIs are currently <strong>bound</strong> by other workspaces.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4d86d960-a23c-4cb1-8155-6fe236240893.png" alt="4d86d960-a23c-4cb1-8155-6fe236240893" style="display:block;margin:0 auto" width="2992" height="1796" loading="lazy">

<p>Let's switch back to the <code>platform-team'</code>s workspace to continue with our setup.</p>
<p>Now, we need to do something kcp-specific. If you check your resources right now, those CRDs are strictly local to this workspace. To safely share them with our tenant teams, we need to convert them into an internal kcp tracking object called an <strong>APIResourceSchema</strong>. This is how kcp structurally version-controls APIs so they can be securely exported.</p>
<p>To do this, we use our <code>kcp</code> plugin to take a "snapshot" of the local MongoDB CRD:</p>
<pre><code class="language-plaintext">kubectl get crd mongodbcommunity.mongodbcommunity.mongodb.com -o yaml | kubectl kcp crd snapshot -f - --prefix v1 | kubectl apply -f -
</code></pre>
<p>You should see an output that says:</p>
<blockquote>
<p>apiresourceschema.apis.kcp.io/v1.mongodbcommunity.mongodbcommunity.mongodb.com created</p>
</blockquote>
<p>This tells kcp: "Get the CRD we just installed, take a snapshot with the prefix 'v1', and apply the resulting <strong>APIResourceSchema</strong> back to the cluster."</p>
<p>Now, let's look for the schema kcp just generated for us:</p>
<pre><code class="language-plaintext">➜ kubectl get apiresourceschemas

NAME                                             AGE
v1.mongodbcommunity.mongodbcommunity.mongodb.com 11s
</code></pre>
<p>To safely share this API with our teams, we wrap that generated schema into an <code>APIExport</code>. This acts like "APIs as a Service," publishing the schema so that other workspaces can optionally choose to consume it.</p>
<p>Let's create the Export using the exact schema name we just found:</p>
<pre><code class="language-shell">➜ cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: apis.kcp.io/v1alpha1
kind: APIExport
metadata:
  name: mongodb-v1
spec:
  latestResourceSchemas:
    - v1.mongodbcommunity.mongodbcommunity.mongodb.com
EOF
</code></pre>
<p>We can confirm this was successfully created by checking the APIExport resource we have</p>
<pre><code class="language-plaintext">➜ kubectl get apiexports

NAME       AGE
mongodb-v1 2m46s
</code></pre>
<h3 id="heading-2-tenant-teams-bind-to-the-api">2. Tenant Teams "Bind" to the API</h3>
<p>Now let's switch our terminal context back over to Team A. Remember our previous output? Their workspace currently has no idea what a MongoDB cluster is. Let's prove it:</p>
<pre><code class="language-plaintext">➜ kubectl ws :root:team-a
Current workspace is "root:team-a" (type root:organization).

➜ kubectl api-resources | grep mongodb
# (No output. The API does not exist here!)
</code></pre>
<p>To securely subscribe to the platform team's newly created API service, Team A needs to create an <code>APIBinding</code>.</p>
<p>While we can write standard Kubernetes YAML to do this, the <code>kcp</code> plugin provides a <code>bind</code> command. Team A simply points the <code>bind</code> command directly at the workspace and the specific API export they want to consume:</p>
<pre><code class="language-plaintext">➜ kubectl kcp bind apiexport root:platform-team:mongodb-v1
apibinding mongodb-v1 created. Waiting to successfully bind ...
mongodb-v1 created and bound.

➜ kcp-test kubectl get apibindings
NAME                  AGE   READY
mongodb-v1            73s   True
tenancy.kcp.io-bqt7a  7h10m True
topology.kcp.io-9dlvq 7h10m True
</code></pre>
<p>The moment Team A executes that <code>bind</code> command, their workspace is magically updated with the new capabilities. Let's check our <code>api-resources</code> one more time:</p>
<pre><code class="language-plaintext">➜ kubectl api-resources | grep mongodb
mongodbcommunity mdbc mongodbcommunity.mongodb.com/v1 true MongoDBCommunity
</code></pre>
<h2 id="heading-beyond-the-primitives-what-we-didnt-cover">Beyond the Primitives: What We Didn't Cover</h2>
<p>At this point, you should have a firm, hands-on grasp of the core user primitives of kcp, that is <strong>Workspaces</strong>, <strong>APIExports</strong>, and <strong>APIBindings</strong>. But we've only just scratched the surface of what this architecture makes possible.</p>
<p>To keep this guide digestible, there are a few massive topics that I deliberately didn't cover in this article:</p>
<ol>
<li><p><strong>Shards and High Availability:</strong> Since kcp is designed to host thousands of logical clusters, a single database isn't enough. kcp introduces the <code>Shard</code> primitive, allowing platform administrators to horizontally partition workspace state across multiple underlying <code>etcd</code> instances. This gives kcp infinite scalability and massive High Availability (HA) without complicating the developer experience.</p>
</li>
<li><p><strong>Front-Proxy:</strong> When kcp scales to host millions of logical clusters, it needs a way to seamlessly direct traffic. The kcp <strong>Front-Proxy</strong> sits at the absolute edge of the architecture, dynamically routing incoming <code>kubectl</code> API requests go straight to the correct underlying workspace and shard. It ensures the developer experience feels perfectly unified, no matter how massive the background infrastructure actually becomes.</p>
</li>
<li><p><strong>Virtual Workspaces:</strong> While the workspaces we built today act as simple isolated buckets of state, kcp also supports <strong>Virtual Workspaces</strong>. These act as dynamic, read-only projections of data. For example, <em><strong>kcp</strong></em> uses virtual workspaces to project a unified view of a specific API across multiple tenant workspaces so that controllers can easily watch them all at once.</p>
</li>
<li><p><strong>APIExportEndpointSlices:</strong> Just like standard Kubernetes uses endpoints to route traffic to pods, kcp uses <code>EndpointSlices</code> to efficiently route and scale the delivery of massive <code>APIExports</code> across thousands of consuming workspaces.</p>
</li>
<li><p><strong>Wiring up the Sync Agent (</strong><code>api-syncagent</code><strong>):</strong> We discussed this conceptually in our architecture diagram, but we didn't actually attach a physical cluster. In a production scenario, you deploy the Sync Agent onto a fleet of downstream execution clusters (like EKS, GKE, or On-Premises environments) to automatically pull workloads safely out of kcp and execute them seamlessly on physical hardware.</p>
</li>
<li><p><strong>External Integrations Like Crossplane:</strong> Because kcp acts purely as a multi-tenant API control plane, it pairs incredibly well with <strong>Crossplane</strong>. By publishing Crossplane as an <code>APIExport</code>, you can empower developer teams to provision actual cloud infrastructure (like AWS databases or Cloud Spanners) using standard YAML directly from their completely isolated kcp workspaces.</p>
</li>
</ol>
<p>We will cover those advanced integrations in a future deep-dive. But armed with just the base primitives we built today, we can already solve the incredibly complex infrastructure problems we outlined at the beginning of the article.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Sync AWS Secrets Manager Secrets into Kubernetes with the External Secrets Operator ]]>
                </title>
                <description>
                    <![CDATA[ If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently? Storing them is straightforward. But handling rotation, stale env vars, and the gap ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sync-aws-secrets-manager-secrets-into-kubernetes-with-the-external-secrets-operator/</link>
                <guid isPermaLink="false">69c541f010e664c5dadc877e</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cybersecurity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Terraform ]]>
                    </category>
                
                    <category>
                        <![CDATA[ secrets management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SRE ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Thu, 26 Mar 2026 14:25:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6cca126e-dd50-4400-ae9d-65449581345b.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If someone asked you how secrets flow from AWS Secrets Manager into a running pod, could you explain it confidently?</p>
<p>Storing them is straightforward. But handling rotation, stale env vars, and the gap between what your pod reads and what AWS actually holds is where many engineers go quiet.</p>
<p>In this guide, you'll build a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods. You'll provision the infrastructure with Terraform, sync secrets using the External Secrets Operator, and run a sample application that reads the same credentials in two different ways: via environment variables and via a volume mount.</p>
<p>By the end, you'll be able to:</p>
<ul>
<li><p>Explain the full architecture from vault to pod</p>
</li>
<li><p>Run the lab locally in about 15 minutes</p>
</li>
<li><p>Prove why environment variables go stale after rotation, while mounted secret files stay fresh</p>
</li>
<li><p>Deploy the same pattern on Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD</p>
</li>
<li><p>Troubleshoot the most common failures</p>
</li>
</ul>
<p>Below is an architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/ac8bfc9e-304e-41b8-b6a3-7ce1795b29a9.png" alt="Architecture diagram showing secrets flowing from AWS Secrets Manager through the External Secrets Operator into a Kubernetes Secret, then splitting into environment variables set at pod start and a volume mount that updates within 60 seconds." style="display:block;margin:0 auto" width="1024" height="1536" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-local-lab">How to Run the Local Lab</a></p>
</li>
<li><p><a href="#heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</a></p>
</li>
<li><p><a href="#heading-how-to-test-secret-rotation">How to Test Secret Rotation</a></p>
</li>
<li><p><a href="#heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</a></p>
</li>
<li><p><a href="#heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</a></p>
</li>
<li><p><a href="#heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</a></p>
</li>
<li><p><a href="#heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following tools installed and configured.</p>
<p><strong>For the local lab:</strong></p>
<ul>
<li><p>An AWS account with access to AWS Secrets Manager</p>
</li>
<li><p>The AWS CLI installed and configured. Run <code>aws configure</code> and provide your access key, secret key, default region, and output format. The credentials need permission to read and write secrets in AWS Secrets Manager.</p>
</li>
<li><p><code>kubectl</code> installed. For Microk8s, run <code>microk8s kubectl config view --raw &gt; ~/.kube/config</code> after installation to connect kubectl to your local cluster.</p>
</li>
<li><p>Terraform installed</p>
</li>
<li><p>Helm installed</p>
</li>
<li><p>Docker installed</p>
</li>
<li><p>A local Kubernetes cluster: the lab supports Microk8s and kind. If you do not have either installed, follow the <a href="https://microk8s.io/">Microk8s install guide</a> before continuing.</p>
</li>
</ul>
<p><strong>For the Amazon Elastic Kubernetes Service sections:</strong></p>
<ul>
<li><p>An Amazon Elastic Kubernetes Service cluster you can create or manage</p>
</li>
<li><p>A GitHub repository you can configure for workflows and secrets</p>
</li>
</ul>
<p>The lab repository includes two deployment paths: a local path for fast learning and an Amazon Elastic Kubernetes Service path for a production-like setup. All the exact commands for each path live in the repo's <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-LOCAL.md"><code>docs/DEPLOY-LOCAL.md</code></a> and <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/DEPLOY-EKS.md"><code>docs/DEPLOY-EKS.md</code></a>.</p>
<h2 id="heading-how-to-understand-the-secret-flow">How to Understand the Secret Flow</h2>
<p>Before you run any command, you need to understand how the pieces connect.</p>
<p>The flow has four stages:</p>
<ol>
<li><p>A developer or automated system updates a secret in AWS Secrets Manager.</p>
</li>
<li><p>The External Secrets Operator polls AWS Secrets Manager on a schedule and creates or updates a Kubernetes Secret.</p>
</li>
<li><p>Your pod reads that Kubernetes Secret.</p>
</li>
<li><p>During rotation, the Kubernetes Secret updates, but your two consumption modes behave differently.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/9dc52f99-add4-490a-ad86-25a30d0ae306.png" alt="A step-by-step flow diagram showing the four stages of secret flow above" style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-how-the-external-secrets-operator-sync-works">How the External Secrets Operator Sync Works</h3>
<p>The External Secrets Operator reads a custom Kubernetes resource called <code>ExternalSecret</code>. That resource tells the operator three things:</p>
<ul>
<li><p>Which secret store to connect to</p>
</li>
<li><p>Which Kubernetes Secret name to create or update</p>
</li>
<li><p>How often to refresh</p>
</li>
</ul>
<p>In this lab, the <code>ExternalSecret</code> creates a Kubernetes Secret named <code>myapp-database-creds</code>. The operator also adds a template annotation that can trigger a pod restart when the secret rotates.</p>
<h3 id="heading-how-the-app-consumes-secrets">How the App Consumes Secrets</h3>
<p>The sample application exposes three endpoints so you can validate behavior at any time.</p>
<ul>
<li><p><code>/secrets/env</code> shows what environment variables the pod sees</p>
</li>
<li><p><code>/secrets/volume</code> shows what files in the mounted secret directory look like</p>
</li>
<li><p><code>/secrets/compare</code> compares both and reports whether rotation has been detected</p>
</li>
</ul>
<p>The app checks four keys: <code>DB_USERNAME</code>, <code>DB_PASSWORD</code>, <code>DB_HOST</code>, and <code>DB_PORT</code>.</p>
<h2 id="heading-how-to-run-the-local-lab">How to Run the Local Lab</h2>
<p>The local lab gives you a fast learning loop. You can see the full pipeline working and test rotation without waiting for a cloud deployment.</p>
<h3 id="heading-step-1-clone-the-repo">Step 1: Clone the Repo</h3>
<pre><code class="language-bash">git clone https://github.com/Osomudeya/k8s-secret-lab
cd k8s-secret-lab
</code></pre>
<h3 id="heading-step-2-run-the-spin-up-script">Step 2: Run the Spin-Up Script</h3>
<pre><code class="language-bash">bash spinup.sh
</code></pre>
<p>The script will ask you to choose a local cluster type. Pick Microk8s or kind, depending on what you have installed. The script installs the External Secrets Operator via Helm, applies the Terraform configuration, and deploys the sample application.</p>
<p>If the script fails at any point, check <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/docs/TROUBLESHOOTING.md"><code>docs/TROUBLESHOOTING.md</code></a> before retrying. The most common causes are missing AWS credentials, a misconfigured kubeconfig, or a Microk8s storage add-on that is not enabled.</p>
<h3 id="heading-important-run-the-lab-ui">Important: Run the Lab UI</h3>
<p>The lab ships with a separate guided tutorial interface that runs on your laptop. This is not the in-cluster application, it's a React-based checklist at <code>lab-ui/</code> that walks you through each concept and checkpoint as you work through the lab.</p>
<p>To start it, open a second terminal and run:</p>
<pre><code class="language-bash">cd lab-ui &amp;&amp; npm install &amp;&amp; npm run dev
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/873166e9-6bff-4e56-a18d-e58b9e9a5af9.png" alt="Screenshot of npm run dev lab ui terminal" style="display:block;margin:0 auto" width="849" height="435" loading="lazy">

<p>Then open <a href="http://localhost:5173"><code>http://localhost:5173</code></a>. You'll see a module-by-module guide covering the full flow from external secrets to rotation to CI/CD.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5a5b220b-3f23-4c7c-8388-f2e23d122e2c.png" alt="Screenshot of The Lab UI, a guided tutorial interface that runs alongside the lab and walks you through each concept and checkpoint." style="display:block;margin:0 auto" width="1399" height="1287" loading="lazy">

<p>Keep this terminal running alongside your lab. The Lab UI and the in-cluster app (<code>localhost:3000</code>) are two separate things, the UI guides you through the steps, the app shows you the live secrets.</p>
<h3 id="heading-step-3-access-the-application">Step 3: Access the Application</h3>
<p>Once the lab finishes, port-forward the service.</p>
<pre><code class="language-bash">kubectl port-forward svc/myapp 3000:80 -n default
</code></pre>
<p>Open <code>http://localhost:3000</code>. You should see a table showing each secret key and whether the environment variable value matches the volume mount value.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/dbe122ac-b787-40d0-96f4-4b1276bab017.png" alt="Screenshot of the running application at localhost:3000. Every row in the table should show &quot;Match ✓" style="display:block;margin:0 auto" width="1213" height="902" loading="lazy">

<h3 id="heading-step-4-validate-that-secrets-match">Step 4: Validate That Secrets Match</h3>
<p>Run the compare endpoint directly from the terminal.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>When everything is working, the response will include <code>"all_match": true</code>.</p>
<h2 id="heading-how-to-inspect-the-externalsecret-and-the-application">How to Inspect the ExternalSecret and the Application</h2>
<p>At this point the lab is running. Now you'll want to inspect the manifests so you understand what each part does.</p>
<h3 id="heading-step-1-read-the-externalsecret-manifest">Step 1: Read the ExternalSecret Manifest</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/external-secret.yaml"><code>k8s/aws/external-secret.yaml</code></a>. Focus on these four fields:</p>
<ul>
<li><p><code>refreshInterval</code>: how often the operator polls AWS Secrets Manager</p>
</li>
<li><p><code>secretStoreRef</code>: which store the operator authenticates against</p>
</li>
<li><p><code>target</code>: the name of the Kubernetes Secret to create</p>
</li>
<li><p><code>data</code>: the mapping from AWS Secrets Manager JSON keys to Kubernetes Secret keys</p>
</li>
</ul>
<p>Here is what that mapping looks like in this lab:</p>
<pre><code class="language-yaml">spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: myapp-database-creds
    creationPolicy: Owner
  data:
    - secretKey: DB_USERNAME
      remoteRef:
        key: prod/myapp/database
        property: username
</code></pre>
<p>The <code>property</code> field tells the operator which JSON key inside the AWS secret to extract. If your secret in AWS Secrets Manager is a JSON object, each field gets its own entry here.</p>
<p>Two fields here are worth understanding before you move on. <code>creationPolicy: Owner</code> means the operator owns the Kubernetes Secret it creates. If you delete the <code>ExternalSecret</code>, the Secret is deleted too. <code>ClusterSecretStore</code> is a cluster-scoped store, meaning any namespace in the cluster can use it. A plain <code>SecretStore</code> is namespace-scoped. For this lab, cluster-scoped is the right choice because it keeps the setup simple.</p>
<h3 id="heading-step-2-read-the-deployment-manifest">Step 2: Read the Deployment Manifest</h3>
<p>Open <a href="http://github.com/Osomudeya/k8s-secret-lab/blob/main/k8s/aws/deployment.yaml"><code>k8s/aws/deployment.yaml</code></a>. You are looking for two sections: <code>envFrom</code> and <code>volumeMounts</code>.</p>
<pre><code class="language-yaml">envFrom:
  - secretRef:
      name: myapp-database-creds

volumeMounts:
  - name: db-secret-vol
    mountPath: /etc/secrets
    readOnly: true
</code></pre>
<p>Both paths read from the same Kubernetes Secret, <code>myapp-database-creds</code>. The <code>envFrom</code> block injects all keys as environment variables at pod start.<br>The <code>volumeMounts</code> block mounts the same secret as files under <code>/etc/secrets</code>.</p>
<p>This is the core of the rotation lesson. Both paths read the same source. But they behave differently after that source changes.</p>
<h3 id="heading-step-3-read-the-app-comparison-logic">Step 3: Read the App Comparison Logic</h3>
<p>Open <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/app/server.js"><code>app/server.js</code></a>. The comparison logic reads environment variables from <code>process.env</code> and reads mounted secret files from <code>/etc/secrets/&lt;key&gt;</code>. Then it computes a per-key match and a global <code>all_match</code> value.</p>
<p>The <code>/secrets/compare</code> endpoint sets <code>rotation_detected: true</code> when any key differs between env and volume.</p>
<h2 id="heading-how-to-test-secret-rotation">How to Test Secret Rotation</h2>
<p>Secret rotation is where real teams feel pain. This lab makes that pain visible so you can explain it clearly and fix it confidently.</p>
<h3 id="heading-how-the-rotation-gap-works"><strong>How the Rotation Gap Works</strong></h3>
<p>When a pod starts, Kubernetes gives it two ways to read a secret.</p>
<p>The first way is environment variables. Think of these like sticky notes written on the wall of the container the moment it boots up. The value gets written once, at startup, and never changes. Even if the secret in AWS gets updated ten minutes later, the sticky note still says the old value. The container cannot see the update because nobody rewrote the note.</p>
<p>The second way is a volume mount. Think of this like a shared folder that someone else can update remotely. Kubernetes creates a small folder inside the container and puts the secret value in a file there. When the secret changes in AWS and ESO syncs it into Kubernetes, the kubelet quietly updates that file within about 60 seconds. The container reads the file fresh every time it needs the value, so it sees the new password automatically.</p>
<p>Same secret, two paths. One goes stale while one stays fresh.</p>
<p>The problem happens when your app reads the database password from the environment variable, the sticky note, and someone rotates the password in AWS. ESO updates Kubernetes. The file gets the new password. But your app is still reading the sticky note, which has the old one. Connection fails.</p>
<p>That difference isn't a bug. It's how the Linux process model and the kubelet work. Understanding it is the difference between knowing Kubernetes secrets and actually operating them.</p>
<p>Here is what you're about to observe in the lab:</p>
<ul>
<li><p>The rotation script updates the secret in AWS</p>
</li>
<li><p>ESO syncs the new value into Kubernetes within seconds</p>
</li>
<li><p>The volume file updates automatically</p>
</li>
<li><p>The environment variable stays stale until the pod restarts</p>
</li>
<li><p>The <code>/secrets/compare</code> endpoint shows both values side by side so you can see the gap live</p>
</li>
</ul>
<h3 id="heading-step-1-confirm-the-lab-is-ready">Step 1: Confirm the Lab Is Ready</h3>
<p>Make sure your pod and the External Secrets Operator are both running before you start.</p>
<pre><code class="language-bash">kubectl get pods -n external-secrets
kubectl get pods -n default
</code></pre>
<p>Both should show <code>Running</code>.</p>
<h3 id="heading-step-2-run-the-rotation-test-script">Step 2: Run the Rotation Test Script</h3>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>The script performs these actions in order:</p>
<ol>
<li><p>Reads the current <code>DB_PASSWORD</code> from the volume mount at <code>/etc/secrets/DB_PASSWORD</code></p>
</li>
<li><p>Reads the current <code>DB_PASSWORD</code> from the environment variable</p>
</li>
<li><p>Updates AWS Secrets Manager with a new password using <code>put-secret-value</code></p>
</li>
<li><p>Forces an immediate ESO sync by annotating the <code>ExternalSecret</code> with <code>force-sync</code></p>
</li>
<li><p>Reads the volume value again</p>
</li>
<li><p>Reads the environment variable again</p>
</li>
</ol>
<p>After the script runs, the volume and the env var will show different values.</p>
<h3 id="heading-step-3-validate-with-the-compare-endpoint">Step 3: Validate With the Compare Endpoint</h3>
<p>Hit the compare endpoint and look at the output.</p>
<pre><code class="language-bash">curl -s http://localhost:3000/secrets/compare | python3 -m json.tool
</code></pre>
<p>You'll see something like this:</p>
<pre><code class="language-json">{
  "comparison": {
    "DB_PASSWORD": {
      "env": "old-password-value",
      "volume": "new-password-value",
      "match": false
    }
  },
  "all_match": false,
  "rotation_detected": true,
  "message": "Volume has new value; env still has old value."
}
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/c4ebb09f-e605-4f68-8e12-1361d94199b2.png" alt="Rotation mismatch, the volume file updated with the new password but the env var still holds the old value from pod startup." style="display:block;margin:0 auto" width="832" height="290" loading="lazy">

<h3 id="heading-step-4-restart-the-deployment-to-sync-env-vars">Step 4: Restart the Deployment to Sync Env Vars</h3>
<p>Env vars don't update in place. You need a pod restart so new containers start with the updated Kubernetes Secret.</p>
<pre><code class="language-bash">kubectl rollout restart deployment/myapp -n default
kubectl rollout status deployment/myapp -n default
</code></pre>
<p>Then hit <code>/secrets/compare</code> again. All rows should now show <code>"all_match": true</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/0040274d-a398-408c-9486-ce0a9e527479.png" alt="After a rolling restart, new pods pick up fresh env vars and all keys match." style="display:block;margin:0 auto" width="821" height="436" loading="lazy">

<h3 id="heading-how-to-automate-restarts-with-reloader">How to Automate Restarts With Reloader</h3>
<p>If you don't want to restart deployments manually after every rotation, you can install <a href="https://github.com/stakater/reloader"><strong>Stakater Reloader</strong></a>. It watches an annotation on the <code>Deployment</code> and triggers a rolling restart automatically when the referenced Kubernetes Secret changes. New pods start with fresh env vars, while old pods drain cleanly. The repo's local deployment guide includes the install steps.</p>
<h2 id="heading-how-to-choose-between-external-secrets-operator-and-the-csi-driver">How to Choose Between External Secrets Operator and the CSI Driver</h2>
<p>Two patterns dominate when it comes to pulling external secrets into Kubernetes: the External Secrets Operator and the <a href="https://secrets-store-csi-driver.sigs.k8s.io/">Secrets Store CSI Driver</a>.</p>
<p>Both get cloud secrets into pods, but they do it differently. Here's a plain comparison:</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>External Secrets Operator</th>
<th>Secrets Store CSI Driver</th>
</tr>
</thead>
<tbody><tr>
<td>Creates a Kubernetes Secret</td>
<td>Yes</td>
<td>No by default</td>
</tr>
<tr>
<td>Supports <code>envFrom</code></td>
<td>Yes</td>
<td>No (workaround only)</td>
</tr>
<tr>
<td>Secret stored in etcd</td>
<td>Yes (base64)</td>
<td>No, if you skip sync</td>
</tr>
<tr>
<td>Rotation</td>
<td>ESO updates the Secret, Reloader restarts pods</td>
<td>Volume file can update in place</td>
</tr>
<tr>
<td>Best for</td>
<td>Most teams. Multi-cloud, env var support</td>
<td>Security policies that prohibit secrets in etcd</td>
</tr>
</tbody></table>
<p>This lab uses the External Secrets Operator for two reasons. First, it produces a native Kubernetes Secret, which means your application and deployment patterns match standard Kubernetes workflows. Second, having both <code>envFrom</code> and a volume mount point to the same Secret makes the rotation behavior easy to observe side by side.</p>
<p>Use the CSI Driver when your security team prohibits storing secrets in etcd through a Kubernetes Secret. The driver mounts secret data directly into the pod file system without creating a Kubernetes Secret. The tradeoff is that you lose the native <code>envFrom</code> model.</p>
<h2 id="heading-how-to-deploy-the-pattern-on-amazon-elastic-kubernetes-service">How to Deploy the Pattern on Amazon Elastic Kubernetes Service</h2>
<p>The local lab is ideal for learning. The Amazon Elastic Kubernetes Service path adds the production-like pieces: IAM role-based permissions for the operator, a load balancer for the app, and a full CI/CD workflow.</p>
<h3 id="heading-step-1-prepare-terraform-and-openid-connect-access">Step 1: Prepare Terraform and OpenID Connect Access</h3>
<p>The repository includes a one-time setup guide for OpenID Connect-based access from GitHub Actions to AWS. Run these commands in the <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder.</p>
<pre><code class="language-bash">cd terraform/github-oidc
terraform init
terraform plan -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform apply -var="github_repo=YOUR_ORG/YOUR_REPO"
terraform output role_arn
</code></pre>
<p>Copy the role ARN from the output. You'll need it in the next step.</p>
<h3 id="heading-step-2-set-the-required-environment-variable">Step 2: Set the Required Environment Variable</h3>
<p>The Amazon Elastic Kubernetes Service spin-up path needs your GitHub Actions role ARN so Terraform can grant the CI/CD runner access to the cluster.</p>
<p>To find your AWS account ID, run:</p>
<pre><code class="language-bash">aws sts get-caller-identity --query Account --output text
</code></pre>
<p>Then set the variable, replacing <code>ACCOUNT</code> with the number that command returns.</p>
<pre><code class="language-bash">export GITHUB_ACTIONS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/your-github-oidc-role
</code></pre>
<h3 id="heading-step-3-run-the-spin-up-script-for-amazon-elastic-kubernetes-service">Step 3: Run the Spin-Up Script for Amazon Elastic Kubernetes Service</h3>
<pre><code class="language-bash">bash spinup.sh --cluster eks
</code></pre>
<p>When the script finishes, it prints the application URL. Open that URL in a browser and confirm that you see the same secrets table you saw locally, with all keys showing <code>Match ✓</code>.</p>
<h3 id="heading-step-4-test-rotation-on-the-deployed-app">Step 4: Test Rotation on the Deployed App</h3>
<p>After you confirm normal operation, run the rotation test the same way you did locally.</p>
<pre><code class="language-bash">bash rotation/test-rotation.sh
</code></pre>
<p>Then use <code>/secrets/compare</code> on the Amazon Elastic Kubernetes Service load balancer URL to validate behavior in the cloud environment.</p>
<p>⚠️ <strong>Cost warning:</strong> Amazon Elastic Kubernetes Service runs at approximately $0.16 per hour. When you're done with the lab, run <a href="https://github.com/Osomudeya/k8s-secret-lab/blob/main/teardown.sh"><code>bash teardown.sh</code></a> from the repo root to destroy all AWS resources and stop charges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/56f05ace-9ab6-4b67-ade6-a0bd1fa3962c.png" alt="Screenshot of the app running on the ALB URL, showing all keys matched" style="display:block;margin:0 auto" width="912" height="891" loading="lazy">

<h2 id="heading-how-to-configure-github-actions-without-stored-aws-credentials">How to Configure GitHub Actions Without Stored AWS Credentials</h2>
<p>The typical CI/CD setup stores <code>AWS_ACCESS_KEY_ID</code> and <code>AWS_SECRET_ACCESS_KEY</code> in GitHub repository secrets. Those keys never rotate. Anyone with repo access can read them. When someone leaves the team, you have to revoke keys and update every workflow.</p>
<p>OpenID Connect eliminates that problem entirely.</p>
<h3 id="heading-how-openid-connect-works-for-github-actions">How OpenID Connect Works for GitHub Actions</h3>
<p>GitHub can issue a short-lived token for each workflow run. That token identifies the run: the repository, branch, and workflow name. You create an IAM role in AWS whose trust policy says: only accept requests that come from this specific GitHub repository and branch. The GitHub Actions runner exchanges that token for temporary AWS credentials via <code>AssumeRoleWithWebIdentity</code>. No long-lived keys are ever stored anywhere.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/48e72210-a669-440e-b42e-81b0c15746ec.png" alt="The full OIDC authentication flow for GitHub Actions deploying to EKS — from minting the JWT token through AssumeRoleWithWebIdentity to temporary credentials, kubeconfig retrieval, and final kubectl apply steps." style="display:block;margin:0 auto" width="1536" height="1024" loading="lazy">

<h3 id="heading-step-1-create-the-iam-role-with-terraform">Step 1: Create the IAM Role With Terraform</h3>
<p>The <a href="https://github.com/Osomudeya/k8s-secret-lab/tree/main/terraform/github-oidc"><code>terraform/github-oidc</code></a> folder creates the OpenID Connect provider and the IAM role for you. You already ran this in the Amazon Elastic Kubernetes Service setup above. The role ARN is the only value you need to store.</p>
<h3 id="heading-step-2-add-the-role-arn-to-github-repository-secrets">Step 2: Add the Role ARN to GitHub Repository Secrets</h3>
<p>In your GitHub repository:</p>
<ol>
<li><p>Go to Settings → Secrets and variables → Actions</p>
</li>
<li><p>Click New repository secret</p>
</li>
<li><p>Name it <code>AWS_ROLE_ARN</code></p>
</li>
<li><p>Paste the role ARN from the Terraform output</p>
</li>
</ol>
<p>That is the only secret you store. The role ARN isn't sensitive. It's an identifier, not a credential.</p>
<h3 id="heading-step-3-configure-terraform-state">Step 3: Configure Terraform State</h3>
<p>For CI/CD to work consistently across runs, Terraform needs a shared state backend. The lab stores Terraform state in an Amazon S3 bucket and uses an Amazon DynamoDB table for state locking. The Amazon Elastic Kubernetes Service deployment guide in the repo covers the backend setup in full.</p>
<h3 id="heading-step-4-push-to-main-and-let-workflows-run">Step 4: Push to Main and Let Workflows Run</h3>
<p>After your first spin-up, every push to the <code>main</code> branch drives the CI/CD pipeline. The repo includes separate workflow files for Terraform infrastructure changes and application deployment changes. Once your application is reachable, use <code>/secrets/compare</code> to validate rotation behavior on the live environment.</p>
<h2 id="heading-how-to-troubleshoot-the-most-common-failures">How to Troubleshoot the Most Common Failures</h2>
<p>Here's a shortlist of the most common symptoms and their fixes.</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Most Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td><code>ExternalSecret</code> is not syncing</td>
<td>Missing credentials or wrong store reference</td>
<td>Confirm the operator can access AWS Secrets Manager and that <code>secretStoreRef</code> points to the correct store</td>
</tr>
<tr>
<td>Pod is stuck in <code>Pending</code></td>
<td>Missing storage setup for local cluster</td>
<td>For Microk8s, enable the storage add-on</td>
</tr>
<tr>
<td>Env and volume still match after rotation</td>
<td>Rotation happened but the pod never restarted</td>
<td>Run <code>kubectl rollout restart</code> or install Reloader</td>
</tr>
<tr>
<td>CRD or API version mismatch</td>
<td>ESO version and manifest <code>apiVersion</code> don't match</td>
<td>Verify the <code>apiVersion</code> for <code>ClusterSecretStore</code> and <code>ExternalSecret</code> match your installed ESO version</td>
</tr>
<tr>
<td>Amazon Elastic Kubernetes Service node group never joins</td>
<td>Networking or IAM permissions for nodes are wrong</td>
<td>Fix internet routing and review the node IAM policy</td>
</tr>
</tbody></table>
<h3 id="heading-how-to-inspect-the-operator-and-the-externalsecret">How to Inspect the Operator and the ExternalSecret</h3>
<p>When something isn't syncing, start with these two commands.</p>
<pre><code class="language-bash"># Check the ExternalSecret status
kubectl describe externalsecret app-db-secret -n default

# Check the ESO operator logs
kubectl logs -n external-secrets -l app.kubernetes.io/name=external-secrets
</code></pre>
<p>The status conditions on the <code>ExternalSecret</code> resource will usually tell you exactly what failed.</p>
<h3 id="heading-how-to-validate-rotation-from-the-app-side">How to Validate Rotation From the App Side</h3>
<p>When you are debugging rotation, don't rely only on Kubernetes resource state. Use the <code>/secrets/compare</code> endpoint to see what the running application actually observes. The endpoint tells you whether env and volume match and whether rotation has been detected. That is the ground truth for your application's behavior.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a complete secrets pipeline from AWS Secrets Manager into Kubernetes pods using Terraform and the External Secrets Operator. You ran the local lab, inspected the <code>ExternalSecret</code> and <code>Deployment</code> manifests, and validated that the application sees the right credentials.</p>
<p>You also tested secret rotation and observed the key behavior firsthand: mounted secret files update within the kubelet sync period, while environment variables stay stale until the pod restarts. That single observation explains a large class of production incidents.</p>
<p>Finally, you saw how the same design extends to Amazon Elastic Kubernetes Service with OpenID Connect-based CI/CD, and you have a troubleshooting checklist for the failures most teams hit.</p>
<p>The lab repository is at <a href="https://github.com/Osomudeya/k8s-secret-lab">github.com/Osomudeya/k8s-secret-lab</a>. If you ran the local lab, the natural next step is phases 4 and 5 from the repo's staged learning path: try the CSI driver path on Microk8s, then follow the EKS setup to see the same pipeline with a real CI/CD workflow and no credentials stored in GitHub. Both are documented in the repo and take less than 30 minutes each.</p>
<p>If this helped you, star the repo and share it with someone who is learning Kubernetes.</p>
<p><em>I send weekly breakdowns of real production incidents and how engineers actually fix them, not tutorials but real failures<br>→</em> <a href="https://osomudeya.gumroad.com/subscribe"><em>Join the newsletter</em></a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection ]]>
                </title>
                <description>
                    <![CDATA[ In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it. An attacker had found it, deployed pods inside T ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/</link>
                <guid isPermaLink="false">69c4112310e664c5dac43f41</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 25 Mar 2026 16:45:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4039b7a4-bb45-4df5-b13b-7414985c1a7e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it.</p>
<p>An attacker had found it, deployed pods inside Tesla's cluster, and was using them to mine cryptocurrency – all on Tesla's AWS bill. The cluster had no authentication on the dashboard, no network restrictions on egress, and nothing monitoring for intrusion. Any one of those controls would have stopped the attack. None of them were in place.</p>
<p>This wasn't a sophisticated zero-day exploit. It was a misconfigured default.</p>
<p>Kubernetes ships with powerful security primitives. The problem is that almost none of them are enabled by default. A fresh cluster is deliberately permissive so it's easy to get started. That permissiveness is a feature in development. In production, it's a liability.</p>
<p>In this handbook, we'll work through the three most impactful security layers in Kubernetes. We'll start with Role-Based Access Control, which governs who can do what to which resources in the API. From there we'll move to pod runtime security, which locks down what containers can actually do once they're running on a node. Finally we'll deploy Falco, a syscall-level detection engine that watches for attacks in progress and alerts in real time.</p>
<p>By the end, you'll have a hardened cluster with working RBAC policies, enforced pod security standards, and live detection rules that fire when something suspicious happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><code>kubectl</code> installed and configured</p>
</li>
<li><p>Docker Desktop or a Linux machine (to run kind)</p>
</li>
<li><p>Basic Kubernetes familiarity – you know what a Pod, Deployment, and Namespace are</p>
</li>
<li><p>No prior security experience needed</p>
</li>
</ul>
<p>All demos run on a local kind cluster. Full YAML and setup scripts are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</a></p>
</li>
<li><p><a href="#heading-what-youll-build">What You'll Build</a></p>
</li>
<li><p><a href="#heading-demo-1--run-a-cluster-security-baseline-with-kube-bench">Demo 1 — Run a Cluster Security Baseline with kube-bench</a></p>
</li>
<li><p><a href="#heading-how-to-configure-rbac">How to Configure RBAC</a></p>
<ul>
<li><p><a href="#heading-the-four-rbac-objects">The Four RBAC Objects</a></p>
</li>
<li><p><a href="#heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</a></p>
</li>
<li><p><a href="#heading-roles-and-clusterroles">Roles and ClusterRoles</a></p>
</li>
<li><p><a href="#heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</a></p>
</li>
<li><p><a href="#heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</a></p>
</li>
<li><p><a href="#heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 — Build a Least-Privilege RBAC Policy for a CI Pipeline</a></p>
</li>
<li><p><a href="#heading-demo-3--audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 — Audit RBAC with rakkess and rbac-lookup</a></p>
</li>
<li><p><a href="#how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</a></p>
<ul>
<li><p><a href="#heading-pod-security-admission">Pod Security Admission</a></p>
</li>
<li><p><a href="#heading-how-to-configure-securitycontext">How to Configure securityContext</a></p>
</li>
<li><p><a href="#heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</a></p>
</li>
<li><p><a href="#heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-4--harden-a-pod-with-securitycontext">Demo 4 — Harden a Pod with securityContext</a></p>
</li>
<li><p><a href="#heading-demo-5--deploy-falco-and-write-a-custom-detection-rule">Demo 5 — Deploy Falco and Write a Custom Detection Rule</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</h2>
<p>To understand what you're defending against, you need to understand where Kubernetes exposes attack surface. There are six main areas, and most production incidents trace back to at least one of them.</p>
<p>The <strong>API server</strong> is the front door to your cluster. Every <code>kubectl</code> command, every CI deploy, and every controller reconciliation loop sends requests here. Unauthenticated or over-privileged access to the API server is effectively game over: an attacker who can talk to it can create pods, read secrets, and modify workloads freely.</p>
<p><strong>etcd</strong> is the key-value store where all cluster state lives, including your Secrets. Kubernetes Secrets are base64-encoded by default, not encrypted. Anyone with direct access to etcd can read every password, token, and certificate in the cluster without going through the API server at all.</p>
<p>The <strong>kubelet</strong> runs on each node and manages the pods assigned to it. If its API is reachable without authentication – which is the default on older clusters – an attacker can exec into any pod on that node and read its memory without ever touching the API server.</p>
<p>The <strong>container runtime</strong> is the layer that actually runs your containers. A container that escapes its isolation boundary lands directly in the host OS. A privileged container with <code>hostPID: true</code> can read the memory of every other process on the node, including other containers.</p>
<p>Your <strong>supply chain</strong> (base images, third-party dependencies, Helm charts, operators) is a potential entry point at every step. The XZ Utils backdoor discovered in 2024 showed how close a well-positioned supply chain attack can come to widespread infrastructure compromise.</p>
<p>Finally, the <strong>network</strong>: by default, every pod in a Kubernetes cluster can reach every other pod on any port. There are no internal firewalls between workloads unless you explicitly create them with NetworkPolicy.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/2e49d975-4f69-4d14-9646-76c6ec377115.png" alt="Kubernetes threat landscape" style="display:block;margin:0 auto" width="4079" height="980" loading="lazy">

<h3 id="heading-real-world-breaches">Real-World Breaches</h3>
<p>These three incidents are worth understanding before you write a single line of YAML. They're not theoretical – they're documented post-mortems from real production clusters.</p>
<table>
<thead>
<tr>
<th>Incident</th>
<th>Year</th>
<th>Root cause</th>
<th>What was missing</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tesla cryptomining</strong></td>
<td>2018</td>
<td>Kubernetes dashboard exposed with no authentication, Unrestricted egress</td>
<td>RBAC on the dashboard endpoint + default-deny NetworkPolicy</td>
</tr>
<tr>
<td><strong>Capital One data breach</strong></td>
<td>2019</td>
<td>SSRF vulnerability in a WAF let an attacker reach the EC2 metadata API, which returned credentials for an over-privileged IAM role</td>
<td>Pod-level IAM restrictions (IRSA) + blocking metadata API egress</td>
</tr>
<tr>
<td><strong>Shopify bug bounty (Kubernetes)</strong></td>
<td>2021</td>
<td>A researcher accessed internal Kubernetes metadata through a misconfigured internal service, exposing pod environment variables containing secrets</td>
<td>Secret management outside environment variables + network segmentation</td>
</tr>
</tbody></table>
<p>The pattern across all three: not zero-day exploits, but misconfigured defaults and missing controls that should have been standard practice.</p>
<p>This article addresses the RBAC and pod security gaps directly.</p>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>Before the first command, here is the security posture you'll have by the end of this article:</p>
<p>You'll start by running kube-bench to get a CIS Benchmark baseline – a concrete score showing where a default cluster stands before any hardening. From there you'll build a least-privilege RBAC policy for a CI pipeline service account and verify its permission boundaries, then audit the full cluster to confirm no over-privileged accounts exist.</p>
<p>On the pod security side, you'll enforce the <code>restricted</code> Pod Security Admission profile on your workload namespace and apply a hardened <code>securityContext</code> to a deployment: non-root user, read-only root filesystem, dropped capabilities, and seccomp profile. To close out, you'll deploy Falco in eBPF mode with a custom detection rule that fires when suspicious tools are run inside a container.</p>
<p>Start to finish, with a kind cluster already running, the demos take about 45–60 minutes.</p>
<h2 id="heading-demo-1-run-a-cluster-security-baseline-with-kube-bench">Demo 1: Run a Cluster Security Baseline with kube-bench</h2>
<p>Before hardening anything, it's a good idea to measure where you are. <a href="https://github.com/aquasecurity/kube-bench">kube-bench</a> runs the CIS Kubernetes Benchmark against your cluster and reports which checks pass and which fail. A baseline run gives you a concrete picture of your cluster's default security posture – and a reference point you can re-run after applying any hardening changes.</p>
<h3 id="heading-step-1-create-a-kind-cluster">Step 1: Create a kind cluster</h3>
<p>Save the following as <code>kind-config.yaml</code>:</p>
<pre><code class="language-yaml"># kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
</code></pre>
<pre><code class="language-bash">kind create cluster --name k8s-security --config kind-config.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Creating cluster "k8s-security" ...
 ✓ Ensuring node image (kindest/node:v1.29.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-security"
</code></pre>
<h3 id="heading-step-2-run-kube-bench">Step 2: Run kube-bench</h3>
<p>kube-bench runs as a Job inside the cluster, mounting the host filesystem to inspect Kubernetes configuration files and processes:</p>
<pre><code class="language-bash">kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench
</code></pre>
<p>The output is long. Scroll to the summary at the bottom:</p>
<pre><code class="language-plaintext">== Summary master ==
0 checks PASS
11 checks FAIL
 9 checks WARN
 0 checks INFO

== Summary node ==
17 checks PASS
 2 checks FAIL
40 checks WARN
 0 checks INFO
</code></pre>
<p>A fresh kind cluster typically fails around 14 checks. Three of the most important failures explain why defaults are a problem:</p>
<table>
<thead>
<tr>
<th>Check ID</th>
<th>Description</th>
<th>Why it matters</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1.2.1</strong></td>
<td><code>--anonymous-auth</code> is not set to false on the API server</td>
<td>Anonymous requests can reach the API server without authentication – exactly how the Tesla dashboard was accessed</td>
</tr>
<tr>
<td><strong>1.2.6</strong></td>
<td><code>--kubelet-certificate-authority</code> is not set</td>
<td>The API server cannot verify kubelet identity, enabling man-in-the-middle attacks between the control plane and nodes</td>
</tr>
<tr>
<td><strong>4.2.6</strong></td>
<td><code>--protect-kernel-defaults</code> is not set on the kubelet</td>
<td>Kernel parameters can be modified from within a container, which is one step toward a container escape</td>
</tr>
</tbody></table>
<p><strong>Note:</strong> Some kube-bench findings are expected on kind because kind is a development tool, not a production-hardened environment. The important thing is to understand what each finding means and whether it applies to your target production setup.</p>
<p>Delete the Job when you're done:</p>
<pre><code class="language-bash">kubectl delete job kube-bench
</code></pre>
<p>Now that you have a baseline, you know what you're starting from. The next step is to work through the most impactful control on that list: access control. RBAC governs every interaction with the Kubernetes API, and getting it right is the foundation everything else builds on.</p>
<h2 id="heading-how-to-configure-rbac">How to Configure RBAC</h2>
<p>Role-Based Access Control is the authorisation layer in Kubernetes. Every request that reaches the API server – from <code>kubectl</code>, from a pod, from a controller – is checked against RBAC rules after authentication succeeds. If there is no rule that explicitly allows the action, Kubernetes denies it.</p>
<p>The key word is "explicitly". RBAC in Kubernetes is additive only. There is no <code>deny</code> rule. You grant access by creating rules, and you remove access by deleting them. This makes the mental model clean: if a subject can do something, you gave it permission to do that thing.</p>
<h3 id="heading-a-brief-case-study-the-shopify-kubernetes-misconfiguration">A Brief Case Study: The Shopify Kubernetes Misconfiguration</h3>
<p>In 2021, security researcher Silas Cutler discovered that a Shopify internal service exposed Kubernetes metadata through an SSRF vulnerability. The metadata included pod environment variables that contained secrets. The root cause was partly RBAC: the service's service account had broader cluster access than it needed, and there was no least-privilege review process.</p>
<p>Shopify paid a $25,000 bug bounty and fixed the issue. The lesson is straightforward: a service account should only have the permissions it needs to do its specific job. Nothing more.</p>
<p>This is the principle you'll apply in Demo 2.</p>
<h3 id="heading-the-four-rbac-objects">The Four RBAC Objects</h3>
<p>RBAC in Kubernetes is built from four API objects. Two define permissions, two bind those permissions to subjects:</p>
<table>
<thead>
<tr>
<th>Object</th>
<th>Scope</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>Role</code></td>
<td>Namespace</td>
<td>Defines a set of permissions within one namespace</td>
</tr>
<tr>
<td><code>ClusterRole</code></td>
<td>Cluster-wide</td>
<td>Defines permissions across all namespaces, or for cluster-scoped resources like Nodes</td>
</tr>
<tr>
<td><code>RoleBinding</code></td>
<td>Namespace</td>
<td>Grants the permissions of a Role or ClusterRole to a subject, within one namespace</td>
</tr>
<tr>
<td><code>ClusterRoleBinding</code></td>
<td>Cluster-wide</td>
<td>Grants the permissions of a ClusterRole to a subject across the entire cluster</td>
</tr>
</tbody></table>
<p>A <strong>subject</strong> is a user, a group, or a service account. Users and groups come from your authentication layer – client certificates, OIDC tokens, or cloud provider identity. Service accounts are Kubernetes-native identities created for pods.</p>
<h3 id="heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</h3>
<p>Before you can write a <code>Role</code>, you need to know three things: the resource name, the API group it belongs to, and the verbs it supports. You shouldn't have to guess any of them – <code>kubectl</code> can tell you everything.</p>
<h4 id="heading-list-all-available-resources-and-their-api-groups">List all available resources and their API groups</h4>
<pre><code class="language-bash">kubectl api-resources
</code></pre>
<p>Partial output:</p>
<pre><code class="language-plaintext">NAME                    SHORTNAMES  APIVERSION                     NAMESPACED  KIND
bindings                            v1                             true        Binding
configmaps              cm          v1                             true        ConfigMap
endpoints               ep          v1                             true        Endpoints
events                  ev          v1                             true        Event
namespaces              ns          v1                             false       Namespace
nodes                   no          v1                             false       Node
pods                    po          v1                             true        Pod
secrets                             v1                             true        Secret
serviceaccounts         sa          v1                             true        ServiceAccount
services                svc         v1                             true        Service
deployments             deploy      apps/v1                        true        Deployment
replicasets             rs          apps/v1                        true        ReplicaSet
statefulsets            sts         apps/v1                        true        StatefulSet
cronjobs                cj          batch/v1                       true        CronJob
jobs                                batch/v1                       true        Job
ingresses               ing         networking.k8s.io/v1           true        Ingress
networkpolicies         netpol      networking.k8s.io/v1           true        NetworkPolicy
clusterroles                        rbac.authorization.k8s.io/v1   false       ClusterRole
roles                               rbac.authorization.k8s.io/v1   true        Role
</code></pre>
<p>The <code>APIVERSION</code> column is what you put in <code>apiGroups</code>. Strip the version suffix and use only the group part:</p>
<table>
<thead>
<tr>
<th>APIVERSION in output</th>
<th>apiGroups value in Role</th>
</tr>
</thead>
<tbody><tr>
<td><code>v1</code></td>
<td><code>""</code> (empty string – the core group)</td>
</tr>
<tr>
<td><code>apps/v1</code></td>
<td><code>"apps"</code></td>
</tr>
<tr>
<td><code>batch/v1</code></td>
<td><code>"batch"</code></td>
</tr>
<tr>
<td><code>networking.k8s.io/v1</code></td>
<td><code>"networking.k8s.io"</code></td>
</tr>
<tr>
<td><code>rbac.authorization.k8s.io/v1</code></td>
<td><code>"rbac.authorization.k8s.io"</code></td>
</tr>
</tbody></table>
<p>The <code>NAMESPACED</code> column tells you whether to use a <code>Role</code> (namespaced resources) or a <code>ClusterRole</code> (non-namespaced resources like <code>nodes</code>).</p>
<h4 id="heading-filter-by-api-group">Filter by API group</h4>
<p>If you want to see only resources in a specific group, for example, everything in <code>apps</code>:</p>
<pre><code class="language-bash">kubectl api-resources --api-group=apps
</code></pre>
<pre><code class="language-plaintext">NAME                  SHORTNAMES  APIVERSION  NAMESPACED  KIND
controllerrevisions               apps/v1     true        ControllerRevision
daemonsets            ds          apps/v1     true        DaemonSet
deployments           deploy      apps/v1     true        Deployment
replicasets           rs          apps/v1     true        ReplicaSet
statefulsets          sts         apps/v1     true        StatefulSet
</code></pre>
<h4 id="heading-list-all-verbs-for-a-specific-resource">List all verbs for a specific resource</h4>
<p>Each resource supports a different set of verbs. To see exactly which verbs a resource supports, use <code>kubectl api-resources</code> with <code>-o wide</code> and look at the <code>VERBS</code> column:</p>
<pre><code class="language-bash">kubectl api-resources -o wide | grep -E "^NAME|^pods "
</code></pre>
<pre><code class="language-plaintext">NAME  SHORTNAMES  APIVERSION  NAMESPACED  KIND  VERBS
pods  po          v1          true        Pod   create,delete,deletecollection,get,list,patch,update,watch
</code></pre>
<p>Or explain the resource directly:</p>
<pre><code class="language-bash">kubectl explain pod --api-version=v1 | head -10
</code></pre>
<p>The full set of verbs Kubernetes supports in RBAC rules is:</p>
<table>
<thead>
<tr>
<th>Verb</th>
<th>What it allows</th>
</tr>
</thead>
<tbody><tr>
<td><code>get</code></td>
<td>Read a single named resource: <code>kubectl get pod my-pod</code></td>
</tr>
<tr>
<td><code>list</code></td>
<td>Read all resources of a type: <code>kubectl get pods</code></td>
</tr>
<tr>
<td><code>watch</code></td>
<td>Stream changes to resources: used by controllers and informers</td>
</tr>
<tr>
<td><code>create</code></td>
<td>Create a new resource</td>
</tr>
<tr>
<td><code>update</code></td>
<td>Replace an existing resource (<code>kubectl apply</code> on an existing object)</td>
</tr>
<tr>
<td><code>patch</code></td>
<td>Partially modify a resource (<code>kubectl patch</code>)</td>
</tr>
<tr>
<td><code>delete</code></td>
<td>Delete a single resource</td>
</tr>
<tr>
<td><code>deletecollection</code></td>
<td>Delete all resources of a type in a namespace</td>
</tr>
<tr>
<td><code>exec</code></td>
<td>Run a command inside a pod (<code>kubectl exec</code>)</td>
</tr>
<tr>
<td><code>portforward</code></td>
<td>Forward a port from a pod (<code>kubectl port-forward</code>)</td>
</tr>
<tr>
<td><code>proxy</code></td>
<td>Proxy HTTP requests to a pod</td>
</tr>
<tr>
<td><code>log</code></td>
<td>Read pod logs (<code>kubectl logs</code>)</td>
</tr>
</tbody></table>
<p><strong>Important:</strong> <code>get</code> and <code>list</code> are separate verbs. Granting <code>list</code> on <code>secrets</code> lets a subject enumerate every secret name and value in a namespace, even if you didn't also grant <code>get</code>. Always think about both when working with sensitive resources like <code>secrets</code>, <code>serviceaccounts</code>, and <code>configmaps</code>.</p>
<h4 id="heading-look-up-a-resources-group-with-kubectl-explain">Look up a resource's group with kubectl explain</h4>
<p>If you already know the resource name but aren't sure of its group, <code>kubectl explain</code> tells you:</p>
<pre><code class="language-bash">kubectl explain deployment
</code></pre>
<pre><code class="language-plaintext">GROUP:      apps
KIND:       Deployment
VERSION:    v1
...
</code></pre>
<pre><code class="language-bash">kubectl explain ingress
</code></pre>
<pre><code class="language-plaintext">GROUP:      networking.k8s.io
KIND:       Ingress
VERSION:    v1
...
</code></pre>
<p>This is the fastest way to look up the <code>apiGroups</code> value for any resource when writing a Role.</p>
<h4 id="heading-a-complete-lookup-workflow">A complete lookup workflow</h4>
<p>Here is the practical workflow when writing a new Role from scratch:</p>
<pre><code class="language-bash"># 1. Find the resource name and API group
kubectl api-resources | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment

# 2. Find the verbs it supports
kubectl api-resources -o wide | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment   create,delete,...,get,list,patch,update,watch

# 3. Write the Role using the group (strip the version) and the verbs you need
</code></pre>
<pre><code class="language-yaml">apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-reader
  namespace: staging
rules:
  - apiGroups: ["apps"]       # from: apps/v1 → strip /v1
    resources: ["deployments"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>With this workflow, you never have to guess an API group or verb. You look it up, then write the minimal rule you need.</p>
<h3 id="heading-roles-and-clusterroles">Roles and ClusterRoles</h3>
<p>A <code>Role</code> defines which verbs are allowed on which resources. Here is a Role that grants read-only access to Pods and ConfigMaps inside the <code>staging</code> namespace:</p>
<pre><code class="language-yaml"># role-ci-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]          # "" = the core API group (Pods, Services, Secrets, ConfigMaps)
    resources: ["pods", "configmaps"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>The <code>apiGroups</code> field tells Kubernetes which API group owns the resource. The core group uses an empty string <code>""</code>. Apps-level resources like Deployments use <code>"apps"</code>. Custom resources use their own group, such as <code>"networking.k8s.io"</code>.</p>
<p>A <code>ClusterRole</code> is structurally identical but omits the namespace and can reference cluster-scoped resources like Nodes and PersistentVolumes:</p>
<pre><code class="language-yaml"># clusterrole-node-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader    # no namespace field
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
</code></pre>
<h4 id="heading-when-to-use-which">When to use which:</h4>
<p>Use a <code>Role</code> when the permission is specific to one namespace. A compromised service account can only affect that namespace: the blast radius is contained. Use a <code>ClusterRole</code> when you need access to cluster-scoped resources, or when you want a reusable permission template that multiple namespaces can share.</p>
<p>A common mistake is reaching for a <code>ClusterRole</code> "just to be safe" because it's easier to configure. Namespace-scoped <code>Roles</code> are almost always the right default.</p>
<h3 id="heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</h3>
<p>A Role by itself does nothing. You need a binding to attach it to a subject. Here is a <code>RoleBinding</code> that grants the <code>ci-reader</code> Role to the <code>ci-pipeline</code> service account:</p>
<pre><code class="language-yaml"># rolebinding-ci.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline       # the service account name
    namespace: staging      # the namespace the SA lives in
roleRef:
  kind: Role
  name: ci-reader           # must match the Role name exactly
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>There is a useful pattern worth knowing: you can bind a <code>ClusterRole</code> using a <code>RoleBinding</code>. This creates namespace-scoped access using a reusable permission template. The <code>ClusterRole</code> defines the rules, while the <code>RoleBinding</code> constrains those rules to a single namespace.</p>
<pre><code class="language-yaml"># RoleBinding referencing a ClusterRole — scoped to one namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: view-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: ClusterRole          # ClusterRole, but bound to one namespace via RoleBinding
  name: view                 # Kubernetes built-in ClusterRole: read-only access to most resources
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>Kubernetes ships with several useful built-in ClusterRoles: <code>view</code> (read-only access to most resources), <code>edit</code> (read/write to most resources), <code>admin</code> (full namespace admin), and <code>cluster-admin</code> (full cluster admin). Use them rather than reinventing them.</p>
<h3 id="heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</h3>
<p>Every pod in Kubernetes runs as a service account. If you don't specify one, Kubernetes uses the <code>default</code> service account in that namespace.</p>
<p>The default service account starts with no permissions – but it still has a token automatically mounted into every pod at <code>/var/run/secrets/kubernetes.io/serviceaccount/token</code>. This means every container in your cluster can authenticate to the API server by default, even if it has nothing useful to do there.</p>
<p>The single most impactful change you can make is to disable this automatic token mounting on service accounts that don't need API access:</p>
<pre><code class="language-yaml"># serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
automountServiceAccountToken: false   # no token mounted into pods by default
</code></pre>
<p>You can also control it at the pod level:</p>
<pre><code class="language-yaml">spec:
  automountServiceAccountToken: false   # override at pod level
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-app:1.0
</code></pre>
<h4 id="heading-the-cluster-admin-anti-pattern">The cluster-admin anti-pattern:</h4>
<p>Never bind <code>cluster-admin</code> to a service account that runs in a pod. <code>cluster-admin</code> grants full read/write access to every resource in the cluster. An attacker who compromises a pod running as <code>cluster-admin</code> owns your cluster completely.</p>
<p>You will see this in Helm charts and tutorials because it "makes things work". It works because it disables the entire authorisation layer. That is not a solution – it's a ticking clock.</p>
<p>The Capital One breach is a direct example of this pattern at the cloud layer: an EC2 instance role had permissions far beyond what the application needed. The SSRF vulnerability was the initial foothold. The over-privileged role was what turned a minor bug into a $80 million fine.</p>
<h3 id="heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</h3>
<p>The <code>kubectl auth can-i</code> command lets you check permissions for any subject. Use <code>--as</code> to impersonate a service account:</p>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

# These should return 'yes'
kubectl auth can-i list pods        --namespace staging --as $SA
kubectl auth can-i get  configmaps  --namespace staging --as $SA

# These should return 'no'
kubectl auth can-i delete pods      --namespace staging --as $SA
kubectl auth can-i get  secrets     --namespace staging --as $SA
kubectl auth can-i list pods        --namespace production --as $SA
</code></pre>
<p>To list every permission a subject has in a namespace:</p>
<pre><code class="language-bash">kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-pipeline
</code></pre>
<p>For a visual matrix across the whole cluster, install <a href="https://github.com/corneliusweig/rakkess">rakkess</a> (part of krew):</p>
<pre><code class="language-bash">kubectl krew install access-matrix

# Permission matrix for all service accounts in staging
kubectl access-matrix --namespace staging
</code></pre>
<p>Example output:</p>
<pre><code class="language-plaintext">NAME          GET  LIST  WATCH  CREATE  UPDATE  PATCH  DELETE
ci-pipeline    ✓    ✓     ✓      ✗       ✗       ✗      ✗
default        ✗    ✗     ✗      ✗       ✗       ✗      ✗
monitoring     ✓    ✓     ✓      ✗       ✗       ✗      ✗
</code></pre>
<p>If you see <code>✓</code> in the CREATE, UPDATE, PATCH, or DELETE columns for a service account that should only read, that's a finding that needs remediation.</p>
<p>⚠️ <strong>The wildcard danger:</strong> The most dangerous RBAC configuration is a wildcard on all three dimensions:</p>
<pre><code class="language-yaml">apiGroups: [""] 
resources: [""] 
verbs: ["*"]
</code></pre>
<p>This is functionally identical to <code>cluster-admin</code>. You will find it in Helm charts for controllers installed with "convenience" permissions. Always audit third-party RBAC before installing operators into a production cluster.</p>
<h2 id="heading-demo-2-build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 – Build a Least-Privilege RBAC Policy for a CI Pipeline</h2>
<p>In this demo, you'll create a service account for a CI pipeline that can list pods and read configmaps in the <code>staging</code> namespace – and nothing else.</p>
<h3 id="heading-step-1-create-the-namespace-and-service-account">Step 1: Create the namespace and service account</h3>
<pre><code class="language-bash">kubectl create namespace staging
</code></pre>
<pre><code class="language-yaml"># ci-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-pipeline
  namespace: staging
automountServiceAccountToken: false
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-serviceaccount.yaml
</code></pre>
<h3 id="heading-step-2-create-the-role">Step 2: Create the Role</h3>
<pre><code class="language-yaml"># ci-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-role.yaml
</code></pre>
<h3 id="heading-step-3-bind-the-role-to-the-service-account">Step 3: Bind the Role to the service account</h3>
<pre><code class="language-yaml"># ci-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: Role
  name: ci-reader
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-rolebinding.yaml
</code></pre>
<h3 id="heading-step-4-test-allowed-operations">Step 4: Test allowed operations</h3>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

kubectl auth can-i list pods       --namespace staging     --as $SA   # yes
kubectl auth can-i get  pods       --namespace staging     --as $SA   # yes
kubectl auth can-i list configmaps --namespace staging     --as $SA   # yes
</code></pre>
<h3 id="heading-step-5-test-denied-operations">Step 5: Test denied operations</h3>
<pre><code class="language-bash">kubectl auth can-i delete pods       --namespace staging     --as $SA   # no
kubectl auth can-i get  secrets      --namespace staging     --as $SA   # no
kubectl auth can-i list pods         --namespace production  --as $SA   # no
kubectl auth can-i create deployments --namespace staging    --as $SA   # no
</code></pre>
<p>All four should return <code>no</code>. Notice the third test: even if there were a matching Role in the <code>staging</code> namespace, the service account cannot access <code>production</code>. A <code>RoleBinding</code> cannot cross namespace boundaries, this is by design.</p>
<p>Writing a least-privilege policy for a service account you control is the easy part. The harder part is auditing what already exists in a cluster. That's what Demo 3 covers.</p>
<h2 id="heading-demo-3-audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 – Audit RBAC with rakkess and rbac-lookup</h2>
<p>Now you'll scan the full cluster to surface any accounts with more permissions than they need.</p>
<h3 id="heading-step-1-install-the-tools">Step 1: Install the tools</h3>
<pre><code class="language-bash">kubectl krew install access-matrix
kubectl krew install rbac-lookup
</code></pre>
<h3 id="heading-step-2-run-rakkess-across-the-cluster">Step 2: Run rakkess across the cluster</h3>
<pre><code class="language-bash"># All service accounts in kube-system
kubectl access-matrix --namespace kube-system

# All ServiceAccounts cluster-wide
kubectl access-matrix
</code></pre>
<h3 id="heading-step-3-find-all-cluster-admin-bindings">Step 3: Find all cluster-admin bindings</h3>
<p>There are two ways subjects get cluster-admin access: via a <code>ClusterRoleBinding</code> (cluster-wide), or via a <code>RoleBinding</code> that references the <code>cluster-admin</code> ClusterRole (namespace-scoped, still dangerous). Check both:</p>
<pre><code class="language-bash"># Find ClusterRoleBindings that grant cluster-admin
kubectl rbac-lookup cluster-admin --kind ClusterRole --output wide
</code></pre>
<p>On a fresh kind cluster this returns:</p>
<pre><code class="language-plaintext">No RBAC Bindings found
</code></pre>
<p>That is the correct and expected result. A default kind cluster doesn't create any <code>ClusterRoleBindings</code> to <code>cluster-admin</code>. The role exists, but nothing is bound to it at the cluster level by default. If you see entries here in your production cluster, each one is a finding worth investigating.</p>
<p>To find who has cluster-level admin access through other means, query the bindings directly:</p>
<pre><code class="language-bash"># Find all ClusterRoleBindings and the subjects they grant
kubectl get clusterrolebindings -o wide
</code></pre>
<pre><code class="language-plaintext">NAME                                                   ROLE                                                                       AGE   USERS                         GROUPS                         SERVICEACCOUNTS
cluster-admin                                          ClusterRole/cluster-admin                                                  10d   system:masters
system:kube-controller-manager                         ClusterRole/system:kube-controller-manager                                 10d
system:kube-scheduler                                  ClusterRole/system:kube-scheduler                                          10d
system:node                                            ClusterRole/system:node                                                    10d
...
</code></pre>
<p>The <code>cluster-admin</code> ClusterRoleBinding grants access to the <code>system:masters</code> group – the group your kubeconfig certificate belongs to. This is expected. Every other binding in this list is worth reviewing to understand what it grants and why.</p>
<p><strong>What to look for:</strong> Any binding where the SERVICEACCOUNTS column is populated with an application service account (not a <code>system:</code> prefixed one) is a potential over-privilege finding. Application pods should never need cluster-admin.</p>
<h3 id="heading-step-4-verify-the-ci-pipeline-service-account">Step 4: Verify the ci-pipeline service account</h3>
<pre><code class="language-bash">kubectl rbac-lookup ci-pipeline --kind ServiceAccount --output wide
</code></pre>
<p>Expected output:</p>
<pre><code class="language-bash">SUBJECT                               SCOPE     ROLE             SOURCE
ServiceAccount/staging:ci-pipeline    staging   Role/ci-reader   RoleBinding/ci-reader-binding
</code></pre>
<p>The format is <code>/&lt;role-name&gt; &lt;binding-kind&gt;/&lt;binding-name&gt;</code>. This tells you:</p>
<ul>
<li><p>The service account is bound to the <code>ci-reader</code> Role</p>
</li>
<li><p>The binding is a <code>RoleBinding</code> named <code>ci-reader-binding</code></p>
</li>
<li><p>There is no namespace prefix on the role name because it is a namespaced <code>Role</code>, not a <code>ClusterRole</code></p>
</li>
</ul>
<p>If the output showed <code>ClusterRole/something</code> here, that would be a finding. It would mean the service account has cluster-wide permissions, not namespace-scoped ones.</p>
<p><strong>rbac-lookup vs kubectl get:</strong> <code>rbac-lookup</code> gives you a subject-centric view: "what does this account have access to?" <code>kubectl get rolebindings,clusterrolebindings -A</code> gives you a binding-centric view: "what bindings exist in the cluster?" Use both. rbac-lookup is faster for auditing a specific service account, while the <code>kubectl get</code> approach is better for a full cluster inventory.</p>
<p>With RBAC locked down, the API server is protected. But RBAC says nothing about what a container can do once it's running. That's a separate layer entirely.</p>
<h2 id="heading-how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</h2>
<p>RBAC controls who can talk to the Kubernetes API. Pod security controls what containers can do once they're running on a node. These are different threat vectors: RBAC protects the control plane, pod security protects the data plane.</p>
<p>A container that runs as root with no capability restrictions can, if compromised, write backdoors to the host filesystem, load kernel modules, read the memory of other processes if <code>hostPID: true</code> is set, and in some configurations escape the container entirely. Pod security closes these doors before an attacker can open them.</p>
<h3 id="heading-a-case-study-the-hildegard-malware-campaign">A Case Study: The Hildegard Malware Campaign</h3>
<p>In early 2021, Palo Alto's Unit 42 research team documented a cryptomining malware campaign called Hildegard that specifically targeted Kubernetes clusters. The attack chain was:</p>
<ol>
<li><p>Find a cluster with the kubelet API exposed without authentication</p>
</li>
<li><p>Deploy a privileged pod with <code>hostPID: true</code></p>
</li>
<li><p>Use the privileged pod to read credentials from other containers' memory</p>
</li>
<li><p>Establish persistence by writing to the host filesystem</p>
</li>
</ol>
<p>Steps 3 and 4 would have been impossible if the pods in the cluster had been running with <code>readOnlyRootFilesystem: true</code>, dropped capabilities, and no <code>hostPID</code>. The attacker had the initial foothold. Pod security would have contained the blast radius.</p>
<h3 id="heading-pod-security-admission">Pod Security Admission</h3>
<p>Pod Security Admission (PSA) is the built-in admission controller that enforces pod security standards at the namespace level. It replaced PodSecurityPolicy in Kubernetes 1.25.</p>
<p><strong>Migrating from PSP?</strong> If you're on Kubernetes &lt; 1.25, you may still be using PodSecurityPolicy, which was removed in 1.25. The migration path is: enable PSA in <code>audit</code> mode first to identify violations, fix them workload by workload, then switch to <code>enforce</code>. For policies PSA cannot express, add Kyverno alongside it.</p>
<p>PSA defines three profiles:</p>
<table>
<thead>
<tr>
<th>Profile</th>
<th>Who it's for</th>
<th>What it restricts</th>
</tr>
</thead>
<tbody><tr>
<td><code>privileged</code></td>
<td>System components (CNI plugins, monitoring agents)</td>
<td>Nothing – no restrictions</td>
</tr>
<tr>
<td><code>baseline</code></td>
<td>Most workloads</td>
<td>Blocks known privilege escalations: no <code>hostNetwork</code>, no <code>hostPID</code>, no privileged containers</td>
</tr>
<tr>
<td><code>restricted</code></td>
<td>Security-sensitive workloads</td>
<td>Everything in baseline, plus: must run as non-root, must drop capabilities, must set a seccomp profile</td>
</tr>
</tbody></table>
<p>And three enforcement modes:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Effect</th>
<th>When to use</th>
</tr>
</thead>
<tbody><tr>
<td><code>enforce</code></td>
<td>Rejects pods that violate the profile at admission</td>
<td>Production – once you've fixed violations</td>
</tr>
<tr>
<td><code>audit</code></td>
<td>Allows pods but records violations in the audit log</td>
<td>Migration – see what would break without breaking anything</td>
</tr>
<tr>
<td><code>warn</code></td>
<td>Allows pods but sends a warning to the client</td>
<td>Development – fast feedback in your terminal</td>
</tr>
</tbody></table>
<p>The migration path: start with <code>audit</code> and <code>warn</code> to identify violations, fix them, then switch to <code>enforce</code>. The two modes can run simultaneously.</p>
<p>Apply them as namespace labels:</p>
<pre><code class="language-yaml"># namespace-staging.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    # Start here: audit and warn simultaneously
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest
</code></pre>
<p>Once violations are resolved, add enforce:</p>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  --overwrite
</code></pre>
<p>Note: don't use <code>--overwrite</code> here. Without it, if <code>enforce</code> is already set to a different value the command will error – which is exactly what you want. You should see:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>If you see <code>namespace/staging not labeled</code>, it means <code>enforce=restricted</code> and <code>enforce-version=latest</code> were already set to those exact values. Confirm enforcement is active:</p>
<pre><code class="language-bash">kubectl get namespace staging --show-labels
</code></pre>
<p>Look for <code>pod-security.kubernetes.io/enforce=restricted</code> in the output. If it's there, enforcement is active.</p>
<h3 id="heading-how-to-configure-securitycontext">How to Configure securityContext</h3>
<p>A <code>securityContext</code> defines the privilege and access control settings for a pod or container. These are the seven fields you should configure on every production workload:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>Set at</th>
<th>What it controls</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot</code></td>
<td>Pod</td>
<td>Rejects containers that run as UID 0 (root)</td>
</tr>
<tr>
<td><code>runAsUser</code> / <code>runAsGroup</code></td>
<td>Pod</td>
<td>Sets a specific UID/GID – don't rely on the image default</td>
</tr>
<tr>
<td><code>fsGroup</code></td>
<td>Pod</td>
<td>All mounted volumes are owned by this GID</td>
</tr>
<tr>
<td><code>seccompProfile</code></td>
<td>Pod</td>
<td>Filters syscalls using a seccomp profile</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation</code></td>
<td>Container</td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code></td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem</code></td>
<td>Container</td>
<td>Makes the container filesystem read-only</td>
</tr>
<tr>
<td><code>capabilities.drop</code></td>
<td>Container</td>
<td>Removes Linux capabilities (drop <code>ALL</code>, add back only what is needed)</td>
</tr>
</tbody></table>
<p>The annotated YAML below shows all seven in context:</p>
<pre><code class="language-yaml"># secure-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: staging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true         # container must run as a non-root user
        runAsUser: 10001           # explicit UID — don't rely on the image's default
        runAsGroup: 10001          # explicit GID
        fsGroup: 10001             # volumes are owned by this group
        seccompProfile:
          type: RuntimeDefault     # use the container runtime's default seccomp profile
      automountServiceAccountToken: false
      containers:
        - name: app
          image: nginx:1.25-alpine
          securityContext:
            allowPrivilegeEscalation: false   # block setuid and sudo inside the container
            readOnlyRootFilesystem: true      # the single highest-impact setting
            capabilities:
              drop:
                - ALL                         # drop every Linux capability
              add: []                         # add back only what is explicitly needed
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
      volumes:
        # nginx needs writable directories — provide them as emptyDir volumes
        - name: tmp
          emptyDir: {}
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}
</code></pre>
<h4 id="heading-why-readonlyrootfilesystem-true-is-the-most-important-setting">Why <code>readOnlyRootFilesystem: true</code> is the most important setting:</h4>
<p>Most post-exploitation techniques require writing to the filesystem. Dropping a backdoor, modifying a binary, writing a cron job, or installing a keylogger all require a writable filesystem. Set <code>readOnlyRootFilesystem: true</code> and every one of these techniques is blocked.</p>
<p>The downside is that many applications write to directories like <code>/tmp</code> or <code>/var/cache</code>. The fix is to mount <code>emptyDir</code> volumes at those specific paths, as shown above. The rest of the filesystem stays read-only.</p>
<p><strong>What each field prevents:</strong></p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it prevents</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot: true</code></td>
<td>Blocks containers that were built to run as root – they fail at admission</td>
</tr>
<tr>
<td><code>runAsUser: 10001</code></td>
<td>Ensures a known, non-privileged UID even if the image doesn't set one</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation: false</code></td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code> – the most common privilege escalation path</td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem: true</code></td>
<td>Prevents writing backdoors, modifying binaries, or creating persistence</td>
</tr>
<tr>
<td><code>capabilities: drop: ALL</code></td>
<td>Removes Linux capabilities like <code>NET_RAW</code> (raw socket access) and <code>SYS_ADMIN</code> (kernel operations)</td>
</tr>
<tr>
<td><code>seccompProfile: RuntimeDefault</code></td>
<td>Filters syscalls to a safe default set – blocks ~300 of the ~400 available syscalls</td>
</tr>
</tbody></table>
<h3 id="heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</h3>
<p>PSA covers the fundamentals. But you'll eventually need policies that PSA cannot express: all images must come from your private registry, all pods must have resource limits, no container may use the <code>latest</code> tag. For these, you need a policy engine.</p>
<p>Two mature options exist:</p>
<table>
<thead>
<tr>
<th></th>
<th>OPA/Gatekeeper</th>
<th>Kyverno</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Policy language</strong></td>
<td>Rego (a custom logic language)</td>
<td>YAML, same format as Kubernetes resources</td>
</tr>
<tr>
<td><strong>Learning curve</strong></td>
<td>Steep: Rego takes real time to learn</td>
<td>Gentle: if you write YAML, you can write policies</td>
</tr>
<tr>
<td><strong>Mutation</strong></td>
<td>Yes, via <code>Assign</code>/<code>AssignMetadata</code></td>
<td>Yes: first-class, well-documented feature</td>
</tr>
<tr>
<td><strong>Audit mode</strong></td>
<td>Yes: reports existing violations</td>
<td>Yes: policy audit mode</td>
</tr>
<tr>
<td><strong>Ecosystem</strong></td>
<td>Integrates with OPA in non-K8s contexts</td>
<td>Kubernetes-native only</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Complex cross-resource logic and teams already using OPA</td>
<td>Teams who want K8s-native syntax and fast setup</td>
</tr>
</tbody></table>
<p>If you're starting fresh, Kyverno gets you to working policies faster. Here is a Kyverno policy that blocks images from outside your trusted registry:</p>
<pre><code class="language-yaml"># kyverno-registry-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Images must come from registry.corp.internal/"
        pattern:
          spec:
            containers:
              - image: "registry.corp.internal/*"
</code></pre>
<h3 id="heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</h3>
<p>PSA and <code>securityContext</code> are preventive controls: they block known-bad configurations before pods start. Falco is a detective control. It watches what containers do while they're running and alerts when something looks wrong.</p>
<p>Falco operates at the syscall level using eBPF. It attaches to the Linux kernel and intercepts every system call made by every container on the node – file opens, network connections, process spawns, privilege escalations. It does this without modifying containers, without injecting sidecars, and with minimal overhead.</p>
<h4 id="heading-what-falco-detects-out-of-the-box">What Falco detects out of the box:</h4>
<p>Falco's default ruleset covers the most common attack patterns. It fires when a shell is opened inside a running container, whether that's a <code>kubectl exec</code> session or a reverse shell from an exploit.</p>
<p>It watches for reads on sensitive files like <code>/etc/shadow</code>, <code>/etc/kubernetes/admin.conf</code>, and <code>/root/.ssh/</code>. It catches the dropper pattern: a binary written to disk and immediately executed. It detects outbound connections to known malicious IPs, writes to <code>/proc</code> or <code>/sys</code> that suggest kernel manipulation, and package managers like <code>apt</code>, <code>yum</code>, or <code>pip</code> being run inside containers that have no business installing software.</p>
<p>Each of these is a rule in Falco's default ruleset. You can extend it with custom rules for your specific workloads – which is exactly what you'll do in Demo 5. But first let's harden the Pod.</p>
<h2 id="heading-demo-4-harden-a-pod-with-securitycontext">Demo 4 – Harden a Pod with securityContext</h2>
<p>In this demo, you'll start with a default nginx deployment, observe the PSA violations it triggers, harden it step by step, and confirm it passes under the <code>restricted</code> profile.</p>
<h3 id="heading-step-1-apply-psa-labels-in-audit-mode">Step 1: Apply PSA labels in audit mode</h3>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted
</code></pre>
<h3 id="heading-step-2-deploy-insecure-nginx-and-observe-the-warnings">Step 2: Deploy insecure nginx and observe the warnings</h3>
<pre><code class="language-yaml"># insecure-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-insecure
  namespace: staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-insecure
  template:
    metadata:
      labels:
        app: nginx-insecure
    spec:
      containers:
        - name: nginx
          image: nginx:1.25-alpine
</code></pre>
<pre><code class="language-bash">kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output (PSA warns but still creates the deployment in <code>warn</code> mode):</p>
<pre><code class="language-plaintext">Warning: would violate PodSecurity "restricted:latest":
  allowPrivilegeEscalation != false (container "nginx" must set
    securityContext.allowPrivilegeEscalation=false)
  unrestricted capabilities (container "nginx" must set
    securityContext.capabilities.drop=["ALL"])
  runAsNonRoot != true (pod or container "nginx" must set
    securityContext.runAsNonRoot=true)
  seccompProfile not set (pod or container "nginx" must set
    securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/nginx-insecure created
</code></pre>
<p>Four violations. Every one of them is a real security gap. But the pod was still created "deployment.apps/nginx-insecure created"</p>
<h3 id="heading-step-3-deploy-the-hardened-version">Step 3: Deploy the hardened version</h3>
<pre><code class="language-bash">kubectl apply -f secure-deployment.yaml   # the YAML from the securityContext section above
</code></pre>
<p>No warnings this time.</p>
<h3 id="heading-step-4-switch-the-namespace-to-enforce">Step 4: Switch the namespace to enforce</h3>
<pre><code class="language-bash&quot;">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>This is the moment enforcement becomes active. Any new pod that violates the <code>restricted</code> profile will be rejected from this point on.</p>
<h3 id="heading-step-5-confirm-insecure-deployments-are-now-rejected">Step 5: Confirm insecure deployments are now rejected</h3>
<pre><code class="language-bash">kubectl delete deployment nginx-insecure -n staging
kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-shell">Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false ...
deployment.apps/nginx-insecure created
</code></pre>
<p>The Deployment object is created. PSA enforces at the <strong>pod</strong> level, not the Deployment level. The Deployment and its ReplicaSet exist, but every attempt to create a pod is rejected. Check the ReplicaSet:</p>
<pre><code class="language-bash">kubectl get replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">NAME                       DESIRED   CURRENT   READY   AGE
nginx-insecure-b668d867b   1         0         0       30s
</code></pre>
<p><code>DESIRED=1</code> but <code>CURRENT=0</code>. The ReplicaSet cannot create any pods because they're rejected at admission. Describe the ReplicaSet to see the rejection events:</p>
<pre><code class="language-bash">kubectl describe replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">Warning  FailedCreate  ReplicaSet "nginx-insecure-b668d867b" create Pod
  "nginx-insecure-xxx" failed: pods is forbidden: violates PodSecurity
  "restricted:latest": allowPrivilegeEscalation != false, unrestricted
  capabilities, runAsNonRoot != true, seccompProfile not set
</code></pre>
<p>The hardened deployment continues running with its pods intact. The insecure one has zero pods and never will. This is exactly how PSA is supposed to work.</p>
<h3 id="heading-step-6-score-the-hardened-pod-with-kube-score">Step 6: Score the hardened pod with kube-score</h3>
<p><a href="https://github.com/zegl/kube-score">kube-score</a> is a static analysis tool that scores Kubernetes manifests against security and reliability best practices:</p>
<pre><code class="language-bash"># macOS
brew install kube-score
# Linux: https://github.com/zegl/kube-score/releases

kube-score score secure-deployment.yaml -v
</code></pre>
<p>Expected output (abridged):</p>
<pre><code class="language-plaintext">apps/v1/Deployment secure-app in staging 
  path=secure-deployment.yaml
    [OK] Stable version
    [OK] Label values
    [CRITICAL] Container Resources
        · app -&gt; CPU limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.cpu
        · app -&gt; Memory limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.memory
        · app -&gt; CPU request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.cpu
        · app -&gt; Memory request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.memory
    [CRITICAL] Container Image Pull Policy
        · app -&gt; ImagePullPolicy is not set to Always
            It's recommended to always set the ImagePullPolicy to Always, to make sure that the imagePullSecrets are always correct, and to always get the image you want.
    [OK] Pod Probes Identical
    [CRITICAL] Container Ephemeral Storage Request and Limit
        · app -&gt; Ephemeral Storage limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.ephemeral-storage
        · app -&gt; Ephemeral Storage request is not set
            Resource requests are recommended to make sure the application can start and run without crashing. Set resource.requests.ephemeral-storage
    [OK] Environment Variable Key Duplication
    [OK] Container Security Context Privileged
    [OK] Pod Topology Spread Constraints
        · Pod Topology Spread Constraints
            No Pod Topology Spread Constraints set, kube-scheduler defaults assumed
    [OK] Container Image Tag
    [CRITICAL] Pod NetworkPolicy
        · The pod does not have a matching NetworkPolicy
            Create a NetworkPolicy that targets this pod to control who/what can communicate with this pod. Note, this feature needs to be supported by the CNI implementation used in the Kubernetes cluster to have an effect.
    [OK] Container Security Context User Group ID
    [OK] Container Security Context ReadOnlyRootFilesystem
    [CRITICAL] Deployment has PodDisruptionBudget
        · No matching PodDisruptionBudget was found
            It's recommended to define a PodDisruptionBudget to avoid unexpected downtime during Kubernetes maintenance operations, such as when draining a node.
    [WARNING] Deployment has host PodAntiAffinity
        · Deployment does not have a host podAntiAffinity set
            It's recommended to set a podAntiAffinity that stops multiple pods from a deployment from being scheduled on the same node. This increases availability in case the node becomes unavailable.
    [OK] Deployment Pod Selector labels match template metadata labels
</code></pre>
<p>Notice there are no security context violations: <code>securityContext</code>, <code>readOnlyRootFilesystem</code>, <code>seccompProfile</code>, and <code>runAsNonRoot</code> all pass. The remaining findings are about <strong>resource management</strong> (CPU/memory limits, ephemeral storage), <strong>availability</strong> (PodDisruptionBudget, anti-affinity), and <strong>network policy</strong> – not security context hardening. Those are important for production readiness, but they're a separate concern from the pod security hardening we did here.</p>
<p>You now have a pod that PSA accepts and kube-score validates. The next step is to add a detection layer – something that watches what the pod does at runtime, not just how it was configured at admission.</p>
<h2 id="heading-demo-5-deploy-falco-and-write-a-custom-detection-rule">Demo 5 – Deploy Falco and Write a Custom Detection Rule</h2>
<p>Now, you'll deploy Falco in eBPF mode, trigger a default alert, then extend Falco with a custom rule that catches <code>curl</code> and <code>wget</code> being run inside containers.</p>
<h3 id="heading-step-1-install-falco-via-helm">Step 1: Install Falco via Helm</h3>
<pre><code class="language-bash">helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  --wait
</code></pre>
<p>Confirm Falco is running on every node:</p>
<pre><code class="language-shell">kubectl get pods -n falco
</code></pre>
<pre><code class="language-shell">NAME           READY   STATUS    RESTARTS   AGE
falco-x8k2p    1/1     Running   0          45s
falco-m9nqr    1/1     Running   0          45s
falco-j4tpw    1/1     Running   0          45s
</code></pre>
<p>One pod per node. Falco runs as a DaemonSet because it needs to monitor syscalls on every node independently.</p>
<h3 id="heading-step-2-trigger-a-default-alert">Step 2: Trigger a default alert</h3>
<p>Open a second terminal and stream the Falco logs:</p>
<pre><code class="language-shell"># Terminal 2 — watch for alerts
kubectl logs -n falco -l app.kubernetes.io/name=falco -f --max-log-requests 3
</code></pre>
<p>In your first terminal, exec into the secure-app pod:</p>
<pre><code class="language-bash"># Terminal 1 — trigger the shell detection
POD=$(kubectl get pod -n staging -l app=secure-app \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD -n staging -- sh
</code></pre>
<p>Within a second, Terminal 2 shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:23:41.456Z: Notice A shell was spawned in a container with an attached terminal
  (user=root user_loginuid=-1 k8s.ns=staging k8s.pod=secure-app-7d9f8b-xxx
   container=app shell=sh parent=runc cmdline=sh terminal=34816)
  rule=Terminal shell in container  priority=NOTICE
  tags=[container, shell, mitre_execution]
</code></pre>
<p>This is Falco's built-in <code>Terminal shell in container</code> rule firing. It detected the <code>kubectl exec</code> session the moment you ran it.</p>
<h3 id="heading-step-3-write-a-custom-rule">Step 3: Write a custom rule</h3>
<p>The built-in rules are comprehensive, but every production environment has workloads with unique behaviour. Here is a custom rule that alerts when <code>curl</code> or <code>wget</code> is executed inside any container:</p>
<pre><code class="language-yaml"># custom-rules.yaml
customRules:
  custom-rules.yaml: |-
    - rule: Suspicious network tool in container
      desc: &gt;
        Detects execution of curl or wget inside a running container.
        These tools are commonly used for data exfiltration, downloading
        attacker payloads, or reaching command-and-control servers.
        Production containers should not be making ad-hoc HTTP requests.
      condition: &gt;
        spawned_process
        and container
        and proc.name in (curl, wget)
      output: &gt;
        Network tool executed in container
        (user=%user.name tool=%proc.name cmd=%proc.cmdline
         pod=%k8s.pod.name ns=%k8s.ns.name image=%container.image)
      priority: WARNING
      tags: [network, exfiltration, custom]
</code></pre>
<p>Apply it by upgrading the Helm release:</p>
<pre><code class="language-bash"> helm upgrade falco falcosecurity/falco \
  --namespace falco \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  -f custom-rules.yaml
</code></pre>
<p>Good, it deployed. Now wait for pods to be ready and test your custom rule:</p>
<h3 id="heading-step-4-test-the-custom-rule">Step 4: Test the custom rule</h3>
<pre><code class="language-bash"># Terminal 1 — run curl inside the container
kubectl exec -it $POD -n staging -- sh -c 'curl https://example.com'
</code></pre>
<p>Terminal 2 immediately shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:31:07.812Z: Warning Network tool executed in container
  (user=root tool=curl cmd=curl https://example.com
   pod=secure-app-7d9f8b-xxx ns=staging image=nginx:1.25-alpine)
  rule=Suspicious network tool in container  priority=WARNING
  tags=[network, exfiltration, custom]
</code></pre>
<h3 id="heading-step-5-route-alerts-to-slack-with-falcosidekick">Step 5: Route alerts to Slack with Falcosidekick</h3>
<p>Streaming logs is useful during development. In production, you need alerts routed to your alerting pipeline. Falcosidekick handles this with support for Slack, PagerDuty, Datadog, Elasticsearch, and over 50 other outputs:</p>
<pre><code class="language-yaml"># falcosidekick-values.yaml
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    minimumpriority: "warning"
    messageformat: &gt;
      [{{.Priority}}] {{.Rule}} |
      pod: {{.OutputFields.k8s.pod.name}} |
      ns: {{.OutputFields.k8s.ns.name}} |
      image: {{.OutputFields.container.image}}
</code></pre>
<pre><code class="language-bash">helm install falcosidekick falcosecurity/falcosidekick \
  --namespace falco \
  -f falcosidekick-values.yaml
</code></pre>
<p><strong>Tuning Falco for production:</strong> A fresh Falco deployment will generate false positives, especially in the first week. Your job is to tune rules to match your workloads' normal behaviour, not to respond to every alert.</p>
<p>Here's the workflow: deploy in staging → identify false positives → add <code>except</code> conditions to rules → validate the false positive rate is low → enable in production with alerting.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the staging namespace and everything in it
kubectl delete namespace staging
 
# Delete Falco and Falcosidekick
helm uninstall falco -n falco
helm uninstall falcosidekick -n falco
kubectl delete namespace falco
 
# Delete the kind cluster entirely
kind delete cluster --name k8s-security
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this handbook, you secured a Kubernetes cluster across three layers: RBAC, pod runtime security, and runtime threat detection.</p>
<p>You built a least-privilege service account, enforced the restricted Pod Security Admission profile, hardened pods with securityContext, deployed Falco for syscall-level detection, and wrote a custom rule to catch suspicious tools inside containers.</p>
<p>Each layer maps to a real-world breach – Tesla, Capital One, Hildegard – showing how these controls would have contained the damage. Run kube-bench again to measure the improvement.</p>
<p>All YAML manifests, Helm values, and setup scripts from this article are available in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Implement GitOps on Kubernetes Using Argo CD ]]>
                </title>
                <description>
                    <![CDATA[ If you’re still running kubectl apply from your local terminal, you aren’t managing a cluster, you’re babysitting one. I’ve spent more nights than I care to admit staring at a terminal, trying to figu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-implement-gitops-on-kubernetes-using-argo-cd/</link>
                <guid isPermaLink="false">69b99877c22d3eeb8ae62100</guid>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gitops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ArgoCD ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Olumoko Moses ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 18:07:51 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/2fe40cbd-1b8a-4cc6-a721-45cc20a80c76.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re still running <code>kubectl apply</code> from your local terminal, you aren’t managing a cluster, you’re babysitting one.</p>
<p>I’ve spent more nights than I care to admit staring at a terminal, trying to figure out why a staging environment suddenly "broke" even though no one supposedly touched it.</p>
<p>We’ve all been there, right? a manual edit here, a quick hotfix there, and suddenly your Git repository is no longer a Source of Truth, it’s a historical document of what used to be running.</p>
<p>Without a reliable strategy, Kubernetes deployments quickly descend into a mess of drift, painful rollbacks, and non-existent audit trails. I learned the hard way that simply storing manifests in Git isn't enough. If your cluster isn't actively listening to your code, you're still working with a gap.</p>
<p>GitOps closes that gap. It turns your cluster into a mirror of your repository. If it isn't in Git, it doesn't exist.</p>
<p>In this tutorial, you aren't just going to read about the theory. You’re going to implement a "Zero-Touch" deployment loop from scratch. We’ll use Argo CD, GitHub Actions, and the Argo CD Image Updater to build a system that builds, tags, and deploys your code the second you hit <code>git push</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/93b61e74-66c3-47d7-aeff-f7a69d0d7390.jpg" alt="Architecture diagram of a complete GitOps CI/CD workflow. A developer pushes code to a GitHub repository, triggering a GitHub Actions pipeline that builds and pushes a new Docker image to DockerHub. The Argo CD Image Updater polls DockerHub for the new tag and commits the change back to the GitHub repository. Finally, the Argo CD Server detects the updated manifest in Git and syncs the changes to the live Kubernetes cluster." style="display:block;margin:0 auto" width="640" height="640" loading="lazy">

<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-gitops-really-means">What GitOps Really Means</a></p>
</li>
<li><p><a href="#heading-what-is-argo-cd-and-how-does-it-implement-gitops">What is Argo CD and How Does it Implement GitOps?</a></p>
</li>
<li><p><a href="#heading-preparing-the-application-source-code">Preparing the Application Source Code and Repo Structure</a></p>
</li>
<li><p><a href="#heading-automating-image-builds-with-github-actions">Automating Image Builds with GitHub Actions</a></p>
</li>
<li><p><a href="#heading-how-to-install-and-access-argo-cd">How to Install and Access Argo CD</a></p>
</li>
<li><p><a href="#heading-understanding-the-argo-cd-application">Understanding the Argo CD Application</a></p>
</li>
<li><p><a href="#heading-deploying-the-application-manifest">Deploying the Application Manifest</a></p>
</li>
<li><p><a href="#heading-automating-updates-with-argo-cd-image-updater">Automating Updates with Argo CD Image Updater</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following ready in your environment:</p>
<ul>
<li><p><strong>A GitHub repository:</strong> You'll need a repository (for example, <code>my-gitops-demo</code>) to serve as your Single Source of Truth. If you're following this tutorial from scratch, start with an empty repo.</p>
</li>
<li><p><strong>A DockerHub account:</strong> This will act as your Container Registry. You’ll need this to build, push, and store the Docker images that GitHub Actions creates.</p>
</li>
<li><p><strong>A running Kubernetes cluster:</strong> You can use a local solution like Minikube or Kind, or a cloud-managed service like Amazon EKS or GKE.</p>
</li>
<li><p><strong>Kubernetes tooling:</strong> Ensure <code>kubectl</code> is installed and configured to communicate with your cluster.</p>
</li>
<li><p><strong>Fundamental K8s knowledge:</strong> You should be comfortable with basic Kubernetes concepts like Pods, Deployments, and Services.</p>
</li>
</ul>
<h3 id="heading-note-for-readers-with-existing-projects">Note for Readers with Existing Projects</h3>
<p>If you already have a project and want to migrate it to this GitOps workflow, you don't need to start over. You can adapt your existing repository by following these three steps:</p>
<ol>
<li><p>Standardize your manifests: Move all your existing Kubernetes YAML files into a dedicated Kubernetes-manifest/ directory at the root of your project.</p>
</li>
<li><p>Containerize your services: Ensure every service you intend to deploy has a Dockerfile in its respective subdirectory (for example, /main-api/Dockerfile).</p>
</li>
<li><p>Prepare for automation: Be ready to replace any manual kubectl apply steps in your current CI pipeline with the automated tagging strategy we’ll implement in the next sections.</p>
</li>
</ol>
<h2 id="heading-what-gitops-really-means">What GitOps Really Means</h2>
<p>At its core, GitOps is an operational framework that uses Git as the single source of truth for your infrastructure and applications. In a traditional setup, you might run <code>kubectl apply -f deployment.yaml</code> from your laptop. This makes it impossible to track who changed what, leading to "snowflake" clusters that no one can reproduce.</p>
<p>GitOps enforces four key principles:</p>
<ol>
<li><p><strong>Declarative:</strong> You describe the <em>desired</em> state (for example, "3 replicas of Nginx"), not the commands to get there.</p>
</li>
<li><p><strong>Versioned and immutable:</strong> Your entire state is in Git. If a deployment fails, you <code>git revert</code> to a previous known-good state.</p>
</li>
<li><p><strong>Pulled automatically:</strong> A software agent (Argo CD) pulls the state from Git.</p>
</li>
<li><p><strong>Continuously reconciled:</strong> The system constantly fixes "drift." If a developer manually changes a service in the cluster, Argo CD will overwrite it to match Git.</p>
</li>
</ol>
<h2 id="heading-what-is-argo-cd-and-how-does-it-implement-gitops">What is Argo CD and How Does it Implement GitOps</h2>
<p>Before we dive into the setup, let’s define the tool we'll be working with.</p>
<p>Argo CD is a declarative, GitOps' continuous delivery engine built specifically for Kubernetes. As a graduated project of the Cloud Native Computing Foundation (CNCF), it has become the industry standard for managing modern infrastructure.</p>
<p>Think of Argo CD as a persistent watchdog that lives inside your cluster. To understand why it's so powerful, we have to look at how it differs from traditional CI/CD tools like Jenkins or GitHub Actions.</p>
<h3 id="heading-the-push-vs-pull-model">The "Push" vs. "Pull" Model</h3>
<p>Traditional tools like the one I mentioned above use a <strong>"Push" model</strong>. In this setup, an external pipeline sends commands (like <code>kubectl apply</code>) into your cluster. This is risky because you must store sensitive cluster administrative keys inside your external CI tool. If your CI tool is compromised, your cluster is, too.</p>
<p>Argo CD flips this script using a <strong>"Pull"</strong> <strong>model</strong>:</p>
<ul>
<li><p><strong>The bridge:</strong> It sits between your Git repo (the "Desired State") and your cluster (the "Live State").</p>
</li>
<li><p><strong>Continuous monitoring:</strong> It watches your Git repo 24/7. The moment it detects a new commit, it "pulls" that change and applies it from <em>inside</em> the cluster.</p>
</li>
<li><p><strong>Self-healing:</strong> If someone manually changes a setting in the cluster (known as "drift"), Argo CD detects the discrepancy and automatically overwrites it to match what is written in Git.</p>
</li>
</ul>
<p>This approach is not only more secure, since no cluster credentials ever leave the environment, but it also ensures that your infrastructure is a perfect, predictable mirror of your code.</p>
<h2 id="heading-preparing-the-application-source-code">Preparing the Application Source Code</h2>
<p>Before we automate the build, we need actual code in our repository. We'll create two simple microservices: a Main API and an Auxiliary Service.</p>
<h3 id="heading-repo-structure">Repo Structure</h3>
<p>Ensure your repository follows this structure exactly. Consistency in naming is vital for the automation to find your files.</p>
<pre><code class="language-plaintext">GITOPS-ARGOCD-DEMO/
├── .github/workflows/main.yml
├── auxiliary-service/
│   └── Dockerfile
├── main-api/
│   └── Dockerfile
├── Kubernetes-manifest/
│   ├── aux-api.yaml
│   ├── kustomization.yaml
│   └── main-api.yaml
├── application.yaml
└── image-updater.yaml
</code></pre>
<h3 id="heading-create-the-dockerfiles">Create the Dockerfiles</h3>
<p>In each service folder, create a simple <code>Dockerfile</code> so our pipeline has something to build.</p>
<p><strong>main-api/Dockerfile</strong></p>
<pre><code class="language-plaintext">FROM nginx:alpine
RUN echo "&lt;h1&gt;Main API - Version 1.0&lt;/h1&gt;" &gt; /usr/share/nginx/html/index.html
EXPOSE 80
</code></pre>
<p><strong>auxiliary-service/Dockerfile</strong></p>
<pre><code class="language-plaintext">FROM nginx:alpine
RUN echo "&lt;h1&gt;Auxiliary Service - Version 1.0&lt;/h1&gt;" &gt; /usr/share/nginx/html/index.html
EXPOSE 80
</code></pre>
<h2 id="heading-automating-image-builds-with-github-actions">Automating Image Builds with GitHub Actions</h2>
<p>In a professional GitOps workflow, your Kubernetes manifests and your application source code often live in the same repository (or linked ones). While Argo CD handles the deployment, you still need a way to turn your code into Docker images. This is where <strong>Continuous Integration (CI)</strong> comes in.</p>
<p>I have included a GitHub Actions workflow in this demo to automate this. Every time you push code to the <code>main</code> branch, this pipeline builds your images and pushes them to DockerHub.</p>
<h3 id="heading-the-ci-pipeline-workflow">The CI Pipeline Workflow</h3>
<p>Create a file at <code>.github/workflows/main.yml</code> and add the following:</p>
<pre><code class="language-plaintext">name: Build and Push Image to DockerHub

on:
  push:
    branches:
      - main
    # Skip builds for image updater commits
    paths-ignore:
      - 'Kubernetes-manifest/**'

jobs:
  docker_build:
    name: Build &amp; Push ${{ matrix.service }}
    environment: argocd-demo
    runs-on: ubuntu-latest
    permissions:
      contents: write

    strategy:
      matrix:
        include:
          - service: aux-service
            dockerfile: auxiliary-service/Dockerfile
          - service: main-service
            dockerfile: main-api/Dockerfile

    env:
      DOCKER_USER: ${{ secrets.DOCKERHUB_USERNAME }}
      RUN_TAG: ${{ github.run_number }}

    steps:
      - name: Check out code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ env.DOCKER_USER }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}

      - name: Build and Push ${{ matrix.service }}
        uses: docker/build-push-action@v6
        with:
          context: .
          file: ${{ matrix.dockerfile }}
          push: true
          tags: \({{ env.DOCKER_USER }}/\){{ matrix.service }}:${{ env.RUN_TAG }}
          cache-from: type=gha,scope=${{ matrix.service }}
          cache-to: type=gha,mode=max,scope=${{ matrix.service }}
</code></pre>
<p><strong>Pro tip:</strong> The <code>paths-ignore</code> section is critical. Later, the Argo CD Image Updater will write changes back to the <code>Kubernetes-manifest/</code> folder. Without this ignore rule, your pipeline would trigger itself forever in an infinite loop.</p>
<p><strong>Note:</strong> You must add <code>DOCKERHUB_USERNAME</code> and <code>DOCKERHUB_PASSWORD</code> to your GitHub Repo Settings &gt; Secrets.</p>
<h2 id="heading-how-to-install-and-access-argo-cd">How to Install and Access Argo CD</h2>
<p>Now that your cluster is running, you can install Argo CD. You'll perform the installation using a standard Kubernetes manifest provided by the Argo project.</p>
<h3 id="heading-step-1-create-the-namespace-and-apply-the-manifests">Step 1: Create the Namespace and Apply the Manifests</h3>
<p>In Kubernetes, it is a best practice to keep your administrative tools separate from your applications. You will create a dedicated namespace named <code>argocd</code> and then apply the official installation script from the Argo project. This script includes all the necessary ServiceAccounts, Roles, and Deployments.</p>
<p>Run the following commands in your terminal:</p>
<pre><code class="language-markdown">kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
</code></pre>
<p>You'll see a long list of resources being created. Wait a minute or two for the pods to initialize. you can verify that all the core components of Argo CD are running.:</p>
<pre><code class="language-markdown">kubectl get all -n argocd
</code></pre>
<p>Ensure all pods show a status of <code>Running</code> before proceeding.</p>
<h3 id="heading-step-2-access-the-argo-cd-user-interface">Step 2: Access the Argo CD User Interface</h3>
<p>To access the dashboard, we use a technique called <strong>port forwarding</strong>. Since the Argo CD server is running inside the cluster's private network, your browser can't see it yet. Port forwarding creates a secure 'tunnel' between a port on your local machine (8080) and a port on the cluster service (443). This allows you to interact with internal services without exposing them to the public internet.</p>
<p>Run the following command:</p>
<pre><code class="language-markdown">kubectl port-forward svc/argocd-server -n argocd 8080:443
</code></pre>
<p>You can now open your browser and navigate to <code>https://localhost:8080</code>. Your browser may warn you that the connection is not private because of a self-signed certificate. You can safely click "Advanced" and proceed to the site.</p>
<h3 id="heading-step-3-how-to-log-in">Step 3: How to Log In</h3>
<p>The default username for Argo CD is <code>admin</code>. The password is autogenerated during the installation process and is stored securely as a Kubernetes secret.</p>
<p>To retrieve this password, open a new terminal tab (so the port-forwarding keeps running) and run:</p>
<pre><code class="language-markdown">kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo
</code></pre>
<p>Copy the output and use it as the password to log into the dashboard.</p>
<h2 id="heading-understanding-the-argo-cd-application">Understanding the Argo CD Application</h2>
<p>An Argo CD <strong>Application</strong> is a Custom Resource (CRD) that acts as a "contract" between your Git repo and your cluster. It defines:</p>
<ul>
<li><p><code>repoURL</code> &amp; <code>path</code>: This tells Argo CD exactly which Git repository to watch and which folder inside that repo contains your YAML manifests.</p>
</li>
<li><p><code>destination</code>: This defines where the app should live. We use <code>https://kubernetes.default.svc</code> to point to the local cluster where Argo CD is installed.</p>
</li>
<li><p><code>syncPolicy</code>: This is the heart of GitOps. By setting <code>automated</code> with <code>selfHeal: true</code>, we tell Argo CD to automatically fix the cluster if someone manually changes something (drift). The <code>prune: true</code> setting ensures that if you delete a file in Git, it also gets deleted in the cluster.</p>
</li>
</ul>
<h3 id="heading-the-application-manifest">The Application Manifest</h3>
<p>Create <code>application.yaml</code> in your project root:</p>
<pre><code class="language-plaintext">apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gitops-argocd-demo
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/&lt;YOUR_GITHUB_USERNAME&gt;/&lt;YOUR_REPO_NAME&gt;.git
    targetRevision: HEAD
    path: Kubernetes-manifest
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd-demo-ns
  syncPolicy:
    automated:
      selfHeal: true
      prune: true
    syncOptions:
      - CreateNamespace=true
</code></pre>
<h3 id="heading-deploying-the-application-manifest">Deploying the Application Manifest</h3>
<p>Now we'll define our Kubernetes resources in the <code>Kubernetes-manifest/</code> folder.</p>
<p><strong>main-api.yaml</strong></p>
<pre><code class="language-plaintext">apiVersion: apps/v1
kind: Deployment
metadata:
  name: main-deployment
  namespace: argocd-demo-ns
spec:
  replicas: 1
  selector:
    matchLabels:
      app: main-api
  template:
    metadata:
      labels:
        app: main-api
    spec:
      containers:
      - name: main-service
        image: &lt;YOUR_DOCKERHUB_USERNAME&gt;/main-service:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: main-service-lb
  namespace: argocd-demo-ns
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: main-api
</code></pre>
<p><strong>aux-api.yaml</strong></p>
<pre><code class="language-plaintext">apiVersion: apps/v1
kind: Deployment
metadata:
  name: aux-deployment
  namespace: argocd-demo-ns
spec:
  replicas: 2
  selector:
    matchLabels:
      app: aux-service
  template:
    metadata:
      labels:
        app: aux-service
    spec:
      containers:
      - name: aux-service
        image: &lt;YOUR_DOCKERHUB_USERNAME&gt;/aux-service:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: aux-service
  namespace: argocd-demo-ns
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: aux-service
</code></pre>
<h2 id="heading-push-and-sync">Push and Sync</h2>
<h3 id="heading-step-1-apply-the-application-manifest">Step 1: Apply the Application Manifest</h3>
<p>Use <code>kubectl</code> to deploy this manifest into the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl apply -f application.yaml -n argocd
</code></pre>
<h3 id="heading-step-2-push-to-your-repository">Step 2: Push to Your Repository</h3>
<p>To trigger the initial deployment and ensure Argo CD stays in sync with your source of truth, add, commit, and push your latest changes to the GitHub repository you configured in the manifest:</p>
<pre><code class="language-plaintext">git add .
git commit -m "initial argo application deployment"
git push origin main
</code></pre>
<h3 id="heading-step-3-verify-the-result-in-argo-cd">Step 3: Verify the Result in Argo CD</h3>
<p>Once you push your changes, head over to your Argo CD dashboard. You'll see the <code>gitops-argocd-demo</code> application appear. After the initial sync, the dashboard will display a healthy, green status indicating that your live cluster state perfectly matches your Git repository.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/7d37a8bd-c913-4393-b82c-ab0adc875574.jpg" alt="Argo CD dashboard showing the gitops-argocd-demo application in a Healthy and Synced state. The resource tree displays the hierarchy of services, deployments, replica sets, and pods running in the cluster." style="display:block;margin:0 auto" width="1440" height="547" loading="lazy">

<p><strong>Note:</strong> As you can see in the screenshot above, Argo CD provides a visual representation of how your Kubernetes objects – Services, Deployments, and Pods – are related and confirms they are "Synced" with your Git repo.</p>
<h2 id="heading-automating-updates-with-argo-cd-image-updater">Automating Updates with Argo CD Image Updater</h2>
<p>Now that we have automated the deployment, let’s solve the final manual hurdle: automatically updating image tags in our manifests whenever a new build is pushed to DockerHub.</p>
<h3 id="heading-step-1-install-argocd-image-updater">Step 1: Install ArgoCD Image Updater</h3>
<p>Install the Image Updater into the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj-labs/argocd-image-updater/stable/config/install.yaml
</code></pre>
<p>Verify the pod is running:</p>
<pre><code class="language-plaintext">kubectl get pods -n argocd | grep image-updater
</code></pre>
<p><strong>Note:</strong> Version 1.1+ uses a CRD-based approach (<code>ImageUpdater</code> custom resources) instead of the annotation-based approach used in older versions. This guide covers the CRD method.</p>
<h3 id="heading-step-2-create-a-github-personal-access-token">Step 2: Create a GitHub Personal Access Token</h3>
<p>The Image Updater needs Git credentials to push write-back commits to your repository.</p>
<ol>
<li><p>Go to GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)</p>
</li>
<li><p>Click Generate new token</p>
</li>
<li><p>Select the <code>repo</code> scope (full control of private repositories)</p>
</li>
<li><p>Copy the generated token</p>
</li>
</ol>
<h3 id="heading-step-3-create-the-git-credentials-secret">Step 3: Create the Git Credentials Secret</h3>
<p>Store the GitHub credentials as a Kubernetes secret in the <code>argocd</code> namespace:</p>
<pre><code class="language-plaintext">kubectl -n argocd create secret generic git-creds \
  --from-literal=username=&lt;YOUR_GITHUB_USERNAME&gt; \
  --from-literal=password=&lt;YOUR_GITHUB_PAT&gt;
</code></pre>
<p>Replace <code>&lt;YOUR_GITHUB_USERNAME&gt;</code> and <code>&lt;YOUR_GITHUB_PAT&gt;</code> with your actual values.</p>
<h3 id="heading-step-4-add-a-kustomization-file-to-your-manifests">Step 4: Add a Kustomization File to Your Manifests</h3>
<p>The Image Updater uses Kustomize's <code>images</code> field to write updated tags. If your <code>Kubernetes-manifest/</code> directory contains plain YAML files, you'll need to wrap them with a <strong>kustomization.yaml</strong> file.</p>
<p>Create a <strong>kustomization.yaml</strong> file:</p>
<pre><code class="language-plaintext">apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - main-api.yaml
  - aux-api.yaml
</code></pre>
<p><strong>How it works:</strong> When the Image Updater detects a new tag, it appends an <code>images</code> section to this file:</p>
<pre><code class="language-plaintext">images:
  - name: &lt;YOUR_GITHUB_USERNAME&gt;/main-service
    newTag: "12"
  - name: &lt;YOUR_GITHUB_USERNAME&gt;/aux-service
    newTag: "12"
</code></pre>
<p>Kustomize then overrides the image tags at deploy time, without modifying your original deployment YAML files.</p>
<p>We use Kustomize here because it allows the Image Updater to manage image tags in a separate, clean way. Instead of the Updater 'messing' with your original <code>main-api.yaml</code> file, it simply updates the <code>kustomization.yaml</code> file. Argo CD then uses Kustomize to merge those changes during deployment.</p>
<h3 id="heading-step-5-create-the-imageupdater-custom-resource">Step 5: Create the ImageUpdater Custom Resource</h3>
<p>Create <strong>image-updater.yaml</strong> in your project root:</p>
<pre><code class="language-plaintext">apiVersion: argocd-image-updater.argoproj.io/v1alpha1
kind: ImageUpdater
metadata:
  name: gitops-argocd-demo-updater
  namespace: argocd
spec:
  commonUpdateSettings:
    updateStrategy: newest-build
    allowTags: "regexp:^[0-9]+$"
  applicationRefs:
    - namePattern: "gitops-argocd-demo"
      writeBackConfig:
        method: "git:secret:argocd/git-creds"
        gitConfig:
          branch: main
          writeBackTarget: "kustomization:."
      images:
        - alias: main-service
          imageName: &lt;YOUR_DOCKERHUB_USERNAME&gt;/main-service
        - alias: aux-service
          imageName: &lt;YOUR_DOCKERHUB_USERNAME&gt;/aux-service
</code></pre>
<p>This ImageUpdater resource is the <strong>"brain"</strong> of our automated tagging system. Here is what the specific fields are doing:</p>
<p><code>updateStrategy:</code></p>
<ul>
<li><code>newest-build:</code> It tells the updater to always look for the most recent image version in DockerHub based on creation time.</li>
</ul>
<p><code>writeBackConfig:</code> This is where the magic happens. It uses the git-creds secret we created to authorize the updater to 'write' back to your repository.</p>
<p><code>writeBackTarget:</code></p>
<ul>
<li><code>kustomization:</code> We are telling the updater specifically to modify the kustomization.yaml file in the manifests folder rather than touching the deployment files directly.</li>
</ul>
<p><code>images:</code> We provide aliases (main-service and aux-service) so the updater knows exactly which images in DockerHub correspond to which containers in our Kubernetes manifests.</p>
<p><strong>Apply the ImageUpdater CR to the cluster:</strong></p>
<pre><code class="language-plaintext">kubectl apply -f image-updater.yaml -n argocd
</code></pre>
<p>Push the kustomization.yaml to your Git repository (the Image Updater clones the repo, so it must exist remotely):</p>
<pre><code class="language-plaintext">git add Kubernetes-manifest/kustomization.yaml
git commit -m "Add kustomization.yaml for image updater write-back"
git push origin main
</code></pre>
<h3 id="heading-step-6-verify-the-image-updater">Step 6: Verify the Image Updater</h3>
<p>Check the Image Updater logs to confirm it's working:</p>
<pre><code class="language-plaintext">kubectl logs -n argocd deployment/argocd-image-updater-controller --tail=20
</code></pre>
<p><strong>Successful output looks like:</strong></p>
<pre><code class="language-plaintext">msg="Starting image update cycle, considering 1 application(s) for update"
msg="Setting new image to YOUR_DOCKERHUB_USERNAME/main-service:11"
msg="Successfully updated image 'YOUR_DOCKERHUB_USERNAME/main-service:7' to 'YOUR_DOCKERHUB_USERNAME/main-service:11'"
msg="Setting new image to YOUR_DOCKERHUB_USERNAME/aux-service:11"
msg="Committing 2 parameter update(s) for application gitops-argocd-demo"
msg="git push origin main"
msg="Successfully updated the live application spec"
msg="Processing results: applications=1 images_considered=2 images_skipped=0 images_updated=2 errors=0"
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You have successfully implemented a professional-grade GitOps loop from scratch. By integrating GitHub Actions, Argo CD, and the Argo CD Image Updater, you’ve bridged the gap between your source code and your live environment.</p>
<p>Think about the workflow you just built:</p>
<ol>
<li><p>You push code to GitHub.</p>
</li>
<li><p>GitHub Actions builds and tags a fresh Docker image.</p>
</li>
<li><p>Argo CD Image Updater detects that new tag and automatically commits it back to your Git manifests.</p>
</li>
<li><p>Argo CD pulls those changes and reconciles your cluster to the new desired state.</p>
</li>
</ol>
<p>No more manual <code>kubectl apply</code>, no more configuration drift, and no more 2:00 AM mysteries. Your Git repository is now truly the Single Source of Truth. If it isn't in Git, it doesn't exist in your cluster, and that is the ultimate DevOps superpower.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster ]]>
                </title>
                <description>
                    <![CDATA[ I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet contro ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kubernetes-self-healing-explained/</link>
                <guid isPermaLink="false">69aae80e78c5adcd0e1c63bc</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Testing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Osomudeya Zudonu ]]>
                </dc:creator>
                <pubDate>Fri, 06 Mar 2026 14:43:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/ef1ba178-622f-4a28-b58a-7fb8a58be964.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from <code>kubectl describe</code>, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.</p>
<p>You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-kubelab-is">What KubeLab Is?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-to-get-the-lab-running">How to Get the Lab Running</a></p>
</li>
<li><p><a href="#heading-simulation-1-kill-random-pod">Simulation 1 — Kill Random Pod</a></p>
</li>
<li><p><a href="#heading-simulation-2-drain-a-worker-node">Simulation 2 — Drain a Worker Node</a></p>
</li>
<li><p><a href="#heading-simulation-3-cpu-stress-and-throttling">Simulation 3 — CPU Stress and Throttling</a></p>
</li>
<li><p><a href="#heading-simulation-4-memory-stress-and-oomkill">Simulation 4 — Memory Stress and OOMKill</a></p>
</li>
<li><p><a href="#heading-simulation-5-database-failure">Simulation 5 — Database Failure</a></p>
</li>
<li><p><a href="#heading-simulation-6-cascading-pod-failure">Simulation 6 — Cascading Pod Failure</a></p>
</li>
<li><p><a href="#heading-simulation-7-readiness-probe-failure">Simulation 7 — Readiness Probe Failure</a></p>
</li>
<li><p><a href="#heading-how-to-read-the-signals-in-grafana">How to Read the Signals in Grafana</a></p>
</li>
<li><p><a href="#heading-how-to-use-this-for-production-debugging">How to Use This for Production Debugging</a></p>
</li>
</ul>
<h2 id="heading-what-is-kubelab"><strong>What is KubeLab?</strong></h2>
<p>KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.</p>
<table>
<thead>
<tr>
<th>Simulation</th>
<th>What it teaches</th>
</tr>
</thead>
<tbody><tr>
<td>Kill Random Pod</td>
<td>ReplicaSet self-healing, pod immutability</td>
</tr>
<tr>
<td>Drain Worker Node</td>
<td>Zero-downtime maintenance, PodDisruptionBudgets</td>
</tr>
<tr>
<td>CPU Stress</td>
<td>Throttling vs crashing, invisible latency</td>
</tr>
<tr>
<td>Memory Stress</td>
<td>OOMKill, exit code 137, silent restart loops</td>
</tr>
<tr>
<td>Database Failure</td>
<td>StatefulSets, PVC persistence</td>
</tr>
<tr>
<td>Cascading Pod Failure</td>
<td>Why replicas: 2 isn't enough</td>
</tr>
<tr>
<td>Readiness Probe Failure</td>
<td>Liveness vs readiness, traffic control</td>
</tr>
</tbody></table>
<p>Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1cd2a06d-7a7a-4250-ab5d-8a78d24af7b5.png" alt="KubeLab cluster map — pods grouped by node, color-coded by status. During simulations, chips change color and move between nodes in real time." style="display:block;margin:0 auto" width="920" height="505" loading="lazy">

<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.</p>
<p><strong>Hardware:</strong> 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/docker-compose-preview.md">setup/docker-compose-preview.md</a> full UI with mock data, no real cluster needed.</p>
<h2 id="heading-how-to-get-the-lab-running"><strong>How to Get the Lab Running</strong></h2>
<p>Full cluster setup lives at <a href="https://github.com/Osomudeya/kubelab/blob/main/setup/k8s-cluster-setup.md">setup/k8s-cluster-setup.md</a> in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:</p>
<pre><code class="language-bash">kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running
</code></pre>
<p>Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:</p>
<pre><code class="language-bash"># Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000
</code></pre>
<p>Grafana login: <code>admin</code> / <code>kubelab-grafana-2026</code>.</p>
<blockquote>
<p>Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.</p>
</blockquote>
<h2 id="heading-simulation-1-kill-random-pod"><strong>Simulation 1: Kill Random Pod</strong></h2>
<p>This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code>. Watch for a pod to go Terminating then a new one to appear.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/3d3cb733-407a-482f-82e7-cbeea496157b.png" alt="Terminals running side by side before clicking Run, events streaming, pod watch, frontend and grafana port forwarding." style="display:block;margin:0 auto" width="706" height="1250" loading="lazy">

<pre><code class="language-bash">kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement
</code></pre>
<p><strong>What happened:</strong> The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.</p>
<p><strong>The production trap:</strong> A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.</p>
<p><strong>The fix:</strong> Set <code>replicas: 2</code>, add a readiness probe, and set <code>terminationGracePeriodSeconds</code> to match your longest request timeout.</p>
<h2 id="heading-simulation-2-drain-a-worker-node"><strong>Simulation 2: Drain a Worker Node</strong></h2>
<p>This simulation cordons a worker node, then evicts all its pods to the remaining node.</p>
<p>To <em><strong>"cordon"</strong></em> a worker node means to mark it as unschedulable. When you run <code>kubectl cordon &lt;node-name&gt;</code>, the Kubernetes control plane adds the <code>node.kubernetes.io/unschedulable:NoSchedule</code> taint to the node. (A <strong>taint</strong> is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does <strong>not</strong> affect the pods that are already running there.</p>
<p>Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.</p>
<p>Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -o wide -w</code>. Watch which node each pod runs on.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -o wide -w
</code></pre>
<pre><code class="language-plaintext">NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled
</code></pre>
<p>In <code>kubectl get nodes</code> the node shows <code>Ready,SchedulingDisabled</code> until you run <code>kubectl uncordon</code>.</p>
<p><strong>What happened:</strong> The node spec got <code>spec.unschedulable=true</code>. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw <code>kubectl delete pod</code> bypasses this check entirely — which is why draining with <code>kubectl drain</code> is always safer than deleting pods manually during maintenance.</p>
<p><strong>The production trap:</strong> Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite <code>replicas: 2</code>.</p>
<p><strong>The fix:</strong> Use pod anti-affinity with topology key: <code>kubernetes.io/hostname</code> and a PodDisruptionBudget with <code>minAvailable: 1</code>.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/1161cbf9-2482-41c7-9b5c-751762d3baaa.png" alt="Node drain CLI output: cordoned node shows Ready,SchedulingDisabled; pods reschedule to the other node." style="display:block;margin:0 auto" width="729" height="128" loading="lazy">

<h2 id="heading-simulation-3-cpu-stress-and-throttling"><strong>Simulation 3: CPU Stress and Throttling</strong></h2>
<p>This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.</p>
<p><strong>Before you click:</strong> Run <code>watch -n 2 kubectl top pods -n kubelab</code> and open the Grafana CPU Usage panel.</p>
<pre><code class="language-bash">kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m
</code></pre>
<p><strong>What happened:</strong> The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.</p>
<p><strong>The production trap:</strong> <code>kubectl top</code> shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.</p>
<p><strong>The fix:</strong> For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/5e3fd49b-c9a0-4271-9be7-b7fec3122c1a.png" alt="One backend pod flatlined at exactly 95-150m for 60 seconds. A healthy pod's CPU fluctuates, this flat ceiling is the throttle." style="display:block;margin:0 auto" width="1476" height="788" loading="lazy">

<h2 id="heading-simulation-4-memory-stress-and-oomkill"><strong>Simulation 4: Memory Stress and OOMKill</strong></h2>
<p>This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -l app=backend -w</code> and open the Grafana Memory Usage panel.</p>
<pre><code class="language-bash">kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown
</code></pre>
<p><strong>What happened:</strong> The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.</p>
<p><strong>The production trap:</strong> The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."</p>
<p><strong>The fix:</strong> Alert on <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) &gt; 3</code> before users notice.<br>The Prometheus expression means: look at how many times containers in the <code>kubelab</code> namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.</p>
<p>Confirm it happened:</p>
<pre><code class="language-bash">kubectl describe pod -n kubelab &lt;pod-name&gt; | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137
</code></pre>
<p>To see the last output before the kernel killed the process, run <code>kubectl logs -n kubelab &lt;pod-name&gt; --previous</code>. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.</p>
<img src="https://cdn.hashnode.com/uploads/covers/698d563262d4ce66226a844a/8ced107b-9d14-4d40-b6d6-7ae0fe35b1b7.png" alt="One backend pod's memory climbs, then the line drops at the OOMKill and reappears as the container restarts. The other pod's line stays flat the whole time" style="display:block;margin:0 auto" width="735" height="298" loading="lazy">

<h2 id="heading-simulation-5-database-failure"><strong>Simulation 5: Database Failure</strong></h2>
<p>This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods,pvc -n kubelab</code>. Note that the PVC exists before you start.</p>
<pre><code class="language-bash">kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume
</code></pre>
<p>A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new <code>postgres-0</code> pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.</p>
<p><strong>What happened:</strong> The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. <code>postgres-0</code> always mounts <code>postgres-data-postgres-0</code>. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.</p>
<p><strong>The production trap:</strong> Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.</p>
<p><strong>The fix:</strong> Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.</p>
<h2 id="heading-simulation-6-cascading-pod-failure"><strong>Simulation 6: Cascading Pod Failure</strong></h2>
<p>This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get endpoints -n kubelab backend-service -w</code>. Watch the IP list.</p>
<pre><code class="language-bash">kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS   &lt;none&gt;   ← every request in this window gets Connection refused
</code></pre>
<p><strong>What happened:</strong> Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.</p>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/6cae14e0-faf2-4d42-90f4-32d00a1b4119.png" alt="The 5xx spike during Cascading Failure, 5 to 15 seconds of real downtime with the exact window timestamped" style="display:block;margin:0 auto" width="746" height="291" loading="lazy">

<p><strong>The production trap:</strong> <code>replicas: 2</code> protects you from one pod dying at a time, nothing more.<br>If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.<br>Check right now with <code>kubectl get pods -n kubelab -o wide | grep backend</code>, and if both pods show the same NODE, you are one node failure away from an outage.</p>
<p><strong>The fix:</strong> Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with <code>minAvailable: 1</code> to block any voluntary action that would leave zero replicas.</p>
<h2 id="heading-simulation-7-readiness-probe-failure"><strong>Simulation 7: Readiness Probe Failure</strong></h2>
<p>This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.</p>
<p><strong>Before you click:</strong> Run <code>kubectl get pods -n kubelab -w</code> in one tab and <code>kubectl get endpoints -n kubelab backend-service -w</code> in another.</p>
<pre><code class="language-bash"># Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic
</code></pre>
<p><strong>What happened:</strong> <code>/ready</code> returned 503. The kubelet marked the pod <code>Ready=False</code>. The Endpoints controller removed its IP from the Service. The liveness probe <code>/health</code>) still returned 200, so no restart. After 120 seconds <code>/ready</code> recovered and the pod rejoined. Run <code>kubectl logs -n kubelab &lt;failing-pod&gt; -f</code> to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.</p>
<p><strong>The production trap:</strong> Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.</p>
<p><strong>The fix:</strong> Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.</p>
<h2 id="heading-4-how-to-read-the-signals-in-grafana"><strong>4. How to Read the Signals in Grafana</strong></h2>
<img src="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/698d563262d4ce66226a844a/e6709c25-2d80-489c-b7fb-418ef303b7e2.png" alt="A screenshot showing my grafana dashboards" style="display:block;margin:0 auto" width="1110" height="1201" loading="lazy">

<p><code>kubectl</code> shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.</p>
<h3 id="heading-the-four-panels-that-matter"><strong>The Four Panels that Matter</strong></h3>
<p><strong>Pod Restarts:</strong> A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.</p>
<p><strong>CPU Usage:</strong> A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.</p>
<p><strong>Memory Usage:</strong> Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.</p>
<p><strong>HTTP Request Rate:</strong> During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.</p>
<h3 id="heading-5-how-to-read-the-terminal-signals"><strong>5. How to Read the Terminal Signals</strong></h3>
<p>What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.</p>
<p>The <code>-w</code> flag on <code>kubectl get pods -n kubelab -w</code> streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — <code>1/2</code> means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.</p>
<p><code>kubectl get events -n kubelab --sort-by=.lastTimestamp</code> is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.</p>
<p><code>kubectl describe pod -n kubelab &lt;pod-name&gt;</code> is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.</p>
<p><code>kubectl get endpoints -n kubelab backend-service</code> shows which pod IPs are actually receiving traffic right now. A pod can show Running in <code>kubectl get pods</code> and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.</p>
<p><code>kubectl logs -n kubelab &lt;pod-name&gt;</code> shows the container's stdout and stderr. Use <code>-f</code> to follow the stream. After a pod restarts, use <code>--previous</code> to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.</p>
<p>A full event sequence during Kill Pod recovery looks like this:</p>
<pre><code class="language-bash">kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10
</code></pre>
<pre><code class="language-plaintext">REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running
</code></pre>
<p>The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.</p>
<h3 id="heading-two-prometheus-queries-worth-memorizing"><strong>Two Prometheus Queries Worth Memorizing</strong></h3>
<p><strong>First query: silent restart loop.</strong> <code>rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])</code> counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.</p>
<p><strong>Second query: invisible CPU throttling.</strong> <code>rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])</code> measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in <code>kubectl top</code> often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).</p>
<pre><code class="language-plaintext"># Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])
</code></pre>
<p>Run these against your own cluster. Not just KubeLab. These are production queries.</p>
<h2 id="heading-6-how-to-use-this-for-production-debugging"><strong>6. How to Use This for Production Debugging</strong></h2>
<p>The repo includes <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/diagnose.md">docs/diagnose.md</a>, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.</p>
<p><strong>Exit code 137, pods restarting.</strong> Run the Memory Stress simulation. Confirm with <code>kubectl describe pod | grep -A 5 "Last State:"</code> and look for <code>Reason: OOMKilled</code>. Raise limits or find the leak. The simulation shows both.</p>
<p><strong>High latency, pods look healthy, zero restarts.</strong> Run the CPU Stress simulation. Check <code>container_cpu_cfs_throttled_seconds_total</code> in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.</p>
<p><strong>503 on some requests, pods show Running.</strong> Run the Readiness Probe Failure simulation. Check <code>kubectl get endpoints</code> — one pod IP is missing despite Running. The pod gets zero traffic.</p>
<p><strong>Pods stuck Pending after a node went down.</strong> Run the Drain Node simulation. Run <code>kubectl describe pod &lt;pending-pod&gt;</code> and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from <code>kubectl describe</code>, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.</p>
<p>What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.</p>
<p>The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/interview-prep.md">docs/interview-prep.md</a> has answers to the 13 most common Kubernetes interview questions. The observability guide at <a href="https://github.com/Osomudeya/kubelab/blob/main/docs/observability.md">docs/observability.md</a> covers Prometheus and Grafana setup in detail.</p>
<p>If this helped you, star the repo at <a href="https://github.com/Osomudeya/kubelab">https://github.com/Osomudeya/kube-lab</a> and share it with someone who is learning Kubernetes the hard way.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Master Kubernetes Through Production-Ready Practice ]]>
                </title>
                <description>
                    <![CDATA[ Stop memorizing isolated commands and start building like a platform engineer. We just posted comprehensive Kubernetes course on the freeCodeCamp.org YouTube channel. This hands-on course is designed  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/master-kubernetes-through-production-ready-practice/</link>
                <guid isPermaLink="false">69a06148ab6baac8ff198f13</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 26 Feb 2026 15:05:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5f68e7df6dfc523d0a894e7c/b1d63be6-f5d9-4ddc-8afc-4455c2ed95ee.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Stop memorizing isolated commands and start building like a platform engineer.</p>
<p>We just posted comprehensive Kubernetes course on the <a href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel. This hands-on course is designed to bridge the gap between theoretical container orchestration and real-world deployment. Saiyam Pathak developed this course.</p>
<p>You will learn to deploy a cloud-native microservices stack from the ground up. You’ll explore the inner workings of the Kubernetes architecture, including the Control Plane, Worker Nodes, and essential interfaces like CRI, CNI, and CSI. You'll learn to ship a functional application complete with a frontend, auth service, and game engine.</p>
<p>This course is a deep dive into the modern Kubernetes ecosystem. You will implement advanced industry standards such as Gateway API for traffic management, CloudNativePG for managing PostgreSQL databases, and cert-manager for automated HTTPS security. By the time you reach the final demo, you’ll have integrated full-stack observability using Prometheus and Grafana, giving you the confidence to manage production-grade environments.</p>
<p>Watch the full course on <a href="https://youtu.be/_4uQI4ihGVU">the freeCodeCamp.org YouTube channel</a> (6-hour watch).</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/_4uQI4ihGVU?si=jZUjCZl2V2T7fEz9" frameborder="0" allowfullscreen="" title="Embedded content" loading="lazy"></iframe></div> ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
