Destiny Erhabor - freeCodeCamp.org

How to Implement Zero-Trust Workload Identity in Kubernetes with SPIFFE, SPIRE, and Cilium

Destiny Erhabor — Tue, 07 Jul 2026 21:47:50 +0000

Your network policy says: allow traffic from 10.0.1.45.

Yesterday, 10.0.1.45 was your payment service. Today, after a rolling deployment, it's your logging agent. Your payment service is now at 10.0.1.89.

Kubernetes has already updated all the endpoints and service records — but your network policy has no idea. It silently allows traffic through based on an IP address that no longer belongs to the workload you intended to trust.

This is the workload identity problem. IP addresses aren't an identity, they're a location. And in a Kubernetes cluster, location changes constantly. Building security policy on top of IP addresses means your security posture silently degrades every time a pod is scheduled, rescheduled, or scaled.

The answer is cryptographic workload identity: every workload gets a certificate-backed identity that proves who it is, not where it is. Services authenticate each other using those certificates before exchanging any data. If the certificate doesn't match, the connection is refused, regardless of what IP address it came from.

This is what SPIFFE and SPIRE provide. And this is how Cilium enforces it using eBPF, without injecting a sidecar into every pod.

In this article you'll understand how the SPIFFE identity model works, deploy SPIRE to issue cryptographic identities to workloads, and use Cilium's built-in SPIRE integration to enforce mutual TLS between services without touching your application code.

Prerequisites

Familiarity with Kubernetes RBAC and pod security — this handbook covers the foundations
Familiarity with TLS certificates and Kubernetes Secrets — this handbook covers cert-manager and certificate concepts
Helm 3 and the Cilium CLI installed
A kind cluster — you'll create a fresh one with Cilium as the CNI in this article
Patience: this is the most complex demo I've covered in this group of articles. SPIRE has more moving parts than anything else covered so far.

All demo files are in the companion GitHub repository.

Prerequisites
The Workload Identity Problem
How SPIFFE Works
How SPIRE Works
How Cilium Implements Mutual TLS with SPIFFE
Demo 1 — Install Cilium with SPIRE Integration
Demo 2 — Enforce Mutual TLS with a CiliumNetworkPolicy
Conclusion
Cleanup (kind)

The Workload Identity Problem

The opening scenario isn't theoretical. In Kubernetes, pods are ephemeral. The scheduler can place a pod on any node, and a pod's IP address is assigned at scheduling time from the node's IP pool.

When a pod is deleted and recreated through a rolling deployment, a node drain, or an autoscaler event, it gets a new IP address. If you've written a NetworkPolicy that says, "allow traffic from this IP", that policy is now pointing at nothing, or worse, at a different workload.

Kubernetes service names help here for east-west traffic — a Service name resolves consistently regardless of which pods back it. But a NetworkPolicy based on a Service name is still a label selector match, not a cryptographic assertion. Any pod that can spoof the right labels can bypass it.

What you actually want is this: before service A sends a request to service B, service B proves its identity cryptographically. If service B can't prove it is who it claims to be, service A refuses the connection. This is mutual TLS, and the key question is: where do the identities come from?

SPIFFE answers that question.

How SPIFFE Works

SPIFFE — Secure Production Identity Framework for Everyone — is a CNCF standard that defines a model for workload identity. It doesn't implement anything by itself. It specifies the format of identities, the API for requesting them, and the trust model that makes them verifiable across services, clusters, and clouds. SPIRE is the reference implementation of that specification.

SPIFFE IDs and Trust Domains

A SPIFFE identity is a URI with a specific format:

spiffe:///

The trust domain is a string that identifies the administrative boundary — typically your organisation, cluster, or environment. Everything within the same trust domain can verify each other's identities. Identities from different trust domains require explicit federation configuration.

Some concrete examples:

spiffe://payments.corp/ns/production/sa/checkout
spiffe://analytics.corp/ns/data/sa/pipeline-worker
spiffe://cluster.local/ns/monitoring/sa/prometheus

The path after the trust domain is arbitrary — it's defined by your SPIRE configuration and typically encodes the Kubernetes namespace and service account of the workload.

SVIDs: The Cryptographic Identity Document

An SVID — SPIFFE Verifiable Identity Document — is how a SPIFFE identity is materialised into something a service can actually use.

There are two SVID formats.

An X.509 SVID is a standard TLS certificate where the SPIFFE ID is embedded in the Subject Alternative Name (SAN) URI field. Because it's a standard X.509 certificate, any TLS library can use it without modification.

The workload presents this certificate in a TLS handshake, and the peer verifies the certificate was signed by a trusted SPIRE server. This is the format used for long-lived connections like gRPC streams.

A JWT SVID is a signed JSON Web Token containing the SPIFFE ID as a claim. It's suitable for request-based authentication over HTTP — pass it in an Authorization header, and the receiving service verifies the signature.

JWT SVIDs are shorter-lived than X.509 SVIDs and scoped to a specific audience to prevent token reuse across services.

For Cilium's mutual authentication, X.509 SVIDs are used. The rest of this article focuses on X.509.

The Trust Bundle

For service A to verify service B's certificate, service A needs to know which Certificate Authority signed it. In SPIFFE, this is called the trust bundle — the set of CA certificates that are trusted within a trust domain.

SPIRE makes the trust bundle available via the Workload API. When a workload requests its identity, it also receives the current trust bundle. When the SPIRE server rotates its CA, it distributes the new trust bundle to all agents, which push it to all workloads. Your application never has to manage trust bundles manually.

How SPIRE Works

SPIRE is the engine that issues SVIDs and manages the identity lifecycle. Understanding its architecture is what makes the Cilium integration make sense.

SPIRE Server and SPIRE Agent

SPIRE has two main components. The SPIRE Server is the central CA. It maintains a registry of workload entries (records that describe which SPIFFE IDs should be issued to which workloads). It issues SVIDs to agents on behalf of workloads, and it's the root of trust for the entire trust domain.

The SPIRE Agent runs on every node as a DaemonSet. It has two jobs. First, it proves to the SPIRE Server that it's running on a legitimate node. This is called node attestation. Second, it exposes the SPIFFE Workload API on a Unix socket on the node, which workloads use to request their SVIDs.

The agent caches SVIDs locally so that a temporary loss of connection to the SPIRE Server doesn't immediately break workload identity.

This split — central server, per-node agents — is deliberate. Workloads never contact the SPIRE Server directly. They only talk to the agent on their own node. The agent mediates all identity requests, which limits the blast radius if a node is compromised.

Node Attestation

When a SPIRE Agent starts up on a new node, it needs to prove its own identity to the SPIRE Server before it can serve identities to workloads. This is node attestation.

In Kubernetes, SPIRE uses PSAT — Projected Service Account Tokens — for node attestation. The agent presents a Kubernetes service account token that is projected specifically for the SPIRE server's audience. The SPIRE Server contacts the Kubernetes API to verify the token, confirms the agent is running in the expected namespace with the expected service account, and issues the agent its own SVID.

This is the reason SPIRE requires specific Kubernetes API flags. The kube-apiserver must be configured to support projected service account tokens with the right audience, which is why the kind cluster config in the demo below sets --api-audiences and --service-account-issuer.

Workload Attestation

Once a node has been attested, its agent can attest workloads. When a workload connects to the Workload API socket and requests an SVID, the agent collects facts about that workload (like its Kubernetes namespace, service account, pod name, and labels) by querying the Kubernetes API. It matches those facts against the workload entries registered in the SPIRE Server. If a matching entry exists, the agent issues the corresponding SVID.

A workload entry looks like this:

SPIFFE ID: spiffe://example.org/ns/production/sa/checkout
Parent ID: spiffe://example.org/spire/agent/k8s_psat/default/
Selectors:
  k8s:ns:production
  k8s:sa:checkout

The selectors describe the Kubernetes facts that must match. A pod running in the production namespace with service account checkout will receive the SPIFFE ID spiffe://example.org/ns/production/sa/checkout. Any other pod will not.

SVID Issuance and Rotation

SVIDs are short-lived by design. The default TTL for X.509 SVIDs in SPIRE is one hour. The SPIRE Agent automatically rotates them in the background — generating a new key pair, requesting a fresh SVID from the server, and making the new SVID available on the Workload API before the old one expires.

Workloads that use the Workload API directly or tools like the SPIFFE CSI driver get the new SVID transparently.

Short-lived credentials are the zero-trust way. If a workload's SVID is compromised, it's only valid for an hour. Compare that to a Kubernetes service account token, which was historically valid forever.

How Cilium Implements Mutual TLS with SPIFFE

Traditional approaches to service mesh mTLS (like Istio or Linkerd) inject a sidecar proxy into every pod. The proxy intercepts all traffic and handles the TLS handshake. The application has no idea TLS is happening. The sidecar adds memory overhead (roughly 50–100MB per pod for Envoy), an extra network hop on every request, and a complex certificate injection mechanism.

Cilium takes a different path. Rather than injecting a proxy, it handles authentication at the network layer using eBPF. The Cilium agent running on each node intercepts connections, performs the mutual TLS handshake using SPIFFE SVIDs, and enforces the authentication result — all in the kernel, without any user-space proxy.

The mechanism works like this. When pod A initiates a connection to pod B, the Cilium agent on pod A's node intercepts the connection. It retrieves pod A's SVID from the SPIRE Workload API. It checks whether there's a CiliumNetworkPolicy requiring mutual authentication for this connection. If there is, it performs a TLS handshake with the Cilium agent on pod B's node, presenting pod A's SVID and requesting pod B's SVID in return.

Both agents verify the SVID against the SPIRE trust bundle. If both SVIDs are valid and the policy allows the connection, it proceeds. If either SVID is invalid or missing, the connection is dropped.

The application on pod A receives data from the application on pod B. Neither application wrote any TLS code. Neither has a sidecar. The authentication happened entirely in the Cilium agents on their respective nodes.

In Cilium's model, the Cilium agent itself gets a SPIFFE identity from SPIRE. It acts as a delegate identity that can request SVIDs on behalf of workloads.

This is slightly different from the standalone SPIRE model where each workload requests its own SVID directly. The Cilium operator registers workload entries in SPIRE automatically based on the Kubernetes Identities it manages, so you don't need to manually create SPIRE entries for every pod.

Demo 1 — Install Cilium with SPIRE Integration

You'll create a kind cluster with Cilium as the CNI and enable its built-in SPIRE integration in a single Helm command.

Step 1: Install the Cilium CLI

# macOS
brew install cilium-cli

# Linux
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar -xzf cilium-linux-amd64.tar.gz -C /usr/local/bin

Step 2: Create a kind Cluster Without a Default CNI

kind's default CNI (kindnet) must be disabled so Cilium can take its place. Save this as kind-cilium.yaml:

# kind-cilium.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
networking:
  disableDefaultCNI: true   # Required: let Cilium be the CNI
  kubeProxyMode: none       # Cilium replaces kube-proxy too

kind create cluster --name k8s-mtls --config kind-cilium.yaml

The nodes will be in a NotReady state until Cilium is installed. This is expected because there's no CNI yet.

Step 3: Install Cilium with SPIRE Enabled

Because Step 2 set kubeProxyMode: none, Cilium has to play the kube-proxy role itself. That means its bootstrap pods can't reach the API server via the kubernetes Service ClusterIP, because nothing is routing it yet.

You have to pass the API server's real address up front. Grab the kind control-plane's IP from Docker:

API_SERVER_IP=$(docker inspect k8s-mtls-control-plane \
  --format='{{ .NetworkSettings.Networks.kind.IPAddress }}')
echo "API_SERVER_IP=$API_SERVER_IP"

Then install Cilium with SPIRE:

helm repo add cilium https://helm.cilium.io/
helm repo update

helm upgrade cilium cilium/cilium \
  --install \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=6443 \
  --set authentication.enabled=true \
  --set authentication.mutual.spire.enabled=true \
  --set authentication.mutual.spire.install.enabled=true \
  --set authentication.mutual.spire.install.server.dataStorage.enabled=false

A few of these flags are easy to miss but each is load-bearing:

kubeProxyReplacement=true: Cilium installs its eBPF-based replacement for kube-proxy. Mandatory whenever the kind config sets kubeProxyMode: none.
k8sServiceHost / k8sServicePort: direct API server address used during bootstrap, before Cilium can route the Service ClusterIP. On EKS/GKE/AKS you don't need this because kube-proxy is still present during install.
authentication.enabled=true: required alongside authentication.mutual.spire.enabled=true. The chart's validate.yaml rejects the install with SPIRE integration requires .Values.authentication.enabled=true and .Values.authentication.mutual.spire.enabled=true if you set only the mutual flag.
dataStorage.enabled=false: switches the SPIRE server from a PVC-backed datastore to in-memory. Fine for a lab cluster, but in production leave this enabled and ensure your cluster has PersistentVolume support.

Notice there's no --wait flag here. On a fresh cluster, --wait will appear to fail with context deadline exceeded because the install is racey by design. The SPIRE server has to schedule on a NotReady node thanks to its tolerations, then Cilium agents come up using SPIRE, then nodes flip to Ready. Let the install return immediately and watch the pods come up over the next ~2 minutes:

kubectl get pods -A -w

Step 4: Verify the Installation

cilium status --wait

    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       disabled
    \__/       ClusterMesh:        disabled

DaemonSet              cilium             Desired: 3, Ready: 3/3, Available: 3/3
DaemonSet              cilium-envoy       Desired: 3, Ready: 3/3, Available: 3/3
Deployment             cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2

Three Cilium agents, one per node, including the control-plane (no taints in the kind config). Check the SPIRE components in the cilium-spire namespace:

kubectl get all -n cilium-spire

NAME                    READY   STATUS    RESTARTS   AGE
pod/spire-agent-2cpsr   1/1     Running   0          3m
pod/spire-agent-klhjx   1/1     Running   0          3m
pod/spire-agent-vhsnc   1/1     Running   0          3m
pod/spire-server-0      2/2     Running   0          3m

NAME                              TYPE        CLUSTER-IP    PORT(S)    AGE
service/spire-server              ClusterIP   10.96.x.x     8081/TCP   3m

NAME                          DESIRED   CURRENT   READY   AGE
daemonset.apps/spire-agent    3         3         3       3m

NAME                             READY   AGE
statefulset.apps/spire-server    1/1     3m

One SPIRE agent per node. The SPIRE server is a StatefulSet with two containers: the server itself plus the SPIRE controller manager, which automatically creates workload registration entries for Cilium identities.

Run a health check on the SPIRE server:

kubectl exec -n cilium-spire spire-server-0 -c spire-server -- \
  /opt/spire/bin/spire-server healthcheck

Server is healthy.

Verify the SPIRE agents have been attested:

kubectl exec -n cilium-spire spire-server-0 -c spire-server -- \
  /opt/spire/bin/spire-server agent list

Found 3 attested agents:

SPIFFE ID         : spiffe://spiffe.cilium/spire/agent/k8s_psat/default/
Attestation type  : k8s_psat
Expiration time   : 2026-05-17 21:08:47 +0000 UTC
Serial number     : 91532884191503307904684123063465502141
Can re-attest     : true

SPIFFE ID         : spiffe://spiffe.cilium/spire/agent/k8s_psat/default/
...

Three agents, one per node, all attested via Kubernetes PSAT. The SPIRE server trusts every node and will issue SVIDs to workloads running on them.

At this point the identity platform is fully in place, but nothing is using it yet. Demo 1 built the machinery that issues cryptographic identities. Demo 2, which we'll walk through next, puts that machinery to work, turning those SVIDs into an enforced mutual-TLS policy between two real services. Keep the cluster from Demo 1 running, as Demo 2 builds directly on it.

Demo 2 — Enforce Mutual TLS with a CiliumNetworkPolicy

Picking up in the same cluster from Demo 1, you'll deploy two services, enforce mutual authentication between them with a CiliumNetworkPolicy, verify that authenticated traffic flows, and confirm that unauthenticated connections are blocked.

Every request here is authenticated with the SVIDs that the SPIRE server you just verified hands out. These two demos are one continuous walkthrough, not standalone exercises.

Step 1: Deploy a Client and Server

This file contains both the server and the client — the client is a sleeping curl pod we'll use to exec into.

# echo-workloads.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-server
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-server
  template:
    metadata:
      labels:
        app: echo-server
    spec:
      containers:
        - name: echo-server
          image: ealen/echo-server:latest
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: echo-server
  namespace: default
spec:
  selector:
    app: echo-server
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-client
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-client
  template:
    metadata:
      labels:
        app: echo-client
    spec:
      containers:
        - name: client
          image: curlimages/curl:latest
          command: ["sleep", "infinity"]

kubectl apply -f echo-workloads.yaml
# kubectl rollout status only takes one resource at a time
kubectl rollout status deployment/echo-server -n default
kubectl rollout status deployment/echo-client -n default

Step 2: Confirm Traffic Flows Without Authentication

Before enforcing mTLS, confirm the client can reach the server:

CLIENT=$(kubectl get pod -l app=echo-client -o jsonpath='{.items[0].metadata.name}')
kubectl exec $CLIENT -- curl -s http://echo-server/

You should get a JSON response from the echo server. Traffic flows freely with no authentication.

Step 3: Apply a CiliumNetworkPolicy Requiring Mutual Authentication

Adding authentication.mode: required to a CiliumNetworkPolicy tells Cilium to enforce mutual TLS for matching traffic. Both sides of the connection must present a valid SPIFFE SVID:

# mtls-policy.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: echo-server-mtls
  namespace: default
spec:
  endpointSelector:
    matchLabels:
      app: echo-server
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: echo-client
      authentication:
        mode: required     # Require mutual TLS for this traffic

kubectl apply -f mtls-policy.yaml

Step 4: Verify Authenticated Traffic Still Flows

kubectl exec $CLIENT -- curl -s http://echo-server/

The connection succeeds. Cilium intercepted it, performed the SPIFFE mTLS handshake between the Cilium agents on both pods' nodes, verified both SVIDs, and allowed the traffic through. The application on the client sent a plain HTTP request and received a response — the mutual authentication happened transparently at the network layer.

Step 5: Observe the Authentication with Hubble (Optional)

Hubble is Cilium's observability layer. It needs its own CLI:

# macOS
brew install hubble

# Linux
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all \
  https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz
sudo tar -xzf hubble-linux-amd64.tar.gz -C /usr/local/bin

Enable Hubble in the cluster, then watch flows. cilium hubble enable deploys Hubble Relay and restarts the Cilium agents to switch on the Hubble server inside them, so wait for it to settle before port-forwarding. If you skip the wait, the port-forward connects before Relay is listening, then dies with connection reset by peer / rpc error … EOF:

cilium hubble enable
cilium status --wait          # wait for "Hubble Relay: OK" before continuing

cilium hubble port-forward &

# Watch flows for the echo-server (Ctrl-C to stop)
hubble observe --namespace default --pod echo-server --follow

Trigger another request in a second terminal:

kubectl exec $CLIENT -- curl -s http://echo-server/

In the Hubble output you'll see:


ℹ️  Hubble Relay is available at 127.0.0.1:4245
Jul  7 12:44:42.380: default/echo-client-86d446b8f-9bn5v:47500 (ID:2822) -> default/echo-server-7467b4b54d-5tvkz:80 (ID:30854) policy-verdict:none TRAFFIC_DIRECTION_UNKNOWN ALLOWED (TCP Flags: SYN)
Jul  7 12:44:42.380: default/echo-client-86d446b8f-9bn5v:47500 (ID:2822) -> default/echo-server-7467b4b54d-5tvkz:80 (ID:30854) to-endpoint FORWARDED (TCP Flags: SYN)
Jul  7 12:44:42.381: default/echo-client-86d446b8f-9bn5v:47500 (ID:2822) <- default/echo-server-7467b4b54d-5tvkz:80 (ID:30854) to-endpoint FORWARDED (TCP Flags: SYN, ACK)

The ALLOWED verdict with the policy-verdict reason confirms the CiliumNetworkPolicy matched and authentication was verified. No sidecar involved — this happened in the Cilium agents.

Prefer a graphical view? Enable the Hubble UI. Everything above is the API + terminal path (Relay on port 4245 backs the hubble CLI). Hubble also ships a web dashboard with a live service map — but it's a separate component that cilium hubble enable does not start by default:

# Add the UI (re-runs enable, keeps Relay, adds the hubble-ui deployment)
cilium hubble enable --ui

# Wait for it to be Ready before opening — same race as Relay. Skip this and
# `cilium hubble ui` fails with "connection refused" on port 8081, because the
# UI's frontend container isn't listening yet.
kubectl -n kube-system rollout status deployment/hubble-ui --timeout=90s

# Port-forwards hubble-ui and opens http://localhost:12000 in your browser
cilium hubble ui

Select the default namespace from the dropdown. That's where the demo pods and the policy live. The map is live: it renders edges from flows as they happen, so an idle namespace looks empty. Trigger a request to light it up:

kubectl exec $CLIENT -- curl -s http://echo-server/

You'll see a forwarded edge echo-client → echo-server. Click it (or open the flow table at the bottom) to read the policy-verdict: ALLOWED. Leave the UI open through Step 6. When you run the unauthorized-client test there, its connection shows up as a red dropped edge, the visual counterpart to the curl timeout.

The UI has three parts.

The service map at the top draws each workload identity as a box and each observed connection as an edge colored by verdict: echo-client → echo-server:80 is a solid green (forwarded) edge, while the box labelled default (that's the unauthorized pod, which carries only the namespace identity because it has no app label, so Hubble names it after that) reaches echo-server over a red dashed (dropped) line. The 🔒 lock on echo-server's → 80 TCP port marks that endpoint as mutually authenticated by the policy.

The flow table underneath logs one row per flow: source identity, destination identity, destination port, L7 info, Verdict, and timestamp. This lets you read both outcomes side by side, with echo-client → echo-server rows marked forwarded and default → echo-server rows marked dropped. This is the same allow/deny split as the CLI, one line per packet.

The top bar holds the namespace selector, a flow filter, the Any verdict / Visual toggle, and a live flows/s rate alongside the count of reporting nodes (3/3).

Step 6: Verify That a Pod Without the Matching Label is Blocked

Deploy a third pod without the echo-client label and try to reach the server:

# unauthorized-client.yaml
apiVersion: v1
kind: Pod
metadata:
  name: unauthorized
  namespace: default
spec:
  containers:
    - name: client
      image: curlimages/curl:latest
      command: ["sleep", "infinity"]

kubectl apply -f unauthorized-client.yaml
kubectl wait --for=condition=Ready pod/unauthorized --timeout=60s
kubectl exec unauthorized -- curl -sS --max-time 5 http://echo-server/

curl: (28) Connection timed out after 5000 milliseconds

The connection times out. The CiliumNetworkPolicy only permits ingress from pods with app: echo-client. A pod without that label gets no SVID match and no policy match. Cilium drops the traffic silently.

There are two gotchas to watch out for here. Run kubectl wait before exec. Run exec too soon after apply and you get container not found ("client") because the pod's container hasn't started yet.

And use curl -sS, not plain -s. With only -s, curl swallows the error text and you just see command terminated with exit code 28. That's the same result — 28 is curl's timeout code — but the -S restores the readable message. The fact that it times out (rather than "connection refused") is the signature of a policy drop: the packets are silently blackholed, not actively rejected. A refusal would return instantly with a different error.

Step 7: Check the Workload Entries in SPIRE

Cilium's SPIRE controller manager automatically created SPIFFE identities for the Cilium security identities in this cluster. You can see them:

kubectl exec -n cilium-spire spire-server-0 -c spire-server -- \
  /opt/spire/bin/spire-server entry show \
  -selector cilium:mutual-auth

Each entry maps a Cilium security identity to a SPIFFE ID. The Cilium operator manages this registry automatically, so you never need to register workloads manually when using Cilium's built-in integration.

Conclusion

IP addresses are location, not identity. And in Kubernetes, location changes with every deployment, so any policy built on address matching silently degrades over time.

Cryptographic workload identity fixes that at the foundation. SPIFFE defines the model (a SPIFFE ID names a workload within a trust domain, an X.509 SVID materialises it into a certificate any TLS library can verify), and SPIRE implements it: the server is the CA and registry, while per-node agents attest via Kubernetes PSAT and issue short-lived, auto-rotating SVIDs.

Cilium wires that identity layer into the network. Add authentication.mode: required to a CiliumNetworkPolicy and its eBPF agents fetch both workloads' SVIDs, run the mutual TLS handshake, and enforce the verdict. There's no sidecar, no application changes, and near-zero overhead versus a service mesh. And you deployed the whole stack in a single Helm command: the complexity lives in the infrastructure, not in your code.

Cleanup (kind)

# Delete demo workloads
kubectl delete deployment echo-server echo-client -n default
kubectl delete service echo-server -n default
kubectl delete pod unauthorized -n default
kubectl delete ciliumnetworkpolicy echo-server-mtls -n default

# Uninstall Cilium (helm doesn't delete the cilium-spire namespace it created)
helm uninstall cilium -n kube-system
kubectl delete namespace cilium-spire

# Delete the cluster (easiest reset on kind)
kind delete cluster --name k8s-mtls

How to Encrypt Kubernetes Traffic with cert-manager, Let's Encrypt, and Internal TLS

Destiny Erhabor — Wed, 20 May 2026 17:47:34 +0000

Most engineers assume their Kubernetes cluster encrypts all of its traffic. It doesn't. The commands you run with kubectl are encrypted — your client and the API server speak TLS. The API server talking to etcd is usually encrypted too, depending on how the cluster was provisioned.

But traffic between your pods? Plaintext by default. Ingress traffic from the internet to your services? Only encrypted if you explicitly configure TLS. And certificates for internal services? You have to provision those yourself.

This is not a Kubernetes oversight. It's a deliberate design choice — Kubernetes provides the primitives and leaves the implementation to you. The problem is that certificate management is notoriously painful. Certificates expire. Provisioning them manually doesn't scale. Forgetting to rotate them causes outages.

cert-manager solves this. It runs as a controller inside your cluster, watches for Certificate resources, requests certificates from configured issuers, stores them in Kubernetes Secrets, and rotates them automatically before they expire. You declare what you want, cert-manager makes it happen and keeps it that way.

In this article you'll work through how cert-manager's core model works, automate public Ingress TLS using Let's Encrypt, set up an internal Certificate Authority for service-to-service encryption, and understand how certificate rotation works so outages caused by expired certificates become a thing of the past.

Prerequisites

A kind cluster with the nginx Ingress controller installed
Helm 3 installed
A domain name with DNS you control — needed for the Let's Encrypt demo
Basic understanding of TLS: you know what a certificate, a private key, and a CA are

All demo files are in the DevOps-Cloud-Projects GitHub repository.

What Is and Isn't Encrypted in Kubernetes
How cert-manager Works
Demo 1 — Install cert-manager and Issue a Let's Encrypt Certificate
How to Get a Wildcard Certificate with DNS-01
Demo 2 — Set Up an Internal CA for Service-to-Service TLS
How Certificate Rotation Works
Cleanup
Conclusion

What Is and Isn't Encrypted in Kubernetes?

Before installing anything, it's worth being precise about what the cluster already protects and what it leaves open.

Traffic path	Encrypted by default?	Notes
`kubectl` → API server	Yes	TLS with the cluster CA
API server → etcd	Usually	Depends on cluster provisioner — verify with your setup
API server → kubelet	Yes	TLS, but kubelet cert verification depends on configuration
Pod → Pod (same cluster)	No	Plaintext unless you add a service mesh or mTLS
Internet → Ingress	No	Opt-in — requires TLS configuration on the Ingress resource
Pod → Kubernetes API	Yes	Via the service account token and cluster CA

The two gaps that matter most in practice are pod-to-pod traffic and Ingress TLS. This article covers both Ingress TLS with Let's Encrypt and internal service-to-service encryption using a private CA.

How cert-manager Works

cert-manager is a Kubernetes operator. It extends the Kubernetes API with custom resources that represent certificate requests and their configuration. When you create a Certificate resource, cert-manager's controller picks it up, requests a certificate from the configured issuer, and stores the resulting certificate and private key in a Kubernetes Secret. When the certificate approaches its expiry, cert-manager renews it automatically.

This model means your application doesn't know or care about certificate management. It reads a Secret. cert-manager keeps that Secret fresh.

The Four Core Resources

cert-manager introduces four custom resources that you'll use regularly:

Resource	What it represents
`Issuer`	A certificate authority or ACME account — namespace-scoped
`ClusterIssuer`	Same as Issuer, but available cluster-wide
`Certificate`	A request for a certificate — describes what you want
`CertificateRequest`	An individual signing request — created automatically by cert-manager, rarely touched directly

In practice you'll mostly deal with ClusterIssuer and Certificate. The ClusterIssuer defines where certificates come from. The Certificate defines what certificate you want and where to store it.

Issuers and ClusterIssuers

An Issuer can only issue certificates within its own namespace. A ClusterIssuer can issue certificates in any namespace. For shared infrastructure like Let's Encrypt, you almost always want a ClusterIssuer. For application-specific internal CAs, an Issuer scoped to that application's namespace is the safer choice.

cert-manager supports several issuer types. The three you'll encounter most often are:

ACME — for public certificates from Let's Encrypt or any ACME-compatible CA. Ownership of the domain is proven via an HTTP-01 or DNS-01 challenge.

CA — for internal certificates signed by a CA whose private key is stored in a Kubernetes Secret. Used for service-to-service TLS within the cluster.

Self-signed — generates self-signed certificates. Rarely useful on its own, but essential as the bootstrap step when creating an internal CA.

The Certificate Lifecycle

When you create a Certificate resource, cert-manager follows this sequence:

Creates a CertificateRequest with a CSR (Certificate Signing Request)
Passes the CSR to the configured issuer
For ACME issuers: creates a Challenge resource and fulfils it (more on this below)
Receives the signed certificate from the issuer
Stores the certificate and private key in the Kubernetes Secret named in spec.secretName
Monitors the certificate's expiry — by default, renews when 2/3 of the validity period has elapsed

Your application mounts the Secret. cert-manager updates it silently. Most applications that watch for file changes will pick up the new certificate without a restart.

ACME Challenges: HTTP-01 vs DNS-01

Let's Encrypt needs proof that you control the domain before it issues a certificate. ACME defines two challenge types for this.

HTTP-01 works by having cert-manager create a temporary HTTP endpoint at http:///.well-known/acme-challenge/. Let's Encrypt sends a request to that URL. If the response matches the expected token, the challenge passes. This requires your cluster to be reachable from the internet on port 80.

DNS-01 works by having cert-manager create a temporary DNS TXT record at _acme-challenge.. Let's Encrypt checks for that record. This doesn't require inbound HTTP access, which makes it the right choice for private clusters, and it's the only way to get wildcard certificates (*.example.com).

The trade-off: HTTP-01 is simpler to set up but only works for single domains and requires internet-accessible infrastructure. DNS-01 requires API access to your DNS provider but works for internal clusters and wildcards.

Demo 1 — Install cert-manager and Issue a Certificate Using Pebble and Let's Encrypt

Pebble is Let's Encrypt's local ACME test server. It runs inside your cluster, issues certificates using the same ACME protocol as Let's Encrypt, and requires no public domain or internet access. Using Pebble lets you test the full cert-manager flow — challenge, issuance, renewal — on a plain kind cluster.

Once you understand the flow locally, switching to real Let's Encrypt is a one-line change: replace the ClusterIssuer server URL and point a DNS record at a publicly reachable cluster. The rest of the configuration is identical.

You'll install cert-manager, create a ClusterIssuer for Let's Encrypt, deploy a sample application with an Ingress, and watch a real certificate be issued and stored automatically.

Step 1: Install cert-manager

cert-manager is now distributed via OCI Helm charts from quay.io/jetstack. The --set crds.enabled=true flag installs the Custom Resource Definitions as part of the chart:

helm upgrade cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --install \
  --create-namespace \
  --namespace cert-manager \
  --set crds.enabled=true \
  --version v1.17.0 \
  --wait

You also need the nginx Ingress controller — cert-manager routes HTTP-01 challenges through it. The controller.service.type=ClusterIP override is for kind specifically: the default LoadBalancer Service never gets an EXTERNAL-IP on kind (there's no cloud LB), which makes --wait hang forever. On a real cluster, drop the override and keep LoadBalancer.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=ClusterIP \
  --wait

Confirm all four components are running:

kubectl get pods -n cert-manager
kubectl get pods -n ingress-nginx

NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-76f84784c8-r4fx4              1/1     Running   0          6m45s
cert-manager-cainjector-66fbf49587-gv25n   1/1     Running   0          6m45s
cert-manager-webhook-577fddf86-l5wj4       1/1     Running   0          6m45s

NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-6c7cd85885-h7zgx   1/1     Running   0          3m34s

kind-specific gotcha — remove the nginx admission webhook now.** On kind, the nginx admission webhook serves with a self-signed certificate that the Kubernetes API server cannot verify. The first time you try to create any Ingress resource you'll see failed calling webhook "validate.nginx.ingress.kubernetes.io": ... x509: certificate signed by unknown authority. Delete the webhook up front so the rest of the demo doesn't trip over it:

kubectl delete validatingwebhookconfiguration ingress-nginx-admission

Step 2: Install Pebble

Pebble is the local ACME test server, distributed by the JupyterHub project. It ships with a companion CoreDNS deployment (pebble-coredns) that Pebble uses to resolve names during ACME validation.

helm install pebble pebble \
  --repo https://jupyterhub.github.io/helm-chart/ \
  --namespace pebble \
  --create-namespace \
  --wait

Confirm both pods are running:

kubectl get pods -n pebble

NAME                              READY   STATUS    RESTARTS   AGE
pebble-8d8d49d64-lz8ck            1/1     Running   0          36s
pebble-coredns-7fb5c7cbf4-4jw9h   1/1     Running   0          36s

Step 3: Wire up DNS for the fake hostname

We're going to issue a cert for echo.pebble.local. That hostname is fake — it doesn't exist in any real DNS — so we have to teach two independent resolvers about it before issuance will work:

Resolver	Used by	What we need it to do
`pebble-coredns` (in the `pebble` namespace)	Pebble itself, when it makes the HTTP-01 validation request	Resolve `echo.pebble.local` → ingress-nginx ClusterIP
Cluster CoreDNS (`kube-system`)	cert-manager's HTTP-01 self-check before reporting the challenge ready	Forward `pebble.local` lookups to `pebble-coredns`

If you skip either layer, the Order will go to invalid state with a DNS lookup failure.

First grab the two IPs you'll need:

NGINX_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller \
  -o jsonpath='{.spec.clusterIP}')
PEBBLE_DNS_IP=$(kubectl get svc pebble-coredns -n pebble \
  -o jsonpath='{.spec.clusterIP}')
echo "NGINX_IP=\(NGINX_IP  PEBBLE_DNS_IP=\)PEBBLE_DNS_IP"

Patch pebble-coredns to answer for *.pebble.local with the ingress controller's IP. The CoreDNS template plugin parses unreliably when the whole block is collapsed onto one line, so apply a real multi-line ConfigMap:

cat <


Verify it answers correctly:
kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local ${PEBBLE_DNS_IP}

You should see Address:  in the response. If you get SERVFAIL, check kubectl logs -n pebble deploy/pebble-coredns — a parser error like not a TTL: "}" means the template block collapsed onto one line again.
Patch the cluster CoreDNS so cert-manager's self-check can resolve the same name. Add a stub zone that forwards pebble.local to pebble-coredns:
cat <

Verify the cluster resolver now answers for echo.pebble.local (without specifying a server — it'll use the default kube-dns):
kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local

Both Server: 10.96.0.10 and Address:  should appear.
Step 4: Fetch the Pebble CA and create the ClusterIssuer
Pebble signs its certificates with a self-signed root that lives in the pebble ConfigMap under root-cert.pem. cert-manager needs to trust this CA to talk to Pebble's ACME directory, so we pass it as a base64-encoded caBundle in the ClusterIssuer:
kubectl get configmap pebble -n pebble \
  -o jsonpath='{.data.root-cert\.pem}' > pebble-ca.crt

head -1 pebble-ca.crt   # should print -----BEGIN CERTIFICATE-----

CA_BUNDLE=$(base64 -i pebble-ca.crt | tr -d '\n')
echo "CA_BUNDLE length: ${#CA_BUNDLE}"   # ~1600 chars, one continuous line

Create the ClusterIssuer using the heredoc — the ${CA_BUNDLE} shell variable gets substituted into the YAML before kubectl reads it:
kubectl apply -f - <

Check the issuer is ready:
kubectl get clusterissuer pebble

NAME     READY   AGE
pebble   True    5s

If READY stays False, the two most common causes are a malformed caBundle (verify it's a single unbroken base64 line with no newlines) or Pebble being unreachable from the cert-manager namespace. To check reachability:
kubectl run test-curl --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --namespace cert-manager -- \
  curl -k https://pebble.pebble.svc.cluster.local/dir

If that returns JSON, Pebble is reachable.
Step 5: Deploy a sample application
# echo-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - name: echo
          image: ealen/echo-server:latest
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: echo
  namespace: default
spec:
  selector:
    app: echo
  ports:
    - port: 80
      targetPort: 80

kubectl apply -f echo-app.yaml

Verify the resources came up:
kubectl get deploy,pod,svc -n default

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/echo   1/1     1            1           32s

NAME                        READY   STATUS    RESTARTS   AGE
pod/echo-5665fbcfdd-mbgxj   1/1     Running   0          36s

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/echo         ClusterIP   10.96.103.114           80/TCP    40s
service/kubernetes   ClusterIP   10.96.0.1               443/TCP   32m

Step 6: Create an Ingress with TLS
The cert-manager.io/cluster-issuer: pebble annotation tells cert-manager to automatically create a Certificate resource for this Ingress, using the issuer we just created. The hostname echo.pebble.local doesn't need to resolve externally — we taught both DNS resolvers about it in Step 3.
# echo-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: echo
  namespace: default
  annotations:
    cert-manager.io/cluster-issuer: pebble
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - echo.pebble.local
      secretName: echo-tls     # cert-manager will create this Secret
  rules:
    - host: echo.pebble.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: echo
                port:
                  number: 80

kubectl apply -f echo-ingress.yaml

Step 7: Watch the certificate being issued
# Watch the Certificate resource (Ctrl-C once Ready=True)
kubectl get certificate echo-tls -n default -w

NAME       READY   SECRET     AGE
echo-tls   False   echo-tls   5s
echo-tls   True    echo-tls   28s

When READY becomes True, the certificate has been issued and stored in the echo-tls Secret. The full chain — CertificateRequest → Order → Challenge → solver pod → Secret — happens in well under a minute on a healthy cluster:
kubectl get certificate,certificaterequest,order,challenge -n default

NAME                                   READY   SECRET     AGE
certificate.cert-manager.io/echo-tls   True    echo-tls   81s

NAME                                            APPROVED   DENIED   READY   ISSUER   AGE
certificaterequest.cert-manager.io/echo-tls-1   True                True    pebble   81s

NAME                                               STATE   AGE
order.acme.cert-manager.io/echo-tls-1-1824732543   valid   81s

(Challenges are deleted automatically once an Order completes, so kubectl get challenge -n default typically shows nothing at this point — that's success, not failure.)
If READY stays False for more than a minute, see the troubleshooting tips at the end of this section.
Inspect the issued certificate to confirm Pebble signed it:
kubectl get secret echo-tls -n default -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -issuer -subject -dates

issuer=CN=Pebble Intermediate CA 05478c
subject=
notBefore=May 17 19:09:22 2026 GMT
notAfter=Aug 15 19:09:21 2026 GMT

Issuer is Pebble's intermediate CA — proof the full ACME flow worked end-to-end. The cert is valid for 90 days, and cert-manager will renew it automatically at day 60.
Hit the ingress over HTTPS from inside the cluster to confirm everything is wired together:
kubectl run curltest --rm -it --restart=Never --image=curlimages/curl -- \
  curl -sk https://echo.pebble.local/

The echo server should return a JSON blob — note the "x-forwarded-proto":"https" field, which proves the request came through nginx over TLS.
Troubleshooting if the cert never goes Ready:

kubectl describe order -n default — look for "DNS problem" or "Connection refused" in the events.

kubectl logs -n pebble deploy/pebble --tail=50 — Pebble logs the exact URL it tried to fetch during validation and any errors.

If the Order is stuck pending with no events: cert-manager hasn't reconciled yet. Wait 30s.

If the Order is invalid: one of the two DNS layers (Step 3) is misconfigured. Re-run both nslookup checks.

If the Ingress apply itself failed with an x509 webhook error: you skipped the kubectl delete validatingwebhookconfiguration ingress-nginx-admission step in Step 1.


Step 8: Switch to Let's Encrypt staging (real public domain)
Pebble proved the flow works locally. Now move to a publicly-reachable domain pointed at a publicly-reachable cluster. The DNS gymnastics from Step 3 go away — the domain is real, so both resolvers find it without intervention.
Use Let's Encrypt staging first. It speaks the same ACME protocol as production but with generous rate limits, so failed attempts during testing won't lock you out:
# clusterissuer-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx

kubectl apply -f clusterissuer-staging.yaml

# Point the Ingress at staging and the real hostname, then force re-issuance
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-staging --overwrite -n default
kubectl delete secret echo-tls -n default

The new cert's issuer will look something like (STAGING) Let's Encrypt.
Step 9: Switch to Let's Encrypt production
Once staging works, repeat with the production ClusterIssuer. The only difference is the server URL:
# clusterissuer-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx

kubectl apply -f clusterissuer-prod.yaml
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-prod --overwrite -n default
kubectl delete secret echo-tls -n default

cert-manager detects the missing Secret and immediately requests a browser-trusted certificate from production Let's Encrypt.
cert-manager detects the missing Secret and immediately triggers a new certificate request using the production issuer.
How to Get a Wildcard Certificate with DNS-01
HTTP-01 challenges work well for single domains with public ingress. But there are two situations where you need DNS-01 instead: when your cluster is not publicly accessible (internal clusters, air-gapped environments, staging namespaces behind a VPN), and when you want a wildcard certificate that covers all subdomains of your domain.
DNS-01 requires cert-manager to be able to create and delete TXT records in your DNS provider. cert-manager has built-in support for Route53, Cloud DNS, Cloudflare, Azure DNS, and many others.
Here is a ClusterIssuer for DNS-01 using AWS Route53:
# clusterissuer-dns01.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns01
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-dns01-account-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            # Use IRSA (IAM Roles for Service Accounts) in production
            # rather than static credentials
            hostedZoneID: YOUR_HOSTED_ZONE_ID

A wildcard Certificate using that issuer:
# wildcard-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: default
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-dns01
    kind: ClusterIssuer
  commonName: "*.example.com"
  dnsNames:
    - "*.example.com"
    - "example.com"        # Also cover the apex domain
  duration: 2160h           # 90 days
  renewBefore: 720h         # Renew 30 days before expiry

The resulting Secret wildcard-example-com-tls can be referenced by any Ingress in the default namespace. All subdomains — api.example.com, dashboard.example.com, staging.example.com — are covered by a single certificate that rotates automatically.
For Cloudflare instead of Route53, the solver section looks like this:
    solvers:
      - dns01:
          cloudflare:
            email: your-email@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

Demo 2 — Set Up an Internal CA for Service-to-Service TLS
Let's Encrypt certificates are great for public-facing services. But for internal services — a gRPC microservice calling another, a web application talking to its database — you don't need public trust. You need a CA that the cluster trusts, and you need it to issue certificates for service names that don't exist as public DNS records.
cert-manager's CA issuer handles this. You create a root CA, tell cert-manager about it, and then issue certificates for internal services using that CA. Every service that trusts the root CA trusts every certificate it issues.
Step 1: Create a self-signed ClusterIssuer
A self-signed issuer generates certificates that are signed by the certificate itself — it is its own CA. You use this as a bootstrap step to create the root CA certificate:
# selfsigned-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned
spec:
  selfSigned: {}

kubectl apply -f selfsigned-issuer.yaml

Step 2: Create the root CA certificate
Use the self-signed issuer to create a CA certificate. The isCA: true field tells cert-manager this certificate can sign other certificates:
# internal-ca.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-ca
  namespace: cert-manager    # Store in cert-manager namespace
spec:
  isCA: true
  commonName: internal-ca
  secretName: internal-ca-secret
  duration: 87600h           # 10 years — this is a root CA
  renewBefore: 720h
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer

kubectl apply -f internal-ca.yaml
kubectl get certificate internal-ca -n cert-manager

NAME          READY   SECRET               AGE
internal-ca   True    internal-ca-secret   8s

Step 3: Create a CA ClusterIssuer backed by the root CA
Now create a ClusterIssuer that uses the root CA Secret you just created. This is the issuer that will sign certificates for your internal services:
# internal-ca-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  ca:
    secretName: internal-ca-secret   # References the Secret in cert-manager namespace

kubectl apply -f internal-ca-issuer.yaml
kubectl get clusterissuer internal-ca

NAME          READY   AGE
internal-ca   True    5s

Step 4: Issue a certificate for an internal service
Now issue a certificate for an internal gRPC service. The dnsNames use Kubernetes internal DNS names — ..svc.cluster.local:
# payments-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: payments-tls
  namespace: production
spec:
  secretName: payments-tls-secret
  issuerRef:
    name: internal-ca
    kind: ClusterIssuer
  commonName: payments.production.svc.cluster.local
  dnsNames:
    - payments.production.svc.cluster.local
    - payments.production.svc
    - payments
  duration: 2160h     # 90 days
  renewBefore: 360h   # Renew 15 days before expiry

kubectl create namespace production
kubectl apply -f payments-cert.yaml
kubectl get certificate payments-tls -n production

NAME           READY   SECRET                AGE
payments-tls   True    payments-tls-secret   6s

The Secret payments-tls-secret now contains tls.crt, tls.key, and ca.crt. Mount this into your application pod:
# In your Deployment spec
volumes:
  - name: tls
    secret:
      secretName: payments-tls-secret
containers:
  - name: payments
    volumeMounts:
      - name: tls
        mountPath: /etc/tls
        readOnly: true

Your application reads /etc/tls/tls.crt and /etc/tls/tls.key to configure TLS. Other services that need to trust it read /etc/tls/ca.crt.
Step 5: Distribute the CA bundle with trust-manager
The problem with a custom CA is that every service needs to know about it. cert-manager's companion tool, trust-manager, handles this by distributing the CA bundle as a ConfigMap to every namespace:
helm upgrade trust-manager oci://quay.io/jetstack/charts/trust-manager \
  --install \
  --namespace cert-manager \
  --wait

Create a Bundle resource that takes the CA certificate from the internal-ca-secret and distributes it cluster-wide:
# ca-bundle.yaml
apiVersion: trust.cert-manager.io/v1alpha1
kind: Bundle
metadata:
  name: internal-ca-bundle
spec:
  sources:
    - secret:
        name: internal-ca-secret
        key: ca.crt
  target:
    configMap:
      key: ca-bundle.crt
    namespaceSelector:
      matchLabels:
        # Distribute to all namespaces with this label
        kubernetes.io/metadata.name: production

kubectl apply -f ca-bundle.yaml

After a few seconds, every matching namespace has a ConfigMap named internal-ca-bundle containing the CA certificate. Applications mount this ConfigMap to trust internally-issued certificates without any per-service configuration.
Step 6: Verify the certificate chain
# Extract the CA cert and service cert
kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt

kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.tls\.crt}' | base64 -d > payments.crt

# Verify the cert was signed by the CA
openssl verify -CAfile ca.crt payments.crt

payments.crt: OK

How Certificate Rotation Works
Certificate rotation is the part of certificate management that breaks production clusters most often. cert-manager handles it automatically, but understanding the mechanism helps you tune it and debug it when things go wrong.
cert-manager watches every Certificate resource it manages and checks the expiry of the underlying certificate in the Secret. When the remaining validity drops below the renewBefore threshold, cert-manager triggers a renewal. The default renewBefore is 1/3 of the certificate's total validity period — so a 90-day certificate starts renewing at day 60.
The renewal creates a new CertificateRequest, goes through the full issuance flow, and updates the Secret in place. The new certificate replaces the old one atomically. Applications that use file mounts and watch for changes (most modern web servers and gRPC frameworks do) will pick up the new certificate without restarting.
# See the current rotation status
kubectl describe certificate echo-tls -n default

Look for these fields in the output:
Status:
  Not After:   2024-06-18T10:00:00Z
  Not Before:  2024-03-20T10:00:00Z
  Renewal Time: 2024-05-18T10:00:00Z   # When cert-manager will start renewing
  Conditions:
    Type:    Ready
    Status:  True
    Message: Certificate is up to date and has not expired

If a renewal fails — for example, because the HTTP-01 challenge can't be completed — cert-manager retries with exponential backoff. The existing certificate continues to serve until it actually expires, giving you a window to debug the issue.
To see renewal events in real time:
kubectl get events -n default --field-selector reason=Issued
kubectl get events -n default --field-selector reason=Failed

Setting renewBefore correctly: For public-facing services, 30 days before a 90-day certificate is a sensible buffer. For internal short-lived certificates (24-hour validity), set renewBefore to 8 hours so rotation happens well before expiry even if the first attempt fails. Never set renewBefore to more than half the certificate's validity — cert-manager will immediately try to renew a certificate it just issued.
Cleanup
# Remove demo resources
kubectl delete ingress echo -n default
kubectl delete service echo -n default
kubectl delete deployment echo -n default
kubectl delete secret echo-tls -n default
kubectl delete certificate payments-tls -n production
kubectl delete namespace production

# Uninstall cert-manager and trust-manager
helm uninstall trust-manager -n cert-manager
helm uninstall cert-manager -n cert-manager
kubectl delete namespace cert-manager

# Remove ClusterIssuers
kubectl delete clusterissuer letsencrypt-staging letsencrypt-prod \
  internal-ca selfsigned 2>/dev/null

Conclusion
Kubernetes leaves TLS configuration entirely to you. In this article you worked through both the public and internal sides of that responsibility.
On the public side, you installed cert-manager using the current OCI Helm chart, created a ClusterIssuer backed by Let's Encrypt, and watched cert-manager go through the full ACME HTTP-01 challenge flow — from creating a temporary solver pod to storing a valid certificate in a Kubernetes Secret. You saw how switching from staging to production is a one-line annotation change, and how cert-manager renews certificates automatically before they expire.
On the internal side, you bootstrapped a private CA using cert-manager's self-signed issuer, created a ClusterIssuer backed by that CA, and issued certificates for internal service names that only exist inside the cluster. You used trust-manager to distribute the CA bundle cluster-wide so services can trust each other's certificates without per-service configuration. And you saw how to verify the certificate chain with openssl so you can confirm it's working before deploying to production.
Understanding certificate rotation is what separates teams that manage TLS confidently from teams that get woken up at 3am by an expired certificate. cert-manager automates the renewal, but the renewBefore field is your safety margin — set it correctly and know how to read the renewal status.
All YAML manifests and Helm values from this article are available in the DevOps-Cloud-Projects GitHub repository.



 How to Authenticate Users in Kubernetes: x509 Certificates, OIDC, and Cloud Identity 
Destiny Erhabor — Mon, 06 Apr 2026 20:31:43 +0000
 Kubernetes doesn't know who you are.
It has no user database, no built-in login system, no password file. When you run kubectl get pods, Kubernetes receives an HTTP request and asks one question: who signed this, and do I trust that signature? Everything else — what you're allowed to do, which namespaces you can access, whether your request goes through at all — comes after that question is answered.
This surprises most engineers who are new to Kubernetes. They expect something like a database of users with passwords. Instead, they find a pluggable chain of authenticators, each one able to vouch for a request in a different way:

Client certificates

OIDC tokens from an external identity provider

Cloud provider IAM tokens

Service account tokens projected into pods.


Any of these can be active at the same time.
Understanding this model is what separates engineers who can debug authentication failures from engineers who copy kubeconfig files and hope for the best.
In this article, you'll work through how the Kubernetes authentication chain works from first principles. You'll see how x509 client certificates are used — and why they're a poor choice for human users in production. You'll configure OIDC authentication with Dex, giving your cluster a real browser-based login flow. And you'll see how AWS, GCP, and Azure each plug into the same underlying model.
Prerequisites

A running kind cluster — a fresh one works fine, or reuse an existing one

kubectl and helm installed

openssl available on your machine (comes pre-installed on macOS and most Linux distros)

Basic familiarity with what a JWT is (a signed JSON object with claims) — you don't need to be able to write one, just recognise one


All demo files are in the companion GitHub repository.
Table of Contents

How Kubernetes Authentication Works

The Authenticator Chain

Users vs Service Accounts

What Happens After Authentication



How to Use x509 Client Certificates

How the Certificate Maps to an Identity

The Cluster CA

The Limits of Certificate-Based Auth



Demo 1 — Create and Use an x509 Client Certificate

How to Set Up OIDC Authentication

How the OIDC Flow Works in Kubernetes

The API Server Configuration

JWT Claims Kubernetes Uses

How kubelogin Works



Demo 2 — Configure OIDC Login with Dex and kubelogin

Cloud Provider Authentication

AWS EKS

Google GKE

Azure AKS



Webhook Token Authentication

Cleanup

Conclusion


How Kubernetes Authentication Works
Every request that reaches the Kubernetes API server — whether from kubectl, a pod, a controller, or a CI pipeline — carries a credential of some kind.
The API server passes that credential through a chain of authenticators in sequence. The first authenticator that can verify the credential wins. If none can, the request is treated as anonymous.
The Authenticator Chain
Kubernetes supports several authentication strategies simultaneously. You can have client certificate authentication and OIDC authentication active on the same cluster at the same time, which is common in production: cluster administrators use certificates, regular developers use OIDC. The strategies active on a cluster are determined by flags passed to the kube-apiserver process.
The strategies available are x509 client certificates, bearer tokens (static token files — rarely used in production), bootstrap tokens (used during node join operations), service account tokens, OIDC tokens, authenticating proxies, and webhook token authentication. A cluster doesn't have to use all of them, and most don't. But knowing they all exist helps when you're diagnosing an auth failure.
Users vs Service Accounts
There is an important distinction in how Kubernetes thinks about identity. Service accounts are Kubernetes objects — they live in a namespace, get created with kubectl create serviceaccount, and have tokens managed by the cluster itself. Every pod runs as a service account. These are machine identities for workloads.
Users, on the other hand, don't exist as Kubernetes objects at all. There is no kubectl create user command. Kubernetes doesn't manage user accounts. Instead, it trusts external systems to assert user identity — a certificate authority, an OIDC provider, or a cloud provider's IAM system. Kubernetes just verifies the assertion and extracts the username and group memberships from it.




Service Account
User



Kubernetes object?
Yes — lives in a namespace
No — managed externally


Created with
kubectl create serviceaccount
External system (CA, IdP, cloud IAM)


Used by
Pods and workloads
Humans and CI systems


Token managed by
Kubernetes
External system


Namespaced?
Yes
No


What Happens After Authentication
Authentication only answers one question: who is this? Once the API server has a verified identity — a username and zero or more group memberships — it passes the request to the authorisation layer. By default that is RBAC, which checks the identity against Role and ClusterRole bindings to determine what the request is allowed to do.
This is why authentication and authorisation are separate concerns in Kubernetes. A valid certificate gets you past the front door. What you can do inside is RBAC's job. An authenticated user with no RBAC bindings can authenticate successfully but will be denied every API call.
If you want a deep dive into how RBAC rules, roles, and bindings work, check out this handbook on How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection.
How to Use x509 Client Certificates
x509 client certificate authentication is the oldest and simplest authentication method in Kubernetes. It's how kubectl works out of the box when you create a cluster — the kubeconfig file that kind or kubeadm generates contains an embedded client certificate signed by the cluster's Certificate Authority.
How the Certificate Maps to an Identity
When the API server receives a request with a client certificate, it validates the certificate against its trusted CA, then reads two fields (The Common Name and Organization) from the certificate to construct an identity.
The Common Name (CN) field becomes the username. The Organization (O) field, which can contain multiple values, becomes the list of groups the user belongs to.
So a certificate with CN=jane and O=engineering authenticates as username jane in group engineering. If you want to give jane permissions, you create a RoleBinding that references either the username jane or the group engineering as a subject.
This is the same mechanism behind system:masters. When kind creates a cluster and writes a kubeconfig for you, it generates a certificate with O=system:masters. Kubernetes has a built-in ClusterRoleBinding that grants cluster-admin to anyone in the system:masters group. That's why your default kubeconfig has full admin access — it's not magic, it's a certificate with the right group.
The Cluster CA
Every Kubernetes cluster has a root Certificate Authority — a private key and a self-signed certificate that the API server trusts. Any client certificate signed by this CA is trusted by the cluster.
The CA certificate and key are typically stored in /etc/kubernetes/pki/ on the control plane node, or in the kube-system namespace as a secret, depending on how the cluster was created.
On kind clusters, you can copy the CA cert and key directly from the control plane container:
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key

Whoever holds the CA key can issue certificates for any username and any group, including system:masters. This makes the CA key the most sensitive secret in a Kubernetes cluster. Guard it accordingly.
The Limits of Certificate-Based Auth
Client certificates work, but they have two fundamental problems that make them a poor choice for human users in production.
The first is that Kubernetes doesn't check certificate revocation lists (CRLs). If a developer's kubeconfig is stolen, the embedded certificate remains valid until it expires — which is typically one year in most Kubernetes setups. There's no way to immediately invalidate it. You can't "log out" a certificate. The only mitigation is to rotate the entire cluster CA, which invalidates every certificate including those belonging to other legitimate users.
The second is operational overhead. Certificates must be generated, distributed to users, and rotated before expiry. There's no self-service. In a team of ten engineers, managing certificates is annoying. In a team of a hundred, it's a full-time job.
For human access in production, OIDC is the right answer: short-lived tokens issued by a trusted identity provider, with a central revocation mechanism, and a standard browser-based login flow. Certificates are fine for service accounts and automation, where token management can be automated and rotation is handled programmatically.
That said, understanding certificates isn't optional. Your kubeconfig uses one. Your CI system probably does too. And cert-based auth is what you fall back to when everything else breaks.
Demo 1 — Create and Use an x509 Client Certificate
In this section, you'll generate a user certificate signed by the cluster CA, bind it to an RBAC role, and use it to authenticate to the cluster as a different user.
This guide is for local development and learning only. Manually signing certificates with the cluster CA and storing keys on disk is done here for simplicity.
In production, you should use the Kubernetes CertificateSigningRequest API or cert-manager for certificate issuance, enforce short-lived certificates with automatic rotation, and store private keys in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or hardware security module (HSM) — never distribute the cluster CA key.
Step 1: Copy the CA cert and key from the kind control plane
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key

This will create two files in your current directory called ca.crt and ca.key
Step 2: Generate a private key and CSR for a new user
You're creating a certificate for a user named jane in the engineering group:
# Generate the private key
openssl genrsa -out jane.key 2048

# Generate a Certificate Signing Request
# CN = username, O = group
openssl req -new \
  -key jane.key \
  -out jane.csr \
  -subj "/CN=jane/O=engineering"

Step 3: Sign the CSR with the cluster CA
openssl x509 -req \
  -in jane.csr \
  -CA ca.crt \
  -CAkey ca.key \
  -CAcreateserial \
  -out jane.crt \
  -days 365

Expected output:
Certificate request self-signature ok
subject=CN=jane, O=engineering

Step 4: Inspect the certificate
Before using it, confirm the identity it carries:
openssl x509 -in jane.crt -noout -subject -dates

subject=CN=jane, O=engineering
notBefore=Mar 20 10:00:00 2024 GMT
notAfter=Mar 20 10:00:00 2025 GMT

One year from now, this certificate becomes invalid and must be replaced. There's no way to extend it — you have to issue a new one.
Step 5: Build a kubeconfig entry for jane
# Get the cluster API server address from the current context
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Create a kubeconfig for jane
kubectl config set-cluster k8s-security \
  --server=$APISERVER \
  --certificate-authority=ca.crt \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-credentials jane \
  --client-certificate=jane.crt \
  --client-key=jane.key \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-context jane@k8s-security \
  --cluster=k8s-security \
  --user=jane \
  --kubeconfig=jane.kubeconfig

kubectl config use-context jane@k8s-security \
  --kubeconfig=jane.kubeconfig

Step 6: Test authentication — before RBAC
Try to list pods using jane's kubeconfig:
kubectl get pods -n staging --kubeconfig=jane.kubeconfig

Error from server (Forbidden): pods is forbidden: User "jane" cannot list
resource "pods" in API group "" in the namespace "staging"

This is correct. Jane authenticated successfully — Kubernetes knows who she is. But she has no RBAC bindings, so every API call is denied. Authentication passed, but authorisation failed.
Step 7: Grant jane access with RBAC
RBAC bindings use the username exactly as it appears in the certificate's CN field. If you need a refresher on how Roles, ClusterRoles, and RoleBindings work, this handbook How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection covers the full RBAC model. For now, a simple RoleBinding using the built-in view ClusterRole is enough:
# jane-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jane-reader
  namespace: staging
subjects:
  - kind: User
    name: jane          # matches the CN in the certificate
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

kubectl apply -f jane-rolebinding.yaml
kubectl get pods -n staging --kubeconfig=jane.kubeconfig

No resources found in staging namespace.

No error — jane can now list pods in staging. She can't delete them, create them, or access other namespaces. The certificate got her in. RBAC determines what she can do.
How to Set Up OIDC Authentication
OpenID Connect is an identity layer on top of OAuth 2.0. It's how Kubernetes integrates with enterprise identity providers — Active Directory, Okta, Google Workspace, Keycloak, and any other provider that speaks OIDC. Understanding how Kubernetes uses it requires following the token from the user's browser to the API server's decision.
How the OIDC Flow Works in Kubernetes
When a developer runs kubectl get pods with OIDC configured, the following happens:

kubectl checks whether the current credential in the kubeconfig is a valid, unexpired OIDC token

If not, it launches kubelogin, a kubectl plugin that opens a browser window

The browser redirects to the OIDC provider (Dex, Okta, your corporate IdP)

The user logs in with their corporate credentials

The OIDC provider issues a signed JWT and returns it to kubelogin

kubelogin caches the token locally (under ~/.kube/cache/oidc-login/) and returns it to kubectl

kubectl sends the token to the API server as a Bearer header

The API server fetches the provider's public keys from its JWKS endpoint and verifies the token signature

If valid, the API server extracts the username and group claims from the token

RBAC takes over from there


The Kubernetes API server never contacts the OIDC provider for each request. It only fetches the provider's public keys periodically to verify signatures locally. This makes OIDC authentication stateless and scalable.
The API Server Configuration
For OIDC to work, the API server needs to know where to find the identity provider and how to interpret the tokens it issues.
In Kubernetes v1.30+, this is configured through an AuthenticationConfiguration file passed via the --authentication-config flag. (In older versions, individual --oidc-* flags were used instead, but these were removed in v1.35.)
The AuthenticationConfiguration defines OIDC providers under the jwt key:



Field
What it does
Example



issuer.url
The OIDC provider's base URL — must match the iss claim in the token
https://dex.example.com


issuer.audiences
The client IDs the token was issued for — must match the aud claim
["kubernetes"]


issuer.certificateAuthority
CA certificate to trust when contacting the OIDC provider (inlined PEM)
-----BEGIN CERTIFICATE-----...


claimMappings.username.claim
Which JWT claim to use as the Kubernetes username
email


claimMappings.groups.claim
Which JWT claim to use as the Kubernetes group list
groups


claimMappings.*.prefix
Prefix added to the claim value — set to "" for no prefix
""


On a kind cluster, the --authentication-config flag is set in the cluster configuration before creation, not after. You'll see this in the next demo.
JWT Claims Kubernetes Uses
A JWT is a signed JSON object with three sections: a header, a payload, and a signature. The payload is a set of claims – key-value pairs that assert facts about the token. Kubernetes reads specific claims from the payload to build an identity.
The required claims are iss (the issuer URL, must match issuer.url in the AuthenticationConfiguration), sub (the subject, a unique identifier for the user), and aud (the audience, must match the issuer.audiences list). The exp claim (expiry time) is also required as the API server rejects expired tokens.
The most useful optional claim is groups (or whatever you configure via claimMappings.groups.claim). When this claim is present, Kubernetes can map OIDC group memberships directly to RBAC group bindings. A user in the platform-engineers group in your identity provider automatically gets the RBAC permissions you've bound to that group in Kubernetes — no manual user management required.
How kubelogin Works
kubelogin (also distributed as kubectl oidc-login) is a kubectl credential plugin. Instead of embedding a static certificate or token in your kubeconfig, you configure a credential plugin that runs a helper binary when kubectl needs a token.
When kubelogin is invoked, it checks its local token cache. If the cached token is still valid, it returns it immediately. If the token has expired, it initiates the OIDC authorization code flow — opens a browser, redirects to the identity provider, receives the token after login, caches it locally, and returns it to kubectl. The whole flow takes about five seconds when it triggers.
This means tokens are short-lived (typically an hour) and rotate automatically. If a developer's machine is compromised, the token expires on its own. There is no long-lived credential sitting in a file somewhere.
Demo 2 — Configure OIDC Login with Dex and kubelogin
In this section, you'll deploy Dex as a self-hosted OIDC provider, configure a kind cluster to trust it, and log in with a browser. Dex is a good demo vehicle because it runs inside the cluster and doesn't require a cloud account or an external service.
This guide is for local development and learning only. Self-signed certificates, static passwords, and certs stored on disk are used here for simplicity.
In production, use a managed identity provider (Azure Entra ID, Google Workspace, Okta), automate certificate lifecycle with cert-manager, and store secrets in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or inject them via CSI driver — never commit or store certs as local files.
Step 1: Create a kind cluster with OIDC authentication
OIDC authentication for the API server must be configured at cluster creation time on Kind because the API server needs to know which identity provider to trust before it starts accepting requests.
Note: Kubernetes v1.30+ deprecated the --oidc-* API server flags in favor of the structured AuthenticationConfiguration API (via --authentication-config). In v1.35+ the old flags are removed entirely. This guide uses the new approach.
nip.io is a wildcard DNS service — dex.127.0.0.1.nip.io resolves to 127.0.0.1. This lets us use a real hostname for TLS without editing /etc/hosts.
First, generate a self-signed CA and TLS certificate for Dex:
# Generate a CA for Dex
openssl req -x509 -newkey rsa:4096 -keyout dex-ca.key \
  -out dex-ca.crt -days 365 -nodes \
  -subj "/CN=dex-ca"

# Generate a certificate for Dex signed by that CA
openssl req -newkey rsa:2048 -keyout dex.key \
  -out dex.csr -nodes \
  -subj "/CN=dex.127.0.0.1.nip.io"

openssl x509 -req -in dex.csr \
  -CA dex-ca.crt -CAkey dex-ca.key \
  -CAcreateserial -out dex.crt -days 365 \
  -extfile <(printf "subjectAltName=DNS:dex.127.0.0.1.nip.io")

Next, generate the AuthenticationConfiguration file. This tells the API server how to validate JWTs — which issuer to trust (url), which audience to expect (audiences), and which JWT claims map to Kubernetes usernames and groups (claimMappings). The CA cert is inlined so the API server can verify Dex's TLS certificate when fetching signing keys:
cat > auth-config.yaml <

The kind-oidc.yaml config uses extraPortMappings to expose Dex's port to your browser, extraMounts to copy files into the Kind node, and a kubeadmConfigPatch to pass --authentication-config to the API server:
# kind-oidc.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      # Forward port 32000 from the Docker container to localhost,
      # so your browser can reach Dex's login page
      - containerPort: 32000
        hostPort: 32000
        protocol: TCP
    extraMounts:
      # Copy files from your machine into the Kind node's filesystem
      - hostPath: ./dex-ca.crt
        containerPath: /etc/ca-certificates/dex-ca.crt
        readOnly: true
      - hostPath: ./auth-config.yaml
        containerPath: /etc/kubernetes/auth-config.yaml
        readOnly: true
    kubeadmConfigPatches:
      # Patch the API server to enable OIDC authentication
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            # Tell the API server to load our AuthenticationConfiguration
            authentication-config: /etc/kubernetes/auth-config.yaml
          extraVolumes:
            # Mount files into the API server pod (it runs as a static pod,
            # so it needs explicit volume mounts even though files are on the node)
            - name: dex-ca
              hostPath: /etc/ca-certificates/dex-ca.crt
              mountPath: /etc/ca-certificates/dex-ca.crt
              readOnly: true
              pathType: File
            - name: auth-config
              hostPath: /etc/kubernetes/auth-config.yaml
              mountPath: /etc/kubernetes/auth-config.yaml
              readOnly: true
              pathType: File

Create the cluster:
kind create cluster --name k8s-auth --config kind-oidc.yaml

Step 2: Deploy Dex
Dex is an OIDC-compliant identity provider that acts as a bridge between Kubernetes and upstream identity sources (LDAP, SAML, GitHub, and so on). In this demo it runs inside the cluster with a static password database — two hardcoded users you can log in as.
The API server doesn't talk to Dex directly on every request. It only needs Dex's CA certificate (which you inlined in the AuthenticationConfiguration) to verify the JWT signatures on tokens that Dex issues.
The deployment has four parts: a ConfigMap with Dex's configuration, a Deployment to run Dex, a NodePort Service to expose it on port 32000 (matching the issuer URL), and RBAC resources so Dex can store state using Kubernetes CRDs.
First, create the namespace and load the TLS certificate as a Kubernetes Secret. Dex needs this to serve HTTPS. Without it, your browser and the API server would refuse to connect:
kubectl create namespace dex

kubectl create secret tls dex-tls \
  --cert=dex.crt \
  --key=dex.key \
  -n dex

Save the following as dex-config.yaml. This configures Dex with a static password connector — two hardcoded users for the demo:
# dex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex-config
  namespace: dex
data:
  config.yaml: |
    # issuer must exactly match the URL in your AuthenticationConfiguration
    issuer: https://dex.127.0.0.1.nip.io:32000

    # Dex stores refresh tokens and auth codes — here it uses Kubernetes CRDs
    storage:
      type: kubernetes
      config:
        inCluster: true

    # Dex's HTTPS listener — serves the login page and token endpoints
    web:
      https: 0.0.0.0:5556
      tlsCert: /etc/dex/tls/tls.crt
      tlsKey: /etc/dex/tls/tls.key

    # staticClients defines which applications can request tokens.
    # "kubernetes" is the client ID that kubelogin uses when authenticating
    staticClients:
      - id: kubernetes
        redirectURIs:
          - http://localhost:8000     # kubelogin listens here to receive the callback
        name: Kubernetes
        secret: kubernetes-secret     # shared secret between kubelogin and Dex

    # Two demo users with the password "password" (bcrypt-hashed).
    # In production, you'd connect Dex to LDAP, SAML, or a social login instead
    enablePasswordDB: true
    staticPasswords:
      - email: "jane@example.com"
        # bcrypt hash of "password" — generate your own with: htpasswd -bnBC 10 "" password
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "jane"
        userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
      - email: "admin@example.com"
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "admin"
        userID: "a8b53e13-7e8c-4f7b-9a33-6c2f4d8c6a1b"
        groups:
          - platform-engineers

Save the following as dex-deployment.yaml. This creates the Deployment, Service, ServiceAccount, and RBAC that Dex needs to run:
# dex-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dex
  namespace: dex
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dex
  template:
    metadata:
      labels:
        app: dex
    spec:
      serviceAccountName: dex
      containers:
        - name: dex
          # v2.45.0+ required — earlier versions don't include groups from staticPasswords in tokens
          image: ghcr.io/dexidp/dex:v2.45.0
          command: ["dex", "serve", "/etc/dex/cfg/config.yaml"]
          ports:
            - name: https
              containerPort: 5556
          volumeMounts:
            - name: config
              mountPath: /etc/dex/cfg
            - name: tls
              mountPath: /etc/dex/tls
      volumes:
        - name: config
          configMap:
            name: dex-config
        - name: tls
          secret:
            secretName: dex-tls
---
# NodePort Service — exposes Dex on port 32000 on the Kind node.
# Combined with extraPortMappings, this makes Dex reachable from your browser
apiVersion: v1
kind: Service
metadata:
  name: dex
  namespace: dex
spec:
  type: NodePort
  ports:
    - name: https
      port: 5556
      targetPort: 5556
      nodePort: 32000
  selector:
    app: dex
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dex
  namespace: dex
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dex
rules:
  - apiGroups: ["dex.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dex
subjects:
  - kind: ServiceAccount
    name: dex
    namespace: dex
roleRef:
  kind: ClusterRole
  name: dex
  apiGroup: rbac.authorization.k8s.io

kubectl apply -f dex-config.yaml
kubectl apply -f dex-deployment.yaml
kubectl rollout status deployment/dex -n dex

Step 3: Install kubelogin
# macOS
brew install int128/kubelogin/kubelogin

# Linux
curl -LO https://github.com/int128/kubelogin/releases/latest/download/kubelogin_linux_amd64.zip
unzip -j kubelogin_linux_amd64.zip kubelogin -d /tmp
sudo mv /tmp/kubelogin /usr/local/bin/kubectl-oidc_login
rm kubelogin_linux_amd64.zip

Confirm it's installed:
kubectl oidc-login --version

Step 4: Configure a kubeconfig entry for OIDC
This creates a new user and context in your kubeconfig. Instead of using a client certificate (like the default Kind admin), it tells kubectl to use kubelogin to get a token from Dex.
The --oidc-extra-scope flags are important: without email and groups, Dex won't include those claims in the JWT, and the API server won't know who you are or what groups you belong to.
kubectl config set-credentials oidc-user \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg=--oidc-issuer-url=https://dex.127.0.0.1.nip.io:32000 \
  --exec-arg=--oidc-client-id=kubernetes \
  --exec-arg=--oidc-client-secret=kubernetes-secret \
  --exec-arg=--oidc-extra-scope=email \
  --exec-arg=--oidc-extra-scope=groups \
  --exec-arg=--certificate-authority=$(pwd)/dex-ca.crt

kubectl config set-context oidc@k8s-auth \
  --cluster=kind-k8s-auth \
  --user=oidc-user

kubectl config use-context oidc@k8s-auth

Step 5: Trigger the login flow
Jane has no RBAC permissions yet, so first grant her read access from the admin context:
kubectl --context kind-k8s-auth create clusterrolebinding jane-view \
  --clusterrole=view --user=jane@example.com

Now switch to the OIDC context and trigger a login:
kubectl get pods -n default

Your browser opens and redirects to the Dex login page. Log in as jane@example.com with password password.




After login, the terminal completes:
No resources found in default namespace.

The browser-based authentication worked. kubectl received the token from Dex, sent it to the API server, the API server validated the JWT signature using the CA certificate from the AuthenticationConfiguration, extracted jane@example.com from the email claim, matched it against the RBAC binding, and authorized the request.
Without the clusterrolebinding, you would see Error from server (Forbidden) — authentication succeeds (the API server knows who you are) but authorization fails (jane has no permissions). This is the distinction between 401 Unauthorized and 403 Forbidden.
Step 6: Inspect the JWT
A JWT (JSON Web Token) is a signed JSON payload that contains claims about the user. kubelogin caches the token locally under ~/.kube/cache/oidc-login/ so you don't have to log in on every kubectl command.
List the directory to find the cached file:
ls ~/.kube/cache/oidc-login/

Decode the JWT payload directly from the cache:
cat ~/.kube/cache/oidc-login/$(ls ~/.kube/cache/oidc-login/ | grep -v lock | head -1) | \
  python3 -c "
import json, sys, base64
token = json.load(sys.stdin)['id_token'].split('.')[1]
token += '=' * (4 - len(token) % 4)
print(json.dumps(json.loads(base64.urlsafe_b64decode(token)), indent=2))
"

You'll see something like:
{
  "iss": "https://dex.127.0.0.1.nip.io:32000",
  "sub": "CiQwOGE4Njg0Yi1kYjg4LTRiNzMtOTBhOS0zY2QxNjYxZjU0NjYSBWxvY2Fs",
  "aud": "kubernetes",
  "exp": 1775307910,
  "iat": 1775221510,
  "email": "jane@example.com",
  "email_verified": true
}

The email claim becomes jane's Kubernetes username because the AuthenticationConfiguration maps username.claim: email. The aud matches the configured audiences. The iss matches the issuer url. This is how the API server validates the token without contacting Dex on every request — it only needs the CA certificate to verify the JWT signature.
Step 7: Map OIDC groups to RBAC
The admin@example.com user has a groups claim in the Dex config containing platform-engineers. Instead of creating individual RBAC bindings per user, you can bind permissions to a group — anyone whose JWT contains that group gets the permissions automatically:
# platform-engineers-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-engineers-admin
subjects:
  - kind: Group
    name: platform-engineers     # matches the groups claim in the JWT
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

You're currently logged in as jane@example.com via the OIDC context, but jane only has view permissions — she can't create cluster-wide RBAC bindings. Switch back to the admin context to apply this:
kubectl config use-context kind-k8s-auth
kubectl apply -f platform-engineers-binding.yaml
kubectl config use-context oidc@k8s-auth

Now clear the cached token to log out of jane's session, then trigger a new login as admin@example.com:
# Clear the cached token — this is how you "log out" with kubelogin
rm -rf ~/.kube/cache/oidc-login/

# This will open the browser again for a fresh login
kubectl get pods -n default

Log in as admin@example.com with password password. This time the JWT will contain "groups": ["platform-engineers"], which matches the ClusterRoleBinding you just created. The admin user gets full cluster access — without ever being added to a kubeconfig by name.
You can verify by decoding the new token (Step 6) — the groups claim will be present:
{
  "email": "admin@example.com",
  "groups": ["platform-engineers"]
}

This is the real power of OIDC group claims: you manage group membership in your identity provider, and Kubernetes permissions follow automatically. Add someone to the platform-engineers group in Dex (or any upstream IdP), and they get cluster-admin access on their next login — no kubeconfig or RBAC changes needed.
Cloud Provider Authentication
AWS, GCP, and Azure each give Kubernetes clusters a native authentication mechanism that ties into their IAM systems.
The implementations differ in API surface, but they all use the same underlying mechanism: OIDC token projection. Once you understand how Dex works above, these are all variations on the same theme.
AWS EKS
EKS uses the aws-iam-authenticator to translate AWS IAM identities into Kubernetes identities. When you run kubectl against an EKS cluster, the AWS CLI generates a short-lived token signed with your IAM credentials. The API server passes this token to the aws-iam-authenticator webhook, which verifies it against AWS STS and returns the corresponding username and groups.
User access is controlled via the aws-auth ConfigMap in kube-system, which maps IAM role ARNs and IAM user ARNs to Kubernetes usernames and groups. A typical entry looks like this:
# In kube-system/aws-auth ConfigMap
mapRoles:
  - rolearn: arn:aws:iam::123456789:role/platform-engineers
    username: platform-engineer:{{SessionName}}
    groups:
      - platform-engineers

AWS is migrating from the aws-auth ConfigMap to a newer Access Entries API, which manages the same mapping through the EKS API rather than a ConfigMap. The underlying authentication mechanism is the same.
Google GKE
GKE integrates with Google Cloud IAM using two different mechanisms, depending on whether you're authenticating as a human user or as a workload.
For human users, GKE accepts standard Google OAuth2 tokens. Running gcloud container clusters get-credentials writes a kubeconfig that uses the gcloud CLI as a credential plugin, generating short-lived tokens from your Google account automatically.
For pod-level identity — letting a pod assume a Google Cloud IAM role — GKE uses Workload Identity. You annotate a Kubernetes service account to bind it to a Google Service Account, and pods running as that service account can call Google Cloud APIs using the GSA's permissions:
# Bind a Kubernetes SA to a Google Service Account
kubectl annotate serviceaccount my-app \
  --namespace production \
  iam.gke.io/gcp-service-account=my-app@my-project.iam.gserviceaccount.com

Azure AKS
AKS integrates with Azure Active Directory. When Azure AD integration is enabled, kubectl requests an Azure AD token on behalf of the user via the Azure CLI, and the AKS API server validates it against Azure AD.
For pod-level identity, AKS uses Azure Workload Identity, which follows the same OIDC federation pattern as GKE Workload Identity. A Kubernetes service account is annotated with an Azure Managed Identity client ID, and pods can request Azure AD tokens without storing any credentials:
# Annotate a service account with the Azure Managed Identity client ID
kubectl annotate serviceaccount my-app \
  --namespace production \
  azure.workload.identity/client-id=

The underlying pattern across all three providers is the same: a trusted OIDC token is issued by the cloud provider, verified by the Kubernetes API server, and mapped to an identity through a binding (the aws-auth ConfigMap, a GKE Workload Identity binding, or an AKS federated identity credential). The OIDC section in this article is the conceptual foundation for all of them.
Webhook Token Authentication
Webhook token authentication is worth knowing about because it appears in several common Kubernetes setups, even if you never configure it yourself.
When a request arrives with a bearer token that no other authenticator recognises, Kubernetes can send that token to an external HTTP endpoint for validation. The endpoint returns a response indicating who the token belongs to.
This is how EKS authentication worked before the aws-iam-authenticator was built into the API server. It's also how bootstrap tokens work during node join operations: a token is generated, embedded in the kubeadm join command, and validated by the bootstrap webhook when the new node contacts the API server for the first time.
For most clusters, you'll encounter webhook auth as something already running rather than something you configure. The main thing to know is that it exists and what it looks like when it appears in logs or configuration.
Cleanup
To remove everything created in this article:
# Delete the OIDC demo cluster
kind delete cluster --name k8s-auth

# Remove generated certificate files
rm -f ca.crt ca.key jane.key jane.csr jane.crt jane.kubeconfig
rm -f dex-ca.crt dex-ca.key dex.crt dex.key dex.csr dex-ca.srl auth-config.yaml

# Remove the kubelogin token cache
rm -rf ~/.kube/cache/oidc-login/

Conclusion
Kubernetes authentication is not a single mechanism — it's a chain of pluggable strategies, each one suited to different use cases. In this article you worked through the most important ones.
x509 client certificates are how Kubernetes works out of the box. The CN field becomes the username, the O field becomes the group, and the cluster CA is the trust anchor. You created a certificate for a new user, bound it to RBAC, and saw exactly how authentication and authorisation interact — authentication gets you in, RBAC determines what you can do.
You also saw the fundamental limitation: Kubernetes doesn't check certificate revocation lists, so a compromised certificate remains valid until it expires. This makes certificates a poor fit for human users in production environments.
OIDC is the production-grade answer. Tokens are short-lived, issued by a trusted identity provider, and map directly to Kubernetes groups through JWT claims. You deployed Dex as a self-hosted OIDC provider, configured the API server to trust it, and set up kubelogin for browser-based authentication.
You then decoded a JWT to see exactly what the API server reads from it, and mapped an OIDC group claim to a Kubernetes ClusterRoleBinding.
Cloud provider authentication — EKS, GKE, AKS — uses the same OIDC foundation with provider-specific wrappers. Understanding how Dex works makes each of those systems immediately readable.
All YAML, certificates, and configuration files from this article are in the companion GitHub repository.
 


 How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection 
Destiny Erhabor — Wed, 25 Mar 2026 16:45:23 +0000
 In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it.
An attacker had found it, deployed pods inside Tesla's cluster, and was using them to mine cryptocurrency – all on Tesla's AWS bill. The cluster had no authentication on the dashboard, no network restrictions on egress, and nothing monitoring for intrusion. Any one of those controls would have stopped the attack. None of them were in place.
This wasn't a sophisticated zero-day exploit. It was a misconfigured default.
Kubernetes ships with powerful security primitives. The problem is that almost none of them are enabled by default. A fresh cluster is deliberately permissive so it's easy to get started. That permissiveness is a feature in development. In production, it's a liability.
In this handbook, we'll work through the three most impactful security layers in Kubernetes. We'll start with Role-Based Access Control, which governs who can do what to which resources in the API. From there we'll move to pod runtime security, which locks down what containers can actually do once they're running on a node. Finally we'll deploy Falco, a syscall-level detection engine that watches for attacks in progress and alerts in real time.
By the end, you'll have a hardened cluster with working RBAC policies, enforced pod security standards, and live detection rules that fire when something suspicious happens.
Prerequisites

kubectl installed and configured

Docker Desktop or a Linux machine (to run kind)

Basic Kubernetes familiarity – you know what a Pod, Deployment, and Namespace are

No prior security experience needed


All demos run on a local kind cluster. Full YAML and setup scripts are in the companion GitHub repository.
Table of Contents

The Kubernetes Threat Landscape

What You'll Build

Demo 1 — Run a Cluster Security Baseline with kube-bench

How to Configure RBAC

The Four RBAC Objects

How to Discover Resources, Verbs, and API Groups

Roles and ClusterRoles

RoleBindings and ClusterRoleBindings

How to Use Service Accounts Safely

How to Audit Your RBAC Configuration



Demo 2 — Build a Least-Privilege RBAC Policy for a CI Pipeline

Demo 3 — Audit RBAC with rakkess and rbac-lookup

How to Harden Pod Runtime Security

Pod Security Admission

How to Configure securityContext

OPA/Gatekeeper vs Kyverno

How to Detect Runtime Threats with Falco



Demo 4 — Harden a Pod with securityContext

Demo 5 — Deploy Falco and Write a Custom Detection Rule

Cleanup

Conclusion


The Kubernetes Threat Landscape
To understand what you're defending against, you need to understand where Kubernetes exposes attack surface. There are six main areas, and most production incidents trace back to at least one of them.
The API server is the front door to your cluster. Every kubectl command, every CI deploy, and every controller reconciliation loop sends requests here. Unauthenticated or over-privileged access to the API server is effectively game over: an attacker who can talk to it can create pods, read secrets, and modify workloads freely.
etcd is the key-value store where all cluster state lives, including your Secrets. Kubernetes Secrets are base64-encoded by default, not encrypted. Anyone with direct access to etcd can read every password, token, and certificate in the cluster without going through the API server at all.
The kubelet runs on each node and manages the pods assigned to it. If its API is reachable without authentication – which is the default on older clusters – an attacker can exec into any pod on that node and read its memory without ever touching the API server.
The container runtime is the layer that actually runs your containers. A container that escapes its isolation boundary lands directly in the host OS. A privileged container with hostPID: true can read the memory of every other process on the node, including other containers.
Your supply chain (base images, third-party dependencies, Helm charts, operators) is a potential entry point at every step. The XZ Utils backdoor discovered in 2024 showed how close a well-positioned supply chain attack can come to widespread infrastructure compromise.
Finally, the network: by default, every pod in a Kubernetes cluster can reach every other pod on any port. There are no internal firewalls between workloads unless you explicitly create them with NetworkPolicy.


Real-World Breaches
These three incidents are worth understanding before you write a single line of YAML. They're not theoretical – they're documented post-mortems from real production clusters.



Incident
Year
Root cause
What was missing



Tesla cryptomining
2018
Kubernetes dashboard exposed with no authentication, Unrestricted egress
RBAC on the dashboard endpoint + default-deny NetworkPolicy


Capital One data breach
2019
SSRF vulnerability in a WAF let an attacker reach the EC2 metadata API, which returned credentials for an over-privileged IAM role
Pod-level IAM restrictions (IRSA) + blocking metadata API egress


Shopify bug bounty (Kubernetes)
2021
A researcher accessed internal Kubernetes metadata through a misconfigured internal service, exposing pod environment variables containing secrets
Secret management outside environment variables + network segmentation


The pattern across all three: not zero-day exploits, but misconfigured defaults and missing controls that should have been standard practice.
This article addresses the RBAC and pod security gaps directly.
What You'll Build
Before the first command, here is the security posture you'll have by the end of this article:
You'll start by running kube-bench to get a CIS Benchmark baseline – a concrete score showing where a default cluster stands before any hardening. From there you'll build a least-privilege RBAC policy for a CI pipeline service account and verify its permission boundaries, then audit the full cluster to confirm no over-privileged accounts exist.
On the pod security side, you'll enforce the restricted Pod Security Admission profile on your workload namespace and apply a hardened securityContext to a deployment: non-root user, read-only root filesystem, dropped capabilities, and seccomp profile. To close out, you'll deploy Falco in eBPF mode with a custom detection rule that fires when suspicious tools are run inside a container.
Start to finish, with a kind cluster already running, the demos take about 45–60 minutes.
Demo 1: Run a Cluster Security Baseline with kube-bench
Before hardening anything, it's a good idea to measure where you are. kube-bench runs the CIS Kubernetes Benchmark against your cluster and reports which checks pass and which fail. A baseline run gives you a concrete picture of your cluster's default security posture – and a reference point you can re-run after applying any hardening changes.
Step 1: Create a kind cluster
Save the following as kind-config.yaml:
# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker

kind create cluster --name k8s-security --config kind-config.yaml

Expected output:
Creating cluster "k8s-security" ...
 ✓ Ensuring node image (kindest/node:v1.29.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-security"

Step 2: Run kube-bench
kube-bench runs as a Job inside the cluster, mounting the host filesystem to inspect Kubernetes configuration files and processes:
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench

The output is long. Scroll to the summary at the bottom:
== Summary master ==
0 checks PASS
11 checks FAIL
 9 checks WARN
 0 checks INFO

== Summary node ==
17 checks PASS
 2 checks FAIL
40 checks WARN
 0 checks INFO

A fresh kind cluster typically fails around 14 checks. Three of the most important failures explain why defaults are a problem:



Check ID
Description
Why it matters



1.2.1
--anonymous-auth is not set to false on the API server
Anonymous requests can reach the API server without authentication – exactly how the Tesla dashboard was accessed


1.2.6
--kubelet-certificate-authority is not set
The API server cannot verify kubelet identity, enabling man-in-the-middle attacks between the control plane and nodes


4.2.6
--protect-kernel-defaults is not set on the kubelet
Kernel parameters can be modified from within a container, which is one step toward a container escape


Note: Some kube-bench findings are expected on kind because kind is a development tool, not a production-hardened environment. The important thing is to understand what each finding means and whether it applies to your target production setup.
Delete the Job when you're done:
kubectl delete job kube-bench

Now that you have a baseline, you know what you're starting from. The next step is to work through the most impactful control on that list: access control. RBAC governs every interaction with the Kubernetes API, and getting it right is the foundation everything else builds on.
How to Configure RBAC
Role-Based Access Control is the authorisation layer in Kubernetes. Every request that reaches the API server – from kubectl, from a pod, from a controller – is checked against RBAC rules after authentication succeeds. If there is no rule that explicitly allows the action, Kubernetes denies it.
The key word is "explicitly". RBAC in Kubernetes is additive only. There is no deny rule. You grant access by creating rules, and you remove access by deleting them. This makes the mental model clean: if a subject can do something, you gave it permission to do that thing.
A Brief Case Study: The Shopify Kubernetes Misconfiguration
In 2021, security researcher Silas Cutler discovered that a Shopify internal service exposed Kubernetes metadata through an SSRF vulnerability. The metadata included pod environment variables that contained secrets. The root cause was partly RBAC: the service's service account had broader cluster access than it needed, and there was no least-privilege review process.
Shopify paid a $25,000 bug bounty and fixed the issue. The lesson is straightforward: a service account should only have the permissions it needs to do its specific job. Nothing more.
This is the principle you'll apply in Demo 2.
The Four RBAC Objects
RBAC in Kubernetes is built from four API objects. Two define permissions, two bind those permissions to subjects:



Object
Scope
What it does



Role
Namespace
Defines a set of permissions within one namespace


ClusterRole
Cluster-wide
Defines permissions across all namespaces, or for cluster-scoped resources like Nodes


RoleBinding
Namespace
Grants the permissions of a Role or ClusterRole to a subject, within one namespace


ClusterRoleBinding
Cluster-wide
Grants the permissions of a ClusterRole to a subject across the entire cluster


A subject is a user, a group, or a service account. Users and groups come from your authentication layer – client certificates, OIDC tokens, or cloud provider identity. Service accounts are Kubernetes-native identities created for pods.
How to Discover Resources, Verbs, and API Groups
Before you can write a Role, you need to know three things: the resource name, the API group it belongs to, and the verbs it supports. You shouldn't have to guess any of them – kubectl can tell you everything.
List all available resources and their API groups
kubectl api-resources

Partial output:
NAME                    SHORTNAMES  APIVERSION                     NAMESPACED  KIND
bindings                            v1                             true        Binding
configmaps              cm          v1                             true        ConfigMap
endpoints               ep          v1                             true        Endpoints
events                  ev          v1                             true        Event
namespaces              ns          v1                             false       Namespace
nodes                   no          v1                             false       Node
pods                    po          v1                             true        Pod
secrets                             v1                             true        Secret
serviceaccounts         sa          v1                             true        ServiceAccount
services                svc         v1                             true        Service
deployments             deploy      apps/v1                        true        Deployment
replicasets             rs          apps/v1                        true        ReplicaSet
statefulsets            sts         apps/v1                        true        StatefulSet
cronjobs                cj          batch/v1                       true        CronJob
jobs                                batch/v1                       true        Job
ingresses               ing         networking.k8s.io/v1           true        Ingress
networkpolicies         netpol      networking.k8s.io/v1           true        NetworkPolicy
clusterroles                        rbac.authorization.k8s.io/v1   false       ClusterRole
roles                               rbac.authorization.k8s.io/v1   true        Role

The APIVERSION column is what you put in apiGroups. Strip the version suffix and use only the group part:



APIVERSION in output
apiGroups value in Role



v1
"" (empty string – the core group)


apps/v1
"apps"


batch/v1
"batch"


networking.k8s.io/v1
"networking.k8s.io"


rbac.authorization.k8s.io/v1
"rbac.authorization.k8s.io"


The NAMESPACED column tells you whether to use a Role (namespaced resources) or a ClusterRole (non-namespaced resources like nodes).
Filter by API group
If you want to see only resources in a specific group, for example, everything in apps:
kubectl api-resources --api-group=apps

NAME                  SHORTNAMES  APIVERSION  NAMESPACED  KIND
controllerrevisions               apps/v1     true        ControllerRevision
daemonsets            ds          apps/v1     true        DaemonSet
deployments           deploy      apps/v1     true        Deployment
replicasets           rs          apps/v1     true        ReplicaSet
statefulsets          sts         apps/v1     true        StatefulSet

List all verbs for a specific resource
Each resource supports a different set of verbs. To see exactly which verbs a resource supports, use kubectl api-resources with -o wide and look at the VERBS column:
kubectl api-resources -o wide | grep -E "^NAME|^pods "

NAME  SHORTNAMES  APIVERSION  NAMESPACED  KIND  VERBS
pods  po          v1          true        Pod   create,delete,deletecollection,get,list,patch,update,watch

Or explain the resource directly:
kubectl explain pod --api-version=v1 | head -10

The full set of verbs Kubernetes supports in RBAC rules is:



Verb
What it allows



get
Read a single named resource: kubectl get pod my-pod


list
Read all resources of a type: kubectl get pods


watch
Stream changes to resources: used by controllers and informers


create
Create a new resource


update
Replace an existing resource (kubectl apply on an existing object)


patch
Partially modify a resource (kubectl patch)


delete
Delete a single resource


deletecollection
Delete all resources of a type in a namespace


exec
Run a command inside a pod (kubectl exec)


portforward
Forward a port from a pod (kubectl port-forward)


proxy
Proxy HTTP requests to a pod


log
Read pod logs (kubectl logs)


Important: get and list are separate verbs. Granting list on secrets lets a subject enumerate every secret name and value in a namespace, even if you didn't also grant get. Always think about both when working with sensitive resources like secrets, serviceaccounts, and configmaps.
Look up a resource's group with kubectl explain
If you already know the resource name but aren't sure of its group, kubectl explain tells you:
kubectl explain deployment

GROUP:      apps
KIND:       Deployment
VERSION:    v1
...

kubectl explain ingress

GROUP:      networking.k8s.io
KIND:       Ingress
VERSION:    v1
...

This is the fastest way to look up the apiGroups value for any resource when writing a Role.
A complete lookup workflow
Here is the practical workflow when writing a new Role from scratch:
# 1. Find the resource name and API group
kubectl api-resources | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment

# 2. Find the verbs it supports
kubectl api-resources -o wide | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment   create,delete,...,get,list,patch,update,watch

# 3. Write the Role using the group (strip the version) and the verbs you need

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-reader
  namespace: staging
rules:
  - apiGroups: ["apps"]       # from: apps/v1 → strip /v1
    resources: ["deployments"]
    verbs: ["get", "list", "watch"]

With this workflow, you never have to guess an API group or verb. You look it up, then write the minimal rule you need.
Roles and ClusterRoles
A Role defines which verbs are allowed on which resources. Here is a Role that grants read-only access to Pods and ConfigMaps inside the staging namespace:
# role-ci-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]          # "" = the core API group (Pods, Services, Secrets, ConfigMaps)
    resources: ["pods", "configmaps"]
    verbs: ["get", "list", "watch"]

The apiGroups field tells Kubernetes which API group owns the resource. The core group uses an empty string "". Apps-level resources like Deployments use "apps". Custom resources use their own group, such as "networking.k8s.io".
A ClusterRole is structurally identical but omits the namespace and can reference cluster-scoped resources like Nodes and PersistentVolumes:
# clusterrole-node-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader    # no namespace field
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]

When to use which:
Use a Role when the permission is specific to one namespace. A compromised service account can only affect that namespace: the blast radius is contained. Use a ClusterRole when you need access to cluster-scoped resources, or when you want a reusable permission template that multiple namespaces can share.
A common mistake is reaching for a ClusterRole "just to be safe" because it's easier to configure. Namespace-scoped Roles are almost always the right default.
RoleBindings and ClusterRoleBindings
A Role by itself does nothing. You need a binding to attach it to a subject. Here is a RoleBinding that grants the ci-reader Role to the ci-pipeline service account:
# rolebinding-ci.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline       # the service account name
    namespace: staging      # the namespace the SA lives in
roleRef:
  kind: Role
  name: ci-reader           # must match the Role name exactly
  apiGroup: rbac.authorization.k8s.io

There is a useful pattern worth knowing: you can bind a ClusterRole using a RoleBinding. This creates namespace-scoped access using a reusable permission template. The ClusterRole defines the rules, while the RoleBinding constrains those rules to a single namespace.
# RoleBinding referencing a ClusterRole — scoped to one namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: view-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: ClusterRole          # ClusterRole, but bound to one namespace via RoleBinding
  name: view                 # Kubernetes built-in ClusterRole: read-only access to most resources
  apiGroup: rbac.authorization.k8s.io

Kubernetes ships with several useful built-in ClusterRoles: view (read-only access to most resources), edit (read/write to most resources), admin (full namespace admin), and cluster-admin (full cluster admin). Use them rather than reinventing them.
How to Use Service Accounts Safely
Every pod in Kubernetes runs as a service account. If you don't specify one, Kubernetes uses the default service account in that namespace.
The default service account starts with no permissions – but it still has a token automatically mounted into every pod at /var/run/secrets/kubernetes.io/serviceaccount/token. This means every container in your cluster can authenticate to the API server by default, even if it has nothing useful to do there.
The single most impactful change you can make is to disable this automatic token mounting on service accounts that don't need API access:
# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
automountServiceAccountToken: false   # no token mounted into pods by default

You can also control it at the pod level:
spec:
  automountServiceAccountToken: false   # override at pod level
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-app:1.0

The cluster-admin anti-pattern:
Never bind cluster-admin to a service account that runs in a pod. cluster-admin grants full read/write access to every resource in the cluster. An attacker who compromises a pod running as cluster-admin owns your cluster completely.
You will see this in Helm charts and tutorials because it "makes things work". It works because it disables the entire authorisation layer. That is not a solution – it's a ticking clock.
The Capital One breach is a direct example of this pattern at the cloud layer: an EC2 instance role had permissions far beyond what the application needed. The SSRF vulnerability was the initial foothold. The over-privileged role was what turned a minor bug into a $80 million fine.
How to Audit Your RBAC Configuration
The kubectl auth can-i command lets you check permissions for any subject. Use --as to impersonate a service account:
SA="system:serviceaccount:staging:ci-pipeline"

# These should return 'yes'
kubectl auth can-i list pods        --namespace staging --as $SA
kubectl auth can-i get  configmaps  --namespace staging --as $SA

# These should return 'no'
kubectl auth can-i delete pods      --namespace staging --as $SA
kubectl auth can-i get  secrets     --namespace staging --as $SA
kubectl auth can-i list pods        --namespace production --as $SA

To list every permission a subject has in a namespace:
kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-pipeline

For a visual matrix across the whole cluster, install rakkess (part of krew):
kubectl krew install access-matrix

# Permission matrix for all service accounts in staging
kubectl access-matrix --namespace staging

Example output:
NAME          GET  LIST  WATCH  CREATE  UPDATE  PATCH  DELETE
ci-pipeline    ✓    ✓     ✓      ✗       ✗       ✗      ✗
default        ✗    ✗     ✗      ✗       ✗       ✗      ✗
monitoring     ✓    ✓     ✓      ✗       ✗       ✗      ✗

If you see ✓ in the CREATE, UPDATE, PATCH, or DELETE columns for a service account that should only read, that's a finding that needs remediation.
⚠️ The wildcard danger: The most dangerous RBAC configuration is a wildcard on all three dimensions:
apiGroups: [""] 
resources: [""] 
verbs: ["*"]

This is functionally identical to cluster-admin. You will find it in Helm charts for controllers installed with "convenience" permissions. Always audit third-party RBAC before installing operators into a production cluster.
Demo 2 – Build a Least-Privilege RBAC Policy for a CI Pipeline
In this demo, you'll create a service account for a CI pipeline that can list pods and read configmaps in the staging namespace – and nothing else.
Step 1: Create the namespace and service account
kubectl create namespace staging

# ci-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-pipeline
  namespace: staging
automountServiceAccountToken: false

kubectl apply -f ci-serviceaccount.yaml

Step 2: Create the Role
# ci-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]

kubectl apply -f ci-role.yaml

Step 3: Bind the Role to the service account
# ci-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: Role
  name: ci-reader
  apiGroup: rbac.authorization.k8s.io

kubectl apply -f ci-rolebinding.yaml

Step 4: Test allowed operations
SA="system:serviceaccount:staging:ci-pipeline"

kubectl auth can-i list pods       --namespace staging     --as $SA   # yes
kubectl auth can-i get  pods       --namespace staging     --as $SA   # yes
kubectl auth can-i list configmaps --namespace staging     --as $SA   # yes

Step 5: Test denied operations
kubectl auth can-i delete pods       --namespace staging     --as $SA   # no
kubectl auth can-i get  secrets      --namespace staging     --as $SA   # no
kubectl auth can-i list pods         --namespace production  --as $SA   # no
kubectl auth can-i create deployments --namespace staging    --as $SA   # no

All four should return no. Notice the third test: even if there were a matching Role in the staging namespace, the service account cannot access production. A RoleBinding cannot cross namespace boundaries, this is by design.
Writing a least-privilege policy for a service account you control is the easy part. The harder part is auditing what already exists in a cluster. That's what Demo 3 covers.
Demo 3 – Audit RBAC with rakkess and rbac-lookup
Now you'll scan the full cluster to surface any accounts with more permissions than they need.
Step 1: Install the tools
kubectl krew install access-matrix
kubectl krew install rbac-lookup

Step 2: Run rakkess across the cluster
# All service accounts in kube-system
kubectl access-matrix --namespace kube-system

# All ServiceAccounts cluster-wide
kubectl access-matrix

Step 3: Find all cluster-admin bindings
There are two ways subjects get cluster-admin access: via a ClusterRoleBinding (cluster-wide), or via a RoleBinding that references the cluster-admin ClusterRole (namespace-scoped, still dangerous). Check both:
# Find ClusterRoleBindings that grant cluster-admin
kubectl rbac-lookup cluster-admin --kind ClusterRole --output wide

On a fresh kind cluster this returns:
No RBAC Bindings found

That is the correct and expected result. A default kind cluster doesn't create any ClusterRoleBindings to cluster-admin. The role exists, but nothing is bound to it at the cluster level by default. If you see entries here in your production cluster, each one is a finding worth investigating.
To find who has cluster-level admin access through other means, query the bindings directly:
# Find all ClusterRoleBindings and the subjects they grant
kubectl get clusterrolebindings -o wide

NAME                                                   ROLE                                                                       AGE   USERS                         GROUPS                         SERVICEACCOUNTS
cluster-admin                                          ClusterRole/cluster-admin                                                  10d   system:masters
system:kube-controller-manager                         ClusterRole/system:kube-controller-manager                                 10d
system:kube-scheduler                                  ClusterRole/system:kube-scheduler                                          10d
system:node                                            ClusterRole/system:node                                                    10d
...

The cluster-admin ClusterRoleBinding grants access to the system:masters group – the group your kubeconfig certificate belongs to. This is expected. Every other binding in this list is worth reviewing to understand what it grants and why.
What to look for: Any binding where the SERVICEACCOUNTS column is populated with an application service account (not a system: prefixed one) is a potential over-privilege finding. Application pods should never need cluster-admin.
Step 4: Verify the ci-pipeline service account
kubectl rbac-lookup ci-pipeline --kind ServiceAccount --output wide

Expected output:
SUBJECT                               SCOPE     ROLE             SOURCE
ServiceAccount/staging:ci-pipeline    staging   Role/ci-reader   RoleBinding/ci-reader-binding

The format is / /. This tells you:

The service account is bound to the ci-reader Role

The binding is a RoleBinding named ci-reader-binding

There is no namespace prefix on the role name because it is a namespaced Role, not a ClusterRole


If the output showed ClusterRole/something here, that would be a finding. It would mean the service account has cluster-wide permissions, not namespace-scoped ones.
rbac-lookup vs kubectl get: rbac-lookup gives you a subject-centric view: "what does this account have access to?" kubectl get rolebindings,clusterrolebindings -A gives you a binding-centric view: "what bindings exist in the cluster?" Use both. rbac-lookup is faster for auditing a specific service account, while the kubectl get approach is better for a full cluster inventory.
With RBAC locked down, the API server is protected. But RBAC says nothing about what a container can do once it's running. That's a separate layer entirely.
How to Harden Pod Runtime Security
RBAC controls who can talk to the Kubernetes API. Pod security controls what containers can do once they're running on a node. These are different threat vectors: RBAC protects the control plane, pod security protects the data plane.
A container that runs as root with no capability restrictions can, if compromised, write backdoors to the host filesystem, load kernel modules, read the memory of other processes if hostPID: true is set, and in some configurations escape the container entirely. Pod security closes these doors before an attacker can open them.
A Case Study: The Hildegard Malware Campaign
In early 2021, Palo Alto's Unit 42 research team documented a cryptomining malware campaign called Hildegard that specifically targeted Kubernetes clusters. The attack chain was:

Find a cluster with the kubelet API exposed without authentication

Deploy a privileged pod with hostPID: true

Use the privileged pod to read credentials from other containers' memory

Establish persistence by writing to the host filesystem


Steps 3 and 4 would have been impossible if the pods in the cluster had been running with readOnlyRootFilesystem: true, dropped capabilities, and no hostPID. The attacker had the initial foothold. Pod security would have contained the blast radius.
Pod Security Admission
Pod Security Admission (PSA) is the built-in admission controller that enforces pod security standards at the namespace level. It replaced PodSecurityPolicy in Kubernetes 1.25.
Migrating from PSP? If you're on Kubernetes < 1.25, you may still be using PodSecurityPolicy, which was removed in 1.25. The migration path is: enable PSA in audit mode first to identify violations, fix them workload by workload, then switch to enforce. For policies PSA cannot express, add Kyverno alongside it.
PSA defines three profiles:



Profile
Who it's for
What it restricts



privileged
System components (CNI plugins, monitoring agents)
Nothing – no restrictions


baseline
Most workloads
Blocks known privilege escalations: no hostNetwork, no hostPID, no privileged containers


restricted
Security-sensitive workloads
Everything in baseline, plus: must run as non-root, must drop capabilities, must set a seccomp profile


And three enforcement modes:



Mode
Effect
When to use



enforce
Rejects pods that violate the profile at admission
Production – once you've fixed violations


audit
Allows pods but records violations in the audit log
Migration – see what would break without breaking anything


warn
Allows pods but sends a warning to the client
Development – fast feedback in your terminal


The migration path: start with audit and warn to identify violations, fix them, then switch to enforce. The two modes can run simultaneously.
Apply them as namespace labels:
# namespace-staging.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    # Start here: audit and warn simultaneously
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

Once violations are resolved, add enforce:
kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  --overwrite

Note: don't use --overwrite here. Without it, if enforce is already set to a different value the command will error – which is exactly what you want. You should see:
namespace/staging labeled

If you see namespace/staging not labeled, it means enforce=restricted and enforce-version=latest were already set to those exact values. Confirm enforcement is active:
kubectl get namespace staging --show-labels

Look for pod-security.kubernetes.io/enforce=restricted in the output. If it's there, enforcement is active.
How to Configure securityContext
A securityContext defines the privilege and access control settings for a pod or container. These are the seven fields you should configure on every production workload:



Field
Set at
What it controls



runAsNonRoot
Pod
Rejects containers that run as UID 0 (root)


runAsUser / runAsGroup
Pod
Sets a specific UID/GID – don't rely on the image default


fsGroup
Pod
All mounted volumes are owned by this GID


seccompProfile
Pod
Filters syscalls using a seccomp profile


allowPrivilegeEscalation
Container
Blocks setuid binaries and sudo


readOnlyRootFilesystem
Container
Makes the container filesystem read-only


capabilities.drop
Container
Removes Linux capabilities (drop ALL, add back only what is needed)


The annotated YAML below shows all seven in context:
# secure-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: staging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true         # container must run as a non-root user
        runAsUser: 10001           # explicit UID — don't rely on the image's default
        runAsGroup: 10001          # explicit GID
        fsGroup: 10001             # volumes are owned by this group
        seccompProfile:
          type: RuntimeDefault     # use the container runtime's default seccomp profile
      automountServiceAccountToken: false
      containers:
        - name: app
          image: nginx:1.25-alpine
          securityContext:
            allowPrivilegeEscalation: false   # block setuid and sudo inside the container
            readOnlyRootFilesystem: true      # the single highest-impact setting
            capabilities:
              drop:
                - ALL                         # drop every Linux capability
              add: []                         # add back only what is explicitly needed
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
      volumes:
        # nginx needs writable directories — provide them as emptyDir volumes
        - name: tmp
          emptyDir: {}
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}

Why readOnlyRootFilesystem: true is the most important setting:
Most post-exploitation techniques require writing to the filesystem. Dropping a backdoor, modifying a binary, writing a cron job, or installing a keylogger all require a writable filesystem. Set readOnlyRootFilesystem: true and every one of these techniques is blocked.
The downside is that many applications write to directories like /tmp or /var/cache. The fix is to mount emptyDir volumes at those specific paths, as shown above. The rest of the filesystem stays read-only.
What each field prevents:



Field
What it prevents



runAsNonRoot: true
Blocks containers that were built to run as root – they fail at admission


runAsUser: 10001
Ensures a known, non-privileged UID even if the image doesn't set one


allowPrivilegeEscalation: false
Blocks setuid binaries and sudo – the most common privilege escalation path


readOnlyRootFilesystem: true
Prevents writing backdoors, modifying binaries, or creating persistence


capabilities: drop: ALL
Removes Linux capabilities like NET_RAW (raw socket access) and SYS_ADMIN (kernel operations)


seccompProfile: RuntimeDefault
Filters syscalls to a safe default set – blocks ~300 of the ~400 available syscalls


OPA/Gatekeeper vs Kyverno
PSA covers the fundamentals. But you'll eventually need policies that PSA cannot express: all images must come from your private registry, all pods must have resource limits, no container may use the latest tag. For these, you need a policy engine.
Two mature options exist:




OPA/Gatekeeper
Kyverno



Policy language
Rego (a custom logic language)
YAML, same format as Kubernetes resources


Learning curve
Steep: Rego takes real time to learn
Gentle: if you write YAML, you can write policies


Mutation
Yes, via Assign/AssignMetadata
Yes: first-class, well-documented feature


Audit mode
Yes: reports existing violations
Yes: policy audit mode


Ecosystem
Integrates with OPA in non-K8s contexts
Kubernetes-native only


Best for
Complex cross-resource logic and teams already using OPA
Teams who want K8s-native syntax and fast setup


If you're starting fresh, Kyverno gets you to working policies faster. Here is a Kyverno policy that blocks images from outside your trusted registry:
# kyverno-registry-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Images must come from registry.corp.internal/"
        pattern:
          spec:
            containers:
              - image: "registry.corp.internal/*"

How to Detect Runtime Threats with Falco
PSA and securityContext are preventive controls: they block known-bad configurations before pods start. Falco is a detective control. It watches what containers do while they're running and alerts when something looks wrong.
Falco operates at the syscall level using eBPF. It attaches to the Linux kernel and intercepts every system call made by every container on the node – file opens, network connections, process spawns, privilege escalations. It does this without modifying containers, without injecting sidecars, and with minimal overhead.
What Falco detects out of the box:
Falco's default ruleset covers the most common attack patterns. It fires when a shell is opened inside a running container, whether that's a kubectl exec session or a reverse shell from an exploit.
It watches for reads on sensitive files like /etc/shadow, /etc/kubernetes/admin.conf, and /root/.ssh/. It catches the dropper pattern: a binary written to disk and immediately executed. It detects outbound connections to known malicious IPs, writes to /proc or /sys that suggest kernel manipulation, and package managers like apt, yum, or pip being run inside containers that have no business installing software.
Each of these is a rule in Falco's default ruleset. You can extend it with custom rules for your specific workloads – which is exactly what you'll do in Demo 5. But first let's harden the Pod.
Demo 4 – Harden a Pod with securityContext
In this demo, you'll start with a default nginx deployment, observe the PSA violations it triggers, harden it step by step, and confirm it passes under the restricted profile.
Step 1: Apply PSA labels in audit mode
kubectl label namespace staging \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted

Step 2: Deploy insecure nginx and observe the warnings
# insecure-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-insecure
  namespace: staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-insecure
  template:
    metadata:
      labels:
        app: nginx-insecure
    spec:
      containers:
        - name: nginx
          image: nginx:1.25-alpine

kubectl apply -f insecure-nginx.yaml

Expected output (PSA warns but still creates the deployment in warn mode):
Warning: would violate PodSecurity "restricted:latest":
  allowPrivilegeEscalation != false (container "nginx" must set
    securityContext.allowPrivilegeEscalation=false)
  unrestricted capabilities (container "nginx" must set
    securityContext.capabilities.drop=["ALL"])
  runAsNonRoot != true (pod or container "nginx" must set
    securityContext.runAsNonRoot=true)
  seccompProfile not set (pod or container "nginx" must set
    securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/nginx-insecure created

Four violations. Every one of them is a real security gap. But the pod was still created "deployment.apps/nginx-insecure created"
Step 3: Deploy the hardened version
kubectl apply -f secure-deployment.yaml   # the YAML from the securityContext section above

No warnings this time.
Step 4: Switch the namespace to enforce
kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest

Expected output:
namespace/staging labeled

This is the moment enforcement becomes active. Any new pod that violates the restricted profile will be rejected from this point on.
Step 5: Confirm insecure deployments are now rejected
kubectl delete deployment nginx-insecure -n staging
kubectl apply -f insecure-nginx.yaml

Expected output:
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false ...
deployment.apps/nginx-insecure created

The Deployment object is created. PSA enforces at the pod level, not the Deployment level. The Deployment and its ReplicaSet exist, but every attempt to create a pod is rejected. Check the ReplicaSet:
kubectl get replicaset -n staging -l app=nginx-insecure

NAME                       DESIRED   CURRENT   READY   AGE
nginx-insecure-b668d867b   1         0         0       30s

DESIRED=1 but CURRENT=0. The ReplicaSet cannot create any pods because they're rejected at admission. Describe the ReplicaSet to see the rejection events:
kubectl describe replicaset -n staging -l app=nginx-insecure

Warning  FailedCreate  ReplicaSet "nginx-insecure-b668d867b" create Pod
  "nginx-insecure-xxx" failed: pods is forbidden: violates PodSecurity
  "restricted:latest": allowPrivilegeEscalation != false, unrestricted
  capabilities, runAsNonRoot != true, seccompProfile not set

The hardened deployment continues running with its pods intact. The insecure one has zero pods and never will. This is exactly how PSA is supposed to work.
Step 6: Score the hardened pod with kube-score
kube-score is a static analysis tool that scores Kubernetes manifests against security and reliability best practices:
# macOS
brew install kube-score
# Linux: https://github.com/zegl/kube-score/releases

kube-score score secure-deployment.yaml -v

Expected output (abridged):
apps/v1/Deployment secure-app in staging 
  path=secure-deployment.yaml
    [OK] Stable version
    [OK] Label values
    [CRITICAL] Container Resources
        · app -> CPU limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.cpu
        · app -> Memory limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.memory
        · app -> CPU request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.cpu
        · app -> Memory request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.memory
    [CRITICAL] Container Image Pull Policy
        · app -> ImagePullPolicy is not set to Always
            It's recommended to always set the ImagePullPolicy to Always, to make sure that the imagePullSecrets are always correct, and to always get the image you want.
    [OK] Pod Probes Identical
    [CRITICAL] Container Ephemeral Storage Request and Limit
        · app -> Ephemeral Storage limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.ephemeral-storage
        · app -> Ephemeral Storage request is not set
            Resource requests are recommended to make sure the application can start and run without crashing. Set resource.requests.ephemeral-storage
    [OK] Environment Variable Key Duplication
    [OK] Container Security Context Privileged
    [OK] Pod Topology Spread Constraints
        · Pod Topology Spread Constraints
            No Pod Topology Spread Constraints set, kube-scheduler defaults assumed
    [OK] Container Image Tag
    [CRITICAL] Pod NetworkPolicy
        · The pod does not have a matching NetworkPolicy
            Create a NetworkPolicy that targets this pod to control who/what can communicate with this pod. Note, this feature needs to be supported by the CNI implementation used in the Kubernetes cluster to have an effect.
    [OK] Container Security Context User Group ID
    [OK] Container Security Context ReadOnlyRootFilesystem
    [CRITICAL] Deployment has PodDisruptionBudget
        · No matching PodDisruptionBudget was found
            It's recommended to define a PodDisruptionBudget to avoid unexpected downtime during Kubernetes maintenance operations, such as when draining a node.
    [WARNING] Deployment has host PodAntiAffinity
        · Deployment does not have a host podAntiAffinity set
            It's recommended to set a podAntiAffinity that stops multiple pods from a deployment from being scheduled on the same node. This increases availability in case the node becomes unavailable.
    [OK] Deployment Pod Selector labels match template metadata labels

Notice there are no security context violations: securityContext, readOnlyRootFilesystem, seccompProfile, and runAsNonRoot all pass. The remaining findings are about resource management (CPU/memory limits, ephemeral storage), availability (PodDisruptionBudget, anti-affinity), and network policy – not security context hardening. Those are important for production readiness, but they're a separate concern from the pod security hardening we did here.
You now have a pod that PSA accepts and kube-score validates. The next step is to add a detection layer – something that watches what the pod does at runtime, not just how it was configured at admission.
Demo 5 – Deploy Falco and Write a Custom Detection Rule
Now, you'll deploy Falco in eBPF mode, trigger a default alert, then extend Falco with a custom rule that catches curl and wget being run inside containers.
Step 1: Install Falco via Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  --wait

Confirm Falco is running on every node:
kubectl get pods -n falco

NAME           READY   STATUS    RESTARTS   AGE
falco-x8k2p    1/1     Running   0          45s
falco-m9nqr    1/1     Running   0          45s
falco-j4tpw    1/1     Running   0          45s

One pod per node. Falco runs as a DaemonSet because it needs to monitor syscalls on every node independently.
Step 2: Trigger a default alert
Open a second terminal and stream the Falco logs:
# Terminal 2 — watch for alerts
kubectl logs -n falco -l app.kubernetes.io/name=falco -f --max-log-requests 3

In your first terminal, exec into the secure-app pod:
# Terminal 1 — trigger the shell detection
POD=$(kubectl get pod -n staging -l app=secure-app \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD -n staging -- sh

Within a second, Terminal 2 shows:
2024-03-15T14:23:41.456Z: Notice A shell was spawned in a container with an attached terminal
  (user=root user_loginuid=-1 k8s.ns=staging k8s.pod=secure-app-7d9f8b-xxx
   container=app shell=sh parent=runc cmdline=sh terminal=34816)
  rule=Terminal shell in container  priority=NOTICE
  tags=[container, shell, mitre_execution]

This is Falco's built-in Terminal shell in container rule firing. It detected the kubectl exec session the moment you ran it.
Step 3: Write a custom rule
The built-in rules are comprehensive, but every production environment has workloads with unique behaviour. Here is a custom rule that alerts when curl or wget is executed inside any container:
# custom-rules.yaml
customRules:
  custom-rules.yaml: |-
    - rule: Suspicious network tool in container
      desc: >
        Detects execution of curl or wget inside a running container.
        These tools are commonly used for data exfiltration, downloading
        attacker payloads, or reaching command-and-control servers.
        Production containers should not be making ad-hoc HTTP requests.
      condition: >
        spawned_process
        and container
        and proc.name in (curl, wget)
      output: >
        Network tool executed in container
        (user=%user.name tool=%proc.name cmd=%proc.cmdline
         pod=%k8s.pod.name ns=%k8s.ns.name image=%container.image)
      priority: WARNING
      tags: [network, exfiltration, custom]

Apply it by upgrading the Helm release:
 helm upgrade falco falcosecurity/falco \
  --namespace falco \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  -f custom-rules.yaml

Good, it deployed. Now wait for pods to be ready and test your custom rule:
Step 4: Test the custom rule
# Terminal 1 — run curl inside the container
kubectl exec -it $POD -n staging -- sh -c 'curl https://example.com'

Terminal 2 immediately shows:
2024-03-15T14:31:07.812Z: Warning Network tool executed in container
  (user=root tool=curl cmd=curl https://example.com
   pod=secure-app-7d9f8b-xxx ns=staging image=nginx:1.25-alpine)
  rule=Suspicious network tool in container  priority=WARNING
  tags=[network, exfiltration, custom]

Step 5: Route alerts to Slack with Falcosidekick
Streaming logs is useful during development. In production, you need alerts routed to your alerting pipeline. Falcosidekick handles this with support for Slack, PagerDuty, Datadog, Elasticsearch, and over 50 other outputs:
# falcosidekick-values.yaml
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    minimumpriority: "warning"
    messageformat: >
      [{{.Priority}}] {{.Rule}} |
      pod: {{.OutputFields.k8s.pod.name}} |
      ns: {{.OutputFields.k8s.ns.name}} |
      image: {{.OutputFields.container.image}}

helm install falcosidekick falcosecurity/falcosidekick \
  --namespace falco \
  -f falcosidekick-values.yaml

Tuning Falco for production: A fresh Falco deployment will generate false positives, especially in the first week. Your job is to tune rules to match your workloads' normal behaviour, not to respond to every alert.
Here's the workflow: deploy in staging → identify false positives → add except conditions to rules → validate the false positive rate is low → enable in production with alerting.
Cleanup
To remove everything created in this article:
# Delete the staging namespace and everything in it
kubectl delete namespace staging
 
# Delete Falco and Falcosidekick
helm uninstall falco -n falco
helm uninstall falcosidekick -n falco
kubectl delete namespace falco
 
# Delete the kind cluster entirely
kind delete cluster --name k8s-security

Conclusion
In this handbook, you secured a Kubernetes cluster across three layers: RBAC, pod runtime security, and runtime threat detection.
You built a least-privilege service account, enforced the restricted Pod Security Admission profile, hardened pods with securityContext, deployed Falco for syscall-level detection, and wrote a custom rule to catch suspicious tools inside containers.
Each layer maps to a real-world breach – Tesla, Capital One, Hildegard – showing how these controls would have contained the damage. Run kube-bench again to measure the improvement.
All YAML manifests, Helm values, and setup scripts from this article are available in the companion GitHub repository.
 


 How to Use Different Container Runtimes: Docker, Podman, and Containerd Explained 
Destiny Erhabor — Tue, 17 Feb 2026 21:39:39 +0000
 If you’re a developer working with containers, chances are Docker is your go-to tool. But did you know that there's a whole ecosystem of container runtimes out there? Some are lighter, some are more secure, and some are specifically built for Kubernetes.
Understanding different container runtimes gives you more options. You can choose the right tool for your specific needs, whether that's better security, lower resource usage, or easier integration with Kubernetes.
In this tutorial, you'll learn about three major container runtimes and how to use them on your system. We’ll dive into practical examples with complete code you can run right now. By the end, you’ll understand when to use each runtime and how to move containers between them.
Table of Contents

What Are Container Runtimes?

How to Understand High-Level vs Low-Level Runtimes

How to Use Docker as Your Baseline

How to Use Podman – The Daemonless Alternative

How to Work with Containerd

How to Move Containers Between Runtimes

Real-World Use Cases

Quick Reference Guide

Conclusion


What Are Container Runtimes?
A container runtime is the software that actually runs your containers. When you type docker run nginx, for example, several things happen behind the scenes. The Docker CLI talks to the Docker daemon, which then uses a container runtime (usually containerd) to actually create and run the container.
Think of it like this: if containers are apps on your phone, the container runtime is the operating system that makes those apps work. Just like you can install the same app on different phones (iPhone vs Android), you can run the same container on different runtimes.
Why Does This Matter?
You might wonder why you should care about what's running your containers. Docker works fine, right? Here are a few reasons:

Security: Some runtimes like Podman can run containers without root privileges. This means if someone breaks out of your container, they don't have full system access.

Resource usage: Different runtimes use different amounts of memory and CPU. On a resource-constrained server or edge device, this matters a lot.

Integration: If you're deploying to Kubernetes, understanding containerd or CRI-O helps you troubleshoot production issues.

Licensing: Docker Desktop has licensing requirements for large companies. Alternatives like Podman are completely free.


Here’s a chart that summarizes these key points:

How to Understand High-Level vs Low-Level Runtimes
Container runtimes are split into two categories, and understanding this distinction helps you see how everything fits together.
Low-Level Runtimes
Low-level runtimes like runc and crun do the actual work of creating containers. They interact directly with the Linux kernel to create isolated environments using features like namespaces and cgroups.
Namespaces isolate what a process can see. For example, a process namespace means the container can't see other processes running on your system. A network namespace means it has its own network stack.
Cgroups (control groups) limit what a process can use. You can limit a container to 512MB of RAM or 50% of one CPU core. This prevents one container from hogging all your resources.
These low-level runtimes implement the OCI (Open Container Initiative) Runtime Specification. This is a standard that defines exactly how to run a container. Because of this standard, you can swap out runtimes and your containers still work.
High-Level Runtimes
High-level runtimes like Docker, Podman, and containerd manage images, networking, volumes, and provide user-friendly interfaces. They handle pulling images from registries, setting up networks between containers, and managing container lifecycles.
These high-level runtimes use low-level runtimes under the hood. When you run docker run, Docker ultimately calls runc to create the container. This layering means you get a nice interface while still benefiting from the standard, battle-tested low-level runtime.
Why This Layering Matters:
This separation of concerns is powerful. High-level runtimes can focus on user experience and features while low-level runtimes focus on reliably creating containers. You can swap low-level runtimes without changing your workflow. Some people use crun instead of runc because it's written in C and starts faster.
How to Use Docker as Your Baseline
Let's start with Docker since you're probably already familiar with it. This will give us a baseline to compare other runtimes against. We'll build a simple web application and then run the same application in different runtimes to see how they compare.
How to Install Docker
You can find installation guides for your operating system:

Docker Desktop for Mac

Docker Desktop for Windows

Docker Engine for Linux


How to Run a Test Container
Let's verify that Docker works by running a simple container:
docker run hello-world

You should see a message that says:
Hello from Docker!
This message shows that your installation appears to be working correctly.

What Just Happened?
When you ran that command, Docker checked if the hello-world image exists locally. It didn't find it, so it pulled the image from Docker Hub (a public registry). Then it created a container from that image, started the container, and the container printed its message and exited.
All of this happened in a few seconds. Now let's build something more useful.
How to Create a Web Server
Create a new directory for your project:
mkdir ~/container-demo
cd ~/container-demo

The ~ symbol means your home directory. On macOS, this is /Users/yourname. On Linux, it's /home/yourname.
Create a simple HTML file:
cat > index.html << 'EOF'


Container Demo

  Hello from Docker!
  This is running in a container.


EOF

This creates a basic HTML file. The cat > command writes to a file, and << 'EOF' means "read until you see EOF" (End Of File). This is a handy way to create files from the command line.
How to Create a Dockerfile
You can create a dockerfile like this:
cat > Dockerfile << 'EOF'
FROM nginx:alpine
COPY index.html /usr/share/nginx/html/
EOF

Understanding the Dockerfile:
The Dockerfile has two instructions:

FROM nginx:alpine: This starts with the official Nginx image. The :alpine tag means we're using the Alpine Linux version, which is much smaller (about 20MB instead of 130MB). Alpine is a minimal Linux distribution popular in containers because of its small size.

COPY index.html /usr/share/nginx/html/: This copies your HTML file into the location where Nginx serves files. Inside the container, Nginx is configured to serve files from /usr/share/nginx/html/.


How to Build a Docker Image
docker build -t my-web-app .

The -t flag means "tag" – we're naming the image my-web-app. The . at the end means "use the current directory as the build context". Docker will look for a Dockerfile in the current directory and send all files here to the Docker daemon for building.
You'll see output like:
[+] Building 2.3s (7/7) FINISHED
=> [internal] load build definition from Dockerfile
=> => transferring dockerfile: 98B
=> [internal] load .dockerignore
...
=> => naming to docker.io/library/my-web-app

This shows Docker building your image layer by layer. Each instruction in the Dockerfile creates a new layer. These layers are cached, so if you rebuild without changes, it's instant.
How to Run a Docker Container
docker run -d -p 8080:80 my-web-app

Understanding the Flags:

-d means "detached mode" – run in the background. Without this, the container runs in the foreground and you'll see Nginx's log output. With -d, it returns immediately and runs in the background.

-p 8080:80 maps port 8080 on your host machine to port 80 inside the container. Nginx listens on port 80 inside the container. To access it from your browser, you need to map it to a port on your machine. We chose 8080, but you could use any available port.


Open your browser and visit http://localhost:8080. You should see your HTML page!

How to Check Running Containers:
docker ps

This shows all running containers. You'll see something like:
CONTAINER ID   IMAGE        COMMAND                  PORTS                  NAMES
a1b2c3d4e5f6   my-web-app   "/docker-entrypoint.…"   0.0.0.0:8080->80/tcp   peaceful_curie

Docker automatically generated a random name (peaceful_curie in this example). You can specify a name with --name if you prefer.
How to View Container Logs:
docker logs 

Replace  with the ID from docker ps (just the first few characters work). This shows what's happening inside the container. For Nginx, you'll see access logs showing requests to your web server.
How to Stop the Container:
docker stop 

This gracefully stops the container. Nginx receives a signal to shut down cleanly.
Now that you understand how to use Docker, let’s check out how Podman works next.
How to Use Podman – The Daemonless Alternative
Now let's try Podman. It's designed to be a drop-in replacement for Docker, but with some key differences that make it interesting for specific use cases.
Why Podman Exists
Docker runs as a daemon (a background service) that requires root privileges. This daemon always runs, listening for commands. This architecture has some downsides:

Security: The Docker daemon runs as root. If someone compromises the daemon, they have root access to your entire system.

Resource Usage: The daemon consumes resources even when you're not running containers.

Single Point of Failure: If the daemon crashes, all your containers stop.


Podman solves these problems by not using a daemon at all. Each podman command runs independently. This is called a "daemonless" architecture.
Key Podman Features
To summarize, here are some key helpful features of Podman that might make it a good fit for your projects:

No daemon required: Each command runs independently. No background service needed.

Rootless by default: Containers run as your regular user, not as root. This dramatically improves security.

Drop-in Docker replacement: Most Docker commands work exactly the same. You can even alias docker=podman and many applications won't notice the difference.

Pod support: Podman has a concept of "pods" like Kubernetes. This is unique among container tools.


Now that you understand the benefits of Podman, let’s see how you can use it.
How to Install Podman
Podman installation varies by operating system. Here are the official guides:

Podman for macOS

Podman for Windows

Podman for Linux


For macOS users (what we'll use in this tutorial), you can install Podman using Homebrew:
brew install podman

How to Initialize and Start Podman Machine
On macOS, Podman needs a Linux VM to run containers (since containers use Linux kernel features). Podman Machine handles this for you:
podman machine init

This creates a small Linux VM. You’ll only need to do this once. The VM is about 1GB and uses minimal resources when running.

Start the machine:
podman machine start

Verify it's working:
podman --version

You should see something like:
podman version 4.5.0

How to Run Containers with Podman
Here's where it gets interesting. You can use nearly identical commands to Docker. Let's build and run the same web server you created earlier:
# Build the image (same command as Docker)
podman build -t my-web-app .

# Run the container
podman run -d -p 8081:80 my-web-app

# See running container
podman ps

Notice that we used port 8081 this time so it doesn't conflict with the Docker container if it's still running. Visit http://localhost:8081 and you'll see the same page, but this time it's running in Podman!

If you experience issue when running the podman build command, you can delete the docker image using docker image rm my-web-app:latest.
What's Different Under the Hood?
Even though the commands look the same, what's happening is different: first no daemon was involved. The podman command directly created and started the container. And the container is running as your user, not as root.
You can verify this by checking what user owns the process:
podman top  user

You'll see your username, not root.
Podman Pods – A Unique Feature
Podman has a unique feature that Docker doesn't have: pods. A pod is a group of containers that share networking and storage. This is the same concept Kubernetes uses, which makes Podman excellent for local Kubernetes development.
Why Pods Matter:
In real applications, you often have multiple containers that need to work together. For example, a web application typically needs a database to store data, a cache layer for temporary storage of frequently accessed data and a logging container for request, response, and non-sensitive critical application metadata.
These four containers (web, database, cache, logger) need to communicate with each other. In Docker, you'd create a custom network and connect each container to it. In Podman, you can create a pod that automatically handles this networking.
How to Create a Podman Pod
podman pod create --name my-app-pod -p 8082:80

This creates a pod named my-app-pod and exposes port 8082 on your host to port 80 inside the pod. Notice that you don't expose ports on individual containers – you expose them on the pod.
Add a web server to the pod:
podman run -d --pod my-app-pod --name web nginx:alpine

The --pod flag tells Podman to run this container inside the pod. The container doesn't need its own port mapping because the pod handles that.
Add Redis (an in-memory database) to the pod:
podman run -d --pod my-app-pod --name cache redis:alpine

Now you have two containers running in the same pod. Here's the powerful part: they share the same network namespace.
To check your pod:
# List all pods
podman pod ps -a

# Show details for one pod
podman pod inspect 

# Check processes running in the pod
podman top pod 

# See logs from containers in that pod
podman logs 


Understanding Shared Networking:
Both containers can reach each other using localhost. The web container can connect to Redis using localhost:6379 (Redis's default port). It's as if they're running on the same machine.
This is exactly how Kubernetes pods work. If you learn Podman pods, you're learning Kubernetes networking too.
How to Generate Kubernetes YAML from Pods
Here's where Podman really shines. You can generate Kubernetes-compatible YAML from your pod:
podman generate kube my-app-pod > my-app-pod.yaml

Open my-app-pod.yaml and you'll see proper Kubernetes configuration:
# Save the output of this file and use kubectl create -f to import
# it into Kubernetes.
#
# Created with podman-5.7.1
apiVersion: v1
kind: Pod
metadata:
  annotations:
    io.kubernetes.cri-o.SandboxID/cache: 5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019
    io.kubernetes.cri-o.SandboxID/web: 5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019
  creationTimestamp: "2026-02-12T13:44:55Z"
  labels:
    app: my-app-pod
  name: my-app-pod
spec:
  containers:
  - args:
    - nginx
    - -g
    - daemon off;
    image: docker.io/library/nginx:alpine
    name: web
    ports:
    - containerPort: 80
      hostPort: 8082
  - args:
    - redis-server
    image: docker.io/library/redis:alpine
    name: cache

This file can be deployed directly to any Kubernetes cluster:
# using minikube cluster
kubectl apply -f my-app-pod.yaml

This is incredibly useful for local development. You can prototype your application using Podman pods, generate the YAML, and deploy to Kubernetes without rewriting anything.
How to Manage Podman Machines
When working with Podman on macOS or Windows, you're using a Linux VM. Here's how to manage it.
List all Podman machines:
podman machine list


This shows all your Podman VMs, their status (running or stopped), and their names. The default machine is usually called podman-machine-default.
Check machine status and info:
podman machine info

This displays detailed information about your current machine including CPU, memory, and disk usage.
Stop the Podman machine:
podman machine stop

If you have multiple machines, specify the name:
podman machine stop podman-machine-default

This stops the VM but preserves it. All your images and containers remain intact. When you stop the machine, all running containers inside it are stopped.
Start a stopped machine:
podman machine start

Or with a specific name:
podman machine start podman-machine-default

This restarts the VM. Your images are still there, but containers remain stopped unless you started them with a restart policy.
Delete a Podman machine:
podman machine rm podman-machine-default

This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space.
With this basic understanding of how Podman works, we can move on and learn about how to use Containerd.
How to Work with Containerd
Containerd is the runtime that Docker itself uses under the hood. It's also the default runtime for most Kubernetes installations. When you run Docker, you're actually using containerd without knowing it.
Why Use containerd Directly?
You might wonder why you'd use containerd directly if Docker already uses it. Here are a few reasons:

Kubernetes: Most Kubernetes clusters use containerd as their container runtime. Understanding it helps you troubleshoot production issues.

Minimal footprint: containerd has no UI and minimal features. It uses less memory than Docker Desktop (about 50MB vs 2GB).

Building tools: If you're building container orchestration tools, working directly with containerd gives you fine-grained control.


Understanding the Architecture
The containerd architecture looks like this:
Your Command → nerdctl → containerd → runc → Container

In this chain, nerdctl provides a Docker-like CLI, containerd manages images and container lifecycle, and runc actually creates the container using kernel features.
How to Install containerd with nerdctl
containerd is designed for systems (like Kubernetes) rather than direct developer use. The installation approach differs by operating system:

Lima for macOS (includes nerdctl)

containerd for Linux (native installation)

nerdctl releases (for all platforms)


For macOS users (what we'll use in this tutorial), we’ll use Lima, which provides a Linux VM with containerd and nerdctl already installed.
brew install lima

Lima comes with nerdctl built-in, so you don't need to install it separately.
For Linux users, you can install containerd directly from your package manager and download nerdctl from the GitHub releases page. Containerd runs natively on Linux without needing a VM.
How to Start a Lima Instance
limactl start

This creates a default Linux VM running containerd with nerdctl available. The VM is configured with reasonable defaults (2GB RAM, 100GB disk). You can customize these settings if needed.
Lima mounts your home directory inside the VM, so you can access your files. This makes working with Lima feel transparent – you don't need to copy files into the VM.
Verify it's working:
lima nerdctl run hello-world


How to Run Your App with nerdctl
The commands are nearly identical to Docker. This is intentional – nerdctl aims for Docker compatibility. Since we're running through Lima, we’ll prefix commands with lima.
Navigate to your project directory:
cd ~/container-demo

Build the image:
lima nerdctl build -t my-web-app .

Run the container:
lima nerdctl run -d -p 8083:80 my-web-app

Visit http://localhost:8083 to see your app running on containerd!

What's Different from Docker?
Under the hood, a lot is different. Containerd is managing your image and container. There's no daemon in the traditional sense (containerd runs differently than dockerd). Images are stored differently (though they're OCI-compliant so they're compatible).
But from your perspective as a developer, the commands feel the same. This is the power of standards like OCI.
How to Check Running Containers:
lima nerdctl ps

This shows all running containers.

How to Manage Lima VMs
When working with containerd through Lima, you're using a Linux VM. Here's how to manage it.
List all Lima VMs:
limactl list

This shows all your Lima VMs, their status (running or stopped), and their names. The default VM is usually called default.
Check VM status and info:
limactl info default

This displays detailed information about the specified VM including its configuration and resource usage.
Stop the Lima VM:
limactl stop default

This stops the VM but preserves it. All your images and containers remain intact. When you stop the VM, all running containers inside it are stopped. The next time you start it, your images will still be there but containers remain stopped.
Start a stopped VM:
limactl start default

This restarts the VM. Your images persist across restarts, so you don't need to rebuild them.
Delete a Lima VM:
limactl delete default


This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space. You'll need to run limactl start again to create a new VM.
Create a new VM with custom settings:
limactl start --name my-custom-vm --cpus 4 --memory 8

This creates a new VM with 4 CPUs and 8GB of memory. You can have multiple Lima VMs for different projects.
How to Move Containers Between Runtimes
Thanks to the OCI (Open Container Initiative) standard, you can move container images between different runtimes. This is incredibly powerful – you can build with one tool and deploy with another.
Why Standards Matter
Before OCI, each container runtime used its own image format. Moving images between runtimes was difficult or impossible.
OCI created standards for the Runtime Specification (how to run a container), the Image Specification (how to package a container image), and the Distribution Specification (how to transfer images between systems).
Now all major runtimes follow these standards, making images portable.
Method 1 – Using Container Registries
The easiest way to share images is through a container registry like Docker Hub, GitHub Container Registry, or your own private registry. Any runtime can push and pull from registries.
First, build with Docker:
docker build -t my-username/my-app:v1 .

The image name has three parts: my-username (your registry username), my-app (the application name), and v1 (a version tag).
Push to Docker Hub:
docker login
docker push my-username/my-app:v1

You'll need to create a free Docker Hub account if you don't have one. The docker login command prompts for your credentials.
Now pull with Podman:
podman pull my-username/my-app:v1

Podman downloads the image from Docker Hub. Even though it was built with Docker, Podman can use it because both follow OCI standards.
Or pull with nerdctl:
lima nerdctl pull my-username/my-app:v1

Same image, three different runtimes. This is the power of standards.
Method 2 – Export and Import
If you don't want to use a public registry (maybe your image contains proprietary code), you can export images as tar files. This is perfect for air-gapped environments or simply moving images between machines.
Export from Docker:
docker save my-web-app -o my-web-app.tar

This creates a file called my-web-app.tar containing the image and all its layers. The file might be large (tens or hundreds of megabytes) depending on your image.
Import to Podman:
podman load -i my-web-app.tar

Import to nerdctl:
lima nerdctl load -i my-web-app.tar

Now you have the same image available in all three runtimes! You can verify:
docker images
podman images  
lima nerdctl images

All three commands will show my-web-app in their image lists.
Understanding Image Layers:
When you export an image, you're exporting all its layers. Each line in your Dockerfile creates a layer. These layers are shared between images, which saves disk space.
For example, if you have 10 images all based on nginx:alpine, they all share the nginx layers. Only the layers unique to each image take up additional space.
Real-World Use Cases
Let's look at some real scenarios where choosing the right runtime matters. These examples show how technical decisions have practical impacts.
Use Case 1 – Security-First Development
If you're working on security-sensitive applications (financial services, healthcare, government), Podman's rootless containers are a huge advantage.
The Security Problem:
Traditional Docker requires root privileges. If someone exploits a vulnerability in your container and escapes to the host system, they have root access. This is called a "container escape" vulnerability.
Podman's rootless mode solves this:
# All Podman commands run as your user by default
podman run --rm -it alpine whoami

This outputs your username, not root. The command uses --rm to remove the container when it exits (cleanup), -it to make it interactive with a terminal, alpine as a minimal Linux distribution, and whoami as a command that prints your username.
Even if someone breaks out of the container, they only have your user's permissions. They can't install system-wide malware, access other users' data, modify system configuration, or install kernel modules.
This dramatically reduces the impact of a container escape.
Example Security Scenario:
Imagine you're running a web application that processes user uploads. A vulnerability lets an attacker execute code in your container. With Docker running as root, they could escape the container, install a rootkit, steal all data from your server, and persist even after you patch the vulnerability.
With Podman rootless, they might escape the container but can only access files your user can access. They can't persist beyond the container and can't affect other users or system files.
The difference is dramatic.
Use Case 2 – Testing Kubernetes Locally
Podman can generate Kubernetes YAML from running containers. This is perfect for prototyping before you commit to a Kubernetes configuration.
The Development Workflow:

Run your application locally with Podman

Test and iterate quickly

Generate Kubernetes YAML when it works

Deploy to a real cluster


Here's a practical example. Let's say you're building a web application with a database:
Run your containers:
# Create a pod (like a Kubernetes pod)
podman pod create --name myapp -p 8080:80

# Add web server
podman run -d --pod myapp --name web nginx:alpine

# Add PostgreSQL
podman run -d --pod myapp --name db \
  -e POSTGRES_PASSWORD=secret \
  postgres:alpine

Test your application at http://localhost:8080. When it works, generate Kubernetes YAML:
podman generate kube myapp > myapp.yaml

Now you can deploy myapp.yaml to any Kubernetes cluster:
kubectl apply -f myapp.yaml

This is much faster than writing Kubernetes YAML by hand and debugging in a cluster. You iterate locally, then deploy when ready.
Why This Matters:
Kubernetes has a steep learning curve. The YAML configuration is verbose and error-prone. By starting with simple Podman commands and generating YAML, you can focus on your application first, learn Kubernetes gradually, catch configuration errors early, and iterate quickly without cloud costs.
Use Case 3 – Resource-Constrained Environments
containerd has the smallest footprint. If you're running containers on edge devices, Raspberry Pi, or resource-constrained servers, this matters a lot.
Comparing Memory Usage:
Here are typical memory footprints for each runtime:

Docker Desktop uses approximately 2GB RAM (includes the VM, daemon, UI, and Kubernetes).

Podman uses approximately 500MB RAM (includes the VM on macOS).

Containerd uses approximately 50MB RAM (just the runtime, no extras).


On a developer laptop with 16GB RAM, this difference doesn't matter much. But consider these scenarios:
1. Edge Computing:
You're running containers on edge devices with 1GB RAM total. Docker Desktop won't fit. containerd leaves room for your application.
2. IoT Devices:
A Raspberry Pi with 2GB RAM running Docker Desktop leaves little room for your application. containerd uses minimal resources.
3. High-Density Servers:
Running 100 containers per server. Every MB counts. Using containerd instead of full Docker saves 2GB per server × 100 servers = 200GB.
Example Setup for Edge Device:
# On a Raspberry Pi or similar device
sudo apt-get install containerd
sudo apt-get install nerdctl

# Now you can run containers with minimal overhead
nerdctl run -d my-lightweight-app

Your application gets to use most of the available RAM instead of competing with a heavy runtime.
Quick Reference Guide
Here's a handy comparison of common commands across runtimes:




Task Docker Podman nerdctl (via Lima)



Build image docker build -t app . podman build -t app . lima nerdctl build -t app .

Run container docker run -d app podman run -d app lima nerdctl run -d app

List containers docker ps podman ps lima nerdctl ps

View logs docker logs podman logs lima nerdctl logs 

Stop container docker stop podman stop lima nerdctl stop 

Remove container docker rm podman rm lima nerdctl rm 

List images docker images podman images lima nerdctl images

Pull image docker pull nginx podman pull nginx lima nerdctl pull nginx

Push to registry docker push app podman push app lima nerdctl push app

Execute in container docker exec -it  sh podman exec -it  sh lima nerdctl exec -it  sh


Conclusion
In this guide, we’ve explored three major container runtimes and learned how to use Docker, Podman, and containerd. The container ecosystem is much bigger than just Docker, and knowing alternatives gives you more options for security, performance, and specialized use cases.
Use Docker when you're learning or need the best documentation. Use Podman when you need rootless security or are building CI/CD pipelines. Use containerd when you need minimal resource usage or are deploying to Kubernetes clusters.
Thanks to OCI standards, your containers are portable. Build with Docker, test with Podman, deploy with containerd – it all works together! You're not locked into one vendor or tool.
As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn and DevOps Cloud Projects
Happy containerizing!
 


 How to Build a Production-Grade Distributed Chatroom in Go [Full Handbook] 
Destiny Erhabor — Fri, 13 Feb 2026 16:17:41 +0000
 If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern communication systems.
By the end of this guide, you'll have built a working chatroom that supports unlimited concurrent users chatting in real-time, message persistence that survives server crashes, session management so users can reconnect after network interruptions, private messaging between users, and graceful handling of slow or disconnected clients.
More importantly, you'll understand the fundamental concepts behind distributed systems. You'll learn concurrent programming with goroutines and channels, TCP socket programming for network communication, write-ahead logging for data durability, state management with mutexes, and how to design systems that degrade gracefully under failure. These concepts power everything from databases to message queues to web servers.
Table of Contents

What is a Distributed Chatroom?

What You'll Learn

Prerequisites

Tutorial Overview

Architecture Overview

Core Concepts You Need to Know

How to Set Up the Project Structure

How to Define Core Data Types

How to Initialize the Server

How to Build the Event Loop

How to Handle Client Connections

How to Implement Message Broadcasting

How to Add Persistence with WAL and Snapshots

How to Implement Session Management

How to Build the Command System

How to Create the Client

How to Test Your Chatroom

How to Deploy Your Server

Enhancements You Can Add

Conclusion


The complete source code for this project is available on GitHub if you'd like to reference it while following along.
What is a Distributed Chatroom?
A chatroom is a server that lets multiple users connect simultaneously and exchange messages in real-time. When we say "production-grade," we mean it includes features you'd expect in a real application: it persists data so messages aren't lost when the server restarts, it handles network failures gracefully, and it can support many concurrent users without slowing down.
The "distributed" aspect refers to how the system manages multiple clients connecting from different locations, all trying to send and receive messages at the same time. This introduces interesting challenges: how do you ensure everyone sees messages in the same order? How do you handle clients with slow internet connections? What happens when someone disconnects unexpectedly?
These aren't just theoretical problems. Every networked application deals with concurrency, state management, and failure handling. Whether you're building a chat app, a multiplayer game, a collaborative editor, or a trading platform, you'll face similar challenges. The patterns you'll learn here apply broadly across distributed systems.
Chat applications are excellent learning projects because they combine several challenging problems in one place. You need to manage concurrent connections safely, broadcast messages to multiple clients without blocking, handle unreliable networks, persist data durably, and ensure the system recovers gracefully from crashes. Each of these topics could be its own tutorial, but here you'll see how they work together in a real application.
What You'll Learn
This tutorial demonstrates several important concepts that are fundamental to building distributed systems. Here's what you'll learn:
1. TCP Socket Programming in Go
You'll learn how to accept incoming TCP connections, read and write data over network sockets, and handle connection failures gracefully. These skills are essential for any networked application, from web servers to database clients.
2. Concurrent Programming with Goroutines and Channels
Go's concurrency model is one of its strongest features. You'll see how to use goroutines to handle multiple clients simultaneously without blocking. You'll use channels to coordinate between goroutines safely, avoiding the common pitfalls of shared memory concurrency like race conditions and deadlocks.
3. State Management in Distributed Systems
Managing shared state across concurrent operations is tricky. You'll learn when to use mutexes versus channels, how to design lock granularity to avoid bottlenecks, and how to ensure data consistency when multiple goroutines access the same data.
4. Write-Ahead Logging (WAL) for Durability
Databases use WAL to ensure data isn't lost during crashes. You'll implement the same pattern, learning how to balance durability with performance. You'll see why fsync is critical, understand the trade-offs of different persistence strategies, and learn how to recover state after unexpected shutdowns.
5. Session Management and Reconnection
Networks are unreliable. Users disconnect, WiFi drops, mobile connections switch towers. You'll build a token-based session system that lets users reconnect seamlessly, preserving their chat history and identity without requiring passwords or complex authentication.
6. Graceful Degradation and Fault Tolerance
Perfect reliability is impossible, so you need to design for partial failures. You'll learn how to prevent slow clients from affecting fast ones, how to continue operating when persistence fails, and how to clean up resources properly when things go wrong.
Prerequisites
To get the most out of this tutorial, you should have some foundational knowledge. You don't need to be an expert, but you should be comfortable with the basics.

Go basics (goroutines, channels, interfaces)

TCP/IP networking fundamentals

Basic concurrency concepts

File I/O operations


Tutorial Overview
This tutorial takes you through building a production-ready chatroom step by step.
You'll start by exploring the overall architecture to understand how components fit together. Then you'll learn about core concepts like concurrency models and persistence strategies.
Next, you'll set up your project structure and define the core data types that represent clients, messages, and the chatroom. Then you'll implement the server initialization and event loop, which is where all coordination happens.
After that, you'll build the networking layer to handle client connections, implement message broadcasting so messages reach all users, and add persistence using write-ahead logging and snapshots.
You'll then implement session management for reconnection, build a command system for user actions, and create a simple client application to test your server.
Finally, you’ll learn how to test and deploy your chatroom, and review key lessons from building a distributed system.
By the end, you'll have a complete, working chatroom and understand how distributed systems handle concurrency, persistence, and failure recovery.
Architecture Overview
The system follows a client-server architecture with internal components that work together to provide a robust chat experience.
High-Level Architecture

Component Breakdown
1. Network Layer

TCP Listener: Accepts incoming connections on port 9000

Connection Handler: Manages individual client connections with dedicated goroutines

Protocol: Simple newline-delimited text protocol


2. Client Management
Each client connection spawns two goroutines:

Read Goroutine: Receives messages from client

Write Goroutine: Sends messages to client (non-blocking with buffered channels)


3. ChatRoom Core
This is the heart of the system – a single goroutine running an event loop:
for {
    select {
        case client := <-cr.join:
            // Handle new client
        case client := <-cr.leave:
            // Handle disconnection
        case message := <-cr.broadcast:
            // Broadcast to all clients
        case client := <-cr.listUsers:
            // Send user list
        case dm := <-cr.directMessage:
            // Handle private message
    }
}

4. State Management
We have three synchronized data structures:

clients map[*Client]bool: Active connections (mutex-protected)

sessions map[string]*SessionInfo: User sessions for reconnection

messages []Message: In-memory message history


5. Persistence Layer
Two-tier approach:

Write-Ahead Log (WAL): Immediate append-only log for durability

Snapshots: Periodic full state dumps for faster recovery


6. Session Management
This enables reconnection with token-based authentication:

Generates unique tokens per user

1-hour session timeout

Preserves chat history for returning users


Message Flow
Here's how a message travels through the system:
User Input → Client Read → Server Receive → Broadcast Channel 
    → ChatRoom Loop → Persist to WAL → Fan-out to All Clients
    → Client Write Goroutines → TCP Send → User Display

The broadcast channel acts as a synchronization point, ensuring total message ordering.
Core Concepts You Need to Know
Understanding the Concurrency Model
This chatroom uses Go's CSP (Communicating Sequential Processes) model. This is a fundamentally different approach to concurrency than you might be used to from other languages.
In traditional concurrent programming, you protect shared memory with locks (mutexes). Multiple threads access the same data structure, and you use locks to ensure only one thread modifies it at a time. This works, but it's error-prone. Forget a lock, and you have a race condition. Hold locks too long, and you have deadlocks.
Go encourages a different approach: instead of communicating by sharing memory, you share memory by communicating. You pass data between goroutines through channels. Only one goroutine owns the data at a time, eliminating many concurrency bugs by design.
Channels provide several advantages. They eliminate most race conditions by design, because if only one goroutine owns the data at a time, there's no race to access it. They provide natural flow control since channels can block when full (back pressure) or block when empty (waiting for data). They make it easier to reason about message flow because you can trace how data moves through your system by following the channels. And they offer better composability since you can combine channels with select statements to coordinate multiple operations.
That said, we’ll still use mutexes in this project. Channels aren't always the right tool. We’ll use mutexes when multiple goroutines need quick, frequent access to shared data structures like maps. And we’ll use channels when we want to coordinate behavior or transfer ownership of data.
Here's how the chatroom uses channels to coordinate everything:
type ChatRoom struct {
    join          chan *Client        // New connections
    leave         chan *Client        // Disconnections
    broadcast     chan string         // Messages to all
    listUsers     chan *Client        // User list requests
    directMessage chan DirectMessage  // Private messages

    // Shared state (mutex-protected)
    clients    map[*Client]bool
    mu         sync.Mutex

    // Message history (separate mutex)
    messages   []Message
    messageMu  sync.Mutex
}

Notice that we have five channels for different types of events. The main event loop receives from all these channels using a select statement. This means all state changes happen sequentially in one place, making the system much easier to reason about.
We could have used one channel that accepts different message types, but separate channels make the code clearer. When you send to chatRoom.join, it's obvious what you're doing. When you send to chatRoom.broadcast, same thing.
The mutexes protect data that many goroutines read frequently. The clients map needs to be accessed every time we broadcast a message. Using a mutex for quick read access is more efficient than passing the entire map through a channel.
Understanding the Persistence Strategy
When your server crashes (and it will eventually), you need to recover the chat history. Users expect their messages to be there when the server restarts. But persistence is expensive: writing to disk is thousands of times slower than writing to memory. So you need a strategy that balances durability with performance.
We’ll use a two-tier approach that's similar to what real databases use: WAL (Write-ahead log) and snapshots.
The WAL is your primary durability mechanism. Here's how it works: every message is immediately appended to a file called messages.wal. This file is append-only, which means we only write to the end. Append-only writes are fast because the disk doesn't need to seek to different locations.
Each message is written as a single line of JSON. After writing each message, we call fsync. This tells the operating system to actually write the data to the physical disk right now, not just buffer it in memory. Without fsync, the OS might lose your data if the power fails before it gets around to writing.
The WAL is append-only and never modified. This makes it very reliable. If the server crashes mid-write, the worst case is one corrupted line at the end, which we can detect and skip during recovery.
The problem with a write-ahead log is that it grows forever. If you have a million messages, you need to replay a million log entries every time you restart the server. That's slow.
Snapshots solve this problem. Every 5 minutes, if there are more than 100 new messages, we write the entire message history to a separate file called snapshot.json. This is the complete state of the chat at that moment.
After creating a snapshot, we truncate (empty) the WAL. New messages continue to append to the WAL, but now we only need to replay messages since the last snapshot.
When the server starts, it first loads the snapshot file (if it exists). This gives us the state from the last snapshot, which might be 100,000 messages. Loading this takes about 100ms. Then it replays all entries from the WAL. This gives us messages written since the last snapshot, which might be only 50 messages. Replaying this takes milliseconds. Finally, it resumes normal operation.
Total recovery time is a few hundred milliseconds instead of several minutes.
This two-tier system gives us the best of both worlds: fast writes during normal operation with the append-only WAL, fast recovery after crashes with snapshot plus small WAL replay, guaranteed durability through fsync after every message, and bounded recovery time because the WAL never grows too large.
The trade-off is that snapshots use more disk space temporarily since you have both the snapshot and the WAL. But disk space is cheap, and correctness is expensive.
Now that you understand the key concepts behind the chatroom's design, it's time to start building. You'll begin by setting up your project structure and creating the necessary directories and files.
How to Set Up the Project Structure
First, create the directory structure for your project. You will create their files as we walk through the tutorial:
mkdir -p chatroom-with-broadcast/cmd/server
mkdir -p chatroom-with-broadcast/cmd/client
mkdir -p chatroom-with-broadcast/internal/chatroom
mkdir -p chatroom-with-broadcast/pkg/token
mkdir -p chatroom-with-broadcast/chatdata
cd chatroom-with-broadcast


Then initialize the Go module.
Note that you’ll need Go 1.23.2 or later installed on your machine. Earlier versions might work, but the code examples assume features available in Go 1.23 and above. This version includes improvements to the standard library that make concurrent programming more efficient.
go mod init github.com/yourusername/chatroom

Your go.mod file should look like this:
module github.com/yourusername/chatroom

go 1.23.2

With your project structure in place, you're ready to start writing code. The first step is defining the data types that will represent the core components of your chatroom: messages, clients, and the chatroom itself.
How to Define Core Data Types
Create a new file internal/chatroom/types.go to define your core data structures. These types form the foundation of your chatroom, so it's important to understand what each one represents and why it's designed the way it is.
package chatroom

import (
    "net"
    "os"
    "sync"
    "time"
)

// Message represents a single chat message with metadata
type Message struct {
    ID        int       `json:"id"`
    From      string    `json:"from"`
    Content   string    `json:"content"`
    Timestamp time.Time `json:"timestamp"`
    Channel   string    `json:"channel"` // "global" or "private:username"
}

// Client represents a connected user
type Client struct {
    conn         net.Conn      // TCP connection
    username     string        // Display name
    outgoing     chan string   // Buffered channel for writes
    lastActive   time.Time     // For idle detection
    messagesSent int           // Statistics
    messagesRecv int
    isSlowClient bool          // Testing flag

    reconnectToken string
    mu             sync.Mutex   // Protects stats fields
}

// ChatRoom is the central coordinator
type ChatRoom struct {
    // Communication channels
    join          chan *Client
    leave         chan *Client
    broadcast     chan string
    listUsers     chan *Client
    directMessage chan DirectMessage

    // State
    clients       map[*Client]bool
    mu            sync.Mutex
    totalMessages int
    startTime     time.Time

    // Message history
    messages      []Message
    messageMu     sync.Mutex
    nextMessageID int

    // Persistence
    walFile       *os.File
    walMu         sync.Mutex
    dataDir       string

    // Sessions
    sessions      map[string]*SessionInfo
    sessionsMu    sync.Mutex
}

// SessionInfo tracks reconnection data
type SessionInfo struct {
    Username       string
    ReconnectToken string
    LastSeen       time.Time
    CreatedAt      time.Time
}

// DirectMessage represents a private message
type DirectMessage struct {
    toClient *Client
    message  string
}

Understanding the Message Type
The Message struct stores everything we need to know about a chat message. The ID field uniquely identifies each message and ensures messages stay in order. The Timestamp lets us show when messages were sent, which is important for chat history.
The Channel field is interesting. Right now, we only use "global" for public messages, but this design lets us add private channels or chat rooms later without changing the data structure. Good data structures anticipate future needs.
Understanding the Client Type
Each connected user is represented by a Client struct. The conn field is their TCP connection – this is how we send and receive data.
The outgoing channel is crucial for performance. Notice it's a chan string, which means it's a channel of strings. We'll make this a buffered channel (size 10). This buffer means we can queue up 10 messages for this client without blocking. If a client is slow to read, we can keep sending to other clients.
Without this buffer, one slow client would block the entire broadcast. With the buffer, slow clients just miss messages if they can't keep up, which is much better than slowing everyone down.
The lastActive timestamp helps us detect idle users. If someone hasn't sent a message in 5 minutes, we can disconnect them to free up resources.
The mu mutex protects the statistics fields. Multiple goroutines will update messagesSent and messagesRecv, so we need a mutex to prevent race conditions.
Understanding the ChatRoom Type
This is the heart of the system. Notice that we have two kinds of fields: channels and protected state.
The five channels (join, leave, broadcast, listUsers, directMessage) are how different parts of the system communicate with the main event loop. When a new client connects, we send them to the join channel. When someone sends a message, it goes to the broadcast channel.
These channels are unbuffered (capacity 0) because we want synchronization. When you send to an unbuffered channel, you block until someone receives. This ensures the event loop processes events in order.
The protected state (maps and slices) needs mutexes because multiple goroutines access it. Notice that we use separate mutexes for different data. The mu mutex protects the clients map. The messageMu mutex protects the messages slice. The sessionsMu mutex protects the sessions map.
Why separate mutexes? Performance. If we used one mutex for everything, broadcasting a message would lock all the data, preventing new clients from joining. Separate mutexes mean different operations can happen concurrently.
The WAL file (walFile) also has its own mutex (walMu) because writing to disk is slow. We don't want to hold the main mutex while waiting for disk I/O.
With your data types defined, the next step is creating a function to initialize the server. This function will set up all your data structures, restore any persisted state from previous runs, and start background workers.
How to Initialize the Server
Server initialization is critical because you need to set up all your data structures in the right order. If you restore state after opening the WAL, you might replay messages twice. If you start accepting connections before loading history, users won't see old messages.
Create a file internal/chatroom/run.go to bootstrap the server:
package chatroom

import (
    "fmt"
    "net"
    "time"
)

func NewChatRoom(dataDir string) (*ChatRoom, error) {
    cr := &ChatRoom{
        clients:       make(map[*Client]bool),
        join:          make(chan *Client),
        leave:         make(chan *Client),
        broadcast:     make(chan string),
        listUsers:     make(chan *Client),
        directMessage: make(chan DirectMessage),
        sessions:      make(map[string]*SessionInfo),
        messages:      make([]Message, 0),
        startTime:     time.Now(),
        dataDir:       dataDir,
    }

    // Restore from snapshot if available
    if err := cr.loadSnapshot(); err != nil {
        fmt.Printf("Failed to load snapshot: %v\n", err)
    }

    // Initialize WAL for new messages
    if err := cr.initializePersistence(); err != nil {
        return nil, err
    }

    // Start background snapshot worker
    go cr.periodicSnapshots()

    return cr, nil
}

func (cr *ChatRoom) periodicSnapshots() {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()

    for range ticker.C {
        cr.messageMu.Lock()
        messageCount := len(cr.messages)
        cr.messageMu.Unlock()

        if messageCount > 100 {
            if err := cr.createSnapshot(); err != nil {
                fmt.Printf("Snapshot failed: %v\n", err)
            }
        }
    }
}

Let's break down what happens during initialization:
1. Creating Data Structures
We start by creating all the maps and channels. The make function initializes these properly. For maps, this creates an empty map ready to use. For channels, this creates an unbuffered channel (capacity 0).
Notice we create the messages slice with initial capacity 0 but room to grow: make([]Message, 0). This is more efficient than starting with nil because the slice is ready to append immediately without allocation.
2. Loading the Snapshot
Before we accept any connections, we try to load a snapshot from disk. This restores the chat history from the last time the server ran. If the snapshot doesn't exist (first run) or fails to load (corrupted file), we just continue with an empty history.
This step must happen before initializing the WAL. If we opened the WAL first, we might replay messages that are already in the snapshot, creating duplicates.
3. Initializing the WAL
The initializePersistence() function opens the WAL file in append mode. It also replays any entries in the WAL that happened after the last snapshot. This ensures we don't lose any messages that were written to the WAL but not yet included in a snapshot.
If this step fails, we return an error and refuse to start. Why? Because if we can't write to the WAL, we can't guarantee durability. It's better to refuse to start than to lie to users by accepting messages we can't persist.
4. Starting Background Workers
The periodicSnapshots() function runs in a separate goroutine. It wakes up every 5 minutes and checks if we need to create a snapshot. Notice the defer ticker.Stop() – this is important. If we forget to stop the ticker, it leaks a goroutine and wastes resources.
The goroutine acquires the messageMu lock just to read the message count, then releases it immediately. We don't hold the lock during the snapshot creation because that's slow and would block message broadcasting.
Why 5 Minutes and 100 Messages?
These are tunable parameters. 5 minutes means recovery never needs to replay more than 5 minutes of messages. 100 messages means we don't create snapshots too frequently during quiet periods.
In a production system, you might make these configurable. A high-traffic chat might want shorter intervals. A low-traffic chat might want longer intervals to reduce disk I/O.
Now that your server is initialized with all the necessary data structures and background workers, you need to build the core coordination mechanism. The event loop is where all state changes happen in your chatroom. It's the heartbeat that keeps everything synchronized.
How to Build the Event Loop
The event loop is the heart of your chatroom. Every client connection, message, and disconnection flows through this single point. This might seem like it could be a bottleneck, but it's actually what makes the system simple and safe.
The Run() method is the server's heartbeat. This is where all the magic happens. Every event in the system flows through this loop. Add this to run.go:
func (cr *ChatRoom) Run() {
    fmt.Println("ChatRoom heartbeat started...")
    go cr.cleanupInactiveClients()

    for {
        select {
        case client := <-cr.join:
            cr.handleJoin(client)

        case client := <-cr.leave:
            cr.handleLeave(client)

        case message := <-cr.broadcast:
            cr.handleBroadcast(message)

        case client := <-cr.listUsers:
            cr.sendUserList(client)

        case dm := <-cr.directMessage:
            cr.handleDirectMessage(dm)
        }
    }
}

Understanding the Select Statement
The select statement is one of Go's most powerful concurrency features. It's like a switch statement for channels. The select waits until one of its cases can proceed, then it executes that case.
Here's what happens: The loop blocks on the select statement, waiting for data on any of the five channels. When data arrives on any channel, that case executes. After the case completes, the loop goes back to waiting.
For example, when a new client connects, code elsewhere in your program sends that client to cr.join. The select receives it and executes cr.handleJoin(client). Once that finishes, the loop goes back to waiting.
Why Use a Single Event Loop?
This might seem like a bottleneck. You have one goroutine processing all events sequentially. Why not process events in parallel?
The answer is consistency. Here's what you gain from sequential processing:
1. No Race Conditions on State
Only one goroutine modifies the clients map, the messages slice, and the sessions map. You never need to worry about two operations interfering with each other. When you add a client in handleJoin, you know for certain that no other code is simultaneously removing clients or broadcasting messages.
This is incredibly powerful. Most bugs in concurrent systems come from unexpected interleaving of operations. By processing events sequentially, you eliminate an entire class of bugs.
2. Total Ordering of Events
Messages are broadcast in the order they arrive. This seems obvious, but it's important. If Alice sends "Hello" and then Bob sends "Hi", you can guarantee everyone sees them in that order. With parallel processing, you'd need additional synchronization to maintain ordering.
3. Simple State Transitions
You can reason about your system state as a series of transitions. "After this join event, the client is in the map. After this leave event, the client is removed." You don't need to worry about concurrent state changes making your reasoning invalid.
4. Easy to Debug
When something goes wrong, you can add logging to the event loop and see exactly what sequence of events led to the problem. With parallel processing, the order of events depends on thread scheduling, making bugs hard to reproduce.
Is This Actually a Bottleneck?
You might worry that sequential processing limits performance. In practice, it's fine for this workload. Here's why:
The handlers are fast. They do simple things like adding to a map, removing from a map, or forwarding a message to channels. These operations take microseconds. The event loop can process thousands of events per second.
The slow operations (writing to disk, sending to client connections) happen in other goroutines. The event loop doesn't wait for them. It just sends data to a channel or adds work to a queue, then immediately moves to the next event.
If you needed higher throughput, you could shard your chat into multiple rooms, each with its own event loop. But for a single chatroom, sequential processing is both simpler and fast enough.
Understanding the Cleanup Worker
Notice the line go cr.cleanupInactiveClients() before the loop. This starts a background goroutine that periodically checks for idle clients.
Why not include this in the event loop? Because it's time-based, not event-based. The cleanup worker wakes up every 30 seconds and sends disconnect events for idle clients. These events flow through the normal event loop, maintaining our single-threaded state mutation property.
Now add the runServer() function and shutdown handler:
import (
    "os"
    "os/signal"
    "syscall"
)

func runServer() {
    chatRoom, err := NewChatRoom("./chatdata")
    if err != nil {
        fmt.Printf("Failed to initialize: %v\n", err)
        return
    }
    defer chatRoom.shutdown()

    // Set up signal handling for graceful shutdown
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    go func() {
        <-sigChan
        fmt.Println("\nReceived shutdown signal")
        chatRoom.shutdown()
        os.Exit(0)
    }()

    go chatRoom.Run()

    listener, err := net.Listen("tcp", ":9000")
    if err != nil {
        fmt.Println("Error starting server:", err)
        return
    }
    defer listener.Close()

    fmt.Println("Server started on :9000")

    for {
        conn, err := listener.Accept()
        if err != nil {
            fmt.Println("Error accepting connection:", err)
            continue
        }
        fmt.Println("New connection from:", conn.RemoteAddr())
        go handleClient(conn, chatRoom)
    }
}

func (cr *ChatRoom) shutdown() {
    fmt.Println("\nShutting down...")
    if err := cr.createSnapshot(); err != nil {
        fmt.Printf("Final snapshot failed: %v\n", err)
    }
    if cr.walFile != nil {
        cr.walFile.Close()
    }
    fmt.Println("Shutdown complete")
}

The runServer() function ties everything together:

Create the chatroom with NewChatRoom()

Defer the shutdown function so it runs when the function exits

Start the event loop in a separate goroutine with go chatRoom.Run()

Listen for TCP connections on port 9000

For each connection, spawn a goroutine with go handleClient()


The defer statement is important. No matter how the function exits (normal return, panic, error), the shutdown function runs. This ensures we create a final snapshot and close the WAL file cleanly.
The signal handling goroutine listens for SIGINT (Ctrl+C) or SIGTERM (system shutdown). When it receives one, it calls shutdown() and exits gracefully. This means when you press Ctrl+C, the server saves its state before stopping.
With your event loop running and listening for connections, the next step is handling what happens when a client actually connects. This involves reading their username, creating a session, and setting up the communication channels.
How to Handle Client Connections
When a client connects to your server, several things need to happen: you need to establish the TCP connection, prompt for a username, create a Client object to represent them, start goroutines to read and write messages, and handle both normal disconnections and unexpected failures.
Create a file internal/chatroom/io.go for managing client connections. When a client connects, handleClient() manages the entire lifecycle:
package chatroom

import (
    "bufio"
    "fmt"
    "math/rand"
    "net"
    "strings"
    "time"
)

func handleClient(conn net.Conn, chatRoom *ChatRoom) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in handleClient: %v\n", r)
        }
        conn.Close()
    }()

    // Set initial timeout for username entry
    conn.SetReadDeadline(time.Now().Add(30 * time.Second))

    reader := bufio.NewReader(conn)

    // Prompt for username or reconnection
    conn.Write([]byte("Enter username (or 'reconnect::'): \n"))

    input, err := reader.ReadString('\n')
    if err != nil {
        fmt.Println("Failed to read username:", err)
        return
    }
    input = strings.TrimSpace(input)

    var username string
    var reconnectToken string
    var isReconnecting bool

    // Parse reconnection attempt
    if strings.HasPrefix(input, "reconnect:") {
        parts := strings.Split(input, ":")
        if len(parts) == 3 {
            username = parts[1]
            reconnectToken = parts[2]
            isReconnecting = true
        } else {
            conn.Write([]byte("Invalid format. Use: reconnect::\n"))
            return
        }
    } else {
        username = input
    }

    // Generate guest name if empty
    if username == "" {
        username = fmt.Sprintf("Guest%d", rand.Intn(1000))
    }

    // Validate reconnection or check for duplicate
    if isReconnecting {
        if chatRoom.validateReconnectToken(username, reconnectToken) {
            fmt.Printf("%s reconnected successfully\n", username)
            conn.Write([]byte(fmt.Sprintf("Welcome back, %s!\n", username)))
        } else {
            conn.Write([]byte("Invalid token or session expired.\n"))
            return
        }
    } else {
        // Prevent duplicate logins
        if chatRoom.isUsernameConnected(username) {
            conn.Write([]byte("Username already connected. Use reconnect if you lost connection.\n"))
            return
        }

        // Create or retrieve session
        chatRoom.sessionsMu.Lock()
        existingSession := chatRoom.sessions[username]
        chatRoom.sessionsMu.Unlock()

        if existingSession != nil {
            token := existingSession.ReconnectToken
            msg := fmt.Sprintf("Tip: Save this token: %s\n", token)
            msg += fmt.Sprintf("To reconnect: reconnect:%s:%s\n", username, token)
            conn.Write([]byte(msg))
        } else {
            session := chatRoom.createSession(username)
            token := session.ReconnectToken
            msg := fmt.Sprintf("Your token: %s\n", token)
            msg += fmt.Sprintf("To reconnect: reconnect:%s:%s\n", username, token)
            conn.Write([]byte(msg))
        }
    }

    // Create client object
    client := &Client{
        conn:           conn,
        username:       username,
        outgoing:       make(chan string, 10), // Buffered
        lastActive:     time.Now(),
        reconnectToken: reconnectToken,
        isSlowClient:   rand.Float64() < 0.1, // 10% chance for testing
    }

    // Clear timeout for normal operation
    conn.SetReadDeadline(time.Time{})

    // Notify chatroom
    chatRoom.join <- client

    // Send welcome message
    welcomeMsg := buildWelcomeMessage(username)
    conn.Write([]byte(welcomeMsg))

    // Start read/write loops
    go readMessages(client, chatRoom)
    writeMessages(client) // Blocks until disconnect

    // Update session on disconnect
    chatRoom.updateSessionActivity(username)
    chatRoom.leave <- client
}

func buildWelcomeMessage(username string) string {
    msg := fmt.Sprintf("Welcome, %s!\n", username)
    msg += "Commands:\n"
    msg += "  /users - List all users\n"
    msg += "  /history [N] - Show last N messages\n"
    msg += "  /msg   - Private message\n"
    msg += "  /token - Show your reconnect token\n"
    msg += "  /stats - Show your stats\n"
    msg += "  /quit - Leave\n"
    return msg
}

The initial 30-second timeout prevents connection exhaustion by disconnecting clients who don't enter a username quickly. The buffered outgoing channel prevents slow clients from blocking the broadcaster. Token-based reconnection lets users resume their session without complex authentication. The dual goroutine design means reading and writing happen independently, so a slow write doesn't block incoming messages.
How to Read Messages from Clients
Add the readMessages() goroutine to handles all incoming data:
func readMessages(client *Client, chatRoom *ChatRoom) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in readMessages for %s: %v\n", client.username, r)
        }
    }()

    reader := bufio.NewReader(client.conn)

    for {
        // Set 5-minute idle timeout
        client.conn.SetReadDeadline(time.Now().Add(5 * time.Minute))

        message, err := reader.ReadString('\n')
        if err != nil {
            if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
                fmt.Printf("%s timed out\n", client.username)
            } else {
                fmt.Printf("%s disconnected: %v\n", client.username, err)
            }
            return
        }

        client.markActive() // Update activity timestamp

        message = strings.TrimSpace(message)
        if message == "" {
            continue
        }

        client.mu.Lock()
        client.messagesRecv++
        client.mu.Unlock()

        // Process commands vs. regular messages
        if strings.HasPrefix(message, "/") {
            handleCommand(client, chatRoom, message)
            continue
        }

        // Regular message - format and broadcast
        formatted := fmt.Sprintf("[%s]: %s\n", client.username, message)
        chatRoom.broadcast <- formatted
    }
}

5 minutes of idle time triggers auto-disconnect. This prevents zombie connections from consuming resources.
How to Write Messages to Clients
Add the writeMessages() function to drain the client's outgoing channel:
func writeMessages(client *Client) {
    defer func() {
        if r := recover(); r != nil {
            fmt.Printf("Panic in writeMessages for %s: %v\n", client.username, r)
        }
    }()

    writer := bufio.NewWriter(client.conn)

    for message := range client.outgoing {
        // Simulate slow client (testing mode)
        if client.isSlowClient {
            time.Sleep(time.Duration(rand.Intn(500)) * time.Millisecond)
        }

        _, err := writer.WriteString(message)
        if err != nil {
            fmt.Printf("Write error for %s: %v\n", client.username, err)
            return
        }

        err = writer.Flush()
        if err != nil {
            fmt.Printf("Flush error for %s: %v\n", client.username, err)
            return
        }
    }
}

Real-world clients have varying network speeds. A client with a slow internet connection shouldn't block message delivery to other users. This is a fundamental challenge in any system that broadcasts to multiple recipients.
To handle this, we use two techniques. First, the outgoing channel is buffered with a size of 10. This means the system can queue up 10 messages for a client without blocking. If a client temporarily slows down (maybe they're loading a large webpage in another tab), the buffer absorbs the slowdown.
Second, when broadcasting messages (which you'll see in the next section), we use non-blocking sends. If a client's buffer is full because they're consistently too slow, we skip sending to them rather than blocking everyone else. The slow client misses some messages, but everyone else continues normally. This is called graceful degradation: the system continues working even when parts of it have problems.
With client connections handled, the next step is implementing the core feature of any chat system: broadcasting messages to all connected users. Broadcasting means taking one message and sending it to many recipients efficiently and safely.
How to Implement Message Broadcasting
Broadcasting is the heart of a chat application. When one user sends a message, it needs to reach everyone else instantly. But this is trickier than it sounds because you need to persist the message for durability, send it to clients at different speeds without blocking, and maintain message ordering across all clients.
Create internal/chatroom/handlers.go to handle events.
The handleBroadcast() method is where messages reach all users:
package chatroom

import (
    "fmt"
    "strings"
    "time"
)

func (cr *ChatRoom) handleBroadcast(message string) {
    // Parse message metadata
    parts := strings.SplitN(message, ": ", 2)
    from := "system"
    actualContent := message

    if len(parts) == 2 {
        from = strings.Trim(parts[0], "[]")
        actualContent = parts[1]
    }

    // Create persistent message record
    cr.messageMu.Lock()
    msg := Message{
        ID:        cr.nextMessageID,
        From:      from,
        Content:   actualContent,
        Timestamp: time.Now(),
        Channel:   "global",
    }
    cr.nextMessageID++
    cr.messages = append(cr.messages, msg)
    cr.messageMu.Unlock()

    // Persist to WAL
    if err := cr.persistMessage(msg); err != nil {
        fmt.Printf("Failed to persist: %v\n", err)
        // Continue anyway - availability over consistency
    }

    // Collect current clients
    cr.mu.Lock()
    clients := make([]*Client, 0, len(cr.clients))
    for client := range cr.clients {
        clients = append(clients, client)
    }
    cr.totalMessages++
    cr.mu.Unlock()

    fmt.Printf("Broadcasting to %d clients: %s", len(clients), message)

    // Fan-out to all clients
    for _, client := range clients {
        select {
        case client.outgoing <- message:
            client.mu.Lock()
            client.messagesSent++
            client.mu.Unlock()
        default:
            fmt.Printf("Skipped %s (channel full)\n", client.username)
        }
    }
}

Consistency Trade-off:
If a WAL write fails, you still broadcast the message. Why? Because availability is more important than perfect consistency for a chat application. Users get their messages immediately, and you can handle WAL repair manually if needed.
How to Handle Join and Leave Events
Add these handlers to handlers.go:
func (cr *ChatRoom) handleJoin(client *Client) {
    cr.mu.Lock()
    cr.clients[client] = true
    cr.mu.Unlock()

    client.markActive()

    fmt.Printf("%s joined (total: %d)\n", client.username, len(cr.clients))

    cr.sendHistory(client, 10)

    announcement := fmt.Sprintf("*** %s joined the chat ***\n", client.username)
    cr.handleBroadcast(announcement)
}

func (cr *ChatRoom) handleLeave(client *Client) {
    cr.mu.Lock()
    if !cr.clients[client] {
        cr.mu.Unlock()
        return
    }
    delete(cr.clients, client)
    cr.mu.Unlock()

    fmt.Printf("%s left (total: %d)\n", client.username, len(cr.clients))

    // Close channel safely
    select {
    case <-client.outgoing:
        // Already closed
    default:
        close(client.outgoing)
    }

    announcement := fmt.Sprintf("*** %s left the chat ***\n", client.username)
    cr.handleBroadcast(announcement)
}

The handleJoin function adds the client to the active clients map, marks them as active for idle tracking, sends them the last 10 messages so they can see recent conversation, and broadcasts an announcement so everyone knows they joined.
The handleLeave function removes the client from the map, closes their outgoing channel safely (the select checks if it's already closed to avoid a panic), and broadcasts a departure announcement.
How to Send User Lists and History
Add these helper functions to handlers.go:
func (cr *ChatRoom) sendHistory(client *Client, count int) {
    cr.messageMu.Lock()
    defer cr.messageMu.Unlock()

    start := len(cr.messages) - count
    if start < 0 {
        start = 0
    }

    historyMsg := "Recent messages:\n"
    for i := start; i < len(cr.messages); i++ {
        msg := cr.messages[i]
        historyMsg += fmt.Sprintf(" [%s]: %s\n", msg.From, msg.Content)
    }

    select {
    case client.outgoing <- historyMsg:
    default:
    }
}

func (cr *ChatRoom) sendUserList(client *Client) {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    list := "Users online:\n"
    for c := range cr.clients {
        status := ""
        if c.isInactive(1 * time.Minute) {
            status = " (idle)"
        }
        list += fmt.Sprintf("  - %s%s\n", c.username, status)
    }

    list += fmt.Sprintf("\nTotal messages: %d\n", cr.totalMessages)
    list += fmt.Sprintf("Uptime: %s\n", time.Since(cr.startTime).Round(time.Second))

    select {
    case client.outgoing <- list:
    default:
    }
}

func (cr *ChatRoom) handleDirectMessage(dm DirectMessage) {
    select {
    case dm.toClient.outgoing <- dm.message:
        dm.toClient.mu.Lock()
        dm.toClient.messagesSent++
        dm.toClient.mu.Unlock()
    default:
        fmt.Printf("Couldn't deliver DM to %s\n", dm.toClient.username)
    }
}

func (cr *ChatRoom) findClientByUsername(username string) *Client {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    for client := range cr.clients {
        if client.username == username {
            return client
        }
    }
    return nil
}

func (c *Client) markActive() {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.lastActive = time.Now()
}

func (c *Client) isInactive(timeout time.Duration) bool {
    c.mu.Lock()
    defer c.mu.Unlock()
    return time.Since(c.lastActive) > timeout
}

You now have a working chat system where clients can connect and exchange messages.
But there's a critical problem: if the server crashes or restarts, all messages are lost. The next step is adding persistence so messages survive failures.
How to Add Persistence with WAL and Snapshots
Persistence ensures your chat history survives server crashes and restarts. Without it, users would lose all their conversations every time the server goes down.
You'll implement this using two complementary mechanisms: a write-ahead log for immediate durability and snapshots for fast recovery.
Create internal/chatroom/persistence.go to handle data durability.
The WAL ensures messages survive crashes:
package chatroom

import (
    "bufio"
    "encoding/json"
    "fmt"
    "io"
    "os"
    "path/filepath"
)

func (cr *ChatRoom) initializePersistence() error {
    if err := os.MkdirAll(cr.dataDir, 0755); err != nil {
        return fmt.Errorf("create data dir: %w", err)
    }

    walPath := filepath.Join(cr.dataDir, "messages.wal")

    if err := cr.recoverFromWAL(walPath); err != nil {
        fmt.Printf("Recovery failed: %v\n", err)
    }

    file, err := os.OpenFile(walPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return fmt.Errorf("open wal: %w", err)
    }

    cr.walFile = file
    fmt.Printf("WAL initialized: %s\n", walPath)
    return nil
}

func (cr *ChatRoom) recoverFromWAL(walPath string) error {
    file, err := os.Open(walPath)
    if err != nil {
        if os.IsNotExist(err) {
            fmt.Println("No WAL found (fresh start)")
            return nil
        }
        return err
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)
    recovered := 0

    for scanner.Scan() {
        line := scanner.Text()
        if line == "" {
            continue
        }

        var msg Message
        if err := json.Unmarshal([]byte(line), &msg); err != nil {
            fmt.Printf("Skipping corrupt line: %s\n", line)
            continue
        }

        cr.messages = append(cr.messages, msg)

        if msg.ID >= cr.nextMessageID {
            cr.nextMessageID = msg.ID + 1
        }
        recovered++
    }

    fmt.Printf("Recovered %d messages\n", recovered)
    return nil
}

func (cr *ChatRoom) persistMessage(msg Message) error {
    cr.walMu.Lock()
    defer cr.walMu.Unlock()

    data, err := json.Marshal(msg)
    if err != nil {
        return err
    }

    _, err = cr.walFile.Write(append(data, '\n'))
    if err != nil {
        return err
    }

    return cr.walFile.Sync()
}

Each line is a JSON-encoded message:
{"id":1,"from":"Alice","content":"Hello world","timestamp":"2024-02-06T10:00:00Z","channel":"global"}
{"id":2,"from":"Bob","content":"Hi Alice!","timestamp":"2024-02-06T10:00:05Z","channel":"global"}

The Sync() call is critical for durability. Without it, the OS might buffer writes in memory, losing them on a crash. The trade-off is that Sync() is expensive (about 1-10ms per call). Production systems might batch multiple messages to improve throughput.
How to Create and Load Snapshots
Add snapshot functionality to persistence.go:
func (cr *ChatRoom) createSnapshot() error {
    snapshotPath := filepath.Join(cr.dataDir, "snapshot.json")
    tempPath := snapshotPath + ".tmp"

    file, err := os.Create(tempPath)
    if err != nil {
        return err
    }
    defer file.Close()

    cr.messageMu.Lock()
    data, err := json.MarshalIndent(cr.messages, "", "  ")
    cr.messageMu.Unlock()

    if err != nil {
        return err
    }

    if _, err := file.Write(data); err != nil {
        return err
    }

    if err := file.Sync(); err != nil {
        return err
    }

    file.Close()

    if err := os.Rename(tempPath, snapshotPath); err != nil {
        return err
    }

    fmt.Printf("Snapshot created (%d messages)\n", len(cr.messages))
    return cr.truncateWAL()
}

func (cr *ChatRoom) truncateWAL() error {
    cr.walMu.Lock()
    defer cr.walMu.Unlock()

    if cr.walFile != nil {
        cr.walFile.Close()
    }

    walPath := filepath.Join(cr.dataDir, "messages.wal")
    file, err := os.OpenFile(walPath, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return err
    }
    cr.walFile = file
    fmt.Println("WAL truncated")
    return nil
}

func (cr *ChatRoom) loadSnapshot() error {
    snapshotPath := filepath.Join(cr.dataDir, "snapshot.json")
    file, err := os.Open(snapshotPath)
    if err != nil {
        if os.IsNotExist(err) {
            return nil
        }
        return err
    }
    defer file.Close()

    data, err := io.ReadAll(file)
    if err != nil {
        return err
    }

    cr.messageMu.Lock()
    err = json.Unmarshal(data, &cr.messages)
    cr.messageMu.Unlock()

    if err != nil {
        return err
    }

    for _, msg := range cr.messages {
        if msg.ID >= cr.nextMessageID {
            cr.nextMessageID = msg.ID + 1
        }
    }

    fmt.Printf("Loaded %d messages from snapshot\n", len(cr.messages))
    return nil
}

Writing to .tmp then renaming ensures you never have a half-written snapshot. Even if power fails mid-write, the old snapshot remains valid.
Recovery Flow
When the server starts, it first loads the snapshot if it exists, which might contain 100K messages and takes about 100ms. Then it replays WAL entries written since the snapshot, which might be only recent messages. Total recovery time is seconds instead of minutes.
With persistence in place, your messages are safe. But network connections are unreliable. Users get disconnected when their WiFi drops, their phone switches towers, or their laptop goes to sleep. The next step is implementing session management so users can reconnect without losing their identity or chat history.
How to Implement Session Management
Session management lets users reconnect to your server after network interruptions without needing to create a new account or re-enter credentials. You'll implement this using cryptographically secure tokens that persist across connections.
Create internal/chatroom/session.go for reconnection handling.
package chatroom

import (
    "fmt"
    "time"

    "github.com/yourusername/chatroom/pkg/token"
)

func (cr *ChatRoom) createSession(username string) *SessionInfo {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    tok := token.GenerateToken()

    session := &SessionInfo{
        Username:       username,
        ReconnectToken: tok,
        LastSeen:       time.Now(),
        CreatedAt:      time.Now(),
    }

    cr.sessions[username] = session

    fmt.Printf("Created session for %s (token: %s...)\n", username, tok[:8])

    return session
}

func (cr *ChatRoom) validateReconnectToken(username, token string) bool {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    session, exists := cr.sessions[username]
    if !exists {
        return false
    }

    if session.ReconnectToken != token {
        return false
    }

    if time.Since(session.LastSeen) > 1*time.Hour {
        delete(cr.sessions, username)
        return false
    }

    session.LastSeen = time.Now()

    return true
}

func (cr *ChatRoom) updateSessionActivity(username string) {
    cr.sessionsMu.Lock()
    defer cr.sessionsMu.Unlock()

    if session, exists := cr.sessions[username]; exists {
        session.LastSeen = time.Now()
    }
}

func (cr *ChatRoom) isUsernameConnected(username string) bool {
    cr.mu.Lock()
    defer cr.mu.Unlock()

    for client := range cr.clients {
        if client.username == username {
            return true
        }
    }

    return false
}

func (cr *ChatRoom) cleanupInactiveClients() {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        cr.mu.Lock()
        var toRemove []*Client

        for client := range cr.clients {
            if client.isInactive(5 * time.Minute) {
                fmt.Printf("Removing inactive: %s\n", client.username)
                toRemove = append(toRemove, client)
            }
        }
        cr.mu.Unlock()

        for _, client := range toRemove {
            cr.leave <- client
        }
    }
}

How to Generate Secure Tokens
Create pkg/token/token.go for token generation:
package token

import (
    "crypto/rand"
    "encoding/hex"
)

// GenerateToken returns a secure random 16-byte hex token
func GenerateToken() string {
    b := make([]byte, 16)
    _, _ = rand.Read(b)
    return hex.EncodeToString(b)
}

Tokens here are transmitted in plaintext over TCP. For production use, you should use TLS encryption to protect tokens in transit, hash tokens before storage so a database breach doesn't expose them, and implement rate limiting on reconnection attempts to prevent brute force attacks.
Your chatroom now supports basic messaging and reconnection. But users need ways to interact with the system beyond just sending messages. The command system provides features like listing users, viewing history, and sending private messages.
How to Build the Command System
Commands are messages that start with a forward slash and perform special actions instead of being broadcast to everyone. This is a pattern used by many chat applications like Slack and Discord. You'll implement several useful commands that enhance the user experience.
Add command handling to io.go:
func handleCommand(client *Client, chatRoom *ChatRoom, command string) {
    parts := strings.Fields(command)
    if len(parts) == 0 {
        return
    }

    switch parts[0] {
    case "/users":
        chatRoom.listUsers <- client

    case "/stats":
        client.mu.Lock()
        stats := fmt.Sprintf("Your Stats:\n")
        stats += fmt.Sprintf("  Messages sent: %d\n", client.messagesSent)
        stats += fmt.Sprintf("  Messages received: %d\n", client.messagesRecv)
        stats += fmt.Sprintf("  Last active: %s ago\n", 
            time.Since(client.lastActive).Round(time.Second))
        client.mu.Unlock()

        select {
        case client.outgoing <- stats:
        default:
        }

    case "/msg":
        if len(parts) < 3 {
            select {
            case client.outgoing <- "Usage: /msg  \n":
            default:
            }
            return
        }

        targetUsername := parts[1]
        messageText := strings.Join(parts[2:], " ")

        targetClient := chatRoom.findClientByUsername(targetUsername)
        if targetClient == nil {
            select {
            case client.outgoing <- fmt.Sprintf("User '%s' not found\n", targetUsername):
            default:
            }
            return
        }

        privateMsg := fmt.Sprintf("[From %s]: %s\n", client.username, messageText)
        select {
        case targetClient.outgoing <- privateMsg:
        default:
            select {
            case client.outgoing <- fmt.Sprintf("%s's inbox is full\n", targetUsername):
            default:
            }
            return
        }

        select {
        case client.outgoing <- fmt.Sprintf("Message sent to %s\n", targetUsername):
        default:
        }

    case "/history":
        count := 20
        if len(parts) > 1 {
            fmt.Sscanf(parts[1], "%d", &count)
        }
        if count > 100 {
            count = 100
        }
        cr.sendHistory(client, count)

    case "/token":
        chatRoom.sessionsMu.Lock()
        session := chatRoom.sessions[client.username]
        chatRoom.sessionsMu.Unlock()

        if session != nil {
            msg := fmt.Sprintf("Your reconnect token:\n")
            msg += fmt.Sprintf("   reconnect:%s:%s\n", client.username, session.ReconnectToken)
            select {
            case client.outgoing <- msg:
            default:
            }
        }

    case "/quit":
        announcement := fmt.Sprintf("%s left the chat\n", client.username)
        chatRoom.broadcast <- announcement

        select {
        case client.outgoing <- "Goodbye!\n":
        default:
        }

        time.Sleep(100 * time.Millisecond)
        client.conn.Close()

    default:
        select {
        case client.outgoing <- fmt.Sprintf("Unknown: %s\n", parts[0]):
        default:
        }
    }
}

Your server is now complete with all the core features: connection handling, message broadcasting, persistence, session management, and commands. But to actually use your chatroom, you need a client application. The client is much simpler than the server because it just needs to connect and relay messages.
How to Create the Client
The client application provides the user interface for your chatroom. It connects to the server, displays incoming messages, and sends outgoing messages typed by the user. While the server is complex with many concurrent components, the client is straightforward
Create internal/chatroom/client.go for the client implementation.
package chatroom

import (
    "bufio"
    "fmt"
    "net"
    "os"
    "strings"
)

func StartClient() {
    conn, err := net.Dial("tcp", ":9000")
    if err != nil {
        fmt.Println("Error connecting:", err)
        return
    }
    defer conn.Close()

    fmt.Println("Connected to chat server")

    // Background goroutine: read from server
    go func() {
        reader := bufio.NewReader(conn)
        for {
            message, err := reader.ReadString('\n')
            if err != nil {
                fmt.Println("Disconnected from server.")
                os.Exit(0)
            }
            // Clear current prompt line and print message
            fmt.Print("\r" + message)
            fmt.Print(">> ")
        }
    }()

    // Main goroutine: read from stdin
    inputReader := bufio.NewReader(os.Stdin)
    fmt.Println("Welcome to the chat server!")

    for {
        fmt.Print(">> ")
        message, _ := inputReader.ReadString('\n')
        message = strings.TrimSpace(message)

        if message == "" {
            continue
        }

        conn.Write([]byte(message + "\n"))
    }
}

How the Client Works:
The client uses two goroutines to handle communication simultaneously. The main goroutine reads from stdin (your keyboard) and sends messages to the server. When you type a message and press Enter, it gets sent over the TCP connection immediately.
The background goroutine continuously reads from the server. Whenever a message arrives, it prints it to your screen. The \r (carriage return) clears the current >> prompt before printing the message, so new messages don't appear on the same line as your input. After printing the message, it reprints the prompt so you can continue typing.
This dual-goroutine design means you can receive messages while typing. If someone sends a message while you're in the middle of typing yours, their message appears immediately and your prompt reappears below it.
The defer conn.Close() ensures the connection is properly closed when the function exits. If the server disconnects, the read goroutine gets an error and calls os.Exit(0) to terminate the entire client program gracefully.
How to Create Entry Points
Create cmd/server/main.go:
package main

import (
    "fmt"
    "os"

    "github.com/yourusername/chatroom/internal/chatroom"
)

func main() {
    fmt.Println("Starting server from cmd/server...")
    chatroom.StartServer()
    os.Exit(0)
}

Create cmd/client/main.go:
package main

import (
    "fmt"
    "github.com/yourusername/chatroom/internal/chatroom"
)

func main() {
    fmt.Println("Starting client from cmd/client...")
    chatroom.StartClient()
}

Add a wrapper function in internal/chatroom/server.go:
package chatroom

func StartServer() {
    runServer()
}

With all your entry points created, your chatroom is complete and ready to test. The next step is learning how to test your implementation to ensure everything works correctly.
How to Test Your Chatroom
Testing a concurrent system like a chatroom requires a different approach than testing typical sequential code. You need to verify that goroutines coordinate correctly, messages arrive in the right order, and the system handles edge cases like disconnections.
How to Write Unit Tests
Unit tests verify individual components in isolation. For your chatroom, the most important test is verifying that messages broadcast correctly to all connected clients.
Create internal/chatroom/chatroom_test.go:
package chatroom

import (
    "testing"
    "strings"
    "time"
)

func TestBroadcast(t *testing.T) {
    cr, _ := NewChatRoom("./testdata")
    defer cr.shutdown()

    go cr.Run()

    // Create mock clients
    client1 := &Client{
        username: "Alice",
        outgoing: make(chan string, 10),
    }
    client2 := &Client{
        username: "Bob",
        outgoing: make(chan string, 10),
    }

    // Join clients
    cr.join <- client1
    cr.join <- client2
    time.Sleep(100 * time.Millisecond)

    // Broadcast message
    cr.broadcast <- "[Alice]: Hello!"

    // Verify both receive it
    select {
    case msg := <-client1.outgoing:
        if !strings.Contains(msg, "Hello!") {
            t.Fatal("Client1 didn't receive correct message")
        }
    case <-time.After(1 * time.Second):
        t.Fatal("Client1 didn't receive message")
    }

    select {
    case msg := <-client2.outgoing:
        if !strings.Contains(msg, "Hello!") {
            t.Fatal("Client2 didn't receive correct message")
        }
    case <-time.After(1 * time.Second):
        t.Fatal("Client2 didn't receive message")
    }
}

Understanding the Test:
This test creates a chatroom instance and starts its event loop with go cr.Run(). Then it creates two mock clients. Notice these aren't real TCP connections – they're just Client structs with outgoing channels. This lets you test the broadcast logic without needing actual network connections.
The test sends both clients to the join channel, waits 100 milliseconds for them to be processed, then broadcasts a message. The select statements with timeout are crucial. They try to receive from each client's outgoing channel, but if nothing arrives within 1 second, the test fails. This prevents the test from hanging forever if something goes wrong.
The time.Sleep(100 * time.Millisecond) gives the event loop time to process the join events before broadcasting. In a real system, you'd use channels to synchronize, but for tests, a small sleep is acceptable.
Run tests with:
go test ./internal/chatroom -v

The -v flag shows verbose output, printing each test as it runs. You'll see whether the broadcast test passes and how long it took. Below is the output showing that the test passed:

How to Do Integration Testing
Integration tests verify the entire system working together – the real server, real clients, and real network connections. Unlike unit tests that mock components, integration tests exercise the full stack.
Test the full client-server flow:
# Terminal 1: Start server
go run cmd/server/main.go

# Terminal 2: Client 1
go run cmd/client/main.go
# Enter username: Alice

# Terminal 3: Client 2  
go run cmd/client/main.go
# Enter username: Bob

# Terminal 4: Client 3  
go run cmd/client/main.go
# Enter username: John

# Test messaging between clients

What to Test:
Once you have the server running and multiple clients connected, you can verify all the features you built. Here's what a complete test session looks like:

Basic Messaging: Send a message from Alice and verify Bob and John both receive it. You should see the message appear in all client windows with the sender's username in brackets. Try sending from each client to verify the broadcast works in all directions.

Join and Leave Announcements: When a new client connects, all existing clients should see a "joined the chat" announcement. When someone disconnects (either with /quit or by closing their terminal), everyone should see a "left the chat" message. This confirms your join and leave handlers work correctly.

Private Messaging: Use /msg Bob this is a private message from Alice's client. The message should appear only in Bob's window, not in John's or Alice's. Try sending private messages between different pairs of users to verify the routing works correctly. The sender should receive a confirmation that the message was sent.

User List: Run /users from any client. You should see a list of all connected users. If someone has been idle for over a minute, they should show an "(idle)" status. The command should also display total message count and server uptime.

Chat History: New clients should automatically receive the last 10 messages when they join. You can also use /history 20 to request the last 20 messages. This verifies your message persistence is working.

Session Reconnection: From one client, use /token to get your reconnection token. It will look something like reconnect:Alice:338f04ca.... Copy this token, disconnect the client with Ctrl+C, start a new client, and paste the reconnection string when prompted. You should rejoin the chat with your previous identity, and other users won't see duplicate join announcements.

Statistics: Use /stats to see how many messages you've sent and received, and when you were last active. This verifies the client-side statistics tracking works.

Error Handling: Try connecting with a username that's already in use – you should be rejected. Try sending a private message to a non-existent user – you should get an error. Try using an invalid reconnection token – you should be denied. These tests verify your validation logic works.


Look at the server terminal to see the server's perspective. You'll see connection logs, broadcast confirmations, and any errors. When clients disconnect, you should see their sessions being updated. When the server creates snapshots, you'll see those logged, too.
Integration testing catches problems that unit tests miss, like network timeouts, message ordering issues across multiple clients, or problems with how the WAL file is created and locked. The screenshot below shows a successful integration test with three clients (Alice, Bob, and John) all communicating successfully, with private messages, public broadcasts, and proper join/leave handling.

How to Deploy Your Server
Deploying your chatroom means running it on a server that stays up 24/7, automatically restarts if it crashes, and starts when the server boots. There are several approaches depending on your infrastructure.
How to Use Systemd
Systemd is the standard init system on most Linux distributions. It manages services, handles restarts, and ensures your chatroom starts on boot.
Create /etc/systemd/system/chatroom.service:
[Unit]
Description=Chatroom Server
After=network.target

[Service]
Type=simple
User=chatroom
WorkingDirectory=/opt/chatroom
ExecStart=/opt/chatroom/server
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Understanding the Configuration:
The [Unit] section describes the service and its dependencies. After=network.target ensures the network is up before starting your chatroom.
The [Service] section defines how to run your server. Type=simple means systemd should just run the command and consider it started. User=chatroom runs the server as a dedicated user (not root) for security. WorkingDirectory sets where the server runs, which is important because your WAL and snapshot files are created relative to this directory.
Restart=on-failure tells systemd to automatically restart your server if it crashes. RestartSec=5s waits 5 seconds before restarting, preventing rapid restart loops if there's a persistent problem.
The [Install] section makes your service start at boot when you enable it.
Deploying Your Server:
First, build your server binary:
go build -o server cmd/server/main.go

Then copy it to the deployment location:
sudo mkdir -p /opt/chatroom
sudo cp server /opt/chatroom/
sudo mkdir -p /opt/chatroom/chatdata

Create a dedicated user for running the service:
sudo useradd -r -s /bin/false chatroom
sudo chown -R chatroom:chatroom /opt/chatroom

Enable and start the service:
sudo systemctl enable chatroom
sudo systemctl start chatroom

Check that it's running:
sudo systemctl status chatroom

You can view logs with:
sudo journalctl -u chatroom -f

The -f flag follows the logs in real-time, similar to tail -f.
How to Use Docker
Docker packages your application with all its dependencies, making it easy to deploy anywhere that runs Docker.
Create a Dockerfile:
FROM golang:1.23-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o server cmd/server/main.go

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/server .
COPY --from=builder /app/chatdata ./chatdata
EXPOSE 9000
CMD ["./server"]

Understanding the Dockerfile:
This uses a multi-stage build. The first stage (builder) uses the full Go image to compile your server. The second stage uses a minimal Alpine Linux image and copies only the compiled binary. This keeps the final image small (about 20MB instead of 800MB).
EXPOSE 9000 documents which port the container uses. CMD ["./server"] specifies what command runs when the container starts.
Build and Run:
docker build -t chatroom .
docker run -p 9000:9000 -v $(pwd)/chatdata:/root/chatdata chatroom

The -p 9000:9000 maps port 9000 in the container to port 9000 on your host, making the chatroom accessible. The -v $(pwd)/chatdata:/root/chatdata mounts your local chatdata directory into the container, so messages persist even if you stop and remove the container.
Running in Production:
For production, you'd typically use Docker Compose or Kubernetes. Here's a simple docker-compose.yml:
version: '3.8'
services:
  chatroom:
    build: .
    ports:
      - "9000:9000"
    volumes:
      - ./chatdata:/root/chatdata
    restart: unless-stopped

Run with:
docker-compose up -d

The restart: unless-stopped policy ensures your container restarts automatically if it crashes or if the Docker daemon restarts
Enhancements You Could Add
1. Multi-Room Support
You could add the concept of channels/rooms like this:
type ChatRoom struct {
    rooms map[string]*Room
}

type Room struct {
    name    string
    clients map[*Client]bool
    history []Message
}

2. User Authentication
You could replace simple usernames with proper authentication for added security:
type User struct {
    ID           int
    Username     string
    PasswordHash string
    Email        string
    CreatedAt    time.Time
}

3. File Sharing
You could allow users to upload files:
type FileMessage struct {
    Message
    FileName string
    FileSize int64
    FileURL  string
}

4. WebSocket Support
You could add HTTP/WebSocket endpoint for web clients.
5. Horizontal Scaling
For massive scale, you could shard across multiple servers using Redis pub/sub or NATS for inter-server communication.
Conclusion
You've now built a production-ready distributed chatroom from scratch. This project demonstrates important distributed systems concepts including concurrency patterns, network programming, state management, persistence, and fault tolerance.
Additional resources:

Go Concurrency: "Concurrency in Go" by Katherine Cox-Buday

Distributed Systems: "Designing Data-Intensive Applications" by Martin Kleppmann

Networking: "Unix Network Programming" by Stevens


The full source code is available on GitHub. Feel free to open issues or contribute improvements.
As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn.
 


 How to Manage Blue-Green Deployments on AWS ECS with Database Migrations: Complete Implementation Guide 
Destiny Erhabor — Thu, 15 Jan 2026 18:25:13 +0000
 Blue-green deployments are celebrated for enabling zero-downtime releases and instant rollbacks. You deploy your new version (green) alongside the current one (blue), switch traffic over, and if something goes wrong, you switch back. Simple, right?
Not quite. While blue-green deployments work beautifully for stateless applications, they become significantly more complex when you introduce databases and stateful services into the equation. The moment your blue and green environments need to share a database, you're facing a fundamental challenge: how do you evolve your schema and data without breaking either version?
In this article, we'll tackle the real-world complexities of implementing blue-green deployments on Amazon ECS when your application depends on shared state. You'll learn practical strategies for handling database migrations, managing sessions, and maintaining data consistency across application versions.
💡 Complete Working Example: All code examples in this article are available in the bluegreen-deployment-ecs repository on GitHub. You can clone it and deploy the entire infrastructure to your AWS account.
Table of Contents

The Problem with State in Blue-Green Deployments

Database Migration Strategies for Blue-Green

Handling Stateful Services in ECS

Complete Implementation: End-to-End Example

Rollback Strategies

Monitoring During Deployments

Best Practices

When NOT to Use Blue-Green

Alternative Deployment Strategies

Cleanup

Conclusion

Further Resources


The Problem with State in Blue-Green Deployments
The elegance of blue-green deployments starts to crumble when you consider databases. Here's why: your blue environment runs application version 1, your green environment runs version 2, but they both connect to the same RDS instance.

Consider this scenario: you're adding a new feature that requires a new database column. Version 2 of your application expects this column to exist. You deploy green, run your migration to add the column, and switch traffic.
Everything works great until you need to rollback. Now version 1 is receiving traffic, but it doesn't know what to do with that new column. Worse, if your migration removed or renamed a column that version 1 depends on, your rollback will fail catastrophically.
Here are the specific challenges you'll face:

Schema versioning conflicts: Your blue environment expects schema version N, while green expects version N+1. Any breaking schema change will cause one environment to fail.

Data inconsistencies: If version 2 writes data in a new format that version 1 can't read, switching back to blue will result in errors or data corruption.

Irreversible migrations: Some database changes are inherently destructive. Dropping a column, changing data types, or restructuring tables can't be easily undone.

Failed rollbacks: The promise of instant rollback becomes hollow when your database has evolved beyond what the blue environment can handle.


Let's explore the strategies that solve these problems.
Database Migration Strategies for Blue-Green
Strategy 1: The Expand-Contract Pattern (Recommended)
The expand-contract pattern is the most practical approach for blue-green deployments with shared databases. It works by breaking schema changes into three phases, ensuring backwards compatibility throughout.
Phase 1: Expand
In this phase, you add new schema elements while keeping old ones intact. If you're renaming a column, add the new column without removing the old one. If you're changing table structure, create new tables alongside existing ones.
-- Example: Renaming 'user_name' to 'username'
-- Phase 1: Expand - Add new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- Populate new column from old column
UPDATE users SET username = user_name WHERE username IS NULL;

At this point, your database supports both the old schema (used by blue) and the new schema (used by green). Your application code needs to handle both as well.
Phase 2: Deploy
Now, deploy your green environment with code that uses the new schema. But this code should still write to both old and new columns to maintain compatibility.
# Version 2 code - writes to both columns
def update_user(user_id, username):
    db.execute(
        "UPDATE users SET username = %s, user_name = %s WHERE id = %s",
        (username, username, user_id)
    )

Traffic shifts from blue to green. Both environments work because the database supports both schemas. If you need to rollback, blue still functions perfectly because the old columns are intact.
Phase 3: Contract
After you're confident green is stable and you've decommissioned blue, remove the old schema elements in a separate deployment.
-- Phase 3: Contract - Remove old column
ALTER TABLE users DROP COLUMN user_name;

Update your application code to stop writing to the old columns. This is now version 3, deployed as a standard release.
When to use: This should be your default approach for most schema changes including adding/removing columns, renaming fields, changing constraints, and restructuring tables.
Strategy 2: Parallel Schemas or Databases
For major breaking changes where backwards compatibility is impractical, you might maintain entirely separate database versions. Version 1 connects to database A, version 2 connects to database B. This approach requires data synchronization between databases. AWS Database Migration Service (DMS) can replicate data in near real-time, or you can build custom replication logic using change data capture.
# Configuration for version-specific database connections
DATABASE_CONFIG = {
    'v1': {
        'host': 'blue-db.cluster-xxxxx.us-east-1.rds.amazonaws.com',
        'database': 'app_v1'
    },
    'v2': {
        'host': 'green-db.cluster-yyyyy.us-east-1.rds.amazonaws.com',
        'database': 'app_v2'
    }
}

During the transition period, you run DMS to keep both databases synchronized, with the understanding that writes go to the active version's database.
The challenge is that you're now managing data synchronization, dealing with replication lag, and paying for two databases. Eventually, you need to consolidate back to one database, which requires another migration. This is expensive and complex, which is why it's the "nuclear option."
When to use: Only for major architectural changes, complete data model redesigns, or when migrating between database types (for example, MySQL to PostgreSQL). If expand-contract can possibly work, use that instead.
Strategy 3: Feature Flags for Gradual Rollout
Feature flags allow you to decouple deployment from release. Both blue and green run the same codebase, but features are toggled on or off via configuration. This shifts the problem from schema compatibility to code-level compatibility.
def create_user(user_data):
    config = get_feature_config()
    if config['use_new_user_schema']:
        return create_user_v2(user_data)
    else:
        return create_user_v1(user_data)

Instead of having two separate deployments (blue and green), you have ONE deployment with conditional logic. The "switch" from old to new behavior happens via configuration change, not infrastructure change. This is technically not pure blue-green, but it's a powerful hybrid approach.
How it works
Your application checks AWS AppConfig (or similar service) for feature flags before executing code paths. When a flag is off, it uses the old schema/logic. When on, it uses the new schema/logic. You can even enable features for a percentage of users (5% get new behavior, 95% get old behavior) for gradual rollout.
The tradeoff is that your codebase temporarily contains both old and new logic with conditional branches everywhere. This increases complexity and requires disciplined cleanup after the feature is fully released. However, you gain fine-grained control and can toggle features on/off instantly without deploying new infrastructure.
When to use: For large features with uncertain stability, gradual rollouts to monitor impact, or when you want instant rollback capability without touching infrastructure. Also useful when combined with expand-contract for extra safety.
Handling Stateful Services in ECS
Beyond databases, several other stateful components require careful consideration during blue-green deployments.
Session Management
It’s a good idea to store sessions in ElastiCache or DynamoDB rather than application memory:
app.config['SESSION_TYPE'] = 'dynamodb'
app.config['SESSION_DYNAMODB'] = boto3.client('dynamodb')

Shared Resources
Beyond database sessions, your application likely depends on other stateful components that need coordination during blue-green deployments:
1. S3 buckets
If your application stores files or data in S3, schema changes to object metadata or file formats can cause compatibility issues between versions. To address this, you can enable S3 versioning to maintain multiple format versions simultaneously.
For example, if version 2 writes JSON files with a new structure, version 1 should still be able to read the old format. You can include a version prefix in object keys (like v1/user-data.json and v2/user-data.json) or embed version metadata in the objects themselves.
Message queues (SQS/SNS)
Messages sent by one version must be readable by the other during the transition. You can use versioned message schemas with a schema_version field in your message payload. Both blue and green should be able to parse messages from either version, even if they only produce messages in their preferred format. Consider using a schema registry or validation library to ensure compatibility.
Cache layers (ElastiCache/Redis)
Cached data structure changes can cause deserialization errors when switching between versions. Try versioning your cache keys by including the schema version: CACHE_VERSION = 'v2' and then cache_key = f"user:{CACHE_VERSION}:{user_id}". This ensures blue and green maintain separate cache namespaces, preventing cross-contamination. When you fully migrate to green, you can flush the old cache keys or let them expire naturally.
CACHE_VERSION = 'v2'
cache_key = f"user:{CACHE_VERSION}:{user_id}"

Implementation: End-to-End Example
Let's walk through a complete blue-green deployment with ECS, handling a database schema change using the expand-contract pattern. We'll migrate from a single address text field to structured street_address, city, state, and zip_code fields.

Here’s the scenario: You're running an e-commerce application on ECS. The current version (blue) stores customer addresses in a single address text field. Version 2 (green) splits this into structured fields: street_address, city, state, and zip_code.
Architecture Setup

Your infrastructure includes:

ECS cluster running Fargate tasks

Application Load Balancer with two target groups (blue and green)

RDS PostgreSQL database (shared between environments)

CodeDeploy for managing traffic shifts

Parameter Store for database connection strings


💡 Implementation Note: The complete Terraform code for this architecture is available in the companion GitHub repository.
Prerequisites
Before starting, make sure that you have the following tools installed and your AWS credentials properly configured:
# Required tools
aws --version      # AWS CLI
terraform --version # Terraform >= 1.0
docker --version   # Docker
psql --version     # PostgreSQL client

# Configure AWS credentials
aws configure
aws sts get-caller-identity  # Verify your identity

Step 1: Deploy Infrastructure and Blue Environment
We’ll start by setting up the entire AWS infrastructure from scratch using Terraform, then deploying the initial version of our application (blue environment).
First, clone the repository and set up your environment:
# Clone the repository
git clone https://github.com/Caesarsage/bluegreen-deployment-ecs.git
cd bluegreen-deployment-ecs

# Create terraform variables
cd terraform
cat > terraform.tfvars <"us-east-1"
project_name       = "ecommerce-bluegreen"
environment        = "production"
vpc_cidr           = "10.0.0.0/16"

# Database credentials (CHANGE THESE!)
db_username = "dbadmin"
db_password = "ChangeThisPassword123!"

# Container configuration
container_image = "PLACEHOLDER"  # Will update after building image
container_port  = 8080

# Scaling configuration
desired_count = 2
cpu           = "256"
memory        = "512"

# Notifications
notification_email = "your-email@example.com"
EOF

Security Note: Never commit terraform.tfvars to Git. It's already in .gitignore.
Next, initialize Terraform and create the ECR repository:
# Initialize Terraform
terraform init
terraform validate

# Create ECR repository
terraform apply -target=aws_ecr_repository.app

# Get ECR repository URL
export ECR_REPO=$(terraform output -raw ecr_repository_url)
echo "ECR Repository: $ECR_REPO"

We create the ECR repository first because we need somewhere to push our Docker image. Then we'll build the image, push it, and finally deploy the rest of the infrastructure that depends on that image existing.
Build and push the initial application like this:

cd ..  # Back to project root

# Set variables
export AWS_REGION=us-east-1
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export ECR_REPOSITORY=ecommerce-bluegreen
export IMAGE_TAG=v1.0.0

# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
    docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Build the image
docker build --platform linux/amd64 -t $ECR_REPOSITORY:$IMAGE_TAG -f docker/Dockerfile .

# Tag and push to ECR
docker tag $ECR_REPOSITORY:$IMAGE_TAG \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

# Update terraform.tfvars with the image URL
echo "container_image = \"$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG\"" >> terraform/terraform.tfvars


The application code is a Flask application that handles both old and new schema formats based on the APP_VERSION environment variable.
Now deploy the complete infrastructure:
cd terraform
terraform apply  # Takes ~15-20 minutes

# Get outputs
export ALB_URL=$(terraform output -raw alb_url)
export TEST_URL=$(terraform output -raw test_url)
export DB_ENDPOINT=$(terraform output -raw db_endpoint)
export ECR_URL=$(terraform output -raw ecr_repository_url)
export BASTION_IP=$(terraform output -raw bastion_public_ip)

echo "Application URL: $ALB_URL"
echo "Test URL: $TEST_URL"
echo "Database Endpoint: $DB_ENDPOINT"



The production listener (port 80) is what your users hit. The test listener (port 8080) lets you test the green environment before shifting production traffic to it. This is crucial for validation.
You can see the complete Terraform configuration in terraform.
Step 2: Initialize Database Schema
Now you’ll need to initialize the database with the schema for version 1 (blue). We'll use Bastion for secure access:
# Copy the migration files to the bastion host from your local machine

scp -i ~/.ssh/id_rsa docker/init.sql ec2-user@$BASTION_IP:/tmp/
scp -i ~/.ssh/id_rsa migrations/*.sql ec2-user@$BASTION_IP:/tmp/

# Then SSH into it and run migrations
ssh -i ~/.ssh ec2-user@$BASTION_IP

# Inside the bastion:
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/init.sql

# Verify
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "\d customers"

# Exit the container
exit


Step 3: Verify Blue Environment
We’ll want to test that everything works before we start the migration. This is your baseline: you want to confirm that the current system is healthy before introducing changes.
# Check health
curl $ALB_URL/health | jq

# Expected response:
# {
#   "status": "healthy",
#   "version": "blue",
#   "environment": "production",
#   "database": "connected",
#   "schema": "compatible"
# }

# Create a customer with the old schema (single address field)
curl -X POST $ALB_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{
      "name": "John Doe",
      "email": "john@example.com",
      "address": "123 Main St, New York, NY, 10001"
    }' | jq

# List customers
curl $ALB_URL/api/customers | jq


Step 4: Expand Phase – Add New Columns
This is the first phase of expand-contract. We're adding the new columns WITHOUT removing the old one, creating a database schema that supports both blue and green simultaneously.
Run the expand migration (migrations/001_expand_address.sql):
-- Migration: 001_expand_address_fields.sql
BEGIN;

ALTER TABLE customers 
  ADD COLUMN street_address VARCHAR(255),
  ADD COLUMN city VARCHAR(100),
  ADD COLUMN state VARCHAR(2),
  ADD COLUMN zip_code VARCHAR(10);

-- Populate new columns from existing data
-- This uses a simple parsing strategy; yours might be more sophisticated

UPDATE customers 
SET 
  street_address = SPLIT_PART(address, ',', 1),
  city = TRIM(SPLIT_PART(address, ',', 2)),
  state = TRIM(SPLIT_PART(address, ',', 3)),
  zip_code = TRIM(SPLIT_PART(address, ',', 4))
WHERE address IS NOT NULL;

COMMIT;

Critical observation: We're NOT dropping the address column. It's still there. Blue continues reading and writing to it, completely unaware that new columns exist. This is what makes the migration safe – nothing breaks.
# Then SSH into it and run migrations
ssh -i ~/.ssh ec2-user@$BASTION_IP

# Inside the bastion:
export DB_ENDPOINT = "" # from terraform output

psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/001_expand_address.sql

# Verify new columns exist
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

exit


Verification: The \d customers command shows the table structure. You should see BOTH the old address column AND the new street_address, city, state, zip_code columns. This confirms the expand phase worked.
The database now supports both old (blue) and new (green) schemas. Blue is still running and working perfectly, and nothing has changed from its perspective.
Step 5: Build and Deploy Green Environment
Now we’ll build version 2 of our application that knows how to work with the new structured address fields, while maintaining backwards compatibility with the old schema.
Start by building version 2 with structured address support:
cd ..  # Back to project root

# Build new version
export IMAGE_TAG=v2.0.0

docker build --platform linux/amd64 -t $ECR_REPOSITORY:$IMAGE_TAG -f docker/Dockerfile .

docker tag $ECR_REPOSITORY:$IMAGE_TAG \
    $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:$IMAGE_TAG

What’s different is that the v2 application code now has logic that:

Reads from the new structured columns (street_address, city, and so on)

Writes to BOTH new columns AND the old address column

Accepts API requests with structured address format


Why write to both: This is crucial. Even though green prefers the new format, it maintains the old format, too. If you need to rollback to blue, all the data blue needs is there and up-to-date. Without this, rollback would be impossible: blue would see empty or stale address fields.
Now create and register green task definition:
cd terraform

# Get necessary ARNs
EXECUTION_ROLE_ARN=$(terraform output -raw ecs_task_execution_role_arn)
TASK_ROLE_ARN=$(terraform output -raw ecs_task_role_arn)
DB_SECRET_ARN=$(terraform output -raw db_secret_arn)

# Create task definition
cat > task-def-green.json <"family": "ecommerce-bluegreen",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "${EXECUTION_ROLE_ARN}",
  "taskRoleArn": "${TASK_ROLE_ARN}",
  "containerDefinitions": [{
    "name": "app",
    "image": "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ECR_REPOSITORY}:${IMAGE_TAG}",
    "essential": true,
    "portMappings": [{
      "containerPort": 8080,
      "protocol": "tcp"
    }],
    "environment": [
      {"name": "APP_VERSION", "value": "green"},
      {"name": "ENVIRONMENT", "value": "production"},
      {"name": "AWS_REGION", "value": "${AWS_REGION}"},
      {"name": "DB_HOST", "value": "${DB_ENDPOINT}"},
      {"name": "DB_PORT", "value": "5432"},
      {"name": "DB_NAME", "value": "ecommerce"}
    ],
    "secrets": [
      {
        "name": "DB_USER",
        "valueFrom": "${DB_SECRET_ARN}:username::"
      },
      {
        "name": "DB_PASSWORD",
        "valueFrom": "${DB_SECRET_ARN}:password::"
      }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/ecommerce-bluegreen",
        "awslogs-region": "${AWS_REGION}",
        "awslogs-stream-prefix": "ecs"
      }
    },
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3,
      "startPeriod": 60
    }
  }]
}
EOF

# Register the task definition
aws ecs register-task-definition --cli-input-json file://task-def-green.json

This JSON tells ECS everything about how to run your container:

Which Docker image to use (the v2.0.0 we just built)

How much CPU/memory to allocate (256 CPU units = 0.25 vCPU)

Environment variables (notice APP_VERSION is set to "green")

Secrets (database credentials pulled from AWS Secrets Manager)

Health check configuration (curl the /health endpoint every 30 seconds)

Logging configuration (send logs to CloudWatch)


Key detail: The APP_VERSION environment variable is how the application knows whether to behave as blue or green. Same codebase, different behavior based on configuration.
Step 6: Execute Blue-Green Deployment
Alright, now it’s time to create AppSpec and trigger the deployment:
TASK_DEF_ARN=$(aws ecs describe-task-definition \
  --task-definition ecommerce-bluegreen \
  --query 'taskDefinition.taskDefinitionArn' \
  --output text)

cat > appspec.json <"version": 0.0,
  "Resources": [{
    "TargetService": {
      "Type": "AWS::ECS::Service",
      "Properties": {
        "TaskDefinition": "${TASK_DEF_ARN}",
        "LoadBalancerInfo": {
          "ContainerName": "app",
          "ContainerPort": 8080
        }
      }
    }
  }]
}
EOF

# Deploy
APPSPEC=$(cat appspec.json | jq -c .)
aws deploy create-deployment \
  --application-name ecommerce-bluegreen \
  --deployment-group-name ecommerce-bluegreen-deployment-group \
  --deployment-config-name CodeDeployDefault.ECSLinear10PercentEvery3Minutes \
  --description "Blue-green deployment to structured address schema" \
  --cli-input-json "{
    \"revision\": {
      \"revisionType\": \"AppSpecContent\",
      \"appSpecContent\": {
        \"content\": $(echo \"$APPSPEC\" | jq -Rs .)
      }
    }
  }"

DEPLOYMENT_ID=$(aws deploy list-deployments \
    --application-name ecommerce-bluegreen \
    --deployment-group-name ecommerce-bluegreen-deployment-group \
    --query 'deployments[0]' --output text)

Monitor the deployment:
# Watch status
watch -n 10 "aws deploy get-deployment --deployment-id $DEPLOYMENT_ID \
    --query 'deploymentInfo.status' --output text"

# Monitor traffic distribution
while true; do
    echo "Production: $(curl -s $ALB_URL/health | jq -r '.version')"
    echo "Test: $(curl -s $TEST_URL/health | jq -r '.version')"
    sleep 30
done

The deployment shifts 10% of traffic every 3 minutes, completing in 30 minutes.
Step 7: Validate Green Environment
After the deployment begins, you need to validate that the green environment is functioning correctly with the new structured address format before allowing production traffic to reach it.
The CodeBuild dashboard below shows the Traffic migration and Deployment status:

We can also test through the test listener (port 8080), which provides isolated access to green tasks:
# Test new structured address API
curl -X POST $TEST_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{
      "name": "Jane Smith",
      "email": "jane@example.com",
      "address": {
        "street": "456 Oak Ave",
        "city": "Los Angeles",
        "state": "CA",
        "zip": "90001"
      }
    }' | jq

curl $ALB_URL/api/customers | jq


What you're validating:

The green environment accepts the new structured address format

Data is correctly written to both new columns (street_address, city, state, zip_code) and the old address column for backwards compatibility

The API response matches expectations for the new schema

Existing data from blue environment is still accessible and readable


If any of these tests fail, you can stop the deployment before production traffic reaches green, preventing customer impact.
Step 8: Post-Deployment Validation
Once CodeDeploy completes the traffic shift, all production requests route to green. This is your opportunity to verify that the deployment was successful and that the new version is handling real production traffic correctly.
# Verify all production traffic goes to green
# Running this multiple times confirms consistent routing
for i in {1..10}; do
    curl -s $ALB_URL/health | jq -r '.version'
done
# Expected output: "green" for all 10 requests

# Test complete CRUD operations with the new API
# Create a customer with structured address
CUSTOMER_ID=$(curl -s -X POST $ALB_URL/api/customers \
    -H "Content-Type: application/json" \
    -d '{"name": "Test User", "email": "test@example.com",
         "address": {"street": "789 Test St", "city": "Test City", 
         "state": "TX", "zip": "75001"}}' | jq -r '.id')

# Read the customer back to verify data persistence
curl $ALB_URL/api/customers/$CUSTOMER_ID | jq

# Update the customer to test modification
curl -X PUT $ALB_URL/api/customers/$CUSTOMER_ID \
    -H "Content-Type: application/json" \
    -d '{"address": {"street": "999 Updated Ave", "city": "Test City", 
         "state": "TX", "zip": "75001"}}' | jq

# Delete the test customer for cleanup
curl -X DELETE $ALB_URL/api/customers/$CUSTOMER_ID


What you're validating:

Traffic routing is 100% to green with no requests reaching blue

Create operations work with the new structured address format

Read operations return correct data with proper address structure

Update operations successfully modify existing records

Delete operations work without errors

The application correctly writes to both new columns and old address column (enabling potential rollback)


Check your CloudWatch logs and metrics during this validation period for any unexpected errors, increased latency, or database connection issues.
Step 9: Contract Phase (After 24-72 Hours)
This is the final phase of expand-contract. We're removing the old address column now that we're confident green is stable. This is the point of no return.
CRITICAL: Only proceed after green has been stable for your confidence period!
# Backup database first
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# Wait for snapshot
aws rds wait db-snapshot-completed \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# Run contract migration
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f /tmp/002_contract_address.sql

# Verify old column is gone
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

The contract migration (migrations/002_contract_address.sql) removes the old address column.

Why wait 24-72 hours: You want to be absolutely certain green is stable before making irreversible changes. During this waiting period:

All your monitoring should show green performing normally

You've seen the system handle multiple daily traffic patterns (morning peak, evening peak, overnight)

Weekly batch jobs have run successfully

You've verified third-party integrations work

No unusual errors or performance degradation


It’s important to snapshot first because once you drop that column, there's no undo button. The snapshot is your safety net. If you discover a critical issue after contracting, you can restore this snapshot and get back to a state where rollback is possible. Without it, you're gambling.
What the contract migration does:
-- migrations/002_contract_address.sql
BEGIN;
ALTER TABLE customers DROP COLUMN address;
COMMIT;

It's simple but permanent. The old address column is gone. The Blue environment will no longer work with this database, as it expects that column to exist. This is fine because blue has been decommissioned (no traffic, tasks terminated).
What to update: You should also deploy version 3 of your application that removes the dual-write logic. Version 2 (green) is still writing to both the new columns and the old address column. Version 3 can stop wasting cycles writing to a column that no longer exists.
The contract migration (migrations/002_contract_address.sql) removes the old address column. Your migration is now complete!
Rollback Strategies
During Deployment (Safe Window)
Use this strategy when you detect issues during the traffic shift, before all traffic has moved to green. CodeDeploy is still managing the deployment, which means it can automatically revert traffic distribution to the previous state.
# Immediate rollback
aws deploy stop-deployment \
    --deployment-id $DEPLOYMENT_ID \
    --auto-rollback-enabled

You should use this strategy when you notice increased error rates, degraded performance, or functional issues during the canary or linear traffic shift. CodeDeploy automatically shifts all traffic back to blue, and green tasks are terminated. This is the safest and fastest rollback option.
This works because the database still contains the old address column (expand phase), so blue can function normally. No data has been lost or made incompatible.
After Deployment (Before Contract)
Use this when the deployment completed successfully, but you discover issues hours or days later during the monitoring period, before you've run the contract migration. Both blue and green environments still exist, and the database supports both schemas.
# Manual listener update
aws elbv2 modify-listener \
    --listener-arn $(terraform output -raw alb_listener_arn) \
    --default-actions Type=forward,TargetGroupArn=$(terraform output -raw blue_target_group_arn)

Or use the provided script:
cd scripts
./rollback.sh

Use this when you discover bugs in green that weren't caught during initial testing, business metrics show unexpected changes (conversion rates drop, customer complaints increase), or third-party integration issues emerge.
This works because the database still has both old and new schema elements. Blue tasks still exist and can serve traffic immediately. Because green was writing to both old and new columns, blue sees all the latest data.
With this, the traffic immediately shifts from green back to blue. Green continues running for observability, but serves no traffic. You can debug green in place without customer impact.
After Contract Phase
Use this as a last resort when you've already removed the old address column, and blue can no longer function with the current database schema. This is significantly more complex and time-consuming than the previous two strategies.
# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db-restored \
    --db-snapshot-identifier pre-contract-YYYYMMDD-HHMMSS

Only use this strategy when you discover a critical, production-breaking issue after the contract phase, and you have no other option but to return to the previous version.
Why it's painful:

Database restore takes 10-30 minutes depending on size

You lose all data written after the snapshot was taken

Requires updating connection strings to point to the restored instance

Need to re-deploy blue environment

Must communicate downtime to users


This is why you wait 24-72 hours before contracting, and take a snapshot immediately before the contract migration. The lengthy waiting period allows you to catch most issues while the safer rollback strategies are still available.
Monitoring During Deployments
Essential Metrics
During a blue-green deployment, you need to monitor both environments simultaneously to detect issues early and make informed decisions about proceeding or rolling back.For each target group (blue and green), track these CloudWatch metrics:
1. TargetResponseTime
Measures latency from when the load balancer sends a request to when it receives a response. You're looking for sudden spikes or gradual degradation. Green should have similar response times to blue (within 10-20%). If green's latency is significantly higher, you may have performance regressions, inefficient queries with the new schema, or resource constraints.
2. RequestCount
Shows traffic volume hitting each target group. During the deployment, you should see blue's count decreasing while green's increases proportionally. If the numbers don't add up (total requests drop significantly), users might be experiencing errors and not retrying. If green receives traffic but shows zero requests, health checks might be failing.
3. HTTPCode_Target_5XX_Count
Server errors indicate application problems. Even a single 5XX error during deployment warrants investigation. Green should have zero 5XX errors during the initial traffic shift. Any errors could indicate incompatibility issues with the new schema, missing environment variables, or database connection problems.
4. DatabaseConnections (from RDS metrics):
Shows active database connections from both environments. Watch for connection pool exhaustion, which manifests as a sudden spike or plateau at your max connections limit. If green uses more connections than blue did, you might have connection leaks or inefficient connection handling in the new code.
5. CPUUtilization
Monitor both ECS task CPU and RDS CPU. Green tasks should use similar CPU to blue tasks for the same request volume. Higher CPU might indicate less efficient code or more complex queries. RDS CPU spikes during deployment often indicate poorly optimized new queries or missing indexes for the new schema.
What to expect:

First 5-10 minutes: Green receives 10% traffic, metrics should closely match blue's baseline

15-20 minutes: Green at 30-50% traffic, both environments should show stable metrics

25-30 minutes: Green at 100% traffic, metrics should stabilize at historical levels

Any divergence from these patterns warrants stopping the deployment and investigating


Custom application metrics: Beyond infrastructure metrics, monitor business-critical metrics like checkout completion rates, API success rates, and user sign-up flows. Sometimes technical metrics look fine but user-facing functionality is broken.
Best Practices
Test Migrations in Staging
Always run your database migrations against a staging environment that mirrors production scale and complexity before touching production. Copy a recent production snapshot to staging and execute your expand migration there first.
Why this matters: Migrations that work fine on small datasets can timeout or lock tables on production-scale data. You might discover that adding an index to a 50-million-row table takes 2 hours, or that your column population query needs optimization.
What to test:

Migration execution time (should complete in seconds/minutes, not hours)

Table locks and their impact (can reads/writes continue during migration?)

Query performance with new schema (are your indexes still effective?)

Rollback procedures (can you undo the migration if needed?)


Use Migration Tools
Don't write raw SQL migrations manually. Use Flyway, Liquibase, Alembic (for Python), or your framework's built-in migration tools (Rails migrations, Django migrations, Entity Framework migrations).
Why this matters: Migration tools provide version tracking, rollback capabilities, checksums to prevent tampering, and a standardized way to manage schema changes across environments.
Configure Health Checks Properly
Your health check endpoint should verify that the application can actually function, not just that the process is running. A comprehensive health check validates database connectivity, schema compatibility, and dependent service availability.
@app.route('/health')
def health_check():
    checks = {
        'database': check_database(),
        'schema': check_schema_compatibility(),
        'cache': check_cache_connection()
    }

    if all(checks.values()):
        return jsonify(checks), 200
    else:
        return jsonify(checks), 503

def check_schema_compatibility():
    """Verify expected schema elements exist"""
    try:
        result = db.query("""
            SELECT column_name 
            FROM information_schema.columns 
            WHERE table_name = 'customers'
            AND column_name IN ('street_address', 'city', 'state', 'zip_code')
        """)
        return len(result) == 4
    except:
        return False

For ALB health checks specifically, make sure you configure appropriate thresholds in your target group settings. A healthy threshold of 2 means the target must pass 2 consecutive health checks before receiving traffic. An unhealthy threshold of 3 means it must fail 3 consecutive checks before being removed. Set your interval to 30 seconds and timeout to 5 seconds to balance responsiveness with stability.
# Terraform configuration for ALB health checks
resource "aws_lb_target_group" "green" {
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
}

This configuration ensures that ECS tasks aren't marked healthy prematurely (preventing traffic to broken tasks) while also not being overly sensitive to transient issues (preventing unnecessary task replacements).
Plan the Contract Phase
The contract phase is irreversible, so treat it with appropriate caution. Wait a minimum of 24-72 hours after green deployment before removing old schema elements. This waiting period isn't arbitrary: it ensures you've observed the system under various conditions.
What to verify before contracting:

Green has handled multiple daily traffic patterns (morning rush, evening peak, overnight batch jobs)

All scheduled jobs and cron tasks have run successfully with the new schema

Weekly reports or analytics pipelines have completed

Third-party integrations (payment processors, shipping APIs, analytics tools) are working

No unusual error patterns in logs

Business metrics (conversions, sign-ups, purchases) remain stable

Customer support hasn't reported related issues


The pre-contract checklist:
# 1. Create a final snapshot
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

# 2. Document current state
echo "Green tasks: $(aws ecs describe-services --cluster ecommerce --services ecommerce-green | jq '.services[0].runningCount')"
echo "Error rate: $(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Sum)"

# 3. Notify team
echo "Running contract migration at $(date)"

# 4. Run migration
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -f migrations/002_contract_address.sql

# 5. Verify
psql -h $DB_ENDPOINT -U dbadmin -d ecommerce -c "\d customers"

Version Your APIs
When changing data formats, maintain backward compatibility by supporting both old and new API versions simultaneously. This allows API consumers (mobile apps, third-party integrations, other services) to migrate at their own pace without coordinating releases.
# Support both API versions during transition
@app.route('/api/v1/customers/')
def get_customer_v1(id):
    customer = Customer.find(id)
    return jsonify({
        'id': customer.id,
        'name': customer.name,
        'address': customer.address  # Old format
    })

@app.route('/api/v2/customers/')
def get_customer_v2(id):
    customer = Customer.find(id)
    return jsonify({
        'id': customer.id,
        'name': customer.name,
        'address': {  # New structured format
            'street': customer.street_address,
            'city': customer.city,
            'state': customer.state,
            'zip': customer.zip_code
        }
    })

To implement this, you can initially deploy both endpoints with blue-green. Then monitor usage of v1 endpoint over time. Once v1 traffic drops below 1% (meaning clients have migrated), deprecate it formally. Remove v1 endpoint in a subsequent release, not during the blue-green deployment itself.
Announce the new API version to consumers with a migration timeline. Give them 2-3 months to update their integrations. Send reminder emails at the halfway point and 2 weeks before v1 shutdown.
Monitor Both Environments
During the transition period, both blue and green are production environments serving real traffic. Monitor them separately to detect version-specific issues.
Set up separate CloudWatch dashboards for blue and green target groups with the same metrics arranged identically. This makes it easy to spot differences at a glance. If green's response time is 200ms while blue's is 50ms, that's a red flag.
Alert on metric divergence
Create alarms that trigger when green's metrics deviate significantly from blue's baseline. For example, if green's error rate is more than 2x blue's historical average, trigger an alert. If green's database query time is 50% higher, investigate before shifting more traffic.
Log aggregation
Ensure logs from both environments are tagged with their version (environment: blue or environment: green) so you can filter and compare them. Use CloudWatch Insights queries to spot patterns.
When NOT to Use Blue-Green
Blue-green isn't always the right choice. Avoid it when you have:

Very large database migrations: If your migration takes hours or requires significant locks, use a traditional maintenance window.

Highly stateful applications: Real-time collaboration tools or WebSocket applications with complex in-memory state may need rolling deployments instead.

Cost constraints: Running two environments doubles costs. Consider canary deployments for cost-sensitive applications.

Complex data model redesigns: Use the strangler fig pattern to gradually migrate functionality to a new service.


Alternative Deployment Strategies
Canary Deployments
Route a small percentage (5-10%) to the new version:
{
  "trafficRouting": {
    "type": "TimeBasedCanary",
    "timeBasedCanary": {
      "canaryPercentage": 10,
      "canaryInterval": 5
    }
  }
}

Rolling Deployments
Gradually replace old tasks with new ones:
{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

Cleanup
After you've successfully completed your blue-green deployment, validated the green environment, and run the contract phase, you need to clean up the AWS resources to avoid unnecessary costs and resource sprawl.
What you're removing:

The entire infrastructure stack (VPC, subnets, NAT gateways, load balancer, ECS cluster, RDS database, and all associated resources)

This is appropriate for a tutorial/testing scenario where you deployed everything from scratch


Important considerations before cleanup:

Ensure you have backups if you need to reference any data later

Export any logs or metrics you want to retain

Document lessons learned from the deployment

Verify no production traffic is still using these resources


cd terraform

# Terraform will prompt you to confirm with "yes"
# Review the destruction plan carefully before confirming
terraform destroy  # Takes ~10-15 minutes

Partial cleanup: If you want to keep certain resources (like RDS snapshots for reference), you can remove them from Terraform state before destroying:
# Remove RDS from Terraform management before destroying
terraform state rm aws_db_instance.main
terraform destroy  # Now destroys everything except RDS

For production environments, you would NOT destroy everything. Instead, you'd decommission the blue environment specifically after confirming green is stable:
# Production scenario - remove only blue environment
terraform destroy -target=aws_ecs_service.blue
terraform destroy -target=aws_lb_target_group.blue

Conclusion
Blue-green deployments with databases require careful planning, but the expand-contract pattern makes it manageable.
Here are some key takeaways:

Use expand-contract as default – Maintains backwards compatibility and safe rollbacks.

Externalize state – Sessions, caches, and storage should use external services.

Plan for three phases – Don't rush to the contract phase.

Test everything in staging – Mirror production scale and complexity.

Monitor aggressively – Track technical and business metrics for both environments.

Know when to use alternatives – Blue-green isn't always the answer.

Document rollback procedures – Everyone should know the rollback process before deployment.


The expand-contract pattern requires more work upfront, but this investment pays dividends in reduced risk and maintained uptime. With the strategies and complete implementation provided here, you can successfully deploy even complex, stateful applications with confidence.
As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn.
For more practical hands-on Cloud/DevOps projects like this one, follow and star this repository: Learn-DevOps-by-building.
Further Resources

Complete Code: github.com/Caesarsage/bluegreen-deployment-ecs

Learn DevOps by Building: GitHub repo

AWS ECS Blue/Green Documentation: AWS Docs

AWS CodeDeploy for ECS: AWS Docs


 


 How to Create Kubernetes Cluster and Security Groups for Pods in AWS [Full Handbook] 
Destiny Erhabor — Wed, 15 Oct 2025 23:53:37 +0000
 Amazon Elastic Kubernetes Service (EKS) Security Groups for Pods is a powerful feature that enables fine-grained network security controls at the pod level. This guide walks you through implementing this feature, from initial cluster setup to testing pod-level security group assignments.
Traditionally, security groups could only be assigned at the EC2 instance level in EKS clusters. This meant that all pods running on a node shared the same network security rules. With Security Groups for Pods, you can now assign specific security groups to individual pods, providing much more granular control over network access.
Table of Contents

Prerequisites

Understanding the Architecture

Infrastructure Foundation

EKS Cluster Configuration

Management Instance Setup

Security Group Configuration

Database Setup

CNI Plugin Configuration

Security Policies Implementation

Testing and Validation

Cleanup and Maintenance

Conclusion


Prerequisites
Before starting this guide, ensure you have:

An AWS account with appropriate permissions

AWS CLI configured on your local machine

Basic understanding of Kubernetes concepts

Familiarity with AWS networking concepts (VPCs, security groups, subnets)

Understanding of Amazon EKS fundamentals


Understanding the Architecture
Before we dive into implementation, let's understand how Security Groups for Pods changes the EKS networking model. We'll start by looking at the traditional approach, then explore the enhanced model, and finally understand the components that make it all work.
Traditional EKS Networking
In the standard EKS networking setup, security happens at the node level rather than the pod level. When you create an EKS cluster using the traditional model, every EC2 worker node gets assigned a security group. All pods running on that node inherit the same security group settings from their host node. This means if you have ten different applications running on the same node, they all share identical network security rules.
This approach has significant limitations. For example, if one pod needs to access a database while another pod should not, you can't enforce this distinction when both pods share the node's security group. The security boundary exists at the node level, creating a coarse-grained security model where all pods on a node have the same network permissions.

Security Groups for Pods Architecture
This networking model changes this paradigm completely. With Security Groups for Pods enabled, you can assign dedicated security groups to individual pods based on their specific needs. Instead of all pods inheriting the node's security group, certain pods can get their own Elastic Network Interface (ENI) with custom security group assignments.
An ENI (Elastic Network Interface) is essentially a virtual network card in AWS. Just as your physical computer has a network card to connect to the internet, EC2 instances and now individual pods can have their own virtual network interfaces. Each ENI can have its own IP address, security groups, and network settings. When we assign an ENI to a pod, that pod gets its own dedicated network identity separate from the node it runs on.
This architecture provides true pod-level security. For instance, you might have a frontend pod and a database access pod running on the same node. The frontend pod uses the node's security group and cannot access the database. Meanwhile, the database access pod gets its own ENI with a security group that explicitly allows database connections. Even though they share the same physical node, these pods have completely different network security profiles.

How It Works:
The implementation of Security Groups for Pods relies on several interconnected mechanisms working together. First, when you mark a pod for special security group treatment through a SecurityGroupPolicy, the system automatically provisions a dedicated ENI for that pod. This ENI assignment happens through AWS VPC CNI's branch networking feature, which allows multiple network interfaces to attach to a single EC2 instance.
The branch networking capability is crucial here. EC2 instances have limits on how many ENIs they can support. For example, a t3.medium instance can support up to three ENIs, while an m5.large can support up to four. The VPC CNI plugin uses these additional ENI slots to create branch interfaces for pods that need custom security groups. Each branch interface can then have its own security group configuration independent of the node's primary network interface.
This fine-grained control means you can now enforce network policies at the application level. Different microservices in your cluster can have completely different network access patterns, even when running on the same infrastructure. A payment processing pod might have strict database access, while a logging pod might only need access to your log aggregation service, and a frontend pod might only need internet access for serving web traffic.
Key Components:
Several Kubernetes and AWS components work together to enable this functionality. Let's walk through each one to understand how they contribute to the overall system.
SecurityGroupPolicy CRD
The SecurityGroupPolicy Custom Resource Definition (CRD) is a Kubernetes object that you create to tell the system which pods should receive which security groups. You use standard Kubernetes label selectors to identify pods, then specify one or more AWS security group ID that should be attached to those pods. When you create a SecurityGroupPolicy, the system doesn't immediately change anything. Instead, it creates a rule that applies to future pods matching those labels.
VPC Resource Controller
The VPC Resource Controller is an AWS component that runs in your cluster's control plane. This controller constantly watches for pods that match your SecurityGroupPolicy definitions.
When a matching pod is created, the controller communicates with AWS EC2 APIs to provision the necessary ENI, attach the specified security groups, and configure the network interface. It also handles the cleanup process when pods are deleted, ensuring that ENIs are properly released and don't become orphaned resources in your AWS account.
AWS VPC CNI
Finally, the AWS VPC CNI plugin is enhanced to support this branch networking feature. When the VPC Resource Controller provisions an ENI for a pod, the CNI plugin on the worker node handles the low-level networking configuration. It attaches the ENI to the pod's network namespace, configures routing rules, and ensures that traffic from that pod flows through the dedicated interface rather than the node's primary network interface. The CNI plugin also maintains the necessary iptables rules and network policies to keep pod networking isolated and secure.
Together, these components create a seamless experience where you simply label your pods and define security policies, and the system handles all the complex AWS networking configuration automatically.
Infrastructure Foundation
Now we'll build the underlying AWS infrastructure that our EKS cluster needs. This includes setting up IAM roles, creating the VPC with proper subnets, and configuring the networking components. We'll work through each step, ensuring that every component is properly configured for Security Groups for Pods to function correctly.
IAM Roles and Policies Setup
Before creating any infrastructure, we need to set up the IAM roles that will define what permissions different AWS services have. Think of IAM roles as identity cards that services present to AWS to prove they're allowed to perform certain actions. We'll create several distinct roles, each with specific permissions tailored to their purpose.
EKS Cluster Service Role:
First, we'll create the IAM role that the EKS service itself will use when managing your cluster. This role establishes a trust relationship between your AWS account and the EKS service, essentially giving EKS permission to perform actions on your behalf.
# Create the EKS cluster service role
aws iam create-role \
  --role-name EKSClusterRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "eks.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

Here’s what’s going on:
This command creates an IAM role that establishes trust between your AWS account and the EKS service:

assume-role-policy-document: Defines which AWS service can assume this role

"Service": "eks.amazonaws.com": Only the EKS service can use this role

This establishes trust between your AWS account and the EKS service



EKS Cluster Attached Role Policy:
Now that we have the role created, we need to attach managed policies that grant the actual permissions EKS needs to function. We'll attach two AWS-managed policies that provide comprehensive permissions for EKS operations.
# Attach the required policies
aws iam attach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

aws iam attach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

Let me explain what each of these policies does. The AmazonEKSClusterPolicy is a managed policy that AWS maintains, giving EKS permission to create and manage the Kubernetes control plane components. This includes actions like setting up the API server, configuring etcd storage, and managing the controller manager and scheduler. Without this policy, EKS couldn't create the fundamental components that make Kubernetes work.
The second policy, AmazonEKSVPCResourceController, is particularly critical for our Security Groups for Pods implementation. This policy allows the VPC Resource Controller to create and delete ENIs, assign security groups to those interfaces, and manage VPC resources on behalf of pods. When a pod needs a dedicated ENI with specific security groups, this policy is what authorizes EKS to make those changes in your VPC.

EKS Node Group Role:
Next, we'll create the IAM role that EC2 worker nodes will use. While the cluster role is for the EKS control plane, this role is for the actual compute instances that run your pods.
# Create the node group role
aws iam create-role \
  --role-name EKSNodeGroupRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

This role's assume role policy specifies ec2.amazonaws.com as the trusted service, meaning EC2 instances can assume this role. When an EC2 instance launches as part of your EKS node group, it automatically assumes this role and uses it to authenticate with AWS services. This is how your worker nodes can pull container images, register with the cluster, and perform other necessary operations.

EKS Node Group Role Attached Policy:
With the node group role created, we now need to attach policies that give worker nodes the permissions they need. We'll attach three different managed policies, each serving a specific purpose in the node's lifecycle.
# Attach required policies for worker nodes
aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

Each policy serves a specific purpose for worker node functionality:

AmazonEKSWorkerNodePolicy: Allows nodes to connect to EKS cluster

AmazonEKS_CNI_Policy: Enables CNI plugin to manage pod networking

AmazonEC2ContainerRegistryReadOnly: Pulls container images from ECR



IAM Role for Management Instance:
To complete our foundation setup, we'll create a dedicated role for the EC2 instance that we'll use to manage the cluster. This management instance will act as our control point for running kubectl commands, configuring the cluster, and performing administrative tasks.
# Create IAM role for management instance
aws iam create-role \
  --role-name EKS-Management-Role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

# Create instance profile
aws iam create-instance-profile \
  --instance-profile-name EKS-Management-Profile

# Add role to instance profile
aws iam add-role-to-instance-profile \
  --instance-profile-name EKS-Management-Profile \
  --role-name EKS-Management-Role

# Create and attach custom policy for EKS management
cat > eks-management-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:*",
                "ec2:DescribeInstances",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeNetworkInterfaces",
                "ec2:CreateSecurityGroup",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:RevokeSecurityGroupIngress",
                "rds:DescribeDBInstances",
                "rds:CreateDBInstance",
                "rds:DeleteDBInstance",
                "iam:PassRole"
            ],
            "Resource": "*"
        }
    ]
}
EOF

aws iam create-policy \
  --policy-name EKS-Management-Policy \
  --policy-document file://eks-management-policy.json

aws iam attach-role-policy \
  --role-name EKS-Management-Role \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy

This setup is more complex than the previous roles because it involves several steps.
First, we create the role with EC2 as the trusted service. Then we create an instance profile, which is AWS's mechanism for attaching IAM roles to EC2 instances. Think of an instance profile as a container that holds the role and makes it available to EC2.
The custom policy we're creating gives comprehensive administrative permissions for managing EKS clusters. The eks:* wildcard grants all EKS actions, while the specific EC2 and RDS permissions allow for infrastructure management.
The iam:PassRole permission is particularly important. It allows this management instance to pass the cluster and node group roles to EKS when creating resources. Without this permission, we couldn't create the cluster from this instance.


VPC and Networking Infrastructure
With our IAM roles configured, we'll now build the network infrastructure that will host our EKS cluster. We're going to create a production-ready VPC with both public and private subnets across multiple availability zones. This architecture provides both security and high availability.
VPC Creation and Configuration
Let's start by creating our Virtual Private Cloud and the Internet Gateway that will provide internet connectivity for our public resources.
# Create VPC
export VPC_ID=$(aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --name 'eks-security-demo'
  --query 'Vpc.VpcId' \
  --output text)

# Create Internet Gateway first
export IGW_ID=$(aws ec2 create-internet-gateway \
  --query 'InternetGateway.InternetGatewayId' \
  --output text)

# Attach Internet Gateway to VPC
aws ec2 attach-internet-gateway \
  --internet-gateway-id $IGW_ID \
  --vpc-id $VPC_ID

When we create the VPC with a 10.0.0.0/16 CIDR block, we're defining an IP address range that provides 65,536 possible IP addresses. This is a private IP range (meaning these addresses aren't routable on the public internet) from the RFC 1918 specification. This gives us plenty of room to create multiple subnets and scale our infrastructure as needed. The /16 designation means the first 16 bits of the IP address are fixed (10.0), while the remaining 16 bits are available for our use.
The Internet Gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between instances in our VPC and the internet. By attaching it to our VPC, we're setting up the foundation for resources in public subnets to communicate with the outside world.


Subnet Architecture Strategy
Now we'll create four subnets – two public and two private – spread across two different availability zones. This multi-AZ approach is crucial for high availability and follows AWS best practices for production deployments.
# Public subnets for NAT Gateway and Load Balancers
export PUBLIC_SUBNET_1=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID \
  --cidr-block 10.0.1.0/24 \
  --availability-zone eu-west-1a \
  --query 'Subnet.SubnetId' \
  --output text)

export PUBLIC_SUBNET_2=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID \
  --cidr-block 10.0.2.0/24 \
  --availability-zone eu-west-1b \
  --query 'Subnet.SubnetId' \
  --output text)

# Enable auto-assign public IP for public subnets
aws ec2 modify-subnet-attribute \
  --subnet-id $PUBLIC_SUBNET_1 \
  --map-public-ip-on-launch

aws ec2 modify-subnet-attribute \
  --subnet-id $PUBLIC_SUBNET_2 \
  --map-public-ip-on-launch

# Private subnets for worker nodes and RDS
export PRIVATE_SUBNET_1=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID \
  --cidr-block 10.0.3.0/24 \
  --availability-zone eu-west-1a \
  --query 'Subnet.SubnetId' \
  --output text)

export PRIVATE_SUBNET_2=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID \
  --cidr-block 10.0.4.0/24 \
  --availability-zone eu-west-1b \
  --query 'Subnet.SubnetId' \
  --output text)

Let me walk you through the subnet design:
Each subnet uses a /24 CIDR block, which provides 256 IP addresses per subnet (though AWS reserves 5 addresses in each subnet for internal use, leaving 251 usable addresses). We're creating these subnets in pairs across two availability zones (eu-west-1a and eu-west-1b). If one availability zone experiences an outage, resources in the other zone can continue operating.
The public subnets (10.0.1.0/24 and 10.0.2.0/24) will host our NAT Gateway and potentially load balancers in the future. We enable auto-assign public IP on these subnets so that any resources launched here automatically receive public IP addresses. This is essential for the NAT Gateway to function properly.
The private subnets (10.0.3.0/24 and 10.0.4.0/24) will host our EKS worker nodes and RDS database. Resources in these subnets don't receive public IP addresses, meaning they can't be directly accessed from the internet. This provides an additional layer of security for our application workloads and database.


EKS Subnet Tagging
Next, we need to add specific tags to our subnets so that EKS can automatically discover and use them correctly. These tags tell EKS which subnets to use for different types of load balancers.
# Tag subnets for EKS auto-discovery
aws ec2 create-tags \
  --resources $PUBLIC_SUBNET_1 $PUBLIC_SUBNET_2 \
  --tags Key=kubernetes.io/cluster/pod-security-cluster-demo,Value=shared \
         Key=kubernetes.io/role/elb,Value=1

aws ec2 create-tags \
  --resources $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \
  --tags Key=kubernetes.io/cluster/pod-security-cluster-demo,Value=shared \
         Key=kubernetes.io/role/internal-elb,Value=1

These tags serve specific purposes in the EKS ecosystem:

The kubernetes.io/cluster/pod-security-cluster-demo=shared tag identifies subnets that belong to our cluster. The "shared" value indicates that these subnets might be used by multiple clusters, though in our case we're only using them for one.

The kubernetes.io/role/elb=1 tag on public subnets tells Kubernetes to use these subnets when creating internet-facing load balancers.

The kubernetes.io/role/internal-elb=1 tag on private subnets indicates where internal load balancers should be created


When you create a Kubernetes Service of type LoadBalancer, these tags help Kubernetes automatically choose the correct subnets based on whether you want an internal or external load balancer.
Routing and NAT Gateway
Now we'll set up the routing infrastructure that controls how traffic flows in and out of our subnets. This includes creating route tables for both public and private subnets, and setting up a NAT Gateway to provide internet access for resources in private subnets.
# Create route table for public subnets
export PUBLIC_RT=$(aws ec2 create-route-table \
  --vpc-id $VPC_ID \
  --query 'RouteTable.RouteTableId' \
  --output text)

# Create route to Internet Gateway
aws ec2 create-route \
  --route-table-id $PUBLIC_RT \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id $IGW_ID

# Associate public subnets with public route table
aws ec2 associate-route-table \
  --subnet-id $PUBLIC_SUBNET_1 \
  --route-table-id $PUBLIC_RT

aws ec2 associate-route-table \
  --subnet-id $PUBLIC_SUBNET_2 \
  --route-table-id $PUBLIC_RT

# Create NAT Gateway
export EIP_ALLOC=$(aws ec2 allocate-address \
  --domain vpc \
  --query 'AllocationId' \
  --output text)

export NAT_GW=$(aws ec2 create-nat-gateway \
  --subnet-id $PUBLIC_SUBNET_1 \
  --allocation-id $EIP_ALLOC \
  --query 'NatGateway.NatGatewayId' \
  --output text)

# Wait for NAT Gateway to be available
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW

# Create route table for private subnets
export PRIVATE_RT=$(aws ec2 create-route-table \
  --vpc-id $VPC_ID \
  --query 'RouteTable.RouteTableId' \
  --output text)

# Create route to NAT Gateway
aws ec2 create-route \
  --route-table-id $PRIVATE_RT \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id $NAT_GW

# Associate private subnets with private route table
aws ec2 associate-route-table \
  --subnet-id $PRIVATE_SUBNET_1 \
  --route-table-id $PRIVATE_RT

aws ec2 associate-route-table \
  --subnet-id $PRIVATE_SUBNET_2 \
  --route-table-id $PRIVATE_RT

Let me explain how this routing configuration enables secure internet access:
We start by creating a route table for our public subnets and adding a default route (0.0.0.0/0) that points to the Internet Gateway. This means any traffic from public subnet resources that doesn't match a more specific route will go directly to the Internet Gateway and out to the internet.
Next, we create a NAT Gateway, which requires an Elastic IP address. An Elastic IP is a static public IPv4 address that AWS allocates to your account. The NAT Gateway lives in a public subnet and acts as a middleman for outbound internet traffic from private subnets. When a resource in a private subnet wants to reach the internet (for example, to download software updates), the traffic goes to the NAT Gateway, which then forwards it to the Internet Gateway. Response traffic comes back through the same path.
For the private subnets, we create a separate route table with a default route pointing to the NAT Gateway instead of directly to the Internet Gateway. This setup allows resources in private subnets to initiate outbound connections to the internet (which they need for things like pulling container images or downloading patches), but prevents inbound connections from the internet. This is a key security feature: your worker nodes and databases can access the internet when needed, but the internet can't directly access them.


EKS Cluster Configuration
With our networking foundation in place, we're ready to create the actual EKS cluster. We'll configure the cluster to support Security Groups for Pods and set up managed worker nodes with appropriate instance types.
EKS Cluster Creation
Let's create our EKS cluster with configuration options specifically chosen to support the Security Groups for Pods feature.
export CLUSTER_ROLE_ARN=$(aws iam get-role \
  --role-name EKSClusterRole \
  --query 'Role.Arn' \
  --output text)

# Create the EKS cluster with detailed configuration
aws eks create-cluster \
  --name pod-security-cluster-demo \
  --kubernetes-version 1.33 \
  --role-arn $CLUSTER_ROLE_ARN \
  --access-config authenticationMode=API_AND_CONFIG_MAP \
  --resources-vpc-config subnetIds=$PUBLIC_SUBNET_1,$PUBLIC_SUBNET_2,$PRIVATE_SUBNET_1,$PRIVATE_SUBNET_2

# Wait for cluster to be active (this can take 10-15 minutes)
aws eks wait cluster-active --name pod-security-cluster-demo

So what's happening in this cluster creation command? First, we're using Kubernetes version 1.33, which is the latest stable version that also supports Security Groups for Pods. The role-arn parameter specifies the EKSClusterRole we created earlier, giving the cluster permission to manage AWS resources.
The access-config setting is particularly important. By specifying API_AND_CONFIG_MAP, we're enabling both modern API-based authentication and the traditional aws-auth ConfigMap approach. This dual authentication mode provides flexibility in how we manage cluster access.
We're including all four of our subnets in the resources-vpc-config. This is crucial because the EKS control plane needs to communicate with worker nodes across availability zones. By specifying both public and private subnets, we ensure that the cluster can place resources wherever they're needed while maintaining proper security boundaries.
The cluster creation process typically takes 10-15 minutes. During this time, AWS is provisioning the Kubernetes control plane components (API server, etcd, controller manager, and scheduler) across multiple availability zones for high availability.

Managed Node Group Setup
With the cluster created, we now need to add worker nodes that will actually run our pods. We'll create a managed node group with instance types specifically chosen to support multiple ENIs.
# Get the ARN of the node group role
export NODE_ROLE_ARN=$(aws iam get-role \
  --role-name EKSNodeGroupRole \
  --query 'Role.Arn' \
  --output text)

# Create the managed node group
aws eks create-nodegroup \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers \
  --subnets $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \
  --node-role $NODE_ROLE_ARN \
  --instance-types m5.large \
  --scaling-config minSize=1,maxSize=3,desiredSize=2 \
  --disk-size 20 \
  --capacity-type ON_DEMAND

# Wait for node group to be active
aws eks wait nodegroup-active \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers

Let me explain the key configuration choices here:
We're launching our worker nodes in the private subnets only, which follows security best practices by keeping compute resources away from direct internet access. The nodes can still download images and updates through the NAT Gateway we set up earlier.
The instance type selection is important for Security Groups for Pods. We're using m5.large instances, which can support up to 3 ENIs. One ENI is used as the primary network interface for the node itself, leaving 2 ENIs available for branch networking. Each branch ENI can support multiple pods with security group policies, giving us good pod density while maintaining the ability to assign custom security groups.
Our scaling configuration starts with 2 nodes (desiredSize=2), can scale down to 1 node (minSize=1), and up to 3 nodes (maxSize=3). This provides enough capacity for our demonstration while keeping costs reasonable. We're using the ON_DEMAND capacity type, which means these instances are standard EC2 instances billed per hour. While Spot instances are cheaper, ON_DEMAND ensures consistent availability without interruptions during our testing.

Instance Type Selection for ENI Limits:
Understanding the ENI limits of different instance types can help when planning for Security Groups for Pods. Let's check the ENI capacity of various instance types to see how they compare.
# Check ENI limits for different instance types
aws ec2 describe-instance-types \
  --instance-types a1.2xlarge t3.medium t3.large m5.large m5.xlarge \
  --query 'InstanceTypes[*].[InstanceType,NetworkInfo.MaximumNetworkInterfaces]' \
  --output table

This command shows ENI limits for different instance types, which determines how many pods can have dedicated security groups:

The m5.large instance type we chose provides 3 maximum network interfaces. Here's how that breaks down in practice: one ENI is always used as the primary network interface for the node itself, handling all the standard node networking. The remaining 2 ENIs can be used as trunk interfaces for branch networking, which is what enables Security Groups for Pods.
While a t3.medium only supports 3 ENIs total (which would also work for our demo), and an m5.xlarge supports 4 ENIs (providing more capacity), the m5.large offers the best balance. It provides adequate pod density for pods requiring security group policies while remaining cost-effective for demonstration purposes. In a production environment, you'd want to carefully calculate your ENI needs based on how many pods will require custom security groups and choose your instance types accordingly.
EKS Cluster Access Configuration
Now we need to configure access to the cluster so our management instance can run kubectl commands. Instead of using the older aws-auth ConfigMap approach, we'll use EKS access entries, which provide a cleaner and more maintainable way to manage cluster access.
# Export the management role ARN 
export MANAGEMENT_ROLE_ARN=$(aws iam get-role \
  --role-name EKS-Management-Role \
  --query 'Role.Arn' \
  --output text)

# Create access entry using the variable
aws eks create-access-entry \
  --cluster-name pod-security-cluster-demo \
  --principal-arn $MANAGEMENT_ROLE_ARN

# Associate admin policy using the variable
aws eks associate-access-policy \
  --cluster-name pod-security-cluster-demo \
  --principal-arn $MANAGEMENT_ROLE_ARN \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster

# Verify the policy was associated using the variable
aws eks list-associated-access-policies \
  --cluster-name pod-security-cluster-demo \
  --principal-arn $MANAGEMENT_ROLE_ARN

This access configuration demonstrates enterprise-grade role separation. The EKSClusterRole we created earlier is a service role that EKS itself uses to manage AWS infrastructure like VPCs, security groups, and load balancers. That's different from the EKS-Management-Role we're configuring now, which is an administrative role that human operators (or in our case, the management EC2 instance) use to interact with Kubernetes resources.
By creating an access entry for the management role and associating it with the AmazonEKSClusterAdminPolicy, we're granting full administrative access to the cluster. This means any EC2 instance that assumes the EKS-Management-Role can run kubectl commands with full permissions.
Access entries are the modern approach to cluster access management in EKS, providing better auditability and easier management compared to manually editing the aws-auth ConfigMap.

Management Instance Setup
Now we'll create a dedicated EC2 instance that will serve as our management workstation for interacting with the EKS cluster. This instance will have all the necessary tools pre-installed and will use the IAM role we configured earlier to access both AWS services and the Kubernetes cluster.
Security Group for Management Access
First, let's create a security group that will control network access to our management instance. This security group will allow SSH connections so we can access the instance.
# Create security group with principle of least privilege
export EC2_SG=$(aws ec2 create-security-group \
  --group-name EKS-Management-SG \
  --description "Security group for EKS management instance" \
  --vpc-id $VPC_ID \
  --query 'GroupId' \
  --output text)

# Allow SSH only from your IP
aws ec2 authorize-security-group-ingress \
  --group-id $EC2_SG \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0  # for security consider using your ip ${MY_IP}/32

We're creating a security group specifically for the management instance and allowing SSH access on port 22. In the example above, we're using 0.0.0.0/0 which allows SSH from any IP address. This is convenient for demonstration purposes, but in a production environment, you should definitely restrict this to your specific IP address instead.

Automated Tool Installation
Now we'll launch the management instance with a user data script that automatically installs all the tools we'll need. User data scripts run automatically when an EC2 instance first boots up, allowing us to fully configure the instance without manual intervention.
# Create user data script for automatic tool installation
cat > user-data.sh << 'EOF'
#!/bin/bash
yum update -y
yum install -y unzip git

# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install

# Install kubectl (reference: https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.33.4/2025-08-20/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH

# Install eksctl for additional EKS management
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
cp /tmp/eksctl /usr/local/bin

# Install helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Install PostgreSQL 13 client (newer version required for RDS SCRAM authentication)
sudo amazon-linux-extras install -y postgresql13

echo "Management tools installed successfully" > /var/log/setup-complete.log
EOF

# Get the latest Amazon Linux 2 AMI ID
export AMI_ID=$(aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=amzn2-ami-hvm-*" "Name=state,Values=available" \
  --query 'Images | sort_by(@, &CreationDate) | [-1].ImageId' \
  --output text)

# Launch instance with user data
export INSTANCE_ID=$(aws ec2 run-instances \
  --image-id $AMI_ID \
  --count 1 \
  --instance-type t3.micro \
  --subnet-id $PUBLIC_SUBNET_1 \
  --security-group-ids $EC2_SG \
  --iam-instance-profile Name=EKS-Management-Profile \
  --user-data file://user-data.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=EKS-Management},{Key=Environment,Value=Demo}]' \
  --query 'Instances[0].InstanceId' \
  --output text)

# Wait for instance to be running
aws ec2 wait instance-running --instance-ids $INSTANCE_ID

# Get instance public IP
export INSTANCE_IP=$(aws ec2 describe-instances \
  --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text)

echo "Management instance is ready. Public IP: $INSTANCE_IP"

Let me walk through what this script does. The user data script starts by updating the system and installing basic utilities like unzip and git. Then it installs the AWS CLI version 2, which we'll use to interact with AWS services from the instance.
Next, we install kubectl, which is the command-line tool for interacting with Kubernetes clusters. We're installing version 1.33.4 to match our EKS cluster version. The script also installs eksctl, which is a higher-level tool for managing EKS clusters, and Helm, which is a package manager for Kubernetes applications.
Finally, we install the PostgreSQL 13 client. This will allow us to connect to the RDS database we'll create later and verify that our pod-level security groups are working correctly. The script writes a completion message to a log file so we can verify later that all tools installed successfully.
When we launch the instance, we're placing it in PUBLIC_SUBNET_1 so it gets a public IP address and can be accessed via SSH. We're attaching the EKS-Management-Profile IAM instance profile, which gives the instance the permissions we configured earlier. We're also using a t3.micro instance type, which is the smallest general-purpose instance. It’s perfectly adequate for running kubectl commands and managing the cluster while keeping costs minimal.


Security Group Configuration
With our infrastructure and cluster in place, we now need to create and configure the security groups that will control pod-level network access. This is where the real power of Security Groups for Pods comes into play. We'll create security groups that individual pods can use to enforce fine-grained network policies.
Retrieving Cluster Network Information
Let's start by connecting to our management instance and gathering the networking details we need from our EKS cluster.
# Connect to management instance and get cluster VPC details

# verify all installation was successful
cat /var/log/setup-complete.log

# Update kubeconfig for your cluster 
aws eks update-kubeconfig --name pod-security-cluster-demo --region eu-west-1 

# Verify configuration 

kubectl get nodes 

# Check cluster info 
kubectl cluster-info 

export VPC_ID=$(aws eks describe-cluster \
   --name pod-security-cluster-demo \
   --query "cluster.resourcesVpcConfig.vpcId" \
   --output text)

echo "Cluster VPC ID: $VPC_ID"

First, we're checking the setup completion log to ensure that all our tools installed correctly during the instance's first boot. Then we configure kubectl to communicate with our EKS cluster by updating the kubeconfig file. This command retrieves the cluster endpoint and certificate authority data, storing them in ~/.kube/config so kubectl knows how to authenticate with our cluster.
Running kubectl get nodes should show us our two worker nodes in a Ready state. The kubectl cluster-info command displays the API server endpoint, confirming that we have proper connectivity to the cluster. Finally, we're extracting the VPC ID where our cluster is running. We'll need this ID when creating security groups, since security groups must be associated with a specific VPC.

Pod-Level Security Group Creation
Now we'll create the security group that specific pods will use when they need database access. This is the security group we'll later assign through a SecurityGroupPolicy.
# Create security group for pods requiring database access
aws ec2 create-security-group \
   --description 'Pod Security Group - Database Access' \
   --group-name 'POD_SG' \
   --vpc-id ${VPC_ID}

export POD_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=POD_SG Name=vpc-id,Values=${VPC_ID} \
   --query "SecurityGroups[0].GroupId" --output text)

echo "Pod Security Group ID: ${POD_SG}"

This command creates a new security group in our cluster's VPC with a descriptive name and purpose. At this point, the security group has no inbound or outbound rules defined – it's essentially an empty container waiting for rules. We're storing the security group ID in the POD_SG variable because we'll need to reference it multiple times: when creating ingress rules, when setting up the SecurityGroupPolicy, and when verifying our configuration later.

Database Security Group Configuration
Next, let's create a dedicated security group for our RDS PostgreSQL database. This security group will strictly control which sources can connect to the database.
# Create security group for RDS database
aws ec2 create-security-group \
   --description 'RDS Security Group - PostgreSQL Database' \
   --group-name 'RDS_SG' \
   --vpc-id ${VPC_ID}

export RDS_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=RDS_SG Name=vpc-id,Values=${VPC_ID} \
   --query "SecurityGroups[0].GroupId" --output text)

export RDS_SG_ID=$(aws rds describe-db-instances --db-instance-identifier rds-ekslab \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' --output text)

echo "RDS Security Group ID: ${RDS_SG}"

Similar to the pod security group, we're creating an RDS-specific security group without any rules initially. This security group will be attached to our RDS database instance when we create it. The beauty of this approach is that we can control database access by simply defining which security groups are allowed to communicate with the RDS security group. We don't need to know specific IP addresses – we can allow access based on security group membership instead.

Inter-Service Communication Rules
Now comes the critical part: configuring the security group rules that will enable the necessary communication between components while maintaining security boundaries.
# Get cluster's node group security group
export NODE_GROUP_SG=$(aws ec2 describe-security-groups \
   --filters Name=tag:Name,Values=eks-cluster-sg-pod-security-cluster-demo-* Name=vpc-id,Values=${VPC_ID} \
   --query "SecurityGroups[0].GroupId" \
   --output text)

# Allow pods with POD_SG to resolve DNS through node group
aws ec2 authorize-security-group-ingress \
   --group-id ${NODE_GROUP_SG} \
   --protocol tcp \
   --port 53 \
   --source-group ${POD_SG}

aws ec2 authorize-security-group-ingress \
   --group-id ${NODE_GROUP_SG} \
   --protocol udp \
   --port 53 \
   --source-group ${POD_SG}

# Allow management instance access to RDS
export MGMT_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=EKS-Management-SG Name=vpc-id,Values=${VPC_ID} \
   --query "SecurityGroups[0].GroupId" --output text)

aws ec2 authorize-security-group-ingress \
   --group-id ${RDS_SG} \
   --protocol tcp \
   --port 5432 \
   --source-group ${MGMT_SG}

# Allow only pods with POD_SG and MGMT_SG to access RDS
export RDS_SG_ID=$(aws rds describe-db-instances --db-instance-identifier rds-ekslab \
  --query 'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId' --output text)

aws ec2 authorize-security-group-ingress \
   --group-id ${RDS_SG_ID} \
   --protocol tcp \
   --port 5432 \
   --source-group ${POD_SG}

aws ec2 authorize-security-group-ingress \
  --group-id $RDS_SG_ID \
  --protocol tcp \
  --port 5432 \
  --source-group $MGMT_SG

Let me explain what each of these rules accomplishes:
First, we're finding the security group that EKS automatically created for our node group. This security group controls traffic to and from the worker nodes themselves.
The first two rules we add allow DNS resolution. When pods with our POD_SG security group need to look up domain names (like our database hostname), they need to query the DNS service that runs on the worker nodes. By allowing both TCP and UDP traffic on port 53 from POD_SG to the node group security group, we ensure that pods with custom security groups can still resolve DNS names. Without these rules, our pods would get ENIs but wouldn't be able to look up any hostnames.
Next, we configure database access rules. We allow the management instance security group to access PostgreSQL port 5432 on the RDS security group. This lets us connect to the database from our management instance to set up test data and verify connectivity.
Most importantly, we allow pods with the POD_SG security group to connect to port 5432 on the RDS security group. This is the rule that will allow our "green pod" (which will be assigned POD_SG) to connect to the database. Notice that we're not allowing the node group security group to access the database - this means that pods without POD_SG cannot connect to the database, even though they're running on the same nodes as pods that can connect.

Database Setup
Now we'll create an Amazon RDS PostgreSQL instance to serve as the protected resource that will demonstrate pod-level access controls. We'll configure the database securely and populate it with test data that we can query from authorized pods.
RDS Subnet Group Creation
Before creating the RDS instance, we need to define where it can be placed by creating a DB subnet group.
# Create DB subnet group spanning private subnets
aws rds create-db-subnet-group \
   --db-subnet-group-name rds-ekslab \
   --db-subnet-group-description "Subnet group for EKS lab RDS instance" \
   --subnet-ids ${PRIVATE_SUBNET_1} ${PRIVATE_SUBNET_2}

A DB subnet group tells RDS which subnets it can use when launching a database instance. We're including both of our private subnets, which serves two important purposes. First, it ensures the database is never exposed directly to the internet. It will only be reachable from within our VPC. Second, it enables multi-AZ deployment if we wanted to add high availability later, since RDS would be able to place a standby replica in the second availability zone.
Secure Password Generation
Let's generate a cryptographically secure password for our database. This is much safer than using a predictable or manually chosen password.
# Generate cryptographically secure password
export RDS_PASSWORD=$(openssl rand -base64 32 | tr -d "=+/" | cut -c1-25)
echo $RDS_PASSWORD > .rds_password
echo "Generated secure RDS password"

Here's what this command does step by step. First, openssl rand -base64 32 generates 32 bytes of random data and encodes it in base64 format. The tr command removes characters that might cause issues in connection strings (equals signs, plus signs, and forward slashes). Finally, we truncate it to 25 characters to ensure it meets RDS password requirements. We save this password to a file so we can retrieve it later when connecting to the database.
RDS Instance Configuration
Now we'll create the actual PostgreSQL database instance with security-focused configuration.
# Create PostgreSQL RDS instance
aws rds create-db-instance \
   --db-instance-identifier rds-ekslab \
   --db-instance-class db.t3.micro \
   --engine postgres \
   --master-username postgres \
   --master-user-password ${RDS_PASSWORD} \
   --allocated-storage 20 \
   --vpc-security-group-ids ${RDS_SG} \
   --db-subnet-group-name rds-ekslab \
   --no-publicly-accessible \
   --backup-retention-period 0 \
   --storage-type gp2

# Wait for database to become available
aws rds wait db-instance-available --db-instance-identifier rds-ekslab

Let me walk through these configuration choices. We're using db.t3.micro, which is the smallest instance class available. It’s perfect for our demonstration while keeping costs minimal. The engine is PostgreSQL, which is a robust open-source relational database that works well for demonstrating network connectivity.
The vpc-security-group-ids parameter attaches our RDS_SG security group to the database. This is what enforces our carefully crafted access rules: only sources allowed by the security group rules we created earlier will be able to connect.
The --no-publicly-accessible flag is crucial for security. This ensures the database doesn't get a public IP address and can't be reached from the internet. Combined with our private subnet placement, this creates multiple layers of network security.
We're setting backup-retention-period to 0 because this is a demonstration environment and we don't need automated backups. In a production environment, you would definitely want automated backups enabled. The storage-type gp2 specifies general-purpose SSD storage, which provides good performance at reasonable cost.
The wait command at the end blocks until the database is fully available, which typically takes 5-10 minutes. During this time, RDS is provisioning the database instance, configuring storage, setting up the master user, and performing initial system setup.
Database Initialization
Once the database is available, we need to connect to it and create some test data that will help us verify connectivity from our pods later.
# Connect to the management instance and create test data
export RDS_PASSWORD=3aboiP3vKjmfNkWKRF6PXBCro #replace
echo $RDS_PASSWORD > .rds_password

export RDS_ENDPOINT=$(aws rds describe-db-instances \
   --db-instance-identifier rds-ekslab \
   --query 'DBInstances[0].Endpoint.Address' \
   --output text)

# Connect to database and create test table
PGPASSWORD=${RDS_PASSWORD} psql -h ${RDS_ENDPOINT} -U postgres -d postgres << EOF
CREATE TABLE IF NOT EXISTS test_data (
    id SERIAL PRIMARY KEY,
    message TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO test_data (message) VALUES 
    ('Hello from authorized pod!'),
    ('Security groups for pods working correctly!'),
    ('Fine-grained network access control demonstrated.');

SELECT * FROM test_data;
EOF

Here's what we're doing in this initialization script. First, we retrieve the database endpoint hostname from AWS. This is the DNS name we'll use to connect to the database. Then we use the psql command-line tool to connect to PostgreSQL. We pass the password via the PGPASSWORD environment variable, which is a standard way to provide passwords to psql without interactive prompts.
Inside the SQL commands, we create a simple table called test_data with three columns: an auto-incrementing ID, a message text field, and a timestamp that defaults to the current time. We insert three test messages that we'll query later from our pods to verify connectivity. Finally, we select all rows to confirm the data was inserted successfully.
When you run this, you should see the three messages displayed, confirming the database is set up and accessible from the management instance.

CNI Plugin Configuration
Now we need to configure the AWS VPC CNI plugin to enable the pod-level ENI assignment and branch networking functionality. This is a crucial step that activates the underlying technology that makes Security Groups for Pods possible.
Enabling Pod ENI Support
We'll activate the feature flag that tells the VPC CNI plugin to support dedicated ENIs for pods.
# Enable pod ENI feature on AWS VPC CNI
kubectl -n kube-system set env daemonset aws-node ENABLE_POD_ENI=true
kubectl -n kube-system set env ds/aws-node ENABLE_POD_ENI=true

# Restart CNI pods to apply configuration
kubectl -n kube-system rollout restart daemonset aws-node
kubectl -n kube-system rollout status daemonset aws-node
kubectl -n kube-system rollout restart ds/aws-node
kubectl -n kube-system rollout status ds/aws-node

Let me explain what's happening here. The aws-node DaemonSet runs the VPC CNI plugin on every worker node in your cluster. This plugin is responsible for assigning IP addresses to pods and configuring their network interfaces. By setting the ENABLE_POD_ENI environment variable to true, we're telling the CNI plugin to support branch networking mode.
When this feature is enabled, the CNI plugin will watch for pods that have SecurityGroupPolicy rules applied to them. For these special pods, instead of just assigning an IP address from the node's primary ENI, the plugin will work with the VPC Resource Controller to provision a dedicated branch ENI. This dedicated ENI can then have its own security groups attached, independent of the node's security groups.
The rollout restart command forces all the aws-node pods to restart with the new configuration. The rollout status command then waits for the restart to complete successfully across all nodes. This typically takes a minute or two as each node's CNI pod is restarted in a rolling fashion.
Verification and Troubleshooting
After enabling the feature, let's verify that everything is configured correctly and that our nodes are ready to support pod ENIs.
# Verify CNI configuration
kubectl -n kube-system get daemonset aws-node -o yaml | grep -A 5 -B 5 ENABLE_POD_ENI

# Check node ENI capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,POD_ENI:.status.allocatable.vpc\\.amazonaws\\.com/pod-eni

# Verify trunk ENI creation on nodes
NODE_ID=$(kubectl get nodes -o jsonpath='{.items[0].spec.providerID}' | cut -d'/' -f5)
aws ec2 describe-network-interfaces \
  --filters Name=attachment.instance-id,Values=$NODE_ID Name=interface-type,Values=trunk \
  --query 'NetworkInterfaces[*].NetworkInterfaceId'

# Check CNI logs for errors
kubectl -n kube-system logs -l k8s-app=aws-node --tail=20

These verification commands help us confirm that the feature is working as expected. The first command checks that the ENABLE_POD_ENI environment variable is properly set in the DaemonSet configuration. You should see the value set to "true" in the output.
The second command displays the pod-eni capacity for each node. This shows how many pods with dedicated ENIs each node can support. For our m5.large instances, you should see a number like "9" or similar, indicating that each node can support that many pods with custom security groups.
The third command looks for trunk ENIs on one of our nodes. When branch networking is enabled, the VPC CNI creates a special "trunk" ENI on each node that serves as the anchor point for branch ENIs. If you see a network interface ID returned here, it confirms that the trunk networking is properly configured.
Finally, we check the CNI plugin logs for any errors. If everything is working correctly, you shouldn't see any error messages. If there are problems, the logs will typically contain helpful information about what went wrong – perhaps permission issues, insufficient ENI capacity, or configuration problems.

Security Policies Implementation
With our infrastructure ready and the CNI plugin configured, we can now create the SecurityGroupPolicy resources that define which pods should receive which security groups. This is where we bridge the gap between Kubernetes pod identity (labels) and AWS network security (security groups).
Namespace and Context Setup
Let's start by creating a dedicated namespace for our demonstration resources. This helps keep things organized and makes cleanup easier.
# Create dedicated namespace for demonstration
kubectl create namespace networking
kubectl config set-context $(kubectl config current-context) --namespace=networking

# Verify namespace creation
kubectl get namespaces

Using a dedicated namespace provides several benefits for our demonstration. First, it isolates our demo resources from system components in the kube-system namespace and from any other applications that might be running. Second, it makes cleanup straightforward – we can delete the entire namespace later to remove all associated resources at once. Third, it provides a scope for our security policies, making it clear which resources they apply to.
The config set-context command changes your default namespace so that subsequent kubectl commands will operate in the networking namespace by default. This saves you from having to specify -n networking with every command.

SecurityGroupPolicy Resource Creation
Now we'll create the SecurityGroupPolicy custom resource that tells the system which pods should get our POD_SG security group.
# Export POD sg
export VPC_ID=$(aws eks describe-cluster \
   --name pod-security-cluster-demo \
   --query "cluster.resourcesVpcConfig.vpcId" \
   --output text)

export POD_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=POD_SG Name=vpc-id,Values=${VPC_ID} \
   --query "SecurityGroups[0].GroupId" --output text)

# Verify SecurityGroupPolicy CRD exists
kubectl get crd securitygrouppolicies.vpcresources.k8s.aws

# Create security group policy
cat << EOF > sg-per-pod-policy.yaml
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: allow-rds-access
  namespace: networking
spec:
  podSelector:
    matchLabels:
      app: green-pod
  securityGroups:
    groupIds:
      - ${POD_SG}
EOF


kubectl apply -f sg-per-pod-policy.yaml

# Verify policy creation
kubectl -n networking get securitygrouppolicies
kubectl -n networking describe securitygrouppolicy allow-rds-access

Let me walk you through what this SecurityGroupPolicy does. The podSelector section uses Kubernetes label selectors to identify which pods should receive the security group. In this case, we're matching any pod with the label app: green-pod. This is standard Kubernetes label selector syntax, so you can use more complex selectors if needed (like multiple labels, or expressions).
The securityGroups section lists the AWS security group IDs that should be attached to matching pods. When a pod with the label app: green-pod is created in the networking namespace, the VPC Resource Controller sees it matches this policy. The controller then provisions a dedicated ENI for that pod and attaches our POD_SG security group to that ENI.
It's important to understand that this policy doesn't immediately change anything: it creates a rule that will apply to future pods. When you later create a pod with matching labels, that's when the ENI provisioning and security group attachment happens.
The verify commands at the end confirm that the SecurityGroupPolicy was created successfully and show its current status. You should see the policy listed with details about the pod selector and security groups.

Testing and Validation
Now comes the exciting part. We'll create two pods to demonstrate that our Security Groups for Pods implementation is working correctly. One pod will have the matching label and should be able to access the database, while the other pod won't have the label and should be blocked.
Kubernetes Secrets for Database Connectivity
First, we need to securely store our database connection credentials using Kubernetes secrets. This is a security best practice that keeps sensitive information out of pod specifications.
# Create secret with RDS connection details
export RDS_PASSWORD=$(cat .rds_password)
export RDS_ENDPOINT=$(aws rds describe-db-instances \
   --db-instance-identifier rds-ekslab \
   --query 'DBInstances[0].Endpoint.Address' \
   --output text)

kubectl -n networking create secret generic rds \
  --from-literal=password="${RDS_PASSWORD}" \
  --from-literal=host="${RDS_ENDPOINT}" \
  --from-literal=username=postgres \
  --from-literal=database=postgres \
  --dry-run=client -o yaml | kubectl apply -f -

# Verify secret creation
kubectl describe secret rds-credentials

Here's what we're doing with this secret creation. We're retrieving the password we generated earlier and the database endpoint hostname, then storing them in a Kubernetes secret along with the username and database name. The secret is created in the networking namespace where our test pods will run.
The --dry-run=client -o yaml | kubectl apply -f - pattern is a common Kubernetes technique that makes the command idempotent. If the secret already exists, it updates it rather than failing. This is useful when you need to run the command multiple times during testing or troubleshooting.
When pods reference this secret, Kubernetes will inject the values as environment variables or mount them as files, depending on how you configure the pod. The sensitive data never appears in the pod specification, and Kubernetes encrypts secrets at rest in etcd.

Green Pod (Authorized Database Access)
Now let's create our green pod – the pod that has the matching label and should successfully connect to the database.
cat << EOF > green-pod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: green-pod
  namespace: networking
  labels:
    app: green-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: green-pod
  template:
    metadata:
      labels:
        app: green-pod
    spec:
      containers:
      - name: postgres-client
        image: postgres:13-alpine
        env:
        - name: PGHOST
          valueFrom:
            secretKeyRef:
              name: rds
              key: host
        - name: PGPASSWORD
          valueFrom:
            secretKeyRef:
              name: rds
              key: password
        - name: PGUSER
          valueFrom:
            secretKeyRef:
              name: rds
              key: username
        - name: PGDATABASE
          valueFrom:
            secretKeyRef:
              name: rds
              key: database
        - name: PGSSLMODE
          value: require
        command: ["/bin/sh"]
        args:
        - -c
        - |
          echo "Green pod starting - should have database access..."
          echo "Attempting to connect to database at \$PGHOST"
          if psql -c "SELECT version();" 2>/dev/null; then
            echo "SUCCESS: Connected to PostgreSQL!"
            psql -c "SELECT version();"
            echo ""
            echo "Test data from database:"
            psql -c "SELECT id, message FROM test_data ORDER BY id;"
          else
            echo "ERROR: Could not connect to database"
            echo "This indicates security group configuration issues"
          fi
          echo "Sleeping to keep container running..."
          sleep 3600
        resources:
          limits:
            memory: "128Mi"
            cpu: "100m"
          requests:
            memory: "64Mi"
            cpu: "50m"
EOF

# Deploy green pod
kubectl apply -f green-pod.yaml
kubectl rollout status deployment green-pod

# Get pod name and check logs
export GREEN_POD_NAME=$(kubectl get pods -l app=green-pod -o jsonpath='{.items[0].metadata.name}')
echo "Green pod: $GREEN_POD_NAME"
kubectl logs $GREEN_POD_NAME

Let me explain what makes this pod special and why it should work. The key is the label app: green-pod in the pod template's metadata section. This label matches our SecurityGroupPolicy selector, so when this pod is created, the VPC Resource Controller will provision a dedicated ENI for it and attach the POD_SG security group.
The pod uses environment variables sourced from our Kubernetes secret to get the database connection details. PostgreSQL's command-line tools (like psql) automatically use these environment variables when set with the PG prefix. This means we don't need to specify connection parameters explicitly – the tools just work.
The startup script in the command section attempts to connect to the database and run a simple query. If the security groups are working correctly, the connection should succeed because this pod's ENI has the POD_SG security group, which is allowed to connect to port 5432 on the RDS security group. The script then queries our test_data table to display the messages we inserted earlier.
=== GREEN POD STARTING ===
This pod should have database access via security groups
Attempting connection to: rds-ekslab.xxxxx.us-west-2.rds.amazonaws.com
SUCCESS: Connected to PostgreSQL!
Database version:
 PostgreSQL 13.x on x86_64-pc-linux-gnu...
Test data from database:
 id |                    message                     |         created_at         
----+------------------------------------------------+----------------------------
  1 | Hello from authorized pod!                     | 2024-01-15 10:30:45.123456
  2 | Security groups for pods working correctly!    | 2024-01-15 10:30:45.234567
  3 | Fine-grained network access control demonstrated. | 2024-01-15 10:30:45.345678


Red Pod (Unauthorized Database Access)
Now let's create the red pod – a pod without the matching label that should be blocked from accessing the database.
cat << EOF > red-pod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: red-pod
  namespace: networking
  labels:
    app: red-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: red-pod
  template:
    metadata:
      labels:
        app: red-pod
    spec:
      containers:
      - name: postgres-client
        image: postgres:13-alpine
        env:
        - name: PGHOST
          valueFrom:
            secretKeyRef:
              name: rds
              key: host
        - name: PGPASSWORD
          valueFrom:
            secretKeyRef:
              name: rds
              key: password
        - name: PGUSER
          valueFrom:
            secretKeyRef:
              name: rds
              key: username
        - name: PGDATABASE
          valueFrom:
            secretKeyRef:
              name: rds
              key: database
        - name: PGSSLMODE
          value: require
        command: ["/bin/sh"]
        args:
        - -c
        - |
          echo "Red pod starting - should NOT have database access..."
          echo "Attempting to connect to database at \$PGHOST"

          # Test database connection (should fail)
          if psql -c "SELECT version();" 2>/dev/null; then
            echo "UNEXPECTED: Connected to database!"
            echo "This suggests security group policy is not working correctly"
          else
            echo "EXPECTED: Could not connect to database"
            echo "This is correct - red pod should not have database access"
            echo "Security groups for pods is working properly!"
          fi

          # Keep container running for inspection
          echo "Sleeping to keep container running..."
          sleep 3600
        resources:
          limits:
            memory: "128Mi"
            cpu: "100m"
          requests:
            memory: "64Mi"
            cpu: "50m"
EOF

# Deploy red pod
kubectl apply -f red-pod.yaml
kubectl rollout status deployment red-pod

# Get pod name and check logs
export RED_POD_NAME=$(kubectl get pods -l app=red-pod -o jsonpath='{.items[0].metadata.name}')
echo "Red pod: $RED_POD_NAME"
kubectl logs $RED_POD_NAME

The red pod is intentionally configured almost identically to the green pod: same container image, same database credentials, same connection attempt. The only significant difference is the label: this pod has app: red-pod instead of app: green-pod.
Because this pod's label doesn't match our SecurityGroupPolicy, the VPC Resource Controller won't provision a dedicated ENI for it. Instead, this pod will use the node's primary network interface and inherit the node's security group. Since we specifically didn't add a rule allowing the node security group to access the RDS security group, this pod's connection attempts should be blocked at the network level.
The expected output from the red pod logs should look like this:
=== RED POD STARTING ===
This pod should NOT have database access
Attempting connection to: rds-ekslab.xxxxx.us-west-2.rds.amazonaws.com
EXPECTED: Could not connect to database


ENI Assignment Verification
Let's verify that the green pod actually received a dedicated ENI while the red pod did not.
# Check green pod ENI assignment
echo "=== Green Pod ENI Assignment ==="
kubectl describe pod $GREEN_POD_NAME | grep -A 3 -B 3 "vpc.amazonaws.com/pod-eni"
kubectl get pod $GREEN_POD_NAME -o yaml | grep -A 5 "annotations:" | grep "vpc.amazonaws.com"

# Check red pod networking (should use node networking)
echo "=== Red Pod Networking ==="
kubectl describe pod $RED_POD_NAME | grep "vpc.amazonaws.com" || echo "No dedicated ENI (expected for red pod)"

# Verify ENI creation in AWS
echo "=== AWS ENI Verification ==="
aws ec2 describe-network-interfaces \
  --filters Name=description,Values="*pod-eni*" \
  --query 'NetworkInterfaces[*].[NetworkInterfaceId,Description,Groups[0].GroupId]' \
  --output table

# Compare pod IP addresses
echo "=== Pod IP Comparison ==="
kubectl get pods -o wide

These verification commands help us understand what's happening at the infrastructure level. When you check the green pod's description, you should see annotations like vpc.amazonaws.com/pod-eni that indicate a dedicated ENI was assigned. The annotation will contain the ENI ID and other networking details.
For the red pod, you won't see these annotations because it's using the node's primary network interface instead of a dedicated ENI. This is the expected behavior.
The AWS CLI command queries EC2 for network interfaces with "pod-eni" in the description. This should return the ENI(s) that were created for pods with SecurityGroupPolicy assignments. You'll see the network interface ID, its description, and importantly, the security group ID (which should match our POD_SG).
When you run kubectl get pods -o wide, you can see the IP addresses assigned to each pod. Both pods will have IP addresses from your VPC's CIDR range, but they're coming from different network interfaces at the infrastructure level.
Troubleshooting Common Issues
If things aren't working as expected, here are some diagnostic commands for resolving common implementation problems:
# If green pod cannot connect to database:

# 1. Verify security group rules
aws ec2 describe-security-groups --group-ids $POD_SG $RDS_SG

# 2. Check if POD_SG has access to RDS_SG
aws ec2 describe-security-groups --group-ids $RDS_SG --query 'SecurityGroups[0].IpPermissions'

# 3. Verify ENI assignment
kubectl describe pod $GREEN_POD_NAME | grep -E "(Events|vpc.amazonaws.com)"

# 4. Check CNI plugin status
kubectl -n kube-system logs -l k8s-app=aws-node --tail=100 | grep -i error

# 5. Validate SecurityGroupPolicy
kubectl get sgp allow-rds-access -o yaml

# 6. Ensure ENABLE_POD_ENI is set
kubectl -n kube-system get ds aws-node -o yaml | grep -A 5 ENABLE_POD_ENI

# If red pod unexpectedly connects:

# 1. Verify pod labels don't match policy
kubectl get pod $RED_POD_NAME --show-labels

# 2. Check for unintended security group rules
aws ec2 describe-security-groups --group-ids $RDS_SG --query 'SecurityGroups[0].IpPermissions'

# 3. Confirm node group security doesn't allow RDS access
export NODE_SG=$(kubectl get nodes -o yaml | grep -o 'sg-[a-zA-Z0-9]*' | head -1)
aws ec2 describe-security-groups --group-ids $NODE_SG

These troubleshooting commands help you systematically diagnose problems. If the green pod can't connect, you work through the checklist: verify the security group rules exist, confirm the ENI was actually assigned, check for CNI errors, and validate the SecurityGroupPolicy configuration.
If the red pod unexpectedly can connect, you check whether it somehow got the wrong labels, whether there's an unintended security group rule allowing node-level access, or whether the node security group itself has database access that it shouldn't have.
Cleanup and Maintenance
When you're finished with this demonstration, it's important to clean up all the resources to avoid ongoing AWS charges. We'll walk through the cleanup process in the proper order to avoid dependency issues.
Kubernetes Resource Cleanup
Let's start by removing all the Kubernetes resources we created during the demonstration.
# Delete application deployments
kubectl delete -f green-pod.yaml
kubectl delete -f red-pod.yaml

# Delete security group policy
kubectl delete -f sg-per-pod-policy.yaml

# Delete secrets
kubectl delete secret rds-credentials

# Delete namespace (removes all resources)
kubectl delete namespace networking

# Disable pod ENI feature
kubectl -n kube-system set env daemonset aws-node ENABLE_POD_ENI=false
kubectl -n kube-system rollout status daemonset aws-node

# Verify ENI cleanup
kubectl get nodes -o custom-columns=NAME:.metadata.name,POD_ENI:.status.allocatable.vpc\.amazonaws\.com/pod-eni

We start by deleting the individual deployments to ensure the pods are terminated gracefully. Then we remove the SecurityGroupPolicy, which stops the VPC Resource Controller from creating new ENIs. Deleting the namespace removes any remaining resources we might have created during testing.
Disabling the ENABLE_POD_ENI feature returns the CNI plugin to its default behavior. This doesn't immediately remove existing trunk ENIs, but it prevents new ones from being created.
RDS and Database Cleanup
Next, we'll remove the RDS database instance and its associated resources.
# Delete RDS instance (skip final snapshot for demo)
aws rds delete-db-instance \
   --db-instance-identifier rds-ekslab \
   --delete-automated-backups \
   --skip-final-snapshot

# Wait for deletion completion
aws rds wait db-instance-deleted --db-instance-identifier rds-ekslab

# Delete DB subnet group
aws rds delete-db-subnet-group \
   --db-subnet-group-name rds-ekslab

The --skip-final-snapshot flag means we won't create a snapshot before deleting the database. In a production environment, you'd typically want a final snapshot, but for our demonstration where the data isn't valuable, skipping it speeds up the deletion process. The wait command blocks until RDS confirms the instance is fully deleted, which can take several minutes.
EKS Cluster Deletion
Now we'll delete the EKS cluster and its node groups.
# Delete managed node group first
aws eks delete-nodegroup \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers

# Wait for node group deletion
aws eks wait nodegroup-deleted \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers

# Delete EKS cluster
aws eks delete-cluster --name pod-security-cluster-demo

# Wait for cluster deletion
aws eks wait cluster-deleted --name pod-security-cluster-demo

It's important to delete the node group before deleting the cluster. If you try to delete the cluster first, it will fail because node groups are dependent resources. The node group deletion process terminates all the EC2 instances and cleans up their associated resources. The cluster deletion then removes the control plane components.
Management Instance Cleanup
Let's remove the management instance and its associated resources.
# Terminate management instance
aws ec2 terminate-instances --instance-ids $INSTANCE_ID

# Wait for termination
aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID

# Release Elastic IP
aws ec2 release-address --allocation-id $EIP_ALLOC

Terminating the instance is straightforward: AWS handles the cleanup of attached volumes and network interfaces automatically. But we need to explicitly release the Elastic IP address. Elastic IPs incur charges if they're allocated but not attached to a running instance, so releasing them is important to avoid unnecessary costs.
Complete VPC Infrastructure Removal
Now we'll remove all the VPC components. This is the most complex cleanup section because VPC resources have many interdependencies that must be resolved in the correct order.
export VPC_ID=$(aws ec2 describe-vpcs \
  --filters Name=cidr-block-association.cidr-block,Values=10.0.0.0/16 Name=isDefault,Values=false \
  --query 'Vpcs[?State==`available`].VpcId | [0]' --output text)

#!/bin/bash
set -euo pipefail

echo "=== Starting comprehensive VPC cleanup ==="

# First, let's identify what's still attached
echo "=== Remaining dependencies check ==="
aws ec2 describe-network-interfaces --filters Name=vpc-id,Values="$VPC_ID" --query 'NetworkInterfaces[*].[NetworkInterfaceId,Description,Status]' --output table
aws ec2 describe-instances --filters Name=vpc-id,Values="$VPC_ID" Name=instance-state-name,Values=running,pending,stopping --query 'Reservations[].Instances[*].[InstanceId,State.Name]' --output table

echo "=== Force delete any remaining ENIs ==="
for eni in $(aws ec2 describe-network-interfaces --filters Name=vpc-id,Values="$VPC_ID" --query 'NetworkInterfaces[?Status!=`in-use`].NetworkInterfaceId' --output text); do
  echo "Deleting ENI: $eni"
  aws ec2 delete-network-interface --network-interface-id "$eni" || true
done

echo "=== Wait for ENI cleanup ==="
sleep 30

echo "=== Delete load balancers ==="
# Delete ALB/NLB
for arn in $(aws elbv2 describe-load-balancers --query "LoadBalancers[?VpcId=='$VPC_ID'].LoadBalancerArn" --output text); do
  aws elbv2 delete-load-balancer --load-balancer-arn "$arn"
  echo "Deleted ALB/NLB: $arn"
done

# Delete Classic ELB
for name in $(aws elb describe-load-balancers --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text); do
  aws elb delete-load-balancer --load-balancer-name "$name"
  echo "Deleted Classic ELB: $name"
done

echo "=== Delete VPC Endpoints ==="
EP_IDS=$(aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values="$VPC_ID" --query 'VpcEndpoints[].VpcEndpointId' --output text || true)
if [ -n "${EP_IDS:-}" ]; then
  aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $EP_IDS
fi

echo "=== Delete NAT Gateways ==="
for nat in $(aws ec2 describe-nat-gateways --filter Name=vpc-id,Values="$VPC_ID" --query 'NatGateways[?State!=`deleted`].NatGatewayId' --output text); do
  aws ec2 delete-nat-gateway --nat-gateway-id "$nat"
  echo "Deleted NAT Gateway: $nat"
done

# Wait for NAT Gateway deletion
echo "Waiting for NAT Gateways to delete..."
while [ $(aws ec2 describe-nat-gateways --filter Name=vpc-id,Values="$VPC_ID" --query 'length(NatGateways[?State!=`deleted`])' --output text) != "0" ]; do
  echo "Still waiting for NAT Gateway deletion..."
  sleep 15
done

echo "=== Delete Internet Gateways ==="
for igw in $(aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values="$VPC_ID" --query 'InternetGateways[].InternetGatewayId' --output text); do
  aws ec2 detach-internet-gateway --internet-gateway-id "$igw" --vpc-id "$VPC_ID" || true
  aws ec2 delete-internet-gateway --internet-gateway-id "$igw"
  echo "Deleted Internet Gateway: $igw"
done

echo "=== Terminate any remaining instances ==="
for iid in $(aws ec2 describe-instances --filters Name=vpc-id,Values="$VPC_ID" Name=instance-state-name,Values=running,pending,stopping --query 'Reservations[].Instances[].InstanceId' --output text); do
  aws ec2 terminate-instances --instance-ids "$iid"
  echo "Terminating instance: $iid"
done

# Wait for instances to terminate
if [ $(aws ec2 describe-instances --filters Name=vpc-id,Values="$VPC_ID" Name=instance-state-name,Values=running,pending,stopping --query 'length(Reservations[].Instances[])' --output text) != "0" ]; then
  echo "Waiting for instances to terminate..."
  aws ec2 wait instance-terminated --instance-ids $(aws ec2 describe-instances --filters Name=vpc-id,Values="$VPC_ID" Name=instance-state-name,Values=running,pending,stopping --query 'Reservations[].Instances[].InstanceId' --output text)
fi

echo "=== Delete subnets ==="
for subnet in $(aws ec2 describe-subnets --filters Name=vpc-id,Values="$VPC_ID" --query 'Subnets[].SubnetId' --output text); do
  aws ec2 delete-subnet --subnet-id "$subnet" || true
  echo "Deleted subnet: $subnet"
done

echo "=== Delete route tables ==="
for rt in $(aws ec2 describe-route-tables --filters Name=vpc-id,Values="$VPC_ID" --query 'RouteTables[?Associations[?Main==`false`]].RouteTableId' --output text); do
  # Disassociate route table first
  for assoc in $(aws ec2 describe-route-tables --route-table-ids "$rt" --query 'RouteTables[].Associations[?Main==`false`].RouteTableAssociationId' --output text); do
    aws ec2 disassociate-route-table --association-id "$assoc" || true
  done
  aws ec2 delete-route-table --route-table-id "$rt" || true
  echo "Deleted route table: $rt"
done

echo "=== Delete security groups ==="
# Delete custom security groups (retry logic for dependencies)
for attempt in {1..3}; do
  echo "Security group deletion attempt $attempt..."
  for sg in $(aws ec2 describe-security-groups --filters Name=vpc-id,Values="$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text); do
    aws ec2 delete-security-group --group-id "$sg" 2>/dev/null && echo "Deleted SG: $sg" || echo "Failed to delete SG: $sg (will retry)"
  done
  sleep 10
done

echo "=== Final VPC deletion ==="
aws ec2 delete-vpc --vpc-id "$VPC_ID"
echo "VPC cleanup completed successfully!"

This cleanup script is comprehensive and handles all the common dependency issues you might encounter when deleting a VPC. Let me explain the order and reasoning behind each section.
We start by checking what resources are still attached to the VPC. This gives us visibility into any unexpected dependencies that might cause deletion failures. Then we delete any detached ENIs. These are network interfaces that EKS or the CNI plugin might have created that are no longer attached to instances.
Load balancers must be deleted before we can remove subnets, because they create ENIs in the subnets. We check for both modern Application/Network Load Balancers and Classic ELBs. VPC endpoints, if any were created, also need to be removed before subnet deletion.
The NAT Gateway deletion is particularly important to wait for completely, because NAT Gateways take several minutes to fully delete. If you try to delete the subnet while the NAT Gateway is still in "deleting" state, the deletion will fail.
Internet Gateways must be detached before they can be deleted. We use the || true pattern here because if the detachment fails (maybe it's already detached), we still want to try the deletion.
Subnets can be deleted once all resources using them are removed. Route tables need to be disassociated from subnets before deletion – we only delete non-main route tables, as the main route table is automatically deleted with the VPC.
Security groups often have dependencies on each other (if rules reference other security groups), so we use a retry loop with three attempts. Each iteration, some security groups might successfully delete, breaking dependencies for others.
Finally, once all attached resources are cleaned up, we can delete the VPC itself.
IAM Resource Cleanup
The last step is cleaning up the IAM roles and policies we created at the beginning.
# Detach policies from roles
aws iam detach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

aws iam detach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

# Clean up management instance IAM resources
aws iam detach-role-policy \
  --role-name EKS-Management-Role \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy

aws iam delete-policy \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy

aws iam remove-role-from-instance-profile \
  --instance-profile-name EKS-Management-Profile \
  --role-name EKS-Management-Role

aws iam delete-instance-profile \
  --instance-profile-name EKS-Management-Profile

# Delete IAM roles
aws iam delete-role --role-name EKSClusterRole
aws iam delete-role --role-name EKSNodeGroupRole  
aws iam delete-role --role-name EKS-Management-Role

echo "Complete cleanup finished successfully"

IAM cleanup follows a specific order: first detach all policies from roles, then delete any custom policies we created, remove roles from instance profiles, delete the instance profiles, and finally delete the roles themselves. IAM requires this order because of the dependency chain: you can't delete a role that still has policies attached, and you can't delete an instance profile that still contains a role.
The custom EKS-Management-Policy that we created needs to be deleted using your account ID in the ARN. The aws sts get-caller-identity command retrieves your account ID dynamically so the command works regardless of which AWS account you're using.
Once this cleanup is complete, you've removed all resources created during this guide and won't incur any further charges.
Conclusion
This comprehensive guide demonstrated how to implement Security Groups for Pods in Amazon EKS, providing fine-grained network security controls at the pod level.
As always, I hope you enjoyed this guide and learned something valuable about securing your EKS workloads. If you want to stay connected or see more hands-on DevOps content, you can follow me on LinkedIn or Twitter.
For more practical, hands-on DevOps projects like this one, follow and star this repository: Learn-DevOps-by-building
 


 Docker Build Tutorial: Learn Contexts, Architecture, and Performance Optimization Techniques 
Destiny Erhabor — Tue, 07 Oct 2025 18:20:08 +0000
 Docker build is a fundamental concept every developer needs to understand. Whether you're containerizing your first application or optimizing existing Docker workflows, understanding Docker build contexts and Docker build architecture is essential for creating efficient, scalable containerized applications.
This comprehensive guide covers everything from basic concepts to advanced optimization techniques, helping you avoid common pitfalls and build better Docker images.
Table of Contents

What is Docker Build?

Docker Build Architecture: How It All Works

Docker Build Features

Docker Build Context

Types of Docker Build Contexts

Common Docker Build Mistakes (And How to Fix Them)

How to Optimize and Monitor Build Performance

Best Practices for Docker Build Performance

Troubleshooting Docker Build Issues

Conclusion


What is Docker Build?
Docker build is the process of creating a Docker image from a Dockerfile and a set of files called the build context. When you run docker build, you're instructing Docker to:

Read your Dockerfile instructions

Gather the necessary files (build context)

Execute each instruction step-by-step

Create a final Docker image


Think of it like following a recipe: the Dockerfile is your recipe, and the build context contains all the ingredients you might need.
Docker Build Architecture: How It All Works
Docker Build uses a client-server architecture where two separate components (Buildx and BuildKit) work together to build your Docker images. This is different from how many people think Docker works, as it's not just one monolithic program doing everything.
What is Buildx (The Client)?
Buildx serves as the user interface that you interact with directly whenever you work with Docker builds. When you type docker build . in your terminal, you're actually communicating with Buildx, which acts as the intermediary between you and the actual build engine.
Buildx’s primary jobs:

Interprets your build command and options

Sends structured build requests to BuildKit

Manages multiple BuildKit instances (builders)

Handles authentication and secrets

Displays build progress to you


What is BuildKit (The Server/Builder)
BuildKit functions as the actual build engine that performs all the heavy lifting during the Docker build process. This powerful backend component receives the structured build requests from Buildx and immediately begins reading and interpreting your Dockerfiles line by line.
BuildKit’s primary jobs:

Receives build requests from Buildx

Reads and interprets Dockerfiles

Executes build instructions step by step

Manages build cache and layers

Requests only the files it needs from the client

Creates the final Docker image


How They Communicate
Here's what happens when you run docker build .:

When you run docker build, the command initiates a multi-step process with BuildKit (as illustrated in the above image).
First, it sends a build request containing your Dockerfile, build arguments, export options, and cache options. BuildKit then intelligently requests only the files it needs when it needs them, starting with package.json to run npm install for dependency installation.
After that's complete, it requests the src/ directory containing your application code and copies those files into the image with the COPY command.
Once all build steps are finished, BuildKit sends back the completed image. Optionally, you can then push this image to a container registry for distribution or deployment.
This on-demand file transfer approach is one of BuildKit's key optimizations: rather than sending your entire build context upfront, it only requests specific files as each build step needs them, making the build process more efficient.
Key Communication Details
Build request contains:
{
  "dockerfile": "FROM node:18\nWORKDIR /app\n...",
  "buildArgs": {"NODE_ENV": "production"},
  "exportOptions": {"type": "image", "name": "my-app:latest"},
  "cacheOptions": {"type": "registry", "ref": "my-app:cache"}
}

Resource requests:

BuildKit asks: "I need the file at ./package.json"

Buildx responds: Sends the actual file content

BuildKit asks: "I need the directory ./src/"

Buildx responds: Sends all files in that directory


Why This Architecture Exists
1. Efficiency
The old Docker builder had a major flaw: it always copied your entire build context upfront, regardless of what was actually needed. Even if your Dockerfile only used a few files, Docker would transfer hundreds of megabytes before starting the build.
BuildKit fixes this through on-demand file transfers. It only requests specific files at each step.
# Old Docker Builder (legacy)
# Always copied ENTIRE context upfront
$ docker build .
Sending build context to Docker daemon  245.7MB  # Everything!

# New BuildKit Architecture  
# Only requests files when needed
$ docker build .
#1 [internal] load build definition from Dockerfile    0.1s
#2 [internal] load .dockerignore                       0.1s
#3 [1/4] FROM node:18                                  0.5s
#4 [internal] load build context                       0.1s
#4 transferring context: 234B  # Only package.json initially!
#5 [2/4] WORKDIR /app                                  0.2s  
#6 [3/4] COPY package*.json ./                         0.1s
#7 [4/4] RUN npm install                               5.2s
#8 [internal] load build context                       0.3s  
#8 transferring context: 2.1MB  # Now requests src/ files
#9 [5/4] COPY src/ ./src/                              0.2s

2. Scalability
The client-server architecture enables scalability features. Multiple Docker CLI clients can connect to the same BuildKit instance, and BuildKit can run on remote servers instead of your local machine. This means you could execute builds on a cloud server while controlling them from your laptop. Teams can also deploy multiple BuildKit instances for different teams or purposes, scaling from individual developers to large enterprises.
3. Security
Security is improved by only requesting sensitive files when explicitly needed. BuildKit never sees files your Dockerfile doesn't reference, reducing the attack surface. It also handles credentials through separate, secure channels rather than mixing them with your build context, preventing secrets from being embedded in image layers or exposed in build logs.
Real-World Example
Let's trace through a typical build step by step. You can find the full code available here: 02-python-cache.
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ ./src/
COPY main.py .
CMD ["python", "main.py"]

Let’s see what actually happens here:

You run docker build .

Buildx says to BuildKit:


   "Here's a build request with this Dockerfile"


BuildKit processes: FROM python:3.9-slim

No client files needed, pulls base image


BuildKit processes: COPY requirements.txt .

BuildKit to Buildx: "I need requirements.txt"

Buildx to BuildKit: Sends the file content



BuildKit processes: RUN pip install -r requirements.txt

No client files needed, runs inside container


BuildKit processes: COPY src/ ./src/

BuildKit to Buildx: "I need all files in src/ directory"

Buildx to BuildKit: Sends all files in src/



BuildKit processes: COPY main.py .

BuildKit to Buildx: "I need main.py"

Buildx to BuildKit: Sends the file



BuildKit to Buildx: "Build complete, here's your image"


From the illustration, you can see that BuildKit only requests what it needs, when it needs it. Not this entire context:

my-app/
├── src/                 # ← Only loaded when COPY src/ runs
├── tests/              # ← Never requested (not in Dockerfile)
├── docs/               # ← Never requested  
├── node_modules/       # ← Never requested (in .dockerignore)
├── requirements.txt    # ← Loaded early (first COPY)
└── main.py            # ← Loaded later (second COPY)

Docker Build Features
Named Contexts
👉 Demo project: 07-named-contexts
Named contexts allow you to include files from multiple sources during a build while keeping them logically separated. This is useful when you need documentation, configuration files, or shared libraries from different directories or repositories in your build.
# Build with additional named context
docker build --build-context docs=./documentation .

# Use named context in Dockerfile
FROM alpine
COPY . /app
# Mount files from named context
RUN --mount=from=docs,target=/docs \
    cp /docs/manual.pdf /app/

Build Secrets
👉 Demo project: 06-build-secrets
Build secrets let you pass sensitive information (like API keys or passwords) to your build without including them in the final image or build history. The secrets are mounted temporarily during specific RUN commands and are never stored in image layers.
# Pass secret to build
echo "api_key=secret123" | docker build --secret id=apikey,src=- .

# Use secret in Dockerfile
FROM alpine
RUN --mount=type=secret,id=apikey \
    export API_KEY=$(cat /run/secrets/apikey) && \
    curl -H "Authorization: $API_KEY" https://api.example.com/data

Docker Build Context
What is a Build Context?
The build context is the collection of files and directories that Docker can access during the build process. It's like gathering all your cooking ingredients on the counter before you start cooking.
docker build [OPTIONS] CONTEXT
                       ^^^^^^^
                       This is your build context

Why Build Contexts Matter

Security: Only files in the context can be accessed during build

Performance: Large contexts slow down builds

Functionality: Your Dockerfile can only COPY/ADD files from the context

Efficiency: Understanding contexts helps you build faster, leaner images


Types of Docker Build Contexts
1. Local Directory Context (Most Common)
👉 See code here: 01-node-local-context
This is what you'll use in 90% of cases – pointing to a folder on your machine:
# Use current directory
docker build .

# Use specific directory
docker build /path/to/my/project

# Use parent directory
docker build ..

Example Project Structure:
my-webapp/
├── src/
│   ├── index.js
│   └── utils.js
├── public/
│   ├── index.html
│   └── styles.css
├── package.json
├── package-lock.json
├── Dockerfile
├── .dockerignore
└── README.md

Corresponding Dockerfile:
FROM node:18-alpine
WORKDIR /app

# Copy package files first for better layer caching
COPY package*.json ./
RUN npm ci --only=production

# Copy application source
COPY src/ ./src/
COPY public/ ./public/

EXPOSE 3000
CMD ["node", "src/index.js"]

2. Remote Git Repository Context
You can build directly from Git repositories without cloning locally:
# Build from GitHub main branch
docker build https://github.com//project.git

# Build from specific branch
docker build https://github.com//project.git#develop

# Build from specific directory in repo
docker build https://github.com//project.git#main:docker

# Build with authentication
docker build --ssh default git@github.com:/private-repo.git

This has various cases like CI/CD pipelines, building open-source projects, ensuring clean builds from source control, automated deployments, and so on.
3. Remote Tarball Context
You can also build from compressed archives hosted on web servers. A remote tarball is a .tar.gz or similar compressed archive file accessible via HTTP/HTTPS. This is useful when your source code is packaged and hosted on a web server, artifact repository, or CDN. Docker downloads and extracts the archive automatically, using its contents as the build context.
This approach works well for CI/CD pipelines where build artifacts are stored centrally, or when you want to build images from released versions of your code without cloning entire repositories.
# Build from remote tarball
docker build http://server.com/context.tar.gz

# BuildKit downloads and extracts automatically
docker build https://example.com/project-v1.2.3.tar.gz

4. Empty Context (Advanced)
When you don't need any files, you can pipe the Dockerfile directly:
# Create image without file context
docker build -t hello-world - <echo "Hello, World!" > /hello.txt
CMD cat /hello.txt
EOF

Common Docker Build Mistakes (And How to Fix Them)
Mistake 1: Wrong Context Directory
👉 Reproduced here: 04-wrong-context
This mistake occurs when you run docker build from the wrong directory, causing the build context to be different from what your Dockerfile expects.
In the example, running docker build frontend/ from the /projects/ directory means the context is /projects/frontend/, but the Dockerfile tries to access ../shared/utils.js, which is outside this context. Docker can only access files within the build context, so any attempt to reference files outside it will fail.
# Project structure
/projects/
├── frontend/
│   ├── Dockerfile
│   ├── src/
│   └── package.json
└── shared/
    └── utils.js

# WRONG - Running from projects directory
docker build frontend/
# This won't work if Dockerfile tries to COPY ../shared/utils.js

How to fix wrong context directory:
The key is aligning your build context with what your Dockerfile needs.

Option 1 changes your working directory so the context matches your Dockerfile's expectations. You run the build from inside frontend/, making that directory the context root.

Option 2 keeps you in the parent directory but explicitly sets it as the context (the . argument) while telling Docker where to find the Dockerfile with the -f flag. Now both frontend/ and shared/ are accessible since they're both within the /projects/ context.


# Option 1: Run from correct directory
cd frontend
docker build .

# Option 2: Use parent directory as context
docker build -f frontend/Dockerfile .

Mistake 2: Including Massive Files
👉 Optimized version with .dockerignore: 05-dockerignore-optimization
This mistake happens when your build context contains large, unnecessary files that slow down the build process.
Docker must transfer the entire context to the build daemon before starting, so including files like node_modules (which can be hundreds of MB), git history, build artifacts, logs, and database dumps makes builds painfully slow. These files are rarely needed in the final image and should be excluded.
# This context includes everything!
my-app/
├── node_modules/        # 200MB+ 
├── .git/               # Version history
├── dist/               # Built files
├── logs/               # Log files
├── temp/               # Temporary files
├── database.dump       # 1GB database backup
└── Dockerfile

How to fix Docker build massive files:
Use .dockerignore to exclude unnecessary files, dramatically reducing context size and build time. We’ll discuss this in more detail below.
Mistake 3: Inefficient Layer Caching
👉 See good practice code here: 02-python-cache
This mistake wastes Docker's layer caching system by copying frequently-changing files (like source code) before running expensive operations (like npm install). When you modify your source code, Docker invalidates the cache for that layer and all subsequent layers, forcing npm install to run again even though dependencies haven't changed. This can turn a 5-second build into a 5-minute build.
# BAD - Changes to source code rebuild npm install
FROM node:18
COPY . /app
WORKDIR /app
RUN npm install
CMD ["npm", "start"]

How to fix docker build inefficient layer caching:
Copy dependency files first, install dependencies, then copy source code. This way, npm install only runs when package.json actually changes:
# GOOD - npm install only rebuilds when package.json changes
FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["npm", "start"]

How to Optimize and Monitor Build Performance
Understanding build performance metrics helps you identify bottlenecks and measure improvements.
How to Optimize Docker Builds with .dockerignore
The .dockerignore file is your secret weapon for faster, more secure builds. It tells Docker which files to exclude from the build context.
Creating .dockerignore Patterns
Create a .dockerignore file in your project root. The syntax is similar to .gitignore, and you can use wildcards (*), match specific file extensions (*.log), exclude entire directories (node_modules/), or use negation patterns (!important.txt) to include files that would otherwise be excluded. Each line represents a pattern, and comments start with #.
Example of a .dockerignore file:
# Dependencies
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# Build outputs
dist/
build/
*.tgz

# Version control
.git/
.gitignore
.svn/

# IDE and editor files
.vscode/
.idea/
*.swp
*.swo
*~

# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# Logs and databases
*.log
*.sqlite
*.db

# Environment and secrets
.env
.env.local
.env.*.local
secrets/
*.key
*.pem

# Documentation
README.md
docs/
*.md

# Test files
test/
tests/
*.test.js
coverage/

# Temporary files
tmp/
temp/
*.tmp

Measuring Build Performance
Analyzing Build Time
Understanding where your build spends time helps identify bottlenecks and optimization opportunities. The detailed progress output shows timing for each build step, cache hits/misses, and resource usage.
# Enable BuildKit progress output
DOCKER_BUILDKIT=1 docker build --progress=plain .

# Use buildx for detailed timing
docker buildx build --progress=plain .

Profiling Context Transfer
Monitor context transfer time to understand how build context size affects overall performance. Profile which directories contribute most to help target .dockerignore optimizations.
# Measure context transfer time
time docker build --no-cache .

# Profile context size by directory
du -sh */ | sort -hr

Measuring .dockerignore Impact
Before .dockerignore, you'll notice that the transfering context size is 245.7MB in 15.2s:
$ docker build .
#1 [internal] load build context
#1 transferring context: 245.7MB in 15.2s

After adding the .dockerignore file, the context reduced to 2.1MB in 0.3s:
$ docker build .
#1 [internal] load build context  
#1 transferring context: 2.1MB in 0.3s

Result: 99% reduction in context size and 50x faster context transfer!
Best Practices for Docker Build Performance
We've covered several optimization techniques throughout this guide. Here's a quick recap of the key practices, plus some additional strategies:

Layer Caching (covered in Mistake 3): Copy dependency files before source code to maximize cache reuse.

Using .dockerignore (covered in Mistake 2): Exclude unnecessary files to reduce context size and improve build speed.

Choosing the Right Context (covered earlier): Select appropriate context types (local, Git, tarball) based on your use case.


Now let’s talk about some more ways you can improve performance:
Use Multi-Stage Builds
👉 Demo project: 03-multistage-node
Multi-stage builds let you use one image for building/compiling your application and a different, smaller image for running it. This dramatically reduces your final image size by excluding build tools, source code, and other unnecessary files from the production image.
# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

Use Specific Base Images
Generic base images like ubuntu:latest include many packages you don't need, making your images larger and slower to download. Specific images like node:18-alpine or distroless images contain only what's necessary for your application to run.
# Large base image
FROM ubuntu:latest

# Smaller, more specific base image  
FROM node:18-alpine

# Even smaller distroless image
FROM gcr.io/distroless/nodejs18-debian11

Combine RUN Commands
Each RUN command creates a new layer in your image. Multiple RUN commands create multiple layers, increasing image size. Combining commands into a single RUN instruction creates just one layer, and you can clean up temporary files in the same step.
# Creates multiple layers
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get clean

# Single layer
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

Troubleshooting Docker Build Issues
Issue: "COPY failed: no such file or directory"
Problem: File not in build context
What’s going wrong: Docker can only access files within the build context (the directory you specify in docker build). If your Dockerfile tries to COPY a file that doesn't exist in the context directory, the build fails. This often happens when running the build command from the wrong directory or when the file path is incorrect relative to the context root.
Solution:
# Check what's in your context
ls -la

# Verify file path relative to context
docker build -t debug . --progress=plain

Issue: "Docker Build is extremely slow"
Problem: Large build context
What’s going wrong: Docker must transfer your entire build context to the BuildKit daemon before building starts. If your context contains large files, directories like node_modules, or unnecessary files, this transfer can take minutes instead of seconds. The larger the context, the slower your builds become.
Solution:
# Check context size
du -sh .

# Add more patterns to .dockerignore
echo "large-directory/" >> .dockerignore
echo "*.zip" >> .dockerignore

Issue: "Cannot locate specified Dockerfile"
Problem: Dockerfile not in context root
What’s going wrong: By default, Docker looks for a file named Dockerfile in the root of your build context. If your Dockerfile is in a subdirectory or has a different name, Docker can't find it. This is common in monorepo setups where Dockerfiles are organized in separate folders.
Solution:
# Specify Dockerfile location
docker build -f path/to/Dockerfile .

# Or move Dockerfile to context root
mv path/to/Dockerfile .

Issue: "Cache misses on unchanged files"
Problem: File timestamps or permissions changed
What’s going wrong: Docker's layer caching relies on file checksums and metadata. Even if file content is unchanged, different timestamps or permissions can cause cache misses, forcing unnecessary rebuilds. This often happens after git operations, file system operations, or when files are copied between systems.
Solution:
# Check file modifications
git status

# Reset timestamps
git ls-files -z | xargs -0 touch -r .git/HEAD

Conclusion
Understanding Docker build contexts and architecture is essential for achieving faster builds. We’ve covered various techniques in this article, like optimized contexts and caching strategies, creating smaller images with efficient layering and multi-stage builds, maintaining better security with proper secret handling and minimal attack surface, and delivering an improved developer experience with faster iteration cycles.
👉 Full code examples are available on GitHub here: Docker build architecture examples
As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on LinkedIn or Twitter.
For more hands-on projects, follow and star this repository: Learn-DevOps-by-building
 


 Kubernetes Networking Tutorial: A Guide for Developers 
Destiny Erhabor — Mon, 23 Jun 2025 17:31:40 +0000
 Kubernetes networking is one of the most critical and complex parts of running containerized workloads in production. It’s what allows different parts of a Kubernetes system – like containers and services – to talk to each other.
This tutorial will walk you through both the theory as well as some hands-on examples and best practices for mastering Kubernetes networking.
Prerequisites

Have basic understanding of containers and Docker installed on your system.

Basic understanding of General Networking terms.

Install kubectl tool for runing kubernetes commands.

Kubernetes cluster (Kind, Minikube, and so on).

Installed helm for Kubernetes package managements.


Table of Contents

Introduction to Kubernetes Networking

Core Concepts in Kubernetes Networking

Cluster Networking Components

DNS and Service Discovery

Pod Networking Deep Dive

Services and Load Balancing

Network Policies and Security

Common Pitfalls and Troubleshooting

Summary and Next Steps


What is Kubernetes Networking?
So what actually is networking in Kubernetes? Well, in basic terms, it helps make sure that each container can communicate with the others, even if they're on different machines. It also ensures that outside traffic can reach the right containers when it needs to.
Kubernetes abstracts much of the complexity involved in networking, but understanding its internal workings helps you optimize and troubleshoot applications.
A key factor is that each pod gets a unique IP address and can communicate with all other pods without Network Address Translation (NAT). This simple yet powerful model supports complex distributed systems.
NAT (Network Address Translation) refers to the process of rewriting the source or destination IP address (and possibly port) of packets as they pass through a router or gateway.
Because NAT alters packet headers, it breaks the “end-to-end” transparency of the network:

The receiving host sees the NAT device’s address instead of the original sender’s.

Packet captures (for example, via tcpdump) only show the translated addresses, obscuring which internal endpoint truly sent the traffic.


Example: Home Wi-Fi Router NAT
Imagine your home network: you have a laptop, a phone, and a smart TV all connected to the same Wi-Fi. Your Internet provider assigns you one public IP address (say, 203.0.113.5). Internally, your router gives each device a private IP (for example, 192.168.1.10 for your laptop, 192.168.1.11 for your phone, and so on).

Outbound traffic: When your laptop (192.168.1.10) requests a webpage, the router rewrites the packet’s source IP from 192.168.1.10 → 203.0.113.5 (and tracks which internal port maps to which device).

Inbound traffic: When the webpage replies, it arrives at 203.0.113.5, and the router uses its NAT table to forward that packet back to 192.168.1.10.


Because of this translation:

External servers only see the router’s IP (203.0.113.5), not your laptop’s.

Packets are “masqueraded” so multiple devices can share one public address.


In contrast, Kubernetes pods communicate without this extra translation layer – each pod IP is “real” within the cluster, so no router-like step obscures who talked to whom.
Example: E-Commerce Microservices
Consider an online store built as separate microservices, each running in its own pod with a unique IP:

Product Catalog Service: 10.244.1.2

Shopping Cart Service: 10.244.2.3

User Authentication Service: 10.244.1.4

Payment Processing Service: 10.244.3.5


When a shopper adds an item to their cart, the Shopping Cart Pod reaches out directly to the Product Catalog Pod at 10.244.1.2. Because there’s no NAT or external proxy in the data path, this communication is fast and reliable – which is crucial for delivering a snappy, real-time user experience.
Tip: For a complete, hands-on implementation of this scenario (and others), check out the “networking-concepts-practice” section of my: Learn-DevOps-by-building | networking-concepts-practice
Importance in Distributed Systems
Networking in distributed systems facilitates the interaction of multiple services, enabling microservices architectures to function efficiently. Reliable networking supports redundancy, scalability, and fault tolerance.
Kubernetes Networking Model Principles
Kubernetes networking operates on three foundational pillars that create a consistent and high-performance network environment:
1. Unique IP per Pod
Every pod receives its own routable IP address, eliminating port conflicts and simplifying service discovery. This design treats pods like traditional VMs or physical hosts: each can bind to standard ports (for example, 80/443) without remapping.
This helps developers avoid port-management complexity, and tools (like monitoring, tracing) work seamlessly, since pods appear as first-class network endpoints.
2. NAT-Free Pod Communication:
Pods communicate directly without Network Address Translation (NAT). Packets retain their original source/destination IPs, ensuring end-to-end visibility. This simplifies debugging (for example, tcpdump shows real pod IPs) and enables precise network policies. No translation layer also means lower latency and no hidden stateful bottlenecks.
3. Direct Node-Pod Routing:
Nodes route traffic to pods without centralized gateways. Each node handles forwarding decisions locally (via CNI plugins), creating a flat L3 network. This avoids single points of failure and optimizes performance – cross-node traffic flows directly between nodes, not through proxies. Scalability is inherent, and adding nodes expands capacity linearly.
Challenges in Container Networking
Common challenges include managing dynamic IP addresses, securing communications, and scaling networks without performance degradation. While Kubernetes abstracts networking complexities, real-world deployments face hurdles, like:
Dynamic IP Management:
Pods are ephemeral – IPs change constantly during scaling, failures, or updates. Hard-coded IPs break, and DNS caching (with misconfigured TTLs) risks routing to stale endpoints. Solutions like CoreDNS dynamically track pod IPs via the Kubernetes API, while readiness probes ensure only live pods are advertised.
Secure Communication:
Default cluster-wide pod connectivity exposes "east-west" threats. Compromised workloads can scan internal services, and encrypting traffic (for example, mTLS) adds CPU overhead. Network Policies enforce segmentation (for example, isolating PCI-compliant services), and service meshes automate encryption without app changes.
Performance at Scale:
Large clusters strain legacy tooling. iptables rules explode with thousands of services, slowing packet processing. Overlay networks (for example, VXLAN) fragment packets, and centralized load balancers bottleneck traffic. Modern CNIs (Cilium/eBPF, Calico/BGP) bypass kernel bottlenecks, while IPVS replaces iptables for O(1) lookups.
Core Concepts in Kubernetes Networking
What are Pods and Nodes?
Pods are the smallest deployable units. Each pod runs on a node, which could be a virtual or physical machine.
Scenario Example: Web Application Deployment
A typical web application might have:

Three frontend pods running NGINX (distributed across two nodes)

Five backend API pods running Node.js (distributed across three nodes)

Two database pods running PostgreSQL (on dedicated nodes with SSD storage)


# View pods distributed across nodes
kubectl get pods -o wide

NAME                        READY   STATUS    NODE
frontend-6f4d85b5c9-1p4z2   1/1     Running   worker-node-1
frontend-6f4d85b5c9-2m5x3   1/1     Running   worker-node-1
frontend-6f4d85b5c9-3n6c4   1/1     Running   worker-node-2
backend-7c8d96b6b8-4q7d5    1/1     Running   worker-node-2
backend-7c8d96b6b8-5r8e6    1/1     Running   worker-node-3
...

What are Services?
Services expose pods using selectors. They provide a stable network identity even as pod IPs change.
kubectl expose pod nginx-pod --port=80 --target-port=80 --name=nginx-service

Scenario Example: Database Service Migration
A team needs to migrate their database from MySQL to PostgreSQL without disrupting application functionality:

Deploy PostgreSQL pods alongside existing MySQL pods

Create a database service that initially selects only MySQL pods:


apiVersion: v1
kind: Service
metadata:
  name: database-service
spec:
  selector:
    app: mysql
  ports:
  - port: 3306
    targetPort: 3306


Update application to be compatible with both databases

Update the service selector to include both MySQL and PostgreSQL pods:


selector:
  app: database  # New label applied to both MySQL and PostgreSQL pods


Gradually remove MySQL pods while the service routes traffic to available PostgreSQL pods

The service abstraction allows for zero-downtime migration by providing a consistent endpoint throughout the transition.
Communication Paths
A communication path is simply the route that network traffic takes from its source to its destination within (or into/out of) the cluster. In Kubernetes, the three main paths are:

Pod-to-Pod: Direct traffic between two pods (possibly on different nodes).

Pod-to-Service: Traffic from a pod destined for a Kubernetes Service (which then load-balances to one of its backend pods).

External-to-Service: Traffic originating outside the cluster (e.g. from an end-user or external system) directed at a Service (often via a LoadBalancer or Ingress).


Pod-to-Pod Communication
Pods communicate directly with each other using their IP addresses without NAT. For example:
kubectl exec -it pod-a -- ping pod-b

Scenario Example: Sidecar Logging
In a log aggregation setup, each application pod has a sidecar container that processes and forwards logs:

Application container writes logs to a shared volume

Sidecar container reads from the volume and forwards to a central logging service


# Check communication between application and sidecar
kubectl exec -it app-pod -c app -- ls -la /var/log/app
kubectl exec -it app-pod -c log-forwarder -- tail -f /var/log/app/application.log

Because both containers are in the same pod, they can communicate via localhost and shared volumes without any network configuration.
Pod-to-Service Communication
Pods communicate with services using DNS names, enabling load-balanced access to multiple pods:
kubectl exec -it pod-a -- curl http://my-service.default.svc.cluster.local

Scenario Example: API Gateway Pattern
A microservices architecture uses an API gateway pattern:

Frontend pods need to access fifteen or more backend microservices

Instead of tracking individual pod IPs, the frontend connects to service names:


// Frontend code
const authService = 'http://auth-service.default.svc.cluster.local';
const userService = 'http://user-service.default.svc.cluster.local';
const productService = 'http://product-service.default.svc.cluster.local';

async function getUserProducts(userId) {
  const authResponse = await fetch(`${authService}/validate`);
  if (authResponse.ok) {
    const user = await fetch(`${userService}/users/${userId}`);
    const products = await fetch(`${productService}/products?user=${userId}`);
    return { user, products };
  }
}

Each service name resolves to a stable endpoint, even as the underlying pods are scaled, replaced, or rescheduled.
External-to-Service Communication
External communication is facilitated through service types like NodePort or LoadBalancer. An example of NodePort usage:
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport-service
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 80
    nodePort: 30080
  selector:
    app: my-app

Now, this service can be accessed externally via:
curl http://:30080

Scenario Example: Public-Facing Web Application
A company runs a public-facing web application that needs external access:

Deploy the application pods with three replicas

Create a LoadBalancer service to expose the application:


apiVersion: v1
kind: Service
metadata:
  name: web-app
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb  # Cloud-specific annotation
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: web-app


When deployed on AWS, this automatically provisions a Network Load Balancer with a public IP

External users access the application through the load balancer, which distributes traffic across all three pods


# Check the external IP assigned to the service
kubectl get service web-app

NAME     TYPE          CLUSTER-IP     EXTERNAL-IP        PORT(S)
web-app  LoadBalancer  10.100.41.213  a1b2c3.amazonaws.com  80:32456/TCP

Cluster Networking Components
Kubernetes networking transforms abstract principles into reality through tightly orchestrated components. Central to this is the Container Network Interface (CNI), a standardized specification that governs how network connectivity is established for containers.
What is a Container Network Interface (CNI) ?
At its essence, CNI acts as Kubernetes' networking plugin framework. It’s responsible for dynamically assigning IP addresses to pods, creating virtual network interfaces (like virtual Ethernet pairs), and configuring routes whenever a pod starts or stops.
Crucially, Kubernetes delegates these low-level networking operations to CNI plugins, allowing you to choose implementations aligned with your environment’s needs: whether that’s Flannel’s simple overlay networks for portability, Calico’s high-performance BGP routing for bare-metal efficiency, or Cilium’s eBPF-powered data plane for advanced security and observability.
Working alongside CNI, kube-proxy operates on every node, translating Service abstractions into concrete routing rules within the node’s kernel (using iptables or IPVS). Meanwhile, CoreDNS provides seamless service discovery by dynamically mapping human-readable names (for example, cart-service.production.svc.cluster.local) to stable Service IPs. Together, these components form a cohesive fabric, ensuring pods can communicate reliably whether they’re on the same node or distributed across global clusters.
High-Level CNI Plugin Differences:

Flannel: Simple overlay (VXLAN, host-gw) for basic multi-host networking.

Calico: Pure-L3 routing using BGP or IP-in-IP, plus rich network policies.

Cilium: eBPF-based dataplane for ultra-fast packet processing and advanced features like API-aware policies.


These High-Level Plugins implement the CNI standard for managing pod IPs and routing.
kubectl get pods -n kube-system

Scenario Example: Multi-Cloud Deployment with Calico
A company operates a hybrid deployment across AWS and Azure:

Choose Calico as the CNI plugin for consistent networking across clouds:

# Install Calico on both clusters
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

# Verify Calico pods are running
kubectl get pods -n kube-system -l k8s-app=calico-node

Calico provides:

Consistent IPAM (IP Address Management) across both clouds

Network policy enforcement in both environments

BGP routing for optimized cross-node traffic



When migrating workloads between clouds, the networking layer behaves consistently despite different underlying infrastructure.

What is kube-proxy?
kube-proxy is a network component that runs on each node and implements Kubernetes’ Service abstraction. Its responsibilities include:

Watching the API server for Service and Endpoint changes.

Programming the node’s packet-filtering layer (iptables or IPVS) so that traffic to a Service ClusterIP:port gets load-balanced to one of its healthy backend pods.

Handling session affinity, if configured (so repeated requests from the same client go to the same pod).


By doing this per-node, kube-proxy ensures any pod on that node can reach any Service IP without needing a central gateway.
What are iptables & IPVS?
Both iptables and IPVS are Linux kernel subsystems that kube-proxy can use to manage Service traffic:
iptables mode
kube-proxy generates a set of NAT rules (in the nat table) so that when a packet arrives for a Service IP, the kernel rewrites its destination to one of the backend pod IPs.
IPVS mode
IPVS (IP Virtual Server) runs as part of the kernel’s Netfilter framework. Instead of dozens or hundreds of iptables rules, it keeps a high-performance hash table of virtual services and real servers.
Here's the comparison of iptables and IPVS modes in a clean table format:




Mode Pros Cons



iptables • Simple and universally available on Linux systems 

• Battle-tested and easy to debug • Rule complexity grows linearly with Services/Endpoints 

• Packet processing slows at scale due to sequential rule checks 

• Service updates trigger full rule reloads 

IPVS • O(1) lookup time regardless of cluster size 

• Built-in load-balancing algorithms (RR, LC, SH) 

• Incremental updates without full rule recomputation 

• Lower CPU overhead for large clusters • Requires Linux kernel ≥4.4 and IPVS modules loaded 

• More complex initial configuration 

• Limited visibility with traditional tool 


Scenario Example: Debugging Service Connectivity
When troubleshooting service connectivity issues in a production cluster:

First, check if kube-proxy is functioning:

# Check kube-proxy pods
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Examine kube-proxy logs
kubectl logs -n kube-system kube-proxy-a1b2c


Inspect the iptables rules created by kube-proxy on a node:

# Connect to a node
ssh worker-node-1

# View iptables rules for a specific service
sudo iptables-save | grep my-service


The output reveals how traffic to ClusterIP 10.96.45.10 is load-balanced across multiple backend pod IPs:

-A KUBE-SVC-XYZAB12345 -m comment --comment "default/my-service" -m statistic --mode random --probability 0.33332 -j KUBE-SEP-POD1
-A KUBE-SVC-XYZAB12345 -m comment --comment "default/my-service" -m statistic --mode random --probability 0.50000 -j KUBE-SEP-POD2
-A KUBE-SVC-XYZAB12345 -m comment --comment "default/my-service" -j KUBE-SEP-POD3

Understanding these rules helps diagnose why traffic might not be reaching certain pods.
DNS and Service Discovery
Every service in Kubernetes relies on DNS to map a human-friendly name (for example, my-svc.default.svc.cluster.local) to its ClusterIP. When pods come and go, DNS records must update quickly so clients never hit stale addresses.
Kubernetes uses CoreDNS as a cluster DNS server. When you create a Service, an A record is added pointing to its ClusterIP. Endpoints (the pod IPs) are published as SRV (Service) records. If a pod crashes or is rescheduled, CoreDNS watches the Endpoints API and updates its records in near–real time.
Key mechanics:

Service A record → ClusterIP

Endpoint SRV records → backend pod IPs & ports

TTL tuning → how long clients cache entries


Why recovery matters:

A DNS TTL that’s too long can leave clients retrying an old IP.

A TTL that’s too short increases DNS load.

Readiness probes must signal “not ready” before CoreDNS removes a pod’s record.


CoreDNS
CoreDNS provides DNS resolution for services inside the cluster.
kubectl exec -it busybox -- nslookup nginx-service

Service discovery is automatic, using:
..svc.cluster.local

Scenario Example: Microservices Environment Variables vs. DNS
A team is migrating from hardcoded environment variables to Kubernetes DNS:
Before: Configuration via environment variables
apiVersion: v1
kind: Pod
metadata:
  name: order-service
spec:
  containers:
  - name: order-app
    image: order-service:v1
    env:
    - name: PAYMENT_SERVICE_HOST
      value: "10.100.45.12"
    - name: INVENTORY_SERVICE_HOST
      value: "10.100.67.34"
    - name: USER_SERVICE_HOST
      value: "10.100.23.78"

After: Using Kubernetes DNS service discovery
apiVersion: v1
kind: Pod
metadata:
  name: order-service
spec:
  containers:
  - name: order-app
    image: order-service:v2
    env:
    - name: PAYMENT_SERVICE_HOST
      value: "payment-service.default.svc.cluster.local"
    - name: INVENTORY_SERVICE_HOST
      value: "inventory-service.default.svc.cluster.local"
    - name: USER_SERVICE_HOST
      value: "user-service.default.svc.cluster.local"

When the team needs to relocate the payment service to a dedicated namespace for PCI compliance:

Move payment service to "finance" namespace

Update only one environment variable:


- name: PAYMENT_SERVICE_HOST
  value: "payment-service.finance.svc.cluster.local"


The application continues working without rebuilding container images or updating other services

Pod Networking Deep Dive
Under the hood, each pod has its own network namespace, virtual Ethernet (veth) pair, and an interface like eth0. The CNI plugin glues these into the cluster fabric.
When the kubelet creates a pod, it calls your CNI plugin:


Allocates an IP from a pool.

Creates a veth pair and moves one end into the pod’s netns.

Programs routes on the host so that other nodes know how to reach this IP.






Namespaces and Virtual Ethernet
Each pod gets a Linux network namespace and connects to the host via a virtual Ethernet pair.
kubectl exec -it nginx-pod -- ip addr

Scenario Example: Debugging Network Connectivity
When troubleshooting connectivity issues between pods:

Examine the network interfaces inside a pod:

kubectl exec -it web-frontend-pod -- ip addr

1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
2: eth0@if18:  mtu 1450 qdisc noqueue state UP group default
    link/ether 82:cf:d8:e9:7a:12 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.2.45/24 scope global eth0
    inet6 fe80::80cf:d8ff:fee9:7a12/64 scope link


Trace the path from pod to node:

# On the node hosting the pod
sudo ip netns list
# Shows namespace like: cni-1a2b3c4d-e5f6-7890-a1b2-c3d4e5f6g7h8

# Examine connections on the node
sudo ip link | grep veth
# Shows virtual ethernet pairs like: veth123456@if2: ...

# Check routes on the node
sudo ip route | grep 10.244.2.45
# Shows how traffic reaches the pod

This investigation reveals how traffic flows from the pod through its namespace, via virtual ethernet pairs, then through the node's routing table to reach other pods.
Shared Networking in Multi-Container Pods
Multi-container pods share the same network namespace. Use this for sidecar and helper containers.
Scenario Example: Service Mesh Sidecar
When implementing Istio service mesh with automatic sidecar injection:

Deploy an application with Istio sidecar injection enabled:

apiVersion: v1
kind: Pod
metadata:
  name: api-service
  annotations:
    sidecar.istio.io/inject: "true"
spec:
  containers:
  - name: api-app
    image: api-service:v1
    ports:
    - containerPort: 8080


After deployment, the pod has two containers sharing the same network namespace:

kubectl describe pod api-service

Name:         api-service
...
Containers:
  api-app:
    ...
    Ports:          8080/TCP
    ...
  istio-proxy:
    ...
    Ports:          15000/TCP, 15001/TCP, 15006/TCP, 15008/TCP
    ...


The sidecar container intercepts all network traffic:

kubectl exec -it api-service -c istio-proxy -- netstat -tulpn

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address     Foreign Address     State       PID/Program name
tcp        0      0 0.0.0.0:15001     0.0.0.0:*           LISTEN      1/envoy
tcp        0      0 0.0.0.0:15006     0.0.0.0:*           LISTEN      1/envoy


Traffic to the application container is transparently intercepted without requiring application changes:

kubectl exec -it api-service -c api-app -- curl localhost:8080
# Actually goes through the proxy even though it looks direct to the app

This shared network namespace enables the service mesh to implement features like traffic encryption, routing, and metrics collection without application modifications.
Services and Load Balancing
Kubernetes Services abstract a set of pods behind a single virtual IP. That virtual IP can be exposed in several ways:
A Service object defines a stable IP (ClusterIP), DNS entry, and a selector. kube-proxy then programs the node to intercept traffic to that IP and forward it to one of the pods.
Service types:

ClusterIP (default): internal only

NodePort: opens the Service on every node’s port (e.g. 30080)

LoadBalancer: asks your cloud provider for an external LB

ExternalName: CNAME to an outside DNS name


Load-balancing mechanics:

kube-proxy + iptables/IPVS (round-robin, least-conn)

External Ingress (NGINX, Traefik) for HTTP/S with host/path routing


🔧 Service Types




Type Description



ClusterIP Default, internal only

NodePort Exposes service on node IP

LoadBalancer Uses cloud provider LB

ExternalName DNS alias for external service


Scenario Example: Multi-Tier Application Exposure
A company runs a three-tier web application with different exposure requirements:

Frontend web tier (public-facing):

apiVersion: v1
kind: Service
metadata:
  name: frontend-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:region:account:certificate/cert-id"
spec:
  type: LoadBalancer
  ports:
  - port: 443
    targetPort: 8080
  selector:
    app: frontend


API tier (internal to frontend only):

apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  type: ClusterIP  # Internal only
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: api


Database tier (internal to API only):

apiVersion: v1
kind: Service
metadata:
  name: db-service
spec:
  type: ClusterIP
  ports:
  - port: 5432
    targetPort: 5432
  selector:
    app: database

This configuration creates a secure architecture where:

Only the frontend is exposed to the internet (with TLS)

The API is only accessible from the frontend pods within the cluster

The database is only accessible from the API pods within the cluster


Ingress Controllers
Ingress provides HTTP(S) routing and TLS termination.
helm install my-ingress ingress-nginx/ingress-nginx

Scenario Example: Hosting Multiple Applications on a Single Domain
A company hosts multiple microservices apps under the same domain with different paths:

Deploy nginx-ingress controller:

helm install nginx-ingress ingress-nginx/ingress-nginx --set controller.publishService.enabled=true


Configure routing for multiple services:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: company-apps
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - services.company.com
    secretName: company-tls
  rules:
  - host: services.company.com
    http:
      paths:
      - path: /dashboard
        pathType: Prefix
        backend:
          service:
            name: dashboard-service
            port:
              number: 80
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-gateway
            port:
              number: 80
      - path: /docs
        pathType: Prefix
        backend:
          service:
            name: documentation-service
            port:
              number: 80


User traffic flow:

User visits https://services.company.com/dashboard

Traffic hits the LoadBalancer service for the ingress controller

Ingress controller routes to the dashboard-service based on path

Dashboard service load balances across dashboard pods




This allows hosting multiple applications behind a single domain and TLS certificate.
Network Policies and Security
Network Policies restrict communication based on pod selectors and namespaces.
policyTypes:
- Ingress

matchLabels:
  app: frontend

Use Cases

Isolate environments (for example, dev vs prod)

Control egress to the internet

Enforce zero-trust networking


Scenario Example: PCI Compliance for Payment Processing
A financial application processes credit card payments and must comply with PCI DSS requirements:

Create dedicated namespace with strict isolation:

kubectl create namespace payment-processing


Deploy payment pods to the isolated namespace:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-processor
  namespace: payment-processing
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment
  template:
    metadata:
      labels:
        app: payment
        pci: "true"
    spec:
      containers:
      - name: payment-app
        image: payment-processor:v1
        ports:
        - containerPort: 8080


Define network policy that:

Only allows traffic from authorized services

Blocks all egress except to specific APIs

Monitors and logs all connection attempts




apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: pci-payment-policy
  namespace: payment-processing
spec:
  podSelector:
    matchLabels:
      pci: "true"
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          environment: production
    - podSelector:
        matchLabels:
          role: checkout
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - ipBlock:
        cidr: 192.168.5.0/24  # Payment gateway API
    ports:
    - protocol: TCP
      port: 443
  - to:
    - namespaceSelector:
        matchLabels:
          name: logging
    ports:
    - protocol: TCP
      port: 8125  # Metrics port


Validate policy with connectivity tests:

# Test from authorized pod (should succeed)
kubectl exec -it -n production checkout-pod -- curl payment-processor.payment-processing.svc.cluster.local:8080

# Test from unauthorized pod (should fail)
kubectl exec -it -n default test-pod -- curl payment-processor.payment-processing.svc.cluster.local:8080

This comprehensive network policy ensures that sensitive payment data is isolated and can only be accessed by authorized services.
Common Pitfalls and Troubleshooting
Pod Not Reachable

Symptom: ping or application traffic times out.

Steps to troubleshoot:

Check pod status & logs:
 kubectl get pod myapp-abc123 -o wide
 kubectl logs myapp-abc123


Inspect CNI plugin logs:
 # e.g. for Calico on kube-system:
 kubectl -n kube-system logs ds/calico-node


Run a network debug container (netshoot):
 kubectl run -it --rm netshoot --image=nicolaka/netshoot -- bash
 # inside netshoot:
 ping 
 ip link show
 ip route show




Why pods can be unreachable: IP allocation failures, misconfigured veth, MTU mismatch, CNI initialization errors.


Service Unreachable

Symptom: Clients can’t hit the Service IP, or curl to ClusterIP:port fails.

Steps to troubleshoot:

Verify Service and Endpoints:
 kubectl get svc my-svc -o yaml
 kubectl get endpoints my-svc -o wide


Inspect kube-proxy rules:
 # iptables mode:
 sudo iptables-save | grep 
 # IPVS mode:
 sudo ipvsadm -Ln


Test connectivity from a pod:
 kubectl exec -it netshoot -- curl -v http://:




Why services break: Missing endpoints (selector mismatch), stale kube-proxy rules, DNS entries pointing at wrong IP.


Policy-Blocked Traffic

Symptom: Connections are actively refused or immediately reset.

Steps to troubleshoot:

List NetworkPolicies in the namespace:
 kubectl get netpol


Describe the policy logic:
 kubectl describe netpol allow-frontend


Simulate allowed vs. blocked flows:
 # From a debug pod:
 kubectl exec -it netshoot -- \
   curl --connect-timeout 2 http://:




Why policies bite you: Default “deny” behavior in some CNI plugins, overly strict podSelector or namespaceSelector, missing egress rules.


🔍 Tools you can use:

kubectl exec: Run arbitrary commands inside any pod. It’s ideal for running ping, curl, ip, or tcpdump from the pod’s own network namespace.

tcpdump: Capture raw packets on an interface. Use it (inside netshoot or via kubectl exec) to see if traffic actually leaves/arrives at a pod.

Netshoot: A utility pod image packed with networking tools (ping, traceroute, dig, curl, tcpdump, and so on) so you don’t have to build your own.

Cilium Hubble: An observability UI/API for Cilium that shows per-connection flows, L4/L7 metadata, and policy verdicts in real time.

Calico Flow Logs: Calico’s eBPF-based logging of allow/deny decisions and packet metadata. It’s great for auditing exactly which policy rule matched a given packet.


Scenario Example: Troubleshooting Service Connection Issues
A team is experiencing intermittent connection failures to a database service:

Check if the service exists and has endpoints:

kubectl get service postgres-db
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
postgres-db  ClusterIP   10.96.145.232           5432/TCP   3d

kubectl get endpoints postgres-db
NAME         ENDPOINTS                                   AGE
postgres-db                                        3d


The service exists but has no endpoints. Check pod selectors:

kubectl describe service postgres-db
Name:              postgres-db
Namespace:         default
Selector:          app=postgres,tier=db
...

kubectl get pods --selector=app=postgres,tier=db
No resources found in default namespace.


Inspect the database pods:

kubectl get pods -l app=postgres
NAME                        READY   STATUS    RESTARTS   AGE
postgres-6b4f87b5c9-8p7x2   1/1     Running   0          3d

kubectl describe pod postgres-6b4f87b5c9-8p7x2
...
Labels:       app=postgres
              pod-template-hash=6b4f87b5c9
...


Found the issue: The pod has label app=postgres but missing the tier=db label required by the service selector.

Fix by updating the service selector:


kubectl patch service postgres-db -p '{"spec":{"selector":{"app":"postgres"}}}'


Verify endpoints are now populated:

kubectl get endpoints postgres-db
NAME         ENDPOINTS             AGE
postgres-db  10.244.2.45:5432      3d

This systematic debugging approach quickly identified a label mismatch causing the connection issues.
Summary
In this tutorial, you explored:

Pod and service communication

Cluster-wide routing and discovery

Load balancing and ingress

Network policy configuration


As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on LinkedIn or Twitter.
For more hands-on projects, follow and star this repository: Learn-DevOps-by-building | networking-concepts-practice
 


 How to Send and Parse JSON Data in Golang – Data Encoding and Decoding Explained With Examples 
Destiny Erhabor — Mon, 05 Aug 2024 13:00:54 +0000
 When building web applications in Golang, working with JSON data is inevitable. Whether you're sending responses to clients or parsing requests, JSON encoding and decoding are essential skills to master. 
In this article, we'll explore the different ways to encode and decode JSON in Golang.
Table of Contents

How to Send JSON Responses (Encoding)
How to Use the Marshal Function for JSON Encoding
How to Use the NewEncoder Function
How to Parse JSON Requests (Decoding)
How to Use the Unmarshal Function to Parse JSON Requests
How to Use NewDecoder Function for JSON Decoding
Custom JSON Marshaling and Unmarshaling
How to Use JSON Marshaler
How to Use JSON Unmarshaler
Trade-offs
Use Cases and Recommendations
Conclusion

How to Send JSON Responses (Encoding)
JSON encoding is the process of converting Go data structures into JSON format.
Encoding refers to the process of converting data from one format to another. In the context of computing and data transmission, encoding typically involves converting data into a standardized format that can be easily stored, transmitted, or processed by different systems or applications.
Think of encoding like packing a suitcase for a trip. You take your clothes (data) and pack them into a suitcase (encoded format) so that they can be easily transported (transmitted) and unpacked (decoded) at your destination.
In the case of JSON encoding, the data is converted into a text-based format that uses human-readable characters to represent the data. This makes it easy for humans to read and understand the data, as well as for different systems to exchange and process the data.
Some common reasons for encoding data include:

Data compression: Reducing the size of the data to make it easier to store or transmit.
Data security: Protecting the data from unauthorized access or tampering.
Data compatibility: Converting data into a format that can be read and processed by different systems or applications.
Data transmission: Converting data into a format that can be easily transmitted over a network or other communication channels.

In Golang, we can use the encoding/json package to encode JSON data.
How to Use the Marshal Function for JSON Encoding
The Marshal function is the most commonly used method for encoding JSON data in Golang. It takes a Go data structure as input and returns a JSON-encoded string.
package main

import ( 
    "encoding/json"
    "fmt"
    "net/http"
 )

type Person struct { 
    Name string `json:"name"` 
    Age int `json:"age"`
}

func handler(w http.ResponseWriter, r *http.Request) { 
    person := Person{  Name: "John",  Age: 30, } 

    // Encoding - One step
    jsonStr, err := json.Marshal(person) 

    if err != nil {  
        http.Error(w, err.Error(), http.StatusInternalServerError)  
        return 
    } 

    w.Write(jsonStr)
}

func main() { 
    http.HandleFunc("/", handler) 
    http.ListenAndServe(":8080", nil)
 }

Code Explanation:
Imports:

encoding/json: Provides functions for encoding and decoding JSON.
fmt: For printing output.

User Struct:

Defines a struct User with fields Name and Age.
Struct tags (for example: json:"name") specify the JSON key names.

main Function:

Creates a User instance.
Calls json.Marshal to encode the user struct into JSON. This returns a byte slice and an error.
If there's no error, it converts the byte slice to a string and prints it.

How to Use the NewEncoder Function
The NewEncoder function is used to encode JSON data to a writer, such as a file or network connection.
package main

import ( 
    "encoding/json" 
    "fmt" 
    "net/http"
)

type Person struct { 
    Name string `json:"name"` 
    Age int `json:"age"`
}

func handler(w http.ResponseWriter, r *http.Request) { 
    person := Person{  Name: "John",  Age: 30 } 

    // Encoding - 2 step . NewEncoder and Encode
    encoder := json.NewEncoder(w) 

    err := encoder.Encode(person) 

    if err != nil {  
        http.Error(w, err.Error(), http.StatusInternalServerError)  

        return 
   }}

func main() { 
   http.HandleFunc("/", handler) http.ListenAndServe(":8080", nil)
}

Code Explanation:
Inside the handler:

The handler function is an HTTP handler that handles incoming HTTP requests.
w http.ResponseWriter: Used to write the response.
r *http.Request: Represents the incoming request.
A Person instance named person was created and initialized with the values Name: "John" and Age: 30.
A JSON encoder was created using json.NewEncoder(w), which will write the JSON output to the response writer w.
The person struct was encoded to JSON and written to the response using encoder.Encode(person).
If an error occurs during encoding, it is sent back to the client as an HTTP error response with a status code 500 Internal Server Error.

How to Parse JSON Requests (Decoding)
JSON decoding is the process of converting JSON data into Go data structures. 
Decoding refers to the process of converting data from a standardized format back into its original form. In computing and data transmission, decoding involves taking encoded data and transforming it into a format that can be easily understood and processed by a specific system or application.
Think of decoding like unpacking a suitcase after a trip. You take the packed suitcase (encoded data) and unpack it, putting each item (data) back to its original place, so that you can use it again.
In the case of JSON decoding, the text-based JSON data is converted back into its original form, such as a Go data structure (like a struct or slice), so that it can be easily accessed and processed by the application.
Some common reasons for decoding data include:

Data extraction: Retrieving specific data from a larger encoded dataset.
Data analysis: Converting encoded data into a format that can be easily analyzed or processed.
Data storage: Converting encoded data into a format that can be easily stored in a database or file system.
Data visualization: Converting encoded data into a format that can be easily visualized or displayed.

Decoding is essentially the reverse process of encoding, and it's an essential step in many data processing pipelines.
In Golang, we can use the encoding/json package to decode JSON data.
How to Use the Unmarshal Function to Parse JSON Requests
The Unmarshal function is the most commonly used method for decoding JSON data in Golang. It takes a JSON-encoded string as input and returns a Go data structure.
package main

import ( 
    "encoding/json" 
    "fmt" 
    "net/http"
)

type Person struct { 
    Name string `json:"name"` 
    Age int `json:"age"`
}

func handler(w http.ResponseWriter, r *http.Request) { 
    var person Person err := json.NewDecoder(r.Body).Decode(&person)

    if err != nil {  
        http.Error(w, err.Error(), http.StatusBadRequest)  
        return
    } 

    fmt.Println(person.Name) 
    // Output: John fmt.Println(person.Age) 
    // Output: 30
}

func main() { 
    http.HandleFunc("/", handler) http.ListenAndServe(":8080", nil)
}

Code Explanation:
Inside the handler:

The handler function is an HTTP handler that handles incoming HTTP requests.
w http.ResponseWriter: Used to write the response.
r *http.Request: Represents the incoming request.
A variable person of type Person was declared.
json.NewDecoder(r.Body).Decode(&person): This decodes the JSON request body into the person struct.
If an error occurs during decoding, it sends back an HTTP 400 error response with a status code 400 Bad Request.
If decoding is successful, the person struct fields Name and Age are printed using fmt.Println.

How to Use the NewDecoder Function for JSON Decoding
The NewDecoder function is also used to decode JSON data from a reader, such as a file or network connection.
package main

import ( 
    "encoding/json" 
    "fmt" 
    "net/http"
)

type Person struct { 
    Name string `json:"name"` 
    Age int `json:"age"`
}

func handler(w http.ResponseWriter, r *http.Request) { 

    decoder := json.NewDecoder(r.Body) 

    var person Person err := decoder.Decode(&person) 

    if err != nil {  
        http.Error(w, err.Error(), http.StatusBadRequest)  
        return 
       } 

    fmt.Println(person.Name) 
    // Output: John fmt.Println(person.Age) 
    // Output: 30
}

func main() { 
    http.HandleFunc("/", handler) 

    http.ListenAndServe(":8080", nil)
 }

Code Explanation:
Inside the handler function:

The handler function is an HTTP handler that handles incoming HTTP requests.
w http.ResponseWriter: Used to write the response.
r *http.Request: Represents the incoming request.

Create a Decoder:

decoder := json.NewDecoder(r.Body): Creates a new JSON decoder that reads from the request body.

Declare a Person Variable:

var person Person: Declares a variable person of type Person.

Decode JSON into Person Struct:

err := decoder.Decode(&person): Decodes the JSON from the request body into the person struct.
If an error occurs during decoding, it sends an HTTP 400 error response with the status code 400 Bad Request and returns from the function.

Print the Decoded Values:

fmt.Println(person.Name): Prints the Name field of the person struct.
fmt.Println(person.Age): Prints the Age field of the person struct.

Custom JSON Marshaling and Unmarshaling
In some cases, the default JSON encoding and decoding behavior provided by json.Marshal and json.Unmarshal may not be sufficient. For instance, you may need to customize how certain fields are represented in JSON. This is where the json.Marshaler and json.Unmarshaler interfaces come in handy.
How to use JSON Marshaler
The json.Marshaler interface allows you to customize the JSON encoding of a type by implementing the MarshalJSON method. This method returns a JSON-encoded byte slice and an error.
func (p Person) MarshalJSON() ([]byte, error) {
    type Alias Person
    return json.Marshal(&struct {
        Alias
        Age string `json:"age"`
    }{
        Alias: (Alias)(p),
        Age:   strconv.Itoa(p.Age) + " years",
    })
}

In this example, the Age field is converted to a string with a " years" suffix when encoding to JSON.
How to use JSON Unmarshaler
The json.Unmarshaler interface allows you to customize the JSON decoding of a type by implementing the UnmarshalJSON method. This method takes a JSON-encoded byte slice and returns an error.
func (p *Person) UnmarshalJSON(data []byte) error {
    type Alias Person
    aux := &struct {
        Alias
        Age string `json:"age"`
    }{Alias: (Alias)(*p)}

    if err := json.Unmarshal(data, &aux); err != nil {
        return err
    }

    ageStr := strings.TrimSuffix(aux.Age, " years")
    age, err := strconv.Atoi(ageStr)
    if err != nil {
        return err
    }

    p.Age = age
    p.Name = aux.Name
    return nil
}

In this example, the Age field is converted from a string with a " years" suffix to an integer when decoding from JSON.
Trade-offs
From the various methods described above for encoding and decoding JSON. Here are the trade-offs for the most commonly used methods:
json.Marshal and json.Unmarshal:
Pros:

Ease of Use: Straightforward for encoding (Marshal) and decoding (Unmarshal) JSON.
Flexibility: Can be used with various types including structs, maps, slices, and more.
Customization: Struct tags (json:"name") allow customization of JSON keys and other options.

Cons:

Performance: May not be the fastest method for very large or complex JSON structures.
Error Handling: Error messages can sometimes be less descriptive for deeply nested or complex data structures.

json.NewEncoder and json.NewDecoder:
Pros:

Stream-Based: Suitable for encoding/decoding JSON in a streaming manner, which can handle large data sets without consuming a lot of memory.
Flexibility: Can work directly with io.Reader and io.Writer interfaces, making them useful for network operations and large files.

Cons:

Complexity: Slightly more complex to use compared to json.Marshal and json.Unmarshal.
Error Handling: Similar to json.Marshal and json.Unmarshal, error messages can be less clear for complex structures.

Custom Marshaler and Unmarshaler Interfaces (json.Marshaler and json.Unmarshaler):
Pros:

Customization: Full control over how types are encoded/decoded. Useful for handling complex types or custom JSON structures.
Flexibility: Allows for implementing custom logic during marshaling/"unmarshaling."

Cons:

Complexity: More complex to implement and use, as it requires writing custom methods.
Maintenance: Increases the maintenance burden since custom logic needs to be kept in sync with any changes in the struct or data format.

Use Cases and Recommendations

Simple Data Structures: Use json.Marshal and json.Unmarshal for straightforward encoding/decoding of simple data structures.
Large Data Streams: Use json.NewEncoder and json.NewDecoder for working with large data streams or when interacting with files or network operations.
Custom Requirements: Implement json.Marshaler and json.Unmarshaler interfaces when you need custom behavior for specific types.
Quick Operations: Use anonymous structs for quick, throwaway operations where defining a full struct type is unnecessary.

Each method has its own strengths and trade-offs, and the best choice depends on the specific requirements of your application.
Conclusion
In conclusion, mastering JSON encoding and decoding is crucial for developing web applications in Golang. 
By understanding the different methods available in the encoding/json package, you can choose the most suitable approach based on your specific requirements.
The Marshal and Unmarshal functions offer simplicity and flexibility for general use, while NewEncoder and NewDecoder provide efficient streaming capabilities for large datasets. 
For scenarios that demand customized JSON representations, implementing the json.Marshaler and json.Unmarshaler interfaces gives you fine-grained control over the encoding and decoding processes. 
Each method has its own strengths and trade-offs, and knowing when and how to use them will enable you handle JSON data effectively in your applications.
 


 How to Upload Large Files Efficiently with AWS S3 Multipart Upload 
Destiny Erhabor — Mon, 08 Jul 2024 12:02:56 +0000
 Imagine running a media streaming platform where users upload large high-definition videos. Uploading such large files can be slow and may fail if the network is unreliable. 
Using traditional single-part uploads can be cumbersome and inefficient for large files, often leading to timeout errors or the need to restart the entire upload process if any part fails. This is where the Amazon S3 multipart upload feature comes into play, offering a robust solution to these challenges.
In this article, you'll explore how to efficiently handle large files with Amazon S3 multipart upload. We'll discuss the benefits of using this feature, walk through the process of uploading files in parts, and provide code examples using the AWS SDK for full-stack Node and React project. 
By the end of this article, you should have a good understanding of how to leverage the Amazon S3 multipart upload to optimize file uploads in your applications.
Prerequisites
Before we start, ensure you have the following:

An AWS account with IAM user credentials.
Node.js installed on your development machine.
Basic knowledge of JavaScript, React, and Node.js.

Table of Contents:

Introduction
Prerequisites
Table of Contents
How it works
Step 1: How to Set Up AWS S3
How to Create an S3 Bucket
How to Configure s3 Bucket Policy
Step 2: How to Set Up AWS S3 Backend with Node.js
How to Initialize a Node.js Project
Install Required Packages
Create Server file
Imports and configuration
Middleware and AWS Configuration
Routes
Start/Initialize Upload Endpoint
Upload Part Endpoint
Complete Upload Endpoint
Start the Server
Environment Variables
Running the Server
Step 3: How to Set Up the Frontend with React
How to Initialize a React Project
Install Required Packages
Create Components
App Component
Testing
Part Upload
Complete Part Upload
Full Code on GitHub
Conclusion

How It Works
A large file upload is divided into smaller parts/chunks, each part is uploaded independently to Amazon S3. Once all the parts have been uploaded, they are combined to create the final object.
Example: Uploading a 100MB file in 5MB parts would result in 20 parts being uploaded to S3. Each part is uploaded with a unique identifier, and the order is maintained to ensure that the file can be reassembled correctly.
Retries can be configured to automatically retry failed parts, and the upload can be paused and resumed at any time. This makes the process more robust and fault-tolerant, especially for large files.

multipart AWS s3 uploads
Learn more on the Amazon S3 multipart upload docs.
Let's get started!
Step 1: How to Set Up AWS S3
How to Create an S3 Bucket
First, log into the AWS Management console

Navigate to the S3 service.


How to create an s3 bucket
Create a new bucket and take note of the bucket name.
Uncheck the Public Access settings for simplicity We'll also configure bucket access using IAM policies after creating the bucket.

How to create an s3 bucket

Leave other settings as default and create the bucket.

How to Configure S3 Bucket Policy
Now, that you have created the bucket, let's set up the policy to allow users read your objects(file/videos) url.

Click on the bucket name and navigate to the Permissions tab.


How to configure s3 bucket policy
Navigate to the Bucket Policy section and click on Edit.
Input the following policy, and replace your-bucket-name with your actual bucket name:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    }
  ]
}

Version: Amazon S3 object version number for the bucket policy language.
Statement: An array of one or more individual statements that define the policy.
Effect: The effect determines whether the statement allows or denies access.
Principal: The entity that the policy is applied to. In this case, we are allowing all principals. In production, you should specify the IAM user or role that needs access.
Action: The action that the policy allows or denies. In this case, we are allowing the s3:GetObject action, which allows users to retrieve objects from the bucket.
Resource: The Amazon Resource Name (ARN) of the bucket and objects that the policy applies to. In this case, we are allowing access to all objects in the bucket.
Click on Save changes to apply the policy.
Step 2: How to Set Up AWS S3 Backend with Node.js
Next, let's set up the backend server with AWS SDK to handle the file upload process.
How to Initialize a Node.js Project
Create a new directory for your project and initialize a new Node.js project:
mkdir s3-multipart-upload
cd s3-multipart-upload
npm init -y

Install Required Packages
Install the following packages using npm:
 npm install express dotenv multer aws-sdk

Create Server File
Create a new file named app.js (For simplicity, we are going to use this file only for all the upload logic) and add the following code:
Imports and Configurations
const cors = require("cors");
const express = require("express");
const AWS = require("aws-sdk");
const dotenv = require("dotenv");
const multer = require("multer");

const multerUpload = multer();
dotenv.config();

const app = express();
const port = 3001;

Imports
cors: Middleware for enabling Cross-Origin Resource Sharing (CORS). This is necessary to allow your frontend application interact with the backend hosted on a different domain or port.
express: A minimal and flexible Node.js web application framework.
AWS: The AWS SDK for JavaScript, which allows you to interact with AWS services.
dotenv: A module that loads environment variables from a .env file into process.env.
multer: Middleware for handling multipart/form-data, which is primarily used for uploading files.
Configurations
multerUpload: Initializes multer for handling file uploads.
dotenv.config(): Loads the environment variables from a .env file.
app: Initializes an Express application.
port: Sets the port on which the Express application will run.
Middleware and AWS Configuration
Next, add the following code to configure middleware and AWS SDK:
app.use(cors());

AWS.config.update({
  accessKeyId: process.env.AWS_ACCESS_KEY,
  secretAccessKey: process.env.AWS_SECRET_KEY,
  region: process.env.AWS_REGION,
});

const s3 = new AWS.S3();
app.use(express.json({ limit: "50mb" }));
app.use(express.urlencoded({ limit: "50mb", extended: true }));

app.use(cors()): Enables CORS for all routes, allowing your frontend to communicate with the backend without issues related to cross-origin requests.
AWS.config.update({ ... }): Configures the AWS SDK with the access key, secret key, and region from the environment variables.
const s3 = new AWS.S3(): Creates an instance of the S3 service.
app.use(express.json({ limit: '50mb' })): Configures Express to parse JSON bodies with a size limit of 50MB.
app.use(express.urlencoded({ limit: '50mb', extended: true })): Configures Express to parse URL-encoded bodies with a size limit of 50MB.
Routes
It's time to start creating our routes. The routes required for the multipart upload process are as follows:

Initialization of the upload process.
Uploading parts of the file.
Completing the upload process.

Start/Initialize Upload Endpoint
This route puts the upload process in play. Add the following code to create an endpoint for initializing the multipart upload process:
app.post("/start-upload", async (req, res) => {
  const { fileName, fileType } = req.body;

  const params = {
    Bucket: process.env.S3_BUCKET,
    Key: fileName,
    ContentType: fileType,
  };

  try {
    const upload = await s3.createMultipartUpload(params).promise();
    // console.log({ upload });
    res.send({ uploadId: upload.UploadId });
  } catch (error) {
    res.send(error);
  }
});

The function above creates a POST endpoint /start-upload that expects a JSON body with fileName and fileType properties. It then uses the createMultipartUpload method from the S3 service to initialize the multipart upload process. If successful, it returns the uploadId to the user, which will be used to upload parts of the file.
Upload Part Endpoint
This is the route where the different smaller parts of the large file upload are received and tagged. Add the following code to create an endpoint for uploading parts of the file:
app.post("/upload-part", multerUpload.single("fileChunk"), async (req, res) => {
  const { fileName, partNumber, uploadId, fileChunk } = req.body;

  const params = {
    Bucket: process.env.S3_BUCKET,
    Key: fileName,
    PartNumber: partNumber,
    UploadId: uploadId,
    Body: Buffer.from(fileChunk, "base64"),
  };

  try {
    const uploadParts = await s3.uploadPart(params).promise();
    console.log({ uploadParts });
    res.send({ ETag: uploadParts.ETag });
  } catch (error) {
    res.send(error);
  }
});

The function above creates a POST endpoint at /upload-part that expects a form-data body with uploadId, partNumber, and fileName properties. It uses the uploadPart method from the S3 service to upload the part of the file. If successful, it returns the ETag of the uploaded part to the client.
The ETag is a unique identifier for the upload part that will be used to complete the multipart upload.
Complete Upload Endpoint
Once the part has been uploaded, the final step is to combine all the parts to create the final object.
Add the following code to create an endpoint for completing the multipart upload process:
app.post("/complete-upload", async (req, res) => {
  const { fileName, uploadId, parts } = req.body;

  const params = {
    Bucket: process.env.S3_BUCKET,
    Key: fileName,
    UploadId: uploadId,
    MultipartUpload: {
      Parts: parts,
    },
  };

  try {
    const complete = await s3.completeMultipartUpload(params).promise();
    console.log({ complete });
    res.send({ fileUrl: complete.Location });
  } catch (error) {
    res.send(error);
  }
});

The function above creates a POST endpoint at /complete-upload that expects a JSON body with uploadId, fileName, and parts properties. It uses the completeMultipartUpload method from the S3 service to combine the uploaded parts and creates the final object. If successful, it returns the data object containing fileUrl about the completed upload.
Start the Server
Finally, add the following code to start the Express server:
app.listen(port, () => {
  console.log(`Server running on port ${port}`);
});

This code starts the Express server on port 3001 and logs a message to the console when the server is running.
Environment Variables
Create a new file named .env in the root directory of your project and add the following environment variables:
AWS_ACCESS_KEY=your-access-key
AWS_SECRET_KEY=your-secret-key
AWS_REGION=your-region
S3_BUCKET=your-bucket-name

Replace your-access-key, your-secret-key, your-region, and your-bucket-name with your actual AWS credentials and bucket name.
Running the Server
To run the server, execute the following command in your terminal:
node app.js

This will start the server on port 3001.
Step 3: How to Set Up the Frontend with React
Now that the backend is set up, let's create a React frontend to interact with the server and upload files to S3 using the multipart upload process.
The frontend will be in charge of splitting the file into parts, uploading each part to the server, and completing the upload process.
How to Initialize a React Project
Create a new React project using Create React App:
npx create-react-app s3-multipart-upload-frontend
cd s3-multipart-upload-frontend

Install Required Packages
Install the following packages using npm:
  npm install axios

Create Components
Create a new file named Upload.js in the src/components directory and add the following code:
import React, { useState } from "react";
import axios from "axios";

const CHUNK_SIZE = 5 * 1024 * 1024; // 5MB

const FileUpload = () => {
  const [file, setFile] = useState(null);
  const [fileUrl, setFileUrl] = useState("");

  const handleFileChange = (e) => {
    setFile(e.target.files[0]);
  };

  const handleFileUpload = async () => {
    const fileName = file.name;
    const fileType = file.type;
    let uploadId = "";
    let parts = [];

    try {
      // Start the multipart upload
      const startUploadResponse = await axios.post(
        "http://localhost:3001/start-upload",
        {
          fileName,
          fileType,
        }
      );

      uploadId = startUploadResponse.data.uploadId;

      // Split the file into chunks and upload each part
      const totalParts = Math.ceil(file.size / CHUNK_SIZE);

      console.log(totalParts);

      for (let partNumber = 1; partNumber <= totalParts; partNumber++) {
        const start = (partNumber - 1) * CHUNK_SIZE;
        const end = Math.min(start + CHUNK_SIZE, file.size);
        const fileChunk = file.slice(start, end);

        const reader = new FileReader();
        reader.readAsArrayBuffer(fileChunk);

        const uploadPart = () => {
          return new Promise((resolve, reject) => {
            reader.onload = async () => {
              const fileChunkBase64 = btoa(
                new Uint8Array(reader.result).reduce(
                  (data, byte) => data + String.fromCharCode(byte),
                  ""
                )
              );

              const uploadPartResponse = await axios.post(
                "http://localhost:3001/upload-part",
                {
                  fileName,
                  partNumber,
                  uploadId,
                  fileChunk: fileChunkBase64,
                }
              );

              parts.push({
                ETag: uploadPartResponse.data.ETag,
                PartNumber: partNumber,
              });
              resolve();
            };
            reader.onerror = reject;
          });
        };

        await uploadPart();
      }

      // Complete the multipart upload
      const completeUploadResponse = await axios.post(
        "http://localhost:3001/complete-upload",
        {
          fileName,
          uploadId,
          parts,
        }
      );

      setFileUrl(completeUploadResponse.data.fileUrl);
      alert("File uploaded successfully");
    } catch (error) {
      console.error("Error uploading file:", error);
    }
  };

  return (
    <div>
      <input type="file" onChange={handleFileChange} />
      <button disabled={!file} onClick={handleFileUpload}>
        Upload
      button>
      <hr />
      <br />
      <br />
      {fileUrl && (
        <a href={fileUrl} target="_blank" rel="noopener noreferrer">
          View Uploaded File
        a>
      )}
    div>
  );
};

export default FileUpload;

The FileUpload component above handles the file upload process using the multipart upload method. It splits the file into chunks, uploads each part to the server, and completes the upload process.
The component consists of the following key parts:
CHUNK_SIZE: The size of each part in bytes. In this case, we are using 5MB parts.
handleFileChange: A function that sets the selected file in the state.
handleFileUpload: A function that initiates the multipart upload process by sending the file to the server in parts.

It starts the upload process by calling the /start-upload endpoint and retrieves the uploadId.
It splits the file into chunks and uploads each part to the server using the /upload-part endpoint.
It completes the upload process by calling the /complete-upload endpoint with the uploadId and parts array.

fileUrl: A state variable that stores the URL of the uploaded file.
The component renders an input field for selecting a file, a button to upload the file, and a link to view the uploaded file.
App Component
Update the App.js file in the src directory with the following code:

import React from "react";

import FileUpload from "./components/FileUpload";

function App() {
  return (
    <div className="App">
      <h1>Large File Upload with S3 Multipart Uploadh1>
      <FileUpload />
    div>
  );
}


export default App;

The App component renders the FileUpload component, which handles the file upload process.
How to Start the Frontend
To run the frontend, execute the following command in your terminal:
npm start

This will start the React development server on port 3000 and open the application in your default web browser.
Testing
Let's test the application by uploading a large file using the frontend. You should see the file being uploaded in parts and then combined to create the final object on the server inspecting your network tab.
Part Upload
In the image below, the start-upload endpoint is called to initialize and start the upload process. The large file uploaded is broken into chunks and uploaded with the upload-part endpoint. You can see up to 10 or more (depending on the size of each chunk to the total file size).
Each upload part has a unique identifier Etag used for the complete upload.

Image uploading in parts
Complete Part Upload
The last and final step of the process is the complete-upload endpoint where the upload parts are combined to form a single object for the file uploaded.

Image uploads completed
You can click on the View Uploaded File to access your uploaded file.
Full Code on GitHub
Click the link below to access the full code on GitHub:
Multipart file uploads with react and NodeJS
Conclusion
In this article, we explored how to efficiently handle large files with Amazon S3 multipart upload. We discussed the benefits of using this feature, walked through the process of uploading files in parts, and provided code examples using Node.js and React. 
This is a high-level implementation of the multipart upload process, you can further enhance it by adding more features like progress tracking, error handling, and resumable uploads.
By leveraging Amazon S3 multipart upload, you can optimize file uploads in your applications by dividing large files into smaller parts, uploading them independently, and combining them to create the final object. This approach not only enhances upload performance but also adds fault tolerance and flexibility to pause and resume uploads, making it ideal for handling large files over unstable networks.
 


 How to Handle Concurrency with Goroutines and Channels in Go 
Destiny Erhabor — Fri, 10 May 2024 15:07:54 +0000
 Concurrency is the ability of a program to perform multiple tasks simultaneously. It is a crucial aspect of building scalable and responsive systems. 
Go's concurrency model is based on the concept of goroutines, lightweight threads that can run multiple functions concurrently, and channels, a built-in communication mechanism for safe and efficient data exchange between goroutines.
Go's concurrency features enable developers to write programs that can:

Handle multiple requests simultaneously, improving responsiveness and throughput.
Utilize multi-core processors efficiently, maximizing system resources.
Write concurrent code that is safe, efficient, and easy to maintain.

Go's concurrency model is designed to minimize overhead, reduce latency, and prevent common concurrency errors like race conditions and deadlocks. 
With Go, developers can build high-performance, scalable, and concurrent systems with ease, making it an ideal choice for building modern distributed systems, networks, and cloud infrastructure.
Table of Contents

Case study: A Bank Teller
Sequential Processing
Concurrency
What are Goroutines and Channels?
What is a Goroutine?
How to Implement a Goroutine
How Does a Goroutine Work?
What are waitGroups?
What are Channels?
How to Write Data to a Channel
How to Read Data from a Channel
How to Implement Channels with Goroutine
What are Channel Buffers?
What is an Unbuffered Channel?
How to Create a Buffered Channel
What are Channel Directions?
How to Handle Multiple Communication Operations with Channel Select
How to Timeout Long Running Processes in a Channel
How to Close a Channel
How to Iterate Over Channel Messages
Conclusion

Let's consider a scenario to illustrate concurrency:
Case Study: A Bank Teller
Imagine a busy bank with two tellers, Maria and David. Customers arrive at the bank to conduct various transactions like deposits, withdrawals, and transfers. The goal is to serve customers quickly and efficiently.
Sequential Processing (No Concurrency)
Maria and David work sequentially, one at a time. When a customer arrives, Maria helps the customer, and David waits until Maria is finished before helping the next customer. This leads to a long wait time for customers.
Concurrency
Maria and David work concurrently, serving customers simultaneously. When a customer arrives, Maria helps the customer with a transaction, and David simultaneously helps another customer with a different transaction. They work together, sharing resources like the bank's database and cash supplies, to serve multiple customers at the same time.
In this scenario, concurrency enables Maria and David to work together efficiently, serving multiple customers simultaneously, and improving the overall customer experience. This same concept applies to computer programming, where concurrency enables multiple tasks to run simultaneously, improving responsiveness, efficiency, and performance.
What are Goroutines and Channels?
A goroutine is a lightweight thread managed by the Go runtime. It is a function that runs on the Go runtime. It helps address concurrency and async flow requirements.
Goroutines allow you to start up and run other threads of execution concurrently within your program.
Channels are used to communicate between goroutines. It is a typed conduit through which you can send and receive values with the channel operator: <-.
How to Implement a Goroutine
To use and implement a goroutine, the go keyword is used to precede a function.
package main

import (
  "fmt"
  "math/rand"
  "time"
)

func pause() {
  time.Sleep(time.Duration(rand.Intn(1000)) * time.Millisecond)
}

func sendMsg(msg string) {
  pause()
  fmt.Println(msg)
}

func main() {
  sendMsg("hello") // sync

  go sendMsg("test1") // async
  go sendMsg("test2") // async
  go sendMsg("test3") // async

  sendMsg("main") // sync

  time.Sleep(2 * time.Second)
}

From the example above,

The sendMsg function is called synchronously and asynchronously.
The sendMsg function is called synchronously when the sendMsg function is called without the go keyword.
The sendMsg function is called asynchronously when the sendMsg function is called with the go keyword.

How Does a Goroutine Work?
When the sendMsg function is called with the go keyword, the main function will not wait for the sendMsg function to finish executing before it continues to the next line of code and will return immediately after the sendMsg function is called.
Otherwise, the function is called synchronously, and the main function will wait for the sendMsg function to finish executing before it continues to the next line of code.
The order of the output when you run the above example will differ from the order of the code because the three goroutine all run concurrently and since the functions pause for a period of time, the order which they wake will differ and be outputted.
The time.Sleep(2 * time.Second) is a quick and simple method used to keep the main function running for 2 seconds to allow the goroutine to finish executing before the main function exits. Otherwise, the main function will exit immediately after the goroutine is called and the goroutine will not have enough time to finish executing resulting to errors.
What are WaitGroups?
Unlike the time.Sleep(2 * time.Second) used in the example above, the WaitGroups are more standard to wait for a collection of goroutines to finish executing. It is a simple way to synchronize multiple goroutines.
A goroutine can also be declared with anonymous functions
package main

import (
  "fmt"
  "sync"
  "time"
  "math/rand"
)

func pause() {
  time.Sleep(time.Duration(rand.Intn(1000)) * time.Millisecond)
}

func sendMsg(msg string, wg *sync.WaitGroup) {
  defer wg.Done()
  pause()
  fmt.Println(msg)
}

func main() {
  var wg sync.WaitGroup

  wg.Add(3)

  go func(msg string) {
    defer wg.Done()
    pause()
    fmt.Println(msg)
  }("test1")


  go sendMsg("test2", &wg)
  go sendMsg("test3", &wg)

  wg.Wait()
}

From the example above, the sync.WaitGroup is used to wait for the three goroutine to finish executing before the main function exits. It synchronizes the three goroutine and the main function.

The sync.WaitGroup (wg) manages the goroutines and keeps track of the number of goroutines that are running.
The sync.WaitGroup.Add (wg.Add) method is used to add the number of goroutines as arguments that are running.
The sync.WaitGroup.Done (wg.Done) method is used to decrement the number of goroutines that are running.
The **sync.WaitGroup.Wait (wg.Wait)** method is used to wait for all the goroutines to finish executing before the main function exits.

What are Channels?
Channels are used to communicate between goroutines. It is a typed conduit through which you can send and receive messages with the channel operator, **<-**.
In their simplest form, one goroutine writes messages into the channel and another goroutine reads the same messages out of the channel.
Channels are created using the make method and the chan keyword together with its type. Channels are used to transfer messages of which type it was declared with.
Example:
package main

func main(){
    msgChan := make(chan string)
}

The example above creates a channel msgChan of type string.
How to Write Data to a Channel
To write data to a channel, first specify the name (msgChan) of the channel, followed by the <- operator and the message. This is considered the Sender.
msgChan <- "hello world"

How to Read Data from a Channel
To read data from a channel, simple move the operator (<-) to front of the channel name (msgChan) and you can assign it to a variable. This is considered the Receiver.
msg := <- msgChan

How to Implement Channels with Goroutine
package main

import (
  "fmt"
  "math/rand"
  "time"
)

func main() {

  msgChan := make(chan string)

  go func() {
    time.Sleep(time.Duration(rand.Intn(1000)) * time.Millisecond)
    msgChan <- "hello" // Write data to the channel
    msgChan <- "world" // Write data to the channel
  }()

  msg1 := <- msgChan
  msg2 := <- msgChan

  fmt.Println(msg1, msg2)
}

The example above shows how to write and read data from a channel. The msgChan channel is created and the go keyword is used to create a goroutine that writes data to the channel. The msg1 and msg2 variables are used to read data from the channel.
Channels behave as a first-in-first-out queue. So, when one goroutine writes data to the channel, the other goroutine reads the data from the channel in the same order it was written.
What are Channel Buffers?
Channels can be buffered or unbuffered. The previous examples include the use of an unbuffered channels.
What is an Unbuffered Channel?
An unbuffered channel causes the sender to block immediately after sending a message into the channel until the receiver receives the message.
What is a Buffered Channel?
A buffered channel allows the sender to send messages into the channel without blocking until the buffer is full. So, the sender blocks only once the buffer has filled up and waits until another goroutine reads off the channel, making sure the space size becomes available before unblocking.
How to Create a Buffered Channel
When creating a buffered channel, use the make function and specify a second parameter to indicate the buffer size.
msgBufChan := make(chan string, 2)

The example above creates a buffered channel msgBufChan of type string with a buffer size of 2. This means that the channel can hold up to two messages before it blocks.
package main

import (
  "time"
)

func main() {
  size := 3
  msgBufChan := make(chan int, size)

  // reader (receiver)
  go func() {
    for {
      _ = <- msgBufChan
      time.Sleep(time.Second)
    }
  }()

  //writer (sender)
  writer := func() {
    for i := 0; i <=> 10; i++ {
      msgBufChan <- i
      println(i)
    }
  }

  writer()
}

The example above creates a buffered channel msgBufChan of type int with a buffer size of 3.

The writer function writes data to the channel and the reader function reads data from the channel.
When the program runs, you will see that the number 0 through to 3 printed out immediately and the remaining numbers 5 through to 10 are printed out slowly about one per second (time.Sleep(time.Second).
This is showing the effect of buffered channel that specify the size it can hold before it blocks.

What are Channel Directions?
When using channels as function parameters, by default, you can send and receive messages within the function. To provide additional safety at compile time, channel function parameters can be defined with a direction. That is, they can be defined to be read-only or write-only.
Example:
package main

import (
  "fmt"
  "time"
)

func writer(channel chan<- string, msg string) {
  channel <- msg
}

func reader(channel <-chan string) {
  msg := <- channel
  fmt.Println(msg)
}

func main() {
  msgChan := make(chan string, 1)

  go reader(msgChan)


  for i :- 0; i < 10; i++ {
    writer(msgChan, fmt.Sprintf("msg %d", i))
  }

  time.Sleep(time.Second * 5)
}

The example above shows how to define a channel with a direction.

The writer function is defined with a write-only channel and
The reader function is defined with a read-only channel.

The msgChan channel is created with a buffer size of 1. The writer function writes data to the channel and the reader function reads data from the channel.
How to Handle Multiple Communication Operations with Channel Select
The select statement lets a goroutine wait on multiple communication operations. A select blocks until one of its cases can run, then it executes that case. It chooses one at random if multiple are ready.
The select and case statements are used to simplify the management and readability of wait across multiple channels.
Example:‌
package main

import (
  "fmt"
  "time"
  "math/rand"
)

func pause() {
  time.Sleep(time.Duration(rand.Intn(1000)) * time.Millisecond)
}

func test1(c chan<- string) {
  for {
    pause()
    c <- "hello"
  }
}

func test2(c chan<- string) {
  for {
    pause()
    c <- "world"
  }
}

func main() {
  rand.Seed(time.Now().Unix())

  c1 := make(chan string)
  c2 := make(chan string)

  go test1(c1)
  go test2(c2)

  for {
    select {
    case msg1 := <- c1:
      fmt.Println(msg1)
    case msg2 := <- c2:
      fmt.Println(msg2)
    }
  }
}

The example above shows how to use the select statement to wait on multiple channels. The test1 and test2 functions write data to the c1 and c2 channels respectively. The main function reads data from the c1 and c2 channels using the select statement.
The select statement will block until one of the channels is ready to send or receive data. If both channels are ready, the select statement will choose one at random.
How to Timeout Long Running Processes in a Channel
The time.After function is used to create a channel that sends a message after a specified duration. This can be used to implement a timeout for a channel.
It can be specified in a select statement to help manage situations where it's taking too long to receive a message from any of the channels being monitored.
Also consider using timeout when working with external resources as you can never guarantee the response time and, therefore may need to proactively take action after a predetermined time has passed.
Implementing a timeout with a select statement is very straightforward.
Example:
package main

import (
  "fmt"
  "time"
)

func main() {
     c1 := make(chan string)

    go func(channel chan string) {
        time.Sleep(1 * time.Second)
        channel <- "hello world"
    }(c1)

    select {
    case msg2 := <-c1:
        fmt.Println(msg2)
    case <-time.After(2 * time.Second): //Timeout after 2 second
        fmt.Println("timeout")
  }
}


The example above shows how to use the time.After function to create a channel that sends a message after a specified duration.
The main function reads data from the c1 channel using the select statement.
The select statement will block until one of the channels is ready to send or receive data.
If the c1 channel is ready, the main function will print the message.
If the c1 channel is not ready after 2 seconds, the main function will print a timeout message.

How to Close a Channel
Closing a channel is used to indicate that no more values will be sent on the channel. It is used to signal to the receiver that the channel has been closed and no more values will be sent.
Go channels can be explicitly closed to help with synchronization issues. The default implementation will close the channel when all the values have been sent.
Closing a channel is done by invoking the built-in close function.‌
close(channel)

Example:
package main

import (
  "fmt"
  "bytes"
)

func process(work <-chan string, fin chan<- string) {
  var b bytes.Buffer
  for {
    if msg, notClosed := <-work; notClosed {
      fmt.Printf("%s received...\n", msg)
    } else {
      fmt.Println("Channel closed")
      fin <- b.String()
      return
    }
  }
}

func main() {
  work := make(chan string, 3)
  fin := make(chan string)

  go process(work, fin)

  word := "hello world"

  for i := 0; i < len(word); i++ {
    letter := string(word[i])
    work <- letter
    fmt.Printf("%s sent ...\n", letters)
  }

  close(work)

  fmt.Printf("result: %s\n", <-fin)
}

The example above shows how to close a channel. The work channel is created with a buffer size of 3. The process function reads data from the work channel and writes data to the fin channel. The main function writes data to the work channel and closes the work channel. The process function will print the message if the work channel is not closed. If the work channel is closed, the process function will print a message and write the data to the fin channel.
How to Iterate Over Channel Messages
Channels can be iterated over by using the range keyword, similar to arrays, slice, and/or maps. This allows you to quickly and easily iterate over the messages within a channel.
Example:
package main

import (
  "fmt"
)

func main() {
  c := make(chan string, 3)

  go func() {
    c <- "hello"
    c <- "world"
    c <- "goroutine"
    close(c) // Closing the channel is very important before proceeding to the iteration hence deadlock error
  }()

  for msg := range c {
    fmt.Println(msg)
  }
}

The example above shows how to iterate over a channel using the range keyword. The c channel is created with a buffer size of 3. The go keyword is used to create a goroutine that writes data to the c channel. The main function iterates over the c channel using the range keyword and prints the message.
Conclusion
In this article, we learned how to handle concurrency with goroutines and channels in Go. We learned how to create goroutines, and how to use WaitGroups and channels to communicate between goroutines. 
We also learned how to use channel buffers, channel directions, channel select, channel timeout, channel closing, and channel range. 
Goroutines and channels are powerful features in Go that help address concurrency and async flow requirements.
As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on LinkedIn or Twitter.
 


 What is Amazon EC2 Auto Scaling? 
Destiny Erhabor — Mon, 06 May 2024 16:32:47 +0000
 Auto scaling is like having a smart system that keeps an eye on how many people are visiting your website. When you have a lot of people, it quickly adds more servers to handle the extra traffic. And when things quiet down, it scales back to save you money.
In AWS, there are two important services that help with this: Amazon EC2 Auto Scaling and AWS Auto Scaling. Amazon EC2 Auto Scaling is specifically for managing your EC2 servers, while AWS Auto Scaling can also handle other things like DynamoDB tables and Amazon Aurora databases.
In this article, we'll dive deeper into how Amazon EC2 Auto Scaling works and how you can use it to keep your website running smoothly without overspending on servers.
Prerequisites

Have an AWS account
Basic understanding of EC2 instance

Table of Content

Prerequisites
Example Use case
Advantages of Amazon EC2 Auto Scaling
Components of EC2 Auto Scaling
What is Launch Configurations vs Launch Templates
How to create a launch template
What are Auto Scaling Groups (ASGs)
How to create an Auto Scaling Group
What are Scaling Policies
Conclusion

Example Use Case
Scenario:
Imagine running a website that sells trendy clothes. Sometimes, lots of people visit your site at once, especially during lunch breaks or evenings. Other times, it's pretty quiet.
Problem:
You need enough servers to handle busy times, but you don't want to waste money on too many servers when it's quiet.
Solution with Amazon EC2 Auto Scaling:
Traffic Analysis: Look at when people visit your site the most. This helps you understand when you need more servers.
Set Rules: Decide when to add or remove servers automatically. For example, you might say, "If more than 70% of our servers are busy for more than 5 minutes, add one more server."
Adjust Server Numbers: Tell Amazon the smallest and biggest number of servers you need. You can also say how many you'd like on average. For instance, you might say, "Keep at least 2 servers running all the time. But if it's busy, go up to 10 servers. And usually, we need around 4."
Load Balancing: Make sure all servers get some work. Use a load balancer to send visitors to the least busy server. This keeps everything running smoothly even if you have many servers.
Test and Watch: Before trusting everything, test to see if it works as planned. Keep an eye on it afterward to make sure it's doing its job right.
Save Money: With auto scaling, you don't pay for servers you're not using. When traffic is low, it reduces the number of servers, saving you money. When traffic picks up, it adds more servers, so your site stays fast.
Advantages of Using Amazon EC2 Auto Scaling
Cost Optimization: EC2 Auto Scaling helps optimize costs by automatically adjusting the number of EC2 instances based on demand. During periods of low traffic, it reduces the number of instances, saving on operational costs. Conversely, during high traffic, it scales up to ensure optimal performance without over-provisioning resources.
Improved Availability: By automatically distributing incoming traffic across multiple instances and fault tolerance of your application. If any instance fails/is unhealthy, the Auto Scaling group replaces it with a new one, ensuring minimal disruption to your services.
Scalability: EC2 Auto Scaling allows your application to handle sudden spikes in traffic or increased workload without manual intervention. 
Enhanced Performance: With EC2 Auto Scaling, you can maintain consistent performance levels even during peak usage periods. By automatically adding more instances when traffic increases, it prevents performance degradation and ensures a smooth user experience.
Ease of Management: EC2 Auto Scaling simplifies the management of your EC2 fleet by automating instance provisioning, scaling, and monitoring.
Integration with AWS Services: EC2 Auto Scaling integrates seamlessly with other AWS services such as Elastic Load Balancing (ELB) and Amazon CloudWatch.
Highly Customizable: EC2 Auto Scaling offers flexibility and customization options to meet the specific needs of your application.
Components of EC2 Auto Scaling
Let's get a better understanding on how the Auto Scaling works through its different components. 
There are two distinct steps to configuration. The first step is the creation of a launch configuration or launch template. The second is the creation of an Auto Scaling group.
Launch Configurations and Launch Templates
Launch configurations or launch templates define the configuration settings for the EC2 instances that will be launched by the Auto Scaling group. 
These settings include the AMI (Amazon Machine Image), instance type, security groups, key pair, and user data. 
Launch configurations are older and being phased out in favor of launch templates, which offer more features and flexibility.
How to Create a Launch Template
First, navigate to EC2 Instance page

AWS instance page
Select the Launch Templates under the instances and click the create button.

AWS launch templates
The following screen should show up, almost similar to launching an EC2 instance. You can fill the required information accordingly.

Create AWS launch templates
After configuration, click the "Create Launch" template button and allow it to create, then view your newly created launch template with default and latest version as 1. You can use this launch template to create another launch template and specify a different version for it.

View AWS launch templates
Auto scaling requires either a launch template or launch configuration to identify the instance it's launching and its configurations.
What are Auto Scaling Groups (ASGs)
Auto Scaling groups are the core component of EC2 Auto Scaling. They define the group of EC2 instances that are managed together and share the same scaling policies. ASGs ensure that your application can automatically scale out (add instances) or scale in (remove instances) based on demand.
How to create an Auto Scaling Group
First, navigate to EC2 Instance page and under the Auto Scaling group, select and click the create button.

creating an Auto Scaling group
On the create screen, the first step is to give your ASG a Name and then select your launch template created from the steps above. 

creating a launch template
The next step requires you to select or override an instance launch template. You also select a VPC and subnet.

selecting instance launch template
The next step is to configure advanced options such as adding a load balancer and monitoring. You can attach or add a new load balancer but for this article we will skip this part.

configuring advanced options
Next, configure the group size and scaling. Here, we want to configure the scale between minimum of 2 and maximum of 5. Also, set the metrics type to track the CPU utilization (set to 50 – you can increase to 70 or more) for scaling.

configuring group size and scaling
Next two steps are for adding notifications (you will need to create an SNS service for this) and tags. In this article, we are going to skip these and create our ASG.
Create and view the ASG created. From its activity folder, you can see those two instances launched. Also, from the instances page, you should see two EC2 instances. This is because we set our desired state to 2.

Auto Scaling groups

Auto Scaling groups
What are Scaling Policies?
Scaling policies define the rules that govern how the Auto Scaling group scales in or out in response to changing demand. There are four types of scaling policies:
Let's break down each type of scaling with examples:
Manual Scaling
Manual scaling involves adjusting the number of EC2 instances in your Auto Scaling group manually, without relying on automated triggers or policies. This type of scaling is typically done in response to predictable events or planned changes in demand.
Example: Assuming you run an e-commerce website, and you know that there will be a flash sale event that will attract a large number of visitors. To handle the expected surge in traffic, you can manually increase the desired capacity of your Auto Scaling group before the event, adding more EC2 instances in advance of the anticipated demand spike. After the event is over, you can manually reduce the desired capacity back to its normal level.
Pros:

Control: Offers direct control over the number of EC2 instances in the Auto Scaling group.
Flexibility: Allows for immediate adjustments based on specific requirements or events.

Cons:

Manual Intervention: Relies on human intervention, which can be time-consuming and prone to errors.
Lack of Automation: Not suitable for handling dynamic or unpredictable fluctuations in demand efficiently.

Schedule Scaling
Schedule scaling involves defining predefined schedules to adjust the number of EC2 instances in your Auto Scaling group automatically. This type of scaling is useful for applications with predictable traffic patterns, such as daily or weekly fluctuations in demand.
Example: Consider a video streaming service that experiences peak traffic during evenings and weekends. You can set up a schedule scaling policy to increase the desired capacity of your Auto Scaling group every evening at 6 PM and decrease it every morning at 6 AM. This ensures that you have enough capacity to handle peak demand periods without overspending on resources during off-peak hours.
Pros:

Predictability: Well-suited for applications with predictable traffic patterns, such as daily or weekly fluctuations.
Cost Optimization: Helps optimize costs by aligning resources with expected demand patterns.

Cons:

Limited Adaptability: May not be responsive to sudden changes in demand or unexpected traffic spikes.
Requires Planning: Requires upfront planning and configuration of schedules based on historical data or business insights.

Dynamic Scaling
Dynamic scaling adjusts the number of EC2 instances in your Auto Scaling group automatically based on real-time metrics, such as CPU utilization, network traffic, or other application-specific metrics. This type of scaling is responsive to fluctuations in demand and helps ensure optimal performance and cost-effectiveness.
Types:

Step Scaling: This policy scales the number of instances based on a series of scaling adjustments defined by step adjustments and associated metrics thresholds. 
Target Tracking: This policy automatically adjusts the number of instances to maintain a specified target metric, such as average CPU utilization or network traffic.

When adding instances to the ASG, it will take a few minutes for them to come online and handle load. This is why a cooldown policy has to be set.
Scaling Cooldowns: Scaling cooldowns help prevent rapid fluctuations in the number of instances by imposing a cooldown period after a scaling activity is triggered. During this cooldown period, EC2 Auto Scaling will not launch or terminate additional instances, allowing time for the newly launched instances to stabilize or for the impact of terminated instances to be observed.
Example: Let's say you operate a ride-sharing platform where demand can vary unpredictably throughout the day. With dynamic scaling, you can configure Auto Scaling policies to add more EC2 instances when the number of ride requests exceeds a certain threshold, and remove instances when demand decreases. This allows you to dynamically adapt to changing traffic patterns in real-time, ensuring a seamless experience for both drivers and passengers.
Pros:

Real-Time Responsiveness: Adjusts resource allocation dynamically in response to actual demand, ensuring optimal performance.
Cost Efficiency: Automatically scales resources up or down, helping to optimize costs by only using what is needed.

Cons:

Potential Over-Provisioning: May lead to over-provisioning during sudden spikes in demand if scaling policies are not properly configured.
Complexity: Requires careful configuration of scaling policies and monitoring of metrics to ensure effective scaling behavior.

Predictive Scaling
Predictive scaling uses machine learning algorithms and historical data to forecast future demand and proactively adjust the number of EC2 instances in your Auto Scaling group. This type of scaling helps prevent under-provisioning or over-provisioning of resources by anticipating changes in demand before they occur.
Example: Suppose you operate a weather forecasting application that experiences increased demand during severe weather events. By analyzing historical data on weather patterns and user behavior, predictive scaling can predict when a surge in traffic is likely to occur and automatically scale up the capacity of your Auto Scaling group ahead of time. This ensures that your application remains responsive and available during peak usage periods without unnecessary resource waste.
Pros:

Proactive Optimization: Anticipates future demand based on historical data, ensuring resources are provisioned ahead of time.
Improved Cost Management: Helps prevent under-provisioning and over-provisioning, optimizing resource usage and costs.

Cons:

Data Dependence: Relies on accurate historical data and effective machine learning models for accurate predictions.
Initial Setup: Requires initial setup and configuration of predictive scaling models, which can be complex and resource-intensive.

Conclusion
In conclusion, Amazon EC2 Auto Scaling offers a range of strategies to effectively manage and optimize the performance of applications running on EC2 instances.
Whether it's through manual adjustments, scheduled scaling, dynamic responses to real-time metrics, or proactive measures based on predictive analytics, EC2 Auto Scaling provides the flexibility and automation needed to ensure that resources are aligned with demand. 
By leveraging these scaling capabilities, businesses can enhance availability, improve cost efficiency, and deliver a seamless user experience, ultimately driving better outcomes for their applications and customers on the AWS platform.
As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on LinkedIn or Twitter.
 


 The new() vs make() Functions in Go – When to Use Each One 
Destiny Erhabor — Thu, 04 Jan 2024 15:44:43 +0000
 Go, also known as Golang, is a statically-typed, compiled programming language designed for simplicity and efficiency. 
When it comes to working with data structures like slices, maps, and channels, you'll likely encounter the new() and make() functions. While both are used for memory allocation, they serve distinct purposes. 
In this article, we'll explore the differences between new() and make() in Go and discuss when to use each.
The new() Function
The new() function in Go is a built-in function that allocates memory for a new zeroed value of a specified type and returns a pointer to it. It is primarily used for initializing and obtaining a pointer to a newly allocated zeroed value of a given type, usually for data types like structs.
Here's a simple example:
package main

import "fmt"

type Person struct {
    Name     string
    Age      int
    Gender     string
}

func main() {
    // Using new() to allocate memory for a Person struct
    p := new(Person)

    // Initializing the fields
    p.Name = "John Doe"
    p.Age = 30
    p.Gender = "Male"

    fmt.Println(p)
}

In this example, new(Person) allocates memory for a new Person struct, and p is a pointer to the newly allocated zeroed value.
The make() Function
On the other hand, the make() function is used for initializing slices, maps, and channels – data structures that require runtime initialization. Unlike new(), make() returns an initialized (non-zeroed) value of a specified type.
Let's look at an example using a slice:
package main

import "fmt"

func main() {
    // Using make() to create a slice with a specified length and capacity
    s := make([]int, 10, 15)

    // Initializing the elements
    for i := 0; i < 10; i++ {
        s[i] = i + 1
    }

    fmt.Println(s)
}

In this example, make([]int, 10, 15) creates a slice of integers with a length of 10 and a capacity of 15. The make() function ensures that the slice is initialized with non-zero values.
When to Use new() and make() in Go
Use new() for Value Types
When dealing with value types like structs, you can use new() to allocate memory for a new zeroed value. This is suitable for scenarios where you want a pointer to an initialized structure.
p := new(Person)

Use make() for Reference Types:
For slices, maps, and channels, where initialization involves setting up data structures and internal pointers, use make() to create an initialized instance.
s := make([]int, 5, 10)

Pointer vs. Value:
Keep in mind that new() returns a pointer, while make() returns a non-zeroed value. Choose the appropriate method based on whether you need a pointer or an initialized value.
Conclusion
Understanding the distinction between new() and make() in Go is crucial for writing clean and efficient code. By using the right method for the appropriate data types, you can ensure proper memory allocation and initialization in your Go programs.

Field	What it does	Example
`issuer.url`	The OIDC provider's base URL — must match the `iss` claim in the token	`https://dex.example.com`
`issuer.audiences`	The client IDs the token was issued for — must match the `aud` claim	`["kubernetes"]`
`issuer.certificateAuthority`	CA certificate to trust when contacting the OIDC provider (inlined PEM)	`-----BEGIN CERTIFICATE-----...`
`claimMappings.username.claim`	Which JWT claim to use as the Kubernetes username	`email`
`claimMappings.groups.claim`	Which JWT claim to use as the Kubernetes group list	`groups`
`claimMappings.*.prefix`	Prefix added to the claim value — set to `""` for no prefix	`""`

APIVERSION in output	apiGroups value in Role
`v1`	`""` (empty string – the core group)
`apps/v1`	`"apps"`
`batch/v1`	`"batch"`
`networking.k8s.io/v1`	`"networking.k8s.io"`
`rbac.authorization.k8s.io/v1`	`"rbac.authorization.k8s.io"`

Verb	What it allows
`get`	Read a single named resource: `kubectl get pod my-pod`
`list`	Read all resources of a type: `kubectl get pods`
`watch`	Stream changes to resources: used by controllers and informers
`create`	Create a new resource
`update`	Replace an existing resource (`kubectl apply` on an existing object)
`patch`	Partially modify a resource (`kubectl patch`)
`delete`	Delete a single resource
`deletecollection`	Delete all resources of a type in a namespace
`exec`	Run a command inside a pod (`kubectl exec`)
`portforward`	Forward a port from a pod (`kubectl port-forward`)
`proxy`	Proxy HTTP requests to a pod
`log`	Read pod logs (`kubectl logs`)

Field	What it prevents
`runAsNonRoot: true`	Blocks containers that were built to run as root – they fail at admission
`runAsUser: 10001`	Ensures a known, non-privileged UID even if the image doesn't set one
`allowPrivilegeEscalation: false`	Blocks `setuid` binaries and `sudo` – the most common privilege escalation path
`readOnlyRootFilesystem: true`	Prevents writing backdoors, modifying binaries, or creating persistence
`capabilities: drop: ALL`	Removes Linux capabilities like `NET_RAW` (raw socket access) and `SYS_ADMIN` (kernel operations)
`seccompProfile: RuntimeDefault`	Filters syscalls to a safe default set – blocks ~300 of the ~400 available syscalls

Task	Docker	Podman	nerdctl (via Lima)
Build image	`docker build -t app .`	`podman build -t app .`	`lima nerdctl build -t app .`
Run container	`docker run -d app`	`podman run -d app`	`lima nerdctl run -d app`
List containers	`docker ps`	`podman ps`	`lima nerdctl ps`
View logs	`docker logs`	`podman logs`	`lima nerdctl logs`
Stop container	`docker stop`	`podman stop`	`lima nerdctl stop`
Remove container	`docker rm`	`podman rm`	`lima nerdctl rm`
List images	`docker images`	`podman images`	`lima nerdctl images`
Pull image	`docker pull nginx`	`podman pull nginx`	`lima nerdctl pull nginx`
Push to registry	`docker push app`	`podman push app`	`lima nerdctl push app`
Execute in container	`docker exec -it sh`	`podman exec -it sh`	`lima nerdctl exec -it sh`

	Service Account	User
Kubernetes object?	Yes — lives in a namespace	No — managed externally
Created with	`kubectl create serviceaccount`	External system (CA, IdP, cloud IAM)
Used by	Pods and workloads	Humans and CI systems
Token managed by	Kubernetes	External system
Namespaced?	Yes	No

Incident	Year	Root cause	What was missing
Tesla cryptomining	2018	Kubernetes dashboard exposed with no authentication, Unrestricted egress	RBAC on the dashboard endpoint + default-deny NetworkPolicy
Capital One data breach	2019	SSRF vulnerability in a WAF let an attacker reach the EC2 metadata API, which returned credentials for an over-privileged IAM role	Pod-level IAM restrictions (IRSA) + blocking metadata API egress
Shopify bug bounty (Kubernetes)	2021	A researcher accessed internal Kubernetes metadata through a misconfigured internal service, exposing pod environment variables containing secrets	Secret management outside environment variables + network segmentation

Check ID	Description	Why it matters
1.2.1	`--anonymous-auth` is not set to false on the API server	Anonymous requests can reach the API server without authentication – exactly how the Tesla dashboard was accessed
1.2.6	`--kubelet-certificate-authority` is not set	The API server cannot verify kubelet identity, enabling man-in-the-middle attacks between the control plane and nodes
4.2.6	`--protect-kernel-defaults` is not set on the kubelet	Kernel parameters can be modified from within a container, which is one step toward a container escape

Object	Scope	What it does
`Role`	Namespace	Defines a set of permissions within one namespace
`ClusterRole`	Cluster-wide	Defines permissions across all namespaces, or for cluster-scoped resources like Nodes
`RoleBinding`	Namespace	Grants the permissions of a Role or ClusterRole to a subject, within one namespace
`ClusterRoleBinding`	Cluster-wide	Grants the permissions of a ClusterRole to a subject across the entire cluster

Profile	Who it's for	What it restricts
`privileged`	System components (CNI plugins, monitoring agents)	Nothing – no restrictions
`baseline`	Most workloads	Blocks known privilege escalations: no `hostNetwork`, no `hostPID`, no privileged containers
`restricted`	Security-sensitive workloads	Everything in baseline, plus: must run as non-root, must drop capabilities, must set a seccomp profile

Mode	Effect	When to use
`enforce`	Rejects pods that violate the profile at admission	Production – once you've fixed violations
`audit`	Allows pods but records violations in the audit log	Migration – see what would break without breaking anything
`warn`	Allows pods but sends a warning to the client	Development – fast feedback in your terminal

Field	Set at	What it controls
`runAsNonRoot`	Pod	Rejects containers that run as UID 0 (root)
`runAsUser` / `runAsGroup`	Pod	Sets a specific UID/GID – don't rely on the image default
`fsGroup`	Pod	All mounted volumes are owned by this GID
`seccompProfile`	Pod	Filters syscalls using a seccomp profile
`allowPrivilegeEscalation`	Container	Blocks `setuid` binaries and `sudo`
`readOnlyRootFilesystem`	Container	Makes the container filesystem read-only
`capabilities.drop`	Container	Removes Linux capabilities (drop `ALL`, add back only what is needed)

	OPA/Gatekeeper	Kyverno
Policy language	Rego (a custom logic language)	YAML, same format as Kubernetes resources
Learning curve	Steep: Rego takes real time to learn	Gentle: if you write YAML, you can write policies
Mutation	Yes, via `Assign`/`AssignMetadata`	Yes: first-class, well-documented feature
Audit mode	Yes: reports existing violations	Yes: policy audit mode
Ecosystem	Integrates with OPA in non-K8s contexts	Kubernetes-native only
Best for	Complex cross-resource logic and teams already using OPA	Teams who want K8s-native syntax and fast setup

Mode	Pros	Cons
iptables	• Simple and universally available on Linux systems
• Battle-tested and easy to debug	• Rule complexity grows linearly with Services/Endpoints
• Packet processing slows at scale due to sequential rule checks
• Service updates trigger full rule reloads
IPVS	• O(1) lookup time regardless of cluster size
• Built-in load-balancing algorithms (RR, LC, SH)
• Incremental updates without full rule recomputation
• Lower CPU overhead for large clusters	• Requires Linux kernel ≥4.4 and IPVS modules loaded
• More complex initial configuration
• Limited visibility with traditional tool

Type	Description
ClusterIP	Default, internal only
NodePort	Exposes service on node IP
LoadBalancer	Uses cloud provider LB
ExternalName	DNS alias for external service