<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Destiny Erhabor - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Destiny Erhabor - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 16:29:35 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/CaesarSage/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Encrypt Kubernetes Traffic with cert-manager, Let's Encrypt, and Internal TLS ]]>
                </title>
                <description>
                    <![CDATA[ Most engineers assume their Kubernetes cluster encrypts all of its traffic. It doesn't. The commands you run with kubectl are encrypted — your client and the API server speak TLS. The API server talki ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-encrypt-kubernetes-traffic/</link>
                <guid isPermaLink="false">6a0df3b68b034602219e482c</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 20 May 2026 17:47:34 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/c1cf9847-fa0f-49f3-93f4-3c5c1e8ac4c0.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most engineers assume their Kubernetes cluster encrypts all of its traffic. It doesn't. The commands you run with <code>kubectl</code> are encrypted — your client and the API server speak TLS. The API server talking to etcd is usually encrypted too, depending on how the cluster was provisioned.</p>
<p>But traffic between your pods? Plaintext by default. Ingress traffic from the internet to your services? Only encrypted if you explicitly configure TLS. And certificates for internal services? You have to provision those yourself.</p>
<p>This is not a Kubernetes oversight. It's a deliberate design choice — Kubernetes provides the primitives and leaves the implementation to you. The problem is that certificate management is notoriously painful. Certificates expire. Provisioning them manually doesn't scale. Forgetting to rotate them causes outages.</p>
<p>cert-manager solves this. It runs as a controller inside your cluster, watches for <code>Certificate</code> resources, requests certificates from configured issuers, stores them in Kubernetes Secrets, and rotates them automatically before they expire. You declare what you want, cert-manager makes it happen and keeps it that way.</p>
<p>In this article you'll work through how cert-manager's core model works, automate public Ingress TLS using Let's Encrypt, set up an internal Certificate Authority for service-to-service encryption, and understand how certificate rotation works so outages caused by expired certificates become a thing of the past.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A kind cluster with the nginx Ingress controller installed</p>
</li>
<li><p>Helm 3 installed</p>
</li>
<li><p>A domain name with DNS you control — needed for the Let's Encrypt demo</p>
</li>
<li><p>Basic understanding of TLS: you know what a certificate, a private key, and a CA are</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security/cert-manager">DevOps-Cloud-Projects GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-and-isnt-encrypted-in-kubernetes">What Is and Isn't Encrypted in Kubernetes</a></p>
</li>
<li><p><a href="#heading-how-cert-manager-works">How cert-manager Works</a></p>
<ul>
<li><p><a href="#heading-the-four-core-resources">The Four Core Resources</a></p>
</li>
<li><p><a href="#heading-issuers-and-clusterissuers">Issuers and ClusterIssuers</a></p>
</li>
<li><p><a href="#heading-the-certificate-lifecycle">The Certificate Lifecycle</a></p>
</li>
<li><p><a href="#heading-acme-challenges-http-01-vs-dns-01">ACME Challenges: HTTP-01 vs DNS-01</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--install-cert-manager-and-issue-a-lets-encrypt-certificate">Demo 1 — Install cert-manager and Issue a Let's Encrypt Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-get-a-wildcard-certificate-with-dns-01">How to Get a Wildcard Certificate with DNS-01</a></p>
</li>
<li><p><a href="#heading-demo-2--set-up-an-internal-ca-for-service-to-service-tls">Demo 2 — Set Up an Internal CA for Service-to-Service TLS</a></p>
</li>
<li><p><a href="#heading-how-certificate-rotation-works">How Certificate Rotation Works</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-and-isnt-encrypted-in-kubernetes">What Is and Isn't Encrypted in Kubernetes?</h2>
<p>Before installing anything, it's worth being precise about what the cluster already protects and what it leaves open.</p>
<table>
<thead>
<tr>
<th>Traffic path</th>
<th>Encrypted by default?</th>
<th>Notes</th>
</tr>
</thead>
<tbody><tr>
<td><code>kubectl</code> → API server</td>
<td>Yes</td>
<td>TLS with the cluster CA</td>
</tr>
<tr>
<td>API server → etcd</td>
<td>Usually</td>
<td>Depends on cluster provisioner — verify with your setup</td>
</tr>
<tr>
<td>API server → kubelet</td>
<td>Yes</td>
<td>TLS, but kubelet cert verification depends on configuration</td>
</tr>
<tr>
<td>Pod → Pod (same cluster)</td>
<td><strong>No</strong></td>
<td>Plaintext unless you add a service mesh or mTLS</td>
</tr>
<tr>
<td>Internet → Ingress</td>
<td><strong>No</strong></td>
<td>Opt-in — requires TLS configuration on the Ingress resource</td>
</tr>
<tr>
<td>Pod → Kubernetes API</td>
<td>Yes</td>
<td>Via the service account token and cluster CA</td>
</tr>
</tbody></table>
<p>The two gaps that matter most in practice are pod-to-pod traffic and Ingress TLS. This article covers both Ingress TLS with Let's Encrypt and internal service-to-service encryption using a private CA.</p>
<h2 id="heading-how-cert-manager-works">How cert-manager Works</h2>
<p>cert-manager is a Kubernetes operator. It extends the Kubernetes API with custom resources that represent certificate requests and their configuration. When you create a <code>Certificate</code> resource, cert-manager's controller picks it up, requests a certificate from the configured issuer, and stores the resulting certificate and private key in a Kubernetes Secret. When the certificate approaches its expiry, cert-manager renews it automatically.</p>
<p>This model means your application doesn't know or care about certificate management. It reads a Secret. cert-manager keeps that Secret fresh.</p>
<h3 id="heading-the-four-core-resources">The Four Core Resources</h3>
<p>cert-manager introduces four custom resources that you'll use regularly:</p>
<table>
<thead>
<tr>
<th>Resource</th>
<th>What it represents</th>
</tr>
</thead>
<tbody><tr>
<td><code>Issuer</code></td>
<td>A certificate authority or ACME account — namespace-scoped</td>
</tr>
<tr>
<td><code>ClusterIssuer</code></td>
<td>Same as Issuer, but available cluster-wide</td>
</tr>
<tr>
<td><code>Certificate</code></td>
<td>A request for a certificate — describes what you want</td>
</tr>
<tr>
<td><code>CertificateRequest</code></td>
<td>An individual signing request — created automatically by cert-manager, rarely touched directly</td>
</tr>
</tbody></table>
<p>In practice you'll mostly deal with <code>ClusterIssuer</code> and <code>Certificate</code>. The <code>ClusterIssuer</code> defines where certificates come from. The <code>Certificate</code> defines what certificate you want and where to store it.</p>
<h3 id="heading-issuers-and-clusterissuers">Issuers and ClusterIssuers</h3>
<p>An <code>Issuer</code> can only issue certificates within its own namespace. A <code>ClusterIssuer</code> can issue certificates in any namespace. For shared infrastructure like Let's Encrypt, you almost always want a <code>ClusterIssuer</code>. For application-specific internal CAs, an <code>Issuer</code> scoped to that application's namespace is the safer choice.</p>
<p>cert-manager supports several issuer types. The three you'll encounter most often are:</p>
<p><strong>ACME</strong> — for public certificates from Let's Encrypt or any ACME-compatible CA. Ownership of the domain is proven via an HTTP-01 or DNS-01 challenge.</p>
<p><strong>CA</strong> — for internal certificates signed by a CA whose private key is stored in a Kubernetes Secret. Used for service-to-service TLS within the cluster.</p>
<p><strong>Self-signed</strong> — generates self-signed certificates. Rarely useful on its own, but essential as the bootstrap step when creating an internal CA.</p>
<h3 id="heading-the-certificate-lifecycle">The Certificate Lifecycle</h3>
<p>When you create a <code>Certificate</code> resource, cert-manager follows this sequence:</p>
<ol>
<li><p>Creates a <code>CertificateRequest</code> with a CSR (Certificate Signing Request)</p>
</li>
<li><p>Passes the CSR to the configured issuer</p>
</li>
<li><p>For ACME issuers: creates a <code>Challenge</code> resource and fulfils it (more on this below)</p>
</li>
<li><p>Receives the signed certificate from the issuer</p>
</li>
<li><p>Stores the certificate and private key in the Kubernetes Secret named in <code>spec.secretName</code></p>
</li>
<li><p>Monitors the certificate's expiry — by default, renews when 2/3 of the validity period has elapsed</p>
</li>
</ol>
<p>Your application mounts the Secret. cert-manager updates it silently. Most applications that watch for file changes will pick up the new certificate without a restart.</p>
<h3 id="heading-acme-challenges-http-01-vs-dns-01">ACME Challenges: HTTP-01 vs DNS-01</h3>
<p>Let's Encrypt needs proof that you control the domain before it issues a certificate. ACME defines two challenge types for this.</p>
<p><strong>HTTP-01</strong> works by having cert-manager create a temporary HTTP endpoint at <code>http://&lt;your-domain&gt;/.well-known/acme-challenge/&lt;token&gt;</code>. Let's Encrypt sends a request to that URL. If the response matches the expected token, the challenge passes. This requires your cluster to be reachable from the internet on port 80.</p>
<p><strong>DNS-01</strong> works by having cert-manager create a temporary DNS TXT record at <code>_acme-challenge.&lt;your-domain&gt;</code>. Let's Encrypt checks for that record. This doesn't require inbound HTTP access, which makes it the right choice for private clusters, and it's the only way to get wildcard certificates (<code>*.example.com</code>).</p>
<p>The trade-off: HTTP-01 is simpler to set up but only works for single domains and requires internet-accessible infrastructure. DNS-01 requires API access to your DNS provider but works for internal clusters and wildcards.</p>
<h2 id="heading-demo-1-install-cert-manager-and-issue-a-certificate-using-pebble-and-lets-encrypt">Demo 1 — Install cert-manager and Issue a Certificate Using Pebble and Let's Encrypt</h2>
<p>Pebble is Let's Encrypt's local ACME test server. It runs inside your cluster, issues certificates using the same ACME protocol as Let's Encrypt, and requires no public domain or internet access. Using Pebble lets you test the full cert-manager flow — challenge, issuance, renewal — on a plain kind cluster.</p>
<p>Once you understand the flow locally, switching to real Let's Encrypt is a one-line change: replace the ClusterIssuer server URL and point a DNS record at a publicly reachable cluster. The rest of the configuration is identical.</p>
<p>You'll install cert-manager, create a <code>ClusterIssuer</code> for Let's Encrypt, deploy a sample application with an Ingress, and watch a real certificate be issued and stored automatically.</p>
<h3 id="heading-step-1-install-cert-manager">Step 1: Install cert-manager</h3>
<p>cert-manager is now distributed via OCI Helm charts from <code>quay.io/jetstack</code>. The <code>--set crds.enabled=true</code> flag installs the Custom Resource Definitions as part of the chart:</p>
<pre><code class="language-bash">helm upgrade cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --install \
  --create-namespace \
  --namespace cert-manager \
  --set crds.enabled=true \
  --version v1.17.0 \
  --wait
</code></pre>
<p>You also need the nginx Ingress controller — cert-manager routes HTTP-01 challenges through it. The <code>controller.service.type=ClusterIP</code> override is for kind specifically: the default <code>LoadBalancer</code> Service never gets an <code>EXTERNAL-IP</code> on kind (there's no cloud LB), which makes <code>--wait</code> hang forever. On a real cluster, drop the override and keep <code>LoadBalancer</code>.</p>
<pre><code class="language-bash">helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=ClusterIP \
  --wait
</code></pre>
<p>Confirm all four components are running:</p>
<pre><code class="language-bash">kubectl get pods -n cert-manager
kubectl get pods -n ingress-nginx
</code></pre>
<pre><code class="language-plaintext">NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-76f84784c8-r4fx4              1/1     Running   0          6m45s
cert-manager-cainjector-66fbf49587-gv25n   1/1     Running   0          6m45s
cert-manager-webhook-577fddf86-l5wj4       1/1     Running   0          6m45s

NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-6c7cd85885-h7zgx   1/1     Running   0          3m34s
</code></pre>
<blockquote>
<p>kind-specific gotcha — remove the nginx admission webhook now.** On kind, the nginx admission webhook serves with a self-signed certificate that the Kubernetes API server cannot verify. The first time you try to create <em>any</em> Ingress resource you'll see <code>failed calling webhook "validate.nginx.ingress.kubernetes.io": ... x509: certificate signed by unknown authority</code>. Delete the webhook up front so the rest of the demo doesn't trip over it:</p>
</blockquote>
<pre><code class="language-bash">kubectl delete validatingwebhookconfiguration ingress-nginx-admission
</code></pre>
<h3 id="heading-step-2-install-pebble">Step 2: Install Pebble</h3>
<p>Pebble is the local ACME test server, distributed by the JupyterHub project. It ships with a companion CoreDNS deployment (<code>pebble-coredns</code>) that Pebble uses to resolve names during ACME validation.</p>
<pre><code class="language-bash">helm install pebble pebble \
  --repo https://jupyterhub.github.io/helm-chart/ \
  --namespace pebble \
  --create-namespace \
  --wait
</code></pre>
<p>Confirm both pods are running:</p>
<pre><code class="language-bash">kubectl get pods -n pebble
</code></pre>
<pre><code class="language-plaintext">NAME                              READY   STATUS    RESTARTS   AGE
pebble-8d8d49d64-lz8ck            1/1     Running   0          36s
pebble-coredns-7fb5c7cbf4-4jw9h   1/1     Running   0          36s
</code></pre>
<h3 id="heading-step-3-wire-up-dns-for-the-fake-hostname">Step 3: Wire up DNS for the fake hostname</h3>
<p>We're going to issue a cert for <code>echo.pebble.local</code>. That hostname is fake — it doesn't exist in any real DNS — so we have to teach <strong>two</strong> independent resolvers about it before issuance will work:</p>
<table>
<thead>
<tr>
<th>Resolver</th>
<th>Used by</th>
<th>What we need it to do</th>
</tr>
</thead>
<tbody><tr>
<td><code>pebble-coredns</code> (in the <code>pebble</code> namespace)</td>
<td>Pebble itself, when it makes the HTTP-01 validation request</td>
<td>Resolve <code>echo.pebble.local</code> → ingress-nginx ClusterIP</td>
</tr>
<tr>
<td>Cluster CoreDNS (<code>kube-system</code>)</td>
<td>cert-manager's HTTP-01 <strong>self-check</strong> before reporting the challenge ready</td>
<td>Forward <code>pebble.local</code> lookups to <code>pebble-coredns</code></td>
</tr>
</tbody></table>
<p>If you skip either layer, the Order will go to <code>invalid</code> state with a DNS lookup failure.</p>
<p>First grab the two IPs you'll need:</p>
<pre><code class="language-bash">NGINX_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller \
  -o jsonpath='{.spec.clusterIP}')
PEBBLE_DNS_IP=$(kubectl get svc pebble-coredns -n pebble \
  -o jsonpath='{.spec.clusterIP}')
echo "NGINX_IP=\(NGINX_IP  PEBBLE_DNS_IP=\)PEBBLE_DNS_IP"
</code></pre>
<p><strong>Patch</strong> <code>pebble-coredns</code> to answer for <code>*.pebble.local</code> with the ingress controller's IP. The CoreDNS <code>template</code> plugin parses unreliably when the whole block is collapsed onto one line, so apply a real multi-line ConfigMap:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: pebble-coredns
  namespace: pebble
data:
  Corefile: |
    .:8053 {
      errors
      health
      ready
      template ANY ANY pebble.local {
        answer "{{ .Name }} 60 IN A ${NGINX_IP}"
      }
      forward . /etc/resolv.conf
      cache 2
      reload
    }
EOF

kubectl rollout restart deploy/pebble-coredns -n pebble
kubectl rollout status deploy/pebble-coredns -n pebble
</code></pre>
<p>Verify it answers correctly:</p>
<pre><code class="language-bash">kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local ${PEBBLE_DNS_IP}
</code></pre>
<p>You should see <code>Address: &lt;NGINX_IP&gt;</code> in the response. If you get <code>SERVFAIL</code>, check <code>kubectl logs -n pebble deploy/pebble-coredns</code> — a parser error like <code>not a TTL: "}"</code> means the template block collapsed onto one line again.</p>
<p><strong>Patch the cluster CoreDNS</strong> so cert-manager's self-check can resolve the same name. Add a stub zone that forwards <code>pebble.local</code> to <code>pebble-coredns</code>:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
    pebble.local:53 {
        forward . ${PEBBLE_DNS_IP}
    }
EOF

kubectl rollout restart deploy/coredns -n kube-system
kubectl rollout status deploy/coredns -n kube-system
</code></pre>
<p>Verify the cluster resolver now answers for <code>echo.pebble.local</code> (without specifying a server — it'll use the default kube-dns):</p>
<pre><code class="language-bash">kubectl run dnstest --rm -it --restart=Never --image=busybox -- \
  nslookup echo.pebble.local
</code></pre>
<p>Both <code>Server: 10.96.0.10</code> and <code>Address: &lt;NGINX_IP&gt;</code> should appear.</p>
<h3 id="heading-step-4-fetch-the-pebble-ca-and-create-the-clusterissuer">Step 4: Fetch the Pebble CA and create the ClusterIssuer</h3>
<p>Pebble signs its certificates with a self-signed root that lives in the <code>pebble</code> ConfigMap under <code>root-cert.pem</code>. cert-manager needs to trust this CA to talk to Pebble's ACME directory, so we pass it as a base64-encoded <code>caBundle</code> in the ClusterIssuer:</p>
<pre><code class="language-bash">kubectl get configmap pebble -n pebble \
  -o jsonpath='{.data.root-cert\.pem}' &gt; pebble-ca.crt

head -1 pebble-ca.crt   # should print -----BEGIN CERTIFICATE-----

CA_BUNDLE=$(base64 -i pebble-ca.crt | tr -d '\n')
echo "CA_BUNDLE length: ${#CA_BUNDLE}"   # ~1600 chars, one continuous line
</code></pre>
<p>Create the ClusterIssuer using the heredoc — the <code>${CA_BUNDLE}</code> shell variable gets substituted into the YAML before kubectl reads it:</p>
<pre><code class="language-bash">kubectl apply -f - &lt;&lt;EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: pebble
spec:
  acme:
    server: https://pebble.pebble.svc.cluster.local/dir
    email: test@example.com
    privateKeySecretRef:
      name: pebble-account-key
    caBundle: ${CA_BUNDLE}
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
EOF
</code></pre>
<p>Check the issuer is ready:</p>
<pre><code class="language-bash">kubectl get clusterissuer pebble
</code></pre>
<pre><code class="language-plaintext">NAME     READY   AGE
pebble   True    5s
</code></pre>
<p>If <code>READY</code> stays <code>False</code>, the two most common causes are a malformed caBundle (verify it's a single unbroken base64 line with no newlines) or Pebble being unreachable from the <code>cert-manager</code> namespace. To check reachability:</p>
<pre><code class="language-bash">kubectl run test-curl --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --namespace cert-manager -- \
  curl -k https://pebble.pebble.svc.cluster.local/dir
</code></pre>
<p>If that returns JSON, Pebble is reachable.</p>
<h3 id="heading-step-5-deploy-a-sample-application">Step 5: Deploy a sample application</h3>
<pre><code class="language-yaml"># echo-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
        - name: echo
          image: ealen/echo-server:latest
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: echo
  namespace: default
spec:
  selector:
    app: echo
  ports:
    - port: 80
      targetPort: 80
</code></pre>
<pre><code class="language-bash">kubectl apply -f echo-app.yaml
</code></pre>
<p>Verify the resources came up:</p>
<pre><code class="language-bash">kubectl get deploy,pod,svc -n default
</code></pre>
<pre><code class="language-plaintext">NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/echo   1/1     1            1           32s

NAME                        READY   STATUS    RESTARTS   AGE
pod/echo-5665fbcfdd-mbgxj   1/1     Running   0          36s

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/echo         ClusterIP   10.96.103.114   &lt;none&gt;        80/TCP    40s
service/kubernetes   ClusterIP   10.96.0.1       &lt;none&gt;        443/TCP   32m
</code></pre>
<h3 id="heading-step-6-create-an-ingress-with-tls">Step 6: Create an Ingress with TLS</h3>
<p>The <code>cert-manager.io/cluster-issuer: pebble</code> annotation tells cert-manager to automatically create a <code>Certificate</code> resource for this Ingress, using the issuer we just created. The hostname <code>echo.pebble.local</code> doesn't need to resolve externally — we taught both DNS resolvers about it in Step 3.</p>
<pre><code class="language-yaml"># echo-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: echo
  namespace: default
  annotations:
    cert-manager.io/cluster-issuer: pebble
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - echo.pebble.local
      secretName: echo-tls     # cert-manager will create this Secret
  rules:
    - host: echo.pebble.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: echo
                port:
                  number: 80
</code></pre>
<pre><code class="language-bash">kubectl apply -f echo-ingress.yaml
</code></pre>
<h3 id="heading-step-7-watch-the-certificate-being-issued">Step 7: Watch the certificate being issued</h3>
<pre><code class="language-bash"># Watch the Certificate resource (Ctrl-C once Ready=True)
kubectl get certificate echo-tls -n default -w
</code></pre>
<pre><code class="language-plaintext">NAME       READY   SECRET     AGE
echo-tls   False   echo-tls   5s
echo-tls   True    echo-tls   28s
</code></pre>
<p>When <code>READY</code> becomes <code>True</code>, the certificate has been issued and stored in the <code>echo-tls</code> Secret. The full chain — CertificateRequest → Order → Challenge → solver pod → Secret — happens in well under a minute on a healthy cluster:</p>
<pre><code class="language-bash">kubectl get certificate,certificaterequest,order,challenge -n default
</code></pre>
<pre><code class="language-plaintext">NAME                                   READY   SECRET     AGE
certificate.cert-manager.io/echo-tls   True    echo-tls   81s

NAME                                            APPROVED   DENIED   READY   ISSUER   AGE
certificaterequest.cert-manager.io/echo-tls-1   True                True    pebble   81s

NAME                                               STATE   AGE
order.acme.cert-manager.io/echo-tls-1-1824732543   valid   81s
</code></pre>
<p>(Challenges are deleted automatically once an Order completes, so <code>kubectl get challenge -n default</code> typically shows nothing at this point — that's success, not failure.)</p>
<p>If <code>READY</code> stays <code>False</code> for more than a minute, see the troubleshooting tips at the end of this section.</p>
<p>Inspect the issued certificate to confirm Pebble signed it:</p>
<pre><code class="language-bash">kubectl get secret echo-tls -n default -o jsonpath='{.data.tls\.crt}' | \
  base64 -d | openssl x509 -noout -issuer -subject -dates
</code></pre>
<pre><code class="language-plaintext">issuer=CN=Pebble Intermediate CA 05478c
subject=
notBefore=May 17 19:09:22 2026 GMT
notAfter=Aug 15 19:09:21 2026 GMT
</code></pre>
<p>Issuer is Pebble's intermediate CA — proof the full ACME flow worked end-to-end. The cert is valid for 90 days, and cert-manager will renew it automatically at day 60.</p>
<p>Hit the ingress over HTTPS from inside the cluster to confirm everything is wired together:</p>
<pre><code class="language-bash">kubectl run curltest --rm -it --restart=Never --image=curlimages/curl -- \
  curl -sk https://echo.pebble.local/
</code></pre>
<p>The echo server should return a JSON blob — note the <code>"x-forwarded-proto":"https"</code> field, which proves the request came through nginx over TLS.</p>
<p><strong>Troubleshooting if the cert never goes Ready:</strong></p>
<ul>
<li><p><code>kubectl describe order -n default</code> — look for "DNS problem" or "Connection refused" in the events.</p>
</li>
<li><p><code>kubectl logs -n pebble deploy/pebble --tail=50</code> — Pebble logs the exact URL it tried to fetch during validation and any errors.</p>
</li>
<li><p>If the Order is stuck pending with no events: cert-manager hasn't reconciled yet. Wait 30s.</p>
</li>
<li><p>If the Order is <code>invalid</code>: one of the two DNS layers (Step 3) is misconfigured. Re-run both <code>nslookup</code> checks.</p>
</li>
<li><p>If the Ingress apply itself failed with an x509 webhook error: you skipped the <code>kubectl delete validatingwebhookconfiguration ingress-nginx-admission</code> step in Step 1.</p>
</li>
</ul>
<h3 id="heading-step-8-switch-to-lets-encrypt-staging-real-public-domain">Step 8: Switch to Let's Encrypt staging (real public domain)</h3>
<p>Pebble proved the flow works locally. Now move to a publicly-reachable domain pointed at a publicly-reachable cluster. The DNS gymnastics from Step 3 go away — the domain is real, so both resolvers find it without intervention.</p>
<p>Use Let's Encrypt <strong>staging</strong> first. It speaks the same ACME protocol as production but with generous rate limits, so failed attempts during testing won't lock you out:</p>
<pre><code class="language-yaml"># clusterissuer-staging.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
</code></pre>
<pre><code class="language-bash">kubectl apply -f clusterissuer-staging.yaml

# Point the Ingress at staging and the real hostname, then force re-issuance
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-staging --overwrite -n default
kubectl delete secret echo-tls -n default
</code></pre>
<p>The new cert's issuer will look something like <code>(STAGING) Let's Encrypt</code>.</p>
<h3 id="heading-step-9-switch-to-lets-encrypt-production">Step 9: Switch to Let's Encrypt production</h3>
<p>Once staging works, repeat with the production ClusterIssuer. The only difference is the <code>server</code> URL:</p>
<pre><code class="language-yaml"># clusterissuer-prod.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - http01:
          ingress:
            ingressClassName: nginx
</code></pre>
<pre><code class="language-bash">kubectl apply -f clusterissuer-prod.yaml
kubectl annotate ingress echo \
  cert-manager.io/cluster-issuer=letsencrypt-prod --overwrite -n default
kubectl delete secret echo-tls -n default
</code></pre>
<p>cert-manager detects the missing Secret and immediately requests a browser-trusted certificate from production Let's Encrypt.</p>
<p>cert-manager detects the missing Secret and immediately triggers a new certificate request using the production issuer.</p>
<h2 id="heading-how-to-get-a-wildcard-certificate-with-dns-01">How to Get a Wildcard Certificate with DNS-01</h2>
<p>HTTP-01 challenges work well for single domains with public ingress. But there are two situations where you need DNS-01 instead: when your cluster is not publicly accessible (internal clusters, air-gapped environments, staging namespaces behind a VPN), and when you want a wildcard certificate that covers all subdomains of your domain.</p>
<p>DNS-01 requires cert-manager to be able to create and delete TXT records in your DNS provider. cert-manager has built-in support for Route53, Cloud DNS, Cloudflare, Azure DNS, and many others.</p>
<p>Here is a <code>ClusterIssuer</code> for DNS-01 using AWS Route53:</p>
<pre><code class="language-yaml"># clusterissuer-dns01.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns01
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: your-email@example.com
    privateKeySecretRef:
      name: letsencrypt-dns01-account-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            # Use IRSA (IAM Roles for Service Accounts) in production
            # rather than static credentials
            hostedZoneID: YOUR_HOSTED_ZONE_ID
</code></pre>
<p>A wildcard <code>Certificate</code> using that issuer:</p>
<pre><code class="language-yaml"># wildcard-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-example-com
  namespace: default
spec:
  secretName: wildcard-example-com-tls
  issuerRef:
    name: letsencrypt-dns01
    kind: ClusterIssuer
  commonName: "*.example.com"
  dnsNames:
    - "*.example.com"
    - "example.com"        # Also cover the apex domain
  duration: 2160h           # 90 days
  renewBefore: 720h         # Renew 30 days before expiry
</code></pre>
<p>The resulting Secret <code>wildcard-example-com-tls</code> can be referenced by any Ingress in the <code>default</code> namespace. All subdomains — <code>api.example.com</code>, <code>dashboard.example.com</code>, <code>staging.example.com</code> — are covered by a single certificate that rotates automatically.</p>
<p>For Cloudflare instead of Route53, the solver section looks like this:</p>
<pre><code class="language-yaml">    solvers:
      - dns01:
          cloudflare:
            email: your-email@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
</code></pre>
<h2 id="heading-demo-2-set-up-an-internal-ca-for-service-to-service-tls">Demo 2 — Set Up an Internal CA for Service-to-Service TLS</h2>
<p>Let's Encrypt certificates are great for public-facing services. But for internal services — a gRPC microservice calling another, a web application talking to its database — you don't need public trust. You need a CA that the cluster trusts, and you need it to issue certificates for service names that don't exist as public DNS records.</p>
<p>cert-manager's CA issuer handles this. You create a root CA, tell cert-manager about it, and then issue certificates for internal services using that CA. Every service that trusts the root CA trusts every certificate it issues.</p>
<h3 id="heading-step-1-create-a-self-signed-clusterissuer">Step 1: Create a self-signed ClusterIssuer</h3>
<p>A self-signed issuer generates certificates that are signed by the certificate itself — it is its own CA. You use this as a bootstrap step to create the root CA certificate:</p>
<pre><code class="language-yaml"># selfsigned-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned
spec:
  selfSigned: {}
</code></pre>
<pre><code class="language-bash">kubectl apply -f selfsigned-issuer.yaml
</code></pre>
<h3 id="heading-step-2-create-the-root-ca-certificate">Step 2: Create the root CA certificate</h3>
<p>Use the self-signed issuer to create a CA certificate. The <code>isCA: true</code> field tells cert-manager this certificate can sign other certificates:</p>
<pre><code class="language-yaml"># internal-ca.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-ca
  namespace: cert-manager    # Store in cert-manager namespace
spec:
  isCA: true
  commonName: internal-ca
  secretName: internal-ca-secret
  duration: 87600h           # 10 years — this is a root CA
  renewBefore: 720h
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned
    kind: ClusterIssuer
</code></pre>
<pre><code class="language-bash">kubectl apply -f internal-ca.yaml
kubectl get certificate internal-ca -n cert-manager
</code></pre>
<pre><code class="language-plaintext">NAME          READY   SECRET               AGE
internal-ca   True    internal-ca-secret   8s
</code></pre>
<h3 id="heading-step-3-create-a-ca-clusterissuer-backed-by-the-root-ca">Step 3: Create a CA ClusterIssuer backed by the root CA</h3>
<p>Now create a <code>ClusterIssuer</code> that uses the root CA Secret you just created. This is the issuer that will sign certificates for your internal services:</p>
<pre><code class="language-yaml"># internal-ca-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  ca:
    secretName: internal-ca-secret   # References the Secret in cert-manager namespace
</code></pre>
<pre><code class="language-bash">kubectl apply -f internal-ca-issuer.yaml
kubectl get clusterissuer internal-ca
</code></pre>
<pre><code class="language-plaintext">NAME          READY   AGE
internal-ca   True    5s
</code></pre>
<h3 id="heading-step-4-issue-a-certificate-for-an-internal-service">Step 4: Issue a certificate for an internal service</h3>
<p>Now issue a certificate for an internal gRPC service. The <code>dnsNames</code> use Kubernetes internal DNS names — <code>&lt;service&gt;.&lt;namespace&gt;.svc.cluster.local</code>:</p>
<pre><code class="language-yaml"># payments-cert.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: payments-tls
  namespace: production
spec:
  secretName: payments-tls-secret
  issuerRef:
    name: internal-ca
    kind: ClusterIssuer
  commonName: payments.production.svc.cluster.local
  dnsNames:
    - payments.production.svc.cluster.local
    - payments.production.svc
    - payments
  duration: 2160h     # 90 days
  renewBefore: 360h   # Renew 15 days before expiry
</code></pre>
<pre><code class="language-bash">kubectl create namespace production
kubectl apply -f payments-cert.yaml
kubectl get certificate payments-tls -n production
</code></pre>
<pre><code class="language-plaintext">NAME           READY   SECRET                AGE
payments-tls   True    payments-tls-secret   6s
</code></pre>
<p>The Secret <code>payments-tls-secret</code> now contains <code>tls.crt</code>, <code>tls.key</code>, and <code>ca.crt</code>. Mount this into your application pod:</p>
<pre><code class="language-yaml"># In your Deployment spec
volumes:
  - name: tls
    secret:
      secretName: payments-tls-secret
containers:
  - name: payments
    volumeMounts:
      - name: tls
        mountPath: /etc/tls
        readOnly: true
</code></pre>
<p>Your application reads <code>/etc/tls/tls.crt</code> and <code>/etc/tls/tls.key</code> to configure TLS. Other services that need to trust it read <code>/etc/tls/ca.crt</code>.</p>
<h3 id="heading-step-5-distribute-the-ca-bundle-with-trust-manager">Step 5: Distribute the CA bundle with trust-manager</h3>
<p>The problem with a custom CA is that every service needs to know about it. cert-manager's companion tool, trust-manager, handles this by distributing the CA bundle as a <code>ConfigMap</code> to every namespace:</p>
<pre><code class="language-bash">helm upgrade trust-manager oci://quay.io/jetstack/charts/trust-manager \
  --install \
  --namespace cert-manager \
  --wait
</code></pre>
<p>Create a <code>Bundle</code> resource that takes the CA certificate from the <code>internal-ca-secret</code> and distributes it cluster-wide:</p>
<pre><code class="language-yaml"># ca-bundle.yaml
apiVersion: trust.cert-manager.io/v1alpha1
kind: Bundle
metadata:
  name: internal-ca-bundle
spec:
  sources:
    - secret:
        name: internal-ca-secret
        key: ca.crt
  target:
    configMap:
      key: ca-bundle.crt
    namespaceSelector:
      matchLabels:
        # Distribute to all namespaces with this label
        kubernetes.io/metadata.name: production
</code></pre>
<pre><code class="language-bash">kubectl apply -f ca-bundle.yaml
</code></pre>
<p>After a few seconds, every matching namespace has a ConfigMap named <code>internal-ca-bundle</code> containing the CA certificate. Applications mount this ConfigMap to trust internally-issued certificates without any per-service configuration.</p>
<h3 id="heading-step-6-verify-the-certificate-chain">Step 6: Verify the certificate chain</h3>
<pre><code class="language-bash"># Extract the CA cert and service cert
kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.ca\.crt}' | base64 -d &gt; ca.crt

kubectl get secret payments-tls-secret -n production \
  -o jsonpath='{.data.tls\.crt}' | base64 -d &gt; payments.crt

# Verify the cert was signed by the CA
openssl verify -CAfile ca.crt payments.crt
</code></pre>
<pre><code class="language-plaintext">payments.crt: OK
</code></pre>
<h2 id="heading-how-certificate-rotation-works">How Certificate Rotation Works</h2>
<p>Certificate rotation is the part of certificate management that breaks production clusters most often. cert-manager handles it automatically, but understanding the mechanism helps you tune it and debug it when things go wrong.</p>
<p>cert-manager watches every <code>Certificate</code> resource it manages and checks the expiry of the underlying certificate in the Secret. When the remaining validity drops below the <code>renewBefore</code> threshold, cert-manager triggers a renewal. The default <code>renewBefore</code> is 1/3 of the certificate's total validity period — so a 90-day certificate starts renewing at day 60.</p>
<p>The renewal creates a new <code>CertificateRequest</code>, goes through the full issuance flow, and updates the Secret in place. The new certificate replaces the old one atomically. Applications that use file mounts and watch for changes (most modern web servers and gRPC frameworks do) will pick up the new certificate without restarting.</p>
<pre><code class="language-bash"># See the current rotation status
kubectl describe certificate echo-tls -n default
</code></pre>
<p>Look for these fields in the output:</p>
<pre><code class="language-plaintext">Status:
  Not After:   2024-06-18T10:00:00Z
  Not Before:  2024-03-20T10:00:00Z
  Renewal Time: 2024-05-18T10:00:00Z   # When cert-manager will start renewing
  Conditions:
    Type:    Ready
    Status:  True
    Message: Certificate is up to date and has not expired
</code></pre>
<p>If a renewal fails — for example, because the HTTP-01 challenge can't be completed — cert-manager retries with exponential backoff. The existing certificate continues to serve until it actually expires, giving you a window to debug the issue.</p>
<p>To see renewal events in real time:</p>
<pre><code class="language-bash">kubectl get events -n default --field-selector reason=Issued
kubectl get events -n default --field-selector reason=Failed
</code></pre>
<p><strong>Setting</strong> <code>renewBefore</code> <strong>correctly:</strong> For public-facing services, 30 days before a 90-day certificate is a sensible buffer. For internal short-lived certificates (24-hour validity), set <code>renewBefore</code> to 8 hours so rotation happens well before expiry even if the first attempt fails. Never set <code>renewBefore</code> to more than half the certificate's validity — cert-manager will immediately try to renew a certificate it just issued.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<pre><code class="language-bash"># Remove demo resources
kubectl delete ingress echo -n default
kubectl delete service echo -n default
kubectl delete deployment echo -n default
kubectl delete secret echo-tls -n default
kubectl delete certificate payments-tls -n production
kubectl delete namespace production

# Uninstall cert-manager and trust-manager
helm uninstall trust-manager -n cert-manager
helm uninstall cert-manager -n cert-manager
kubectl delete namespace cert-manager

# Remove ClusterIssuers
kubectl delete clusterissuer letsencrypt-staging letsencrypt-prod \
  internal-ca selfsigned 2&gt;/dev/null
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes leaves TLS configuration entirely to you. In this article you worked through both the public and internal sides of that responsibility.</p>
<p>On the public side, you installed cert-manager using the current OCI Helm chart, created a <code>ClusterIssuer</code> backed by Let's Encrypt, and watched cert-manager go through the full ACME HTTP-01 challenge flow — from creating a temporary solver pod to storing a valid certificate in a Kubernetes Secret. You saw how switching from staging to production is a one-line annotation change, and how cert-manager renews certificates automatically before they expire.</p>
<p>On the internal side, you bootstrapped a private CA using cert-manager's self-signed issuer, created a <code>ClusterIssuer</code> backed by that CA, and issued certificates for internal service names that only exist inside the cluster. You used trust-manager to distribute the CA bundle cluster-wide so services can trust each other's certificates without per-service configuration. And you saw how to verify the certificate chain with <code>openssl</code> so you can confirm it's working before deploying to production.</p>
<p>Understanding certificate rotation is what separates teams that manage TLS confidently from teams that get woken up at 3am by an expired certificate. cert-manager automates the renewal, but the <code>renewBefore</code> field is your safety margin — set it correctly and know how to read the renewal status.</p>
<p>All YAML manifests and Helm values from this article are available in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security/cert-manager">DevOps-Cloud-Projects GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Authenticate Users in Kubernetes: x509 Certificates, OIDC, and Cloud Identity ]]>
                </title>
                <description>
                    <![CDATA[ Kubernetes doesn't know who you are. It has no user database, no built-in login system, no password file. When you run kubectl get pods, Kubernetes receives an HTTP request and asks one question: who  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-authenticate-users-in-kubernetes-x509-certificates-oidc-and-cloud-identity/</link>
                <guid isPermaLink="false">69d4182f40c9cabf4484dbdb</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ authentication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:31:43 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36356282-0cfb-43a8-8461-84f20e64b041.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Kubernetes doesn't know who you are.</p>
<p>It has no user database, no built-in login system, no password file. When you run <code>kubectl get pods</code>, Kubernetes receives an HTTP request and asks one question: who signed this, and do I trust that signature? Everything else — what you're allowed to do, which namespaces you can access, whether your request goes through at all — comes after that question is answered.</p>
<p>This surprises most engineers who are new to Kubernetes. They expect something like a database of users with passwords. Instead, they find a pluggable chain of authenticators, each one able to vouch for a request in a different way:</p>
<ul>
<li><p>Client certificates</p>
</li>
<li><p>OIDC tokens from an external identity provider</p>
</li>
<li><p>Cloud provider IAM tokens</p>
</li>
<li><p>Service account tokens projected into pods.</p>
</li>
</ul>
<p>Any of these can be active at the same time.</p>
<p>Understanding this model is what separates engineers who can debug authentication failures from engineers who copy kubeconfig files and hope for the best.</p>
<p>In this article, you'll work through how the Kubernetes authentication chain works from first principles. You'll see how x509 client certificates are used — and why they're a poor choice for human users in production. You'll configure OIDC authentication with Dex, giving your cluster a real browser-based login flow. And you'll see how AWS, GCP, and Azure each plug into the same underlying model.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A running kind cluster — a fresh one works fine, or reuse an existing one</p>
</li>
<li><p><code>kubectl</code> and <code>helm</code> installed</p>
</li>
<li><p><code>openssl</code> available on your machine (comes pre-installed on macOS and most Linux distros)</p>
</li>
<li><p>Basic familiarity with what a JWT is (a signed JSON object with claims) — you don't need to be able to write one, just recognise one</p>
</li>
</ul>
<p>All demo files are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</a></p>
<ul>
<li><p><a href="#heading-the-authenticator-chain">The Authenticator Chain</a></p>
</li>
<li><p><a href="#heading-users-vs-service-accounts">Users vs Service Accounts</a></p>
</li>
<li><p><a href="#heading-what-happens-after-authentication">What Happens After Authentication</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</a></p>
<ul>
<li><p><a href="#heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</a></p>
</li>
<li><p><a href="#the-cluster-ca">The Cluster CA</a></p>
</li>
<li><p><a href="#heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-1--create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</a></p>
</li>
<li><p><a href="#heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</a></p>
<ul>
<li><p><a href="#heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</a></p>
</li>
<li><p><a href="#heading-the-api-server-configuration">The API Server Configuration</a></p>
</li>
<li><p><a href="#heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</a></p>
</li>
<li><p><a href="#heading-how-kubelogin-works">How kubelogin Works</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</a></p>
</li>
<li><p><a href="#heading-cloud-provider-authentication">Cloud Provider Authentication</a></p>
<ul>
<li><p><a href="#heading-aws-eks">AWS EKS</a></p>
</li>
<li><p><a href="#heading-google-gke">Google GKE</a></p>
</li>
<li><p><a href="#heading-azure-aks">Azure AKS</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-webhook-token-authentication">Webhook Token Authentication</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-how-kubernetes-authentication-works">How Kubernetes Authentication Works</h2>
<p>Every request that reaches the Kubernetes API server — whether from <code>kubectl</code>, a pod, a controller, or a CI pipeline — carries a credential of some kind.</p>
<p>The API server passes that credential through a chain of authenticators in sequence. The first authenticator that can verify the credential wins. If none can, the request is treated as anonymous.</p>
<h3 id="heading-the-authenticator-chain">The Authenticator Chain</h3>
<p>Kubernetes supports several authentication strategies simultaneously. You can have client certificate authentication and OIDC authentication active on the same cluster at the same time, which is common in production: cluster administrators use certificates, regular developers use OIDC. The strategies active on a cluster are determined by flags passed to the <code>kube-apiserver</code> process.</p>
<p>The strategies available are x509 client certificates, bearer tokens (static token files — rarely used in production), bootstrap tokens (used during node join operations), service account tokens, OIDC tokens, authenticating proxies, and webhook token authentication. A cluster doesn't have to use all of them, and most don't. But knowing they all exist helps when you're diagnosing an auth failure.</p>
<h3 id="heading-users-vs-service-accounts">Users vs Service Accounts</h3>
<p>There is an important distinction in how Kubernetes thinks about identity. Service accounts are Kubernetes objects — they live in a namespace, get created with <code>kubectl create serviceaccount</code>, and have tokens managed by the cluster itself. Every pod runs as a service account. These are machine identities for workloads.</p>
<p>Users, on the other hand, don't exist as Kubernetes objects at all. There is no <code>kubectl create user</code> command. Kubernetes doesn't manage user accounts. Instead, it trusts external systems to assert user identity — a certificate authority, an OIDC provider, or a cloud provider's IAM system. Kubernetes just verifies the assertion and extracts the username and group memberships from it.</p>
<table>
<thead>
<tr>
<th></th>
<th>Service Account</th>
<th>User</th>
</tr>
</thead>
<tbody><tr>
<td>Kubernetes object?</td>
<td>Yes — lives in a namespace</td>
<td>No — managed externally</td>
</tr>
<tr>
<td>Created with</td>
<td><code>kubectl create serviceaccount</code></td>
<td>External system (CA, IdP, cloud IAM)</td>
</tr>
<tr>
<td>Used by</td>
<td>Pods and workloads</td>
<td>Humans and CI systems</td>
</tr>
<tr>
<td>Token managed by</td>
<td>Kubernetes</td>
<td>External system</td>
</tr>
<tr>
<td>Namespaced?</td>
<td>Yes</td>
<td>No</td>
</tr>
</tbody></table>
<h3 id="heading-what-happens-after-authentication">What Happens After Authentication</h3>
<p>Authentication only answers one question: who is this? Once the API server has a verified identity — a username and zero or more group memberships — it passes the request to the authorisation layer. By default that is RBAC, which checks the identity against Role and ClusterRole bindings to determine what the request is allowed to do.</p>
<p>This is why authentication and authorisation are separate concerns in Kubernetes. A valid certificate gets you past the front door. What you can do inside is RBAC's job. An authenticated user with no RBAC bindings can authenticate successfully but will be denied every API call.</p>
<p>If you want a deep dive into how RBAC rules, roles, and bindings work, check out this handbook on <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a>.</p>
<h2 id="heading-how-to-use-x509-client-certificates">How to Use x509 Client Certificates</h2>
<p>x509 client certificate authentication is the oldest and simplest authentication method in Kubernetes. It's how <code>kubectl</code> works out of the box when you create a cluster — the kubeconfig file that <code>kind</code> or <code>kubeadm</code> generates contains an embedded client certificate signed by the cluster's Certificate Authority.</p>
<h3 id="heading-how-the-certificate-maps-to-an-identity">How the Certificate Maps to an Identity</h3>
<p>When the API server receives a request with a client certificate, it validates the certificate against its trusted CA, then reads two fields (The Common Name and Organization) from the certificate to construct an identity.</p>
<p>The <strong>Common Name (CN)</strong> field becomes the username. The <strong>Organization (O)</strong> field, which can contain multiple values, becomes the list of groups the user belongs to.</p>
<p>So a certificate with <code>CN=jane</code> and <code>O=engineering</code> authenticates as username <code>jane</code> in group <code>engineering</code>. If you want to give <code>jane</code> permissions, you create a RoleBinding that references either the username <code>jane</code> or the group <code>engineering</code> as a subject.</p>
<p>This is the same mechanism behind <code>system:masters</code>. When <code>kind</code> creates a cluster and writes a kubeconfig for you, it generates a certificate with <code>O=system:masters</code>. Kubernetes has a built-in ClusterRoleBinding that grants <code>cluster-admin</code> to anyone in the <code>system:masters</code> group. That's why your default kubeconfig has full admin access — it's not magic, it's a certificate with the right group.</p>
<h3 id="heading-the-cluster-ca">The Cluster CA</h3>
<p>Every Kubernetes cluster has a root Certificate Authority — a private key and a self-signed certificate that the API server trusts. Any client certificate signed by this CA is trusted by the cluster.</p>
<p>The CA certificate and key are typically stored in <code>/etc/kubernetes/pki/</code> on the control plane node, or in the <code>kube-system</code> namespace as a secret, depending on how the cluster was created.</p>
<p>On kind clusters, you can copy the CA cert and key directly from the control plane container:</p>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>Whoever holds the CA key can issue certificates for any username and any group, including <code>system:masters</code>. This makes the CA key the most sensitive secret in a Kubernetes cluster. Guard it accordingly.</p>
<h3 id="heading-the-limits-of-certificate-based-auth">The Limits of Certificate-Based Auth</h3>
<p>Client certificates work, but they have two fundamental problems that make them a poor choice for human users in production.</p>
<p>The first is that <strong>Kubernetes doesn't check certificate revocation lists (CRLs)</strong>. If a developer's kubeconfig is stolen, the embedded certificate remains valid until it expires — which is typically one year in most Kubernetes setups. There's no way to immediately invalidate it. You can't "log out" a certificate. The only mitigation is to rotate the entire cluster CA, which invalidates every certificate including those belonging to other legitimate users.</p>
<p>The second is <strong>operational overhead</strong>. Certificates must be generated, distributed to users, and rotated before expiry. There's no self-service. In a team of ten engineers, managing certificates is annoying. In a team of a hundred, it's a full-time job.</p>
<p>For human access in production, OIDC is the right answer: short-lived tokens issued by a trusted identity provider, with a central revocation mechanism, and a standard browser-based login flow. Certificates are fine for service accounts and automation, where token management can be automated and rotation is handled programmatically.</p>
<p>That said, understanding certificates isn't optional. Your kubeconfig uses one. Your CI system probably does too. And cert-based auth is what you fall back to when everything else breaks.</p>
<h2 id="heading-demo-1-create-and-use-an-x509-client-certificate">Demo 1 — Create and Use an x509 Client Certificate</h2>
<p>In this section, you'll generate a user certificate signed by the cluster CA, bind it to an RBAC role, and use it to authenticate to the cluster as a different user.</p>
<p><strong>This guide is for local development and learning only.</strong> Manually signing certificates with the cluster CA and storing keys on disk is done here for simplicity.</p>
<p>In production, you should use the Kubernetes CertificateSigningRequest API or cert-manager for certificate issuance, enforce short-lived certificates with automatic rotation, and store private keys in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or hardware security module (HSM) — never distribute the cluster CA key.</p>
<h3 id="heading-step-1-copy-the-ca-cert-and-key-from-the-kind-control-plane">Step 1: Copy the CA cert and key from the kind control plane</h3>
<pre><code class="language-bash">docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.crt ./ca.crt
docker cp k8s-security-control-plane:/etc/kubernetes/pki/ca.key ./ca.key
</code></pre>
<p>This will create two files in your current directory called <code>ca.crt</code> and <code>ca.key</code></p>
<h3 id="heading-step-2-generate-a-private-key-and-csr-for-a-new-user">Step 2: Generate a private key and CSR for a new user</h3>
<p>You're creating a certificate for a user named <code>jane</code> in the <code>engineering</code> group:</p>
<pre><code class="language-bash"># Generate the private key
openssl genrsa -out jane.key 2048

# Generate a Certificate Signing Request
# CN = username, O = group
openssl req -new \
  -key jane.key \
  -out jane.csr \
  -subj "/CN=jane/O=engineering"
</code></pre>
<h3 id="heading-step-3-sign-the-csr-with-the-cluster-ca">Step 3: Sign the CSR with the cluster CA</h3>
<pre><code class="language-bash">openssl x509 -req \
  -in jane.csr \
  -CA ca.crt \
  -CAkey ca.key \
  -CAcreateserial \
  -out jane.crt \
  -days 365
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Certificate request self-signature ok
subject=CN=jane, O=engineering
</code></pre>
<h3 id="heading-step-4-inspect-the-certificate">Step 4: Inspect the certificate</h3>
<p>Before using it, confirm the identity it carries:</p>
<pre><code class="language-bash">openssl x509 -in jane.crt -noout -subject -dates
</code></pre>
<pre><code class="language-plaintext">subject=CN=jane, O=engineering
notBefore=Mar 20 10:00:00 2024 GMT
notAfter=Mar 20 10:00:00 2025 GMT
</code></pre>
<p>One year from now, this certificate becomes invalid and must be replaced. There's no way to extend it — you have to issue a new one.</p>
<h3 id="heading-step-5-build-a-kubeconfig-entry-for-jane">Step 5: Build a kubeconfig entry for jane</h3>
<pre><code class="language-bash"># Get the cluster API server address from the current context
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Create a kubeconfig for jane
kubectl config set-cluster k8s-security \
  --server=$APISERVER \
  --certificate-authority=ca.crt \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-credentials jane \
  --client-certificate=jane.crt \
  --client-key=jane.key \
  --embed-certs=true \
  --kubeconfig=jane.kubeconfig

kubectl config set-context jane@k8s-security \
  --cluster=k8s-security \
  --user=jane \
  --kubeconfig=jane.kubeconfig

kubectl config use-context jane@k8s-security \
  --kubeconfig=jane.kubeconfig
</code></pre>
<h3 id="heading-step-6-test-authentication-before-rbac">Step 6: Test authentication — before RBAC</h3>
<p>Try to list pods using jane's kubeconfig:</p>
<pre><code class="language-bash">kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">Error from server (Forbidden): pods is forbidden: User "jane" cannot list
resource "pods" in API group "" in the namespace "staging"
</code></pre>
<p>This is correct. Jane authenticated successfully — Kubernetes knows who she is. But she has no RBAC bindings, so every API call is denied. Authentication passed, but authorisation failed.</p>
<h3 id="heading-step-7-grant-jane-access-with-rbac">Step 7: Grant jane access with RBAC</h3>
<p>RBAC bindings use the username exactly as it appears in the certificate's CN field. If you need a refresher on how Roles, ClusterRoles, and RoleBindings work, this handbook <a href="https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/">How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection</a> covers the full RBAC model. For now, a simple RoleBinding using the built-in <code>view</code> ClusterRole is enough:</p>
<pre><code class="language-yaml"># jane-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jane-reader
  namespace: staging
subjects:
  - kind: User
    name: jane          # matches the CN in the certificate
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f jane-rolebinding.yaml
kubectl get pods -n staging --kubeconfig=jane.kubeconfig
</code></pre>
<pre><code class="language-plaintext">No resources found in staging namespace.
</code></pre>
<p>No error — jane can now list pods in <code>staging</code>. She can't delete them, create them, or access other namespaces. The certificate got her in. RBAC determines what she can do.</p>
<h2 id="heading-how-to-set-up-oidc-authentication">How to Set Up OIDC Authentication</h2>
<p>OpenID Connect is an identity layer on top of OAuth 2.0. It's how Kubernetes integrates with enterprise identity providers — Active Directory, Okta, Google Workspace, Keycloak, and any other provider that speaks OIDC. Understanding how Kubernetes uses it requires following the token from the user's browser to the API server's decision.</p>
<h3 id="heading-how-the-oidc-flow-works-in-kubernetes">How the OIDC Flow Works in Kubernetes</h3>
<p>When a developer runs <code>kubectl get pods</code> with OIDC configured, the following happens:</p>
<ol>
<li><p><code>kubectl</code> checks whether the current credential in the kubeconfig is a valid, unexpired OIDC token</p>
</li>
<li><p>If not, it launches <code>kubelogin</code>, a kubectl plugin that opens a browser window</p>
</li>
<li><p>The browser redirects to the OIDC provider (Dex, Okta, your corporate IdP)</p>
</li>
<li><p>The user logs in with their corporate credentials</p>
</li>
<li><p>The OIDC provider issues a signed JWT and returns it to kubelogin</p>
</li>
<li><p>kubelogin caches the token locally (under <code>~/.kube/cache/oidc-login/</code>) and returns it to <code>kubectl</code></p>
</li>
<li><p><code>kubectl</code> sends the token to the API server as a <code>Bearer</code> header</p>
</li>
<li><p>The API server fetches the provider's public keys from its JWKS endpoint and verifies the token signature</p>
</li>
<li><p>If valid, the API server extracts the username and group claims from the token</p>
</li>
<li><p>RBAC takes over from there</p>
</li>
</ol>
<p>The Kubernetes API server never contacts the OIDC provider for each request. It only fetches the provider's public keys periodically to verify signatures locally. This makes OIDC authentication stateless and scalable.</p>
<h3 id="heading-the-api-server-configuration">The API Server Configuration</h3>
<p>For OIDC to work, the API server needs to know where to find the identity provider and how to interpret the tokens it issues.</p>
<p>In Kubernetes v1.30+, this is configured through an <code>AuthenticationConfiguration</code> file passed via the <code>--authentication-config</code> flag. (In older versions, individual <code>--oidc-*</code> flags were used instead, but these were removed in v1.35.)</p>
<p>The <code>AuthenticationConfiguration</code> defines OIDC providers under the <code>jwt</code> key:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it does</th>
<th>Example</th>
</tr>
</thead>
<tbody><tr>
<td><code>issuer.url</code></td>
<td>The OIDC provider's base URL — must match the <code>iss</code> claim in the token</td>
<td><code>https://dex.example.com</code></td>
</tr>
<tr>
<td><code>issuer.audiences</code></td>
<td>The client IDs the token was issued for — must match the <code>aud</code> claim</td>
<td><code>["kubernetes"]</code></td>
</tr>
<tr>
<td><code>issuer.certificateAuthority</code></td>
<td>CA certificate to trust when contacting the OIDC provider (inlined PEM)</td>
<td><code>-----BEGIN CERTIFICATE-----...</code></td>
</tr>
<tr>
<td><code>claimMappings.username.claim</code></td>
<td>Which JWT claim to use as the Kubernetes username</td>
<td><code>email</code></td>
</tr>
<tr>
<td><code>claimMappings.groups.claim</code></td>
<td>Which JWT claim to use as the Kubernetes group list</td>
<td><code>groups</code></td>
</tr>
<tr>
<td><code>claimMappings.*.prefix</code></td>
<td>Prefix added to the claim value — set to <code>""</code> for no prefix</td>
<td><code>""</code></td>
</tr>
</tbody></table>
<p>On a kind cluster, the <code>--authentication-config</code> flag is set in the cluster configuration before creation, not after. You'll see this in the next demo.</p>
<h3 id="heading-jwt-claims-kubernetes-uses">JWT Claims Kubernetes Uses</h3>
<p>A JWT is a signed JSON object with three sections: a header, a payload, and a signature. The payload is a set of claims – key-value pairs that assert facts about the token. Kubernetes reads specific claims from the payload to build an identity.</p>
<p>The required claims are <code>iss</code> (the issuer URL, must match <code>issuer.url</code> in the <code>AuthenticationConfiguration</code>), <code>sub</code> (the subject, a unique identifier for the user), and <code>aud</code> (the audience, must match the <code>issuer.audiences</code> list). The <code>exp</code> claim (expiry time) is also required as the API server rejects expired tokens.</p>
<p>The most useful optional claim is <code>groups</code> (or whatever you configure via <code>claimMappings.groups.claim</code>). When this claim is present, Kubernetes can map OIDC group memberships directly to RBAC group bindings. A user in the <code>platform-engineers</code> group in your identity provider automatically gets the RBAC permissions you've bound to that group in Kubernetes — no manual user management required.</p>
<h3 id="heading-how-kubelogin-works">How kubelogin Works</h3>
<p>kubelogin (also distributed as <code>kubectl oidc-login</code>) is a kubectl credential plugin. Instead of embedding a static certificate or token in your kubeconfig, you configure a credential plugin that runs a helper binary when <code>kubectl</code> needs a token.</p>
<p>When kubelogin is invoked, it checks its local token cache. If the cached token is still valid, it returns it immediately. If the token has expired, it initiates the OIDC authorization code flow — opens a browser, redirects to the identity provider, receives the token after login, caches it locally, and returns it to <code>kubectl</code>. The whole flow takes about five seconds when it triggers.</p>
<p>This means tokens are short-lived (typically an hour) and rotate automatically. If a developer's machine is compromised, the token expires on its own. There is no long-lived credential sitting in a file somewhere.</p>
<h2 id="heading-demo-2-configure-oidc-login-with-dex-and-kubelogin">Demo 2 — Configure OIDC Login with Dex and kubelogin</h2>
<p>In this section, you'll deploy Dex as a self-hosted OIDC provider, configure a kind cluster to trust it, and log in with a browser. Dex is a good demo vehicle because it runs inside the cluster and doesn't require a cloud account or an external service.</p>
<p><strong>This guide is for local development and learning only.</strong> Self-signed certificates, static passwords, and certs stored on disk are used here for simplicity.</p>
<p>In production, use a managed identity provider (Azure Entra ID, Google Workspace, Okta), automate certificate lifecycle with cert-manager, and store secrets in a secrets manager (HashiCorp Vault, AWS Secrets Manager) or inject them via CSI driver — never commit or store certs as local files.</p>
<h3 id="heading-step-1-create-a-kind-cluster-with-oidc-authentication">Step 1: Create a kind cluster with OIDC authentication</h3>
<p>OIDC authentication for the API server must be configured at cluster creation time on Kind because the API server needs to know which identity provider to trust before it starts accepting requests.</p>
<p><strong>Note:</strong> Kubernetes v1.30+ deprecated the <code>--oidc-*</code> API server flags in favor of the structured <code>AuthenticationConfiguration</code> API (via <code>--authentication-config</code>). In v1.35+ the old flags are removed entirely. This guide uses the new approach.</p>
<p><strong>nip.io</strong> is a wildcard DNS service — <code>dex.127.0.0.1.nip.io</code> resolves to <code>127.0.0.1</code>. This lets us use a real hostname for TLS without editing <code>/etc/hosts</code>.</p>
<p>First, generate a self-signed CA and TLS certificate for Dex:</p>
<pre><code class="language-bash"># Generate a CA for Dex
openssl req -x509 -newkey rsa:4096 -keyout dex-ca.key \
  -out dex-ca.crt -days 365 -nodes \
  -subj "/CN=dex-ca"

# Generate a certificate for Dex signed by that CA
openssl req -newkey rsa:2048 -keyout dex.key \
  -out dex.csr -nodes \
  -subj "/CN=dex.127.0.0.1.nip.io"

openssl x509 -req -in dex.csr \
  -CA dex-ca.crt -CAkey dex-ca.key \
  -CAcreateserial -out dex.crt -days 365 \
  -extfile &lt;(printf "subjectAltName=DNS:dex.127.0.0.1.nip.io")
</code></pre>
<p>Next, generate the <code>AuthenticationConfiguration</code> file. This tells the API server how to validate JWTs — which issuer to trust (<code>url</code>), which audience to expect (<code>audiences</code>), and which JWT claims map to Kubernetes usernames and groups (<code>claimMappings</code>). The CA cert is inlined so the API server can verify Dex's TLS certificate when fetching signing keys:</p>
<pre><code class="language-bash">cat &gt; auth-config.yaml &lt;&lt;EOF
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
  - issuer:
      url: https://dex.127.0.0.1.nip.io:32000
      audiences:
        - kubernetes
      certificateAuthority: |
$(sed 's/^/        /' dex-ca.crt)
    claimMappings:
      username:
        claim: email
        prefix: ""
      groups:
        claim: groups
        prefix: ""
EOF
</code></pre>
<p>The <code>kind-oidc.yaml</code> config uses <code>extraPortMappings</code> to expose Dex's port to your browser, <code>extraMounts</code> to copy files into the Kind node, and a <code>kubeadmConfigPatch</code> to pass <code>--authentication-config</code> to the API server:</p>
<pre><code class="language-yaml"># kind-oidc.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    extraPortMappings:
      # Forward port 32000 from the Docker container to localhost,
      # so your browser can reach Dex's login page
      - containerPort: 32000
        hostPort: 32000
        protocol: TCP
    extraMounts:
      # Copy files from your machine into the Kind node's filesystem
      - hostPath: ./dex-ca.crt
        containerPath: /etc/ca-certificates/dex-ca.crt
        readOnly: true
      - hostPath: ./auth-config.yaml
        containerPath: /etc/kubernetes/auth-config.yaml
        readOnly: true
    kubeadmConfigPatches:
      # Patch the API server to enable OIDC authentication
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            # Tell the API server to load our AuthenticationConfiguration
            authentication-config: /etc/kubernetes/auth-config.yaml
          extraVolumes:
            # Mount files into the API server pod (it runs as a static pod,
            # so it needs explicit volume mounts even though files are on the node)
            - name: dex-ca
              hostPath: /etc/ca-certificates/dex-ca.crt
              mountPath: /etc/ca-certificates/dex-ca.crt
              readOnly: true
              pathType: File
            - name: auth-config
              hostPath: /etc/kubernetes/auth-config.yaml
              mountPath: /etc/kubernetes/auth-config.yaml
              readOnly: true
              pathType: File
</code></pre>
<p>Create the cluster:</p>
<pre><code class="language-bash">kind create cluster --name k8s-auth --config kind-oidc.yaml
</code></pre>
<h3 id="heading-step-2-deploy-dex">Step 2: Deploy Dex</h3>
<p>Dex is an OIDC-compliant identity provider that acts as a bridge between Kubernetes and upstream identity sources (LDAP, SAML, GitHub, and so on). In this demo it runs inside the cluster with a static password database — two hardcoded users you can log in as.</p>
<p>The API server doesn't talk to Dex directly on every request. It only needs Dex's CA certificate (which you inlined in the <code>AuthenticationConfiguration</code>) to verify the JWT signatures on tokens that Dex issues.</p>
<p>The deployment has four parts: a ConfigMap with Dex's configuration, a Deployment to run Dex, a NodePort Service to expose it on port 32000 (matching the issuer URL), and RBAC resources so Dex can store state using Kubernetes CRDs.</p>
<p>First, create the namespace and load the TLS certificate as a Kubernetes Secret. Dex needs this to serve HTTPS. Without it, your browser and the API server would refuse to connect:</p>
<pre><code class="language-bash">kubectl create namespace dex

kubectl create secret tls dex-tls \
  --cert=dex.crt \
  --key=dex.key \
  -n dex
</code></pre>
<p>Save the following as <code>dex-config.yaml</code>. This configures Dex with a static password connector — two hardcoded users for the demo:</p>
<pre><code class="language-yaml"># dex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex-config
  namespace: dex
data:
  config.yaml: |
    # issuer must exactly match the URL in your AuthenticationConfiguration
    issuer: https://dex.127.0.0.1.nip.io:32000

    # Dex stores refresh tokens and auth codes — here it uses Kubernetes CRDs
    storage:
      type: kubernetes
      config:
        inCluster: true

    # Dex's HTTPS listener — serves the login page and token endpoints
    web:
      https: 0.0.0.0:5556
      tlsCert: /etc/dex/tls/tls.crt
      tlsKey: /etc/dex/tls/tls.key

    # staticClients defines which applications can request tokens.
    # "kubernetes" is the client ID that kubelogin uses when authenticating
    staticClients:
      - id: kubernetes
        redirectURIs:
          - http://localhost:8000     # kubelogin listens here to receive the callback
        name: Kubernetes
        secret: kubernetes-secret     # shared secret between kubelogin and Dex

    # Two demo users with the password "password" (bcrypt-hashed).
    # In production, you'd connect Dex to LDAP, SAML, or a social login instead
    enablePasswordDB: true
    staticPasswords:
      - email: "jane@example.com"
        # bcrypt hash of "password" — generate your own with: htpasswd -bnBC 10 "" password
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "jane"
        userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
      - email: "admin@example.com"
        hash: "\(2a\)10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
        username: "admin"
        userID: "a8b53e13-7e8c-4f7b-9a33-6c2f4d8c6a1b"
        groups:
          - platform-engineers
</code></pre>
<p>Save the following as <code>dex-deployment.yaml</code>. This creates the Deployment, Service, ServiceAccount, and RBAC that Dex needs to run:</p>
<pre><code class="language-yaml"># dex-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dex
  namespace: dex
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dex
  template:
    metadata:
      labels:
        app: dex
    spec:
      serviceAccountName: dex
      containers:
        - name: dex
          # v2.45.0+ required — earlier versions don't include groups from staticPasswords in tokens
          image: ghcr.io/dexidp/dex:v2.45.0
          command: ["dex", "serve", "/etc/dex/cfg/config.yaml"]
          ports:
            - name: https
              containerPort: 5556
          volumeMounts:
            - name: config
              mountPath: /etc/dex/cfg
            - name: tls
              mountPath: /etc/dex/tls
      volumes:
        - name: config
          configMap:
            name: dex-config
        - name: tls
          secret:
            secretName: dex-tls
---
# NodePort Service — exposes Dex on port 32000 on the Kind node.
# Combined with extraPortMappings, this makes Dex reachable from your browser
apiVersion: v1
kind: Service
metadata:
  name: dex
  namespace: dex
spec:
  type: NodePort
  ports:
    - name: https
      port: 5556
      targetPort: 5556
      nodePort: 32000
  selector:
    app: dex
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dex
  namespace: dex
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dex
rules:
  - apiGroups: ["dex.coreos.com"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dex
subjects:
  - kind: ServiceAccount
    name: dex
    namespace: dex
roleRef:
  kind: ClusterRole
  name: dex
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f dex-config.yaml
kubectl apply -f dex-deployment.yaml
kubectl rollout status deployment/dex -n dex
</code></pre>
<h3 id="heading-step-3-install-kubelogin">Step 3: Install kubelogin</h3>
<pre><code class="language-bash"># macOS
brew install int128/kubelogin/kubelogin

# Linux
curl -LO https://github.com/int128/kubelogin/releases/latest/download/kubelogin_linux_amd64.zip
unzip -j kubelogin_linux_amd64.zip kubelogin -d /tmp
sudo mv /tmp/kubelogin /usr/local/bin/kubectl-oidc_login
rm kubelogin_linux_amd64.zip
</code></pre>
<p>Confirm it's installed:</p>
<pre><code class="language-bash">kubectl oidc-login --version
</code></pre>
<h3 id="heading-step-4-configure-a-kubeconfig-entry-for-oidc">Step 4: Configure a kubeconfig entry for OIDC</h3>
<p>This creates a new user and context in your kubeconfig. Instead of using a client certificate (like the default Kind admin), it tells kubectl to use kubelogin to get a token from Dex.</p>
<p>The <code>--oidc-extra-scope</code> flags are important: without <code>email</code> and <code>groups</code>, Dex won't include those claims in the JWT, and the API server won't know who you are or what groups you belong to.</p>
<pre><code class="language-bash">kubectl config set-credentials oidc-user \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl \
  --exec-arg=oidc-login \
  --exec-arg=get-token \
  --exec-arg=--oidc-issuer-url=https://dex.127.0.0.1.nip.io:32000 \
  --exec-arg=--oidc-client-id=kubernetes \
  --exec-arg=--oidc-client-secret=kubernetes-secret \
  --exec-arg=--oidc-extra-scope=email \
  --exec-arg=--oidc-extra-scope=groups \
  --exec-arg=--certificate-authority=$(pwd)/dex-ca.crt

kubectl config set-context oidc@k8s-auth \
  --cluster=kind-k8s-auth \
  --user=oidc-user

kubectl config use-context oidc@k8s-auth
</code></pre>
<h3 id="heading-step-5-trigger-the-login-flow">Step 5: Trigger the login flow</h3>
<p>Jane has no RBAC permissions yet, so first grant her read access from the admin context:</p>
<pre><code class="language-bash">kubectl --context kind-k8s-auth create clusterrolebinding jane-view \
  --clusterrole=view --user=jane@example.com
</code></pre>
<p>Now switch to the OIDC context and trigger a login:</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Your browser opens and redirects to the Dex login page. Log in as <code>jane@example.com</code> with password <code>password</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/44fe0657-b383-4245-9e43-45daea7a3f4f.png" alt="dexidp login screen" style="display:block;margin:0 auto" width="866" height="549" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/4f77442a-3055-47fc-a141-8d881731a1f4.png" alt="dexidp grant access" style="display:block;margin:0 auto" width="925" height="512" loading="lazy">

<p>After login, the terminal completes:</p>
<pre><code class="language-plaintext">No resources found in default namespace.
</code></pre>
<p>The browser-based authentication worked. <code>kubectl</code> received the token from Dex, sent it to the API server, the API server validated the JWT signature using the CA certificate from the <code>AuthenticationConfiguration</code>, extracted <code>jane@example.com</code> from the <code>email</code> claim, matched it against the RBAC binding, and authorized the request.</p>
<p>Without the <code>clusterrolebinding</code>, you would see <code>Error from server (Forbidden)</code> — authentication succeeds (the API server knows <em>who</em> you are) but authorization fails (jane has no permissions). This is the distinction between 401 Unauthorized and 403 Forbidden.</p>
<h3 id="heading-step-6-inspect-the-jwt">Step 6: Inspect the JWT</h3>
<p>A JWT (JSON Web Token) is a signed JSON payload that contains claims about the user. kubelogin caches the token locally under <code>~/.kube/cache/oidc-login/</code> so you don't have to log in on every kubectl command.</p>
<p>List the directory to find the cached file:</p>
<pre><code class="language-bash">ls ~/.kube/cache/oidc-login/
</code></pre>
<p>Decode the JWT payload directly from the cache:</p>
<pre><code class="language-bash">cat ~/.kube/cache/oidc-login/$(ls ~/.kube/cache/oidc-login/ | grep -v lock | head -1) | \
  python3 -c "
import json, sys, base64
token = json.load(sys.stdin)['id_token'].split('.')[1]
token += '=' * (4 - len(token) % 4)
print(json.dumps(json.loads(base64.urlsafe_b64decode(token)), indent=2))
"
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-json">{
  "iss": "https://dex.127.0.0.1.nip.io:32000",
  "sub": "CiQwOGE4Njg0Yi1kYjg4LTRiNzMtOTBhOS0zY2QxNjYxZjU0NjYSBWxvY2Fs",
  "aud": "kubernetes",
  "exp": 1775307910,
  "iat": 1775221510,
  "email": "jane@example.com",
  "email_verified": true
}
</code></pre>
<p>The <code>email</code> claim becomes jane's Kubernetes username because the <code>AuthenticationConfiguration</code> maps <code>username.claim: email</code>. The <code>aud</code> matches the configured <code>audiences</code>. The <code>iss</code> matches the issuer <code>url</code>. This is how the API server validates the token without contacting Dex on every request — it only needs the CA certificate to verify the JWT signature.</p>
<h3 id="heading-step-7-map-oidc-groups-to-rbac">Step 7: Map OIDC groups to RBAC</h3>
<p>The <code>admin@example.com</code> user has a <code>groups</code> claim in the Dex config containing <code>platform-engineers</code>. Instead of creating individual RBAC bindings per user, you can bind permissions to a group — anyone whose JWT contains that group gets the permissions automatically:</p>
<pre><code class="language-yaml"># platform-engineers-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-engineers-admin
subjects:
  - kind: Group
    name: platform-engineers     # matches the groups claim in the JWT
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>You're currently logged in as <code>jane@example.com</code> via the OIDC context, but jane only has <code>view</code> permissions — she can't create cluster-wide RBAC bindings. Switch back to the admin context to apply this:</p>
<pre><code class="language-bash">kubectl config use-context kind-k8s-auth
kubectl apply -f platform-engineers-binding.yaml
kubectl config use-context oidc@k8s-auth
</code></pre>
<p>Now clear the cached token to log out of jane's session, then trigger a new login as <code>admin@example.com</code>:</p>
<pre><code class="language-bash"># Clear the cached token — this is how you "log out" with kubelogin
rm -rf ~/.kube/cache/oidc-login/

# This will open the browser again for a fresh login
kubectl get pods -n default
</code></pre>
<p>Log in as <code>admin@example.com</code> with password <code>password</code>. This time the JWT will contain <code>"groups": ["platform-engineers"]</code>, which matches the <code>ClusterRoleBinding</code> you just created. The admin user gets full cluster access — without ever being added to a kubeconfig by name.</p>
<p>You can verify by decoding the new token (Step 6) — the <code>groups</code> claim will be present:</p>
<pre><code class="language-json">{
  "email": "admin@example.com",
  "groups": ["platform-engineers"]
}
</code></pre>
<p>This is the real power of OIDC group claims: you manage group membership in your identity provider, and Kubernetes permissions follow automatically. Add someone to the <code>platform-engineers</code> group in Dex (or any upstream IdP), and they get cluster-admin access on their next login — no kubeconfig or RBAC changes needed.</p>
<h2 id="heading-cloud-provider-authentication">Cloud Provider Authentication</h2>
<p>AWS, GCP, and Azure each give Kubernetes clusters a native authentication mechanism that ties into their IAM systems.</p>
<p>The implementations differ in API surface, but they all use the same underlying mechanism: OIDC token projection. Once you understand how Dex works above, these are all variations on the same theme.</p>
<h3 id="heading-aws-eks">AWS EKS</h3>
<p>EKS uses the <code>aws-iam-authenticator</code> to translate AWS IAM identities into Kubernetes identities. When you run <code>kubectl</code> against an EKS cluster, the AWS CLI generates a short-lived token signed with your IAM credentials. The API server passes this token to the aws-iam-authenticator webhook, which verifies it against AWS STS and returns the corresponding username and groups.</p>
<p>User access is controlled via the <code>aws-auth</code> ConfigMap in <code>kube-system</code>, which maps IAM role ARNs and IAM user ARNs to Kubernetes usernames and groups. A typical entry looks like this:</p>
<pre><code class="language-yaml"># In kube-system/aws-auth ConfigMap
mapRoles:
  - rolearn: arn:aws:iam::123456789:role/platform-engineers
    username: platform-engineer:{{SessionName}}
    groups:
      - platform-engineers
</code></pre>
<p>AWS is migrating from the <code>aws-auth</code> ConfigMap to a newer Access Entries API, which manages the same mapping through the EKS API rather than a ConfigMap. The underlying authentication mechanism is the same.</p>
<h3 id="heading-google-gke">Google GKE</h3>
<p>GKE integrates with Google Cloud IAM using two different mechanisms, depending on whether you're authenticating as a human user or as a workload.</p>
<p>For human users, GKE accepts standard Google OAuth2 tokens. Running <code>gcloud container clusters get-credentials</code> writes a kubeconfig that uses the <code>gcloud</code> CLI as a credential plugin, generating short-lived tokens from your Google account automatically.</p>
<p>For pod-level identity — letting a pod assume a Google Cloud IAM role — GKE uses Workload Identity. You annotate a Kubernetes service account to bind it to a Google Service Account, and pods running as that service account can call Google Cloud APIs using the GSA's permissions:</p>
<pre><code class="language-bash"># Bind a Kubernetes SA to a Google Service Account
kubectl annotate serviceaccount my-app \
  --namespace production \
  iam.gke.io/gcp-service-account=my-app@my-project.iam.gserviceaccount.com
</code></pre>
<h3 id="heading-azure-aks">Azure AKS</h3>
<p>AKS integrates with Azure Active Directory. When Azure AD integration is enabled, <code>kubectl</code> requests an Azure AD token on behalf of the user via the Azure CLI, and the AKS API server validates it against Azure AD.</p>
<p>For pod-level identity, AKS uses Azure Workload Identity, which follows the same OIDC federation pattern as GKE Workload Identity. A Kubernetes service account is annotated with an Azure Managed Identity client ID, and pods can request Azure AD tokens without storing any credentials:</p>
<pre><code class="language-bash"># Annotate a service account with the Azure Managed Identity client ID
kubectl annotate serviceaccount my-app \
  --namespace production \
  azure.workload.identity/client-id=&lt;MANAGED_IDENTITY_CLIENT_ID&gt;
</code></pre>
<p>The underlying pattern across all three providers is the same: a trusted OIDC token is issued by the cloud provider, verified by the Kubernetes API server, and mapped to an identity through a binding (the <code>aws-auth</code> ConfigMap, a GKE Workload Identity binding, or an AKS federated identity credential). The OIDC section in this article is the conceptual foundation for all of them.</p>
<h2 id="heading-webhook-token-authentication">Webhook Token Authentication</h2>
<p>Webhook token authentication is worth knowing about because it appears in several common Kubernetes setups, even if you never configure it yourself.</p>
<p>When a request arrives with a bearer token that no other authenticator recognises, Kubernetes can send that token to an external HTTP endpoint for validation. The endpoint returns a response indicating who the token belongs to.</p>
<p>This is how EKS authentication worked before the aws-iam-authenticator was built into the API server. It's also how bootstrap tokens work during node join operations: a token is generated, embedded in the <code>kubeadm join</code> command, and validated by the bootstrap webhook when the new node contacts the API server for the first time.</p>
<p>For most clusters, you'll encounter webhook auth as something already running rather than something you configure. The main thing to know is that it exists and what it looks like when it appears in logs or configuration.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the OIDC demo cluster
kind delete cluster --name k8s-auth

# Remove generated certificate files
rm -f ca.crt ca.key jane.key jane.csr jane.crt jane.kubeconfig
rm -f dex-ca.crt dex-ca.key dex.crt dex.key dex.csr dex-ca.srl auth-config.yaml

# Remove the kubelogin token cache
rm -rf ~/.kube/cache/oidc-login/
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Kubernetes authentication is not a single mechanism — it's a chain of pluggable strategies, each one suited to different use cases. In this article you worked through the most important ones.</p>
<p>x509 client certificates are how Kubernetes works out of the box. The CN field becomes the username, the O field becomes the group, and the cluster CA is the trust anchor. You created a certificate for a new user, bound it to RBAC, and saw exactly how authentication and authorisation interact — authentication gets you in, RBAC determines what you can do.</p>
<p>You also saw the fundamental limitation: Kubernetes doesn't check certificate revocation lists, so a compromised certificate remains valid until it expires. This makes certificates a poor fit for human users in production environments.</p>
<p>OIDC is the production-grade answer. Tokens are short-lived, issued by a trusted identity provider, and map directly to Kubernetes groups through JWT claims. You deployed Dex as a self-hosted OIDC provider, configured the API server to trust it, and set up kubelogin for browser-based authentication.</p>
<p>You then decoded a JWT to see exactly what the API server reads from it, and mapped an OIDC group claim to a Kubernetes ClusterRoleBinding.</p>
<p>Cloud provider authentication — EKS, GKE, AKS — uses the same OIDC foundation with provider-specific wrappers. Understanding how Dex works makes each of those systems immediately readable.</p>
<p>All YAML, certificates, and configuration files from this article are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/k8/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Secure a Kubernetes Cluster: RBAC, Pod Hardening, and Runtime Protection ]]>
                </title>
                <description>
                    <![CDATA[ In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it. An attacker had found it, deployed pods inside T ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-secure-a-kubernetes-cluster-handbook/</link>
                <guid isPermaLink="false">69c4112310e664c5dac43f41</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 25 Mar 2026 16:45:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4039b7a4-bb45-4df5-b13b-7414985c1a7e.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In 2018, RedLock's cloud security research team discovered that Tesla's Kubernetes dashboard was exposed to the public internet with no password on it.</p>
<p>An attacker had found it, deployed pods inside Tesla's cluster, and was using them to mine cryptocurrency – all on Tesla's AWS bill. The cluster had no authentication on the dashboard, no network restrictions on egress, and nothing monitoring for intrusion. Any one of those controls would have stopped the attack. None of them were in place.</p>
<p>This wasn't a sophisticated zero-day exploit. It was a misconfigured default.</p>
<p>Kubernetes ships with powerful security primitives. The problem is that almost none of them are enabled by default. A fresh cluster is deliberately permissive so it's easy to get started. That permissiveness is a feature in development. In production, it's a liability.</p>
<p>In this handbook, we'll work through the three most impactful security layers in Kubernetes. We'll start with Role-Based Access Control, which governs who can do what to which resources in the API. From there we'll move to pod runtime security, which locks down what containers can actually do once they're running on a node. Finally we'll deploy Falco, a syscall-level detection engine that watches for attacks in progress and alerts in real time.</p>
<p>By the end, you'll have a hardened cluster with working RBAC policies, enforced pod security standards, and live detection rules that fire when something suspicious happens.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><code>kubectl</code> installed and configured</p>
</li>
<li><p>Docker Desktop or a Linux machine (to run kind)</p>
</li>
<li><p>Basic Kubernetes familiarity – you know what a Pod, Deployment, and Namespace are</p>
</li>
<li><p>No prior security experience needed</p>
</li>
</ul>
<p>All demos run on a local kind cluster. Full YAML and setup scripts are in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</a></p>
</li>
<li><p><a href="#heading-what-youll-build">What You'll Build</a></p>
</li>
<li><p><a href="#heading-demo-1--run-a-cluster-security-baseline-with-kube-bench">Demo 1 — Run a Cluster Security Baseline with kube-bench</a></p>
</li>
<li><p><a href="#heading-how-to-configure-rbac">How to Configure RBAC</a></p>
<ul>
<li><p><a href="#heading-the-four-rbac-objects">The Four RBAC Objects</a></p>
</li>
<li><p><a href="#heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</a></p>
</li>
<li><p><a href="#heading-roles-and-clusterroles">Roles and ClusterRoles</a></p>
</li>
<li><p><a href="#heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</a></p>
</li>
<li><p><a href="#heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</a></p>
</li>
<li><p><a href="#heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-2--build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 — Build a Least-Privilege RBAC Policy for a CI Pipeline</a></p>
</li>
<li><p><a href="#heading-demo-3--audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 — Audit RBAC with rakkess and rbac-lookup</a></p>
</li>
<li><p><a href="#how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</a></p>
<ul>
<li><p><a href="#heading-pod-security-admission">Pod Security Admission</a></p>
</li>
<li><p><a href="#heading-how-to-configure-securitycontext">How to Configure securityContext</a></p>
</li>
<li><p><a href="#heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</a></p>
</li>
<li><p><a href="#heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-demo-4--harden-a-pod-with-securitycontext">Demo 4 — Harden a Pod with securityContext</a></p>
</li>
<li><p><a href="#heading-demo-5--deploy-falco-and-write-a-custom-detection-rule">Demo 5 — Deploy Falco and Write a Custom Detection Rule</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-kubernetes-threat-landscape">The Kubernetes Threat Landscape</h2>
<p>To understand what you're defending against, you need to understand where Kubernetes exposes attack surface. There are six main areas, and most production incidents trace back to at least one of them.</p>
<p>The <strong>API server</strong> is the front door to your cluster. Every <code>kubectl</code> command, every CI deploy, and every controller reconciliation loop sends requests here. Unauthenticated or over-privileged access to the API server is effectively game over: an attacker who can talk to it can create pods, read secrets, and modify workloads freely.</p>
<p><strong>etcd</strong> is the key-value store where all cluster state lives, including your Secrets. Kubernetes Secrets are base64-encoded by default, not encrypted. Anyone with direct access to etcd can read every password, token, and certificate in the cluster without going through the API server at all.</p>
<p>The <strong>kubelet</strong> runs on each node and manages the pods assigned to it. If its API is reachable without authentication – which is the default on older clusters – an attacker can exec into any pod on that node and read its memory without ever touching the API server.</p>
<p>The <strong>container runtime</strong> is the layer that actually runs your containers. A container that escapes its isolation boundary lands directly in the host OS. A privileged container with <code>hostPID: true</code> can read the memory of every other process on the node, including other containers.</p>
<p>Your <strong>supply chain</strong> (base images, third-party dependencies, Helm charts, operators) is a potential entry point at every step. The XZ Utils backdoor discovered in 2024 showed how close a well-positioned supply chain attack can come to widespread infrastructure compromise.</p>
<p>Finally, the <strong>network</strong>: by default, every pod in a Kubernetes cluster can reach every other pod on any port. There are no internal firewalls between workloads unless you explicitly create them with NetworkPolicy.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f2a6b76d7d55f162b5da2ee/2e49d975-4f69-4d14-9646-76c6ec377115.png" alt="Kubernetes threat landscape" style="display:block;margin:0 auto" width="4079" height="980" loading="lazy">

<h3 id="heading-real-world-breaches">Real-World Breaches</h3>
<p>These three incidents are worth understanding before you write a single line of YAML. They're not theoretical – they're documented post-mortems from real production clusters.</p>
<table>
<thead>
<tr>
<th>Incident</th>
<th>Year</th>
<th>Root cause</th>
<th>What was missing</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tesla cryptomining</strong></td>
<td>2018</td>
<td>Kubernetes dashboard exposed with no authentication, Unrestricted egress</td>
<td>RBAC on the dashboard endpoint + default-deny NetworkPolicy</td>
</tr>
<tr>
<td><strong>Capital One data breach</strong></td>
<td>2019</td>
<td>SSRF vulnerability in a WAF let an attacker reach the EC2 metadata API, which returned credentials for an over-privileged IAM role</td>
<td>Pod-level IAM restrictions (IRSA) + blocking metadata API egress</td>
</tr>
<tr>
<td><strong>Shopify bug bounty (Kubernetes)</strong></td>
<td>2021</td>
<td>A researcher accessed internal Kubernetes metadata through a misconfigured internal service, exposing pod environment variables containing secrets</td>
<td>Secret management outside environment variables + network segmentation</td>
</tr>
</tbody></table>
<p>The pattern across all three: not zero-day exploits, but misconfigured defaults and missing controls that should have been standard practice.</p>
<p>This article addresses the RBAC and pod security gaps directly.</p>
<h2 id="heading-what-youll-build">What You'll Build</h2>
<p>Before the first command, here is the security posture you'll have by the end of this article:</p>
<p>You'll start by running kube-bench to get a CIS Benchmark baseline – a concrete score showing where a default cluster stands before any hardening. From there you'll build a least-privilege RBAC policy for a CI pipeline service account and verify its permission boundaries, then audit the full cluster to confirm no over-privileged accounts exist.</p>
<p>On the pod security side, you'll enforce the <code>restricted</code> Pod Security Admission profile on your workload namespace and apply a hardened <code>securityContext</code> to a deployment: non-root user, read-only root filesystem, dropped capabilities, and seccomp profile. To close out, you'll deploy Falco in eBPF mode with a custom detection rule that fires when suspicious tools are run inside a container.</p>
<p>Start to finish, with a kind cluster already running, the demos take about 45–60 minutes.</p>
<h2 id="heading-demo-1-run-a-cluster-security-baseline-with-kube-bench">Demo 1: Run a Cluster Security Baseline with kube-bench</h2>
<p>Before hardening anything, it's a good idea to measure where you are. <a href="https://github.com/aquasecurity/kube-bench">kube-bench</a> runs the CIS Kubernetes Benchmark against your cluster and reports which checks pass and which fail. A baseline run gives you a concrete picture of your cluster's default security posture – and a reference point you can re-run after applying any hardening changes.</p>
<h3 id="heading-step-1-create-a-kind-cluster">Step 1: Create a kind cluster</h3>
<p>Save the following as <code>kind-config.yaml</code>:</p>
<pre><code class="language-yaml"># kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
</code></pre>
<pre><code class="language-bash">kind create cluster --name k8s-security --config kind-config.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">Creating cluster "k8s-security" ...
 ✓ Ensuring node image (kindest/node:v1.29.0) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✓ Joining worker nodes 🚜
Set kubectl context to "kind-k8s-security"
</code></pre>
<h3 id="heading-step-2-run-kube-bench">Step 2: Run kube-bench</h3>
<p>kube-bench runs as a Job inside the cluster, mounting the host filesystem to inspect Kubernetes configuration files and processes:</p>
<pre><code class="language-bash">kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench
</code></pre>
<p>The output is long. Scroll to the summary at the bottom:</p>
<pre><code class="language-plaintext">== Summary master ==
0 checks PASS
11 checks FAIL
 9 checks WARN
 0 checks INFO

== Summary node ==
17 checks PASS
 2 checks FAIL
40 checks WARN
 0 checks INFO
</code></pre>
<p>A fresh kind cluster typically fails around 14 checks. Three of the most important failures explain why defaults are a problem:</p>
<table>
<thead>
<tr>
<th>Check ID</th>
<th>Description</th>
<th>Why it matters</th>
</tr>
</thead>
<tbody><tr>
<td><strong>1.2.1</strong></td>
<td><code>--anonymous-auth</code> is not set to false on the API server</td>
<td>Anonymous requests can reach the API server without authentication – exactly how the Tesla dashboard was accessed</td>
</tr>
<tr>
<td><strong>1.2.6</strong></td>
<td><code>--kubelet-certificate-authority</code> is not set</td>
<td>The API server cannot verify kubelet identity, enabling man-in-the-middle attacks between the control plane and nodes</td>
</tr>
<tr>
<td><strong>4.2.6</strong></td>
<td><code>--protect-kernel-defaults</code> is not set on the kubelet</td>
<td>Kernel parameters can be modified from within a container, which is one step toward a container escape</td>
</tr>
</tbody></table>
<p><strong>Note:</strong> Some kube-bench findings are expected on kind because kind is a development tool, not a production-hardened environment. The important thing is to understand what each finding means and whether it applies to your target production setup.</p>
<p>Delete the Job when you're done:</p>
<pre><code class="language-bash">kubectl delete job kube-bench
</code></pre>
<p>Now that you have a baseline, you know what you're starting from. The next step is to work through the most impactful control on that list: access control. RBAC governs every interaction with the Kubernetes API, and getting it right is the foundation everything else builds on.</p>
<h2 id="heading-how-to-configure-rbac">How to Configure RBAC</h2>
<p>Role-Based Access Control is the authorisation layer in Kubernetes. Every request that reaches the API server – from <code>kubectl</code>, from a pod, from a controller – is checked against RBAC rules after authentication succeeds. If there is no rule that explicitly allows the action, Kubernetes denies it.</p>
<p>The key word is "explicitly". RBAC in Kubernetes is additive only. There is no <code>deny</code> rule. You grant access by creating rules, and you remove access by deleting them. This makes the mental model clean: if a subject can do something, you gave it permission to do that thing.</p>
<h3 id="heading-a-brief-case-study-the-shopify-kubernetes-misconfiguration">A Brief Case Study: The Shopify Kubernetes Misconfiguration</h3>
<p>In 2021, security researcher Silas Cutler discovered that a Shopify internal service exposed Kubernetes metadata through an SSRF vulnerability. The metadata included pod environment variables that contained secrets. The root cause was partly RBAC: the service's service account had broader cluster access than it needed, and there was no least-privilege review process.</p>
<p>Shopify paid a $25,000 bug bounty and fixed the issue. The lesson is straightforward: a service account should only have the permissions it needs to do its specific job. Nothing more.</p>
<p>This is the principle you'll apply in Demo 2.</p>
<h3 id="heading-the-four-rbac-objects">The Four RBAC Objects</h3>
<p>RBAC in Kubernetes is built from four API objects. Two define permissions, two bind those permissions to subjects:</p>
<table>
<thead>
<tr>
<th>Object</th>
<th>Scope</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>Role</code></td>
<td>Namespace</td>
<td>Defines a set of permissions within one namespace</td>
</tr>
<tr>
<td><code>ClusterRole</code></td>
<td>Cluster-wide</td>
<td>Defines permissions across all namespaces, or for cluster-scoped resources like Nodes</td>
</tr>
<tr>
<td><code>RoleBinding</code></td>
<td>Namespace</td>
<td>Grants the permissions of a Role or ClusterRole to a subject, within one namespace</td>
</tr>
<tr>
<td><code>ClusterRoleBinding</code></td>
<td>Cluster-wide</td>
<td>Grants the permissions of a ClusterRole to a subject across the entire cluster</td>
</tr>
</tbody></table>
<p>A <strong>subject</strong> is a user, a group, or a service account. Users and groups come from your authentication layer – client certificates, OIDC tokens, or cloud provider identity. Service accounts are Kubernetes-native identities created for pods.</p>
<h3 id="heading-how-to-discover-resources-verbs-and-api-groups">How to Discover Resources, Verbs, and API Groups</h3>
<p>Before you can write a <code>Role</code>, you need to know three things: the resource name, the API group it belongs to, and the verbs it supports. You shouldn't have to guess any of them – <code>kubectl</code> can tell you everything.</p>
<h4 id="heading-list-all-available-resources-and-their-api-groups">List all available resources and their API groups</h4>
<pre><code class="language-bash">kubectl api-resources
</code></pre>
<p>Partial output:</p>
<pre><code class="language-plaintext">NAME                    SHORTNAMES  APIVERSION                     NAMESPACED  KIND
bindings                            v1                             true        Binding
configmaps              cm          v1                             true        ConfigMap
endpoints               ep          v1                             true        Endpoints
events                  ev          v1                             true        Event
namespaces              ns          v1                             false       Namespace
nodes                   no          v1                             false       Node
pods                    po          v1                             true        Pod
secrets                             v1                             true        Secret
serviceaccounts         sa          v1                             true        ServiceAccount
services                svc         v1                             true        Service
deployments             deploy      apps/v1                        true        Deployment
replicasets             rs          apps/v1                        true        ReplicaSet
statefulsets            sts         apps/v1                        true        StatefulSet
cronjobs                cj          batch/v1                       true        CronJob
jobs                                batch/v1                       true        Job
ingresses               ing         networking.k8s.io/v1           true        Ingress
networkpolicies         netpol      networking.k8s.io/v1           true        NetworkPolicy
clusterroles                        rbac.authorization.k8s.io/v1   false       ClusterRole
roles                               rbac.authorization.k8s.io/v1   true        Role
</code></pre>
<p>The <code>APIVERSION</code> column is what you put in <code>apiGroups</code>. Strip the version suffix and use only the group part:</p>
<table>
<thead>
<tr>
<th>APIVERSION in output</th>
<th>apiGroups value in Role</th>
</tr>
</thead>
<tbody><tr>
<td><code>v1</code></td>
<td><code>""</code> (empty string – the core group)</td>
</tr>
<tr>
<td><code>apps/v1</code></td>
<td><code>"apps"</code></td>
</tr>
<tr>
<td><code>batch/v1</code></td>
<td><code>"batch"</code></td>
</tr>
<tr>
<td><code>networking.k8s.io/v1</code></td>
<td><code>"networking.k8s.io"</code></td>
</tr>
<tr>
<td><code>rbac.authorization.k8s.io/v1</code></td>
<td><code>"rbac.authorization.k8s.io"</code></td>
</tr>
</tbody></table>
<p>The <code>NAMESPACED</code> column tells you whether to use a <code>Role</code> (namespaced resources) or a <code>ClusterRole</code> (non-namespaced resources like <code>nodes</code>).</p>
<h4 id="heading-filter-by-api-group">Filter by API group</h4>
<p>If you want to see only resources in a specific group, for example, everything in <code>apps</code>:</p>
<pre><code class="language-bash">kubectl api-resources --api-group=apps
</code></pre>
<pre><code class="language-plaintext">NAME                  SHORTNAMES  APIVERSION  NAMESPACED  KIND
controllerrevisions               apps/v1     true        ControllerRevision
daemonsets            ds          apps/v1     true        DaemonSet
deployments           deploy      apps/v1     true        Deployment
replicasets           rs          apps/v1     true        ReplicaSet
statefulsets          sts         apps/v1     true        StatefulSet
</code></pre>
<h4 id="heading-list-all-verbs-for-a-specific-resource">List all verbs for a specific resource</h4>
<p>Each resource supports a different set of verbs. To see exactly which verbs a resource supports, use <code>kubectl api-resources</code> with <code>-o wide</code> and look at the <code>VERBS</code> column:</p>
<pre><code class="language-bash">kubectl api-resources -o wide | grep -E "^NAME|^pods "
</code></pre>
<pre><code class="language-plaintext">NAME  SHORTNAMES  APIVERSION  NAMESPACED  KIND  VERBS
pods  po          v1          true        Pod   create,delete,deletecollection,get,list,patch,update,watch
</code></pre>
<p>Or explain the resource directly:</p>
<pre><code class="language-bash">kubectl explain pod --api-version=v1 | head -10
</code></pre>
<p>The full set of verbs Kubernetes supports in RBAC rules is:</p>
<table>
<thead>
<tr>
<th>Verb</th>
<th>What it allows</th>
</tr>
</thead>
<tbody><tr>
<td><code>get</code></td>
<td>Read a single named resource: <code>kubectl get pod my-pod</code></td>
</tr>
<tr>
<td><code>list</code></td>
<td>Read all resources of a type: <code>kubectl get pods</code></td>
</tr>
<tr>
<td><code>watch</code></td>
<td>Stream changes to resources: used by controllers and informers</td>
</tr>
<tr>
<td><code>create</code></td>
<td>Create a new resource</td>
</tr>
<tr>
<td><code>update</code></td>
<td>Replace an existing resource (<code>kubectl apply</code> on an existing object)</td>
</tr>
<tr>
<td><code>patch</code></td>
<td>Partially modify a resource (<code>kubectl patch</code>)</td>
</tr>
<tr>
<td><code>delete</code></td>
<td>Delete a single resource</td>
</tr>
<tr>
<td><code>deletecollection</code></td>
<td>Delete all resources of a type in a namespace</td>
</tr>
<tr>
<td><code>exec</code></td>
<td>Run a command inside a pod (<code>kubectl exec</code>)</td>
</tr>
<tr>
<td><code>portforward</code></td>
<td>Forward a port from a pod (<code>kubectl port-forward</code>)</td>
</tr>
<tr>
<td><code>proxy</code></td>
<td>Proxy HTTP requests to a pod</td>
</tr>
<tr>
<td><code>log</code></td>
<td>Read pod logs (<code>kubectl logs</code>)</td>
</tr>
</tbody></table>
<p><strong>Important:</strong> <code>get</code> and <code>list</code> are separate verbs. Granting <code>list</code> on <code>secrets</code> lets a subject enumerate every secret name and value in a namespace, even if you didn't also grant <code>get</code>. Always think about both when working with sensitive resources like <code>secrets</code>, <code>serviceaccounts</code>, and <code>configmaps</code>.</p>
<h4 id="heading-look-up-a-resources-group-with-kubectl-explain">Look up a resource's group with kubectl explain</h4>
<p>If you already know the resource name but aren't sure of its group, <code>kubectl explain</code> tells you:</p>
<pre><code class="language-bash">kubectl explain deployment
</code></pre>
<pre><code class="language-plaintext">GROUP:      apps
KIND:       Deployment
VERSION:    v1
...
</code></pre>
<pre><code class="language-bash">kubectl explain ingress
</code></pre>
<pre><code class="language-plaintext">GROUP:      networking.k8s.io
KIND:       Ingress
VERSION:    v1
...
</code></pre>
<p>This is the fastest way to look up the <code>apiGroups</code> value for any resource when writing a Role.</p>
<h4 id="heading-a-complete-lookup-workflow">A complete lookup workflow</h4>
<p>Here is the practical workflow when writing a new Role from scratch:</p>
<pre><code class="language-bash"># 1. Find the resource name and API group
kubectl api-resources | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment

# 2. Find the verbs it supports
kubectl api-resources -o wide | grep deployment

# Output:
# deployments   deploy   apps/v1   true   Deployment   create,delete,...,get,list,patch,update,watch

# 3. Write the Role using the group (strip the version) and the verbs you need
</code></pre>
<pre><code class="language-yaml">apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-reader
  namespace: staging
rules:
  - apiGroups: ["apps"]       # from: apps/v1 → strip /v1
    resources: ["deployments"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>With this workflow, you never have to guess an API group or verb. You look it up, then write the minimal rule you need.</p>
<h3 id="heading-roles-and-clusterroles">Roles and ClusterRoles</h3>
<p>A <code>Role</code> defines which verbs are allowed on which resources. Here is a Role that grants read-only access to Pods and ConfigMaps inside the <code>staging</code> namespace:</p>
<pre><code class="language-yaml"># role-ci-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]          # "" = the core API group (Pods, Services, Secrets, ConfigMaps)
    resources: ["pods", "configmaps"]
    verbs: ["get", "list", "watch"]
</code></pre>
<p>The <code>apiGroups</code> field tells Kubernetes which API group owns the resource. The core group uses an empty string <code>""</code>. Apps-level resources like Deployments use <code>"apps"</code>. Custom resources use their own group, such as <code>"networking.k8s.io"</code>.</p>
<p>A <code>ClusterRole</code> is structurally identical but omits the namespace and can reference cluster-scoped resources like Nodes and PersistentVolumes:</p>
<pre><code class="language-yaml"># clusterrole-node-reader.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader    # no namespace field
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
</code></pre>
<h4 id="heading-when-to-use-which">When to use which:</h4>
<p>Use a <code>Role</code> when the permission is specific to one namespace. A compromised service account can only affect that namespace: the blast radius is contained. Use a <code>ClusterRole</code> when you need access to cluster-scoped resources, or when you want a reusable permission template that multiple namespaces can share.</p>
<p>A common mistake is reaching for a <code>ClusterRole</code> "just to be safe" because it's easier to configure. Namespace-scoped <code>Roles</code> are almost always the right default.</p>
<h3 id="heading-rolebindings-and-clusterrolebindings">RoleBindings and ClusterRoleBindings</h3>
<p>A Role by itself does nothing. You need a binding to attach it to a subject. Here is a <code>RoleBinding</code> that grants the <code>ci-reader</code> Role to the <code>ci-pipeline</code> service account:</p>
<pre><code class="language-yaml"># rolebinding-ci.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline       # the service account name
    namespace: staging      # the namespace the SA lives in
roleRef:
  kind: Role
  name: ci-reader           # must match the Role name exactly
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>There is a useful pattern worth knowing: you can bind a <code>ClusterRole</code> using a <code>RoleBinding</code>. This creates namespace-scoped access using a reusable permission template. The <code>ClusterRole</code> defines the rules, while the <code>RoleBinding</code> constrains those rules to a single namespace.</p>
<pre><code class="language-yaml"># RoleBinding referencing a ClusterRole — scoped to one namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: view-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: ClusterRole          # ClusterRole, but bound to one namespace via RoleBinding
  name: view                 # Kubernetes built-in ClusterRole: read-only access to most resources
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<p>Kubernetes ships with several useful built-in ClusterRoles: <code>view</code> (read-only access to most resources), <code>edit</code> (read/write to most resources), <code>admin</code> (full namespace admin), and <code>cluster-admin</code> (full cluster admin). Use them rather than reinventing them.</p>
<h3 id="heading-how-to-use-service-accounts-safely">How to Use Service Accounts Safely</h3>
<p>Every pod in Kubernetes runs as a service account. If you don't specify one, Kubernetes uses the <code>default</code> service account in that namespace.</p>
<p>The default service account starts with no permissions – but it still has a token automatically mounted into every pod at <code>/var/run/secrets/kubernetes.io/serviceaccount/token</code>. This means every container in your cluster can authenticate to the API server by default, even if it has nothing useful to do there.</p>
<p>The single most impactful change you can make is to disable this automatic token mounting on service accounts that don't need API access:</p>
<pre><code class="language-yaml"># serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
automountServiceAccountToken: false   # no token mounted into pods by default
</code></pre>
<p>You can also control it at the pod level:</p>
<pre><code class="language-yaml">spec:
  automountServiceAccountToken: false   # override at pod level
  serviceAccountName: my-app
  containers:
    - name: app
      image: my-app:1.0
</code></pre>
<h4 id="heading-the-cluster-admin-anti-pattern">The cluster-admin anti-pattern:</h4>
<p>Never bind <code>cluster-admin</code> to a service account that runs in a pod. <code>cluster-admin</code> grants full read/write access to every resource in the cluster. An attacker who compromises a pod running as <code>cluster-admin</code> owns your cluster completely.</p>
<p>You will see this in Helm charts and tutorials because it "makes things work". It works because it disables the entire authorisation layer. That is not a solution – it's a ticking clock.</p>
<p>The Capital One breach is a direct example of this pattern at the cloud layer: an EC2 instance role had permissions far beyond what the application needed. The SSRF vulnerability was the initial foothold. The over-privileged role was what turned a minor bug into a $80 million fine.</p>
<h3 id="heading-how-to-audit-your-rbac-configuration">How to Audit Your RBAC Configuration</h3>
<p>The <code>kubectl auth can-i</code> command lets you check permissions for any subject. Use <code>--as</code> to impersonate a service account:</p>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

# These should return 'yes'
kubectl auth can-i list pods        --namespace staging --as $SA
kubectl auth can-i get  configmaps  --namespace staging --as $SA

# These should return 'no'
kubectl auth can-i delete pods      --namespace staging --as $SA
kubectl auth can-i get  secrets     --namespace staging --as $SA
kubectl auth can-i list pods        --namespace production --as $SA
</code></pre>
<p>To list every permission a subject has in a namespace:</p>
<pre><code class="language-bash">kubectl auth can-i --list \
  --namespace staging \
  --as system:serviceaccount:staging:ci-pipeline
</code></pre>
<p>For a visual matrix across the whole cluster, install <a href="https://github.com/corneliusweig/rakkess">rakkess</a> (part of krew):</p>
<pre><code class="language-bash">kubectl krew install access-matrix

# Permission matrix for all service accounts in staging
kubectl access-matrix --namespace staging
</code></pre>
<p>Example output:</p>
<pre><code class="language-plaintext">NAME          GET  LIST  WATCH  CREATE  UPDATE  PATCH  DELETE
ci-pipeline    ✓    ✓     ✓      ✗       ✗       ✗      ✗
default        ✗    ✗     ✗      ✗       ✗       ✗      ✗
monitoring     ✓    ✓     ✓      ✗       ✗       ✗      ✗
</code></pre>
<p>If you see <code>✓</code> in the CREATE, UPDATE, PATCH, or DELETE columns for a service account that should only read, that's a finding that needs remediation.</p>
<p>⚠️ <strong>The wildcard danger:</strong> The most dangerous RBAC configuration is a wildcard on all three dimensions:</p>
<pre><code class="language-yaml">apiGroups: [""] 
resources: [""] 
verbs: ["*"]
</code></pre>
<p>This is functionally identical to <code>cluster-admin</code>. You will find it in Helm charts for controllers installed with "convenience" permissions. Always audit third-party RBAC before installing operators into a production cluster.</p>
<h2 id="heading-demo-2-build-a-least-privilege-rbac-policy-for-a-ci-pipeline">Demo 2 – Build a Least-Privilege RBAC Policy for a CI Pipeline</h2>
<p>In this demo, you'll create a service account for a CI pipeline that can list pods and read configmaps in the <code>staging</code> namespace – and nothing else.</p>
<h3 id="heading-step-1-create-the-namespace-and-service-account">Step 1: Create the namespace and service account</h3>
<pre><code class="language-bash">kubectl create namespace staging
</code></pre>
<pre><code class="language-yaml"># ci-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ci-pipeline
  namespace: staging
automountServiceAccountToken: false
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-serviceaccount.yaml
</code></pre>
<h3 id="heading-step-2-create-the-role">Step 2: Create the Role</h3>
<pre><code class="language-yaml"># ci-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ci-reader
  namespace: staging
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-role.yaml
</code></pre>
<h3 id="heading-step-3-bind-the-role-to-the-service-account">Step 3: Bind the Role to the service account</h3>
<pre><code class="language-yaml"># ci-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ci-reader-binding
  namespace: staging
subjects:
  - kind: ServiceAccount
    name: ci-pipeline
    namespace: staging
roleRef:
  kind: Role
  name: ci-reader
  apiGroup: rbac.authorization.k8s.io
</code></pre>
<pre><code class="language-bash">kubectl apply -f ci-rolebinding.yaml
</code></pre>
<h3 id="heading-step-4-test-allowed-operations">Step 4: Test allowed operations</h3>
<pre><code class="language-bash">SA="system:serviceaccount:staging:ci-pipeline"

kubectl auth can-i list pods       --namespace staging     --as $SA   # yes
kubectl auth can-i get  pods       --namespace staging     --as $SA   # yes
kubectl auth can-i list configmaps --namespace staging     --as $SA   # yes
</code></pre>
<h3 id="heading-step-5-test-denied-operations">Step 5: Test denied operations</h3>
<pre><code class="language-bash">kubectl auth can-i delete pods       --namespace staging     --as $SA   # no
kubectl auth can-i get  secrets      --namespace staging     --as $SA   # no
kubectl auth can-i list pods         --namespace production  --as $SA   # no
kubectl auth can-i create deployments --namespace staging    --as $SA   # no
</code></pre>
<p>All four should return <code>no</code>. Notice the third test: even if there were a matching Role in the <code>staging</code> namespace, the service account cannot access <code>production</code>. A <code>RoleBinding</code> cannot cross namespace boundaries, this is by design.</p>
<p>Writing a least-privilege policy for a service account you control is the easy part. The harder part is auditing what already exists in a cluster. That's what Demo 3 covers.</p>
<h2 id="heading-demo-3-audit-rbac-with-rakkess-and-rbac-lookup">Demo 3 – Audit RBAC with rakkess and rbac-lookup</h2>
<p>Now you'll scan the full cluster to surface any accounts with more permissions than they need.</p>
<h3 id="heading-step-1-install-the-tools">Step 1: Install the tools</h3>
<pre><code class="language-bash">kubectl krew install access-matrix
kubectl krew install rbac-lookup
</code></pre>
<h3 id="heading-step-2-run-rakkess-across-the-cluster">Step 2: Run rakkess across the cluster</h3>
<pre><code class="language-bash"># All service accounts in kube-system
kubectl access-matrix --namespace kube-system

# All ServiceAccounts cluster-wide
kubectl access-matrix
</code></pre>
<h3 id="heading-step-3-find-all-cluster-admin-bindings">Step 3: Find all cluster-admin bindings</h3>
<p>There are two ways subjects get cluster-admin access: via a <code>ClusterRoleBinding</code> (cluster-wide), or via a <code>RoleBinding</code> that references the <code>cluster-admin</code> ClusterRole (namespace-scoped, still dangerous). Check both:</p>
<pre><code class="language-bash"># Find ClusterRoleBindings that grant cluster-admin
kubectl rbac-lookup cluster-admin --kind ClusterRole --output wide
</code></pre>
<p>On a fresh kind cluster this returns:</p>
<pre><code class="language-plaintext">No RBAC Bindings found
</code></pre>
<p>That is the correct and expected result. A default kind cluster doesn't create any <code>ClusterRoleBindings</code> to <code>cluster-admin</code>. The role exists, but nothing is bound to it at the cluster level by default. If you see entries here in your production cluster, each one is a finding worth investigating.</p>
<p>To find who has cluster-level admin access through other means, query the bindings directly:</p>
<pre><code class="language-bash"># Find all ClusterRoleBindings and the subjects they grant
kubectl get clusterrolebindings -o wide
</code></pre>
<pre><code class="language-plaintext">NAME                                                   ROLE                                                                       AGE   USERS                         GROUPS                         SERVICEACCOUNTS
cluster-admin                                          ClusterRole/cluster-admin                                                  10d   system:masters
system:kube-controller-manager                         ClusterRole/system:kube-controller-manager                                 10d
system:kube-scheduler                                  ClusterRole/system:kube-scheduler                                          10d
system:node                                            ClusterRole/system:node                                                    10d
...
</code></pre>
<p>The <code>cluster-admin</code> ClusterRoleBinding grants access to the <code>system:masters</code> group – the group your kubeconfig certificate belongs to. This is expected. Every other binding in this list is worth reviewing to understand what it grants and why.</p>
<p><strong>What to look for:</strong> Any binding where the SERVICEACCOUNTS column is populated with an application service account (not a <code>system:</code> prefixed one) is a potential over-privilege finding. Application pods should never need cluster-admin.</p>
<h3 id="heading-step-4-verify-the-ci-pipeline-service-account">Step 4: Verify the ci-pipeline service account</h3>
<pre><code class="language-bash">kubectl rbac-lookup ci-pipeline --kind ServiceAccount --output wide
</code></pre>
<p>Expected output:</p>
<pre><code class="language-bash">SUBJECT                               SCOPE     ROLE             SOURCE
ServiceAccount/staging:ci-pipeline    staging   Role/ci-reader   RoleBinding/ci-reader-binding
</code></pre>
<p>The format is <code>/&lt;role-name&gt; &lt;binding-kind&gt;/&lt;binding-name&gt;</code>. This tells you:</p>
<ul>
<li><p>The service account is bound to the <code>ci-reader</code> Role</p>
</li>
<li><p>The binding is a <code>RoleBinding</code> named <code>ci-reader-binding</code></p>
</li>
<li><p>There is no namespace prefix on the role name because it is a namespaced <code>Role</code>, not a <code>ClusterRole</code></p>
</li>
</ul>
<p>If the output showed <code>ClusterRole/something</code> here, that would be a finding. It would mean the service account has cluster-wide permissions, not namespace-scoped ones.</p>
<p><strong>rbac-lookup vs kubectl get:</strong> <code>rbac-lookup</code> gives you a subject-centric view: "what does this account have access to?" <code>kubectl get rolebindings,clusterrolebindings -A</code> gives you a binding-centric view: "what bindings exist in the cluster?" Use both. rbac-lookup is faster for auditing a specific service account, while the <code>kubectl get</code> approach is better for a full cluster inventory.</p>
<p>With RBAC locked down, the API server is protected. But RBAC says nothing about what a container can do once it's running. That's a separate layer entirely.</p>
<h2 id="heading-how-to-harden-pod-runtime-security">How to Harden Pod Runtime Security</h2>
<p>RBAC controls who can talk to the Kubernetes API. Pod security controls what containers can do once they're running on a node. These are different threat vectors: RBAC protects the control plane, pod security protects the data plane.</p>
<p>A container that runs as root with no capability restrictions can, if compromised, write backdoors to the host filesystem, load kernel modules, read the memory of other processes if <code>hostPID: true</code> is set, and in some configurations escape the container entirely. Pod security closes these doors before an attacker can open them.</p>
<h3 id="heading-a-case-study-the-hildegard-malware-campaign">A Case Study: The Hildegard Malware Campaign</h3>
<p>In early 2021, Palo Alto's Unit 42 research team documented a cryptomining malware campaign called Hildegard that specifically targeted Kubernetes clusters. The attack chain was:</p>
<ol>
<li><p>Find a cluster with the kubelet API exposed without authentication</p>
</li>
<li><p>Deploy a privileged pod with <code>hostPID: true</code></p>
</li>
<li><p>Use the privileged pod to read credentials from other containers' memory</p>
</li>
<li><p>Establish persistence by writing to the host filesystem</p>
</li>
</ol>
<p>Steps 3 and 4 would have been impossible if the pods in the cluster had been running with <code>readOnlyRootFilesystem: true</code>, dropped capabilities, and no <code>hostPID</code>. The attacker had the initial foothold. Pod security would have contained the blast radius.</p>
<h3 id="heading-pod-security-admission">Pod Security Admission</h3>
<p>Pod Security Admission (PSA) is the built-in admission controller that enforces pod security standards at the namespace level. It replaced PodSecurityPolicy in Kubernetes 1.25.</p>
<p><strong>Migrating from PSP?</strong> If you're on Kubernetes &lt; 1.25, you may still be using PodSecurityPolicy, which was removed in 1.25. The migration path is: enable PSA in <code>audit</code> mode first to identify violations, fix them workload by workload, then switch to <code>enforce</code>. For policies PSA cannot express, add Kyverno alongside it.</p>
<p>PSA defines three profiles:</p>
<table>
<thead>
<tr>
<th>Profile</th>
<th>Who it's for</th>
<th>What it restricts</th>
</tr>
</thead>
<tbody><tr>
<td><code>privileged</code></td>
<td>System components (CNI plugins, monitoring agents)</td>
<td>Nothing – no restrictions</td>
</tr>
<tr>
<td><code>baseline</code></td>
<td>Most workloads</td>
<td>Blocks known privilege escalations: no <code>hostNetwork</code>, no <code>hostPID</code>, no privileged containers</td>
</tr>
<tr>
<td><code>restricted</code></td>
<td>Security-sensitive workloads</td>
<td>Everything in baseline, plus: must run as non-root, must drop capabilities, must set a seccomp profile</td>
</tr>
</tbody></table>
<p>And three enforcement modes:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Effect</th>
<th>When to use</th>
</tr>
</thead>
<tbody><tr>
<td><code>enforce</code></td>
<td>Rejects pods that violate the profile at admission</td>
<td>Production – once you've fixed violations</td>
</tr>
<tr>
<td><code>audit</code></td>
<td>Allows pods but records violations in the audit log</td>
<td>Migration – see what would break without breaking anything</td>
</tr>
<tr>
<td><code>warn</code></td>
<td>Allows pods but sends a warning to the client</td>
<td>Development – fast feedback in your terminal</td>
</tr>
</tbody></table>
<p>The migration path: start with <code>audit</code> and <code>warn</code> to identify violations, fix them, then switch to <code>enforce</code>. The two modes can run simultaneously.</p>
<p>Apply them as namespace labels:</p>
<pre><code class="language-yaml"># namespace-staging.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    # Start here: audit and warn simultaneously
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest
</code></pre>
<p>Once violations are resolved, add enforce:</p>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest \
  --overwrite
</code></pre>
<p>Note: don't use <code>--overwrite</code> here. Without it, if <code>enforce</code> is already set to a different value the command will error – which is exactly what you want. You should see:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>If you see <code>namespace/staging not labeled</code>, it means <code>enforce=restricted</code> and <code>enforce-version=latest</code> were already set to those exact values. Confirm enforcement is active:</p>
<pre><code class="language-bash">kubectl get namespace staging --show-labels
</code></pre>
<p>Look for <code>pod-security.kubernetes.io/enforce=restricted</code> in the output. If it's there, enforcement is active.</p>
<h3 id="heading-how-to-configure-securitycontext">How to Configure securityContext</h3>
<p>A <code>securityContext</code> defines the privilege and access control settings for a pod or container. These are the seven fields you should configure on every production workload:</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>Set at</th>
<th>What it controls</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot</code></td>
<td>Pod</td>
<td>Rejects containers that run as UID 0 (root)</td>
</tr>
<tr>
<td><code>runAsUser</code> / <code>runAsGroup</code></td>
<td>Pod</td>
<td>Sets a specific UID/GID – don't rely on the image default</td>
</tr>
<tr>
<td><code>fsGroup</code></td>
<td>Pod</td>
<td>All mounted volumes are owned by this GID</td>
</tr>
<tr>
<td><code>seccompProfile</code></td>
<td>Pod</td>
<td>Filters syscalls using a seccomp profile</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation</code></td>
<td>Container</td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code></td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem</code></td>
<td>Container</td>
<td>Makes the container filesystem read-only</td>
</tr>
<tr>
<td><code>capabilities.drop</code></td>
<td>Container</td>
<td>Removes Linux capabilities (drop <code>ALL</code>, add back only what is needed)</td>
</tr>
</tbody></table>
<p>The annotated YAML below shows all seven in context:</p>
<pre><code class="language-yaml"># secure-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: staging
spec:
  replicas: 2
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true         # container must run as a non-root user
        runAsUser: 10001           # explicit UID — don't rely on the image's default
        runAsGroup: 10001          # explicit GID
        fsGroup: 10001             # volumes are owned by this group
        seccompProfile:
          type: RuntimeDefault     # use the container runtime's default seccomp profile
      automountServiceAccountToken: false
      containers:
        - name: app
          image: nginx:1.25-alpine
          securityContext:
            allowPrivilegeEscalation: false   # block setuid and sudo inside the container
            readOnlyRootFilesystem: true      # the single highest-impact setting
            capabilities:
              drop:
                - ALL                         # drop every Linux capability
              add: []                         # add back only what is explicitly needed
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: nginx-cache
              mountPath: /var/cache/nginx
            - name: nginx-run
              mountPath: /var/run
      volumes:
        # nginx needs writable directories — provide them as emptyDir volumes
        - name: tmp
          emptyDir: {}
        - name: nginx-cache
          emptyDir: {}
        - name: nginx-run
          emptyDir: {}
</code></pre>
<h4 id="heading-why-readonlyrootfilesystem-true-is-the-most-important-setting">Why <code>readOnlyRootFilesystem: true</code> is the most important setting:</h4>
<p>Most post-exploitation techniques require writing to the filesystem. Dropping a backdoor, modifying a binary, writing a cron job, or installing a keylogger all require a writable filesystem. Set <code>readOnlyRootFilesystem: true</code> and every one of these techniques is blocked.</p>
<p>The downside is that many applications write to directories like <code>/tmp</code> or <code>/var/cache</code>. The fix is to mount <code>emptyDir</code> volumes at those specific paths, as shown above. The rest of the filesystem stays read-only.</p>
<p><strong>What each field prevents:</strong></p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it prevents</th>
</tr>
</thead>
<tbody><tr>
<td><code>runAsNonRoot: true</code></td>
<td>Blocks containers that were built to run as root – they fail at admission</td>
</tr>
<tr>
<td><code>runAsUser: 10001</code></td>
<td>Ensures a known, non-privileged UID even if the image doesn't set one</td>
</tr>
<tr>
<td><code>allowPrivilegeEscalation: false</code></td>
<td>Blocks <code>setuid</code> binaries and <code>sudo</code> – the most common privilege escalation path</td>
</tr>
<tr>
<td><code>readOnlyRootFilesystem: true</code></td>
<td>Prevents writing backdoors, modifying binaries, or creating persistence</td>
</tr>
<tr>
<td><code>capabilities: drop: ALL</code></td>
<td>Removes Linux capabilities like <code>NET_RAW</code> (raw socket access) and <code>SYS_ADMIN</code> (kernel operations)</td>
</tr>
<tr>
<td><code>seccompProfile: RuntimeDefault</code></td>
<td>Filters syscalls to a safe default set – blocks ~300 of the ~400 available syscalls</td>
</tr>
</tbody></table>
<h3 id="heading-opagatekeeper-vs-kyverno">OPA/Gatekeeper vs Kyverno</h3>
<p>PSA covers the fundamentals. But you'll eventually need policies that PSA cannot express: all images must come from your private registry, all pods must have resource limits, no container may use the <code>latest</code> tag. For these, you need a policy engine.</p>
<p>Two mature options exist:</p>
<table>
<thead>
<tr>
<th></th>
<th>OPA/Gatekeeper</th>
<th>Kyverno</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Policy language</strong></td>
<td>Rego (a custom logic language)</td>
<td>YAML, same format as Kubernetes resources</td>
</tr>
<tr>
<td><strong>Learning curve</strong></td>
<td>Steep: Rego takes real time to learn</td>
<td>Gentle: if you write YAML, you can write policies</td>
</tr>
<tr>
<td><strong>Mutation</strong></td>
<td>Yes, via <code>Assign</code>/<code>AssignMetadata</code></td>
<td>Yes: first-class, well-documented feature</td>
</tr>
<tr>
<td><strong>Audit mode</strong></td>
<td>Yes: reports existing violations</td>
<td>Yes: policy audit mode</td>
</tr>
<tr>
<td><strong>Ecosystem</strong></td>
<td>Integrates with OPA in non-K8s contexts</td>
<td>Kubernetes-native only</td>
</tr>
<tr>
<td><strong>Best for</strong></td>
<td>Complex cross-resource logic and teams already using OPA</td>
<td>Teams who want K8s-native syntax and fast setup</td>
</tr>
</tbody></table>
<p>If you're starting fresh, Kyverno gets you to working policies faster. Here is a Kyverno policy that blocks images from outside your trusted registry:</p>
<pre><code class="language-yaml"># kyverno-registry-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Images must come from registry.corp.internal/"
        pattern:
          spec:
            containers:
              - image: "registry.corp.internal/*"
</code></pre>
<h3 id="heading-how-to-detect-runtime-threats-with-falco">How to Detect Runtime Threats with Falco</h3>
<p>PSA and <code>securityContext</code> are preventive controls: they block known-bad configurations before pods start. Falco is a detective control. It watches what containers do while they're running and alerts when something looks wrong.</p>
<p>Falco operates at the syscall level using eBPF. It attaches to the Linux kernel and intercepts every system call made by every container on the node – file opens, network connections, process spawns, privilege escalations. It does this without modifying containers, without injecting sidecars, and with minimal overhead.</p>
<h4 id="heading-what-falco-detects-out-of-the-box">What Falco detects out of the box:</h4>
<p>Falco's default ruleset covers the most common attack patterns. It fires when a shell is opened inside a running container, whether that's a <code>kubectl exec</code> session or a reverse shell from an exploit.</p>
<p>It watches for reads on sensitive files like <code>/etc/shadow</code>, <code>/etc/kubernetes/admin.conf</code>, and <code>/root/.ssh/</code>. It catches the dropper pattern: a binary written to disk and immediately executed. It detects outbound connections to known malicious IPs, writes to <code>/proc</code> or <code>/sys</code> that suggest kernel manipulation, and package managers like <code>apt</code>, <code>yum</code>, or <code>pip</code> being run inside containers that have no business installing software.</p>
<p>Each of these is a rule in Falco's default ruleset. You can extend it with custom rules for your specific workloads – which is exactly what you'll do in Demo 5. But first let's harden the Pod.</p>
<h2 id="heading-demo-4-harden-a-pod-with-securitycontext">Demo 4 – Harden a Pod with securityContext</h2>
<p>In this demo, you'll start with a default nginx deployment, observe the PSA violations it triggers, harden it step by step, and confirm it passes under the <code>restricted</code> profile.</p>
<h3 id="heading-step-1-apply-psa-labels-in-audit-mode">Step 1: Apply PSA labels in audit mode</h3>
<pre><code class="language-bash">kubectl label namespace staging \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted
</code></pre>
<h3 id="heading-step-2-deploy-insecure-nginx-and-observe-the-warnings">Step 2: Deploy insecure nginx and observe the warnings</h3>
<pre><code class="language-yaml"># insecure-nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-insecure
  namespace: staging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-insecure
  template:
    metadata:
      labels:
        app: nginx-insecure
    spec:
      containers:
        - name: nginx
          image: nginx:1.25-alpine
</code></pre>
<pre><code class="language-bash">kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output (PSA warns but still creates the deployment in <code>warn</code> mode):</p>
<pre><code class="language-plaintext">Warning: would violate PodSecurity "restricted:latest":
  allowPrivilegeEscalation != false (container "nginx" must set
    securityContext.allowPrivilegeEscalation=false)
  unrestricted capabilities (container "nginx" must set
    securityContext.capabilities.drop=["ALL"])
  runAsNonRoot != true (pod or container "nginx" must set
    securityContext.runAsNonRoot=true)
  seccompProfile not set (pod or container "nginx" must set
    securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/nginx-insecure created
</code></pre>
<p>Four violations. Every one of them is a real security gap. But the pod was still created "deployment.apps/nginx-insecure created"</p>
<h3 id="heading-step-3-deploy-the-hardened-version">Step 3: Deploy the hardened version</h3>
<pre><code class="language-bash">kubectl apply -f secure-deployment.yaml   # the YAML from the securityContext section above
</code></pre>
<p>No warnings this time.</p>
<h3 id="heading-step-4-switch-the-namespace-to-enforce">Step 4: Switch the namespace to enforce</h3>
<pre><code class="language-bash&quot;">kubectl label namespace staging \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=latest
</code></pre>
<p>Expected output:</p>
<pre><code class="language-plaintext">namespace/staging labeled
</code></pre>
<p>This is the moment enforcement becomes active. Any new pod that violates the <code>restricted</code> profile will be rejected from this point on.</p>
<h3 id="heading-step-5-confirm-insecure-deployments-are-now-rejected">Step 5: Confirm insecure deployments are now rejected</h3>
<pre><code class="language-bash">kubectl delete deployment nginx-insecure -n staging
kubectl apply -f insecure-nginx.yaml
</code></pre>
<p>Expected output:</p>
<pre><code class="language-shell">Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false ...
deployment.apps/nginx-insecure created
</code></pre>
<p>The Deployment object is created. PSA enforces at the <strong>pod</strong> level, not the Deployment level. The Deployment and its ReplicaSet exist, but every attempt to create a pod is rejected. Check the ReplicaSet:</p>
<pre><code class="language-bash">kubectl get replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">NAME                       DESIRED   CURRENT   READY   AGE
nginx-insecure-b668d867b   1         0         0       30s
</code></pre>
<p><code>DESIRED=1</code> but <code>CURRENT=0</code>. The ReplicaSet cannot create any pods because they're rejected at admission. Describe the ReplicaSet to see the rejection events:</p>
<pre><code class="language-bash">kubectl describe replicaset -n staging -l app=nginx-insecure
</code></pre>
<pre><code class="language-plaintext">Warning  FailedCreate  ReplicaSet "nginx-insecure-b668d867b" create Pod
  "nginx-insecure-xxx" failed: pods is forbidden: violates PodSecurity
  "restricted:latest": allowPrivilegeEscalation != false, unrestricted
  capabilities, runAsNonRoot != true, seccompProfile not set
</code></pre>
<p>The hardened deployment continues running with its pods intact. The insecure one has zero pods and never will. This is exactly how PSA is supposed to work.</p>
<h3 id="heading-step-6-score-the-hardened-pod-with-kube-score">Step 6: Score the hardened pod with kube-score</h3>
<p><a href="https://github.com/zegl/kube-score">kube-score</a> is a static analysis tool that scores Kubernetes manifests against security and reliability best practices:</p>
<pre><code class="language-bash"># macOS
brew install kube-score
# Linux: https://github.com/zegl/kube-score/releases

kube-score score secure-deployment.yaml -v
</code></pre>
<p>Expected output (abridged):</p>
<pre><code class="language-plaintext">apps/v1/Deployment secure-app in staging 
  path=secure-deployment.yaml
    [OK] Stable version
    [OK] Label values
    [CRITICAL] Container Resources
        · app -&gt; CPU limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.cpu
        · app -&gt; Memory limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.memory
        · app -&gt; CPU request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.cpu
        · app -&gt; Memory request is not set
            Resource requests are recommended to make sure that the application can start and run without crashing. Set resources.requests.memory
    [CRITICAL] Container Image Pull Policy
        · app -&gt; ImagePullPolicy is not set to Always
            It's recommended to always set the ImagePullPolicy to Always, to make sure that the imagePullSecrets are always correct, and to always get the image you want.
    [OK] Pod Probes Identical
    [CRITICAL] Container Ephemeral Storage Request and Limit
        · app -&gt; Ephemeral Storage limit is not set
            Resource limits are recommended to avoid resource DDOS. Set resources.limits.ephemeral-storage
        · app -&gt; Ephemeral Storage request is not set
            Resource requests are recommended to make sure the application can start and run without crashing. Set resource.requests.ephemeral-storage
    [OK] Environment Variable Key Duplication
    [OK] Container Security Context Privileged
    [OK] Pod Topology Spread Constraints
        · Pod Topology Spread Constraints
            No Pod Topology Spread Constraints set, kube-scheduler defaults assumed
    [OK] Container Image Tag
    [CRITICAL] Pod NetworkPolicy
        · The pod does not have a matching NetworkPolicy
            Create a NetworkPolicy that targets this pod to control who/what can communicate with this pod. Note, this feature needs to be supported by the CNI implementation used in the Kubernetes cluster to have an effect.
    [OK] Container Security Context User Group ID
    [OK] Container Security Context ReadOnlyRootFilesystem
    [CRITICAL] Deployment has PodDisruptionBudget
        · No matching PodDisruptionBudget was found
            It's recommended to define a PodDisruptionBudget to avoid unexpected downtime during Kubernetes maintenance operations, such as when draining a node.
    [WARNING] Deployment has host PodAntiAffinity
        · Deployment does not have a host podAntiAffinity set
            It's recommended to set a podAntiAffinity that stops multiple pods from a deployment from being scheduled on the same node. This increases availability in case the node becomes unavailable.
    [OK] Deployment Pod Selector labels match template metadata labels
</code></pre>
<p>Notice there are no security context violations: <code>securityContext</code>, <code>readOnlyRootFilesystem</code>, <code>seccompProfile</code>, and <code>runAsNonRoot</code> all pass. The remaining findings are about <strong>resource management</strong> (CPU/memory limits, ephemeral storage), <strong>availability</strong> (PodDisruptionBudget, anti-affinity), and <strong>network policy</strong> – not security context hardening. Those are important for production readiness, but they're a separate concern from the pod security hardening we did here.</p>
<p>You now have a pod that PSA accepts and kube-score validates. The next step is to add a detection layer – something that watches what the pod does at runtime, not just how it was configured at admission.</p>
<h2 id="heading-demo-5-deploy-falco-and-write-a-custom-detection-rule">Demo 5 – Deploy Falco and Write a Custom Detection Rule</h2>
<p>Now, you'll deploy Falco in eBPF mode, trigger a default alert, then extend Falco with a custom rule that catches <code>curl</code> and <code>wget</code> being run inside containers.</p>
<h3 id="heading-step-1-install-falco-via-helm">Step 1: Install Falco via Helm</h3>
<pre><code class="language-bash">helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  --wait
</code></pre>
<p>Confirm Falco is running on every node:</p>
<pre><code class="language-shell">kubectl get pods -n falco
</code></pre>
<pre><code class="language-shell">NAME           READY   STATUS    RESTARTS   AGE
falco-x8k2p    1/1     Running   0          45s
falco-m9nqr    1/1     Running   0          45s
falco-j4tpw    1/1     Running   0          45s
</code></pre>
<p>One pod per node. Falco runs as a DaemonSet because it needs to monitor syscalls on every node independently.</p>
<h3 id="heading-step-2-trigger-a-default-alert">Step 2: Trigger a default alert</h3>
<p>Open a second terminal and stream the Falco logs:</p>
<pre><code class="language-shell"># Terminal 2 — watch for alerts
kubectl logs -n falco -l app.kubernetes.io/name=falco -f --max-log-requests 3
</code></pre>
<p>In your first terminal, exec into the secure-app pod:</p>
<pre><code class="language-bash"># Terminal 1 — trigger the shell detection
POD=$(kubectl get pod -n staging -l app=secure-app \
  -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD -n staging -- sh
</code></pre>
<p>Within a second, Terminal 2 shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:23:41.456Z: Notice A shell was spawned in a container with an attached terminal
  (user=root user_loginuid=-1 k8s.ns=staging k8s.pod=secure-app-7d9f8b-xxx
   container=app shell=sh parent=runc cmdline=sh terminal=34816)
  rule=Terminal shell in container  priority=NOTICE
  tags=[container, shell, mitre_execution]
</code></pre>
<p>This is Falco's built-in <code>Terminal shell in container</code> rule firing. It detected the <code>kubectl exec</code> session the moment you ran it.</p>
<h3 id="heading-step-3-write-a-custom-rule">Step 3: Write a custom rule</h3>
<p>The built-in rules are comprehensive, but every production environment has workloads with unique behaviour. Here is a custom rule that alerts when <code>curl</code> or <code>wget</code> is executed inside any container:</p>
<pre><code class="language-yaml"># custom-rules.yaml
customRules:
  custom-rules.yaml: |-
    - rule: Suspicious network tool in container
      desc: &gt;
        Detects execution of curl or wget inside a running container.
        These tools are commonly used for data exfiltration, downloading
        attacker payloads, or reaching command-and-control servers.
        Production containers should not be making ad-hoc HTTP requests.
      condition: &gt;
        spawned_process
        and container
        and proc.name in (curl, wget)
      output: &gt;
        Network tool executed in container
        (user=%user.name tool=%proc.name cmd=%proc.cmdline
         pod=%k8s.pod.name ns=%k8s.ns.name image=%container.image)
      priority: WARNING
      tags: [network, exfiltration, custom]
</code></pre>
<p>Apply it by upgrading the Helm release:</p>
<pre><code class="language-bash"> helm upgrade falco falcosecurity/falco \
  --namespace falco \
  --set driver.kind=modern_ebpf \
  --set tty=true \
  -f custom-rules.yaml
</code></pre>
<p>Good, it deployed. Now wait for pods to be ready and test your custom rule:</p>
<h3 id="heading-step-4-test-the-custom-rule">Step 4: Test the custom rule</h3>
<pre><code class="language-bash"># Terminal 1 — run curl inside the container
kubectl exec -it $POD -n staging -- sh -c 'curl https://example.com'
</code></pre>
<p>Terminal 2 immediately shows:</p>
<pre><code class="language-plaintext">2024-03-15T14:31:07.812Z: Warning Network tool executed in container
  (user=root tool=curl cmd=curl https://example.com
   pod=secure-app-7d9f8b-xxx ns=staging image=nginx:1.25-alpine)
  rule=Suspicious network tool in container  priority=WARNING
  tags=[network, exfiltration, custom]
</code></pre>
<h3 id="heading-step-5-route-alerts-to-slack-with-falcosidekick">Step 5: Route alerts to Slack with Falcosidekick</h3>
<p>Streaming logs is useful during development. In production, you need alerts routed to your alerting pipeline. Falcosidekick handles this with support for Slack, PagerDuty, Datadog, Elasticsearch, and over 50 other outputs:</p>
<pre><code class="language-yaml"># falcosidekick-values.yaml
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    minimumpriority: "warning"
    messageformat: &gt;
      [{{.Priority}}] {{.Rule}} |
      pod: {{.OutputFields.k8s.pod.name}} |
      ns: {{.OutputFields.k8s.ns.name}} |
      image: {{.OutputFields.container.image}}
</code></pre>
<pre><code class="language-bash">helm install falcosidekick falcosecurity/falcosidekick \
  --namespace falco \
  -f falcosidekick-values.yaml
</code></pre>
<p><strong>Tuning Falco for production:</strong> A fresh Falco deployment will generate false positives, especially in the first week. Your job is to tune rules to match your workloads' normal behaviour, not to respond to every alert.</p>
<p>Here's the workflow: deploy in staging → identify false positives → add <code>except</code> conditions to rules → validate the false positive rate is low → enable in production with alerting.</p>
<h2 id="heading-cleanup">Cleanup</h2>
<p>To remove everything created in this article:</p>
<pre><code class="language-bash"># Delete the staging namespace and everything in it
kubectl delete namespace staging
 
# Delete Falco and Falcosidekick
helm uninstall falco -n falco
helm uninstall falcosidekick -n falco
kubectl delete namespace falco
 
# Delete the kind cluster entirely
kind delete cluster --name k8s-security
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this handbook, you secured a Kubernetes cluster across three layers: RBAC, pod runtime security, and runtime threat detection.</p>
<p>You built a least-privilege service account, enforced the restricted Pod Security Admission profile, hardened pods with securityContext, deployed Falco for syscall-level detection, and wrote a custom rule to catch suspicious tools inside containers.</p>
<p>Each layer maps to a real-world breach – Tesla, Capital One, Hildegard – showing how these controls would have contained the damage. Run kube-bench again to measure the improvement.</p>
<p>All YAML manifests, Helm values, and setup scripts from this article are available in the <a href="https://github.com/Caesarsage/DevOps-Cloud-Projects/tree/main/intermediate/security">companion GitHub repository</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Different Container Runtimes: Docker, Podman, and Containerd Explained ]]>
                </title>
                <description>
                    <![CDATA[ If you’re a developer working with containers, chances are Docker is your go-to tool. But did you know that there's a whole ecosystem of container runtimes out there? Some are lighter, some are more secure, and some are specifically built for Kuberne... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-different-container-runtimes-docker-podman-and-containerd-explained/</link>
                <guid isPermaLink="false">6994e01b44a48dd86fdf0816</guid>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Tue, 17 Feb 2026 21:39:39 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771357533601/1cba7a91-19f0-4038-93e6-504b121a9a03.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re a developer working with containers, chances are Docker is your go-to tool. But did you know that there's a whole ecosystem of container runtimes out there? Some are lighter, some are more secure, and some are specifically built for Kubernetes.</p>
<p>Understanding different container runtimes gives you more options. You can choose the right tool for your specific needs, whether that's better security, lower resource usage, or easier integration with Kubernetes.</p>
<p>In this tutorial, you'll learn about three major container runtimes and how to use them on your system. We’ll dive into practical examples with complete code you can run right now. By the end, you’ll understand when to use each runtime and how to move containers between them.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-are-container-runtimes">What Are Container Runtimes?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-high-level-vs-low-level-runtimes">How to Understand High-Level vs Low-Level Runtimes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-docker-as-your-baseline">How to Use Docker as Your Baseline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-podman-the-daemonless-alternative">How to Use Podman – The Daemonless Alternative</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-work-with-containerd">How to Work with Containerd</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-move-containers-between-runtimes">How to Move Containers Between Runtimes</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-use-cases">Real-World Use Cases</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-quick-reference-guide">Quick Reference Guide</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-are-container-runtimes">What Are Container Runtimes?</h2>
<p>A container runtime is the software that actually runs your containers. When you type <code>docker run nginx</code>, for example, several things happen behind the scenes. The Docker CLI talks to the Docker daemon, which then uses a container runtime (usually containerd) to actually create and run the container.</p>
<p>Think of it like this: if containers are apps on your phone, the container runtime is the operating system that makes those apps work. Just like you can install the same app on different phones (iPhone vs Android), you can run the same container on different runtimes.</p>
<h3 id="heading-why-does-this-matter">Why Does This Matter?</h3>
<p>You might wonder why you should care about what's running your containers. Docker works fine, right? Here are a few reasons:</p>
<ol>
<li><p><strong>Security:</strong> Some runtimes like Podman can run containers without root privileges. This means if someone breaks out of your container, they don't have full system access.</p>
</li>
<li><p><strong>Resource usage:</strong> Different runtimes use different amounts of memory and CPU. On a resource-constrained server or edge device, this matters a lot.</p>
</li>
<li><p><strong>Integration:</strong> If you're deploying to Kubernetes, understanding containerd or CRI-O helps you troubleshoot production issues.</p>
</li>
<li><p><strong>Licensing:</strong> Docker Desktop has licensing requirements for large companies. Alternatives like Podman are completely free.</p>
</li>
</ol>
<p>Here’s a chart that summarizes these key points:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770901945553/8ef53746-02d1-4936-8930-fc7255aaa2bc.jpeg" alt="Container runtime comparison chart" class="image--center mx-auto" width="2647" height="1510" loading="lazy"></p>
<h2 id="heading-how-to-understand-high-level-vs-low-level-runtimes">How to Understand High-Level vs Low-Level Runtimes</h2>
<p>Container runtimes are split into two categories, and understanding this distinction helps you see how everything fits together.</p>
<h3 id="heading-low-level-runtimes">Low-Level Runtimes</h3>
<p>Low-level runtimes like <code>runc</code> and <code>crun</code> do the actual work of creating containers. They interact directly with the Linux kernel to create isolated environments using features like namespaces and cgroups.</p>
<p><strong>Namespaces</strong> isolate what a process can see. For example, a process namespace means the container can't see other processes running on your system. A network namespace means it has its own network stack.</p>
<p><strong>Cgroups</strong> (control groups) limit what a process can use. You can limit a container to 512MB of RAM or 50% of one CPU core. This prevents one container from hogging all your resources.</p>
<p>These low-level runtimes implement the OCI (Open Container Initiative) Runtime Specification. This is a standard that defines exactly how to run a container. Because of this standard, you can swap out runtimes and your containers still work.</p>
<h3 id="heading-high-level-runtimes">High-Level Runtimes</h3>
<p>High-level runtimes like Docker, Podman, and containerd manage images, networking, volumes, and provide user-friendly interfaces. They handle pulling images from registries, setting up networks between containers, and managing container lifecycles.</p>
<p>These high-level runtimes use low-level runtimes under the hood. When you run <code>docker run</code>, Docker ultimately calls <code>runc</code> to create the container. This layering means you get a nice interface while still benefiting from the standard, battle-tested low-level runtime.</p>
<h4 id="heading-why-this-layering-matters">Why This Layering Matters:</h4>
<p>This separation of concerns is powerful. High-level runtimes can focus on user experience and features while low-level runtimes focus on reliably creating containers. You can swap low-level runtimes without changing your workflow. Some people use <code>crun</code> instead of <code>runc</code> because it's written in C and starts faster.</p>
<h2 id="heading-how-to-use-docker-as-your-baseline">How to Use Docker as Your Baseline</h2>
<p>Let's start with Docker since you're probably already familiar with it. This will give us a baseline to compare other runtimes against. We'll build a simple web application and then run the same application in different runtimes to see how they compare.</p>
<h3 id="heading-how-to-install-docker">How to Install Docker</h3>
<p>You can find installation guides for your operating system:</p>
<ul>
<li><p><a target="_blank" href="https://docs.docker.com/desktop/install/mac-install/">Docker Desktop for</a> <a target="_blank" href="https://docs.docker.com/desktop/install/mac-install/">Mac</a></p>
</li>
<li><p><a target="_blank" href="https://docs.docker.com/desktop/install/windows-install/">Docker Desktop for Windows</a></p>
</li>
<li><p><a target="_blank" href="https://docs.docker.com/engine/install/">Docker Engine for Linux</a></p>
</li>
</ul>
<h3 id="heading-how-to-run-a-test-container">How to Run a Test Container</h3>
<p>Let's verify that Docker works by running a simple container:</p>
<pre><code class="lang-bash">docker run hello-world
</code></pre>
<p>You should see a message that says:</p>
<pre><code class="lang-bash">Hello from Docker!
This message shows that your installation appears to be working correctly.
</code></pre>
<h4 id="heading-what-just-happened">What Just Happened?</h4>
<p>When you ran that command, Docker checked if the <code>hello-world</code> image exists locally. It didn't find it, so it pulled the image from Docker Hub (a public registry). Then it created a container from that image, started the container, and the container printed its message and exited.</p>
<p>All of this happened in a few seconds. Now let's build something more useful.</p>
<h3 id="heading-how-to-create-a-web-server">How to Create a Web Server</h3>
<p>Create a new directory for your project:</p>
<pre><code class="lang-bash">mkdir ~/container-demo
<span class="hljs-built_in">cd</span> ~/container-demo
</code></pre>
<p>The <code>~</code> symbol means your home directory. On macOS, this is <code>/Users/yourname</code>. On Linux, it's <code>/home/yourname</code>.</p>
<p>Create a simple HTML file:</p>
<pre><code class="lang-bash">cat &gt; index.html &lt;&lt; <span class="hljs-string">'EOF'</span>
&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;&lt;title&gt;Container Demo&lt;/title&gt;&lt;/head&gt;
&lt;body&gt;
  &lt;h1&gt;Hello from Docker!&lt;/h1&gt;
  &lt;p&gt;This is running <span class="hljs-keyword">in</span> a container.&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
EOF
</code></pre>
<p>This creates a basic HTML file. The <code>cat &gt;</code> command writes to a file, and <code>&lt;&lt; 'EOF'</code> means "read until you see EOF" (End Of File). This is a handy way to create files from the command line.</p>
<h3 id="heading-how-to-create-a-dockerfile">How to Create a Dockerfile</h3>
<p>You can create a dockerfile like this:</p>
<pre><code class="lang-bash">cat &gt; Dockerfile &lt;&lt; <span class="hljs-string">'EOF'</span>
FROM nginx:alpine
COPY index.html /usr/share/nginx/html/
EOF
</code></pre>
<h4 id="heading-understanding-the-dockerfile">Understanding the Dockerfile:</h4>
<p>The Dockerfile has two instructions:</p>
<ol>
<li><p><strong>FROM nginx:alpine</strong>: This starts with the official Nginx image. The <code>:alpine</code> tag means we're using the Alpine Linux version, which is much smaller (about 20MB instead of 130MB). Alpine is a minimal Linux distribution popular in containers because of its small size.</p>
</li>
<li><p><strong>COPY index.html /usr/share/nginx/html/</strong>: This copies your HTML file into the location where Nginx serves files. Inside the container, Nginx is configured to serve files from <code>/usr/share/nginx/html/</code>.</p>
</li>
</ol>
<h3 id="heading-how-to-build-a-docker-image">How to Build a Docker Image</h3>
<pre><code class="lang-bash">docker build -t my-web-app .
</code></pre>
<p>The <code>-t</code> flag means "tag" – we're naming the image <code>my-web-app</code>. The <code>.</code> at the end means "use the current directory as the build context". Docker will look for a Dockerfile in the current directory and send all files here to the Docker daemon for building.</p>
<p>You'll see output like:</p>
<pre><code class="lang-bash">[+] Building 2.3s (7/7) FINISHED
=&gt; [internal] load build definition from Dockerfile
=&gt; =&gt; transferring dockerfile: 98B
=&gt; [internal] load .dockerignore
...
=&gt; =&gt; naming to docker.io/library/my-web-app
</code></pre>
<p>This shows Docker building your image layer by layer. Each instruction in the Dockerfile creates a new layer. These layers are cached, so if you rebuild without changes, it's instant.</p>
<h3 id="heading-how-to-run-a-docker-container">How to Run a Docker Container</h3>
<pre><code class="lang-bash">docker run -d -p 8080:80 my-web-app
</code></pre>
<h4 id="heading-understanding-the-flags">Understanding the Flags:</h4>
<ul>
<li><p><strong>-d</strong> means "detached mode" – run in the background. Without this, the container runs in the foreground and you'll see Nginx's log output. With <code>-d</code>, it returns immediately and runs in the background.</p>
</li>
<li><p><strong>-p 8080:80</strong> maps port 8080 on your host machine to port 80 inside the container. Nginx listens on port 80 inside the container. To access it from your browser, you need to map it to a port on your machine. We chose 8080, but you could use any available port.</p>
</li>
</ul>
<p>Open your browser and visit <code>http://localhost:8080</code>. You should see your HTML page!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770902754636/b6641413-7bd6-4548-aa75-dbc487630c1d.png" alt="Localhost running docker container" class="image--center mx-auto" width="884" height="594" loading="lazy"></p>
<h4 id="heading-how-to-check-running-containers">How to Check Running Containers:</h4>
<pre><code class="lang-bash">docker ps
</code></pre>
<p>This shows all running containers. You'll see something like:</p>
<pre><code class="lang-bash">CONTAINER ID   IMAGE        COMMAND                  PORTS                  NAMES
a1b2c3d4e5f6   my-web-app   <span class="hljs-string">"/docker-entrypoint.…"</span>   0.0.0.0:8080-&gt;80/tcp   peaceful_curie
</code></pre>
<p>Docker automatically generated a random name (<code>peaceful_curie</code> in this example). You can specify a name with <code>--name</code> if you prefer.</p>
<h4 id="heading-how-to-view-container-logs">How to View Container Logs:</h4>
<pre><code class="lang-bash">docker logs &lt;container-id&gt;
</code></pre>
<p>Replace <code>&lt;container-id&gt;</code> with the ID from <code>docker ps</code> (just the first few characters work). This shows what's happening inside the container. For Nginx, you'll see access logs showing requests to your web server.</p>
<h4 id="heading-how-to-stop-the-container">How to Stop the Container:</h4>
<pre><code class="lang-bash">docker stop &lt;container-id&gt;
</code></pre>
<p>This gracefully stops the container. Nginx receives a signal to shut down cleanly.</p>
<p>Now that you understand how to use Docker, let’s check out how Podman works next.</p>
<h2 id="heading-how-to-use-podman-the-daemonless-alternative">How to Use Podman – The Daemonless Alternative</h2>
<p>Now let's try Podman. It's designed to be a drop-in replacement for Docker, but with some key differences that make it interesting for specific use cases.</p>
<h3 id="heading-why-podman-exists">Why Podman Exists</h3>
<p>Docker runs as a daemon (a background service) that requires root privileges. This daemon always runs, listening for commands. This architecture has some downsides:</p>
<ol>
<li><p><strong>Security:</strong> The Docker daemon runs as root. If someone compromises the daemon, they have root access to your entire system.</p>
</li>
<li><p><strong>Resource Usage:</strong> The daemon consumes resources even when you're not running containers.</p>
</li>
<li><p><strong>Single Point of Failure:</strong> If the daemon crashes, all your containers stop.</p>
</li>
</ol>
<p>Podman solves these problems by not using a daemon at all. Each <code>podman</code> command runs independently. This is called a "daemonless" architecture.</p>
<h3 id="heading-key-podman-features">Key Podman Features</h3>
<p>To summarize, here are some key helpful features of Podman that might make it a good fit for your projects:</p>
<ol>
<li><p><strong>No daemon required:</strong> Each command runs independently. No background service needed.</p>
</li>
<li><p><strong>Rootless by default:</strong> Containers run as your regular user, not as root. This dramatically improves security.</p>
</li>
<li><p><strong>Drop-in Docker replacement:</strong> Most Docker commands work exactly the same. You can even alias <code>docker=podman</code> and many applications won't notice the difference.</p>
</li>
<li><p><strong>Pod support:</strong> Podman has a concept of "pods" like Kubernetes. This is unique among container tools.</p>
</li>
</ol>
<p>Now that you understand the benefits of Podman, let’s see how you can use it.</p>
<h3 id="heading-how-to-install-podman">How to Install Podman</h3>
<p>Podman installation varies by operating system. Here are the official guides:</p>
<ul>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman for macOS</a></p>
</li>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman fo</a><a target="_blank" href="https://podman.io/docs/installation#windows">r</a> <a target="_blank" href="https://podman.io/docs/installation#windows">Windo</a><a target="_blank" href="https://podman.io/docs/installation#macos">ws</a></p>
</li>
<li><p><a target="_blank" href="https://podman.io/docs/installation#macos">Podman for</a> <a target="_blank" href="https://podman.io/docs/installation#linux">Li</a><a target="_blank" href="https://podman.io/docs/installation#windows">nux</a></p>
</li>
</ul>
<p><strong>For macOS users</strong> (what we'll use in this tutorial), you can install Podman using Homebrew:</p>
<pre><code class="lang-bash">brew install podman
</code></pre>
<h3 id="heading-how-to-initialize-and-start-podman-machine">How to Initialize and Start Podman Machine</h3>
<p>On macOS, Podman needs a Linux VM to run containers (since containers use Linux kernel features). Podman Machine handles this for you:</p>
<pre><code class="lang-bash">podman machine init
</code></pre>
<p>This creates a small Linux VM. You’ll only need to do this once. The VM is about 1GB and uses minimal resources when running.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903100891/671690cc-8073-4748-b2df-c4308585d411.png" alt="Initialize podman machine" class="image--center mx-auto" width="1028" height="344" loading="lazy"></p>
<p>Start the machine:</p>
<pre><code class="lang-bash">podman machine start
</code></pre>
<p>Verify it's working:</p>
<pre><code class="lang-bash">podman --version
</code></pre>
<p>You should see something like:</p>
<pre><code class="lang-bash">podman version 4.5.0
</code></pre>
<h3 id="heading-how-to-run-containers-with-podman">How to Run Containers with Podman</h3>
<p>Here's where it gets interesting. You can use nearly identical commands to Docker. Let's build and run the same web server you created earlier:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Build the image (same command as Docker)</span>
podman build -t my-web-app .

<span class="hljs-comment"># Run the container</span>
podman run -d -p 8081:80 my-web-app

<span class="hljs-comment"># See running container</span>
podman ps
</code></pre>
<p>Notice that we used port 8081 this time so it doesn't conflict with the Docker container if it's still running. Visit <code>http://localhost:8081</code> and you'll see the same page, but this time it's running in Podman!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903417925/4717dd8f-bda5-4aaa-ad16-2a24726ee820.png" alt="Localhost running podman container" class="image--center mx-auto" width="856" height="458" loading="lazy"></p>
<p>If you experience issue when running the podman build command, you can delete the docker image using <code>docker image rm my-web-app:latest</code>.</p>
<h4 id="heading-whats-different-under-the-hood">What's Different Under the Hood?</h4>
<p>Even though the commands look the same, what's happening is different: first no daemon was involved. The <code>podman</code> command directly created and started the container. And the container is running as your user, not as root.</p>
<p>You can verify this by checking what user owns the process:</p>
<pre><code class="lang-bash">podman top &lt;container-id&gt; user
</code></pre>
<p>You'll see your username, not <code>root</code>.</p>
<h3 id="heading-podman-pods-a-unique-feature">Podman Pods – A Unique Feature</h3>
<p>Podman has a unique feature that Docker doesn't have: pods. A pod is a group of containers that share networking and storage. This is the same concept Kubernetes uses, which makes Podman excellent for local Kubernetes development.</p>
<h4 id="heading-why-pods-matter">Why Pods Matter:</h4>
<p>In real applications, you often have multiple containers that need to work together. For example, a web application typically needs a database to store data, a cache layer for temporary storage of frequently accessed data and a logging container for request, response, and non-sensitive critical application metadata.</p>
<p>These four containers (web, database, cache, logger) need to communicate with each other. In Docker, you'd create a custom network and connect each container to it. In Podman, you can create a pod that automatically handles this networking.</p>
<h3 id="heading-how-to-create-a-podman-pod">How to Create a Podman Pod</h3>
<pre><code class="lang-bash">podman pod create --name my-app-pod -p 8082:80
</code></pre>
<p>This creates a pod named <code>my-app-pod</code> and exposes port 8082 on your host to port 80 inside the pod. Notice that you don't expose ports on individual containers – you expose them on the pod.</p>
<p>Add a web server to the pod:</p>
<pre><code class="lang-bash">podman run -d --pod my-app-pod --name web nginx:alpine
</code></pre>
<p>The <code>--pod</code> flag tells Podman to run this container inside the pod. The container doesn't need its own port mapping because the pod handles that.</p>
<p>Add Redis (an in-memory database) to the pod:</p>
<pre><code class="lang-bash">podman run -d --pod my-app-pod --name cache redis:alpine
</code></pre>
<p>Now you have two containers running in the same pod. Here's the powerful part: they share the same network namespace.</p>
<p>To check your pod:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># List all pods</span>
podman pod ps -a

<span class="hljs-comment"># Show details for one pod</span>
podman pod inspect &lt;pod-name-or-id&gt;

<span class="hljs-comment"># Check processes running in the pod</span>
podman top pod &lt;pod-name-or-id&gt;

<span class="hljs-comment"># See logs from containers in that pod</span>
podman logs &lt;container-name-or-id&gt;
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770903859712/3cabe09b-d693-4adf-85bf-74115122203a.png" alt="Podman pod inspection showing container running" class="image--center mx-auto" width="1128" height="744" loading="lazy"></p>
<h4 id="heading-understanding-shared-networking">Understanding Shared Networking:</h4>
<p>Both containers can reach each other using <code>localhost</code>. The web container can connect to Redis using <code>localhost:6379</code> (Redis's default port). It's as if they're running on the same machine.</p>
<p>This is exactly how Kubernetes pods work. If you learn Podman pods, you're learning Kubernetes networking too.</p>
<h3 id="heading-how-to-generate-kubernetes-yaml-from-pods">How to Generate Kubernetes YAML from Pods</h3>
<p>Here's where Podman really shines. You can generate Kubernetes-compatible YAML from your pod:</p>
<pre><code class="lang-bash">podman generate kube my-app-pod &gt; my-app-pod.yaml
</code></pre>
<p>Open <code>my-app-pod.yaml</code> and you'll see proper Kubernetes configuration:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Save the output of this file and use kubectl create -f to import</span>
<span class="hljs-comment"># it into Kubernetes.</span>
<span class="hljs-comment">#</span>
<span class="hljs-comment"># Created with podman-5.7.1</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">io.kubernetes.cri-o.SandboxID/cache:</span> <span class="hljs-string">5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019</span>
    <span class="hljs-attr">io.kubernetes.cri-o.SandboxID/web:</span> <span class="hljs-string">5e56bd9eab1a02a88654e3614312302d0f3f8d3652480498e6d1eef7d4824019</span>
  <span class="hljs-attr">creationTimestamp:</span> <span class="hljs-string">"2026-02-12T13:44:55Z"</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">my-app-pod</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-app-pod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">args:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">nginx</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">-g</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">daemon</span> <span class="hljs-string">off;</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/library/nginx:alpine</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">web</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">80</span>
      <span class="hljs-attr">hostPort:</span> <span class="hljs-number">8082</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">args:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">redis-server</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">docker.io/library/redis:alpine</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">cache</span>
</code></pre>
<p>This file can be deployed directly to any Kubernetes cluster:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># using minikube cluster</span>
kubectl apply -f my-app-pod.yaml
</code></pre>
<p>This is incredibly useful for local development. You can prototype your application using Podman pods, generate the YAML, and deploy to Kubernetes without rewriting anything.</p>
<h3 id="heading-how-to-manage-podman-machines">How to Manage Podman Machines</h3>
<p>When working with Podman on macOS or Windows, you're using a Linux VM. Here's how to manage it.</p>
<h4 id="heading-list-all-podman-machines">List all Podman machines:</h4>
<pre><code class="lang-bash">podman machine list
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074980607/84d2692e-11ce-4943-9187-a6a993d43c1d.png" alt="podman machine list" class="image--center mx-auto" width="1296" height="210" loading="lazy"></p>
<p>This shows all your Podman VMs, their status (running or stopped), and their names. The default machine is usually called <code>podman-machine-default</code>.</p>
<h4 id="heading-check-machine-status-and-info">Check machine status and info:</h4>
<pre><code class="lang-bash">podman machine info
</code></pre>
<p>This displays detailed information about your current machine including CPU, memory, and disk usage.</p>
<h4 id="heading-stop-the-podman-machine">Stop the Podman machine:</h4>
<pre><code class="lang-bash">podman machine stop
</code></pre>
<p>If you have multiple machines, specify the name:</p>
<pre><code class="lang-bash">podman machine stop podman-machine-default
</code></pre>
<p>This stops the VM but preserves it. All your images and containers remain intact. When you stop the machine, all running containers inside it are stopped.</p>
<h4 id="heading-start-a-stopped-machine">Start a stopped machine:</h4>
<pre><code class="lang-bash">podman machine start
</code></pre>
<p>Or with a specific name:</p>
<pre><code class="lang-bash">podman machine start podman-machine-default
</code></pre>
<p>This restarts the VM. Your images are still there, but containers remain stopped unless you started them with a restart policy.</p>
<h4 id="heading-delete-a-podman-machine">Delete a Podman machine:</h4>
<pre><code class="lang-bash">podman machine rm podman-machine-default
</code></pre>
<p>This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space.</p>
<p>With this basic understanding of how Podman works, we can move on and learn about how to use Containerd.</p>
<h2 id="heading-how-to-work-with-containerd">How to Work with Containerd</h2>
<p>Containerd is the runtime that Docker itself uses under the hood. It's also the default runtime for most Kubernetes installations. When you run Docker, you're actually using containerd without knowing it.</p>
<h3 id="heading-why-use-containerd-directly">Why Use containerd Directly?</h3>
<p>You might wonder why you'd use containerd directly if Docker already uses it. Here are a few reasons:</p>
<ol>
<li><p><strong>Kubernetes:</strong> Most Kubernetes clusters use containerd as their container runtime. Understanding it helps you troubleshoot production issues.</p>
</li>
<li><p><strong>Minimal footprint:</strong> containerd has no UI and minimal features. It uses less memory than Docker Desktop (about 50MB vs 2GB).</p>
</li>
<li><p><strong>Building tools:</strong> If you're building container orchestration tools, working directly with containerd gives you fine-grained control.</p>
</li>
</ol>
<h3 id="heading-understanding-the-architecture">Understanding the Architecture</h3>
<p>The containerd architecture looks like this:</p>
<pre><code class="lang-bash">Your Command → nerdctl → containerd → runc → Container
</code></pre>
<p>In this chain, nerdctl provides a Docker-like CLI, containerd manages images and container lifecycle, and runc actually creates the container using kernel features.</p>
<h3 id="heading-how-to-install-containerd-with-nerdctl">How to Install containerd with nerdctl</h3>
<p>containerd is designed for systems (like Kubernetes) rather than direct developer use. The installation approach differs by operating system:</p>
<ul>
<li><p><a target="_blank" href="https://lima-vm.io/docs/installation/">Lima for macOS</a> (includes nerdctl)</p>
</li>
<li><p><a target="_blank" href="https://github.com/containerd/containerd/blob/main/docs/getting-started.md">containerd for Linux</a> (native installation)</p>
</li>
<li><p><a target="_blank" href="https://github.com/containerd/nerdctl/releases">nerdctl releases</a> (for all platforms)</p>
</li>
</ul>
<p><strong>For macOS users</strong> (what we'll use in this tutorial), we’ll use Lima, which provides a Linux VM with containerd and nerdctl already installed.</p>
<pre><code class="lang-bash">brew install lima
</code></pre>
<p>Lima comes with nerdctl built-in, so you don't need to install it separately.</p>
<p><strong>For Linux users</strong>, you can install containerd directly from your package manager and download nerdctl from the GitHub releases page. Containerd runs natively on Linux without needing a VM.</p>
<h3 id="heading-how-to-start-a-lima-instance">How to Start a Lima Instance</h3>
<pre><code class="lang-bash">limactl start
</code></pre>
<p>This creates a default Linux VM running containerd with nerdctl available. The VM is configured with reasonable defaults (2GB RAM, 100GB disk). You can customize these settings if needed.</p>
<p>Lima mounts your home directory inside the VM, so you can access your files. This makes working with Lima feel transparent – you don't need to copy files into the VM.</p>
<p>Verify it's working:</p>
<pre><code class="lang-bash">lima nerdctl run hello-world
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074008992/aa76dcd5-8eb5-4baf-9d72-47e1f4aa3ae3.png" alt="Containerd Lima instance and Hello-world container" class="image--center mx-auto" width="1686" height="834" loading="lazy"></p>
<h3 id="heading-how-to-run-your-app-with-nerdctl">How to Run Your App with nerdctl</h3>
<p>The commands are nearly identical to Docker. This is intentional – nerdctl aims for Docker compatibility. Since we're running through Lima, we’ll prefix commands with <code>lima</code>.</p>
<p>Navigate to your project directory:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> ~/container-demo
</code></pre>
<p>Build the image:</p>
<pre><code class="lang-bash">lima nerdctl build -t my-web-app .
</code></pre>
<p>Run the container:</p>
<pre><code class="lang-bash">lima nerdctl run -d -p 8083:80 my-web-app
</code></pre>
<p>Visit <code>http://localhost:8083</code> to see your app running on containerd!</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074352767/8ee7339d-8145-494e-9bac-41bfc8f620e1.png" alt="Localhost running containerd container" class="image--center mx-auto" width="1066" height="598" loading="lazy"></p>
<h3 id="heading-whats-different-from-docker">What's Different from Docker?</h3>
<p>Under the hood, a lot is different. Containerd is managing your image and container. There's no daemon in the traditional sense (containerd runs differently than dockerd). Images are stored differently (though they're OCI-compliant so they're compatible).</p>
<p>But from your perspective as a developer, the commands feel the same. This is the power of standards like OCI.</p>
<h4 id="heading-how-to-check-running-containers-1">How to Check Running Containers:</h4>
<pre><code class="lang-bash">lima nerdctl ps
</code></pre>
<p>This shows all running containers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074426408/3b5da24c-0dad-4c8e-9ce8-fd8cb319f9f9.png" alt="Running containers" class="image--center mx-auto" width="1878" height="304" loading="lazy"></p>
<h3 id="heading-how-to-manage-lima-vms">How to Manage Lima VMs</h3>
<p>When working with containerd through Lima, you're using a Linux VM. Here's how to manage it.</p>
<h4 id="heading-list-all-lima-vms">List all Lima VMs:</h4>
<pre><code class="lang-bash">limactl list
</code></pre>
<p>This shows all your Lima VMs, their status (running or stopped), and their names. The default VM is usually called <code>default</code>.</p>
<h4 id="heading-check-vm-status-and-info">Check VM status and info:</h4>
<pre><code class="lang-bash">limactl info default
</code></pre>
<p>This displays detailed information about the specified VM including its configuration and resource usage.</p>
<h4 id="heading-stop-the-lima-vm">Stop the Lima VM:</h4>
<pre><code class="lang-bash">limactl stop default
</code></pre>
<p>This stops the VM but preserves it. All your images and containers remain intact. When you stop the VM, all running containers inside it are stopped. The next time you start it, your images will still be there but containers remain stopped.</p>
<h4 id="heading-start-a-stopped-vm">Start a stopped VM:</h4>
<pre><code class="lang-bash">limactl start default
</code></pre>
<p>This restarts the VM. Your images persist across restarts, so you don't need to rebuild them.</p>
<h4 id="heading-delete-a-lima-vm">Delete a Lima VM:</h4>
<pre><code class="lang-bash">limactl delete default
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771074893694/071702bb-8a35-4681-98ec-f1375a52c5d7.png" alt="Containerd VM list and deletion" class="image--center mx-auto" width="1272" height="324" loading="lazy"></p>
<p>This completely destroys the VM and all its contents (images, containers, volumes). Use this when you want to start fresh or free up disk space. You'll need to run <code>limactl start</code> again to create a new VM.</p>
<h4 id="heading-create-a-new-vm-with-custom-settings">Create a new VM with custom settings:</h4>
<pre><code class="lang-bash">limactl start --name my-custom-vm --cpus 4 --memory 8
</code></pre>
<p>This creates a new VM with 4 CPUs and 8GB of memory. You can have multiple Lima VMs for different projects.</p>
<h2 id="heading-how-to-move-containers-between-runtimes">How to Move Containers Between Runtimes</h2>
<p>Thanks to the OCI (Open Container Initiative) standard, you can move container images between different runtimes. This is incredibly powerful – you can build with one tool and deploy with another.</p>
<h3 id="heading-why-standards-matter">Why Standards Matter</h3>
<p>Before OCI, each container runtime used its own image format. Moving images between runtimes was difficult or impossible.</p>
<p>OCI created standards for the Runtime Specification (how to run a container), the Image Specification (how to package a container image), and the Distribution Specification (how to transfer images between systems).</p>
<p>Now all major runtimes follow these standards, making images portable.</p>
<h3 id="heading-method-1-using-container-registries">Method 1 – Using Container Registries</h3>
<p>The easiest way to share images is through a container registry like Docker Hub, GitHub Container Registry, or your own private registry. Any runtime can push and pull from registries.</p>
<p>First, build with Docker:</p>
<pre><code class="lang-bash">docker build -t my-username/my-app:v1 .
</code></pre>
<p>The image name has three parts: <code>my-username</code> (your registry username), <code>my-app</code> (the application name), and <code>v1</code> (a version tag).</p>
<p>Push to Docker Hub:</p>
<pre><code class="lang-bash">docker login
docker push my-username/my-app:v1
</code></pre>
<p>You'll need to create a free Docker Hub account if you don't have one. The <code>docker login</code> command prompts for your credentials.</p>
<p>Now pull with Podman:</p>
<pre><code class="lang-bash">podman pull my-username/my-app:v1
</code></pre>
<p>Podman downloads the image from Docker Hub. Even though it was built with Docker, Podman can use it because both follow OCI standards.</p>
<p>Or pull with nerdctl:</p>
<pre><code class="lang-bash">lima nerdctl pull my-username/my-app:v1
</code></pre>
<p>Same image, three different runtimes. This is the power of standards.</p>
<h3 id="heading-method-2-export-and-import">Method 2 – Export and Import</h3>
<p>If you don't want to use a public registry (maybe your image contains proprietary code), you can export images as tar files. This is perfect for air-gapped environments or simply moving images between machines.</p>
<p>Export from Docker:</p>
<pre><code class="lang-bash">docker save my-web-app -o my-web-app.tar
</code></pre>
<p>This creates a file called <code>my-web-app.tar</code> containing the image and all its layers. The file might be large (tens or hundreds of megabytes) depending on your image.</p>
<p>Import to Podman:</p>
<pre><code class="lang-bash">podman load -i my-web-app.tar
</code></pre>
<p>Import to nerdctl:</p>
<pre><code class="lang-bash">lima nerdctl load -i my-web-app.tar
</code></pre>
<p>Now you have the same image available in all three runtimes! You can verify:</p>
<pre><code class="lang-bash">docker images
podman images  
lima nerdctl images
</code></pre>
<p>All three commands will show <code>my-web-app</code> in their image lists.</p>
<h4 id="heading-understanding-image-layers">Understanding Image Layers:</h4>
<p>When you export an image, you're exporting all its layers. Each line in your Dockerfile creates a layer. These layers are shared between images, which saves disk space.</p>
<p>For example, if you have 10 images all based on <code>nginx:alpine</code>, they all share the nginx layers. Only the layers unique to each image take up additional space.</p>
<h2 id="heading-real-world-use-cases">Real-World Use Cases</h2>
<p>Let's look at some real scenarios where choosing the right runtime matters. These examples show how technical decisions have practical impacts.</p>
<h3 id="heading-use-case-1-security-first-development">Use Case 1 – Security-First Development</h3>
<p>If you're working on security-sensitive applications (financial services, healthcare, government), Podman's rootless containers are a huge advantage.</p>
<h4 id="heading-the-security-problem">The Security Problem:</h4>
<p>Traditional Docker requires root privileges. If someone exploits a vulnerability in your container and escapes to the host system, they have root access. This is called a "container escape" vulnerability.</p>
<p>Podman's rootless mode solves this:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># All Podman commands run as your user by default</span>
podman run --rm -it alpine whoami
</code></pre>
<p>This outputs your username, not <code>root</code>. The command uses <code>--rm</code> to remove the container when it exits (cleanup), <code>-it</code> to make it interactive with a terminal, <code>alpine</code> as a minimal Linux distribution, and <code>whoami</code> as a command that prints your username.</p>
<p>Even if someone breaks out of the container, they only have your user's permissions. They can't install system-wide malware, access other users' data, modify system configuration, or install kernel modules.</p>
<p>This dramatically reduces the impact of a container escape.</p>
<h4 id="heading-example-security-scenario">Example Security Scenario:</h4>
<p>Imagine you're running a web application that processes user uploads. A vulnerability lets an attacker execute code in your container. With Docker running as root, they could escape the container, install a rootkit, steal all data from your server, and persist even after you patch the vulnerability.</p>
<p>With Podman rootless, they might escape the container but can only access files your user can access. They can't persist beyond the container and can't affect other users or system files.</p>
<p>The difference is dramatic.</p>
<h3 id="heading-use-case-2-testing-kubernetes-locally">Use Case 2 – Testing Kubernetes Locally</h3>
<p>Podman can generate Kubernetes YAML from running containers. This is perfect for prototyping before you commit to a Kubernetes configuration.</p>
<h4 id="heading-the-development-workflow">The Development Workflow:</h4>
<ol>
<li><p>Run your application locally with Podman</p>
</li>
<li><p>Test and iterate quickly</p>
</li>
<li><p>Generate Kubernetes YAML when it works</p>
</li>
<li><p>Deploy to a real cluster</p>
</li>
</ol>
<p>Here's a practical example. Let's say you're building a web application with a database:</p>
<p>Run your containers:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a pod (like a Kubernetes pod)</span>
podman pod create --name myapp -p 8080:80

<span class="hljs-comment"># Add web server</span>
podman run -d --pod myapp --name web nginx:alpine

<span class="hljs-comment"># Add PostgreSQL</span>
podman run -d --pod myapp --name db \
  -e POSTGRES_PASSWORD=secret \
  postgres:alpine
</code></pre>
<p>Test your application at <code>http://localhost:8080</code>. When it works, generate Kubernetes YAML:</p>
<pre><code class="lang-bash">podman generate kube myapp &gt; myapp.yaml
</code></pre>
<p>Now you can deploy <code>myapp.yaml</code> to any Kubernetes cluster:</p>
<pre><code class="lang-bash">kubectl apply -f myapp.yaml
</code></pre>
<p>This is much faster than writing Kubernetes YAML by hand and debugging in a cluster. You iterate locally, then deploy when ready.</p>
<h4 id="heading-why-this-matters">Why This Matters:</h4>
<p>Kubernetes has a steep learning curve. The YAML configuration is verbose and error-prone. By starting with simple Podman commands and generating YAML, you can focus on your application first, learn Kubernetes gradually, catch configuration errors early, and iterate quickly without cloud costs.</p>
<h3 id="heading-use-case-3-resource-constrained-environments">Use Case 3 – Resource-Constrained Environments</h3>
<p>containerd has the smallest footprint. If you're running containers on edge devices, Raspberry Pi, or resource-constrained servers, this matters a lot.</p>
<h4 id="heading-comparing-memory-usage">Comparing Memory Usage:</h4>
<p>Here are typical memory footprints for each runtime:</p>
<ul>
<li><p>Docker Desktop uses approximately 2GB RAM (includes the VM, daemon, UI, and Kubernetes).</p>
</li>
<li><p>Podman uses approximately 500MB RAM (includes the VM on macOS).</p>
</li>
<li><p>Containerd uses approximately 50MB RAM (just the runtime, no extras).</p>
</li>
</ul>
<p>On a developer laptop with 16GB RAM, this difference doesn't matter much. But consider these scenarios:</p>
<p><strong>1. Edge Computing:</strong></p>
<p>You're running containers on edge devices with 1GB RAM total. Docker Desktop won't fit. containerd leaves room for your application.</p>
<p><strong>2. IoT Devices:</strong></p>
<p>A Raspberry Pi with 2GB RAM running Docker Desktop leaves little room for your application. containerd uses minimal resources.</p>
<p><strong>3. High-Density Servers:</strong></p>
<p>Running 100 containers per server. Every MB counts. Using containerd instead of full Docker saves 2GB per server × 100 servers = 200GB.</p>
<p><strong>Example Setup for Edge Device:</strong></p>
<pre><code class="lang-bash"><span class="hljs-comment"># On a Raspberry Pi or similar device</span>
sudo apt-get install containerd
sudo apt-get install nerdctl

<span class="hljs-comment"># Now you can run containers with minimal overhead</span>
nerdctl run -d my-lightweight-app
</code></pre>
<p>Your application gets to use most of the available RAM instead of competing with a heavy runtime.</p>
<h2 id="heading-quick-reference-guide">Quick Reference Guide</h2>
<p>Here's a handy comparison of common commands across runtimes:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Task</td><td>Docker</td><td>Podman</td><td>nerdctl (via Lima)</td></tr>
</thead>
<tbody>
<tr>
<td>Build image</td><td><code>docker build -t app .</code></td><td><code>podman build -t app .</code></td><td><code>lima nerdctl build -t app .</code></td></tr>
<tr>
<td>Run container</td><td><code>docker run -d app</code></td><td><code>podman run -d app</code></td><td><code>lima nerdctl run -d app</code></td></tr>
<tr>
<td>List containers</td><td><code>docker ps</code></td><td><code>podman ps</code></td><td><code>lima nerdctl ps</code></td></tr>
<tr>
<td>View logs</td><td><code>docker logs &lt;id&gt;</code></td><td><code>podman logs &lt;id&gt;</code></td><td><code>lima nerdctl logs &lt;id&gt;</code></td></tr>
<tr>
<td>Stop container</td><td><code>docker stop &lt;id&gt;</code></td><td><code>podman stop &lt;id&gt;</code></td><td><code>lima nerdctl stop &lt;id&gt;</code></td></tr>
<tr>
<td>Remove container</td><td><code>docker rm &lt;id&gt;</code></td><td><code>podman rm &lt;id&gt;</code></td><td><code>lima nerdctl rm &lt;id&gt;</code></td></tr>
<tr>
<td>List images</td><td><code>docker images</code></td><td><code>podman images</code></td><td><code>lima nerdctl images</code></td></tr>
<tr>
<td>Pull image</td><td><code>docker pull nginx</code></td><td><code>podman pull nginx</code></td><td><code>lima nerdctl pull nginx</code></td></tr>
<tr>
<td>Push to registry</td><td><code>docker push app</code></td><td><code>podman push app</code></td><td><code>lima nerdctl push app</code></td></tr>
<tr>
<td>Execute in container</td><td><code>docker exec -it &lt;id&gt; sh</code></td><td><code>podman exec -it &lt;id&gt; sh</code></td><td><code>lima nerdctl exec -it &lt;id&gt; sh</code></td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion">Conclusion</h2>
<p>In this guide, we’ve explored three major container runtimes and learned how to use Docker, Podman, and containerd. The container ecosystem is much bigger than just Docker, and knowing alternatives gives you more options for security, performance, and specialized use cases.</p>
<p>Use Docker when you're learning or need the best documentation. Use Podman when you need rootless security or are building CI/CD pipelines. Use containerd when you need minimal resource usage or are deploying to Kubernetes clusters.</p>
<p>Thanks to OCI standards, your containers are portable. Build with Docker, test with Podman, deploy with containerd – it all works together! You're not locked into one vendor or tool.</p>
<p>As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> and <a target="_blank" href="https://github.com/Caesarsage/DevOps-Cloud-Projects">DevOps Cloud Projects</a></p>
<p>Happy containerizing!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Production-Grade Distributed Chatroom in Go [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern commun... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-production-grade-distributed-chatroom-in-go-full-handbook/</link>
                <guid isPermaLink="false">698f4ea5c4c4900d2483651c</guid>
                
                    <category>
                        <![CDATA[ distributed system ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Programming Blogs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Fri, 13 Feb 2026 16:17:41 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770999686180/3bccee22-e7e9-477f-8a5f-50800896e972.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've ever wondered how chat applications like Slack, Discord, or WhatsApp work behind the scenes, this tutorial will show you. You'll build a real-time chat server from scratch using Go, learning the fundamental concepts that power modern communication systems.</p>
<p>By the end of this guide, you'll have built a working chatroom that supports unlimited concurrent users chatting in real-time, message persistence that survives server crashes, session management so users can reconnect after network interruptions, private messaging between users, and graceful handling of slow or disconnected clients.</p>
<p>More importantly, you'll understand the fundamental concepts behind distributed systems. You'll learn concurrent programming with goroutines and channels, TCP socket programming for network communication, write-ahead logging for data durability, state management with mutexes, and how to design systems that degrade gracefully under failure. These concepts power everything from databases to message queues to web servers.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-a-distributed-chatroom">What is a Distributed Chatroom</a>?</p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tutorial-overview">Tutorial Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-architecture-overview">Architecture Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-core-concepts-you-need-to-know">Core Concepts You Need to Know</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-the-project-structure">How to Set Up the Project Structure</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-define-core-data-types">How to Define Core Data Types</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-initialize-the-server">How to Initialize the Server</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-event-loop">How to Build the Event Loop</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-client-connections">How to Handle Client Connections</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-message-broadcasting">How to Implement Message Broadcasting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-add-persistence-with-wal-and-snapshots">How to Add Persistence with WAL and Snapshots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-session-management">How to Implement Session Management</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-command-system">How to Build the Command System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-the-client">How to Create the Client</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-test-your-chatroom">How to Test Your Chatroom</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-deploy-your-server">How to Deploy Your Server</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-enhancements-you-can-add">Enhancements You Can Add</a></p>
</li>
<li><p><a target="_blank" href="https://hashnode.com/draft/696aa5b42b74f2bf9668a223#heading-enhancement-you-could-add">Conclusion</a></p>
</li>
</ol>
<p>The complete source code for this project is available on <a target="_blank" href="https://github.com/Caesarsage/distributed-system/tree/main/chatroom-with-broadcast">GitHub</a> if you'd like to reference it while following along.</p>
<h2 id="heading-what-is-a-distributed-chatroom">What is a Distributed Chatroom?</h2>
<p>A chatroom is a server that lets multiple users connect simultaneously and exchange messages in real-time. When we say "production-grade," we mean it includes features you'd expect in a real application: it persists data so messages aren't lost when the server restarts, it handles network failures gracefully, and it can support many concurrent users without slowing down.</p>
<p>The "distributed" aspect refers to how the system manages multiple clients connecting from different locations, all trying to send and receive messages at the same time. This introduces interesting challenges: how do you ensure everyone sees messages in the same order? How do you handle clients with slow internet connections? What happens when someone disconnects unexpectedly?</p>
<p>These aren't just theoretical problems. Every networked application deals with concurrency, state management, and failure handling. Whether you're building a chat app, a multiplayer game, a collaborative editor, or a trading platform, you'll face similar challenges. The patterns you'll learn here apply broadly across distributed systems.</p>
<p>Chat applications are excellent learning projects because they combine several challenging problems in one place. You need to manage concurrent connections safely, broadcast messages to multiple clients without blocking, handle unreliable networks, persist data durably, and ensure the system recovers gracefully from crashes. Each of these topics could be its own tutorial, but here you'll see how they work together in a real application.</p>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<p>This tutorial demonstrates several important concepts that are fundamental to building distributed systems. Here's what you'll learn:</p>
<h3 id="heading-1-tcp-socket-programming-in-go">1. TCP Socket Programming in Go</h3>
<p>You'll learn how to accept incoming TCP connections, read and write data over network sockets, and handle connection failures gracefully. These skills are essential for any networked application, from web servers to database clients.</p>
<h3 id="heading-2-concurrent-programming-with-goroutines-and-channels">2. Concurrent Programming with Goroutines and Channels</h3>
<p>Go's concurrency model is one of its strongest features. You'll see how to use goroutines to handle multiple clients simultaneously without blocking. You'll use channels to coordinate between goroutines safely, avoiding the common pitfalls of shared memory concurrency like race conditions and deadlocks.</p>
<h3 id="heading-3-state-management-in-distributed-systems">3. State Management in Distributed Systems</h3>
<p>Managing shared state across concurrent operations is tricky. You'll learn when to use mutexes versus channels, how to design lock granularity to avoid bottlenecks, and how to ensure data consistency when multiple goroutines access the same data.</p>
<h3 id="heading-4-write-ahead-logging-wal-for-durability">4. Write-Ahead Logging (WAL) for Durability</h3>
<p>Databases use WAL to ensure data isn't lost during crashes. You'll implement the same pattern, learning how to balance durability with performance. You'll see why fsync is critical, understand the trade-offs of different persistence strategies, and learn how to recover state after unexpected shutdowns.</p>
<h3 id="heading-5-session-management-and-reconnection">5. Session Management and Reconnection</h3>
<p>Networks are unreliable. Users disconnect, WiFi drops, mobile connections switch towers. You'll build a token-based session system that lets users reconnect seamlessly, preserving their chat history and identity without requiring passwords or complex authentication.</p>
<h3 id="heading-6-graceful-degradation-and-fault-tolerance">6. Graceful Degradation and Fault Tolerance</h3>
<p>Perfect reliability is impossible, so you need to design for partial failures. You'll learn how to prevent slow clients from affecting fast ones, how to continue operating when persistence fails, and how to clean up resources properly when things go wrong.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this tutorial, you should have some foundational knowledge. You don't need to be an expert, but you should be comfortable with the basics.</p>
<ul>
<li><p>Go basics (goroutines, channels, interfaces)</p>
</li>
<li><p>TCP/IP networking fundamentals</p>
</li>
<li><p>Basic concurrency concepts</p>
</li>
<li><p>File I/O operations</p>
</li>
</ul>
<h2 id="heading-tutorial-overview">Tutorial Overview</h2>
<p>This tutorial takes you through building a production-ready chatroom step by step.</p>
<p>You'll start by exploring the overall architecture to understand how components fit together. Then you'll learn about core concepts like concurrency models and persistence strategies.</p>
<p>Next, you'll set up your project structure and define the core data types that represent clients, messages, and the chatroom. Then you'll implement the server initialization and event loop, which is where all coordination happens.</p>
<p>After that, you'll build the networking layer to handle client connections, implement message broadcasting so messages reach all users, and add persistence using write-ahead logging and snapshots.</p>
<p>You'll then implement session management for reconnection, build a command system for user actions, and create a simple client application to test your server.</p>
<p>Finally, you’ll learn how to test and deploy your chatroom, and review key lessons from building a distributed system.</p>
<p>By the end, you'll have a complete, working chatroom and understand how distributed systems handle concurrency, persistence, and failure recovery.</p>
<h2 id="heading-architecture-overview">Architecture Overview</h2>
<p>The system follows a client-server architecture with internal components that work together to provide a robust chat experience.</p>
<h3 id="heading-high-level-architecture">High-Level Architecture</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770347113467/2ca609e9-c902-4311-a571-e9c3b1280786.jpeg" alt="Chatroom broadcast architecture diagram" class="image--center mx-auto" width="1994" height="2005" loading="lazy"></p>
<h3 id="heading-component-breakdown">Component Breakdown</h3>
<h4 id="heading-1-network-layer">1. <strong>Network Layer</strong></h4>
<ul>
<li><p><strong>TCP Listener</strong>: Accepts incoming connections on port 9000</p>
</li>
<li><p><strong>Connection Handler</strong>: Manages individual client connections with dedicated goroutines</p>
</li>
<li><p><strong>Protocol</strong>: Simple newline-delimited text protocol</p>
</li>
</ul>
<h4 id="heading-2-client-management">2. <strong>Client Management</strong></h4>
<p>Each client connection spawns two goroutines:</p>
<ul>
<li><p><strong>Read Goroutine</strong>: Receives messages from client</p>
</li>
<li><p><strong>Write Goroutine</strong>: Sends messages to client (non-blocking with buffered channels)</p>
</li>
</ul>
<h4 id="heading-3-chatroom-core">3. <strong>ChatRoom Core</strong></h4>
<p>This is the heart of the system – a single goroutine running an event loop:</p>
<pre><code class="lang-go"><span class="hljs-keyword">for</span> {
    <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client := &lt;-cr.join:
            <span class="hljs-comment">// Handle new client</span>
        <span class="hljs-keyword">case</span> client := &lt;-cr.leave:
            <span class="hljs-comment">// Handle disconnection</span>
        <span class="hljs-keyword">case</span> message := &lt;-cr.broadcast:
            <span class="hljs-comment">// Broadcast to all clients</span>
        <span class="hljs-keyword">case</span> client := &lt;-cr.listUsers:
            <span class="hljs-comment">// Send user list</span>
        <span class="hljs-keyword">case</span> dm := &lt;-cr.directMessage:
            <span class="hljs-comment">// Handle private message</span>
    }
}
</code></pre>
<h4 id="heading-4-state-management">4. <strong>State Management</strong></h4>
<p>We have three synchronized data structures:</p>
<ul>
<li><p><code>clients map[*Client]bool</code>: Active connections (mutex-protected)</p>
</li>
<li><p><code>sessions map[string]*SessionInfo</code>: User sessions for reconnection</p>
</li>
<li><p><code>messages []Message</code>: In-memory message history</p>
</li>
</ul>
<h4 id="heading-5-persistence-layer">5. <strong>Persistence Layer</strong></h4>
<p>Two-tier approach:</p>
<ul>
<li><p><strong>Write-Ahead Log (WAL)</strong>: Immediate append-only log for durability</p>
</li>
<li><p><strong>Snapshots</strong>: Periodic full state dumps for faster recovery</p>
</li>
</ul>
<h4 id="heading-6-session-management">6. <strong>Session Management</strong></h4>
<p>This enables reconnection with token-based authentication:</p>
<ul>
<li><p>Generates unique tokens per user</p>
</li>
<li><p>1-hour session timeout</p>
</li>
<li><p>Preserves chat history for returning users</p>
</li>
</ul>
<h3 id="heading-message-flow">Message Flow</h3>
<p>Here's how a message travels through the system:</p>
<pre><code class="lang-bash">User Input → Client Read → Server Receive → Broadcast Channel 
    → ChatRoom Loop → Persist to WAL → Fan-out to All Clients
    → Client Write Goroutines → TCP Send → User Display
</code></pre>
<p>The broadcast channel acts as a synchronization point, ensuring total message ordering.</p>
<h2 id="heading-core-concepts-you-need-to-know">Core Concepts You Need to Know</h2>
<h3 id="heading-understanding-the-concurrency-model">Understanding the Concurrency Model</h3>
<p>This chatroom uses Go's CSP (Communicating Sequential Processes) model. This is a fundamentally different approach to concurrency than you might be used to from other languages.</p>
<p>In traditional concurrent programming, you protect shared memory with locks (mutexes). Multiple threads access the same data structure, and you use locks to ensure only one thread modifies it at a time. This works, but it's error-prone. Forget a lock, and you have a race condition. Hold locks too long, and you have deadlocks.</p>
<p>Go encourages a different approach: instead of communicating by sharing memory, you share memory by communicating. You pass data between goroutines through channels. Only one goroutine owns the data at a time, eliminating many concurrency bugs by design.</p>
<p>Channels provide several advantages. They eliminate most race conditions by design, because if only one goroutine owns the data at a time, there's no race to access it. They provide natural flow control since channels can block when full (back pressure) or block when empty (waiting for data). They make it easier to reason about message flow because you can trace how data moves through your system by following the channels. And they offer better composability since you can combine channels with select statements to coordinate multiple operations.</p>
<p>That said, we’ll still use mutexes in this project. Channels aren't always the right tool. We’ll use mutexes when multiple goroutines need quick, frequent access to shared data structures like maps. And we’ll use channels when we want to coordinate behavior or transfer ownership of data.</p>
<p>Here's how the chatroom uses channels to coordinate everything:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    join          <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// New connections</span>
    leave         <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// Disconnections</span>
    broadcast     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>         <span class="hljs-comment">// Messages to all</span>
    listUsers     <span class="hljs-keyword">chan</span> *Client        <span class="hljs-comment">// User list requests</span>
    directMessage <span class="hljs-keyword">chan</span> DirectMessage  <span class="hljs-comment">// Private messages</span>

    <span class="hljs-comment">// Shared state (mutex-protected)</span>
    clients    <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    mu         sync.Mutex

    <span class="hljs-comment">// Message history (separate mutex)</span>
    messages   []Message
    messageMu  sync.Mutex
}
</code></pre>
<p>Notice that we have five channels for different types of events. The main event loop receives from all these channels using a select statement. This means all state changes happen sequentially in one place, making the system much easier to reason about.</p>
<p>We could have used one channel that accepts different message types, but separate channels make the code clearer. When you send to <code>chatRoom.join</code>, it's obvious what you're doing. When you send to <code>chatRoom.broadcast</code>, same thing.</p>
<p>The mutexes protect data that many goroutines read frequently. The <code>clients</code> map needs to be accessed every time we broadcast a message. Using a mutex for quick read access is more efficient than passing the entire map through a channel.</p>
<h3 id="heading-understanding-the-persistence-strategy">Understanding the Persistence Strategy</h3>
<p>When your server crashes (and it will eventually), you need to recover the chat history. Users expect their messages to be there when the server restarts. But persistence is expensive: writing to disk is thousands of times slower than writing to memory. So you need a strategy that balances durability with performance.</p>
<p>We’ll use a two-tier approach that's similar to what real databases use: WAL (Write-ahead log) and snapshots.</p>
<p>The WAL is your primary durability mechanism. Here's how it works: every message is immediately appended to a file called <code>messages.wal</code>. This file is append-only, which means we only write to the end. Append-only writes are fast because the disk doesn't need to seek to different locations.</p>
<p>Each message is written as a single line of JSON. After writing each message, we call <code>fsync</code>. This tells the operating system to actually write the data to the physical disk right now, not just buffer it in memory. Without fsync, the OS might lose your data if the power fails before it gets around to writing.</p>
<p>The WAL is append-only and never modified. This makes it very reliable. If the server crashes mid-write, the worst case is one corrupted line at the end, which we can detect and skip during recovery.</p>
<p>The problem with a write-ahead log is that it grows forever. If you have a million messages, you need to replay a million log entries every time you restart the server. That's slow.</p>
<p>Snapshots solve this problem. Every 5 minutes, if there are more than 100 new messages, we write the entire message history to a separate file called <code>snapshot.json</code>. This is the complete state of the chat at that moment.</p>
<p>After creating a snapshot, we truncate (empty) the WAL. New messages continue to append to the WAL, but now we only need to replay messages since the last snapshot.</p>
<p>When the server starts, it first loads the snapshot file (if it exists). This gives us the state from the last snapshot, which might be 100,000 messages. Loading this takes about 100ms. Then it replays all entries from the WAL. This gives us messages written since the last snapshot, which might be only 50 messages. Replaying this takes milliseconds. Finally, it resumes normal operation.</p>
<p>Total recovery time is a few hundred milliseconds instead of several minutes.</p>
<p>This two-tier system gives us the best of both worlds: fast writes during normal operation with the append-only WAL, fast recovery after crashes with snapshot plus small WAL replay, guaranteed durability through fsync after every message, and bounded recovery time because the WAL never grows too large.</p>
<p>The trade-off is that snapshots use more disk space temporarily since you have both the snapshot and the WAL. But disk space is cheap, and correctness is expensive.</p>
<p>Now that you understand the key concepts behind the chatroom's design, it's time to start building. You'll begin by setting up your project structure and creating the necessary directories and files.</p>
<h2 id="heading-how-to-set-up-the-project-structure">How to Set Up the Project Structure</h2>
<p>First, create the directory structure for your project. You will create their files as we walk through the tutorial:</p>
<pre><code class="lang-bash">mkdir -p chatroom-with-broadcast/cmd/server
mkdir -p chatroom-with-broadcast/cmd/client
mkdir -p chatroom-with-broadcast/internal/chatroom
mkdir -p chatroom-with-broadcast/pkg/token
mkdir -p chatroom-with-broadcast/chatdata
<span class="hljs-built_in">cd</span> chatroom-with-broadcast
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770347289299/82f067a8-2cd0-49f0-8338-a002846e618b.png" alt="Chatroom project structure" class="image--center mx-auto" width="388" height="788" loading="lazy"></p>
<p>Then initialize the Go module.</p>
<p>Note that you’ll need Go 1.23.2 or later installed on your machine. Earlier versions might work, but the code examples assume features available in Go 1.23 and above. This version includes improvements to the standard library that make concurrent programming more efficient.</p>
<pre><code class="lang-bash">go mod init github.com/yourusername/chatroom
</code></pre>
<p>Your <code>go.mod</code> file should look like this:</p>
<pre><code class="lang-go">module github.com/yourusername/chatroom

<span class="hljs-keyword">go</span> <span class="hljs-number">1.23</span><span class="hljs-number">.2</span>
</code></pre>
<p>With your project structure in place, you're ready to start writing code. The first step is defining the data types that will represent the core components of your chatroom: messages, clients, and the chatroom itself.</p>
<h2 id="heading-how-to-define-core-data-types">How to Define Core Data Types</h2>
<p>Create a new file <code>internal/chatroom/types.go</code> to define your core data structures. These types form the foundation of your chatroom, so it's important to understand what each one represents and why it's designed the way it is.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"sync"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-comment">// Message represents a single chat message with metadata</span>
<span class="hljs-keyword">type</span> Message <span class="hljs-keyword">struct</span> {
    ID        <span class="hljs-keyword">int</span>       <span class="hljs-string">`json:"id"`</span>
    From      <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"from"`</span>
    Content   <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"content"`</span>
    Timestamp time.Time <span class="hljs-string">`json:"timestamp"`</span>
    Channel   <span class="hljs-keyword">string</span>    <span class="hljs-string">`json:"channel"`</span> <span class="hljs-comment">// "global" or "private:username"</span>
}

<span class="hljs-comment">// Client represents a connected user</span>
<span class="hljs-keyword">type</span> Client <span class="hljs-keyword">struct</span> {
    conn         net.Conn      <span class="hljs-comment">// TCP connection</span>
    username     <span class="hljs-keyword">string</span>        <span class="hljs-comment">// Display name</span>
    outgoing     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>   <span class="hljs-comment">// Buffered channel for writes</span>
    lastActive   time.Time     <span class="hljs-comment">// For idle detection</span>
    messagesSent <span class="hljs-keyword">int</span>           <span class="hljs-comment">// Statistics</span>
    messagesRecv <span class="hljs-keyword">int</span>
    isSlowClient <span class="hljs-keyword">bool</span>          <span class="hljs-comment">// Testing flag</span>

    reconnectToken <span class="hljs-keyword">string</span>
    mu             sync.Mutex   <span class="hljs-comment">// Protects stats fields</span>
}

<span class="hljs-comment">// ChatRoom is the central coordinator</span>
<span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    <span class="hljs-comment">// Communication channels</span>
    join          <span class="hljs-keyword">chan</span> *Client
    leave         <span class="hljs-keyword">chan</span> *Client
    broadcast     <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>
    listUsers     <span class="hljs-keyword">chan</span> *Client
    directMessage <span class="hljs-keyword">chan</span> DirectMessage

    <span class="hljs-comment">// State</span>
    clients       <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    mu            sync.Mutex
    totalMessages <span class="hljs-keyword">int</span>
    startTime     time.Time

    <span class="hljs-comment">// Message history</span>
    messages      []Message
    messageMu     sync.Mutex
    nextMessageID <span class="hljs-keyword">int</span>

    <span class="hljs-comment">// Persistence</span>
    walFile       *os.File
    walMu         sync.Mutex
    dataDir       <span class="hljs-keyword">string</span>

    <span class="hljs-comment">// Sessions</span>
    sessions      <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*SessionInfo
    sessionsMu    sync.Mutex
}

<span class="hljs-comment">// SessionInfo tracks reconnection data</span>
<span class="hljs-keyword">type</span> SessionInfo <span class="hljs-keyword">struct</span> {
    Username       <span class="hljs-keyword">string</span>
    ReconnectToken <span class="hljs-keyword">string</span>
    LastSeen       time.Time
    CreatedAt      time.Time
}

<span class="hljs-comment">// DirectMessage represents a private message</span>
<span class="hljs-keyword">type</span> DirectMessage <span class="hljs-keyword">struct</span> {
    toClient *Client
    message  <span class="hljs-keyword">string</span>
}
</code></pre>
<h4 id="heading-understanding-the-message-type">Understanding the Message Type</h4>
<p>The <code>Message</code> struct stores everything we need to know about a chat message. The <code>ID</code> field uniquely identifies each message and ensures messages stay in order. The <code>Timestamp</code> lets us show when messages were sent, which is important for chat history.</p>
<p>The <code>Channel</code> field is interesting. Right now, we only use "global" for public messages, but this design lets us add private channels or chat rooms later without changing the data structure. Good data structures anticipate future needs.</p>
<h4 id="heading-understanding-the-client-type">Understanding the Client Type</h4>
<p>Each connected user is represented by a <code>Client</code> struct. The <code>conn</code> field is their TCP connection – this is how we send and receive data.</p>
<p>The <code>outgoing</code> channel is crucial for performance. Notice it's a <code>chan string</code>, which means it's a channel of strings. We'll make this a buffered channel (size 10). This buffer means we can queue up 10 messages for this client without blocking. If a client is slow to read, we can keep sending to other clients.</p>
<p>Without this buffer, one slow client would block the entire broadcast. With the buffer, slow clients just miss messages if they can't keep up, which is much better than slowing everyone down.</p>
<p>The <code>lastActive</code> timestamp helps us detect idle users. If someone hasn't sent a message in 5 minutes, we can disconnect them to free up resources.</p>
<p>The <code>mu</code> mutex protects the statistics fields. Multiple goroutines will update <code>messagesSent</code> and <code>messagesRecv</code>, so we need a mutex to prevent race conditions.</p>
<h4 id="heading-understanding-the-chatroom-type">Understanding the ChatRoom Type</h4>
<p>This is the heart of the system. Notice that we have two kinds of fields: channels and protected state.</p>
<p>The five channels (<code>join</code>, <code>leave</code>, <code>broadcast</code>, <code>listUsers</code>, <code>directMessage</code>) are how different parts of the system communicate with the main event loop. When a new client connects, we send them to the <code>join</code> channel. When someone sends a message, it goes to the <code>broadcast</code> channel.</p>
<p>These channels are unbuffered (capacity 0) because we want synchronization. When you send to an unbuffered channel, you block until someone receives. This ensures the event loop processes events in order.</p>
<p>The protected state (maps and slices) needs mutexes because multiple goroutines access it. Notice that we use separate mutexes for different data. The <code>mu</code> mutex protects the <code>clients</code> map. The <code>messageMu</code> mutex protects the <code>messages</code> slice. The <code>sessionsMu</code> mutex protects the <code>sessions</code> map.</p>
<p>Why separate mutexes? Performance. If we used one mutex for everything, broadcasting a message would lock all the data, preventing new clients from joining. Separate mutexes mean different operations can happen concurrently.</p>
<p>The WAL file (<code>walFile</code>) also has its own mutex (<code>walMu</code>) because writing to disk is slow. We don't want to hold the main mutex while waiting for disk I/O.</p>
<p>With your data types defined, the next step is creating a function to initialize the server. This function will set up all your data structures, restore any persisted state from previous runs, and start background workers.</p>
<h2 id="heading-how-to-initialize-the-server">How to Initialize the Server</h2>
<p>Server initialization is critical because you need to set up all your data structures in the right order. If you restore state after opening the WAL, you might replay messages twice. If you start accepting connections before loading history, users won't see old messages.</p>
<p>Create a file <code>internal/chatroom/run.go</code> to bootstrap the server:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">NewChatRoom</span><span class="hljs-params">(dataDir <span class="hljs-keyword">string</span>)</span> <span class="hljs-params">(*ChatRoom, error)</span></span> {
    cr := &amp;ChatRoom{
        clients:       <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>),
        join:          <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        leave:         <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        broadcast:     <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>),
        listUsers:     <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> *Client),
        directMessage: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> DirectMessage),
        sessions:      <span class="hljs-built_in">make</span>(<span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*SessionInfo),
        messages:      <span class="hljs-built_in">make</span>([]Message, <span class="hljs-number">0</span>),
        startTime:     time.Now(),
        dataDir:       dataDir,
    }

    <span class="hljs-comment">// Restore from snapshot if available</span>
    <span class="hljs-keyword">if</span> err := cr.loadSnapshot(); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to load snapshot: %v\n"</span>, err)
    }

    <span class="hljs-comment">// Initialize WAL for new messages</span>
    <span class="hljs-keyword">if</span> err := cr.initializePersistence(); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>, err
    }

    <span class="hljs-comment">// Start background snapshot worker</span>
    <span class="hljs-keyword">go</span> cr.periodicSnapshots()

    <span class="hljs-keyword">return</span> cr, <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">periodicSnapshots</span><span class="hljs-params">()</span></span> {
    ticker := time.NewTicker(<span class="hljs-number">5</span> * time.Minute)
    <span class="hljs-keyword">defer</span> ticker.Stop()

    <span class="hljs-keyword">for</span> <span class="hljs-keyword">range</span> ticker.C {
        cr.messageMu.Lock()
        messageCount := <span class="hljs-built_in">len</span>(cr.messages)
        cr.messageMu.Unlock()

        <span class="hljs-keyword">if</span> messageCount &gt; <span class="hljs-number">100</span> {
            <span class="hljs-keyword">if</span> err := cr.createSnapshot(); err != <span class="hljs-literal">nil</span> {
                fmt.Printf(<span class="hljs-string">"Snapshot failed: %v\n"</span>, err)
            }
        }
    }
}
</code></pre>
<p>Let's break down what happens during initialization:</p>
<h4 id="heading-1-creating-data-structures">1. Creating Data Structures</h4>
<p>We start by creating all the maps and channels. The <code>make</code> function initializes these properly. For maps, this creates an empty map ready to use. For channels, this creates an unbuffered channel (capacity 0).</p>
<p>Notice we create the <code>messages</code> slice with initial capacity 0 but room to grow: <code>make([]Message, 0)</code>. This is more efficient than starting with <code>nil</code> because the slice is ready to append immediately without allocation.</p>
<h4 id="heading-2-loading-the-snapshot">2. Loading the Snapshot</h4>
<p>Before we accept any connections, we try to load a snapshot from disk. This restores the chat history from the last time the server ran. If the snapshot doesn't exist (first run) or fails to load (corrupted file), we just continue with an empty history.</p>
<p>This step must happen before initializing the WAL. If we opened the WAL first, we might replay messages that are already in the snapshot, creating duplicates.</p>
<h4 id="heading-3-initializing-the-wal">3. Initializing the WAL</h4>
<p>The <code>initializePersistence()</code> function opens the WAL file in append mode. It also replays any entries in the WAL that happened after the last snapshot. This ensures we don't lose any messages that were written to the WAL but not yet included in a snapshot.</p>
<p>If this step fails, we return an error and refuse to start. Why? Because if we can't write to the WAL, we can't guarantee durability. It's better to refuse to start than to lie to users by accepting messages we can't persist.</p>
<h4 id="heading-4-starting-background-workers">4. Starting Background Workers</h4>
<p>The <code>periodicSnapshots()</code> function runs in a separate goroutine. It wakes up every 5 minutes and checks if we need to create a snapshot. Notice the <code>defer ticker.Stop()</code> – this is important. If we forget to stop the ticker, it leaks a goroutine and wastes resources.</p>
<p>The goroutine acquires the <code>messageMu</code> lock just to read the message count, then releases it immediately. We don't hold the lock during the snapshot creation because that's slow and would block message broadcasting.</p>
<h4 id="heading-why-5-minutes-and-100-messages">Why 5 Minutes and 100 Messages?</h4>
<p>These are tunable parameters. 5 minutes means recovery never needs to replay more than 5 minutes of messages. 100 messages means we don't create snapshots too frequently during quiet periods.</p>
<p>In a production system, you might make these configurable. A high-traffic chat might want shorter intervals. A low-traffic chat might want longer intervals to reduce disk I/O.</p>
<p>Now that your server is initialized with all the necessary data structures and background workers, you need to build the core coordination mechanism. The event loop is where all state changes happen in your chatroom. It's the heartbeat that keeps everything synchronized.</p>
<h2 id="heading-how-to-build-the-event-loop">How to Build the Event Loop</h2>
<p>The event loop is the heart of your chatroom. Every client connection, message, and disconnection flows through this single point. This might seem like it could be a bottleneck, but it's actually what makes the system simple and safe.</p>
<p>The <code>Run()</code> method is the server's heartbeat. This is where all the magic happens. Every event in the system flows through this loop. Add this to <code>run.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">Run</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"ChatRoom heartbeat started..."</span>)
    <span class="hljs-keyword">go</span> cr.cleanupInactiveClients()

    <span class="hljs-keyword">for</span> {
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client := &lt;-cr.join:
            cr.handleJoin(client)

        <span class="hljs-keyword">case</span> client := &lt;-cr.leave:
            cr.handleLeave(client)

        <span class="hljs-keyword">case</span> message := &lt;-cr.broadcast:
            cr.handleBroadcast(message)

        <span class="hljs-keyword">case</span> client := &lt;-cr.listUsers:
            cr.sendUserList(client)

        <span class="hljs-keyword">case</span> dm := &lt;-cr.directMessage:
            cr.handleDirectMessage(dm)
        }
    }
}
</code></pre>
<h4 id="heading-understanding-the-select-statement">Understanding the Select Statement</h4>
<p>The <code>select</code> statement is one of Go's most powerful concurrency features. It's like a switch statement for channels. The select waits until one of its cases can proceed, then it executes that case.</p>
<p>Here's what happens: The loop blocks on the select statement, waiting for data on any of the five channels. When data arrives on any channel, that case executes. After the case completes, the loop goes back to waiting.</p>
<p>For example, when a new client connects, code elsewhere in your program sends that client to <code>cr.join</code>. The select receives it and executes <code>cr.handleJoin(client)</code>. Once that finishes, the loop goes back to waiting.</p>
<h4 id="heading-why-use-a-single-event-loop">Why Use a Single Event Loop?</h4>
<p>This might seem like a bottleneck. You have one goroutine processing all events sequentially. Why not process events in parallel?</p>
<p>The answer is consistency. Here's what you gain from sequential processing:</p>
<p><strong>1. No Race Conditions on State</strong></p>
<p>Only one goroutine modifies the <code>clients</code> map, the <code>messages</code> slice, and the <code>sessions</code> map. You never need to worry about two operations interfering with each other. When you add a client in <code>handleJoin</code>, you know for certain that no other code is simultaneously removing clients or broadcasting messages.</p>
<p>This is incredibly powerful. Most bugs in concurrent systems come from unexpected interleaving of operations. By processing events sequentially, you eliminate an entire class of bugs.</p>
<p><strong>2. Total Ordering of Events</strong></p>
<p>Messages are broadcast in the order they arrive. This seems obvious, but it's important. If Alice sends "Hello" and then Bob sends "Hi", you can guarantee everyone sees them in that order. With parallel processing, you'd need additional synchronization to maintain ordering.</p>
<p><strong>3. Simple State Transitions</strong></p>
<p>You can reason about your system state as a series of transitions. "After this join event, the client is in the map. After this leave event, the client is removed." You don't need to worry about concurrent state changes making your reasoning invalid.</p>
<p><strong>4. Easy to Debug</strong></p>
<p>When something goes wrong, you can add logging to the event loop and see exactly what sequence of events led to the problem. With parallel processing, the order of events depends on thread scheduling, making bugs hard to reproduce.</p>
<h4 id="heading-is-this-actually-a-bottleneck">Is This Actually a Bottleneck?</h4>
<p>You might worry that sequential processing limits performance. In practice, it's fine for this workload. Here's why:</p>
<p>The handlers are fast. They do simple things like adding to a map, removing from a map, or forwarding a message to channels. These operations take microseconds. The event loop can process thousands of events per second.</p>
<p>The slow operations (writing to disk, sending to client connections) happen in other goroutines. The event loop doesn't wait for them. It just sends data to a channel or adds work to a queue, then immediately moves to the next event.</p>
<p>If you needed higher throughput, you could shard your chat into multiple rooms, each with its own event loop. But for a single chatroom, sequential processing is both simpler and fast enough.</p>
<h4 id="heading-understanding-the-cleanup-worker">Understanding the Cleanup Worker</h4>
<p>Notice the line <code>go cr.cleanupInactiveClients()</code> before the loop. This starts a background goroutine that periodically checks for idle clients.</p>
<p>Why not include this in the event loop? Because it's time-based, not event-based. The cleanup worker wakes up every 30 seconds and sends disconnect events for idle clients. These events flow through the normal event loop, maintaining our single-threaded state mutation property.</p>
<p>Now add the <code>runServer()</code> function and shutdown handler:</p>
<pre><code class="lang-go"><span class="hljs-keyword">import</span> (
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"os/signal"</span>
    <span class="hljs-string">"syscall"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">runServer</span><span class="hljs-params">()</span></span> {
    chatRoom, err := NewChatRoom(<span class="hljs-string">"./chatdata"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to initialize: %v\n"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> chatRoom.shutdown()

    <span class="hljs-comment">// Set up signal handling for graceful shutdown</span>
    sigChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> os.Signal, <span class="hljs-number">1</span>)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        &lt;-sigChan
        fmt.Println(<span class="hljs-string">"\nReceived shutdown signal"</span>)
        chatRoom.shutdown()
        os.Exit(<span class="hljs-number">0</span>)
    }()

    <span class="hljs-keyword">go</span> chatRoom.Run()

    listener, err := net.Listen(<span class="hljs-string">"tcp"</span>, <span class="hljs-string">":9000"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Error starting server:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> listener.Close()

    fmt.Println(<span class="hljs-string">"Server started on :9000"</span>)

    <span class="hljs-keyword">for</span> {
        conn, err := listener.Accept()
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Println(<span class="hljs-string">"Error accepting connection:"</span>, err)
            <span class="hljs-keyword">continue</span>
        }
        fmt.Println(<span class="hljs-string">"New connection from:"</span>, conn.RemoteAddr())
        <span class="hljs-keyword">go</span> handleClient(conn, chatRoom)
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">shutdown</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"\nShutting down..."</span>)
    <span class="hljs-keyword">if</span> err := cr.createSnapshot(); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Final snapshot failed: %v\n"</span>, err)
    }
    <span class="hljs-keyword">if</span> cr.walFile != <span class="hljs-literal">nil</span> {
        cr.walFile.Close()
    }
    fmt.Println(<span class="hljs-string">"Shutdown complete"</span>)
}
</code></pre>
<p>The <code>runServer()</code> function ties everything together:</p>
<ol>
<li><p>Create the chatroom with <code>NewChatRoom()</code></p>
</li>
<li><p>Defer the shutdown function so it runs when the function exits</p>
</li>
<li><p>Start the event loop in a separate goroutine with <code>go chatRoom.Run()</code></p>
</li>
<li><p>Listen for TCP connections on port 9000</p>
</li>
<li><p>For each connection, spawn a goroutine with <code>go handleClient()</code></p>
</li>
</ol>
<p>The defer statement is important. No matter how the function exits (normal return, panic, error), the shutdown function runs. This ensures we create a final snapshot and close the WAL file cleanly.</p>
<p>The signal handling goroutine listens for SIGINT (Ctrl+C) or SIGTERM (system shutdown). When it receives one, it calls <code>shutdown()</code> and exits gracefully. This means when you press Ctrl+C, the server saves its state before stopping.</p>
<p>With your event loop running and listening for connections, the next step is handling what happens when a client actually connects. This involves reading their username, creating a session, and setting up the communication channels.</p>
<h2 id="heading-how-to-handle-client-connections">How to Handle Client Connections</h2>
<p>When a client connects to your server, several things need to happen: you need to establish the TCP connection, prompt for a username, create a Client object to represent them, start goroutines to read and write messages, and handle both normal disconnections and unexpected failures.</p>
<p>Create a file <code>internal/chatroom/io.go</code> for managing client connections. When a client connects, <code>handleClient()</code> manages the entire lifecycle:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"math/rand"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handleClient</span><span class="hljs-params">(conn net.Conn, chatRoom *ChatRoom)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in handleClient: %v\n"</span>, r)
        }
        conn.Close()
    }()

    <span class="hljs-comment">// Set initial timeout for username entry</span>
    conn.SetReadDeadline(time.Now().Add(<span class="hljs-number">30</span> * time.Second))

    reader := bufio.NewReader(conn)

    <span class="hljs-comment">// Prompt for username or reconnection</span>
    conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Enter username (or 'reconnect:&lt;username&gt;:&lt;token&gt;'): \n"</span>))

    input, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Failed to read username:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    input = strings.TrimSpace(input)

    <span class="hljs-keyword">var</span> username <span class="hljs-keyword">string</span>
    <span class="hljs-keyword">var</span> reconnectToken <span class="hljs-keyword">string</span>
    <span class="hljs-keyword">var</span> isReconnecting <span class="hljs-keyword">bool</span>

    <span class="hljs-comment">// Parse reconnection attempt</span>
    <span class="hljs-keyword">if</span> strings.HasPrefix(input, <span class="hljs-string">"reconnect:"</span>) {
        parts := strings.Split(input, <span class="hljs-string">":"</span>)
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">3</span> {
            username = parts[<span class="hljs-number">1</span>]
            reconnectToken = parts[<span class="hljs-number">2</span>]
            isReconnecting = <span class="hljs-literal">true</span>
        } <span class="hljs-keyword">else</span> {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Invalid format. Use: reconnect:&lt;username&gt;:&lt;token&gt;\n"</span>))
            <span class="hljs-keyword">return</span>
        }
    } <span class="hljs-keyword">else</span> {
        username = input
    }

    <span class="hljs-comment">// Generate guest name if empty</span>
    <span class="hljs-keyword">if</span> username == <span class="hljs-string">""</span> {
        username = fmt.Sprintf(<span class="hljs-string">"Guest%d"</span>, rand.Intn(<span class="hljs-number">1000</span>))
    }

    <span class="hljs-comment">// Validate reconnection or check for duplicate</span>
    <span class="hljs-keyword">if</span> isReconnecting {
        <span class="hljs-keyword">if</span> chatRoom.validateReconnectToken(username, reconnectToken) {
            fmt.Printf(<span class="hljs-string">"%s reconnected successfully\n"</span>, username)
            conn.Write([]<span class="hljs-keyword">byte</span>(fmt.Sprintf(<span class="hljs-string">"Welcome back, %s!\n"</span>, username)))
        } <span class="hljs-keyword">else</span> {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Invalid token or session expired.\n"</span>))
            <span class="hljs-keyword">return</span>
        }
    } <span class="hljs-keyword">else</span> {
        <span class="hljs-comment">// Prevent duplicate logins</span>
        <span class="hljs-keyword">if</span> chatRoom.isUsernameConnected(username) {
            conn.Write([]<span class="hljs-keyword">byte</span>(<span class="hljs-string">"Username already connected. Use reconnect if you lost connection.\n"</span>))
            <span class="hljs-keyword">return</span>
        }

        <span class="hljs-comment">// Create or retrieve session</span>
        chatRoom.sessionsMu.Lock()
        existingSession := chatRoom.sessions[username]
        chatRoom.sessionsMu.Unlock()

        <span class="hljs-keyword">if</span> existingSession != <span class="hljs-literal">nil</span> {
            token := existingSession.ReconnectToken
            msg := fmt.Sprintf(<span class="hljs-string">"Tip: Save this token: %s\n"</span>, token)
            msg += fmt.Sprintf(<span class="hljs-string">"To reconnect: reconnect:%s:%s\n"</span>, username, token)
            conn.Write([]<span class="hljs-keyword">byte</span>(msg))
        } <span class="hljs-keyword">else</span> {
            session := chatRoom.createSession(username)
            token := session.ReconnectToken
            msg := fmt.Sprintf(<span class="hljs-string">"Your token: %s\n"</span>, token)
            msg += fmt.Sprintf(<span class="hljs-string">"To reconnect: reconnect:%s:%s\n"</span>, username, token)
            conn.Write([]<span class="hljs-keyword">byte</span>(msg))
        }
    }

    <span class="hljs-comment">// Create client object</span>
    client := &amp;Client{
        conn:           conn,
        username:       username,
        outgoing:       <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>), <span class="hljs-comment">// Buffered</span>
        lastActive:     time.Now(),
        reconnectToken: reconnectToken,
        isSlowClient:   rand.Float64() &lt; <span class="hljs-number">0.1</span>, <span class="hljs-comment">// 10% chance for testing</span>
    }

    <span class="hljs-comment">// Clear timeout for normal operation</span>
    conn.SetReadDeadline(time.Time{})

    <span class="hljs-comment">// Notify chatroom</span>
    chatRoom.join &lt;- client

    <span class="hljs-comment">// Send welcome message</span>
    welcomeMsg := buildWelcomeMessage(username)
    conn.Write([]<span class="hljs-keyword">byte</span>(welcomeMsg))

    <span class="hljs-comment">// Start read/write loops</span>
    <span class="hljs-keyword">go</span> readMessages(client, chatRoom)
    writeMessages(client) <span class="hljs-comment">// Blocks until disconnect</span>

    <span class="hljs-comment">// Update session on disconnect</span>
    chatRoom.updateSessionActivity(username)
    chatRoom.leave &lt;- client
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">buildWelcomeMessage</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">string</span></span> {
    msg := fmt.Sprintf(<span class="hljs-string">"Welcome, %s!\n"</span>, username)
    msg += <span class="hljs-string">"Commands:\n"</span>
    msg += <span class="hljs-string">"  /users - List all users\n"</span>
    msg += <span class="hljs-string">"  /history [N] - Show last N messages\n"</span>
    msg += <span class="hljs-string">"  /msg &lt;user&gt; &lt;msg&gt; - Private message\n"</span>
    msg += <span class="hljs-string">"  /token - Show your reconnect token\n"</span>
    msg += <span class="hljs-string">"  /stats - Show your stats\n"</span>
    msg += <span class="hljs-string">"  /quit - Leave\n"</span>
    <span class="hljs-keyword">return</span> msg
}
</code></pre>
<p>The initial 30-second timeout prevents connection exhaustion by disconnecting clients who don't enter a username quickly. The buffered <code>outgoing</code> channel prevents slow clients from blocking the broadcaster. Token-based reconnection lets users resume their session without complex authentication. The dual goroutine design means reading and writing happen independently, so a slow write doesn't block incoming messages.</p>
<h3 id="heading-how-to-read-messages-from-clients">How to Read Messages from Clients</h3>
<p>Add the <code>readMessages()</code> goroutine to handles all incoming data:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">readMessages</span><span class="hljs-params">(client *Client, chatRoom *ChatRoom)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in readMessages for %s: %v\n"</span>, client.username, r)
        }
    }()

    reader := bufio.NewReader(client.conn)

    <span class="hljs-keyword">for</span> {
        <span class="hljs-comment">// Set 5-minute idle timeout</span>
        client.conn.SetReadDeadline(time.Now().Add(<span class="hljs-number">5</span> * time.Minute))

        message, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">if</span> netErr, ok := err.(net.Error); ok &amp;&amp; netErr.Timeout() {
                fmt.Printf(<span class="hljs-string">"%s timed out\n"</span>, client.username)
            } <span class="hljs-keyword">else</span> {
                fmt.Printf(<span class="hljs-string">"%s disconnected: %v\n"</span>, client.username, err)
            }
            <span class="hljs-keyword">return</span>
        }

        client.markActive() <span class="hljs-comment">// Update activity timestamp</span>

        message = strings.TrimSpace(message)
        <span class="hljs-keyword">if</span> message == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        client.mu.Lock()
        client.messagesRecv++
        client.mu.Unlock()

        <span class="hljs-comment">// Process commands vs. regular messages</span>
        <span class="hljs-keyword">if</span> strings.HasPrefix(message, <span class="hljs-string">"/"</span>) {
            handleCommand(client, chatRoom, message)
            <span class="hljs-keyword">continue</span>
        }

        <span class="hljs-comment">// Regular message - format and broadcast</span>
        formatted := fmt.Sprintf(<span class="hljs-string">"[%s]: %s\n"</span>, client.username, message)
        chatRoom.broadcast &lt;- formatted
    }
}
</code></pre>
<p>5 minutes of idle time triggers auto-disconnect. This prevents zombie connections from consuming resources.</p>
<h3 id="heading-how-to-write-messages-to-clients">How to Write Messages to Clients</h3>
<p>Add the <code>writeMessages()</code> function to drain the client's <code>outgoing</code> channel:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">writeMessages</span><span class="hljs-params">(client *Client)</span></span> {
    <span class="hljs-keyword">defer</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        <span class="hljs-keyword">if</span> r := <span class="hljs-built_in">recover</span>(); r != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Panic in writeMessages for %s: %v\n"</span>, client.username, r)
        }
    }()

    writer := bufio.NewWriter(client.conn)

    <span class="hljs-keyword">for</span> message := <span class="hljs-keyword">range</span> client.outgoing {
        <span class="hljs-comment">// Simulate slow client (testing mode)</span>
        <span class="hljs-keyword">if</span> client.isSlowClient {
            time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">500</span>)) * time.Millisecond)
        }

        _, err := writer.WriteString(message)
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Write error for %s: %v\n"</span>, client.username, err)
            <span class="hljs-keyword">return</span>
        }

        err = writer.Flush()
        <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Flush error for %s: %v\n"</span>, client.username, err)
            <span class="hljs-keyword">return</span>
        }
    }
}
</code></pre>
<p>Real-world clients have varying network speeds. A client with a slow internet connection shouldn't block message delivery to other users. This is a fundamental challenge in any system that broadcasts to multiple recipients.</p>
<p>To handle this, we use two techniques. First, the <code>outgoing</code> channel is buffered with a size of 10. This means the system can queue up 10 messages for a client without blocking. If a client temporarily slows down (maybe they're loading a large webpage in another tab), the buffer absorbs the slowdown.</p>
<p>Second, when broadcasting messages (which you'll see in the next section), we use non-blocking sends. If a client's buffer is full because they're consistently too slow, we skip sending to them rather than blocking everyone else. The slow client misses some messages, but everyone else continues normally. This is called graceful degradation: the system continues working even when parts of it have problems.</p>
<p>With client connections handled, the next step is implementing the core feature of any chat system: broadcasting messages to all connected users. Broadcasting means taking one message and sending it to many recipients efficiently and safely.</p>
<h2 id="heading-how-to-implement-message-broadcasting">How to Implement Message Broadcasting</h2>
<p>Broadcasting is the heart of a chat application. When one user sends a message, it needs to reach everyone else instantly. But this is trickier than it sounds because you need to persist the message for durability, send it to clients at different speeds without blocking, and maintain message ordering across all clients.</p>
<p>Create <code>internal/chatroom/handlers.go</code> to handle events.</p>
<p>The <code>handleBroadcast()</code> method is where messages reach all users:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleBroadcast</span><span class="hljs-params">(message <span class="hljs-keyword">string</span>)</span></span> {
    <span class="hljs-comment">// Parse message metadata</span>
    parts := strings.SplitN(message, <span class="hljs-string">": "</span>, <span class="hljs-number">2</span>)
    from := <span class="hljs-string">"system"</span>
    actualContent := message

    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">2</span> {
        from = strings.Trim(parts[<span class="hljs-number">0</span>], <span class="hljs-string">"[]"</span>)
        actualContent = parts[<span class="hljs-number">1</span>]
    }

    <span class="hljs-comment">// Create persistent message record</span>
    cr.messageMu.Lock()
    msg := Message{
        ID:        cr.nextMessageID,
        From:      from,
        Content:   actualContent,
        Timestamp: time.Now(),
        Channel:   <span class="hljs-string">"global"</span>,
    }
    cr.nextMessageID++
    cr.messages = <span class="hljs-built_in">append</span>(cr.messages, msg)
    cr.messageMu.Unlock()

    <span class="hljs-comment">// Persist to WAL</span>
    <span class="hljs-keyword">if</span> err := cr.persistMessage(msg); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Failed to persist: %v\n"</span>, err)
        <span class="hljs-comment">// Continue anyway - availability over consistency</span>
    }

    <span class="hljs-comment">// Collect current clients</span>
    cr.mu.Lock()
    clients := <span class="hljs-built_in">make</span>([]*Client, <span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(cr.clients))
    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        clients = <span class="hljs-built_in">append</span>(clients, client)
    }
    cr.totalMessages++
    cr.mu.Unlock()

    fmt.Printf(<span class="hljs-string">"Broadcasting to %d clients: %s"</span>, <span class="hljs-built_in">len</span>(clients), message)

    <span class="hljs-comment">// Fan-out to all clients</span>
    <span class="hljs-keyword">for</span> _, client := <span class="hljs-keyword">range</span> clients {
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- message:
            client.mu.Lock()
            client.messagesSent++
            client.mu.Unlock()
        <span class="hljs-keyword">default</span>:
            fmt.Printf(<span class="hljs-string">"Skipped %s (channel full)\n"</span>, client.username)
        }
    }
}
</code></pre>
<h4 id="heading-consistency-trade-off">Consistency Trade-off:</h4>
<p>If a WAL write fails, you still broadcast the message. Why? Because availability is more important than perfect consistency for a chat application. Users get their messages immediately, and you can handle WAL repair manually if needed.</p>
<h3 id="heading-how-to-handle-join-and-leave-events">How to Handle Join and Leave Events</h3>
<p>Add these handlers to <code>handlers.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleJoin</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    cr.clients[client] = <span class="hljs-literal">true</span>
    cr.mu.Unlock()

    client.markActive()

    fmt.Printf(<span class="hljs-string">"%s joined (total: %d)\n"</span>, client.username, <span class="hljs-built_in">len</span>(cr.clients))

    cr.sendHistory(client, <span class="hljs-number">10</span>)

    announcement := fmt.Sprintf(<span class="hljs-string">"*** %s joined the chat ***\n"</span>, client.username)
    cr.handleBroadcast(announcement)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleLeave</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">if</span> !cr.clients[client] {
        cr.mu.Unlock()
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-built_in">delete</span>(cr.clients, client)
    cr.mu.Unlock()

    fmt.Printf(<span class="hljs-string">"%s left (total: %d)\n"</span>, client.username, <span class="hljs-built_in">len</span>(cr.clients))

    <span class="hljs-comment">// Close channel safely</span>
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> &lt;-client.outgoing:
        <span class="hljs-comment">// Already closed</span>
    <span class="hljs-keyword">default</span>:
        <span class="hljs-built_in">close</span>(client.outgoing)
    }

    announcement := fmt.Sprintf(<span class="hljs-string">"*** %s left the chat ***\n"</span>, client.username)
    cr.handleBroadcast(announcement)
}
</code></pre>
<p>The <code>handleJoin</code> function adds the client to the active clients map, marks them as active for idle tracking, sends them the last 10 messages so they can see recent conversation, and broadcasts an announcement so everyone knows they joined.</p>
<p>The <code>handleLeave</code> function removes the client from the map, closes their outgoing channel safely (the select checks if it's already closed to avoid a panic), and broadcasts a departure announcement.</p>
<h3 id="heading-how-to-send-user-lists-and-history">How to Send User Lists and History</h3>
<p>Add these helper functions to <code>handlers.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">sendHistory</span><span class="hljs-params">(client *Client, count <span class="hljs-keyword">int</span>)</span></span> {
    cr.messageMu.Lock()
    <span class="hljs-keyword">defer</span> cr.messageMu.Unlock()

    start := <span class="hljs-built_in">len</span>(cr.messages) - count
    <span class="hljs-keyword">if</span> start &lt; <span class="hljs-number">0</span> {
        start = <span class="hljs-number">0</span>
    }

    historyMsg := <span class="hljs-string">"Recent messages:\n"</span>
    <span class="hljs-keyword">for</span> i := start; i &lt; <span class="hljs-built_in">len</span>(cr.messages); i++ {
        msg := cr.messages[i]
        historyMsg += fmt.Sprintf(<span class="hljs-string">" [%s]: %s\n"</span>, msg.From, msg.Content)
    }

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> client.outgoing &lt;- historyMsg:
    <span class="hljs-keyword">default</span>:
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">sendUserList</span><span class="hljs-params">(client *Client)</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    list := <span class="hljs-string">"Users online:\n"</span>
    <span class="hljs-keyword">for</span> c := <span class="hljs-keyword">range</span> cr.clients {
        status := <span class="hljs-string">""</span>
        <span class="hljs-keyword">if</span> c.isInactive(<span class="hljs-number">1</span> * time.Minute) {
            status = <span class="hljs-string">" (idle)"</span>
        }
        list += fmt.Sprintf(<span class="hljs-string">"  - %s%s\n"</span>, c.username, status)
    }

    list += fmt.Sprintf(<span class="hljs-string">"\nTotal messages: %d\n"</span>, cr.totalMessages)
    list += fmt.Sprintf(<span class="hljs-string">"Uptime: %s\n"</span>, time.Since(cr.startTime).Round(time.Second))

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> client.outgoing &lt;- list:
    <span class="hljs-keyword">default</span>:
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">handleDirectMessage</span><span class="hljs-params">(dm DirectMessage)</span></span> {
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> dm.toClient.outgoing &lt;- dm.message:
        dm.toClient.mu.Lock()
        dm.toClient.messagesSent++
        dm.toClient.mu.Unlock()
    <span class="hljs-keyword">default</span>:
        fmt.Printf(<span class="hljs-string">"Couldn't deliver DM to %s\n"</span>, dm.toClient.username)
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">findClientByUsername</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> *<span class="hljs-title">Client</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        <span class="hljs-keyword">if</span> client.username == username {
            <span class="hljs-keyword">return</span> client
        }
    }
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Client)</span> <span class="hljs-title">markActive</span><span class="hljs-params">()</span></span> {
    c.mu.Lock()
    <span class="hljs-keyword">defer</span> c.mu.Unlock()
    c.lastActive = time.Now()
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(c *Client)</span> <span class="hljs-title">isInactive</span><span class="hljs-params">(timeout time.Duration)</span> <span class="hljs-title">bool</span></span> {
    c.mu.Lock()
    <span class="hljs-keyword">defer</span> c.mu.Unlock()
    <span class="hljs-keyword">return</span> time.Since(c.lastActive) &gt; timeout
}
</code></pre>
<p>You now have a working chat system where clients can connect and exchange messages.</p>
<p>But there's a critical problem: if the server crashes or restarts, all messages are lost. The next step is adding persistence so messages survive failures.</p>
<h2 id="heading-how-to-add-persistence-with-wal-and-snapshots">How to Add Persistence with WAL and Snapshots</h2>
<p>Persistence ensures your chat history survives server crashes and restarts. Without it, users would lose all their conversations every time the server goes down.</p>
<p>You'll implement this using two complementary mechanisms: a write-ahead log for immediate durability and snapshots for fast recovery.</p>
<p>Create <code>internal/chatroom/persistence.go</code> to handle data durability.</p>
<p>The WAL ensures messages survive crashes:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"encoding/json"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"io"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"path/filepath"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">initializePersistence</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">if</span> err := os.MkdirAll(cr.dataDir, <span class="hljs-number">0755</span>); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> fmt.Errorf(<span class="hljs-string">"create data dir: %w"</span>, err)
    }

    walPath := filepath.Join(cr.dataDir, <span class="hljs-string">"messages.wal"</span>)

    <span class="hljs-keyword">if</span> err := cr.recoverFromWAL(walPath); err != <span class="hljs-literal">nil</span> {
        fmt.Printf(<span class="hljs-string">"Recovery failed: %v\n"</span>, err)
    }

    file, err := os.OpenFile(walPath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, <span class="hljs-number">0644</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> fmt.Errorf(<span class="hljs-string">"open wal: %w"</span>, err)
    }

    cr.walFile = file
    fmt.Printf(<span class="hljs-string">"WAL initialized: %s\n"</span>, walPath)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">recoverFromWAL</span><span class="hljs-params">(walPath <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">error</span></span> {
    file, err := os.Open(walPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            fmt.Println(<span class="hljs-string">"No WAL found (fresh start)"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    scanner := bufio.NewScanner(file)
    recovered := <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> scanner.Scan() {
        line := scanner.Text()
        <span class="hljs-keyword">if</span> line == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        <span class="hljs-keyword">var</span> msg Message
        <span class="hljs-keyword">if</span> err := json.Unmarshal([]<span class="hljs-keyword">byte</span>(line), &amp;msg); err != <span class="hljs-literal">nil</span> {
            fmt.Printf(<span class="hljs-string">"Skipping corrupt line: %s\n"</span>, line)
            <span class="hljs-keyword">continue</span>
        }

        cr.messages = <span class="hljs-built_in">append</span>(cr.messages, msg)

        <span class="hljs-keyword">if</span> msg.ID &gt;= cr.nextMessageID {
            cr.nextMessageID = msg.ID + <span class="hljs-number">1</span>
        }
        recovered++
    }

    fmt.Printf(<span class="hljs-string">"Recovered %d messages\n"</span>, recovered)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">persistMessage</span><span class="hljs-params">(msg Message)</span> <span class="hljs-title">error</span></span> {
    cr.walMu.Lock()
    <span class="hljs-keyword">defer</span> cr.walMu.Unlock()

    data, err := json.Marshal(msg)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    _, err = cr.walFile.Write(<span class="hljs-built_in">append</span>(data, <span class="hljs-string">'\n'</span>))
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">return</span> cr.walFile.Sync()
}
</code></pre>
<p>Each line is a JSON-encoded message:</p>
<pre><code class="lang-json">{<span class="hljs-attr">"id"</span>:<span class="hljs-number">1</span>,<span class="hljs-attr">"from"</span>:<span class="hljs-string">"Alice"</span>,<span class="hljs-attr">"content"</span>:<span class="hljs-string">"Hello world"</span>,<span class="hljs-attr">"timestamp"</span>:<span class="hljs-string">"2024-02-06T10:00:00Z"</span>,<span class="hljs-attr">"channel"</span>:<span class="hljs-string">"global"</span>}
{<span class="hljs-attr">"id"</span>:<span class="hljs-number">2</span>,<span class="hljs-attr">"from"</span>:<span class="hljs-string">"Bob"</span>,<span class="hljs-attr">"content"</span>:<span class="hljs-string">"Hi Alice!"</span>,<span class="hljs-attr">"timestamp"</span>:<span class="hljs-string">"2024-02-06T10:00:05Z"</span>,<span class="hljs-attr">"channel"</span>:<span class="hljs-string">"global"</span>}
</code></pre>
<p>The <code>Sync()</code> call is critical for durability. Without it, the OS might buffer writes in memory, losing them on a crash. The trade-off is that <code>Sync()</code> is expensive (about 1-10ms per call). Production systems might batch multiple messages to improve throughput.</p>
<h3 id="heading-how-to-create-and-load-snapshots">How to Create and Load Snapshots</h3>
<p>Add snapshot functionality to <code>persistence.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">createSnapshot</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    snapshotPath := filepath.Join(cr.dataDir, <span class="hljs-string">"snapshot.json"</span>)
    tempPath := snapshotPath + <span class="hljs-string">".tmp"</span>

    file, err := os.Create(tempPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    cr.messageMu.Lock()
    data, err := json.MarshalIndent(cr.messages, <span class="hljs-string">""</span>, <span class="hljs-string">"  "</span>)
    cr.messageMu.Unlock()

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">if</span> _, err := file.Write(data); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">if</span> err := file.Sync(); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    file.Close()

    <span class="hljs-keyword">if</span> err := os.Rename(tempPath, snapshotPath); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    fmt.Printf(<span class="hljs-string">"Snapshot created (%d messages)\n"</span>, <span class="hljs-built_in">len</span>(cr.messages))
    <span class="hljs-keyword">return</span> cr.truncateWAL()
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">truncateWAL</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    cr.walMu.Lock()
    <span class="hljs-keyword">defer</span> cr.walMu.Unlock()

    <span class="hljs-keyword">if</span> cr.walFile != <span class="hljs-literal">nil</span> {
        cr.walFile.Close()
    }

    walPath := filepath.Join(cr.dataDir, <span class="hljs-string">"messages.wal"</span>)
    file, err := os.OpenFile(walPath, os.O_TRUNC|os.O_CREATE|os.O_WRONLY, <span class="hljs-number">0644</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    cr.walFile = file
    fmt.Println(<span class="hljs-string">"WAL truncated"</span>)
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">loadSnapshot</span><span class="hljs-params">()</span> <span class="hljs-title">error</span></span> {
    snapshotPath := filepath.Join(cr.dataDir, <span class="hljs-string">"snapshot.json"</span>)
    file, err := os.Open(snapshotPath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">if</span> os.IsNotExist(err) {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
        }
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> file.Close()

    data, err := io.ReadAll(file)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    cr.messageMu.Lock()
    err = json.Unmarshal(data, &amp;cr.messages)
    cr.messageMu.Unlock()

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    <span class="hljs-keyword">for</span> _, msg := <span class="hljs-keyword">range</span> cr.messages {
        <span class="hljs-keyword">if</span> msg.ID &gt;= cr.nextMessageID {
            cr.nextMessageID = msg.ID + <span class="hljs-number">1</span>
        }
    }

    fmt.Printf(<span class="hljs-string">"Loaded %d messages from snapshot\n"</span>, <span class="hljs-built_in">len</span>(cr.messages))
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>Writing to <code>.tmp</code> then renaming ensures you never have a half-written snapshot. Even if power fails mid-write, the old snapshot remains valid.</p>
<h4 id="heading-recovery-flow">Recovery Flow</h4>
<p>When the server starts, it first loads the snapshot if it exists, which might contain 100K messages and takes about 100ms. Then it replays WAL entries written since the snapshot, which might be only recent messages. Total recovery time is seconds instead of minutes.</p>
<p>With persistence in place, your messages are safe. But network connections are unreliable. Users get disconnected when their WiFi drops, their phone switches towers, or their laptop goes to sleep. The next step is implementing session management so users can reconnect without losing their identity or chat history.</p>
<h2 id="heading-how-to-implement-session-management">How to Implement Session Management</h2>
<p>Session management lets users reconnect to your server after network interruptions without needing to create a new account or re-enter credentials. You'll implement this using cryptographically secure tokens that persist across connections.</p>
<p>Create <code>internal/chatroom/session.go</code> for reconnection handling.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"time"</span>

    <span class="hljs-string">"github.com/yourusername/chatroom/pkg/token"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">createSession</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> *<span class="hljs-title">SessionInfo</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    tok := token.GenerateToken()

    session := &amp;SessionInfo{
        Username:       username,
        ReconnectToken: tok,
        LastSeen:       time.Now(),
        CreatedAt:      time.Now(),
    }

    cr.sessions[username] = session

    fmt.Printf(<span class="hljs-string">"Created session for %s (token: %s...)\n"</span>, username, tok[:<span class="hljs-number">8</span>])

    <span class="hljs-keyword">return</span> session
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">validateReconnectToken</span><span class="hljs-params">(username, token <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">bool</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    session, exists := cr.sessions[username]
    <span class="hljs-keyword">if</span> !exists {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    <span class="hljs-keyword">if</span> session.ReconnectToken != token {
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    <span class="hljs-keyword">if</span> time.Since(session.LastSeen) &gt; <span class="hljs-number">1</span>*time.Hour {
        <span class="hljs-built_in">delete</span>(cr.sessions, username)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
    }

    session.LastSeen = time.Now()

    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">updateSessionActivity</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span></span> {
    cr.sessionsMu.Lock()
    <span class="hljs-keyword">defer</span> cr.sessionsMu.Unlock()

    <span class="hljs-keyword">if</span> session, exists := cr.sessions[username]; exists {
        session.LastSeen = time.Now()
    }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">isUsernameConnected</span><span class="hljs-params">(username <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">bool</span></span> {
    cr.mu.Lock()
    <span class="hljs-keyword">defer</span> cr.mu.Unlock()

    <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
        <span class="hljs-keyword">if</span> client.username == username {
            <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(cr *ChatRoom)</span> <span class="hljs-title">cleanupInactiveClients</span><span class="hljs-params">()</span></span> {
    ticker := time.NewTicker(<span class="hljs-number">30</span> * time.Second)
    <span class="hljs-keyword">defer</span> ticker.Stop()

    <span class="hljs-keyword">for</span> <span class="hljs-keyword">range</span> ticker.C {
        cr.mu.Lock()
        <span class="hljs-keyword">var</span> toRemove []*Client

        <span class="hljs-keyword">for</span> client := <span class="hljs-keyword">range</span> cr.clients {
            <span class="hljs-keyword">if</span> client.isInactive(<span class="hljs-number">5</span> * time.Minute) {
                fmt.Printf(<span class="hljs-string">"Removing inactive: %s\n"</span>, client.username)
                toRemove = <span class="hljs-built_in">append</span>(toRemove, client)
            }
        }
        cr.mu.Unlock()

        <span class="hljs-keyword">for</span> _, client := <span class="hljs-keyword">range</span> toRemove {
            cr.leave &lt;- client
        }
    }
}
</code></pre>
<h3 id="heading-how-to-generate-secure-tokens">How to Generate Secure Tokens</h3>
<p>Create <code>pkg/token/token.go</code> for token generation:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> token

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"crypto/rand"</span>
    <span class="hljs-string">"encoding/hex"</span>
)

<span class="hljs-comment">// GenerateToken returns a secure random 16-byte hex token</span>
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">GenerateToken</span><span class="hljs-params">()</span> <span class="hljs-title">string</span></span> {
    b := <span class="hljs-built_in">make</span>([]<span class="hljs-keyword">byte</span>, <span class="hljs-number">16</span>)
    _, _ = rand.Read(b)
    <span class="hljs-keyword">return</span> hex.EncodeToString(b)
}
</code></pre>
<p>Tokens here are transmitted in plaintext over TCP. For production use, you should use TLS encryption to protect tokens in transit, hash tokens before storage so a database breach doesn't expose them, and implement rate limiting on reconnection attempts to prevent brute force attacks.</p>
<p>Your chatroom now supports basic messaging and reconnection. But users need ways to interact with the system beyond just sending messages. The command system provides features like listing users, viewing history, and sending private messages.</p>
<h2 id="heading-how-to-build-the-command-system">How to Build the Command System</h2>
<p>Commands are messages that start with a forward slash and perform special actions instead of being broadcast to everyone. This is a pattern used by many chat applications like Slack and Discord. You'll implement several useful commands that enhance the user experience.</p>
<p>Add command handling to <code>io.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handleCommand</span><span class="hljs-params">(client *Client, chatRoom *ChatRoom, command <span class="hljs-keyword">string</span>)</span></span> {
    parts := strings.Fields(command)
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) == <span class="hljs-number">0</span> {
        <span class="hljs-keyword">return</span>
    }

    <span class="hljs-keyword">switch</span> parts[<span class="hljs-number">0</span>] {
    <span class="hljs-keyword">case</span> <span class="hljs-string">"/users"</span>:
        chatRoom.listUsers &lt;- client

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/stats"</span>:
        client.mu.Lock()
        stats := fmt.Sprintf(<span class="hljs-string">"Your Stats:\n"</span>)
        stats += fmt.Sprintf(<span class="hljs-string">"  Messages sent: %d\n"</span>, client.messagesSent)
        stats += fmt.Sprintf(<span class="hljs-string">"  Messages received: %d\n"</span>, client.messagesRecv)
        stats += fmt.Sprintf(<span class="hljs-string">"  Last active: %s ago\n"</span>, 
            time.Since(client.lastActive).Round(time.Second))
        client.mu.Unlock()

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- stats:
        <span class="hljs-keyword">default</span>:
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/msg"</span>:
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) &lt; <span class="hljs-number">3</span> {
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- <span class="hljs-string">"Usage: /msg &lt;username&gt; &lt;message&gt;\n"</span>:
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        targetUsername := parts[<span class="hljs-number">1</span>]
        messageText := strings.Join(parts[<span class="hljs-number">2</span>:], <span class="hljs-string">" "</span>)

        targetClient := chatRoom.findClientByUsername(targetUsername)
        <span class="hljs-keyword">if</span> targetClient == <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"User '%s' not found\n"</span>, targetUsername):
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        privateMsg := fmt.Sprintf(<span class="hljs-string">"[From %s]: %s\n"</span>, client.username, messageText)
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> targetClient.outgoing &lt;- privateMsg:
        <span class="hljs-keyword">default</span>:
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"%s's inbox is full\n"</span>, targetUsername):
            <span class="hljs-keyword">default</span>:
            }
            <span class="hljs-keyword">return</span>
        }

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"Message sent to %s\n"</span>, targetUsername):
        <span class="hljs-keyword">default</span>:
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/history"</span>:
        count := <span class="hljs-number">20</span>
        <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(parts) &gt; <span class="hljs-number">1</span> {
            fmt.Sscanf(parts[<span class="hljs-number">1</span>], <span class="hljs-string">"%d"</span>, &amp;count)
        }
        <span class="hljs-keyword">if</span> count &gt; <span class="hljs-number">100</span> {
            count = <span class="hljs-number">100</span>
        }
        cr.sendHistory(client, count)

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/token"</span>:
        chatRoom.sessionsMu.Lock()
        session := chatRoom.sessions[client.username]
        chatRoom.sessionsMu.Unlock()

        <span class="hljs-keyword">if</span> session != <span class="hljs-literal">nil</span> {
            msg := fmt.Sprintf(<span class="hljs-string">"Your reconnect token:\n"</span>)
            msg += fmt.Sprintf(<span class="hljs-string">"   reconnect:%s:%s\n"</span>, client.username, session.ReconnectToken)
            <span class="hljs-keyword">select</span> {
            <span class="hljs-keyword">case</span> client.outgoing &lt;- msg:
            <span class="hljs-keyword">default</span>:
            }
        }

    <span class="hljs-keyword">case</span> <span class="hljs-string">"/quit"</span>:
        announcement := fmt.Sprintf(<span class="hljs-string">"%s left the chat\n"</span>, client.username)
        chatRoom.broadcast &lt;- announcement

        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- <span class="hljs-string">"Goodbye!\n"</span>:
        <span class="hljs-keyword">default</span>:
        }

        time.Sleep(<span class="hljs-number">100</span> * time.Millisecond)
        client.conn.Close()

    <span class="hljs-keyword">default</span>:
        <span class="hljs-keyword">select</span> {
        <span class="hljs-keyword">case</span> client.outgoing &lt;- fmt.Sprintf(<span class="hljs-string">"Unknown: %s\n"</span>, parts[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">default</span>:
        }
    }
}
</code></pre>
<p>Your server is now complete with all the core features: connection handling, message broadcasting, persistence, session management, and commands. But to actually use your chatroom, you need a client application. The client is much simpler than the server because it just needs to connect and relay messages.</p>
<h2 id="heading-how-to-create-the-client">How to Create the Client</h2>
<p>The client application provides the user interface for your chatroom. It connects to the server, displays incoming messages, and sends outgoing messages typed by the user. While the server is complex with many concurrent components, the client is straightforward</p>
<p>Create <code>internal/chatroom/client.go</code> for the client implementation.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"bufio"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"net"</span>
    <span class="hljs-string">"os"</span>
    <span class="hljs-string">"strings"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">StartClient</span><span class="hljs-params">()</span></span> {
    conn, err := net.Dial(<span class="hljs-string">"tcp"</span>, <span class="hljs-string">":9000"</span>)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(<span class="hljs-string">"Error connecting:"</span>, err)
        <span class="hljs-keyword">return</span>
    }
    <span class="hljs-keyword">defer</span> conn.Close()

    fmt.Println(<span class="hljs-string">"Connected to chat server"</span>)

    <span class="hljs-comment">// Background goroutine: read from server</span>
    <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
        reader := bufio.NewReader(conn)
        <span class="hljs-keyword">for</span> {
            message, err := reader.ReadString(<span class="hljs-string">'\n'</span>)
            <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
                fmt.Println(<span class="hljs-string">"Disconnected from server."</span>)
                os.Exit(<span class="hljs-number">0</span>)
            }
            <span class="hljs-comment">// Clear current prompt line and print message</span>
            fmt.Print(<span class="hljs-string">"\r"</span> + message)
            fmt.Print(<span class="hljs-string">"&gt;&gt; "</span>)
        }
    }()

    <span class="hljs-comment">// Main goroutine: read from stdin</span>
    inputReader := bufio.NewReader(os.Stdin)
    fmt.Println(<span class="hljs-string">"Welcome to the chat server!"</span>)

    <span class="hljs-keyword">for</span> {
        fmt.Print(<span class="hljs-string">"&gt;&gt; "</span>)
        message, _ := inputReader.ReadString(<span class="hljs-string">'\n'</span>)
        message = strings.TrimSpace(message)

        <span class="hljs-keyword">if</span> message == <span class="hljs-string">""</span> {
            <span class="hljs-keyword">continue</span>
        }

        conn.Write([]<span class="hljs-keyword">byte</span>(message + <span class="hljs-string">"\n"</span>))
    }
}
</code></pre>
<h4 id="heading-how-the-client-works">How the Client Works:</h4>
<p>The client uses two goroutines to handle communication simultaneously. The main goroutine reads from stdin (your keyboard) and sends messages to the server. When you type a message and press Enter, it gets sent over the TCP connection immediately.</p>
<p>The background goroutine continuously reads from the server. Whenever a message arrives, it prints it to your screen. The <code>\r</code> (carriage return) clears the current <code>&gt;&gt;</code> prompt before printing the message, so new messages don't appear on the same line as your input. After printing the message, it reprints the prompt so you can continue typing.</p>
<p>This dual-goroutine design means you can receive messages while typing. If someone sends a message while you're in the middle of typing yours, their message appears immediately and your prompt reappears below it.</p>
<p>The <code>defer conn.Close()</code> ensures the connection is properly closed when the function exits. If the server disconnects, the read goroutine gets an error and calls <code>os.Exit(0)</code> to terminate the entire client program gracefully.</p>
<h3 id="heading-how-to-create-entry-points">How to Create Entry Points</h3>
<p>Create <code>cmd/server/main.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"os"</span>

    <span class="hljs-string">"github.com/yourusername/chatroom/internal/chatroom"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"Starting server from cmd/server..."</span>)
    chatroom.StartServer()
    os.Exit(<span class="hljs-number">0</span>)
}
</code></pre>
<p>Create <code>cmd/client/main.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"github.com/yourusername/chatroom/internal/chatroom"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    fmt.Println(<span class="hljs-string">"Starting client from cmd/client..."</span>)
    chatroom.StartClient()
}
</code></pre>
<p>Add a wrapper function in <code>internal/chatroom/server.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">StartServer</span><span class="hljs-params">()</span></span> {
    runServer()
}
</code></pre>
<p>With all your entry points created, your chatroom is complete and ready to test. The next step is learning how to test your implementation to ensure everything works correctly.</p>
<h2 id="heading-how-to-test-your-chatroom">How to Test Your Chatroom</h2>
<p>Testing a concurrent system like a chatroom requires a different approach than testing typical sequential code. You need to verify that goroutines coordinate correctly, messages arrive in the right order, and the system handles edge cases like disconnections.</p>
<h3 id="heading-how-to-write-unit-tests">How to Write Unit Tests</h3>
<p>Unit tests verify individual components in isolation. For your chatroom, the most important test is verifying that messages broadcast correctly to all connected clients.</p>
<p>Create <code>internal/chatroom/chatroom_test.go</code>:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> chatroom

<span class="hljs-keyword">import</span> (
    <span class="hljs-string">"testing"</span>
    <span class="hljs-string">"strings"</span>
    <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">TestBroadcast</span><span class="hljs-params">(t *testing.T)</span></span> {
    cr, _ := NewChatRoom(<span class="hljs-string">"./testdata"</span>)
    <span class="hljs-keyword">defer</span> cr.shutdown()

    <span class="hljs-keyword">go</span> cr.Run()

    <span class="hljs-comment">// Create mock clients</span>
    client1 := &amp;Client{
        username: <span class="hljs-string">"Alice"</span>,
        outgoing: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>),
    }
    client2 := &amp;Client{
        username: <span class="hljs-string">"Bob"</span>,
        outgoing: <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">10</span>),
    }

    <span class="hljs-comment">// Join clients</span>
    cr.join &lt;- client1
    cr.join &lt;- client2
    time.Sleep(<span class="hljs-number">100</span> * time.Millisecond)

    <span class="hljs-comment">// Broadcast message</span>
    cr.broadcast &lt;- <span class="hljs-string">"[Alice]: Hello!"</span>

    <span class="hljs-comment">// Verify both receive it</span>
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg := &lt;-client1.outgoing:
        <span class="hljs-keyword">if</span> !strings.Contains(msg, <span class="hljs-string">"Hello!"</span>) {
            t.Fatal(<span class="hljs-string">"Client1 didn't receive correct message"</span>)
        }
    <span class="hljs-keyword">case</span> &lt;-time.After(<span class="hljs-number">1</span> * time.Second):
        t.Fatal(<span class="hljs-string">"Client1 didn't receive message"</span>)
    }

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg := &lt;-client2.outgoing:
        <span class="hljs-keyword">if</span> !strings.Contains(msg, <span class="hljs-string">"Hello!"</span>) {
            t.Fatal(<span class="hljs-string">"Client2 didn't receive correct message"</span>)
        }
    <span class="hljs-keyword">case</span> &lt;-time.After(<span class="hljs-number">1</span> * time.Second):
        t.Fatal(<span class="hljs-string">"Client2 didn't receive message"</span>)
    }
}
</code></pre>
<h4 id="heading-understanding-the-test">Understanding the Test:</h4>
<p>This test creates a chatroom instance and starts its event loop with <code>go cr.Run()</code>. Then it creates two mock clients. Notice these aren't real TCP connections – they're just Client structs with outgoing channels. This lets you test the broadcast logic without needing actual network connections.</p>
<p>The test sends both clients to the join channel, waits 100 milliseconds for them to be processed, then broadcasts a message. The <code>select</code> statements with timeout are crucial. They try to receive from each client's outgoing channel, but if nothing arrives within 1 second, the test fails. This prevents the test from hanging forever if something goes wrong.</p>
<p>The <code>time.Sleep(100 * time.Millisecond)</code> gives the event loop time to process the join events before broadcasting. In a real system, you'd use channels to synchronize, but for tests, a small sleep is acceptable.</p>
<p>Run tests with:</p>
<pre><code class="lang-go"><span class="hljs-keyword">go</span> test ./internal/chatroom -v
</code></pre>
<p>The <code>-v</code> flag shows verbose output, printing each test as it runs. You'll see whether the broadcast test passes and how long it took. Below is the output showing that the test passed:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770480869735/1102c089-1d10-43fc-bd7f-5571283213a0.png" alt="Chatroom unit test" class="image--center mx-auto" width="1294" height="456" loading="lazy"></p>
<h3 id="heading-how-to-do-integration-testing">How to Do Integration Testing</h3>
<p>Integration tests verify the entire system working together – the real server, real clients, and real network connections. Unlike unit tests that mock components, integration tests exercise the full stack.</p>
<p>Test the full client-server flow:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Terminal 1: Start server</span>
go run cmd/server/main.go

<span class="hljs-comment"># Terminal 2: Client 1</span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: Alice</span>

<span class="hljs-comment"># Terminal 3: Client 2  </span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: Bob</span>

<span class="hljs-comment"># Terminal 4: Client 3  </span>
go run cmd/client/main.go
<span class="hljs-comment"># Enter username: John</span>

<span class="hljs-comment"># Test messaging between clients</span>
</code></pre>
<h4 id="heading-what-to-test">What to Test:</h4>
<p>Once you have the server running and multiple clients connected, you can verify all the features you built. Here's what a complete test session looks like:</p>
<ol>
<li><p><strong>Basic Messaging:</strong> Send a message from Alice and verify Bob and John both receive it. You should see the message appear in all client windows with the sender's username in brackets. Try sending from each client to verify the broadcast works in all directions.</p>
</li>
<li><p><strong>Join and Leave Announcements:</strong> When a new client connects, all existing clients should see a "joined the chat" announcement. When someone disconnects (either with <code>/quit</code> or by closing their terminal), everyone should see a "left the chat" message. This confirms your join and leave handlers work correctly.</p>
</li>
<li><p><strong>Private Messaging:</strong> Use <code>/msg Bob this is a private message</code> from Alice's client. The message should appear only in Bob's window, not in John's or Alice's. Try sending private messages between different pairs of users to verify the routing works correctly. The sender should receive a confirmation that the message was sent.</p>
</li>
<li><p><strong>User List:</strong> Run <code>/users</code> from any client. You should see a list of all connected users. If someone has been idle for over a minute, they should show an "(idle)" status. The command should also display total message count and server uptime.</p>
</li>
<li><p><strong>Chat History:</strong> New clients should automatically receive the last 10 messages when they join. You can also use <code>/history 20</code> to request the last 20 messages. This verifies your message persistence is working.</p>
</li>
<li><p><strong>Session Reconnection:</strong> From one client, use <code>/token</code> to get your reconnection token. It will look something like <code>reconnect:Alice:338f04ca...</code>. Copy this token, disconnect the client with Ctrl+C, start a new client, and paste the reconnection string when prompted. You should rejoin the chat with your previous identity, and other users won't see duplicate join announcements.</p>
</li>
<li><p><strong>Statistics:</strong> Use <code>/stats</code> to see how many messages you've sent and received, and when you were last active. This verifies the client-side statistics tracking works.</p>
</li>
<li><p><strong>Error Handling:</strong> Try connecting with a username that's already in use – you should be rejected. Try sending a private message to a non-existent user – you should get an error. Try using an invalid reconnection token – you should be denied. These tests verify your validation logic works.</p>
</li>
</ol>
<p>Look at the server terminal to see the server's perspective. You'll see connection logs, broadcast confirmations, and any errors. When clients disconnect, you should see their sessions being updated. When the server creates snapshots, you'll see those logged, too.</p>
<p>Integration testing catches problems that unit tests miss, like network timeouts, message ordering issues across multiple clients, or problems with how the WAL file is created and locked. The screenshot below shows a successful integration test with three clients (Alice, Bob, and John) all communicating successfully, with private messages, public broadcasts, and proper join/leave handling.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770348441963/d66bb2da-088b-4b9c-95a2-e5f401c5f49e.png" alt="chatroom broadcast test" class="image--center mx-auto" width="2833" height="644" loading="lazy"></p>
<h2 id="heading-how-to-deploy-your-server">How to Deploy Your Server</h2>
<p>Deploying your chatroom means running it on a server that stays up 24/7, automatically restarts if it crashes, and starts when the server boots. There are several approaches depending on your infrastructure.</p>
<h3 id="heading-how-to-use-systemd">How to Use Systemd</h3>
<p>Systemd is the standard init system on most Linux distributions. It manages services, handles restarts, and ensures your chatroom starts on boot.</p>
<p>Create <code>/etc/systemd/system/chatroom.service</code>:</p>
<pre><code class="lang-ini"><span class="hljs-section">[Unit]</span>
<span class="hljs-attr">Description</span>=Chatroom Server
<span class="hljs-attr">After</span>=network.target

<span class="hljs-section">[Service]</span>
<span class="hljs-attr">Type</span>=simple
<span class="hljs-attr">User</span>=chatroom
<span class="hljs-attr">WorkingDirectory</span>=/opt/chatroom
<span class="hljs-attr">ExecStart</span>=/opt/chatroom/server
<span class="hljs-attr">Restart</span>=<span class="hljs-literal">on</span>-failure
<span class="hljs-attr">RestartSec</span>=<span class="hljs-number">5</span>s

<span class="hljs-section">[Install]</span>
<span class="hljs-attr">WantedBy</span>=multi-user.target
</code></pre>
<h4 id="heading-understanding-the-configuration">Understanding the Configuration:</h4>
<p>The <code>[Unit]</code> section describes the service and its dependencies. <code>After=network.target</code> ensures the network is up before starting your chatroom.</p>
<p>The <code>[Service]</code> section defines how to run your server. <code>Type=simple</code> means systemd should just run the command and consider it started. <code>User=chatroom</code> runs the server as a dedicated user (not root) for security. <code>WorkingDirectory</code> sets where the server runs, which is important because your WAL and snapshot files are created relative to this directory.</p>
<p><code>Restart=on-failure</code> tells systemd to automatically restart your server if it crashes. <code>RestartSec=5s</code> waits 5 seconds before restarting, preventing rapid restart loops if there's a persistent problem.</p>
<p>The <code>[Install]</code> section makes your service start at boot when you enable it.</p>
<h4 id="heading-deploying-your-server">Deploying Your Server:</h4>
<p>First, build your server binary:</p>
<pre><code class="lang-bash">go build -o server cmd/server/main.go
</code></pre>
<p>Then copy it to the deployment location:</p>
<pre><code class="lang-bash">sudo mkdir -p /opt/chatroom
sudo cp server /opt/chatroom/
sudo mkdir -p /opt/chatroom/chatdata
</code></pre>
<p>Create a dedicated user for running the service:</p>
<pre><code class="lang-bash">sudo useradd -r -s /bin/<span class="hljs-literal">false</span> chatroom
sudo chown -R chatroom:chatroom /opt/chatroom
</code></pre>
<p>Enable and start the service:</p>
<pre><code class="lang-bash">sudo systemctl <span class="hljs-built_in">enable</span> chatroom
sudo systemctl start chatroom
</code></pre>
<p>Check that it's running:</p>
<pre><code class="lang-bash">sudo systemctl status chatroom
</code></pre>
<p>You can view logs with:</p>
<pre><code class="lang-bash">sudo journalctl -u chatroom -f
</code></pre>
<p>The <code>-f</code> flag follows the logs in real-time, similar to <code>tail -f</code>.</p>
<h3 id="heading-how-to-use-docker">How to Use Docker</h3>
<p>Docker packages your application with all its dependencies, making it easy to deploy anywhere that runs Docker.</p>
<p>Create a <code>Dockerfile</code>:</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> golang:<span class="hljs-number">1.23</span>-alpine AS builder
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> go.mod go.sum ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go mod download</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . .</span>
<span class="hljs-keyword">RUN</span><span class="bash"> go build -o server cmd/server/main.go</span>

<span class="hljs-keyword">FROM</span> alpine:latest
<span class="hljs-keyword">RUN</span><span class="bash"> apk --no-cache add ca-certificates</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /root/</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/server .</span>
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/chatdata ./chatdata</span>
<span class="hljs-keyword">EXPOSE</span> <span class="hljs-number">9000</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"./server"</span>]</span>
</code></pre>
<h4 id="heading-understanding-the-dockerfile">Understanding the Dockerfile:</h4>
<p>This uses a multi-stage build. The first stage (<code>builder</code>) uses the full Go image to compile your server. The second stage uses a minimal Alpine Linux image and copies only the compiled binary. This keeps the final image small (about 20MB instead of 800MB).</p>
<p><code>EXPOSE 9000</code> documents which port the container uses. <code>CMD ["./server"]</code> specifies what command runs when the container starts.</p>
<p>Build and Run:</p>
<pre><code class="lang-bash">docker build -t chatroom .
docker run -p 9000:9000 -v $(<span class="hljs-built_in">pwd</span>)/chatdata:/root/chatdata chatroom
</code></pre>
<p>The <code>-p 9000:9000</code> maps port 9000 in the container to port 9000 on your host, making the chatroom accessible. The <code>-v $(pwd)/chatdata:/root/chatdata</code> mounts your local chatdata directory into the container, so messages persist even if you stop and remove the container.</p>
<h4 id="heading-running-in-production">Running in Production:</h4>
<p>For production, you'd typically use Docker Compose or Kubernetes. Here's a simple <code>docker-compose.yml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">version:</span> <span class="hljs-string">'3.8'</span>
<span class="hljs-attr">services:</span>
  <span class="hljs-attr">chatroom:</span>
    <span class="hljs-attr">build:</span> <span class="hljs-string">.</span>
    <span class="hljs-attr">ports:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">"9000:9000"</span>
    <span class="hljs-attr">volumes:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">./chatdata:/root/chatdata</span>
    <span class="hljs-attr">restart:</span> <span class="hljs-string">unless-stopped</span>
</code></pre>
<p>Run with:</p>
<pre><code class="lang-bash">docker-compose up -d
</code></pre>
<p>The <code>restart: unless-stopped</code> policy ensures your container restarts automatically if it crashes or if the Docker daemon restarts</p>
<h2 id="heading-enhancements-you-could-add">Enhancements You Could Add</h2>
<h3 id="heading-1-multi-room-support">1. Multi-Room Support</h3>
<p>You could add the concept of channels/rooms like this:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> ChatRoom <span class="hljs-keyword">struct</span> {
    rooms <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]*Room
}

<span class="hljs-keyword">type</span> Room <span class="hljs-keyword">struct</span> {
    name    <span class="hljs-keyword">string</span>
    clients <span class="hljs-keyword">map</span>[*Client]<span class="hljs-keyword">bool</span>
    history []Message
}
</code></pre>
<h3 id="heading-2-user-authentication">2. User Authentication</h3>
<p>You could replace simple usernames with proper authentication for added security:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> User <span class="hljs-keyword">struct</span> {
    ID           <span class="hljs-keyword">int</span>
    Username     <span class="hljs-keyword">string</span>
    PasswordHash <span class="hljs-keyword">string</span>
    Email        <span class="hljs-keyword">string</span>
    CreatedAt    time.Time
}
</code></pre>
<h3 id="heading-3-file-sharing">3. File Sharing</h3>
<p>You could allow users to upload files:</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> FileMessage <span class="hljs-keyword">struct</span> {
    Message
    FileName <span class="hljs-keyword">string</span>
    FileSize <span class="hljs-keyword">int64</span>
    FileURL  <span class="hljs-keyword">string</span>
}
</code></pre>
<h3 id="heading-4-websocket-support">4. WebSocket Support</h3>
<p>You could add HTTP/WebSocket endpoint for web clients.</p>
<h3 id="heading-5-horizontal-scaling">5. Horizontal Scaling</h3>
<p>For massive scale, you could shard across multiple servers using Redis pub/sub or NATS for inter-server communication.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've now built a production-ready distributed chatroom from scratch. This project demonstrates important distributed systems concepts including concurrency patterns, network programming, state management, persistence, and fault tolerance.</p>
<p>Additional resources:</p>
<ul>
<li><p><strong>Go Concurrency</strong>: "Concurrency in Go" by Katherine Cox-Buday</p>
</li>
<li><p><strong>Distributed Systems</strong>: "Designing Data-Intensive Applications" by Martin Kleppmann</p>
</li>
<li><p><strong>Networking</strong>: "Unix Network Programming" by Stevens</p>
</li>
</ul>
<p>The full source code is available on <a target="_blank" href="https://github.com/Caesarsage/distributed-system/tree/main/chatroom-with-broadcast">GitHub</a>. Feel free to open issues or contribute improvements.</p>
<p>As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Manage Blue-Green Deployments on AWS ECS with Database Migrations: Complete Implementation Guide ]]>
                </title>
                <description>
                    <![CDATA[ Blue-green deployments are celebrated for enabling zero-downtime releases and instant rollbacks. You deploy your new version (green) alongside the current one (blue), switch traffic over, and if something goes wrong, you switch back. Simple, right? N... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-manage-blue-green-deployments-on-aws-ecs-with-database-migrations/</link>
                <guid isPermaLink="false">69693109596ef11a775126fb</guid>
                
                    <category>
                        <![CDATA[ deployment ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Blue/Green deployment ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Databases ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 18:25:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768497873258/be1ce2a3-c95f-488e-913a-a772007a0d2a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Blue-green deployments are celebrated for enabling zero-downtime releases and instant rollbacks. You deploy your new version (green) alongside the current one (blue), switch traffic over, and if something goes wrong, you switch back. Simple, right?</p>
<p>Not quite. While blue-green deployments work beautifully for stateless applications, they become significantly more complex when you introduce databases and stateful services into the equation. The moment your blue and green environments need to share a database, you're facing a fundamental challenge: how do you evolve your schema and data without breaking either version?</p>
<p>In this article, we'll tackle the real-world complexities of implementing blue-green deployments on Amazon ECS when your application depends on shared state. You'll learn practical strategies for handling database migrations, managing sessions, and maintaining data consistency across application versions.</p>
<p>💡 <strong>Complete Working Example</strong>: All code examples in this article are available in the <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs">bluegreen-deployment-ecs</a> <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs">repository on GitHub.</a> You can clone it and deploy the entire infrastructure to your AWS account.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-problem-with-state-in-blue-green-deployments">The Problem with State in Blue-Green Deployments</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-database-migration-strategies-for-blue-green">Database Migration Strategies for Blue-Green</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-handling-stateful-services-in-ecs">Handling Stateful Services in ECS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-complete-implementation-end-to-end-example">Complete Implementation: End-to-End Example</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-rollback-strategies">Rollback Strategies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-monitoring-during-deployments">Monitoring During Deployments</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices">Best Practices</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-when-not-to-use-blue-green">When NOT to Use Blue-Green</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-alternative-deployment-strategies">Alternative Deployment Strategies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-further-resources">Further Resources</a></p>
</li>
</ul>
<h2 id="heading-the-problem-with-state-in-blue-green-deployments">The Problem with State in Blue-Green Deployments</h2>
<p>The elegance of blue-green deployments starts to crumble when you consider databases. Here's why: your blue environment runs application version 1, your green environment runs version 2, but they both connect to the same RDS instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768056130585/109ceff8-4500-45d7-aaa0-5e259b4a7b11.png" alt="Figure 1: The blue-green dilemma - both environments share the same database but expect different schemas" class="image--center mx-auto" width="1579" height="1131" loading="lazy"></p>
<p>Consider this scenario: you're adding a new feature that requires a new database column. Version 2 of your application expects this column to exist. You deploy green, run your migration to add the column, and switch traffic.</p>
<p>Everything works great until you need to rollback. Now version 1 is receiving traffic, but it doesn't know what to do with that new column. Worse, if your migration removed or renamed a column that version 1 depends on, your rollback will fail catastrophically.</p>
<p>Here are the specific challenges you'll face:</p>
<ul>
<li><p><strong>Schema versioning conflicts</strong>: Your blue environment expects schema version N, while green expects version N+1. Any breaking schema change will cause one environment to fail.</p>
</li>
<li><p><strong>Data inconsistencies</strong>: If version 2 writes data in a new format that version 1 can't read, switching back to blue will result in errors or data corruption.</p>
</li>
<li><p><strong>Irreversible migrations</strong>: Some database changes are inherently destructive. Dropping a column, changing data types, or restructuring tables can't be easily undone.</p>
</li>
<li><p><strong>Failed rollbacks</strong>: The promise of instant rollback becomes hollow when your database has evolved beyond what the blue environment can handle.</p>
</li>
</ul>
<p>Let's explore the strategies that solve these problems.</p>
<h2 id="heading-database-migration-strategies-for-blue-green">Database Migration Strategies for Blue-Green</h2>
<h3 id="heading-strategy-1-the-expand-contract-pattern-recommended">Strategy 1: The Expand-Contract Pattern (Recommended)</h3>
<p>The expand-contract pattern is the most practical approach for blue-green deployments with shared databases. It works by breaking schema changes into three phases, ensuring backwards compatibility throughout.</p>
<h4 id="heading-phase-1-expand">Phase 1: Expand</h4>
<p>In this phase, you add new schema elements while keeping old ones intact. If you're renaming a column, add the new column without removing the old one. If you're changing table structure, create new tables alongside existing ones.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Example: Renaming 'user_name' to 'username'</span>
<span class="hljs-comment">-- Phase 1: Expand - Add new column</span>
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> username <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">255</span>);

<span class="hljs-comment">-- Populate new column from old column</span>
<span class="hljs-keyword">UPDATE</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">SET</span> username = user_name <span class="hljs-keyword">WHERE</span> username <span class="hljs-keyword">IS</span> <span class="hljs-literal">NULL</span>;
</code></pre>
<p>At this point, your database supports both the old schema (used by blue) and the new schema (used by green). Your application code needs to handle both as well.</p>
<h4 id="heading-phase-2-deploy">Phase 2: Deploy</h4>
<p>Now, deploy your green environment with code that uses the new schema. But this code should still write to both old and new columns to maintain compatibility.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Version 2 code - writes to both columns</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">update_user</span>(<span class="hljs-params">user_id, username</span>):</span>
    db.execute(
        <span class="hljs-string">"UPDATE users SET username = %s, user_name = %s WHERE id = %s"</span>,
        (username, username, user_id)
    )
</code></pre>
<p>Traffic shifts from blue to green. Both environments work because the database supports both schemas. If you need to rollback, blue still functions perfectly because the old columns are intact.</p>
<h4 id="heading-phase-3-contract">Phase 3: Contract</h4>
<p>After you're confident green is stable and you've decommissioned blue, remove the old schema elements in a separate deployment.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Phase 3: Contract - Remove old column</span>
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">DROP</span> <span class="hljs-keyword">COLUMN</span> user_name;
</code></pre>
<p>Update your application code to stop writing to the old columns. This is now version 3, deployed as a standard release.</p>
<p><strong>When to use</strong>: This should be your default approach for most schema changes including adding/removing columns, renaming fields, changing constraints, and restructuring tables.</p>
<h3 id="heading-strategy-2-parallel-schemas-or-databases">Strategy 2: Parallel Schemas or Databases</h3>
<p>For major breaking changes where backwards compatibility is impractical, you might maintain entirely separate database versions. Version 1 connects to database A, version 2 connects to database B. This approach requires data synchronization between databases. AWS Database Migration Service (DMS) can replicate data in near real-time, or you can build custom replication logic using change data capture.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Configuration for version-specific database connections</span>
DATABASE_CONFIG = {
    <span class="hljs-string">'v1'</span>: {
        <span class="hljs-string">'host'</span>: <span class="hljs-string">'blue-db.cluster-xxxxx.us-east-1.rds.amazonaws.com'</span>,
        <span class="hljs-string">'database'</span>: <span class="hljs-string">'app_v1'</span>
    },
    <span class="hljs-string">'v2'</span>: {
        <span class="hljs-string">'host'</span>: <span class="hljs-string">'green-db.cluster-yyyyy.us-east-1.rds.amazonaws.com'</span>,
        <span class="hljs-string">'database'</span>: <span class="hljs-string">'app_v2'</span>
    }
}
</code></pre>
<p>During the transition period, you run DMS to keep both databases synchronized, with the understanding that writes go to the active version's database.</p>
<p>The challenge is that you're now managing data synchronization, dealing with replication lag, and paying for two databases. Eventually, you need to consolidate back to one database, which requires another migration. This is expensive and complex, which is why it's the "nuclear option."</p>
<p><strong>When to use</strong>: Only for major architectural changes, complete data model redesigns, or when migrating between database types (for example, MySQL to PostgreSQL). If expand-contract can possibly work, use that instead.</p>
<h3 id="heading-strategy-3-feature-flags-for-gradual-rollout">Strategy 3: Feature Flags for Gradual Rollout</h3>
<p>Feature flags allow you to decouple deployment from release. Both blue and green run the same codebase, but features are toggled on or off via configuration. This shifts the problem from schema compatibility to code-level compatibility.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_user</span>(<span class="hljs-params">user_data</span>):</span>
    config = get_feature_config()
    <span class="hljs-keyword">if</span> config[<span class="hljs-string">'use_new_user_schema'</span>]:
        <span class="hljs-keyword">return</span> create_user_v2(user_data)
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> create_user_v1(user_data)
</code></pre>
<p>Instead of having two separate deployments (blue and green), you have ONE deployment with conditional logic. The "switch" from old to new behavior happens via configuration change, not infrastructure change. This is technically not pure blue-green, but it's a powerful hybrid approach.</p>
<h4 id="heading-how-it-works">How it works</h4>
<p>Your application checks AWS AppConfig (or similar service) for feature flags before executing code paths. When a flag is off, it uses the old schema/logic. When on, it uses the new schema/logic. You can even enable features for a percentage of users (5% get new behavior, 95% get old behavior) for gradual rollout.</p>
<p>The tradeoff is that your codebase temporarily contains both old and new logic with conditional branches everywhere. This increases complexity and requires disciplined cleanup after the feature is fully released. However, you gain fine-grained control and can toggle features on/off instantly without deploying new infrastructure.</p>
<p><strong>When to use:</strong> For large features with uncertain stability, gradual rollouts to monitor impact, or when you want instant rollback capability without touching infrastructure. Also useful when combined with expand-contract for extra safety.</p>
<h2 id="heading-handling-stateful-services-in-ecs">Handling Stateful Services in ECS</h2>
<p>Beyond databases, several other stateful components require careful consideration during blue-green deployments.</p>
<h3 id="heading-session-management">Session Management</h3>
<p>It’s a good idea to store sessions in ElastiCache or DynamoDB rather than application memory:</p>
<pre><code class="lang-python">app.config[<span class="hljs-string">'SESSION_TYPE'</span>] = <span class="hljs-string">'dynamodb'</span>
app.config[<span class="hljs-string">'SESSION_DYNAMODB'</span>] = boto3.client(<span class="hljs-string">'dynamodb'</span>)
</code></pre>
<h3 id="heading-shared-resources">Shared Resources</h3>
<p>Beyond database sessions, your application likely depends on other stateful components that need coordination during blue-green deployments:</p>
<h4 id="heading-1-s3-buckets">1. S3 buckets</h4>
<p>If your application stores files or data in S3, schema changes to object metadata or file formats can cause compatibility issues between versions. To address this, you can enable S3 versioning to maintain multiple format versions simultaneously.</p>
<p>For example, if version 2 writes JSON files with a new structure, version 1 should still be able to read the old format. You can include a version prefix in object keys (like <code>v1/user-data.json</code> and <code>v2/user-data.json</code>) or embed version metadata in the objects themselves.</p>
<h4 id="heading-message-queues-sqssns">Message queues (SQS/SNS)</h4>
<p>Messages sent by one version must be readable by the other during the transition. You can use versioned message schemas with a <code>schema_version</code> field in your message payload. Both blue and green should be able to parse messages from either version, even if they only produce messages in their preferred format. Consider using a schema registry or validation library to ensure compatibility.</p>
<h4 id="heading-cache-layers-elasticacheredis">Cache layers (ElastiCache/Redis)</h4>
<p>Cached data structure changes can cause deserialization errors when switching between versions. Try versioning your cache keys by including the schema version: <code>CACHE_VERSION = 'v2'</code> and then <code>cache_key = f"user:{CACHE_VERSION}:{user_id}"</code>. This ensures blue and green maintain separate cache namespaces, preventing cross-contamination. When you fully migrate to green, you can flush the old cache keys or let them expire naturally.</p>
<pre><code class="lang-python">CACHE_VERSION = <span class="hljs-string">'v2'</span>
cache_key = <span class="hljs-string">f"user:<span class="hljs-subst">{CACHE_VERSION}</span>:<span class="hljs-subst">{user_id}</span>"</span>
</code></pre>
<h2 id="heading-implementation-end-to-end-example">Implementation: End-to-End Example</h2>
<p>Let's walk through a complete blue-green deployment with ECS, handling a database schema change using the <strong>expand-contract pattern</strong>. We'll migrate from a single <code>address</code> text field to structured <code>street_address</code>, <code>city</code>, <code>state</code>, and <code>zip_code</code> fields.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768052075044/fdb732dd-cf3d-473f-a22c-f5ab98870625.png" alt="Figure 2: The three phases of expand-contract migration ensuring continuous compatibility" class="image--center mx-auto" width="3444" height="624" loading="lazy"></p>
<p><strong>Here’s the scenario:</strong> You're running an e-commerce application on ECS. The current version (blue) stores customer addresses in a single address text field. Version 2 (green) splits this into structured fields: street_address, city, state, and zip_code.</p>
<h3 id="heading-architecture-setup"><strong>Architecture Setup</strong></h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768087707691/ff19ce97-b745-4aa8-8b39-4d835fd781cd.png" alt="Figure 3: Complete AWS architecture for blue-green ECS deployment with shared RDS database" class="image--center mx-auto" width="2479" height="3679" loading="lazy"></p>
<p>Your infrastructure includes:</p>
<ul>
<li><p>ECS cluster running Fargate tasks</p>
</li>
<li><p>Application Load Balancer with two target groups (blue and green)</p>
</li>
<li><p>RDS PostgreSQL database (shared between environments)</p>
</li>
<li><p>CodeDeploy for managing traffic shifts</p>
</li>
<li><p>Parameter Store for database connection strings</p>
</li>
</ul>
<p>💡 <strong>Implementation Note</strong>: The complete Terraform code for this architecture is available in the <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/tree/main/terraform">companion GitHub repository</a>.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>Before starting, make sure that you have the following tools installed and your AWS credentials properly configured:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Required tools</span>
aws --version      <span class="hljs-comment"># AWS CLI</span>
terraform --version <span class="hljs-comment"># Terraform &gt;= 1.0</span>
docker --version   <span class="hljs-comment"># Docker</span>
psql --version     <span class="hljs-comment"># PostgreSQL client</span>

<span class="hljs-comment"># Configure AWS credentials</span>
aws configure
aws sts get-caller-identity  <span class="hljs-comment"># Verify your identity</span>
</code></pre>
<h3 id="heading-step-1-deploy-infrastructure-and-blue-environment">Step 1: Deploy Infrastructure and Blue Environment</h3>
<p>We’ll start by setting up the entire AWS infrastructure from scratch using Terraform, then deploying the initial version of our application (blue environment).</p>
<p>First, clone the repository and set up your environment:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Clone the repository</span>
git <span class="hljs-built_in">clone</span> https://github.com/Caesarsage/bluegreen-deployment-ecs.git
<span class="hljs-built_in">cd</span> bluegreen-deployment-ecs

<span class="hljs-comment"># Create terraform variables</span>
<span class="hljs-built_in">cd</span> terraform
cat &gt; terraform.tfvars &lt;&lt;EOF
aws_region         = <span class="hljs-string">"us-east-1"</span>
project_name       = <span class="hljs-string">"ecommerce-bluegreen"</span>
environment        = <span class="hljs-string">"production"</span>
vpc_cidr           = <span class="hljs-string">"10.0.0.0/16"</span>

<span class="hljs-comment"># Database credentials (CHANGE THESE!)</span>
db_username = <span class="hljs-string">"dbadmin"</span>
db_password = <span class="hljs-string">"ChangeThisPassword123!"</span>

<span class="hljs-comment"># Container configuration</span>
container_image = <span class="hljs-string">"PLACEHOLDER"</span>  <span class="hljs-comment"># Will update after building image</span>
container_port  = 8080

<span class="hljs-comment"># Scaling configuration</span>
desired_count = 2
cpu           = <span class="hljs-string">"256"</span>
memory        = <span class="hljs-string">"512"</span>

<span class="hljs-comment"># Notifications</span>
notification_email = <span class="hljs-string">"your-email@example.com"</span>
EOF
</code></pre>
<p><strong>Security Note:</strong> Never commit <code>terraform.tfvars</code> to Git. It's already in <code>.gitignore</code>.</p>
<p>Next, initialize Terraform and create the ECR repository:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Initialize Terraform</span>
terraform init
terraform validate

<span class="hljs-comment"># Create ECR repository</span>
terraform apply -target=aws_ecr_repository.app

<span class="hljs-comment"># Get ECR repository URL</span>
<span class="hljs-built_in">export</span> ECR_REPO=$(terraform output -raw ecr_repository_url)
<span class="hljs-built_in">echo</span> <span class="hljs-string">"ECR Repository: <span class="hljs-variable">$ECR_REPO</span>"</span>
</code></pre>
<p>We create the ECR repository first because we need somewhere to push our Docker image. Then we'll build the image, push it, and finally deploy the rest of the infrastructure that depends on that image existing.</p>
<p>Build and push the initial application like this:</p>
<pre><code class="lang-bash">
<span class="hljs-built_in">cd</span> ..  <span class="hljs-comment"># Back to project root</span>

<span class="hljs-comment"># Set variables</span>
<span class="hljs-built_in">export</span> AWS_REGION=us-east-1
<span class="hljs-built_in">export</span> AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
<span class="hljs-built_in">export</span> ECR_REPOSITORY=ecommerce-bluegreen
<span class="hljs-built_in">export</span> IMAGE_TAG=v1.0.0

<span class="hljs-comment"># Login to ECR</span>
aws ecr get-login-password --region <span class="hljs-variable">$AWS_REGION</span> | \
    docker login --username AWS --password-stdin <span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com

<span class="hljs-comment"># Build the image</span>
docker build --platform linux/amd64 -t <span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span> -f docker/Dockerfile .

<span class="hljs-comment"># Tag and push to ECR</span>
docker tag <span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span> \
    <span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com/<span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span>

docker push <span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com/<span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span>

<span class="hljs-comment"># Update terraform.tfvars with the image URL</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"container_image = \"<span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com/<span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span>\""</span> &gt;&gt; terraform/terraform.tfvars
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768137809806/820d7005-b924-4224-9b58-de5701466c1f.png" alt="Figure 4: ECR Private repository for Docker image" class="image--center mx-auto" width="2442" height="632" loading="lazy"></p>
<p>The <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/tree/main/app">application code</a> is a Flask application that handles both old and new schema formats based on the <code>APP_VERSION</code> environment variable.</p>
<p>Now deploy the complete infrastructure:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> terraform
terraform apply  <span class="hljs-comment"># Takes ~15-20 minutes</span>

<span class="hljs-comment"># Get outputs</span>
<span class="hljs-built_in">export</span> ALB_URL=$(terraform output -raw alb_url)
<span class="hljs-built_in">export</span> TEST_URL=$(terraform output -raw test_url)
<span class="hljs-built_in">export</span> DB_ENDPOINT=$(terraform output -raw db_endpoint)
<span class="hljs-built_in">export</span> ECR_URL=$(terraform output -raw ecr_repository_url)
<span class="hljs-built_in">export</span> BASTION_IP=$(terraform output -raw bastion_public_ip)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Application URL: <span class="hljs-variable">$ALB_URL</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Test URL: <span class="hljs-variable">$TEST_URL</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Database Endpoint: <span class="hljs-variable">$DB_ENDPOINT</span>"</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768141033921/07c2e9b9-c652-4cec-91ae-2de956d8655d.png" alt="Application Load Balancer with two target groups (blue and green)" class="image--center mx-auto" width="2504" height="844" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768142296716/9963c779-e0a8-4418-8d69-9bc8fcbbc553.png" alt="Figure 5: Application Load Balancer with two target groups (blue and green)" class="image--center mx-auto" width="2553" height="458" loading="lazy"></p>
<p>The production listener (port 80) is what your users hit. The test listener (port 8080) lets you test the green environment before shifting production traffic to it. This is crucial for validation.</p>
<p>You can see the complete Terraform configuration in <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/tree/main/terraform"><code>terraform</code></a>.</p>
<h3 id="heading-step-2-initialize-database-schema">Step 2: Initialize Database Schema</h3>
<p>Now you’ll need to initialize the database with the schema for version 1 (blue). We'll use Bastion for secure access:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Copy the migration files to the bastion host from your local machine</span>

scp -i ~/.ssh/id_rsa docker/init.sql ec2-user@<span class="hljs-variable">$BASTION_IP</span>:/tmp/
scp -i ~/.ssh/id_rsa migrations/*.sql ec2-user@<span class="hljs-variable">$BASTION_IP</span>:/tmp/

<span class="hljs-comment"># Then SSH into it and run migrations</span>
ssh -i ~/.ssh ec2-user@<span class="hljs-variable">$BASTION_IP</span>

<span class="hljs-comment"># Inside the bastion:</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -f /tmp/init.sql

<span class="hljs-comment"># Verify</span>
psql -h <span class="hljs-variable">$DB_HOST</span> -U <span class="hljs-variable">$DB_USER</span> -d <span class="hljs-variable">$DB_NAME</span> -c <span class="hljs-string">"\d customers"</span>

<span class="hljs-comment"># Exit the container</span>
<span class="hljs-built_in">exit</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089062401/8f23655e-b50b-4b24-af98-b195e29da9c7.png" alt="Figure 6: Database schema - the customers table with the original columns" class="image--center mx-auto" width="1298" height="402" loading="lazy"></p>
<h3 id="heading-step-3-verify-blue-environment">Step 3: Verify Blue Environment</h3>
<p>We’ll want to test that everything works before we start the migration. This is your baseline: you want to confirm that the current system is healthy before introducing changes.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check health</span>
curl <span class="hljs-variable">$ALB_URL</span>/health | jq

<span class="hljs-comment"># Expected response:</span>
<span class="hljs-comment"># {</span>
<span class="hljs-comment">#   "status": "healthy",</span>
<span class="hljs-comment">#   "version": "blue",</span>
<span class="hljs-comment">#   "environment": "production",</span>
<span class="hljs-comment">#   "database": "connected",</span>
<span class="hljs-comment">#   "schema": "compatible"</span>
<span class="hljs-comment"># }</span>

<span class="hljs-comment"># Create a customer with the old schema (single address field)</span>
curl -X POST <span class="hljs-variable">$ALB_URL</span>/api/customers \
    -H <span class="hljs-string">"Content-Type: application/json"</span> \
    -d <span class="hljs-string">'{
      "name": "John Doe",
      "email": "john@example.com",
      "address": "123 Main St, New York, NY, 10001"
    }'</span> | jq

<span class="hljs-comment"># List customers</span>
curl <span class="hljs-variable">$ALB_URL</span>/api/customers | jq
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768138569485/b7455a6e-b101-4cdb-83b8-40e0dbafb0b0.png" alt="Figure 7: Blue Environment Verification" class="image--center mx-auto" width="1068" height="434" loading="lazy"></p>
<h3 id="heading-step-4-expand-phase-add-new-columns">Step 4: Expand Phase – Add New Columns</h3>
<p>This is the first phase of expand-contract. We're adding the new columns WITHOUT removing the old one, creating a database schema that supports both blue and green simultaneously.</p>
<p>Run the expand migration (<a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/blob/main/migrations/001_expand_address.sql"><code>migrations/001_expand_address.sql</code>)</a>:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Migration: 001_expand_address_fields.sql</span>
<span class="hljs-keyword">BEGIN</span>;

<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> customers 
  <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> street_address <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">255</span>),
  <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> city <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">100</span>),
  <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> state <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">2</span>),
  <span class="hljs-keyword">ADD</span> <span class="hljs-keyword">COLUMN</span> zip_code <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">10</span>);

<span class="hljs-comment">-- Populate new columns from existing data</span>
<span class="hljs-comment">-- This uses a simple parsing strategy; yours might be more sophisticated</span>

<span class="hljs-keyword">UPDATE</span> customers 
<span class="hljs-keyword">SET</span> 
  street_address = SPLIT_PART(address, <span class="hljs-string">','</span>, <span class="hljs-number">1</span>),
  city = <span class="hljs-keyword">TRIM</span>(SPLIT_PART(address, <span class="hljs-string">','</span>, <span class="hljs-number">2</span>)),
  state = <span class="hljs-keyword">TRIM</span>(SPLIT_PART(address, <span class="hljs-string">','</span>, <span class="hljs-number">3</span>)),
  zip_code = <span class="hljs-keyword">TRIM</span>(SPLIT_PART(address, <span class="hljs-string">','</span>, <span class="hljs-number">4</span>))
<span class="hljs-keyword">WHERE</span> address <span class="hljs-keyword">IS</span> <span class="hljs-keyword">NOT</span> <span class="hljs-literal">NULL</span>;

<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p><strong>Critical observation:</strong> We're NOT dropping the <code>address</code> column. It's still there. Blue continues reading and writing to it, completely unaware that new columns exist. This is what makes the migration safe – nothing breaks.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Then SSH into it and run migrations</span>
ssh -i ~/.ssh ec2-user@<span class="hljs-variable">$BASTION_IP</span>

<span class="hljs-comment"># Inside the bastion:</span>
<span class="hljs-built_in">export</span> DB_ENDPOINT = <span class="hljs-string">""</span> <span class="hljs-comment"># from terraform output</span>

psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -f /tmp/001_expand_address.sql

<span class="hljs-comment"># Verify new columns exist</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -c <span class="hljs-string">"\d customers"</span>

<span class="hljs-built_in">exit</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768089194050/e053dee3-382b-4ccd-a0e0-8c17003e9832.png" alt="Figure 8: Database schema evolution - the customers table during expand phase with both old and new columns" class="image--center mx-auto" width="1638" height="694" loading="lazy"></p>
<p><strong>Verification:</strong> The <code>\d customers</code> command shows the table structure. You should see BOTH the old <code>address</code> column AND the new <code>street_address</code>, <code>city</code>, <code>state</code>, <code>zip_code</code> columns. This confirms the expand phase worked.</p>
<p>The database now supports both old (blue) and new (green) schemas. Blue is still running and working perfectly, and nothing has changed from its perspective.</p>
<h3 id="heading-step-5-build-and-deploy-green-environment">Step 5: Build and Deploy Green Environment</h3>
<p>Now we’ll build version 2 of our application that knows how to work with the new structured address fields, while maintaining backwards compatibility with the old schema.</p>
<p>Start by building version 2 with structured address support:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> ..  <span class="hljs-comment"># Back to project root</span>

<span class="hljs-comment"># Build new version</span>
<span class="hljs-built_in">export</span> IMAGE_TAG=v2.0.0

docker build --platform linux/amd64 -t <span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span> -f docker/Dockerfile .

docker tag <span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span> \
    <span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com/<span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span>

docker push <span class="hljs-variable">$AWS_ACCOUNT_ID</span>.dkr.ecr.<span class="hljs-variable">$AWS_REGION</span>.amazonaws.com/<span class="hljs-variable">$ECR_REPOSITORY</span>:<span class="hljs-variable">$IMAGE_TAG</span>
</code></pre>
<p>What’s different is that the v2 <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/blob/main/app/models.py">application code</a> now has logic that:</p>
<ul>
<li><p><strong>Reads</strong> from the new structured columns (<code>street_address</code>, <code>city</code>, and so on)</p>
</li>
<li><p><strong>Writes</strong> to BOTH new columns AND the old <code>address</code> column</p>
</li>
<li><p>Accepts API requests with structured address format</p>
</li>
</ul>
<p><strong>Why write to both:</strong> This is crucial. Even though green prefers the new format, it maintains the old format, too. If you need to rollback to blue, all the data blue needs is there and up-to-date. Without this, rollback would be impossible: blue would see empty or stale <code>address</code> fields.</p>
<p>Now create and register green task definition:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> terraform

<span class="hljs-comment"># Get necessary ARNs</span>
EXECUTION_ROLE_ARN=$(terraform output -raw ecs_task_execution_role_arn)
TASK_ROLE_ARN=$(terraform output -raw ecs_task_role_arn)
DB_SECRET_ARN=$(terraform output -raw db_secret_arn)

<span class="hljs-comment"># Create task definition</span>
cat &gt; task-def-green.json &lt;&lt;EOF
{
  <span class="hljs-string">"family"</span>: <span class="hljs-string">"ecommerce-bluegreen"</span>,
  <span class="hljs-string">"networkMode"</span>: <span class="hljs-string">"awsvpc"</span>,
  <span class="hljs-string">"requiresCompatibilities"</span>: [<span class="hljs-string">"FARGATE"</span>],
  <span class="hljs-string">"cpu"</span>: <span class="hljs-string">"256"</span>,
  <span class="hljs-string">"memory"</span>: <span class="hljs-string">"512"</span>,
  <span class="hljs-string">"executionRoleArn"</span>: <span class="hljs-string">"<span class="hljs-variable">${EXECUTION_ROLE_ARN}</span>"</span>,
  <span class="hljs-string">"taskRoleArn"</span>: <span class="hljs-string">"<span class="hljs-variable">${TASK_ROLE_ARN}</span>"</span>,
  <span class="hljs-string">"containerDefinitions"</span>: [{
    <span class="hljs-string">"name"</span>: <span class="hljs-string">"app"</span>,
    <span class="hljs-string">"image"</span>: <span class="hljs-string">"<span class="hljs-variable">${AWS_ACCOUNT_ID}</span>.dkr.ecr.<span class="hljs-variable">${AWS_REGION}</span>.amazonaws.com/<span class="hljs-variable">${ECR_REPOSITORY}</span>:<span class="hljs-variable">${IMAGE_TAG}</span>"</span>,
    <span class="hljs-string">"essential"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-string">"portMappings"</span>: [{
      <span class="hljs-string">"containerPort"</span>: 8080,
      <span class="hljs-string">"protocol"</span>: <span class="hljs-string">"tcp"</span>
    }],
    <span class="hljs-string">"environment"</span>: [
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"APP_VERSION"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"green"</span>},
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"ENVIRONMENT"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"production"</span>},
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"AWS_REGION"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"<span class="hljs-variable">${AWS_REGION}</span>"</span>},
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"DB_HOST"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"<span class="hljs-variable">${DB_ENDPOINT}</span>"</span>},
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"DB_PORT"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"5432"</span>},
      {<span class="hljs-string">"name"</span>: <span class="hljs-string">"DB_NAME"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-string">"ecommerce"</span>}
    ],
    <span class="hljs-string">"secrets"</span>: [
      {
        <span class="hljs-string">"name"</span>: <span class="hljs-string">"DB_USER"</span>,
        <span class="hljs-string">"valueFrom"</span>: <span class="hljs-string">"<span class="hljs-variable">${DB_SECRET_ARN}</span>:username::"</span>
      },
      {
        <span class="hljs-string">"name"</span>: <span class="hljs-string">"DB_PASSWORD"</span>,
        <span class="hljs-string">"valueFrom"</span>: <span class="hljs-string">"<span class="hljs-variable">${DB_SECRET_ARN}</span>:password::"</span>
      }
    ],
    <span class="hljs-string">"logConfiguration"</span>: {
      <span class="hljs-string">"logDriver"</span>: <span class="hljs-string">"awslogs"</span>,
      <span class="hljs-string">"options"</span>: {
        <span class="hljs-string">"awslogs-group"</span>: <span class="hljs-string">"/ecs/ecommerce-bluegreen"</span>,
        <span class="hljs-string">"awslogs-region"</span>: <span class="hljs-string">"<span class="hljs-variable">${AWS_REGION}</span>"</span>,
        <span class="hljs-string">"awslogs-stream-prefix"</span>: <span class="hljs-string">"ecs"</span>
      }
    },
    <span class="hljs-string">"healthCheck"</span>: {
      <span class="hljs-string">"command"</span>: [<span class="hljs-string">"CMD-SHELL"</span>, <span class="hljs-string">"curl -f http://localhost:8080/health || exit 1"</span>],
      <span class="hljs-string">"interval"</span>: 30,
      <span class="hljs-string">"timeout"</span>: 5,
      <span class="hljs-string">"retries"</span>: 3,
      <span class="hljs-string">"startPeriod"</span>: 60
    }
  }]
}
EOF

<span class="hljs-comment"># Register the task definition</span>
aws ecs register-task-definition --cli-input-json file://task-def-green.json
</code></pre>
<p>This JSON tells ECS everything about how to run your container:</p>
<ul>
<li><p>Which Docker image to use (the v2.0.0 we just built)</p>
</li>
<li><p>How much CPU/memory to allocate (256 CPU units = 0.25 vCPU)</p>
</li>
<li><p>Environment variables (notice <code>APP_VERSION</code> is set to "green")</p>
</li>
<li><p>Secrets (database credentials pulled from AWS Secrets Manager)</p>
</li>
<li><p>Health check configuration (curl the /health endpoint every 30 seconds)</p>
</li>
<li><p>Logging configuration (send logs to CloudWatch)</p>
</li>
</ul>
<p><strong>Key detail:</strong> The <code>APP_VERSION</code> environment variable is how the application knows whether to behave as blue or green. Same codebase, different behavior based on configuration.</p>
<h3 id="heading-step-6-execute-blue-green-deployment">Step 6: Execute Blue-Green Deployment</h3>
<p>Alright, now it’s time to create AppSpec and trigger the deployment:</p>
<pre><code class="lang-bash">TASK_DEF_ARN=$(aws ecs describe-task-definition \
  --task-definition ecommerce-bluegreen \
  --query <span class="hljs-string">'taskDefinition.taskDefinitionArn'</span> \
  --output text)

cat &gt; appspec.json &lt;&lt;EOF
{
  <span class="hljs-string">"version"</span>: 0.0,
  <span class="hljs-string">"Resources"</span>: [{
    <span class="hljs-string">"TargetService"</span>: {
      <span class="hljs-string">"Type"</span>: <span class="hljs-string">"AWS::ECS::Service"</span>,
      <span class="hljs-string">"Properties"</span>: {
        <span class="hljs-string">"TaskDefinition"</span>: <span class="hljs-string">"<span class="hljs-variable">${TASK_DEF_ARN}</span>"</span>,
        <span class="hljs-string">"LoadBalancerInfo"</span>: {
          <span class="hljs-string">"ContainerName"</span>: <span class="hljs-string">"app"</span>,
          <span class="hljs-string">"ContainerPort"</span>: 8080
        }
      }
    }
  }]
}
EOF

<span class="hljs-comment"># Deploy</span>
APPSPEC=$(cat appspec.json | jq -c .)
aws deploy create-deployment \
  --application-name ecommerce-bluegreen \
  --deployment-group-name ecommerce-bluegreen-deployment-group \
  --deployment-config-name CodeDeployDefault.ECSLinear10PercentEvery3Minutes \
  --description <span class="hljs-string">"Blue-green deployment to structured address schema"</span> \
  --cli-input-json <span class="hljs-string">"{
    \"revision\": {
      \"revisionType\": \"AppSpecContent\",
      \"appSpecContent\": {
        \"content\": <span class="hljs-subst">$(echo \"$APPSPEC\" | jq -Rs .)</span>
      }
    }
  }"</span>

DEPLOYMENT_ID=$(aws deploy list-deployments \
    --application-name ecommerce-bluegreen \
    --deployment-group-name ecommerce-bluegreen-deployment-group \
    --query <span class="hljs-string">'deployments[0]'</span> --output text)
</code></pre>
<p>Monitor the deployment:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Watch status</span>
watch -n 10 <span class="hljs-string">"aws deploy get-deployment --deployment-id <span class="hljs-variable">$DEPLOYMENT_ID</span> \
    --query 'deploymentInfo.status' --output text"</span>

<span class="hljs-comment"># Monitor traffic distribution</span>
<span class="hljs-keyword">while</span> <span class="hljs-literal">true</span>; <span class="hljs-keyword">do</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Production: <span class="hljs-subst">$(curl -s $ALB_URL/health | jq -r '.version')</span>"</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Test: <span class="hljs-subst">$(curl -s $TEST_URL/health | jq -r '.version')</span>"</span>
    sleep 30
<span class="hljs-keyword">done</span>
</code></pre>
<p>The deployment shifts 10% of traffic every 3 minutes, completing in 30 minutes.</p>
<h3 id="heading-step-7-validate-green-environment">Step 7: Validate Green Environment</h3>
<p>After the deployment begins, you need to validate that the green environment is functioning correctly with the new structured address format before allowing production traffic to reach it.</p>
<p>The CodeBuild dashboard below shows the Traffic migration and Deployment status:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768093087711/fc1b869c-7fae-421e-8d98-45769300cb0a.png" alt="Monitoring in CodeDeploy" class="image--center mx-auto" width="2282" height="1460" loading="lazy"></p>
<p>We can also test through the test listener (port 8080), which provides isolated access to green tasks:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Test new structured address API</span>
curl -X POST <span class="hljs-variable">$TEST_URL</span>/api/customers \
    -H <span class="hljs-string">"Content-Type: application/json"</span> \
    -d <span class="hljs-string">'{
      "name": "Jane Smith",
      "email": "jane@example.com",
      "address": {
        "street": "456 Oak Ave",
        "city": "Los Angeles",
        "state": "CA",
        "zip": "90001"
      }
    }'</span> | jq

curl <span class="hljs-variable">$ALB_URL</span>/api/customers | jq
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768140730325/57c6a047-994f-4b5e-8e19-4d6fb25ad44e.png" alt="Validate Green environment response" class="image--center mx-auto" width="1422" height="672" loading="lazy"></p>
<p>What you're validating:</p>
<ul>
<li><p>The green environment accepts the new structured address format</p>
</li>
<li><p>Data is correctly written to both new columns (street_address, city, state, zip_code) and the old address column for backwards compatibility</p>
</li>
<li><p>The API response matches expectations for the new schema</p>
</li>
<li><p>Existing data from blue environment is still accessible and readable</p>
</li>
</ul>
<p>If any of these tests fail, you can stop the deployment before production traffic reaches green, preventing customer impact.</p>
<h3 id="heading-step-8-post-deployment-validation">Step 8: Post-Deployment Validation</h3>
<p>Once CodeDeploy completes the traffic shift, all production requests route to green. This is your opportunity to verify that the deployment was successful and that the new version is handling real production traffic correctly.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Verify all production traffic goes to green</span>
<span class="hljs-comment"># Running this multiple times confirms consistent routing</span>
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> {1..10}; <span class="hljs-keyword">do</span>
    curl -s <span class="hljs-variable">$ALB_URL</span>/health | jq -r <span class="hljs-string">'.version'</span>
<span class="hljs-keyword">done</span>
<span class="hljs-comment"># Expected output: "green" for all 10 requests</span>

<span class="hljs-comment"># Test complete CRUD operations with the new API</span>
<span class="hljs-comment"># Create a customer with structured address</span>
CUSTOMER_ID=$(curl -s -X POST <span class="hljs-variable">$ALB_URL</span>/api/customers \
    -H <span class="hljs-string">"Content-Type: application/json"</span> \
    -d <span class="hljs-string">'{"name": "Test User", "email": "test@example.com",
         "address": {"street": "789 Test St", "city": "Test City", 
         "state": "TX", "zip": "75001"}}'</span> | jq -r <span class="hljs-string">'.id'</span>)

<span class="hljs-comment"># Read the customer back to verify data persistence</span>
curl <span class="hljs-variable">$ALB_URL</span>/api/customers/<span class="hljs-variable">$CUSTOMER_ID</span> | jq

<span class="hljs-comment"># Update the customer to test modification</span>
curl -X PUT <span class="hljs-variable">$ALB_URL</span>/api/customers/<span class="hljs-variable">$CUSTOMER_ID</span> \
    -H <span class="hljs-string">"Content-Type: application/json"</span> \
    -d <span class="hljs-string">'{"address": {"street": "999 Updated Ave", "city": "Test City", 
         "state": "TX", "zip": "75001"}}'</span> | jq

<span class="hljs-comment"># Delete the test customer for cleanup</span>
curl -X DELETE <span class="hljs-variable">$ALB_URL</span>/api/customers/<span class="hljs-variable">$CUSTOMER_ID</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768140850962/a31273e9-cbc1-4d09-9f6d-7248b402f712.png" alt="Verify all production traffic goes to green" class="image--center mx-auto" width="846" height="270" loading="lazy"></p>
<p>What you're validating:</p>
<ul>
<li><p>Traffic routing is 100% to green with no requests reaching blue</p>
</li>
<li><p>Create operations work with the new structured address format</p>
</li>
<li><p>Read operations return correct data with proper address structure</p>
</li>
<li><p>Update operations successfully modify existing records</p>
</li>
<li><p>Delete operations work without errors</p>
</li>
<li><p>The application correctly writes to both new columns and old address column (enabling potential rollback)</p>
</li>
</ul>
<p>Check your CloudWatch logs and metrics during this validation period for any unexpected errors, increased latency, or database connection issues.</p>
<h3 id="heading-step-9-contract-phase-after-24-72-hours">Step 9: Contract Phase (After 24-72 Hours)</h3>
<p>This is the final phase of expand-contract. We're removing the old <code>address</code> column now that we're confident green is stable. This is the point of no return.</p>
<p><strong>CRITICAL</strong>: Only proceed after green has been stable for your confidence period!</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Backup database first</span>
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

<span class="hljs-comment"># Wait for snapshot</span>
aws rds <span class="hljs-built_in">wait</span> db-snapshot-completed \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

<span class="hljs-comment"># Run contract migration</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -f /tmp/002_contract_address.sql

<span class="hljs-comment"># Verify old column is gone</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -c <span class="hljs-string">"\d customers"</span>
</code></pre>
<p>The contract migration (<a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/blob/main/migrations/002_contract_address.sql"><code>migrations/002_contract_address.sql</code></a>) removes the old <code>address</code> column.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768140955991/d6f6f287-09e5-4693-a4e9-77c1d9080466.png" alt="d6f6f287-09e5-4693-a4e9-77c1d9080466" class="image--center mx-auto" width="1506" height="444" loading="lazy"></p>
<p><strong>Why wait 24-72 hours:</strong> You want to be absolutely certain green is stable before making irreversible changes. During this waiting period:</p>
<ul>
<li><p>All your monitoring should show green performing normally</p>
</li>
<li><p>You've seen the system handle multiple daily traffic patterns (morning peak, evening peak, overnight)</p>
</li>
<li><p>Weekly batch jobs have run successfully</p>
</li>
<li><p>You've verified third-party integrations work</p>
</li>
<li><p>No unusual errors or performance degradation</p>
</li>
</ul>
<p>It’s important to snapshot first because once you drop that column, there's no undo button. The snapshot is your safety net. If you discover a critical issue after contracting, you can restore this snapshot and get back to a state where rollback is possible. Without it, you're gambling.</p>
<p><strong>What the contract migration does:</strong></p>
<pre><code class="lang-sql"><span class="hljs-comment">-- migrations/002_contract_address.sql</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> customers <span class="hljs-keyword">DROP</span> <span class="hljs-keyword">COLUMN</span> address;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>It's simple but permanent. The old <code>address</code> column is gone. The Blue environment will no longer work with this database, as it expects that column to exist. This is fine because blue has been decommissioned (no traffic, tasks terminated).</p>
<p><strong>What to update:</strong> You should also deploy version 3 of your application that removes the dual-write logic. Version 2 (green) is still writing to both the new columns and the old <code>address</code> column. Version 3 can stop wasting cycles writing to a column that no longer exists.</p>
<p>The contract migration (<a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs/blob/main/migrations/002_contract_address.sql"><code>migrations/002_contract_address.sql</code></a>) removes the old <code>address</code> column. Your migration is now complete!</p>
<h2 id="heading-rollback-strategies">Rollback Strategies</h2>
<h3 id="heading-during-deployment-safe-window">During Deployment (Safe Window)</h3>
<p>Use this strategy when you detect issues <strong>during the traffic shift</strong>, before all traffic has moved to green. CodeDeploy is still managing the deployment, which means it can automatically revert traffic distribution to the previous state.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Immediate rollback</span>
aws deploy stop-deployment \
    --deployment-id <span class="hljs-variable">$DEPLOYMENT_ID</span> \
    --auto-rollback-enabled
</code></pre>
<p>You should use this strategy when you notice increased error rates, degraded performance, or functional issues during the canary or linear traffic shift. CodeDeploy automatically shifts all traffic back to blue, and green tasks are terminated. This is the safest and fastest rollback option.</p>
<p>This works because the database still contains the old <code>address</code> column (expand phase), so blue can function normally. No data has been lost or made incompatible.</p>
<h3 id="heading-after-deployment-before-contract">After Deployment (Before Contract)</h3>
<p>Use this when the deployment completed successfully, but you discover issues hours or days later during the monitoring period, before you've run the contract migration. Both blue and green environments still exist, and the database supports both schemas.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Manual listener update</span>
aws elbv2 modify-listener \
    --listener-arn $(terraform output -raw alb_listener_arn) \
    --default-actions Type=forward,TargetGroupArn=$(terraform output -raw blue_target_group_arn)
</code></pre>
<p>Or use the provided script:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> scripts
./rollback.sh
</code></pre>
<p>Use this when you discover bugs in green that weren't caught during initial testing, business metrics show unexpected changes (conversion rates drop, customer complaints increase), or third-party integration issues emerge.</p>
<p>This works because the database still has both old and new schema elements. Blue tasks still exist and can serve traffic immediately. Because green was writing to both old and new columns, blue sees all the latest data.</p>
<p>With this, the traffic immediately shifts from green back to blue. Green continues running for observability, but serves no traffic. You can debug green in place without customer impact.</p>
<h3 id="heading-after-contract-phase">After Contract Phase</h3>
<p>Use this as a <strong>last resort</strong> when you've already removed the old address column, and blue can no longer function with the current database schema. This is significantly more complex and time-consuming than the previous two strategies.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Restore from snapshot</span>
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db-restored \
    --db-snapshot-identifier pre-contract-YYYYMMDD-HHMMSS
</code></pre>
<p>Only use this strategy when you discover a critical, production-breaking issue after the contract phase, and you have no other option but to return to the previous version.</p>
<p><strong>Why it's painful</strong>:</p>
<ul>
<li><p>Database restore takes 10-30 minutes depending on size</p>
</li>
<li><p>You lose all data written after the snapshot was taken</p>
</li>
<li><p>Requires updating connection strings to point to the restored instance</p>
</li>
<li><p>Need to re-deploy blue environment</p>
</li>
<li><p>Must communicate downtime to users</p>
</li>
</ul>
<p>This is why you wait 24-72 hours before contracting, and take a snapshot immediately before the contract migration. The lengthy waiting period allows you to catch most issues while the safer rollback strategies are still available.</p>
<h2 id="heading-monitoring-during-deployments">Monitoring During Deployments</h2>
<h3 id="heading-essential-metrics">Essential Metrics</h3>
<p>During a blue-green deployment, you need to monitor both environments simultaneously to detect issues early and make informed decisions about proceeding or rolling back.For each target group (blue and green), track these CloudWatch metrics:</p>
<h4 id="heading-1-targetresponsetime">1. TargetResponseTime</h4>
<p>Measures latency from when the load balancer sends a request to when it receives a response. You're looking for sudden spikes or gradual degradation. Green should have similar response times to blue (within 10-20%). If green's latency is significantly higher, you may have performance regressions, inefficient queries with the new schema, or resource constraints.</p>
<h4 id="heading-2-requestcount">2. RequestCount</h4>
<p>Shows traffic volume hitting each target group. During the deployment, you should see blue's count decreasing while green's increases proportionally. If the numbers don't add up (total requests drop significantly), users might be experiencing errors and not retrying. If green receives traffic but shows zero requests, health checks might be failing.</p>
<h4 id="heading-3-httpcodetarget5xxcount">3. HTTPCode_Target_5XX_Count</h4>
<p>Server errors indicate application problems. Even a single 5XX error during deployment warrants investigation. Green should have zero 5XX errors during the initial traffic shift. Any errors could indicate incompatibility issues with the new schema, missing environment variables, or database connection problems.</p>
<h4 id="heading-4-databaseconnections-from-rds-metrics">4. DatabaseConnections (from RDS metrics):</h4>
<p>Shows active database connections from both environments. Watch for connection pool exhaustion, which manifests as a sudden spike or plateau at your max connections limit. If green uses more connections than blue did, you might have connection leaks or inefficient connection handling in the new code.</p>
<h4 id="heading-5-cpuutilization">5. CPUUtilization</h4>
<p>Monitor both ECS task CPU and RDS CPU. Green tasks should use similar CPU to blue tasks for the same request volume. Higher CPU might indicate less efficient code or more complex queries. RDS CPU spikes during deployment often indicate poorly optimized new queries or missing indexes for the new schema.</p>
<p><strong>What to expect</strong>:</p>
<ul>
<li><p>First 5-10 minutes: Green receives 10% traffic, metrics should closely match blue's baseline</p>
</li>
<li><p>15-20 minutes: Green at 30-50% traffic, both environments should show stable metrics</p>
</li>
<li><p>25-30 minutes: Green at 100% traffic, metrics should stabilize at historical levels</p>
</li>
<li><p>Any divergence from these patterns warrants stopping the deployment and investigating</p>
</li>
</ul>
<p><strong>Custom application metrics</strong>: Beyond infrastructure metrics, monitor business-critical metrics like checkout completion rates, API success rates, and user sign-up flows. Sometimes technical metrics look fine but user-facing functionality is broken.</p>
<h2 id="heading-best-practices">Best Practices</h2>
<h3 id="heading-test-migrations-in-staging">Test Migrations in Staging</h3>
<p>Always run your database migrations against a staging environment that mirrors production scale and complexity before touching production. Copy a recent production snapshot to staging and execute your expand migration there first.</p>
<p><strong>Why this matters</strong>: Migrations that work fine on small datasets can timeout or lock tables on production-scale data. You might discover that adding an index to a 50-million-row table takes 2 hours, or that your column population query needs optimization.</p>
<p><strong>What to test</strong>:</p>
<ul>
<li><p>Migration execution time (should complete in seconds/minutes, not hours)</p>
</li>
<li><p>Table locks and their impact (can reads/writes continue during migration?)</p>
</li>
<li><p>Query performance with new schema (are your indexes still effective?)</p>
</li>
<li><p>Rollback procedures (can you undo the migration if needed?)</p>
</li>
</ul>
<h3 id="heading-use-migration-tools">Use Migration Tools</h3>
<p>Don't write raw SQL migrations manually. Use Flyway, Liquibase, Alembic (for Python), or your framework's built-in migration tools (Rails migrations, Django migrations, Entity Framework migrations).</p>
<p><strong>Why this matters</strong>: Migration tools provide version tracking, rollback capabilities, checksums to prevent tampering, and a standardized way to manage schema changes across environments.</p>
<h3 id="heading-configure-health-checks-properly">Configure Health Checks Properly</h3>
<p>Your health check endpoint should verify that the application can actually function, not just that the process is running. A comprehensive health check validates database connectivity, schema compatibility, and dependent service availability.</p>
<pre><code class="lang-python"><span class="hljs-meta">@app.route('/health')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">health_check</span>():</span>
    checks = {
        <span class="hljs-string">'database'</span>: check_database(),
        <span class="hljs-string">'schema'</span>: check_schema_compatibility(),
        <span class="hljs-string">'cache'</span>: check_cache_connection()
    }

    <span class="hljs-keyword">if</span> all(checks.values()):
        <span class="hljs-keyword">return</span> jsonify(checks), <span class="hljs-number">200</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> jsonify(checks), <span class="hljs-number">503</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_schema_compatibility</span>():</span>
    <span class="hljs-string">"""Verify expected schema elements exist"""</span>
    <span class="hljs-keyword">try</span>:
        result = db.query(<span class="hljs-string">"""
            SELECT column_name 
            FROM information_schema.columns 
            WHERE table_name = 'customers'
            AND column_name IN ('street_address', 'city', 'state', 'zip_code')
        """</span>)
        <span class="hljs-keyword">return</span> len(result) == <span class="hljs-number">4</span>
    <span class="hljs-keyword">except</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>
</code></pre>
<p>For ALB health checks specifically, make sure you configure appropriate thresholds in your target group settings. A healthy threshold of 2 means the target must pass 2 consecutive health checks before receiving traffic. An unhealthy threshold of 3 means it must fail 3 consecutive checks before being removed. Set your interval to 30 seconds and timeout to 5 seconds to balance responsiveness with stability.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Terraform configuration for ALB health checks</span>
resource <span class="hljs-string">"aws_lb_target_group"</span> <span class="hljs-string">"green"</span> {
  health_check {
    enabled             = <span class="hljs-literal">true</span>
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = <span class="hljs-string">"/health"</span>
    matcher             = <span class="hljs-string">"200"</span>
  }
}
</code></pre>
<p>This configuration ensures that ECS tasks aren't marked healthy prematurely (preventing traffic to broken tasks) while also not being overly sensitive to transient issues (preventing unnecessary task replacements).</p>
<h3 id="heading-plan-the-contract-phase">Plan the Contract Phase</h3>
<p>The contract phase is irreversible, so treat it with appropriate caution. Wait a minimum of 24-72 hours after green deployment before removing old schema elements. This waiting period isn't arbitrary: it ensures you've observed the system under various conditions.</p>
<p><strong>What to verify before contracting</strong>:</p>
<ul>
<li><p>Green has handled multiple daily traffic patterns (morning rush, evening peak, overnight batch jobs)</p>
</li>
<li><p>All scheduled jobs and cron tasks have run successfully with the new schema</p>
</li>
<li><p>Weekly reports or analytics pipelines have completed</p>
</li>
<li><p>Third-party integrations (payment processors, shipping APIs, analytics tools) are working</p>
</li>
<li><p>No unusual error patterns in logs</p>
</li>
<li><p>Business metrics (conversions, sign-ups, purchases) remain stable</p>
</li>
<li><p>Customer support hasn't reported related issues</p>
</li>
</ul>
<p>The pre-contract checklist:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># 1. Create a final snapshot</span>
aws rds create-db-snapshot \
    --db-instance-identifier ecommerce-bluegreen-db \
    --db-snapshot-identifier pre-contract-$(date +%Y%m%d-%H%M%S)

<span class="hljs-comment"># 2. Document current state</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Green tasks: <span class="hljs-subst">$(aws ecs describe-services --cluster ecommerce --services ecommerce-green | jq '.services[0].runningCount')</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Error rate: <span class="hljs-subst">$(aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_Target_5XX_Count --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)</span> --end-time <span class="hljs-subst">$(date -u +%Y-%m-%dT%H:%M:%S)</span> --period 3600 --statistics Sum)"</span>

<span class="hljs-comment"># 3. Notify team</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Running contract migration at <span class="hljs-subst">$(date)</span>"</span>

<span class="hljs-comment"># 4. Run migration</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -f migrations/002_contract_address.sql

<span class="hljs-comment"># 5. Verify</span>
psql -h <span class="hljs-variable">$DB_ENDPOINT</span> -U dbadmin -d ecommerce -c <span class="hljs-string">"\d customers"</span>
</code></pre>
<h3 id="heading-version-your-apis">Version Your APIs</h3>
<p>When changing data formats, maintain backward compatibility by supporting both old and new API versions simultaneously. This allows API consumers (mobile apps, third-party integrations, other services) to migrate at their own pace without coordinating releases.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Support both API versions during transition</span>
<span class="hljs-meta">@app.route('/api/v1/customers/&lt;id&gt;')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_customer_v1</span>(<span class="hljs-params">id</span>):</span>
    customer = Customer.find(id)
    <span class="hljs-keyword">return</span> jsonify({
        <span class="hljs-string">'id'</span>: customer.id,
        <span class="hljs-string">'name'</span>: customer.name,
        <span class="hljs-string">'address'</span>: customer.address  <span class="hljs-comment"># Old format</span>
    })

<span class="hljs-meta">@app.route('/api/v2/customers/&lt;id&gt;')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_customer_v2</span>(<span class="hljs-params">id</span>):</span>
    customer = Customer.find(id)
    <span class="hljs-keyword">return</span> jsonify({
        <span class="hljs-string">'id'</span>: customer.id,
        <span class="hljs-string">'name'</span>: customer.name,
        <span class="hljs-string">'address'</span>: {  <span class="hljs-comment"># New structured format</span>
            <span class="hljs-string">'street'</span>: customer.street_address,
            <span class="hljs-string">'city'</span>: customer.city,
            <span class="hljs-string">'state'</span>: customer.state,
            <span class="hljs-string">'zip'</span>: customer.zip_code
        }
    })
</code></pre>
<p>To implement this, you can initially deploy both endpoints with blue-green. Then monitor usage of v1 endpoint over time. Once v1 traffic drops below 1% (meaning clients have migrated), deprecate it formally. Remove v1 endpoint in a subsequent release, not during the blue-green deployment itself.</p>
<p>Announce the new API version to consumers with a migration timeline. Give them 2-3 months to update their integrations. Send reminder emails at the halfway point and 2 weeks before v1 shutdown.</p>
<h3 id="heading-monitor-both-environments">Monitor Both Environments</h3>
<p>During the transition period, both blue and green are production environments serving real traffic. Monitor them separately to detect version-specific issues.</p>
<p>Set up separate CloudWatch dashboards for blue and green target groups with the same metrics arranged identically. This makes it easy to spot differences at a glance. If green's response time is 200ms while blue's is 50ms, that's a red flag.</p>
<h4 id="heading-alert-on-metric-divergence">Alert on metric divergence</h4>
<p>Create alarms that trigger when green's metrics deviate significantly from blue's baseline. For example, if green's error rate is more than 2x blue's historical average, trigger an alert. If green's database query time is 50% higher, investigate before shifting more traffic.</p>
<h4 id="heading-log-aggregation">Log aggregation</h4>
<p>Ensure logs from both environments are tagged with their version (<code>environment: blue</code> or <code>environment: green</code>) so you can filter and compare them. Use CloudWatch Insights queries to spot patterns.</p>
<h2 id="heading-when-not-to-use-blue-green">When NOT to Use Blue-Green</h2>
<p>Blue-green isn't always the right choice. Avoid it when you have:</p>
<ul>
<li><p><strong>Very large database migrations</strong>: If your migration takes hours or requires significant locks, use a traditional maintenance window.</p>
</li>
<li><p><strong>Highly stateful applications</strong>: Real-time collaboration tools or WebSocket applications with complex in-memory state may need rolling deployments instead.</p>
</li>
<li><p><strong>Cost constraints</strong>: Running two environments doubles costs. Consider canary deployments for cost-sensitive applications.</p>
</li>
<li><p><strong>Complex data model redesigns</strong>: Use the strangler fig pattern to gradually migrate functionality to a new service.</p>
</li>
</ul>
<h3 id="heading-alternative-deployment-strategies">Alternative Deployment Strategies</h3>
<h4 id="heading-canary-deployments">Canary Deployments</h4>
<p>Route a small percentage (5-10%) to the new version:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"trafficRouting"</span>: {
    <span class="hljs-attr">"type"</span>: <span class="hljs-string">"TimeBasedCanary"</span>,
    <span class="hljs-attr">"timeBasedCanary"</span>: {
      <span class="hljs-attr">"canaryPercentage"</span>: <span class="hljs-number">10</span>,
      <span class="hljs-attr">"canaryInterval"</span>: <span class="hljs-number">5</span>
    }
  }
}
</code></pre>
<h3 id="heading-rolling-deployments">Rolling Deployments</h3>
<p>Gradually replace old tasks with new ones:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"deploymentConfiguration"</span>: {
    <span class="hljs-attr">"maximumPercent"</span>: <span class="hljs-number">200</span>,
    <span class="hljs-attr">"minimumHealthyPercent"</span>: <span class="hljs-number">100</span>
  }
}
</code></pre>
<h2 id="heading-cleanup">Cleanup</h2>
<p>After you've successfully completed your blue-green deployment, validated the green environment, and run the contract phase, you need to clean up the AWS resources to avoid unnecessary costs and resource sprawl.</p>
<p><strong>What you're removing</strong>:</p>
<ul>
<li><p>The entire infrastructure stack (VPC, subnets, NAT gateways, load balancer, ECS cluster, RDS database, and all associated resources)</p>
</li>
<li><p>This is appropriate for a tutorial/testing scenario where you deployed everything from scratch</p>
</li>
</ul>
<p>Important considerations before cleanup:</p>
<ul>
<li><p>Ensure you have backups if you need to reference any data later</p>
</li>
<li><p>Export any logs or metrics you want to retain</p>
</li>
<li><p>Document lessons learned from the deployment</p>
</li>
<li><p>Verify no production traffic is still using these resources</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> terraform

<span class="hljs-comment"># Terraform will prompt you to confirm with "yes"</span>
<span class="hljs-comment"># Review the destruction plan carefully before confirming</span>
terraform destroy  <span class="hljs-comment"># Takes ~10-15 minutes</span>
</code></pre>
<p><strong>Partial cleanup</strong>: If you want to keep certain resources (like RDS snapshots for reference), you can remove them from Terraform state before destroying:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Remove RDS from Terraform management before destroying</span>
terraform state rm aws_db_instance.main
terraform destroy  <span class="hljs-comment"># Now destroys everything except RDS</span>
</code></pre>
<p>For production environments, you would NOT destroy everything. Instead, you'd decommission the blue environment specifically after confirming green is stable:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Production scenario - remove only blue environment</span>
terraform destroy -target=aws_ecs_service.blue
terraform destroy -target=aws_lb_target_group.blue
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Blue-green deployments with databases require careful planning, but the expand-contract pattern makes it manageable.</p>
<p>Here are some key takeaways:</p>
<ol>
<li><p><strong>Use expand-contract as default</strong> – Maintains backwards compatibility and safe rollbacks.</p>
</li>
<li><p><strong>Externalize state</strong> – Sessions, caches, and storage should use external services.</p>
</li>
<li><p><strong>Plan for three phases</strong> – Don't rush to the contract phase.</p>
</li>
<li><p><strong>Test everything in staging</strong> – Mirror production scale and complexity.</p>
</li>
<li><p><strong>Monitor aggressively</strong> – Track technical and business metrics for both environments.</p>
</li>
<li><p><strong>Know when to use alternatives</strong> – Blue-green isn't always the answer.</p>
</li>
<li><p><strong>Document rollback procedures</strong> – Everyone should know the rollback process before deployment.</p>
</li>
</ol>
<p>The expand-contract pattern requires more work upfront, but this investment pays dividends in reduced risk and maintained uptime. With the strategies and complete implementation provided here, you can successfully deploy even complex, stateful applications with confidence.</p>
<p>As always, I hope you enjoyed this guide and learned something. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a>.</p>
<p>For more practical hands-on Cloud/DevOps projects like this one, follow and star this repository: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building">Learn-DevOps-by-building</a>.</p>
<h2 id="heading-further-resources">Further Resources</h2>
<ul>
<li><p>Complete Code: <a target="_blank" href="https://github.com/Caesarsage/bluegreen-deployment-ecs">github.com/Caesarsage/bluegreen-deployment-ecs</a></p>
</li>
<li><p>Learn DevOps by Building: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building">GitHub repo</a></p>
</li>
<li><p>AWS ECS Blue/Green Documentation: <a target="_blank" href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-bluegreen.html">AWS Docs</a></p>
</li>
<li><p>AWS CodeDeploy for ECS: <a target="_blank" href="https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps-ecs.html">AWS Docs</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Kubernetes Cluster and Security Groups for Pods in AWS [Full Handbook] ]]>
                </title>
                <description>
                    <![CDATA[ Amazon Elastic Kubernetes Service (EKS) Security Groups for Pods is a powerful feature that enables fine-grained network security controls at the pod level. This guide walks you through implementing this feature, from initial cluster setup to testing... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-kubernetes-cluster-and-security-groups-for-pods-in-aws-handbook/</link>
                <guid isPermaLink="false">68f034017abb7495f91ce942</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Security ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Wed, 15 Oct 2025 23:53:37 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760572399710/e6ff9b5b-2fa5-4e61-9b89-9b68c81e6d46.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Amazon Elastic Kubernetes Service (EKS) Security Groups for Pods is a powerful feature that enables fine-grained network security controls at the pod level. This guide walks you through implementing this feature, from initial cluster setup to testing pod-level security group assignments.</p>
<p>Traditionally, security groups could only be assigned at the EC2 instance level in EKS clusters. This meant that all pods running on a node shared the same network security rules. With Security Groups for Pods, you can now assign specific security groups to individual pods, providing much more granular control over network access.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-architecture">Understanding the Architecture</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-infrastructure-foundation">Infrastructure Foundation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-eks-cluster-configuration">EKS Cluster Configuration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-management-instance-setup">Management Instance Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-security-group-configuration">Security Group Configuration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-database-setup">Database Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cni-plugin-configuration">CNI Plugin Configuration</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-security-policies-implementation">Security Policies Implementation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-testing-and-validation">Testing and Validation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cleanup-and-maintenance">Cleanup and Maintenance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before starting this guide, ensure you have:</p>
<ul>
<li><p>An AWS account with appropriate permissions</p>
</li>
<li><p>AWS CLI configured on your local machine</p>
</li>
<li><p>Basic understanding of Kubernetes concepts</p>
</li>
<li><p>Familiarity with AWS networking concepts (VPCs, security groups, subnets)</p>
</li>
<li><p>Understanding of Amazon EKS fundamentals</p>
</li>
</ul>
<h2 id="heading-understanding-the-architecture">Understanding the Architecture</h2>
<p>Before we dive into implementation, let's understand how Security Groups for Pods changes the EKS networking model. We'll start by looking at the traditional approach, then explore the enhanced model, and finally understand the components that make it all work.</p>
<h3 id="heading-traditional-eks-networking">Traditional EKS Networking</h3>
<p>In the standard EKS networking setup, security happens at the node level rather than the pod level. When you create an EKS cluster using the traditional model, every EC2 worker node gets assigned a security group. All pods running on that node inherit the same security group settings from their host node. This means if you have ten different applications running on the same node, they all share identical network security rules.</p>
<p>This approach has significant limitations. For example, if one pod needs to access a database while another pod should not, you can't enforce this distinction when both pods share the node's security group. The security boundary exists at the node level, creating a coarse-grained security model where all pods on a node have the same network permissions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758251803143/fa7ac487-a847-4029-9543-839c428ec20c.png" alt="Security group pod architecture without security groups" class="image--center mx-auto" width="266" height="251" loading="lazy"></p>
<h3 id="heading-security-groups-for-pods-architecture">Security Groups for Pods Architecture</h3>
<p>This networking model changes this paradigm completely. With Security Groups for Pods enabled, you can assign dedicated security groups to individual pods based on their specific needs. Instead of all pods inheriting the node's security group, certain pods can get their own Elastic Network Interface (ENI) with custom security group assignments.</p>
<p>An ENI (Elastic Network Interface) is essentially a virtual network card in AWS. Just as your physical computer has a network card to connect to the internet, EC2 instances and now individual pods can have their own virtual network interfaces. Each ENI can have its own IP address, security groups, and network settings. When we assign an ENI to a pod, that pod gets its own dedicated network identity separate from the node it runs on.</p>
<p>This architecture provides true pod-level security. For instance, you might have a frontend pod and a database access pod running on the same node. The frontend pod uses the node's security group and cannot access the database. Meanwhile, the database access pod gets its own ENI with a security group that explicitly allows database connections. Even though they share the same physical node, these pods have completely different network security profiles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758252157480/fda41975-aabc-4e1c-a638-75a26c565bb0.png" alt="Security group pod architecture with security groups" class="image--center mx-auto" width="266" height="251" loading="lazy"></p>
<h3 id="heading-how-it-works">How It Works:</h3>
<p>The implementation of Security Groups for Pods relies on several interconnected mechanisms working together. First, when you mark a pod for special security group treatment through a <code>SecurityGroupPolicy</code>, the system automatically provisions a dedicated ENI for that pod. This ENI assignment happens through <a target="_blank" href="https://docs.aws.amazon.com/eks/latest/best-practices/vpc-cni.html">AWS VPC CNI's</a> branch networking feature, which allows multiple network interfaces to attach to a single EC2 instance.</p>
<p>The branch networking capability is crucial here. EC2 instances have limits on how many ENIs they can support. For example, a t3.medium instance can support up to three ENIs, while an m5.large can support up to four. The VPC CNI plugin uses these additional ENI slots to create branch interfaces for pods that need custom security groups. Each branch interface can then have its own security group configuration independent of the node's primary network interface.</p>
<p>This fine-grained control means you can now enforce network policies at the application level. Different microservices in your cluster can have completely different network access patterns, even when running on the same infrastructure. A payment processing pod might have strict database access, while a logging pod might only need access to your log aggregation service, and a frontend pod might only need internet access for serving web traffic.</p>
<h3 id="heading-key-components">Key Components:</h3>
<p>Several Kubernetes and AWS components work together to enable this functionality. Let's walk through each one to understand how they contribute to the overall system.</p>
<h4 id="heading-securitygrouppolicy-crd">SecurityGroupPolicy CRD</h4>
<p>The SecurityGroupPolicy Custom Resource Definition (CRD) is a Kubernetes object that you create to tell the system which pods should receive which security groups. You use standard Kubernetes label selectors to identify pods, then specify one or more AWS security group ID that should be attached to those pods. When you create a SecurityGroupPolicy, the system doesn't immediately change anything. Instead, it creates a rule that applies to future pods matching those labels.</p>
<h4 id="heading-vpc-resource-controller">VPC Resource Controller</h4>
<p>The VPC Resource Controller is an AWS component that runs in your cluster's control plane. This controller constantly watches for pods that match your SecurityGroupPolicy definitions.</p>
<p>When a matching pod is created, the controller communicates with AWS EC2 APIs to provision the necessary ENI, attach the specified security groups, and configure the network interface. It also handles the cleanup process when pods are deleted, ensuring that ENIs are properly released and don't become orphaned resources in your AWS account.</p>
<h4 id="heading-aws-vpc-cni">AWS VPC CNI</h4>
<p>Finally, the AWS VPC CNI plugin is enhanced to support this branch networking feature. When the VPC Resource Controller provisions an ENI for a pod, the CNI plugin on the worker node handles the low-level networking configuration. It attaches the ENI to the pod's network namespace, configures routing rules, and ensures that traffic from that pod flows through the dedicated interface rather than the node's primary network interface. The CNI plugin also maintains the necessary iptables rules and network policies to keep pod networking isolated and secure.</p>
<p>Together, these components create a seamless experience where you simply label your pods and define security policies, and the system handles all the complex AWS networking configuration automatically.</p>
<h2 id="heading-infrastructure-foundation">Infrastructure Foundation</h2>
<p>Now we'll build the underlying AWS infrastructure that our EKS cluster needs. This includes setting up IAM roles, creating the VPC with proper subnets, and configuring the networking components. We'll work through each step, ensuring that every component is properly configured for Security Groups for Pods to function correctly.</p>
<h3 id="heading-iam-roles-and-policies-setup">IAM Roles and Policies Setup</h3>
<p>Before creating any infrastructure, we need to set up the IAM roles that will define what permissions different AWS services have. Think of IAM roles as identity cards that services present to AWS to prove they're allowed to perform certain actions. We'll create several distinct roles, each with specific permissions tailored to their purpose.</p>
<h4 id="heading-eks-cluster-service-role">EKS Cluster Service Role:</h4>
<p>First, we'll create the IAM role that the EKS service itself will use when managing your cluster. This role establishes a trust relationship between your AWS account and the EKS service, essentially giving EKS permission to perform actions on your behalf.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create the EKS cluster service role</span>
aws iam create-role \
  --role-name EKSClusterRole \
  --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "eks.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'</span>
</code></pre>
<p><strong>Here’s what’s going on:</strong></p>
<p>This command creates an IAM role that establishes trust between your AWS account and the EKS service:</p>
<ul>
<li><p><code>assume-role-policy-document</code>: Defines which AWS service can assume this role</p>
</li>
<li><p><code>"Service": "eks.amazonaws.com"</code>: Only the EKS service can use this role</p>
</li>
<li><p>This establishes trust between your AWS account and the EKS service</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758420978401/1f7a7849-923b-4378-9ee5-1063cf4b2ffd.png" alt="EKS Cluster iam role" class="image--center mx-auto" width="2430" height="404" loading="lazy"></p>
<h4 id="heading-eks-cluster-attached-role-policy">EKS Cluster Attached Role Policy:</h4>
<p>Now that we have the role created, we need to attach managed policies that grant the actual permissions EKS needs to function. We'll attach two AWS-managed policies that provide comprehensive permissions for EKS operations.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Attach the required policies</span>
aws iam attach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

aws iam attach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
</code></pre>
<p>Let me explain what each of these policies does. The <strong>AmazonEKSClusterPolicy</strong> is a managed policy that AWS maintains, giving EKS permission to create and manage the Kubernetes control plane components. This includes actions like setting up the API server, configuring etcd storage, and managing the controller manager and scheduler. Without this policy, EKS couldn't create the fundamental components that make Kubernetes work.</p>
<p>The second policy, <strong>AmazonEKSVPCResourceController</strong>, is particularly critical for our Security Groups for Pods implementation. This policy allows the VPC Resource Controller to create and delete ENIs, assign security groups to those interfaces, and manage VPC resources on behalf of pods. When a pod needs a dedicated ENI with specific security groups, this policy is what authorizes EKS to make those changes in your VPC.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758421026844/2ef6755d-1306-40f5-8048-04f3334a8308.png" alt="EKS Cluster role policy" class="image--center mx-auto" width="2461" height="1252" loading="lazy"></p>
<h4 id="heading-eks-node-group-role">EKS Node Group Role:</h4>
<p>Next, we'll create the IAM role that EC2 worker nodes will use. While the cluster role is for the EKS control plane, this role is for the actual compute instances that run your pods.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create the node group role</span>
aws iam create-role \
  --role-name EKSNodeGroupRole \
  --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'</span>
</code></pre>
<p>This role's <code>assume role policy</code> specifies <strong>ec2.amazonaws.com</strong> as the trusted service, meaning EC2 instances can assume this role. When an EC2 instance launches as part of your EKS node group, it automatically assumes this role and uses it to authenticate with AWS services. This is how your worker nodes can pull container images, register with the cluster, and perform other necessary operations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758417915291/f06a2ab3-d6be-4bfc-a489-fae3a0c22e8f.png" alt="EKS node group role" class="image--center mx-auto" width="2394" height="412" loading="lazy"></p>
<h4 id="heading-eks-node-group-role-attached-policy">EKS Node Group Role Attached Policy:</h4>
<p>With the node group role created, we now need to attach policies that give worker nodes the permissions they need. We'll attach three different managed policies, each serving a specific purpose in the node's lifecycle.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Attach required policies for worker nodes</span>
aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

aws iam attach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
</code></pre>
<p>Each policy serves a specific purpose for worker node functionality:</p>
<ol>
<li><p><strong>AmazonEKSWorkerNodePolicy</strong>: Allows nodes to connect to EKS cluster</p>
</li>
<li><p><strong>AmazonEKS_CNI_Policy</strong>: Enables CNI plugin to manage pod networking</p>
</li>
<li><p><strong>AmazonEC2ContainerRegistryReadOnly</strong>: Pulls container images from ECR</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758417798644/50a4146a-86ae-4c08-8488-b7eba6b454fe.png" alt="node group iam role and policy" class="image--center mx-auto" width="2689" height="1457" loading="lazy"></p>
<h4 id="heading-iam-role-for-management-instance">IAM Role for Management Instance:</h4>
<p>To complete our foundation setup, we'll create a dedicated role for the EC2 instance that we'll use to manage the cluster. This management instance will act as our control point for running kubectl commands, configuring the cluster, and performing administrative tasks.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create IAM role for management instance</span>
aws iam create-role \
  --role-name EKS-Management-Role \
  --assume-role-policy-document <span class="hljs-string">'{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "ec2.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'</span>

<span class="hljs-comment"># Create instance profile</span>
aws iam create-instance-profile \
  --instance-profile-name EKS-Management-Profile

<span class="hljs-comment"># Add role to instance profile</span>
aws iam add-role-to-instance-profile \
  --instance-profile-name EKS-Management-Profile \
  --role-name EKS-Management-Role

<span class="hljs-comment"># Create and attach custom policy for EKS management</span>
cat &gt; eks-management-policy.json &lt;&lt; <span class="hljs-string">'EOF'</span>
{
    <span class="hljs-string">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-string">"Statement"</span>: [
        {
            <span class="hljs-string">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-string">"Action"</span>: [
                <span class="hljs-string">"eks:*"</span>,
                <span class="hljs-string">"ec2:DescribeInstances"</span>,
                <span class="hljs-string">"ec2:DescribeSecurityGroups"</span>,
                <span class="hljs-string">"ec2:DescribeVpcs"</span>,
                <span class="hljs-string">"ec2:DescribeSubnets"</span>,
                <span class="hljs-string">"ec2:DescribeNetworkInterfaces"</span>,
                <span class="hljs-string">"ec2:CreateSecurityGroup"</span>,
                <span class="hljs-string">"ec2:AuthorizeSecurityGroupIngress"</span>,
                <span class="hljs-string">"ec2:RevokeSecurityGroupIngress"</span>,
                <span class="hljs-string">"rds:DescribeDBInstances"</span>,
                <span class="hljs-string">"rds:CreateDBInstance"</span>,
                <span class="hljs-string">"rds:DeleteDBInstance"</span>,
                <span class="hljs-string">"iam:PassRole"</span>
            ],
            <span class="hljs-string">"Resource"</span>: <span class="hljs-string">"*"</span>
        }
    ]
}
EOF

aws iam create-policy \
  --policy-name EKS-Management-Policy \
  --policy-document file://eks-management-policy.json

aws iam attach-role-policy \
  --role-name EKS-Management-Role \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy
</code></pre>
<p>This setup is more complex than the previous roles because it involves several steps.</p>
<p>First, we create the role with EC2 as the trusted service. Then we create an <strong>instance profile</strong>, which is AWS's mechanism for attaching IAM roles to EC2 instances. Think of an instance profile as a container that holds the role and makes it available to EC2.</p>
<p>The custom policy we're creating gives comprehensive administrative permissions for managing EKS clusters. The eks:* wildcard grants all EKS actions, while the specific EC2 and RDS permissions allow for infrastructure management.</p>
<p>The <strong>iam:PassRole</strong> permission is particularly important. It allows this management instance to pass the cluster and node group roles to EKS when creating resources. Without this permission, we couldn't create the cluster from this instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758424448145/c7c68885-35ca-4c96-997a-5f8c447c4652.png" alt="EKS Management role" class="image--center mx-auto" width="2493" height="460" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758424489993/2c574387-54f3-4283-9dcf-357beef020eb.png" alt="EKS Management policy" class="image--center mx-auto" width="2497" height="1316" loading="lazy"></p>
<h3 id="heading-vpc-and-networking-infrastructure">VPC and Networking Infrastructure</h3>
<p>With our IAM roles configured, we'll now build the network infrastructure that will host our EKS cluster. We're going to create a production-ready VPC with both public and private subnets across multiple availability zones. This architecture provides both security and high availability.</p>
<h4 id="heading-vpc-creation-and-configuration">VPC Creation and Configuration</h4>
<p>Let's start by creating our Virtual Private Cloud and the Internet Gateway that will provide internet connectivity for our public resources.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create VPC</span>
<span class="hljs-built_in">export</span> VPC_ID=$(aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --name <span class="hljs-string">'eks-security-demo'</span>
  --query <span class="hljs-string">'Vpc.VpcId'</span> \
  --output text)

<span class="hljs-comment"># Create Internet Gateway first</span>
<span class="hljs-built_in">export</span> IGW_ID=$(aws ec2 create-internet-gateway \
  --query <span class="hljs-string">'InternetGateway.InternetGatewayId'</span> \
  --output text)

<span class="hljs-comment"># Attach Internet Gateway to VPC</span>
aws ec2 attach-internet-gateway \
  --internet-gateway-id <span class="hljs-variable">$IGW_ID</span> \
  --vpc-id <span class="hljs-variable">$VPC_ID</span>
</code></pre>
<p>When we create the VPC with a 10.0.0.0/16 CIDR block, we're defining an IP address range that provides 65,536 possible IP addresses. This is a private IP range (meaning these addresses aren't routable on the public internet) from the RFC 1918 specification. This gives us plenty of room to create multiple subnets and scale our infrastructure as needed. The /16 designation means the first 16 bits of the IP address are fixed (10.0), while the remaining 16 bits are available for our use.</p>
<p>The Internet Gateway is a horizontally scaled, redundant, and highly available VPC component that allows communication between instances in our VPC and the internet. By attaching it to our VPC, we're setting up the foundation for resources in public subnets to communicate with the outside world.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758419087788/25ed1689-31f5-4076-b6ba-cd33358e59dc.png" alt="25ed1689-31f5-4076-b6ba-cd33358e59dc" class="image--center mx-auto" width="2525" height="336" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758419294210/8c7db06c-b213-47f3-9d37-ff74eaea9960.png" alt="vpc" class="image--center mx-auto" width="2476" height="1370" loading="lazy"></p>
<h4 id="heading-subnet-architecture-strategy">Subnet Architecture Strategy</h4>
<p>Now we'll create four subnets – two public and two private – spread across two different availability zones. This multi-AZ approach is crucial for high availability and follows AWS best practices for production deployments.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Public subnets for NAT Gateway and Load Balancers</span>
<span class="hljs-built_in">export</span> PUBLIC_SUBNET_1=$(aws ec2 create-subnet \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.1.0/24 \
  --availability-zone eu-west-1a \
  --query <span class="hljs-string">'Subnet.SubnetId'</span> \
  --output text)

<span class="hljs-built_in">export</span> PUBLIC_SUBNET_2=$(aws ec2 create-subnet \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.2.0/24 \
  --availability-zone eu-west-1b \
  --query <span class="hljs-string">'Subnet.SubnetId'</span> \
  --output text)

<span class="hljs-comment"># Enable auto-assign public IP for public subnets</span>
aws ec2 modify-subnet-attribute \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_1</span> \
  --map-public-ip-on-launch

aws ec2 modify-subnet-attribute \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_2</span> \
  --map-public-ip-on-launch

<span class="hljs-comment"># Private subnets for worker nodes and RDS</span>
<span class="hljs-built_in">export</span> PRIVATE_SUBNET_1=$(aws ec2 create-subnet \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.3.0/24 \
  --availability-zone eu-west-1a \
  --query <span class="hljs-string">'Subnet.SubnetId'</span> \
  --output text)

<span class="hljs-built_in">export</span> PRIVATE_SUBNET_2=$(aws ec2 create-subnet \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --cidr-block 10.0.4.0/24 \
  --availability-zone eu-west-1b \
  --query <span class="hljs-string">'Subnet.SubnetId'</span> \
  --output text)
</code></pre>
<p>Let me walk you through the subnet design:</p>
<p>Each subnet uses a /24 CIDR block, which provides 256 IP addresses per subnet (though AWS reserves 5 addresses in each subnet for internal use, leaving 251 usable addresses). We're creating these subnets in pairs across two availability zones (eu-west-1a and eu-west-1b). If one availability zone experiences an outage, resources in the other zone can continue operating.</p>
<p>The public subnets (10.0.1.0/24 and 10.0.2.0/24) will host our NAT Gateway and potentially load balancers in the future. We enable auto-assign public IP on these subnets so that any resources launched here automatically receive public IP addresses. This is essential for the NAT Gateway to function properly.</p>
<p>The private subnets (10.0.3.0/24 and 10.0.4.0/24) will host our EKS worker nodes and RDS database. Resources in these subnets don't receive public IP addresses, meaning they can't be directly accessed from the internet. This provides an additional layer of security for our application workloads and database.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758420805187/4c6452d0-5188-46d8-8b75-36b2c6ee1180.png" alt="internet gateway" class="image--center mx-auto" width="2521" height="494" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758419921540/5d231ace-0788-4b30-97bf-71d5b1cbc5d5.png" alt="subnets" class="image--center mx-auto" width="2596" height="612" loading="lazy"></p>
<h4 id="heading-eks-subnet-tagging">EKS Subnet Tagging</h4>
<p>Next, we need to add specific tags to our subnets so that EKS can automatically discover and use them correctly. These tags tell EKS which subnets to use for different types of load balancers.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Tag subnets for EKS auto-discovery</span>
aws ec2 create-tags \
  --resources <span class="hljs-variable">$PUBLIC_SUBNET_1</span> <span class="hljs-variable">$PUBLIC_SUBNET_2</span> \
  --tags Key=kubernetes.io/cluster/pod-security-cluster-demo,Value=shared \
         Key=kubernetes.io/role/elb,Value=1

aws ec2 create-tags \
  --resources <span class="hljs-variable">$PRIVATE_SUBNET_1</span> <span class="hljs-variable">$PRIVATE_SUBNET_2</span> \
  --tags Key=kubernetes.io/cluster/pod-security-cluster-demo,Value=shared \
         Key=kubernetes.io/role/internal-elb,Value=1
</code></pre>
<p>These tags serve specific purposes in the EKS ecosystem:</p>
<ul>
<li><p>The <code>kubernetes.io/cluster/pod-security-cluster-demo=shared</code> tag identifies subnets that belong to our cluster. The "shared" value indicates that these subnets might be used by multiple clusters, though in our case we're only using them for one.</p>
</li>
<li><p>The <code>kubernetes.io/role/elb=1</code> tag on public subnets tells Kubernetes to use these subnets when creating internet-facing load balancers.</p>
</li>
<li><p>The <code>kubernetes.io/role/internal-elb=1</code> tag on private subnets indicates where internal load balancers should be created</p>
</li>
</ul>
<p>When you create a Kubernetes Service of type LoadBalancer, these tags help Kubernetes automatically choose the correct subnets based on whether you want an internal or external load balancer.</p>
<h4 id="heading-routing-and-nat-gateway">Routing and NAT Gateway</h4>
<p>Now we'll set up the routing infrastructure that controls how traffic flows in and out of our subnets. This includes creating route tables for both public and private subnets, and setting up a NAT Gateway to provide internet access for resources in private subnets.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create route table for public subnets</span>
<span class="hljs-built_in">export</span> PUBLIC_RT=$(aws ec2 create-route-table \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --query <span class="hljs-string">'RouteTable.RouteTableId'</span> \
  --output text)

<span class="hljs-comment"># Create route to Internet Gateway</span>
aws ec2 create-route \
  --route-table-id <span class="hljs-variable">$PUBLIC_RT</span> \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id <span class="hljs-variable">$IGW_ID</span>

<span class="hljs-comment"># Associate public subnets with public route table</span>
aws ec2 associate-route-table \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_1</span> \
  --route-table-id <span class="hljs-variable">$PUBLIC_RT</span>

aws ec2 associate-route-table \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_2</span> \
  --route-table-id <span class="hljs-variable">$PUBLIC_RT</span>

<span class="hljs-comment"># Create NAT Gateway</span>
<span class="hljs-built_in">export</span> EIP_ALLOC=$(aws ec2 allocate-address \
  --domain vpc \
  --query <span class="hljs-string">'AllocationId'</span> \
  --output text)

<span class="hljs-built_in">export</span> NAT_GW=$(aws ec2 create-nat-gateway \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_1</span> \
  --allocation-id <span class="hljs-variable">$EIP_ALLOC</span> \
  --query <span class="hljs-string">'NatGateway.NatGatewayId'</span> \
  --output text)

<span class="hljs-comment"># Wait for NAT Gateway to be available</span>
aws ec2 <span class="hljs-built_in">wait</span> nat-gateway-available --nat-gateway-ids <span class="hljs-variable">$NAT_GW</span>

<span class="hljs-comment"># Create route table for private subnets</span>
<span class="hljs-built_in">export</span> PRIVATE_RT=$(aws ec2 create-route-table \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --query <span class="hljs-string">'RouteTable.RouteTableId'</span> \
  --output text)

<span class="hljs-comment"># Create route to NAT Gateway</span>
aws ec2 create-route \
  --route-table-id <span class="hljs-variable">$PRIVATE_RT</span> \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id <span class="hljs-variable">$NAT_GW</span>

<span class="hljs-comment"># Associate private subnets with private route table</span>
aws ec2 associate-route-table \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_1</span> \
  --route-table-id <span class="hljs-variable">$PRIVATE_RT</span>

aws ec2 associate-route-table \
  --subnet-id <span class="hljs-variable">$PRIVATE_SUBNET_2</span> \
  --route-table-id <span class="hljs-variable">$PRIVATE_RT</span>
</code></pre>
<p>Let me explain how this routing configuration enables secure internet access:</p>
<p>We start by creating a <code>route table</code> for our public subnets and adding a default route (0.0.0.0/0) that points to the Internet Gateway. This means any traffic from public subnet resources that doesn't match a more specific route will go directly to the Internet Gateway and out to the internet.</p>
<p>Next, we create a <code>NAT Gateway</code>, which requires an Elastic IP address. An <strong>Elastic IP</strong> is a static public IPv4 address that AWS allocates to your account. The NAT Gateway lives in a public subnet and acts as a middleman for outbound internet traffic from private subnets. When a resource in a private subnet wants to reach the internet (for example, to download software updates), the traffic goes to the NAT Gateway, which then forwards it to the Internet Gateway. Response traffic comes back through the same path.</p>
<p>For the <code>private subnets</code>, we create a separate route table with a default route pointing to the NAT Gateway instead of directly to the Internet Gateway. This setup allows resources in private subnets to initiate outbound connections to the internet (which they need for things like pulling container images or downloading patches), but prevents inbound connections from the internet. This is a key security feature: your worker nodes and databases can access the internet when needed, but the internet can't directly access them.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758420369369/ffa92cd6-691d-4b3c-867f-739f6aa33c21.png" alt="route table" class="image--center mx-auto" width="2596" height="1472" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758420512111/f7922e4a-103f-4f4d-84be-0baa20681b6b.png" alt="nat gateway" class="image--center mx-auto" width="2581" height="372" loading="lazy"></p>
<h2 id="heading-eks-cluster-configuration">EKS Cluster Configuration</h2>
<p>With our networking foundation in place, we're ready to create the actual EKS cluster. We'll configure the cluster to support Security Groups for Pods and set up managed worker nodes with appropriate instance types.</p>
<h3 id="heading-eks-cluster-creation">EKS Cluster Creation</h3>
<p>Let's create our EKS cluster with configuration options specifically chosen to support the Security Groups for Pods feature.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> CLUSTER_ROLE_ARN=$(aws iam get-role \
  --role-name EKSClusterRole \
  --query <span class="hljs-string">'Role.Arn'</span> \
  --output text)

<span class="hljs-comment"># Create the EKS cluster with detailed configuration</span>
aws eks create-cluster \
  --name pod-security-cluster-demo \
  --kubernetes-version 1.33 \
  --role-arn <span class="hljs-variable">$CLUSTER_ROLE_ARN</span> \
  --access-config authenticationMode=API_AND_CONFIG_MAP \
  --resources-vpc-config subnetIds=<span class="hljs-variable">$PUBLIC_SUBNET_1</span>,<span class="hljs-variable">$PUBLIC_SUBNET_2</span>,<span class="hljs-variable">$PRIVATE_SUBNET_1</span>,<span class="hljs-variable">$PRIVATE_SUBNET_2</span>

<span class="hljs-comment"># Wait for cluster to be active (this can take 10-15 minutes)</span>
aws eks <span class="hljs-built_in">wait</span> cluster-active --name pod-security-cluster-demo
</code></pre>
<p>So what's happening in this cluster creation command? First, we're using <code>Kubernetes version 1.33</code>, which is the latest stable version that also supports Security Groups for Pods. The <code>role-arn</code> parameter specifies the EKSClusterRole we created earlier, giving the cluster permission to manage AWS resources.</p>
<p>The access-config setting is particularly important. By specifying <code>API_AND_CONFIG_MAP</code>, we're enabling both modern API-based authentication and the traditional aws-auth ConfigMap approach. This dual authentication mode provides flexibility in how we manage cluster access.</p>
<p>We're including all four of our subnets in the <code>resources-vpc-config</code>. This is crucial because the EKS control plane needs to communicate with worker nodes across availability zones. By specifying both public and private subnets, we ensure that the cluster can place resources wherever they're needed while maintaining proper security boundaries.</p>
<p>The cluster creation process typically takes 10-15 minutes. During this time, AWS is provisioning the Kubernetes control plane components (API server, etcd, controller manager, and scheduler) across multiple availability zones for high availability.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758424603683/eba3fe22-b120-476d-84e7-1d3f86ff30d4.png" alt="EKS Cluster" class="image--center mx-auto" width="2481" height="560" loading="lazy"></p>
<h3 id="heading-managed-node-group-setup">Managed Node Group Setup</h3>
<p>With the cluster created, we now need to add worker nodes that will actually run our pods. We'll create a managed node group with instance types specifically chosen to support multiple ENIs.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Get the ARN of the node group role</span>
<span class="hljs-built_in">export</span> NODE_ROLE_ARN=$(aws iam get-role \
  --role-name EKSNodeGroupRole \
  --query <span class="hljs-string">'Role.Arn'</span> \
  --output text)

<span class="hljs-comment"># Create the managed node group</span>
aws eks create-nodegroup \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers \
  --subnets <span class="hljs-variable">$PRIVATE_SUBNET_1</span> <span class="hljs-variable">$PRIVATE_SUBNET_2</span> \
  --node-role <span class="hljs-variable">$NODE_ROLE_ARN</span> \
  --instance-types m5.large \
  --scaling-config minSize=1,maxSize=3,desiredSize=2 \
  --disk-size 20 \
  --capacity-type ON_DEMAND

<span class="hljs-comment"># Wait for node group to be active</span>
aws eks <span class="hljs-built_in">wait</span> nodegroup-active \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers
</code></pre>
<p>Let me explain the key configuration choices here:</p>
<p>We're launching our <code>worker nodes</code> in the private subnets only, which follows security best practices by keeping compute resources away from direct internet access. The nodes can still download images and updates through the NAT Gateway we set up earlier.</p>
<p>The <code>instance type</code> selection is important for Security Groups for Pods. We're using m5.large instances, which can support up to 3 ENIs. One ENI is used as the primary network interface for the node itself, leaving 2 ENIs available for branch networking. Each branch ENI can support multiple pods with security group policies, giving us good pod density while maintaining the ability to assign custom security groups.</p>
<p>Our <code>scaling configuration</code> starts with 2 nodes (desiredSize=2), can scale down to 1 node (minSize=1), and up to 3 nodes (maxSize=3). This provides enough capacity for our demonstration while keeping costs reasonable. We're using the ON_DEMAND capacity type, which means these instances are standard EC2 instances billed per hour. While Spot instances are cheaper, ON_DEMAND ensures consistent availability without interruptions during our testing.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758425098518/006a5962-900b-4da6-87fd-2c3fa5194965.png" alt="node groups" class="image--center mx-auto" width="2366" height="458" loading="lazy"></p>
<h4 id="heading-instance-type-selection-for-eni-limits">Instance Type Selection for ENI Limits:</h4>
<p>Understanding the ENI limits of different instance types can help when planning for Security Groups for Pods. Let's check the ENI capacity of various instance types to see how they compare.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check ENI limits for different instance types</span>
aws ec2 describe-instance-types \
  --instance-types a1.2xlarge t3.medium t3.large m5.large m5.xlarge \
  --query <span class="hljs-string">'InstanceTypes[*].[InstanceType,NetworkInfo.MaximumNetworkInterfaces]'</span> \
  --output table
</code></pre>
<p>This command shows ENI limits for different instance types, which determines how many pods can have dedicated security groups:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758970656167/a244aca3-94ad-4c8c-a9d5-136613fd420a.png" alt="ENI instance type" class="image--center mx-auto" width="564" height="454" loading="lazy"></p>
<p>The <code>m5.large instance type</code> we chose provides 3 maximum network interfaces. Here's how that breaks down in practice: one ENI is always used as the primary network interface for the node itself, handling all the standard node networking. The remaining 2 ENIs can be used as trunk interfaces for branch networking, which is what enables Security Groups for Pods.</p>
<p>While a t3.medium only supports 3 ENIs total (which would also work for our demo), and an m5.xlarge supports 4 ENIs (providing more capacity), the m5.large offers the best balance. It provides adequate pod density for pods requiring security group policies while remaining cost-effective for demonstration purposes. In a production environment, you'd want to carefully calculate your ENI needs based on how many pods will require custom security groups and choose your instance types accordingly.</p>
<h3 id="heading-eks-cluster-access-configuration">EKS Cluster Access Configuration</h3>
<p>Now we need to configure access to the cluster so our management instance can run kubectl commands. Instead of using the older aws-auth ConfigMap approach, we'll use EKS access entries, which provide a cleaner and more maintainable way to manage cluster access.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Export the management role ARN </span>
<span class="hljs-built_in">export</span> MANAGEMENT_ROLE_ARN=$(aws iam get-role \
  --role-name EKS-Management-Role \
  --query <span class="hljs-string">'Role.Arn'</span> \
  --output text)

<span class="hljs-comment"># Create access entry using the variable</span>
aws eks create-access-entry \
  --cluster-name pod-security-cluster-demo \
  --principal-arn <span class="hljs-variable">$MANAGEMENT_ROLE_ARN</span>

<span class="hljs-comment"># Associate admin policy using the variable</span>
aws eks associate-access-policy \
  --cluster-name pod-security-cluster-demo \
  --principal-arn <span class="hljs-variable">$MANAGEMENT_ROLE_ARN</span> \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope <span class="hljs-built_in">type</span>=cluster

<span class="hljs-comment"># Verify the policy was associated using the variable</span>
aws eks list-associated-access-policies \
  --cluster-name pod-security-cluster-demo \
  --principal-arn <span class="hljs-variable">$MANAGEMENT_ROLE_ARN</span>
</code></pre>
<p>This access configuration demonstrates enterprise-grade role separation. The EKSClusterRole we created earlier is a service role that EKS itself uses to manage AWS infrastructure like VPCs, security groups, and load balancers. That's different from the EKS-Management-Role we're configuring now, which is an administrative role that human operators (or in our case, the management EC2 instance) use to interact with Kubernetes resources.</p>
<p>By creating an access entry for the management role and associating it with the AmazonEKSClusterAdminPolicy, we're granting full administrative access to the cluster. This means any EC2 instance that assumes the EKS-Management-Role can run kubectl commands with full permissions.</p>
<p>Access entries are the modern approach to cluster access management in EKS, providing better auditability and easier management compared to manually editing the aws-auth ConfigMap.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758979128608/f8bd41dc-67e6-45f2-978e-b54b92c880e0.png" alt="EKS cluster access configuration and Policy verification" class="image--center mx-auto" width="1498" height="564" loading="lazy"></p>
<h2 id="heading-management-instance-setup">Management Instance Setup</h2>
<p>Now we'll create a dedicated EC2 instance that will serve as our management workstation for interacting with the EKS cluster. This instance will have all the necessary tools pre-installed and will use the IAM role we configured earlier to access both AWS services and the Kubernetes cluster.</p>
<h3 id="heading-security-group-for-management-access">Security Group for Management Access</h3>
<p>First, let's create a security group that will control network access to our management instance. This security group will allow SSH connections so we can access the instance.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create security group with principle of least privilege</span>
<span class="hljs-built_in">export</span> EC2_SG=$(aws ec2 create-security-group \
  --group-name EKS-Management-SG \
  --description <span class="hljs-string">"Security group for EKS management instance"</span> \
  --vpc-id <span class="hljs-variable">$VPC_ID</span> \
  --query <span class="hljs-string">'GroupId'</span> \
  --output text)

<span class="hljs-comment"># Allow SSH only from your IP</span>
aws ec2 authorize-security-group-ingress \
  --group-id <span class="hljs-variable">$EC2_SG</span> \
  --protocol tcp \
  --port 22 \
  --cidr 0.0.0.0/0  <span class="hljs-comment"># for security consider using your ip ${MY_IP}/32</span>
</code></pre>
<p>We're creating a security group specifically for the management instance and allowing SSH access on port 22. In the example above, we're using 0.0.0.0/0 which allows SSH from any IP address. This is convenient for demonstration purposes, but in a production environment, you should definitely restrict this to your specific IP address instead.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758424881848/1a7bf4f6-454b-4e6e-b670-7445bfc401e2.png" alt="security group for eks management" class="image--center mx-auto" width="2545" height="504" loading="lazy"></p>
<h3 id="heading-automated-tool-installation">Automated Tool Installation</h3>
<p>Now we'll launch the management instance with a user data script that automatically installs all the tools we'll need. User data scripts run automatically when an EC2 instance first boots up, allowing us to fully configure the instance without manual intervention.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create user data script for automatic tool installation</span>
cat &gt; user-data.sh &lt;&lt; <span class="hljs-string">'EOF'</span>
<span class="hljs-comment">#!/bin/bash</span>
yum update -y
yum install -y unzip git

<span class="hljs-comment"># Install AWS CLI v2</span>
curl <span class="hljs-string">"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"</span> -o <span class="hljs-string">"awscliv2.zip"</span>
unzip awscliv2.zip
./aws/install

<span class="hljs-comment"># Install kubectl (reference: https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)</span>
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.33.4/2025-08-20/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p <span class="hljs-variable">$HOME</span>/bin &amp;&amp; cp ./kubectl <span class="hljs-variable">$HOME</span>/bin/kubectl &amp;&amp; <span class="hljs-built_in">export</span> PATH=<span class="hljs-variable">$HOME</span>/bin:<span class="hljs-variable">$PATH</span>

<span class="hljs-comment"># Install eksctl for additional EKS management</span>
curl --silent --location <span class="hljs-string">"https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_<span class="hljs-subst">$(uname -s)</span>_amd64.tar.gz"</span> | tar xz -C /tmp
cp /tmp/eksctl /usr/<span class="hljs-built_in">local</span>/bin

<span class="hljs-comment"># Install helm</span>
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

<span class="hljs-comment"># Install PostgreSQL 13 client (newer version required for RDS SCRAM authentication)</span>
sudo amazon-linux-extras install -y postgresql13

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Management tools installed successfully"</span> &gt; /var/<span class="hljs-built_in">log</span>/setup-complete.log
EOF

<span class="hljs-comment"># Get the latest Amazon Linux 2 AMI ID</span>
<span class="hljs-built_in">export</span> AMI_ID=$(aws ec2 describe-images \
  --owners amazon \
  --filters <span class="hljs-string">"Name=name,Values=amzn2-ami-hvm-*"</span> <span class="hljs-string">"Name=state,Values=available"</span> \
  --query <span class="hljs-string">'Images | sort_by(@, &amp;CreationDate) | [-1].ImageId'</span> \
  --output text)

<span class="hljs-comment"># Launch instance with user data</span>
<span class="hljs-built_in">export</span> INSTANCE_ID=$(aws ec2 run-instances \
  --image-id <span class="hljs-variable">$AMI_ID</span> \
  --count 1 \
  --instance-type t3.micro \
  --subnet-id <span class="hljs-variable">$PUBLIC_SUBNET_1</span> \
  --security-group-ids <span class="hljs-variable">$EC2_SG</span> \
  --iam-instance-profile Name=EKS-Management-Profile \
  --user-data file://user-data.sh \
  --tag-specifications <span class="hljs-string">'ResourceType=instance,Tags=[{Key=Name,Value=EKS-Management},{Key=Environment,Value=Demo}]'</span> \
  --query <span class="hljs-string">'Instances[0].InstanceId'</span> \
  --output text)

<span class="hljs-comment"># Wait for instance to be running</span>
aws ec2 <span class="hljs-built_in">wait</span> instance-running --instance-ids <span class="hljs-variable">$INSTANCE_ID</span>

<span class="hljs-comment"># Get instance public IP</span>
<span class="hljs-built_in">export</span> INSTANCE_IP=$(aws ec2 describe-instances \
  --instance-ids <span class="hljs-variable">$INSTANCE_ID</span> \
  --query <span class="hljs-string">'Reservations[0].Instances[0].PublicIpAddress'</span> \
  --output text)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Management instance is ready. Public IP: <span class="hljs-variable">$INSTANCE_IP</span>"</span>
</code></pre>
<p>Let me walk through what this script does. The user data script starts by updating the system and installing basic utilities like unzip and git. Then it installs the AWS CLI version 2, which we'll use to interact with AWS services from the instance.</p>
<p>Next, we install kubectl, which is the command-line tool for interacting with Kubernetes clusters. We're installing version 1.33.4 to match our EKS cluster version. The script also installs eksctl, which is a higher-level tool for managing EKS clusters, and Helm, which is a package manager for Kubernetes applications.</p>
<p>Finally, we install the PostgreSQL 13 client. This will allow us to connect to the RDS database we'll create later and verify that our pod-level security groups are working correctly. The script writes a completion message to a log file so we can verify later that all tools installed successfully.</p>
<p>When we launch the instance, we're placing it in PUBLIC_SUBNET_1 so it gets a public IP address and can be accessed via SSH. We're attaching the EKS-Management-Profile IAM instance profile, which gives the instance the permissions we configured earlier. We're also using a t3.micro instance type, which is the smallest general-purpose instance. It’s perfectly adequate for running kubectl commands and managing the cluster while keeping costs minimal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758425229354/9993f3a8-42d5-4c09-8b22-0749760708ff.png" alt="magement ec2 instance running" class="image--center mx-auto" width="1650" height="458" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758425336650/aea9fb9e-07f3-4771-ae3a-feab104621be.png" alt="EC2 Instance management" class="image--center mx-auto" width="2540" height="466" loading="lazy"></p>
<h2 id="heading-security-group-configuration">Security Group Configuration</h2>
<p>With our infrastructure and cluster in place, we now need to create and configure the security groups that will control pod-level network access. This is where the real power of Security Groups for Pods comes into play. We'll create security groups that individual pods can use to enforce fine-grained network policies.</p>
<h3 id="heading-retrieving-cluster-network-information">Retrieving Cluster Network Information</h3>
<p>Let's start by connecting to our management instance and gathering the networking details we need from our EKS cluster.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Connect to management instance and get cluster VPC details</span>

<span class="hljs-comment"># verify all installation was successful</span>
cat /var/<span class="hljs-built_in">log</span>/setup-complete.log

<span class="hljs-comment"># Update kubeconfig for your cluster </span>
aws eks update-kubeconfig --name pod-security-cluster-demo --region eu-west-1 

<span class="hljs-comment"># Verify configuration </span>

kubectl get nodes 

<span class="hljs-comment"># Check cluster info </span>
kubectl cluster-info 

<span class="hljs-built_in">export</span> VPC_ID=$(aws eks describe-cluster \
   --name pod-security-cluster-demo \
   --query <span class="hljs-string">"cluster.resourcesVpcConfig.vpcId"</span> \
   --output text)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Cluster VPC ID: <span class="hljs-variable">$VPC_ID</span>"</span>
</code></pre>
<p>First, we're checking the setup completion log to ensure that all our tools installed correctly during the instance's first boot. Then we configure kubectl to communicate with our EKS cluster by updating the kubeconfig file. This command retrieves the cluster endpoint and certificate authority data, storing them in ~/.kube/config so kubectl knows how to authenticate with our cluster.</p>
<p>Running kubectl get nodes should show us our two worker nodes in a Ready state. The kubectl cluster-info command displays the API server endpoint, confirming that we have proper connectivity to the cluster. Finally, we're extracting the VPC ID where our cluster is running. We'll need this ID when creating security groups, since security groups must be associated with a specific VPC.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758431764884/a5f9771d-839e-4bbc-9c5b-ee428f0dde9a.png" alt="retrive VPC ID" class="image--center mx-auto" width="1370" height="244" loading="lazy"></p>
<h3 id="heading-pod-level-security-group-creation">Pod-Level Security Group Creation</h3>
<p>Now we'll create the security group that specific pods will use when they need database access. This is the security group we'll later assign through a SecurityGroupPolicy.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create security group for pods requiring database access</span>
aws ec2 create-security-group \
   --description <span class="hljs-string">'Pod Security Group - Database Access'</span> \
   --group-name <span class="hljs-string">'POD_SG'</span> \
   --vpc-id <span class="hljs-variable">${VPC_ID}</span>

<span class="hljs-built_in">export</span> POD_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=POD_SG Name=vpc-id,Values=<span class="hljs-variable">${VPC_ID}</span> \
   --query <span class="hljs-string">"SecurityGroups[0].GroupId"</span> --output text)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Pod Security Group ID: <span class="hljs-variable">${POD_SG}</span>"</span>
</code></pre>
<p>This command creates a new security group in our cluster's VPC with a descriptive name and purpose. At this point, the security group has no inbound or outbound rules defined – it's essentially an empty container waiting for rules. We're storing the security group ID in the POD_SG variable because we'll need to reference it multiple times: when creating ingress rules, when setting up the SecurityGroupPolicy, and when verifying our configuration later.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758431859644/2399369f-eb05-44ab-b06d-25b905a0f36d.png" alt="Pod security group created" class="image--center mx-auto" width="1352" height="98" loading="lazy"></p>
<h3 id="heading-database-security-group-configuration">Database Security Group Configuration</h3>
<p>Next, let's create a dedicated security group for our RDS PostgreSQL database. This security group will strictly control which sources can connect to the database.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create security group for RDS database</span>
aws ec2 create-security-group \
   --description <span class="hljs-string">'RDS Security Group - PostgreSQL Database'</span> \
   --group-name <span class="hljs-string">'RDS_SG'</span> \
   --vpc-id <span class="hljs-variable">${VPC_ID}</span>

<span class="hljs-built_in">export</span> RDS_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=RDS_SG Name=vpc-id,Values=<span class="hljs-variable">${VPC_ID}</span> \
   --query <span class="hljs-string">"SecurityGroups[0].GroupId"</span> --output text)

<span class="hljs-built_in">export</span> RDS_SG_ID=$(aws rds describe-db-instances --db-instance-identifier rds-ekslab \
  --query <span class="hljs-string">'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId'</span> --output text)

<span class="hljs-built_in">echo</span> <span class="hljs-string">"RDS Security Group ID: <span class="hljs-variable">${RDS_SG}</span>"</span>
</code></pre>
<p>Similar to the pod security group, we're creating an RDS-specific security group without any rules initially. This security group will be attached to our RDS database instance when we create it. The beauty of this approach is that we can control database access by simply defining which security groups are allowed to communicate with the RDS security group. We don't need to know specific IP addresses – we can allow access based on security group membership instead.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758431821224/c1c6a0b9-1df6-4430-b646-80c356e9f5b8.png" alt="Create and export security group for RDS database " class="image--center mx-auto" width="1298" height="104" loading="lazy"></p>
<h3 id="heading-inter-service-communication-rules">Inter-Service Communication Rules</h3>
<p>Now comes the critical part: configuring the security group rules that will enable the necessary communication between components while maintaining security boundaries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Get cluster's node group security group</span>
<span class="hljs-built_in">export</span> NODE_GROUP_SG=$(aws ec2 describe-security-groups \
   --filters Name=tag:Name,Values=eks-cluster-sg-pod-security-cluster-demo-* Name=vpc-id,Values=<span class="hljs-variable">${VPC_ID}</span> \
   --query <span class="hljs-string">"SecurityGroups[0].GroupId"</span> \
   --output text)

<span class="hljs-comment"># Allow pods with POD_SG to resolve DNS through node group</span>
aws ec2 authorize-security-group-ingress \
   --group-id <span class="hljs-variable">${NODE_GROUP_SG}</span> \
   --protocol tcp \
   --port 53 \
   --source-group <span class="hljs-variable">${POD_SG}</span>

aws ec2 authorize-security-group-ingress \
   --group-id <span class="hljs-variable">${NODE_GROUP_SG}</span> \
   --protocol udp \
   --port 53 \
   --source-group <span class="hljs-variable">${POD_SG}</span>

<span class="hljs-comment"># Allow management instance access to RDS</span>
<span class="hljs-built_in">export</span> MGMT_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=EKS-Management-SG Name=vpc-id,Values=<span class="hljs-variable">${VPC_ID}</span> \
   --query <span class="hljs-string">"SecurityGroups[0].GroupId"</span> --output text)

aws ec2 authorize-security-group-ingress \
   --group-id <span class="hljs-variable">${RDS_SG}</span> \
   --protocol tcp \
   --port 5432 \
   --source-group <span class="hljs-variable">${MGMT_SG}</span>

<span class="hljs-comment"># Allow only pods with POD_SG and MGMT_SG to access RDS</span>
<span class="hljs-built_in">export</span> RDS_SG_ID=$(aws rds describe-db-instances --db-instance-identifier rds-ekslab \
  --query <span class="hljs-string">'DBInstances[0].VpcSecurityGroups[0].VpcSecurityGroupId'</span> --output text)

aws ec2 authorize-security-group-ingress \
   --group-id <span class="hljs-variable">${RDS_SG_ID}</span> \
   --protocol tcp \
   --port 5432 \
   --source-group <span class="hljs-variable">${POD_SG}</span>

aws ec2 authorize-security-group-ingress \
  --group-id <span class="hljs-variable">$RDS_SG_ID</span> \
  --protocol tcp \
  --port 5432 \
  --source-group <span class="hljs-variable">$MGMT_SG</span>
</code></pre>
<p>Let me explain what each of these rules accomplishes:</p>
<p>First, we're finding the security group that EKS automatically created for our node group. This security group controls traffic to and from the worker nodes themselves.</p>
<p>The first two rules we add allow DNS resolution. When pods with our POD_SG security group need to look up domain names (like our database hostname), they need to query the DNS service that runs on the worker nodes. By allowing both TCP and UDP traffic on port 53 from POD_SG to the node group security group, we ensure that pods with custom security groups can still resolve DNS names. Without these rules, our pods would get ENIs but wouldn't be able to look up any hostnames.</p>
<p>Next, we configure database access rules. We allow the management instance security group to access PostgreSQL port 5432 on the RDS security group. This lets us connect to the database from our management instance to set up test data and verify connectivity.</p>
<p>Most importantly, we allow pods with the POD_SG security group to connect to port 5432 on the RDS security group. This is the rule that will allow our "green pod" (which will be assigned POD_SG) to connect to the database. Notice that we're not allowing the node group security group to access the database - this means that pods without POD_SG cannot connect to the database, even though they're running on the same nodes as pods that can connect.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758432062853/f8fc4c7b-0565-4852-b93a-2a4e997c18f0.png" alt="Configure security group rules" class="image--center mx-auto" width="1484" height="846" loading="lazy"></p>
<h2 id="heading-database-setup">Database Setup</h2>
<p>Now we'll create an Amazon RDS PostgreSQL instance to serve as the protected resource that will demonstrate pod-level access controls. We'll configure the database securely and populate it with test data that we can query from authorized pods.</p>
<h3 id="heading-rds-subnet-group-creation">RDS Subnet Group Creation</h3>
<p>Before creating the RDS instance, we need to define where it can be placed by creating a DB subnet group.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create DB subnet group spanning private subnets</span>
aws rds create-db-subnet-group \
   --db-subnet-group-name rds-ekslab \
   --db-subnet-group-description <span class="hljs-string">"Subnet group for EKS lab RDS instance"</span> \
   --subnet-ids <span class="hljs-variable">${PRIVATE_SUBNET_1}</span> <span class="hljs-variable">${PRIVATE_SUBNET_2}</span>
</code></pre>
<p>A DB subnet group tells RDS which subnets it can use when launching a database instance. We're including both of our private subnets, which serves two important purposes. First, it ensures the database is never exposed directly to the internet. It will only be reachable from within our VPC. Second, it enables multi-AZ deployment if we wanted to add high availability later, since RDS would be able to place a standby replica in the second availability zone.</p>
<h3 id="heading-secure-password-generation">Secure Password Generation</h3>
<p>Let's generate a cryptographically secure password for our database. This is much safer than using a predictable or manually chosen password.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Generate cryptographically secure password</span>
<span class="hljs-built_in">export</span> RDS_PASSWORD=$(openssl rand -base64 32 | tr -d <span class="hljs-string">"=+/"</span> | cut -c1-25)
<span class="hljs-built_in">echo</span> <span class="hljs-variable">$RDS_PASSWORD</span> &gt; .rds_password
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Generated secure RDS password"</span>
</code></pre>
<p>Here's what this command does step by step. First, <code>openssl rand -base64 32</code> generates 32 bytes of random data and encodes it in base64 format. The <code>tr</code> command removes characters that might cause issues in connection strings (equals signs, plus signs, and forward slashes). Finally, we truncate it to 25 characters to ensure it meets RDS password requirements. We save this password to a file so we can retrieve it later when connecting to the database.</p>
<h3 id="heading-rds-instance-configuration">RDS Instance Configuration</h3>
<p>Now we'll create the actual PostgreSQL database instance with security-focused configuration.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create PostgreSQL RDS instance</span>
aws rds create-db-instance \
   --db-instance-identifier rds-ekslab \
   --db-instance-class db.t3.micro \
   --engine postgres \
   --master-username postgres \
   --master-user-password <span class="hljs-variable">${RDS_PASSWORD}</span> \
   --allocated-storage 20 \
   --vpc-security-group-ids <span class="hljs-variable">${RDS_SG}</span> \
   --db-subnet-group-name rds-ekslab \
   --no-publicly-accessible \
   --backup-retention-period 0 \
   --storage-type gp2

<span class="hljs-comment"># Wait for database to become available</span>
aws rds <span class="hljs-built_in">wait</span> db-instance-available --db-instance-identifier rds-ekslab
</code></pre>
<p>Let me walk through these configuration choices. We're using db.t3.micro, which is the smallest instance class available. It’s perfect for our demonstration while keeping costs minimal. The engine is PostgreSQL, which is a robust open-source relational database that works well for demonstrating network connectivity.</p>
<p>The <code>vpc-security-group-ids</code> parameter attaches our RDS_SG security group to the database. This is what enforces our carefully crafted access rules: only sources allowed by the security group rules we created earlier will be able to connect.</p>
<p>The <code>--no-publicly-accessible</code> flag is crucial for security. This ensures the database doesn't get a public IP address and can't be reached from the internet. Combined with our private subnet placement, this creates multiple layers of network security.</p>
<p>We're setting <code>backup-retention-period</code> to 0 because this is a demonstration environment and we don't need automated backups. In a production environment, you would definitely want automated backups enabled. The <code>storage-type gp2</code> specifies general-purpose SSD storage, which provides good performance at reasonable cost.</p>
<p>The <code>wait</code> command at the end blocks until the database is fully available, which typically takes 5-10 minutes. During this time, RDS is provisioning the database instance, configuring storage, setting up the master user, and performing initial system setup.</p>
<h3 id="heading-database-initialization">Database Initialization</h3>
<p>Once the database is available, we need to connect to it and create some test data that will help us verify connectivity from our pods later.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Connect to the management instance and create test data</span>
<span class="hljs-built_in">export</span> RDS_PASSWORD=3aboiP3vKjmfNkWKRF6PXBCro <span class="hljs-comment">#replace</span>
<span class="hljs-built_in">echo</span> <span class="hljs-variable">$RDS_PASSWORD</span> &gt; .rds_password

<span class="hljs-built_in">export</span> RDS_ENDPOINT=$(aws rds describe-db-instances \
   --db-instance-identifier rds-ekslab \
   --query <span class="hljs-string">'DBInstances[0].Endpoint.Address'</span> \
   --output text)

<span class="hljs-comment"># Connect to database and create test table</span>
PGPASSWORD=<span class="hljs-variable">${RDS_PASSWORD}</span> psql -h <span class="hljs-variable">${RDS_ENDPOINT}</span> -U postgres -d postgres &lt;&lt; EOF
CREATE TABLE IF NOT EXISTS test_data (
    id SERIAL PRIMARY KEY,
    message TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO test_data (message) VALUES 
    (<span class="hljs-string">'Hello from authorized pod!'</span>),
    (<span class="hljs-string">'Security groups for pods working correctly!'</span>),
    (<span class="hljs-string">'Fine-grained network access control demonstrated.'</span>);

SELECT * FROM test_data;
EOF
</code></pre>
<p>Here's what we're doing in this initialization script. First, we retrieve the database endpoint hostname from AWS. This is the DNS name we'll use to connect to the database. Then we use the psql command-line tool to connect to PostgreSQL. We pass the password via the <code>PGPASSWORD</code> environment variable, which is a standard way to provide passwords to psql without interactive prompts.</p>
<p>Inside the SQL commands, we create a simple table called <code>test_data</code> with three columns: an auto-incrementing ID, a message text field, and a timestamp that defaults to the current time. We insert three test messages that we'll query later from our pods to verify connectivity. Finally, we select all rows to confirm the data was inserted successfully.</p>
<p>When you run this, you should see the three messages displayed, confirming the database is set up and accessible from the management instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758436248533/9176d2ce-f540-458d-8e6a-bbc1ab4f9d62.png" alt="create database table" class="image--center mx-auto" width="1278" height="306" loading="lazy"></p>
<h2 id="heading-cni-plugin-configuration">CNI Plugin Configuration</h2>
<p>Now we need to configure the AWS VPC CNI plugin to enable the pod-level ENI assignment and branch networking functionality. This is a crucial step that activates the underlying technology that makes Security Groups for Pods possible.</p>
<h3 id="heading-enabling-pod-eni-support">Enabling Pod ENI Support</h3>
<p>We'll activate the feature flag that tells the VPC CNI plugin to support dedicated ENIs for pods.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Enable pod ENI feature on AWS VPC CNI</span>
kubectl -n kube-system <span class="hljs-built_in">set</span> env daemonset aws-node ENABLE_POD_ENI=<span class="hljs-literal">true</span>
kubectl -n kube-system <span class="hljs-built_in">set</span> env ds/aws-node ENABLE_POD_ENI=<span class="hljs-literal">true</span>

<span class="hljs-comment"># Restart CNI pods to apply configuration</span>
kubectl -n kube-system rollout restart daemonset aws-node
kubectl -n kube-system rollout status daemonset aws-node
kubectl -n kube-system rollout restart ds/aws-node
kubectl -n kube-system rollout status ds/aws-node
</code></pre>
<p>Let me explain what's happening here. The aws-node DaemonSet runs the VPC CNI plugin on every worker node in your cluster. This plugin is responsible for assigning IP addresses to pods and configuring their network interfaces. By setting the <code>ENABLE_POD_ENI</code> environment variable to true, we're telling the CNI plugin to support branch networking mode.</p>
<p>When this feature is enabled, the CNI plugin will watch for pods that have SecurityGroupPolicy rules applied to them. For these special pods, instead of just assigning an IP address from the node's primary ENI, the plugin will work with the VPC Resource Controller to provision a dedicated branch ENI. This dedicated ENI can then have its own security groups attached, independent of the node's security groups.</p>
<p>The <code>rollout restart</code> command forces all the aws-node pods to restart with the new configuration. The <code>rollout status</code> command then waits for the restart to complete successfully across all nodes. This typically takes a minute or two as each node's CNI pod is restarted in a rolling fashion.</p>
<h3 id="heading-verification-and-troubleshooting">Verification and Troubleshooting</h3>
<p>After enabling the feature, let's verify that everything is configured correctly and that our nodes are ready to support pod ENIs.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Verify CNI configuration</span>
kubectl -n kube-system get daemonset aws-node -o yaml | grep -A 5 -B 5 ENABLE_POD_ENI

<span class="hljs-comment"># Check node ENI capacity</span>
kubectl get nodes -o custom-columns=NAME:.metadata.name,POD_ENI:.status.allocatable.vpc\\.amazonaws\\.com/pod-eni

<span class="hljs-comment"># Verify trunk ENI creation on nodes</span>
NODE_ID=$(kubectl get nodes -o jsonpath=<span class="hljs-string">'{.items[0].spec.providerID}'</span> | cut -d<span class="hljs-string">'/'</span> -f5)
aws ec2 describe-network-interfaces \
  --filters Name=attachment.instance-id,Values=<span class="hljs-variable">$NODE_ID</span> Name=interface-type,Values=trunk \
  --query <span class="hljs-string">'NetworkInterfaces[*].NetworkInterfaceId'</span>

<span class="hljs-comment"># Check CNI logs for errors</span>
kubectl -n kube-system logs -l k8s-app=aws-node --tail=20
</code></pre>
<p>These verification commands help us confirm that the feature is working as expected. The first command checks that the <code>ENABLE_POD_ENI</code> environment variable is properly set in the DaemonSet configuration. You should see the value set to "true" in the output.</p>
<p>The second command displays the pod-eni capacity for each node. This shows how many pods with dedicated ENIs each node can support. For our m5.large instances, you should see a number like "9" or similar, indicating that each node can support that many pods with custom security groups.</p>
<p>The third command looks for trunk ENIs on one of our nodes. When branch networking is enabled, the VPC CNI creates a special "trunk" ENI on each node that serves as the anchor point for branch ENIs. If you see a network interface ID returned here, it confirms that the trunk networking is properly configured.</p>
<p>Finally, we check the CNI plugin logs for any errors. If everything is working correctly, you shouldn't see any error messages. If there are problems, the logs will typically contain helpful information about what went wrong – perhaps permission issues, insufficient ENI capacity, or configuration problems.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758981390663/faa2d5ef-f6c8-4ac1-926a-aa5b8186add8.png" alt="CNI Verifications" class="image--center mx-auto" width="2764" height="146" loading="lazy"></p>
<h2 id="heading-security-policies-implementation">Security Policies Implementation</h2>
<p>With our infrastructure ready and the CNI plugin configured, we can now create the SecurityGroupPolicy resources that define which pods should receive which security groups. This is where we bridge the gap between Kubernetes pod identity (labels) and AWS network security (security groups).</p>
<h3 id="heading-namespace-and-context-setup">Namespace and Context Setup</h3>
<p>Let's start by creating a dedicated namespace for our demonstration resources. This helps keep things organized and makes cleanup easier.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create dedicated namespace for demonstration</span>
kubectl create namespace networking
kubectl config set-context $(kubectl config current-context) --namespace=networking

<span class="hljs-comment"># Verify namespace creation</span>
kubectl get namespaces
</code></pre>
<p>Using a dedicated namespace provides several benefits for our demonstration. First, it isolates our demo resources from system components in the kube-system namespace and from any other applications that might be running. Second, it makes cleanup straightforward – we can delete the entire namespace later to remove all associated resources at once. Third, it provides a scope for our security policies, making it clear which resources they apply to.</p>
<p>The config set-context command changes your default namespace so that subsequent kubectl commands will operate in the networking namespace by default. This saves you from having to specify -n networking with every command.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758438029420/17763b08-5928-49ad-9033-0ea6986abe00.png" alt="Create k8 namespace and context setup" class="image--center mx-auto" width="2078" height="546" loading="lazy"></p>
<h3 id="heading-securitygrouppolicy-resource-creation">SecurityGroupPolicy Resource Creation</h3>
<p>Now we'll create the SecurityGroupPolicy custom resource that tells the system which pods should get our POD_SG security group.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Export POD sg</span>
<span class="hljs-built_in">export</span> VPC_ID=$(aws eks describe-cluster \
   --name pod-security-cluster-demo \
   --query <span class="hljs-string">"cluster.resourcesVpcConfig.vpcId"</span> \
   --output text)

<span class="hljs-built_in">export</span> POD_SG=$(aws ec2 describe-security-groups \
   --filters Name=group-name,Values=POD_SG Name=vpc-id,Values=<span class="hljs-variable">${VPC_ID}</span> \
   --query <span class="hljs-string">"SecurityGroups[0].GroupId"</span> --output text)

<span class="hljs-comment"># Verify SecurityGroupPolicy CRD exists</span>
kubectl get crd securitygrouppolicies.vpcresources.k8s.aws

<span class="hljs-comment"># Create security group policy</span>
cat &lt;&lt; EOF &gt; sg-per-pod-policy.yaml
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: allow-rds-access
  namespace: networking
spec:
  podSelector:
    matchLabels:
      app: green-pod
  securityGroups:
    groupIds:
      - <span class="hljs-variable">${POD_SG}</span>
EOF


kubectl apply -f sg-per-pod-policy.yaml

<span class="hljs-comment"># Verify policy creation</span>
kubectl -n networking get securitygrouppolicies
kubectl -n networking describe securitygrouppolicy allow-rds-access
</code></pre>
<p>Let me walk you through what this SecurityGroupPolicy does. The podSelector section uses Kubernetes label selectors to identify which pods should receive the security group. In this case, we're matching any pod with the label <code>app: green-pod</code>. This is standard Kubernetes label selector syntax, so you can use more complex selectors if needed (like multiple labels, or expressions).</p>
<p>The <code>securityGroups</code> section lists the AWS security group IDs that should be attached to matching pods. When a pod with the label <code>app: green-pod</code> is created in the networking namespace, the VPC Resource Controller sees it matches this policy. The controller then provisions a dedicated ENI for that pod and attaches our <code>POD_SG</code> security group to that ENI.</p>
<p>It's important to understand that this policy doesn't immediately change anything: it creates a rule that will apply to future pods. When you later create a pod with matching labels, that's when the ENI provisioning and security group attachment happens.</p>
<p>The verify commands at the end confirm that the SecurityGroupPolicy was created successfully and show its current status. You should see the policy listed with details about the pod selector and security groups.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758438517423/29876b46-8adc-49d1-8542-f6895a0e44ef.png" alt="Security group policies creared" class="image--center mx-auto" width="1574" height="736" loading="lazy"></p>
<h2 id="heading-testing-and-validation">Testing and Validation</h2>
<p>Now comes the exciting part. We'll create two pods to demonstrate that our Security Groups for Pods implementation is working correctly. One pod will have the matching label and should be able to access the database, while the other pod won't have the label and should be blocked.</p>
<h3 id="heading-kubernetes-secrets-for-database-connectivity">Kubernetes Secrets for Database Connectivity</h3>
<p>First, we need to securely store our database connection credentials using Kubernetes secrets. This is a security best practice that keeps sensitive information out of pod specifications.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create secret with RDS connection details</span>
<span class="hljs-built_in">export</span> RDS_PASSWORD=$(cat .rds_password)
<span class="hljs-built_in">export</span> RDS_ENDPOINT=$(aws rds describe-db-instances \
   --db-instance-identifier rds-ekslab \
   --query <span class="hljs-string">'DBInstances[0].Endpoint.Address'</span> \
   --output text)

kubectl -n networking create secret generic rds \
  --from-literal=password=<span class="hljs-string">"<span class="hljs-variable">${RDS_PASSWORD}</span>"</span> \
  --from-literal=host=<span class="hljs-string">"<span class="hljs-variable">${RDS_ENDPOINT}</span>"</span> \
  --from-literal=username=postgres \
  --from-literal=database=postgres \
  --dry-run=client -o yaml | kubectl apply -f -

<span class="hljs-comment"># Verify secret creation</span>
kubectl describe secret rds-credentials
</code></pre>
<p>Here's what we're doing with this secret creation. We're retrieving the password we generated earlier and the database endpoint hostname, then storing them in a Kubernetes secret along with the username and database name. The secret is created in the networking namespace where our test pods will run.</p>
<p>The <code>--dry-run=client -o yaml | kubectl apply -f -</code> pattern is a common Kubernetes technique that makes the command idempotent. If the secret already exists, it updates it rather than failing. This is useful when you need to run the command multiple times during testing or troubleshooting.</p>
<p>When pods reference this secret, Kubernetes will inject the values as environment variables or mount them as files, depending on how you configure the pod. The sensitive data never appears in the pod specification, and Kubernetes encrypts secrets at rest in etcd.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758438556701/9aaa6f4e-7392-46f7-9624-0515494ef049.png" alt="create and verify k8 networking secrets" class="image--center mx-auto" width="1156" height="530" loading="lazy"></p>
<h3 id="heading-green-pod-authorized-database-access">Green Pod (Authorized Database Access)</h3>
<p>Now let's create our green pod – the pod that has the matching label and should successfully connect to the database.</p>
<pre><code class="lang-bash">cat &lt;&lt; EOF &gt; green-pod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: green-pod
  namespace: networking
  labels:
    app: green-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: green-pod
  template:
    metadata:
      labels:
        app: green-pod
    spec:
      containers:
      - name: postgres-client
        image: postgres:13-alpine
        env:
        - name: PGHOST
          valueFrom:
            secretKeyRef:
              name: rds
              key: host
        - name: PGPASSWORD
          valueFrom:
            secretKeyRef:
              name: rds
              key: password
        - name: PGUSER
          valueFrom:
            secretKeyRef:
              name: rds
              key: username
        - name: PGDATABASE
          valueFrom:
            secretKeyRef:
              name: rds
              key: database
        - name: PGSSLMODE
          value: require
        <span class="hljs-built_in">command</span>: [<span class="hljs-string">"/bin/sh"</span>]
        args:
        - -c
        - |
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Green pod starting - should have database access..."</span>
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Attempting to connect to database at \$PGHOST"</span>
          <span class="hljs-keyword">if</span> psql -c <span class="hljs-string">"SELECT version();"</span> 2&gt;/dev/null; <span class="hljs-keyword">then</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"SUCCESS: Connected to PostgreSQL!"</span>
            psql -c <span class="hljs-string">"SELECT version();"</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">""</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"Test data from database:"</span>
            psql -c <span class="hljs-string">"SELECT id, message FROM test_data ORDER BY id;"</span>
          <span class="hljs-keyword">else</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"ERROR: Could not connect to database"</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"This indicates security group configuration issues"</span>
          <span class="hljs-keyword">fi</span>
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Sleeping to keep container running..."</span>
          sleep 3600
        resources:
          limits:
            memory: <span class="hljs-string">"128Mi"</span>
            cpu: <span class="hljs-string">"100m"</span>
          requests:
            memory: <span class="hljs-string">"64Mi"</span>
            cpu: <span class="hljs-string">"50m"</span>
EOF

<span class="hljs-comment"># Deploy green pod</span>
kubectl apply -f green-pod.yaml
kubectl rollout status deployment green-pod

<span class="hljs-comment"># Get pod name and check logs</span>
<span class="hljs-built_in">export</span> GREEN_POD_NAME=$(kubectl get pods -l app=green-pod -o jsonpath=<span class="hljs-string">'{.items[0].metadata.name}'</span>)
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Green pod: <span class="hljs-variable">$GREEN_POD_NAME</span>"</span>
kubectl logs <span class="hljs-variable">$GREEN_POD_NAME</span>
</code></pre>
<p>Let me explain what makes this pod special and why it should work. The key is the label <code>app: green-pod</code> in the pod template's metadata section. This label matches our SecurityGroupPolicy selector, so when this pod is created, the VPC Resource Controller will provision a dedicated ENI for it and attach the POD_SG security group.</p>
<p>The pod uses environment variables sourced from our Kubernetes secret to get the database connection details. PostgreSQL's command-line tools (like psql) automatically use these environment variables when set with the PG prefix. This means we don't need to specify connection parameters explicitly – the tools just work.</p>
<p>The startup script in the command section attempts to connect to the database and run a simple query. If the security groups are working correctly, the connection should succeed because this pod's ENI has the POD_SG security group, which is allowed to connect to port 5432 on the RDS security group. The script then queries our test_data table to display the messages we inserted earlier.</p>
<pre><code class="lang-bash">=== GREEN POD STARTING ===
This pod should have database access via security groups
Attempting connection to: rds-ekslab.xxxxx.us-west-2.rds.amazonaws.com
SUCCESS: Connected to PostgreSQL!
Database version:
 PostgreSQL 13.x on x86_64-pc-linux-gnu...
Test data from database:
 id |                    message                     |         created_at         
----+------------------------------------------------+----------------------------
  1 | Hello from authorized pod!                     | 2024-01-15 10:30:45.123456
  2 | Security groups <span class="hljs-keyword">for</span> pods working correctly!    | 2024-01-15 10:30:45.234567
  3 | Fine-grained network access control demonstrated. | 2024-01-15 10:30:45.345678
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758967069726/d439b2da-64bc-4074-95ca-51ed02263e68.png" alt="Verify access to green pod" class="image--center mx-auto" width="1678" height="874" loading="lazy"></p>
<h3 id="heading-red-pod-unauthorized-database-access">Red Pod (Unauthorized Database Access)</h3>
<p>Now let's create the red pod – a pod without the matching label that should be blocked from accessing the database.</p>
<pre><code class="lang-bash">cat &lt;&lt; EOF &gt; red-pod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: red-pod
  namespace: networking
  labels:
    app: red-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: red-pod
  template:
    metadata:
      labels:
        app: red-pod
    spec:
      containers:
      - name: postgres-client
        image: postgres:13-alpine
        env:
        - name: PGHOST
          valueFrom:
            secretKeyRef:
              name: rds
              key: host
        - name: PGPASSWORD
          valueFrom:
            secretKeyRef:
              name: rds
              key: password
        - name: PGUSER
          valueFrom:
            secretKeyRef:
              name: rds
              key: username
        - name: PGDATABASE
          valueFrom:
            secretKeyRef:
              name: rds
              key: database
        - name: PGSSLMODE
          value: require
        <span class="hljs-built_in">command</span>: [<span class="hljs-string">"/bin/sh"</span>]
        args:
        - -c
        - |
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Red pod starting - should NOT have database access..."</span>
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Attempting to connect to database at \$PGHOST"</span>

          <span class="hljs-comment"># Test database connection (should fail)</span>
          <span class="hljs-keyword">if</span> psql -c <span class="hljs-string">"SELECT version();"</span> 2&gt;/dev/null; <span class="hljs-keyword">then</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"UNEXPECTED: Connected to database!"</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"This suggests security group policy is not working correctly"</span>
          <span class="hljs-keyword">else</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"EXPECTED: Could not connect to database"</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"This is correct - red pod should not have database access"</span>
            <span class="hljs-built_in">echo</span> <span class="hljs-string">"Security groups for pods is working properly!"</span>
          <span class="hljs-keyword">fi</span>

          <span class="hljs-comment"># Keep container running for inspection</span>
          <span class="hljs-built_in">echo</span> <span class="hljs-string">"Sleeping to keep container running..."</span>
          sleep 3600
        resources:
          limits:
            memory: <span class="hljs-string">"128Mi"</span>
            cpu: <span class="hljs-string">"100m"</span>
          requests:
            memory: <span class="hljs-string">"64Mi"</span>
            cpu: <span class="hljs-string">"50m"</span>
EOF

<span class="hljs-comment"># Deploy red pod</span>
kubectl apply -f red-pod.yaml
kubectl rollout status deployment red-pod

<span class="hljs-comment"># Get pod name and check logs</span>
<span class="hljs-built_in">export</span> RED_POD_NAME=$(kubectl get pods -l app=red-pod -o jsonpath=<span class="hljs-string">'{.items[0].metadata.name}'</span>)
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Red pod: <span class="hljs-variable">$RED_POD_NAME</span>"</span>
kubectl logs <span class="hljs-variable">$RED_POD_NAME</span>
</code></pre>
<p>The red pod is intentionally configured almost identically to the green pod: same container image, same database credentials, same connection attempt. The only significant difference is the label: this pod has <code>app: red-pod</code> instead of <code>app: green-pod</code>.</p>
<p>Because this pod's label doesn't match our SecurityGroupPolicy, the VPC Resource Controller won't provision a dedicated ENI for it. Instead, this pod will use the node's primary network interface and inherit the node's security group. Since we specifically didn't add a rule allowing the node security group to access the RDS security group, this pod's connection attempts should be blocked at the network level.</p>
<p>The expected output from the red pod logs should look like this:</p>
<pre><code class="lang-bash">=== RED POD STARTING ===
This pod should NOT have database access
Attempting connection to: rds-ekslab.xxxxx.us-west-2.rds.amazonaws.com
EXPECTED: Could not connect to database
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758439888619/41226f05-a51a-4c34-bb57-9d0def923a06.png" alt="Unauthorized access to red pod" class="image--center mx-auto" width="1800" height="244" loading="lazy"></p>
<h3 id="heading-eni-assignment-verification">ENI Assignment Verification</h3>
<p>Let's verify that the green pod actually received a dedicated ENI while the red pod did not.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check green pod ENI assignment</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Green Pod ENI Assignment ==="</span>
kubectl describe pod <span class="hljs-variable">$GREEN_POD_NAME</span> | grep -A 3 -B 3 <span class="hljs-string">"vpc.amazonaws.com/pod-eni"</span>
kubectl get pod <span class="hljs-variable">$GREEN_POD_NAME</span> -o yaml | grep -A 5 <span class="hljs-string">"annotations:"</span> | grep <span class="hljs-string">"vpc.amazonaws.com"</span>

<span class="hljs-comment"># Check red pod networking (should use node networking)</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Red Pod Networking ==="</span>
kubectl describe pod <span class="hljs-variable">$RED_POD_NAME</span> | grep <span class="hljs-string">"vpc.amazonaws.com"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"No dedicated ENI (expected for red pod)"</span>

<span class="hljs-comment"># Verify ENI creation in AWS</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== AWS ENI Verification ==="</span>
aws ec2 describe-network-interfaces \
  --filters Name=description,Values=<span class="hljs-string">"*pod-eni*"</span> \
  --query <span class="hljs-string">'NetworkInterfaces[*].[NetworkInterfaceId,Description,Groups[0].GroupId]'</span> \
  --output table

<span class="hljs-comment"># Compare pod IP addresses</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Pod IP Comparison ==="</span>
kubectl get pods -o wide
</code></pre>
<p>These verification commands help us understand what's happening at the infrastructure level. When you check the green pod's description, you should see annotations like <code>vpc.amazonaws.com/pod-eni</code> that indicate a dedicated ENI was assigned. The annotation will contain the ENI ID and other networking details.</p>
<p>For the red pod, you won't see these annotations because it's using the node's primary network interface instead of a dedicated ENI. This is the expected behavior.</p>
<p>The AWS CLI command queries EC2 for network interfaces with "pod-eni" in the description. This should return the ENI(s) that were created for pods with SecurityGroupPolicy assignments. You'll see the network interface ID, its description, and importantly, the security group ID (which should match our POD_SG).</p>
<p>When you run <code>kubectl get pods -o wide</code>, you can see the IP addresses assigned to each pod. Both pods will have IP addresses from your VPC's CIDR range, but they're coming from different network interfaces at the infrastructure level.</p>
<h3 id="heading-troubleshooting-common-issues">Troubleshooting Common Issues</h3>
<p>If things aren't working as expected, here are some diagnostic commands for resolving common implementation problems:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># If green pod cannot connect to database:</span>

<span class="hljs-comment"># 1. Verify security group rules</span>
aws ec2 describe-security-groups --group-ids <span class="hljs-variable">$POD_SG</span> <span class="hljs-variable">$RDS_SG</span>

<span class="hljs-comment"># 2. Check if POD_SG has access to RDS_SG</span>
aws ec2 describe-security-groups --group-ids <span class="hljs-variable">$RDS_SG</span> --query <span class="hljs-string">'SecurityGroups[0].IpPermissions'</span>

<span class="hljs-comment"># 3. Verify ENI assignment</span>
kubectl describe pod <span class="hljs-variable">$GREEN_POD_NAME</span> | grep -E <span class="hljs-string">"(Events|vpc.amazonaws.com)"</span>

<span class="hljs-comment"># 4. Check CNI plugin status</span>
kubectl -n kube-system logs -l k8s-app=aws-node --tail=100 | grep -i error

<span class="hljs-comment"># 5. Validate SecurityGroupPolicy</span>
kubectl get sgp allow-rds-access -o yaml

<span class="hljs-comment"># 6. Ensure ENABLE_POD_ENI is set</span>
kubectl -n kube-system get ds aws-node -o yaml | grep -A 5 ENABLE_POD_ENI

<span class="hljs-comment"># If red pod unexpectedly connects:</span>

<span class="hljs-comment"># 1. Verify pod labels don't match policy</span>
kubectl get pod <span class="hljs-variable">$RED_POD_NAME</span> --show-labels

<span class="hljs-comment"># 2. Check for unintended security group rules</span>
aws ec2 describe-security-groups --group-ids <span class="hljs-variable">$RDS_SG</span> --query <span class="hljs-string">'SecurityGroups[0].IpPermissions'</span>

<span class="hljs-comment"># 3. Confirm node group security doesn't allow RDS access</span>
<span class="hljs-built_in">export</span> NODE_SG=$(kubectl get nodes -o yaml | grep -o <span class="hljs-string">'sg-[a-zA-Z0-9]*'</span> | head -1)
aws ec2 describe-security-groups --group-ids <span class="hljs-variable">$NODE_SG</span>
</code></pre>
<p>These troubleshooting commands help you systematically diagnose problems. If the green pod can't connect, you work through the checklist: verify the security group rules exist, confirm the ENI was actually assigned, check for CNI errors, and validate the SecurityGroupPolicy configuration.</p>
<p>If the red pod unexpectedly can connect, you check whether it somehow got the wrong labels, whether there's an unintended security group rule allowing node-level access, or whether the node security group itself has database access that it shouldn't have.</p>
<h2 id="heading-cleanup-and-maintenance">Cleanup and Maintenance</h2>
<p>When you're finished with this demonstration, it's important to clean up all the resources to avoid ongoing AWS charges. We'll walk through the cleanup process in the proper order to avoid dependency issues.</p>
<h3 id="heading-kubernetes-resource-cleanup">Kubernetes Resource Cleanup</h3>
<p>Let's start by removing all the Kubernetes resources we created during the demonstration.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Delete application deployments</span>
kubectl delete -f green-pod.yaml
kubectl delete -f red-pod.yaml

<span class="hljs-comment"># Delete security group policy</span>
kubectl delete -f sg-per-pod-policy.yaml

<span class="hljs-comment"># Delete secrets</span>
kubectl delete secret rds-credentials

<span class="hljs-comment"># Delete namespace (removes all resources)</span>
kubectl delete namespace networking

<span class="hljs-comment"># Disable pod ENI feature</span>
kubectl -n kube-system <span class="hljs-built_in">set</span> env daemonset aws-node ENABLE_POD_ENI=<span class="hljs-literal">false</span>
kubectl -n kube-system rollout status daemonset aws-node

<span class="hljs-comment"># Verify ENI cleanup</span>
kubectl get nodes -o custom-columns=NAME:.metadata.name,POD_ENI:.status.allocatable.vpc\.amazonaws\.com/pod-eni
</code></pre>
<p>We start by deleting the individual deployments to ensure the pods are terminated gracefully. Then we remove the SecurityGroupPolicy, which stops the VPC Resource Controller from creating new ENIs. Deleting the namespace removes any remaining resources we might have created during testing.</p>
<p>Disabling the ENABLE_POD_ENI feature returns the CNI plugin to its default behavior. This doesn't immediately remove existing trunk ENIs, but it prevents new ones from being created.</p>
<h3 id="heading-rds-and-database-cleanup">RDS and Database Cleanup</h3>
<p>Next, we'll remove the RDS database instance and its associated resources.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Delete RDS instance (skip final snapshot for demo)</span>
aws rds delete-db-instance \
   --db-instance-identifier rds-ekslab \
   --delete-automated-backups \
   --skip-final-snapshot

<span class="hljs-comment"># Wait for deletion completion</span>
aws rds <span class="hljs-built_in">wait</span> db-instance-deleted --db-instance-identifier rds-ekslab

<span class="hljs-comment"># Delete DB subnet group</span>
aws rds delete-db-subnet-group \
   --db-subnet-group-name rds-ekslab
</code></pre>
<p>The <code>--skip-final-snapshot</code> flag means we won't create a snapshot before deleting the database. In a production environment, you'd typically want a final snapshot, but for our demonstration where the data isn't valuable, skipping it speeds up the deletion process. The wait command blocks until RDS confirms the instance is fully deleted, which can take several minutes.</p>
<h3 id="heading-eks-cluster-deletion">EKS Cluster Deletion</h3>
<p>Now we'll delete the EKS cluster and its node groups.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Delete managed node group first</span>
aws eks delete-nodegroup \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers

<span class="hljs-comment"># Wait for node group deletion</span>
aws eks <span class="hljs-built_in">wait</span> nodegroup-deleted \
  --cluster-name pod-security-cluster-demo \
  --nodegroup-name workers

<span class="hljs-comment"># Delete EKS cluster</span>
aws eks delete-cluster --name pod-security-cluster-demo

<span class="hljs-comment"># Wait for cluster deletion</span>
aws eks <span class="hljs-built_in">wait</span> cluster-deleted --name pod-security-cluster-demo
</code></pre>
<p>It's important to delete the node group before deleting the cluster. If you try to delete the cluster first, it will fail because node groups are dependent resources. The node group deletion process terminates all the EC2 instances and cleans up their associated resources. The cluster deletion then removes the control plane components.</p>
<h3 id="heading-management-instance-cleanup">Management Instance Cleanup</h3>
<p>Let's remove the management instance and its associated resources.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Terminate management instance</span>
aws ec2 terminate-instances --instance-ids <span class="hljs-variable">$INSTANCE_ID</span>

<span class="hljs-comment"># Wait for termination</span>
aws ec2 <span class="hljs-built_in">wait</span> instance-terminated --instance-ids <span class="hljs-variable">$INSTANCE_ID</span>

<span class="hljs-comment"># Release Elastic IP</span>
aws ec2 release-address --allocation-id <span class="hljs-variable">$EIP_ALLOC</span>
</code></pre>
<p>Terminating the instance is straightforward: AWS handles the cleanup of attached volumes and network interfaces automatically. But we need to explicitly release the Elastic IP address. Elastic IPs incur charges if they're allocated but not attached to a running instance, so releasing them is important to avoid unnecessary costs.</p>
<h3 id="heading-complete-vpc-infrastructure-removal">Complete VPC Infrastructure Removal</h3>
<p>Now we'll remove all the VPC components. This is the most complex cleanup section because VPC resources have many interdependencies that must be resolved in the correct order.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> VPC_ID=$(aws ec2 describe-vpcs \
  --filters Name=cidr-block-association.cidr-block,Values=10.0.0.0/16 Name=isDefault,Values=<span class="hljs-literal">false</span> \
  --query <span class="hljs-string">'Vpcs[?State==`available`].VpcId | [0]'</span> --output text)

<span class="hljs-comment">#!/bin/bash</span>
<span class="hljs-built_in">set</span> -euo pipefail

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Starting comprehensive VPC cleanup ==="</span>

<span class="hljs-comment"># First, let's identify what's still attached</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Remaining dependencies check ==="</span>
aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'NetworkInterfaces[*].[NetworkInterfaceId,Description,Status]'</span> --output table
aws ec2 describe-instances --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> Name=instance-state-name,Values=running,pending,stopping --query <span class="hljs-string">'Reservations[].Instances[*].[InstanceId,State.Name]'</span> --output table

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Force delete any remaining ENIs ==="</span>
<span class="hljs-keyword">for</span> eni <span class="hljs-keyword">in</span> $(aws ec2 describe-network-interfaces --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'NetworkInterfaces[?Status!=`in-use`].NetworkInterfaceId'</span> --output text); <span class="hljs-keyword">do</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleting ENI: <span class="hljs-variable">$eni</span>"</span>
  aws ec2 delete-network-interface --network-interface-id <span class="hljs-string">"<span class="hljs-variable">$eni</span>"</span> || <span class="hljs-literal">true</span>
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Wait for ENI cleanup ==="</span>
sleep 30

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete load balancers ==="</span>
<span class="hljs-comment"># Delete ALB/NLB</span>
<span class="hljs-keyword">for</span> arn <span class="hljs-keyword">in</span> $(aws elbv2 describe-load-balancers --query <span class="hljs-string">"LoadBalancers[?VpcId=='<span class="hljs-variable">$VPC_ID</span>'].LoadBalancerArn"</span> --output text); <span class="hljs-keyword">do</span>
  aws elbv2 delete-load-balancer --load-balancer-arn <span class="hljs-string">"<span class="hljs-variable">$arn</span>"</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted ALB/NLB: <span class="hljs-variable">$arn</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-comment"># Delete Classic ELB</span>
<span class="hljs-keyword">for</span> name <span class="hljs-keyword">in</span> $(aws elb describe-load-balancers --query <span class="hljs-string">"LoadBalancerDescriptions[?VPCId=='<span class="hljs-variable">$VPC_ID</span>'].LoadBalancerName"</span> --output text); <span class="hljs-keyword">do</span>
  aws elb delete-load-balancer --load-balancer-name <span class="hljs-string">"<span class="hljs-variable">$name</span>"</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted Classic ELB: <span class="hljs-variable">$name</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete VPC Endpoints ==="</span>
EP_IDS=$(aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'VpcEndpoints[].VpcEndpointId'</span> --output text || <span class="hljs-literal">true</span>)
<span class="hljs-keyword">if</span> [ -n <span class="hljs-string">"<span class="hljs-variable">${EP_IDS:-}</span>"</span> ]; <span class="hljs-keyword">then</span>
  aws ec2 delete-vpc-endpoints --vpc-endpoint-ids <span class="hljs-variable">$EP_IDS</span>
<span class="hljs-keyword">fi</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete NAT Gateways ==="</span>
<span class="hljs-keyword">for</span> nat <span class="hljs-keyword">in</span> $(aws ec2 describe-nat-gateways --filter Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'NatGateways[?State!=`deleted`].NatGatewayId'</span> --output text); <span class="hljs-keyword">do</span>
  aws ec2 delete-nat-gateway --nat-gateway-id <span class="hljs-string">"<span class="hljs-variable">$nat</span>"</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted NAT Gateway: <span class="hljs-variable">$nat</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-comment"># Wait for NAT Gateway deletion</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Waiting for NAT Gateways to delete..."</span>
<span class="hljs-keyword">while</span> [ $(aws ec2 describe-nat-gateways --filter Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'length(NatGateways[?State!=`deleted`])'</span> --output text) != <span class="hljs-string">"0"</span> ]; <span class="hljs-keyword">do</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Still waiting for NAT Gateway deletion..."</span>
  sleep 15
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete Internet Gateways ==="</span>
<span class="hljs-keyword">for</span> igw <span class="hljs-keyword">in</span> $(aws ec2 describe-internet-gateways --filters Name=attachment.vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'InternetGateways[].InternetGatewayId'</span> --output text); <span class="hljs-keyword">do</span>
  aws ec2 detach-internet-gateway --internet-gateway-id <span class="hljs-string">"<span class="hljs-variable">$igw</span>"</span> --vpc-id <span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> || <span class="hljs-literal">true</span>
  aws ec2 delete-internet-gateway --internet-gateway-id <span class="hljs-string">"<span class="hljs-variable">$igw</span>"</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted Internet Gateway: <span class="hljs-variable">$igw</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Terminate any remaining instances ==="</span>
<span class="hljs-keyword">for</span> iid <span class="hljs-keyword">in</span> $(aws ec2 describe-instances --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> Name=instance-state-name,Values=running,pending,stopping --query <span class="hljs-string">'Reservations[].Instances[].InstanceId'</span> --output text); <span class="hljs-keyword">do</span>
  aws ec2 terminate-instances --instance-ids <span class="hljs-string">"<span class="hljs-variable">$iid</span>"</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Terminating instance: <span class="hljs-variable">$iid</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-comment"># Wait for instances to terminate</span>
<span class="hljs-keyword">if</span> [ $(aws ec2 describe-instances --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> Name=instance-state-name,Values=running,pending,stopping --query <span class="hljs-string">'length(Reservations[].Instances[])'</span> --output text) != <span class="hljs-string">"0"</span> ]; <span class="hljs-keyword">then</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Waiting for instances to terminate..."</span>
  aws ec2 <span class="hljs-built_in">wait</span> instance-terminated --instance-ids $(aws ec2 describe-instances --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> Name=instance-state-name,Values=running,pending,stopping --query <span class="hljs-string">'Reservations[].Instances[].InstanceId'</span> --output text)
<span class="hljs-keyword">fi</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete subnets ==="</span>
<span class="hljs-keyword">for</span> subnet <span class="hljs-keyword">in</span> $(aws ec2 describe-subnets --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'Subnets[].SubnetId'</span> --output text); <span class="hljs-keyword">do</span>
  aws ec2 delete-subnet --subnet-id <span class="hljs-string">"<span class="hljs-variable">$subnet</span>"</span> || <span class="hljs-literal">true</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted subnet: <span class="hljs-variable">$subnet</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete route tables ==="</span>
<span class="hljs-keyword">for</span> rt <span class="hljs-keyword">in</span> $(aws ec2 describe-route-tables --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'RouteTables[?Associations[?Main==`false`]].RouteTableId'</span> --output text); <span class="hljs-keyword">do</span>
  <span class="hljs-comment"># Disassociate route table first</span>
  <span class="hljs-keyword">for</span> assoc <span class="hljs-keyword">in</span> $(aws ec2 describe-route-tables --route-table-ids <span class="hljs-string">"<span class="hljs-variable">$rt</span>"</span> --query <span class="hljs-string">'RouteTables[].Associations[?Main==`false`].RouteTableAssociationId'</span> --output text); <span class="hljs-keyword">do</span>
    aws ec2 disassociate-route-table --association-id <span class="hljs-string">"<span class="hljs-variable">$assoc</span>"</span> || <span class="hljs-literal">true</span>
  <span class="hljs-keyword">done</span>
  aws ec2 delete-route-table --route-table-id <span class="hljs-string">"<span class="hljs-variable">$rt</span>"</span> || <span class="hljs-literal">true</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted route table: <span class="hljs-variable">$rt</span>"</span>
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Delete security groups ==="</span>
<span class="hljs-comment"># Delete custom security groups (retry logic for dependencies)</span>
<span class="hljs-keyword">for</span> attempt <span class="hljs-keyword">in</span> {1..3}; <span class="hljs-keyword">do</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Security group deletion attempt <span class="hljs-variable">$attempt</span>..."</span>
  <span class="hljs-keyword">for</span> sg <span class="hljs-keyword">in</span> $(aws ec2 describe-security-groups --filters Name=vpc-id,Values=<span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span> --query <span class="hljs-string">'SecurityGroups[?GroupName!=`default`].GroupId'</span> --output text); <span class="hljs-keyword">do</span>
    aws ec2 delete-security-group --group-id <span class="hljs-string">"<span class="hljs-variable">$sg</span>"</span> 2&gt;/dev/null &amp;&amp; <span class="hljs-built_in">echo</span> <span class="hljs-string">"Deleted SG: <span class="hljs-variable">$sg</span>"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"Failed to delete SG: <span class="hljs-variable">$sg</span> (will retry)"</span>
  <span class="hljs-keyword">done</span>
  sleep 10
<span class="hljs-keyword">done</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Final VPC deletion ==="</span>
aws ec2 delete-vpc --vpc-id <span class="hljs-string">"<span class="hljs-variable">$VPC_ID</span>"</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"VPC cleanup completed successfully!"</span>
</code></pre>
<p>This cleanup script is comprehensive and handles all the common dependency issues you might encounter when deleting a VPC. Let me explain the order and reasoning behind each section.</p>
<p>We start by checking what resources are still attached to the VPC. This gives us visibility into any unexpected dependencies that might cause deletion failures. Then we delete any detached ENIs. These are network interfaces that EKS or the CNI plugin might have created that are no longer attached to instances.</p>
<p>Load balancers must be deleted before we can remove subnets, because they create ENIs in the subnets. We check for both modern Application/Network Load Balancers and Classic ELBs. VPC endpoints, if any were created, also need to be removed before subnet deletion.</p>
<p>The NAT Gateway deletion is particularly important to wait for completely, because NAT Gateways take several minutes to fully delete. If you try to delete the subnet while the NAT Gateway is still in "deleting" state, the deletion will fail.</p>
<p>Internet Gateways must be detached before they can be deleted. We use the <code>|| true</code> pattern here because if the detachment fails (maybe it's already detached), we still want to try the deletion.</p>
<p>Subnets can be deleted once all resources using them are removed. Route tables need to be disassociated from subnets before deletion – we only delete non-main route tables, as the main route table is automatically deleted with the VPC.</p>
<p>Security groups often have dependencies on each other (if rules reference other security groups), so we use a retry loop with three attempts. Each iteration, some security groups might successfully delete, breaking dependencies for others.</p>
<p>Finally, once all attached resources are cleaned up, we can delete the VPC itself.</p>
<h3 id="heading-iam-resource-cleanup">IAM Resource Cleanup</h3>
<p>The last step is cleaning up the IAM roles and policies we created at the beginning.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Detach policies from roles</span>
aws iam detach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

aws iam detach-role-policy \
  --role-name EKSClusterRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

aws iam detach-role-policy \
  --role-name EKSNodeGroupRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

<span class="hljs-comment"># Clean up management instance IAM resources</span>
aws iam detach-role-policy \
  --role-name EKS-Management-Role \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy

aws iam delete-policy \
  --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/EKS-Management-Policy

aws iam remove-role-from-instance-profile \
  --instance-profile-name EKS-Management-Profile \
  --role-name EKS-Management-Role

aws iam delete-instance-profile \
  --instance-profile-name EKS-Management-Profile

<span class="hljs-comment"># Delete IAM roles</span>
aws iam delete-role --role-name EKSClusterRole
aws iam delete-role --role-name EKSNodeGroupRole  
aws iam delete-role --role-name EKS-Management-Role

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Complete cleanup finished successfully"</span>
</code></pre>
<p>IAM cleanup follows a specific order: first detach all policies from roles, then delete any custom policies we created, remove roles from instance profiles, delete the instance profiles, and finally delete the roles themselves. IAM requires this order because of the dependency chain: you can't delete a role that still has policies attached, and you can't delete an instance profile that still contains a role.</p>
<p>The custom EKS-Management-Policy that we created needs to be deleted using your account ID in the ARN. The <code>aws sts get-caller-identity</code> command retrieves your account ID dynamically so the command works regardless of which AWS account you're using.</p>
<p>Once this cleanup is complete, you've removed all resources created during this guide and won't incur any further charges.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This comprehensive guide demonstrated how to implement Security Groups for Pods in Amazon EKS, providing fine-grained network security controls at the pod level.</p>
<p>As always, I hope you enjoyed this guide and learned something valuable about securing your EKS workloads. If you want to stay connected or see more hands-on DevOps content, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
<p>For more practical, hands-on DevOps projects like this one, follow and star this repository: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building">Learn-DevOps-by-building</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Docker Build Tutorial: Learn Contexts, Architecture, and Performance Optimization Techniques ]]>
                </title>
                <description>
                    <![CDATA[ Docker build is a fundamental concept every developer needs to understand. Whether you're containerizing your first application or optimizing existing Docker workflows, understanding Docker build contexts and Docker build architecture is essential fo... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/docker-build-tutorial-learn-contexts-architecture-and-performance-optimization-techniques/</link>
                <guid isPermaLink="false">68e559d8ac28fbe4acae92be</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software development ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Tue, 07 Oct 2025 18:20:08 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759861193876/871b72e7-9673-4572-b788-48f082a6b380.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Docker build is a fundamental concept every developer needs to understand. Whether you're containerizing your first application or optimizing existing Docker workflows, understanding Docker build contexts and Docker build architecture is essential for creating efficient, scalable containerized applications.</p>
<p>This comprehensive guide covers everything from basic concepts to advanced optimization techniques, helping you avoid common pitfalls and build better Docker images.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-docker-build">What is Docker Build?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-docker-build-architecture-how-it-all-works">Docker Build Architecture: How It All Works</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-docker-build-features">Docker Build Features</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-docker-build-context">Docker Build Context</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-types-of-docker-build-contexts">Types of Docker Build Contexts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-docker-build-mistakes-and-how-to-fix-them">Common Docker Build Mistakes (And How to Fix Them)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-optimize-and-monitor-build-performance">How to Optimize and Monitor Build Performance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices-for-docker-build-performance">Best Practices for Docker Build Performance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-troubleshooting-docker-build-issues">Troubleshooting Docker Build Issues</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-docker-build">What is Docker Build?</h2>
<p>Docker build is the process of creating a Docker image from a Dockerfile and a set of files called the <strong>build context</strong>. When you run <code>docker build</code>, you're instructing Docker to:</p>
<ol>
<li><p>Read your Dockerfile instructions</p>
</li>
<li><p>Gather the necessary files (build context)</p>
</li>
<li><p>Execute each instruction step-by-step</p>
</li>
<li><p>Create a final Docker image</p>
</li>
</ol>
<p>Think of it like following a recipe: the Dockerfile is your recipe, and the build context contains all the ingredients you might need.</p>
<h2 id="heading-docker-build-architecture-how-it-all-works">Docker Build Architecture: How It All Works</h2>
<p>Docker Build uses a client-server architecture where two separate components (<strong>Buildx and BuildKit</strong>) work together to build your Docker images. This is different from how many people think Docker works, as it's not just one monolithic program doing everything.</p>
<h3 id="heading-what-is-buildx-the-client">What is Buildx (The Client)?</h3>
<p>Buildx serves as the user interface that you interact with directly whenever you work with Docker builds. When you type <code>docker build .</code> in your terminal, you're actually communicating with Buildx, which acts as the intermediary between you and the actual build engine.</p>
<h4 id="heading-buildxs-primary-jobs">Buildx’s primary jobs:</h4>
<ul>
<li><p>Interprets your build command and options</p>
</li>
<li><p>Sends structured build requests to BuildKit</p>
</li>
<li><p>Manages multiple BuildKit instances (builders)</p>
</li>
<li><p>Handles authentication and secrets</p>
</li>
<li><p>Displays build progress to you</p>
</li>
</ul>
<h3 id="heading-what-is-buildkit-the-serverbuilder">What is BuildKit (The Server/Builder)</h3>
<p>BuildKit functions as the actual build engine that performs all the heavy lifting during the Docker build process. This powerful backend component receives the structured build requests from Buildx and immediately begins reading and interpreting your Dockerfiles line by line.</p>
<h4 id="heading-buildkits-primary-jobs">BuildKit’s primary jobs:</h4>
<ul>
<li><p>Receives build requests from Buildx</p>
</li>
<li><p>Reads and interprets Dockerfiles</p>
</li>
<li><p>Executes build instructions step by step</p>
</li>
<li><p>Manages build cache and layers</p>
</li>
<li><p>Requests only the files it needs from the client</p>
</li>
<li><p>Creates the final Docker image</p>
</li>
</ul>
<h3 id="heading-how-they-communicate">How They Communicate</h3>
<p>Here's what happens when you run <code>docker build .</code>:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758733757378/d3322dad-efac-4c4a-b8f8-69f17a4920e8.png" alt="Diagram showing Docker build process with BuildKit, including sending build request with Dockerfile and build arguments, requesting and receiving package.json, running npm install, requesting and receiving src directory files, copying files, completing build, and optionally pushing to registry." class="image--center mx-auto" width="2947" height="2628" loading="lazy"></p>
<p>When you run <code>docker build</code>, the command initiates a multi-step process with BuildKit (as illustrated in the above image).</p>
<p>First, it sends a build request containing your Dockerfile, build arguments, export options, and cache options. BuildKit then intelligently requests only the files it needs when it needs them, starting with <code>package.json</code> to run <code>npm install</code> for dependency installation.</p>
<p>After that's complete, it requests the <code>src/</code> directory containing your application code and copies those files into the image with the <code>COPY</code> command.</p>
<p>Once all build steps are finished, BuildKit sends back the completed image. Optionally, you can then push this image to a container registry for distribution or deployment.</p>
<p>This on-demand file transfer approach is one of BuildKit's key optimizations: rather than sending your entire build context upfront, it only requests specific files as each build step needs them, making the build process more efficient.</p>
<h3 id="heading-key-communication-details">Key Communication Details</h3>
<p>Build request contains:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"dockerfile"</span>: <span class="hljs-string">"FROM node:18\nWORKDIR /app\n..."</span>,
  <span class="hljs-attr">"buildArgs"</span>: {<span class="hljs-attr">"NODE_ENV"</span>: <span class="hljs-string">"production"</span>},
  <span class="hljs-attr">"exportOptions"</span>: {<span class="hljs-attr">"type"</span>: <span class="hljs-string">"image"</span>, <span class="hljs-attr">"name"</span>: <span class="hljs-string">"my-app:latest"</span>},
  <span class="hljs-attr">"cacheOptions"</span>: {<span class="hljs-attr">"type"</span>: <span class="hljs-string">"registry"</span>, <span class="hljs-attr">"ref"</span>: <span class="hljs-string">"my-app:cache"</span>}
}
</code></pre>
<p>Resource requests:</p>
<ul>
<li><p>BuildKit asks: "I need the file at <code>./package.json</code>"</p>
</li>
<li><p>Buildx responds: Sends the actual file content</p>
</li>
<li><p>BuildKit asks: "I need the directory <code>./src/</code>"</p>
</li>
<li><p>Buildx responds: Sends all files in that directory</p>
</li>
</ul>
<h3 id="heading-why-this-architecture-exists">Why This Architecture Exists</h3>
<h4 id="heading-1-efficiency">1. Efficiency</h4>
<p>The old Docker builder had a major flaw: it always copied your entire build context upfront, regardless of what was actually needed. Even if your Dockerfile only used a few files, Docker would transfer hundreds of megabytes before starting the build.</p>
<p>BuildKit fixes this through on-demand file transfers. It only requests specific files at each step.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Old Docker Builder (legacy)</span>
<span class="hljs-comment"># Always copied ENTIRE context upfront</span>
$ docker build .
Sending build context to Docker daemon  245.7MB  <span class="hljs-comment"># Everything!</span>

<span class="hljs-comment"># New BuildKit Architecture  </span>
<span class="hljs-comment"># Only requests files when needed</span>
$ docker build .
<span class="hljs-comment">#1 [internal] load build definition from Dockerfile    0.1s</span>
<span class="hljs-comment">#2 [internal] load .dockerignore                       0.1s</span>
<span class="hljs-comment">#3 [1/4] FROM node:18                                  0.5s</span>
<span class="hljs-comment">#4 [internal] load build context                       0.1s</span>
<span class="hljs-comment">#4 transferring context: 234B  # Only package.json initially!</span>
<span class="hljs-comment">#5 [2/4] WORKDIR /app                                  0.2s  </span>
<span class="hljs-comment">#6 [3/4] COPY package*.json ./                         0.1s</span>
<span class="hljs-comment">#7 [4/4] RUN npm install                               5.2s</span>
<span class="hljs-comment">#8 [internal] load build context                       0.3s  </span>
<span class="hljs-comment">#8 transferring context: 2.1MB  # Now requests src/ files</span>
<span class="hljs-comment">#9 [5/4] COPY src/ ./src/                              0.2s</span>
</code></pre>
<h4 id="heading-2-scalability">2. Scalability</h4>
<p>The client-server architecture enables scalability features. Multiple Docker CLI clients can connect to the same BuildKit instance, and BuildKit can run on remote servers instead of your local machine. This means you could execute builds on a cloud server while controlling them from your laptop. Teams can also deploy multiple BuildKit instances for different teams or purposes, scaling from individual developers to large enterprises.</p>
<h4 id="heading-3-security">3. Security</h4>
<p>Security is improved by only requesting sensitive files when explicitly needed. BuildKit never sees files your Dockerfile doesn't reference, reducing the attack surface. It also handles credentials through separate, secure channels rather than mixing them with your build context, preventing secrets from being embedded in image layers or exposed in build logs.</p>
<h3 id="heading-real-world-example">Real-World Example</h3>
<p>Let's trace through a typical build step by step. You can find the full code available here: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/02-python-cache">02-python-cache</a>.</p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> python:<span class="hljs-number">3.9</span>-slim
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> requirements.txt .</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install -r requirements.txt</span>
<span class="hljs-keyword">COPY</span><span class="bash"> src/ ./src/</span>
<span class="hljs-keyword">COPY</span><span class="bash"> main.py .</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"python"</span>, <span class="hljs-string">"main.py"</span>]</span>
</code></pre>
<p>Let’s see what actually happens here:</p>
<ol>
<li><p>You run <code>docker build .</code></p>
</li>
<li><p>Buildx says to BuildKit:</p>
</li>
</ol>
<pre><code class="lang-bash">   <span class="hljs-string">"Here's a build request with this Dockerfile"</span>
</code></pre>
<ol start="3">
<li><p><strong>BuildKit processes</strong>: <code>FROM python:3.9-slim</code></p>
<ul>
<li>No client files needed, pulls base image</li>
</ul>
</li>
<li><p><strong>BuildKit processes</strong>: <code>COPY requirements.txt .</code></p>
<ul>
<li><p>BuildKit to Buildx: "I need <code>requirements.txt</code>"</p>
</li>
<li><p>Buildx to BuildKit: Sends the file content</p>
</li>
</ul>
</li>
<li><p><strong>BuildKit processes</strong>: <code>RUN pip install -r requirements.txt</code></p>
<ul>
<li>No client files needed, runs inside container</li>
</ul>
</li>
<li><p><strong>BuildKit processes</strong>: <code>COPY src/ ./src/</code></p>
<ul>
<li><p>BuildKit to Buildx: "I need all files in <code>src/</code> directory"</p>
</li>
<li><p>Buildx to BuildKit: Sends all files in src/</p>
</li>
</ul>
</li>
<li><p><strong>BuildKit processes</strong>: <code>COPY main.py .</code></p>
<ul>
<li><p>BuildKit to Buildx: "I need <code>main.py</code>"</p>
</li>
<li><p>Buildx to BuildKit: Sends the file</p>
</li>
</ul>
</li>
<li><p>BuildKit to Buildx: "Build complete, here's your image"</p>
</li>
</ol>
<p>From the illustration, you can see that BuildKit only requests what it needs, when it needs it. Not this entire context:</p>
<pre><code class="lang-bash">
my-app/
├── src/                 <span class="hljs-comment"># ← Only loaded when COPY src/ runs</span>
├── tests/              <span class="hljs-comment"># ← Never requested (not in Dockerfile)</span>
├── docs/               <span class="hljs-comment"># ← Never requested  </span>
├── node_modules/       <span class="hljs-comment"># ← Never requested (in .dockerignore)</span>
├── requirements.txt    <span class="hljs-comment"># ← Loaded early (first COPY)</span>
└── main.py            <span class="hljs-comment"># ← Loaded later (second COPY)</span>
</code></pre>
<h2 id="heading-docker-build-features">Docker Build Features</h2>
<h3 id="heading-named-contexts">Named Contexts</h3>
<p>👉 Demo project: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/07-named-contexts">07-named-contexts</a></p>
<p>Named contexts allow you to include files from multiple sources during a build while keeping them logically separated. This is useful when you need documentation, configuration files, or shared libraries from different directories or repositories in your build.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Build with additional named context</span>
docker build --build-context docs=./documentation .
</code></pre>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># Use named context in Dockerfile</span>
<span class="hljs-keyword">FROM</span> alpine
<span class="hljs-keyword">COPY</span><span class="bash"> . /app</span>
<span class="hljs-comment"># Mount files from named context</span>
<span class="hljs-keyword">RUN</span><span class="bash"> --mount=from=docs,target=/docs \
    cp /docs/manual.pdf /app/</span>
</code></pre>
<h3 id="heading-build-secrets">Build Secrets</h3>
<p>👉 Demo project: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/06-build-secrets">06-build-secrets</a></p>
<p>Build secrets let you pass sensitive information (like API keys or passwords) to your build without including them in the final image or build history. The secrets are mounted temporarily during specific <code>RUN</code> commands and are never stored in image layers.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Pass secret to build</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"api_key=secret123"</span> | docker build --secret id=apikey,src=- .
</code></pre>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># Use secret in Dockerfile</span>
<span class="hljs-keyword">FROM</span> alpine
<span class="hljs-keyword">RUN</span><span class="bash"> --mount=<span class="hljs-built_in">type</span>=secret,id=apikey \
    <span class="hljs-built_in">export</span> API_KEY=$(cat /run/secrets/apikey) &amp;&amp; \
    curl -H <span class="hljs-string">"Authorization: <span class="hljs-variable">$API_KEY</span>"</span> https://api.example.com/data</span>
</code></pre>
<h2 id="heading-docker-build-context">Docker Build Context</h2>
<h3 id="heading-what-is-a-build-context">What is a Build Context?</h3>
<p>The build context is the collection of files and directories that Docker can access during the build process. It's like gathering all your cooking ingredients on the counter before you start cooking.</p>
<pre><code class="lang-bash">docker build [OPTIONS] CONTEXT
                       ^^^^^^^
                       This is your build context
</code></pre>
<h3 id="heading-why-build-contexts-matter">Why Build Contexts Matter</h3>
<ol>
<li><p><strong>Security</strong>: Only files in the context can be accessed during build</p>
</li>
<li><p><strong>Performance</strong>: Large contexts slow down builds</p>
</li>
<li><p><strong>Functionality</strong>: Your Dockerfile can only COPY/ADD files from the context</p>
</li>
<li><p><strong>Efficiency</strong>: Understanding contexts helps you build faster, leaner images</p>
</li>
</ol>
<h2 id="heading-types-of-docker-build-contexts">Types of Docker Build Contexts</h2>
<h3 id="heading-1-local-directory-context-most-common">1. Local Directory Context (Most Common)</h3>
<p>👉 See code here: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/01-node-local-context">01-node-local-context</a></p>
<p>This is what you'll use in 90% of cases – pointing to a folder on your machine:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Use current directory</span>
docker build .

<span class="hljs-comment"># Use specific directory</span>
docker build /path/to/my/project

<span class="hljs-comment"># Use parent directory</span>
docker build ..
</code></pre>
<p><strong>Example Project Structure:</strong></p>
<pre><code class="lang-bash">my-webapp/
├── src/
│   ├── index.js
│   └── utils.js
├── public/
│   ├── index.html
│   └── styles.css
├── package.json
├── package-lock.json
├── Dockerfile
├── .dockerignore
└── README.md
</code></pre>
<p><strong>Corresponding Dockerfile:</strong></p>
<pre><code class="lang-dockerfile"><span class="hljs-keyword">FROM</span> node:<span class="hljs-number">18</span>-alpine
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>

<span class="hljs-comment"># Copy package files first for better layer caching</span>
<span class="hljs-keyword">COPY</span><span class="bash"> package*.json ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm ci --only=production</span>

<span class="hljs-comment"># Copy application source</span>
<span class="hljs-keyword">COPY</span><span class="bash"> src/ ./src/</span>
<span class="hljs-keyword">COPY</span><span class="bash"> public/ ./public/</span>

<span class="hljs-keyword">EXPOSE</span> <span class="hljs-number">3000</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"node"</span>, <span class="hljs-string">"src/index.js"</span>]</span>
</code></pre>
<h3 id="heading-2-remote-git-repository-context">2. Remote Git Repository Context</h3>
<p>You can build directly from Git repositories without cloning locally:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Build from GitHub main branch</span>
docker build https://github.com/&lt;username&gt;/project.git

<span class="hljs-comment"># Build from specific branch</span>
docker build https://github.com/&lt;username&gt;/project.git<span class="hljs-comment">#develop</span>

<span class="hljs-comment"># Build from specific directory in repo</span>
docker build https://github.com/&lt;username&gt;/project.git<span class="hljs-comment">#main:docker</span>

<span class="hljs-comment"># Build with authentication</span>
docker build --ssh default git@github.com:&lt;username&gt;/private-repo.git
</code></pre>
<p>This has various cases like CI/CD pipelines, building open-source projects, ensuring clean builds from source control, automated deployments, and so on.</p>
<h3 id="heading-3-remote-tarball-context">3. Remote Tarball Context</h3>
<p>You can also build from compressed archives hosted on web servers. A remote <strong>tarball</strong> is a <code>.tar.gz</code> or similar compressed archive file accessible via HTTP/HTTPS. This is useful when your source code is packaged and hosted on a web server, artifact repository, or CDN. Docker downloads and extracts the archive automatically, using its contents as the build context.</p>
<p>This approach works well for CI/CD pipelines where build artifacts are stored centrally, or when you want to build images from released versions of your code without cloning entire repositories.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Build from remote tarball</span>
docker build http://server.com/context.tar.gz

<span class="hljs-comment"># BuildKit downloads and extracts automatically</span>
docker build https://example.com/project-v1.2.3.tar.gz
</code></pre>
<h3 id="heading-4-empty-context-advanced">4. Empty Context (Advanced)</h3>
<p>When you don't need any files, you can pipe the Dockerfile directly:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create image without file context</span>
docker build -t hello-world - &lt;&lt;EOF
FROM alpine:latest
RUN <span class="hljs-built_in">echo</span> <span class="hljs-string">"Hello, World!"</span> &gt; /hello.txt
CMD cat /hello.txt
EOF
</code></pre>
<h2 id="heading-common-docker-build-mistakes-and-how-to-fix-them">Common Docker Build Mistakes (And How to Fix Them)</h2>
<h3 id="heading-mistake-1-wrong-context-directory">Mistake 1: Wrong Context Directory</h3>
<p>👉 Reproduced here: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/04-wrong-context">04-wrong-context</a></p>
<p>This mistake occurs when you run <code>docker build</code> from the wrong directory, causing the build context to be different from what your Dockerfile expects.</p>
<p>In the example, running <code>docker build frontend/</code> from the <code>/projects/</code> directory means the context is <code>/projects/frontend/</code>, but the Dockerfile tries to access <code>../shared/utils.js</code>, which is outside this context. Docker can only access files within the build context, so any attempt to reference files outside it will fail.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Project structure</span>
/projects/
├── frontend/
│   ├── Dockerfile
│   ├── src/
│   └── package.json
└── shared/
    └── utils.js

<span class="hljs-comment"># WRONG - Running from projects directory</span>
docker build frontend/
<span class="hljs-comment"># This won't work if Dockerfile tries to COPY ../shared/utils.js</span>
</code></pre>
<h4 id="heading-how-to-fix-wrong-context-directory">How to fix wrong context directory:</h4>
<p>The key is aligning your build context with what your Dockerfile needs.</p>
<ul>
<li><p><strong>Option 1</strong> changes your working directory so the context matches your Dockerfile's expectations. You run the build from inside <code>frontend/</code>, making that directory the context root.</p>
</li>
<li><p><strong>Option 2</strong> keeps you in the parent directory but explicitly sets it as the context (the <code>.</code> argument) while telling Docker where to find the Dockerfile with the <code>-f</code> flag. Now both <code>frontend/</code> and <code>shared/</code> are accessible since they're both within the <code>/projects/</code> context.</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Option 1: Run from correct directory</span>
<span class="hljs-built_in">cd</span> frontend
docker build .

<span class="hljs-comment"># Option 2: Use parent directory as context</span>
docker build -f frontend/Dockerfile .
</code></pre>
<h3 id="heading-mistake-2-including-massive-files">Mistake 2: Including Massive Files</h3>
<p>👉 Optimized version with <code>.dockerignore</code>: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/05-dockerignore-optimization">05-dockerignore-optimization</a></p>
<p>This mistake happens when your build context contains large, unnecessary files that slow down the build process.</p>
<p>Docker must transfer the entire context to the build daemon before starting, so including files like <code>node_modules</code> (which can be hundreds of MB), git history, build artifacts, logs, and database dumps makes builds painfully slow. These files are rarely needed in the final image and should be excluded.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># This context includes everything!</span>
my-app/
├── node_modules/        <span class="hljs-comment"># 200MB+ </span>
├── .git/               <span class="hljs-comment"># Version history</span>
├── dist/               <span class="hljs-comment"># Built files</span>
├── logs/               <span class="hljs-comment"># Log files</span>
├── temp/               <span class="hljs-comment"># Temporary files</span>
├── database.dump       <span class="hljs-comment"># 1GB database backup</span>
└── Dockerfile
</code></pre>
<h4 id="heading-how-to-fix-docker-build-massive-files">How to fix Docker build massive files:</h4>
<p>Use <code>.dockerignore</code> to exclude unnecessary files, dramatically reducing context size and build time. We’ll discuss this in more detail below.</p>
<h3 id="heading-mistake-3-inefficient-layer-caching">Mistake 3: Inefficient Layer Caching</h3>
<p>👉 See good practice code here: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/02-python-cache">02-python-cache</a></p>
<p>This mistake wastes Docker's layer caching system by copying frequently-changing files (like source code) before running expensive operations (like <code>npm install</code>). When you modify your source code, Docker invalidates the cache for that layer and all subsequent layers, forcing <code>npm install</code> to run again even though dependencies haven't changed. This can turn a 5-second build into a 5-minute build.</p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># BAD - Changes to source code rebuild npm install</span>
<span class="hljs-keyword">FROM</span> node:<span class="hljs-number">18</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . /app</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm install</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"npm"</span>, <span class="hljs-string">"start"</span>]</span>
</code></pre>
<h4 id="heading-how-to-fix-docker-build-inefficient-layer-caching">How to fix docker build inefficient layer caching:</h4>
<p>Copy dependency files first, install dependencies, then copy source code. This way, <code>npm install</code> only runs when <code>package.json</code> actually changes:</p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># GOOD - npm install only rebuilds when package.json changes</span>
<span class="hljs-keyword">FROM</span> node:<span class="hljs-number">18</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> package*.json ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm install</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . .</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"npm"</span>, <span class="hljs-string">"start"</span>]</span>
</code></pre>
<h2 id="heading-how-to-optimize-and-monitor-build-performance">How to Optimize and Monitor Build Performance</h2>
<p>Understanding build performance metrics helps you identify bottlenecks and measure improvements.</p>
<h3 id="heading-how-to-optimize-docker-builds-with-dockerignore">How to Optimize Docker Builds with .dockerignore</h3>
<p>The <code>.dockerignore</code> file is your secret weapon for faster, more secure builds. It tells Docker which files to exclude from the build context.</p>
<h4 id="heading-creating-dockerignore-patterns">Creating .dockerignore Patterns</h4>
<p>Create a <code>.dockerignore</code> file in your project root. The syntax is similar to <code>.gitignore</code>, and you can use wildcards (<code>*</code>), match specific file extensions (<code>*.log</code>), exclude entire directories (<code>node_modules/</code>), or use negation patterns (<code>!important.txt</code>) to include files that would otherwise be excluded. Each line represents a pattern, and comments start with <code>#</code>.</p>
<p>Example of a .dockerignore file:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Dependencies</span>
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*

<span class="hljs-comment"># Build outputs</span>
dist/
build/
*.tgz

<span class="hljs-comment"># Version control</span>
.git/
.gitignore
.svn/

<span class="hljs-comment"># IDE and editor files</span>
.vscode/
.idea/
*.swp
*.swo
*~

<span class="hljs-comment"># OS generated files</span>
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

<span class="hljs-comment"># Logs and databases</span>
*.<span class="hljs-built_in">log</span>
*.sqlite
*.db

<span class="hljs-comment"># Environment and secrets</span>
.env
.env.local
.env.*.<span class="hljs-built_in">local</span>
secrets/
*.key
*.pem

<span class="hljs-comment"># Documentation</span>
README.md
docs/
*.md

<span class="hljs-comment"># Test files</span>
<span class="hljs-built_in">test</span>/
tests/
*.test.js
coverage/

<span class="hljs-comment"># Temporary files</span>
tmp/
temp/
*.tmp
</code></pre>
<h3 id="heading-measuring-build-performance">Measuring Build Performance</h3>
<h4 id="heading-analyzing-build-time">Analyzing Build Time</h4>
<p>Understanding where your build spends time helps identify bottlenecks and optimization opportunities. The detailed progress output shows timing for each build step, cache hits/misses, and resource usage.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Enable BuildKit progress output</span>
DOCKER_BUILDKIT=1 docker build --progress=plain .

<span class="hljs-comment"># Use buildx for detailed timing</span>
docker buildx build --progress=plain .
</code></pre>
<h4 id="heading-profiling-context-transfer">Profiling Context Transfer</h4>
<p>Monitor context transfer time to understand how build context size affects overall performance. Profile which directories contribute most to help target <code>.dockerignore</code> optimizations.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Measure context transfer time</span>
time docker build --no-cache .

<span class="hljs-comment"># Profile context size by directory</span>
du -sh */ | sort -hr
</code></pre>
<h4 id="heading-measuring-dockerignore-impact">Measuring .dockerignore Impact</h4>
<p>Before <code>.dockerignore</code>, you'll notice that the <code>transfering context</code> size is 245.7MB in 15.2s:</p>
<pre><code class="lang-bash">$ docker build .
<span class="hljs-comment">#1 [internal] load build context</span>
<span class="hljs-comment">#1 transferring context: 245.7MB in 15.2s</span>
</code></pre>
<p>After adding the .dockerignore file, the context reduced to 2.1MB in 0.3s:</p>
<pre><code class="lang-bash">$ docker build .
<span class="hljs-comment">#1 [internal] load build context  </span>
<span class="hljs-comment">#1 transferring context: 2.1MB in 0.3s</span>
</code></pre>
<p><strong>Result</strong>: 99% reduction in context size and 50x faster context transfer!</p>
<h2 id="heading-best-practices-for-docker-build-performance">Best Practices for Docker Build Performance</h2>
<p>We've covered several optimization techniques throughout this guide. Here's a quick recap of the key practices, plus some additional strategies:</p>
<ol>
<li><p><strong>Layer Caching</strong> (covered in Mistake 3): Copy dependency files before source code to maximize cache reuse.</p>
</li>
<li><p><strong>Using .dockerignore</strong> (covered in Mistake 2): Exclude unnecessary files to reduce context size and improve build speed.</p>
</li>
<li><p><strong>Choosing the Right Context</strong> (covered earlier): Select appropriate context types (local, Git, tarball) based on your use case.</p>
</li>
</ol>
<p>Now let’s talk about some more ways you can improve performance:</p>
<h3 id="heading-use-multi-stage-builds">Use Multi-Stage Builds</h3>
<p>👉 Demo project: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples/03-multistage-node">03-multistage-node</a></p>
<p>Multi-stage builds let you use one image for building/compiling your application and a different, smaller image for running it. This dramatically reduces your final image size by excluding build tools, source code, and other unnecessary files from the production image.</p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># Build stage</span>
<span class="hljs-keyword">FROM</span> node:<span class="hljs-number">18</span> AS builder
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> package*.json ./</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm ci</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . .</span>
<span class="hljs-keyword">RUN</span><span class="bash"> npm run build</span>

<span class="hljs-comment"># Production stage</span>
<span class="hljs-keyword">FROM</span> nginx:alpine
<span class="hljs-keyword">COPY</span><span class="bash"> --from=builder /app/dist /usr/share/nginx/html</span>
<span class="hljs-keyword">EXPOSE</span> <span class="hljs-number">80</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"nginx"</span>, <span class="hljs-string">"-g"</span>, <span class="hljs-string">"daemon off;"</span>]</span>
</code></pre>
<h3 id="heading-use-specific-base-images">Use Specific Base Images</h3>
<p>Generic base images like <code>ubuntu:latest</code> include many packages you don't need, making your images larger and slower to download. Specific images like <code>node:18-alpine</code> or distroless images contain only what's necessary for your application to run.</p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># Large base image</span>
<span class="hljs-keyword">FROM</span> ubuntu:latest

<span class="hljs-comment"># Smaller, more specific base image  </span>
<span class="hljs-keyword">FROM</span> node:<span class="hljs-number">18</span>-alpine

<span class="hljs-comment"># Even smaller distroless image</span>
<span class="hljs-keyword">FROM</span> gcr.io/distroless/nodejs18-debian11
</code></pre>
<h3 id="heading-combine-run-commands">Combine RUN Commands</h3>
<p>Each <code>RUN</code> command creates a new layer in your image. Multiple <code>RUN</code> commands create multiple layers, increasing image size. Combining commands into a single <code>RUN</code> instruction creates just one layer, and you can clean up temporary files in the same step.</p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># Creates multiple layers</span>
<span class="hljs-keyword">RUN</span><span class="bash"> apt-get update</span>
<span class="hljs-keyword">RUN</span><span class="bash"> apt-get install -y curl</span>
<span class="hljs-keyword">RUN</span><span class="bash"> apt-get clean</span>

<span class="hljs-comment"># Single layer</span>
<span class="hljs-keyword">RUN</span><span class="bash"> apt-get update &amp;&amp; \
    apt-get install -y curl &amp;&amp; \
    apt-get clean &amp;&amp; \
    rm -rf /var/lib/apt/lists/*</span>
</code></pre>
<h2 id="heading-troubleshooting-docker-build-issues">Troubleshooting Docker Build Issues</h2>
<h3 id="heading-issue-copy-failed-no-such-file-or-directory">Issue: "COPY failed: no such file or directory"</h3>
<p><strong>Problem</strong>: File not in build context<br><strong>What’s going wrong</strong>: Docker can only access files within the build context (the directory you specify in <code>docker build</code>). If your Dockerfile tries to <code>COPY</code> a file that doesn't exist in the context directory, the build fails. This often happens when running the build command from the wrong directory or when the file path is incorrect relative to the context root.</p>
<p><strong>Solution</strong>:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check what's in your context</span>
ls -la

<span class="hljs-comment"># Verify file path relative to context</span>
docker build -t debug . --progress=plain
</code></pre>
<h3 id="heading-issue-docker-build-is-extremely-slow">Issue: "Docker Build is extremely slow"</h3>
<p><strong>Problem</strong>: Large build context<br><strong>What’s going wrong</strong>: Docker must transfer your entire build context to the BuildKit daemon before building starts. If your context contains large files, directories like <code>node_modules</code>, or unnecessary files, this transfer can take minutes instead of seconds. The larger the context, the slower your builds become.</p>
<p><strong>Solution</strong>:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check context size</span>
du -sh .

<span class="hljs-comment"># Add more patterns to .dockerignore</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"large-directory/"</span> &gt;&gt; .dockerignore
<span class="hljs-built_in">echo</span> <span class="hljs-string">"*.zip"</span> &gt;&gt; .dockerignore
</code></pre>
<h3 id="heading-issue-cannot-locate-specified-dockerfile">Issue: "Cannot locate specified Dockerfile"</h3>
<p><strong>Problem</strong>: Dockerfile not in context root<br><strong>What’s going wrong</strong>: By default, Docker looks for a file named <code>Dockerfile</code> in the root of your build context. If your Dockerfile is in a subdirectory or has a different name, Docker can't find it. This is common in monorepo setups where Dockerfiles are organized in separate folders.</p>
<p><strong>Solution</strong>:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Specify Dockerfile location</span>
docker build -f path/to/Dockerfile .

<span class="hljs-comment"># Or move Dockerfile to context root</span>
mv path/to/Dockerfile .
</code></pre>
<h3 id="heading-issue-cache-misses-on-unchanged-files">Issue: "Cache misses on unchanged files"</h3>
<p><strong>Problem</strong>: File timestamps or permissions changed<br><strong>What’s going wrong</strong>: Docker's layer caching relies on file checksums and metadata. Even if file content is unchanged, different timestamps or permissions can cause cache misses, forcing unnecessary rebuilds. This often happens after git operations, file system operations, or when files are copied between systems.</p>
<p><strong>Solution</strong>:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Check file modifications</span>
git status

<span class="hljs-comment"># Reset timestamps</span>
git ls-files -z | xargs -0 touch -r .git/HEAD
</code></pre>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Understanding Docker build contexts and architecture is essential for achieving faster builds. We’ve covered various techniques in this article, like optimized contexts and caching strategies, creating smaller images with efficient layering and multi-stage builds, maintaining better security with proper secret handling and minimal attack surface, and delivering an improved developer experience with faster iteration cycles.</p>
<p>👉 <strong>Full code examples are available on GitHub here:</strong> <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/tree/main/beginner/docker/docker-build-architecture-examples">Docker build architecture examples</a></p>
<p>As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
<p>For more hands-on projects, follow and star this repository: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building">Learn-DevOps-by-building</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Kubernetes Networking Tutorial: A Guide for Developers ]]>
                </title>
                <description>
                    <![CDATA[ Kubernetes networking is one of the most critical and complex parts of running containerized workloads in production. It’s what allows different parts of a Kubernetes system – like containers and services – to talk to each other. This tutorial will w... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/kubernetes-networking-tutorial-for-developers/</link>
                <guid isPermaLink="false">68598f7c91eb0b11714a7c62</guid>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ networking ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 23 Jun 2025 17:31:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1750697209688/e55bb451-1278-4004-ae3d-fd8bdbae47da.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Kubernetes networking is one of the most critical and complex parts of running containerized workloads in production. It’s what allows different parts of a Kubernetes system – like containers and services – to talk to each other.</p>
<p>This tutorial will walk you through both the theory as well as some hands-on examples and best practices for mastering Kubernetes networking.</p>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<ul>
<li><p>Have basic understanding of containers and <a target="_blank" href="https://docs.docker.com/engine/install/">Docker installed</a> on your system.</p>
</li>
<li><p>Basic understanding of General Networking terms.</p>
</li>
<li><p><a target="_blank" href="https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/">Install kubectl</a> tool for runing kubernetes commands.</p>
</li>
<li><p>Kubernetes cluster (<a target="_blank" href="https://kind.sigs.k8s.io/">Kind</a>, <a target="_blank" href="https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/cluster-intro/">Minikube</a>, and so on).</p>
</li>
<li><p><a target="_blank" href="https://helm.sh/docs/intro/install/">Installed helm</a> for Kubernetes package managements.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-introduction-to-kubernetes-networking">Introduction to Kubernetes Networking</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-core-concepts-in-kubernetes-networking">Core Concepts in Kubernetes Networking</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cluster-networking-components">Cluster Networking Components</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-dns-and-service-discovery">DNS and Service Discovery</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pod-networking-deep-dive">Pod Networking Deep Dive</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-services-and-load-balancing">Services and Load Balancing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-network-policies-and-security">Network Policies and Security</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-pitfalls-and-troubleshooting">Common Pitfalls and Troubleshooting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-summary-and-next-steps">Summary and Next Steps</a></p>
</li>
</ol>
<h2 id="heading-what-is-kubernetes-networking">What is Kubernetes Networking?</h2>
<p>So what actually is networking in Kubernetes? Well, in basic terms, it helps make sure that each container can communicate with the others, even if they're on different machines. It also ensures that outside traffic can reach the right containers when it needs to.</p>
<p>Kubernetes abstracts much of the complexity involved in networking, but understanding its internal workings helps you optimize and troubleshoot applications.</p>
<p>A key factor is that each pod gets a unique IP address and can communicate with all other pods without Network Address Translation (NAT). This simple yet powerful model supports complex distributed systems.</p>
<p><strong>NAT (Network Address Translation)</strong> refers to the process of rewriting the source or destination IP address (and possibly port) of packets as they pass through a router or gateway.</p>
<p>Because NAT alters packet headers, it breaks the “end-to-end” transparency of the network:</p>
<ol>
<li><p>The receiving host sees the NAT device’s address instead of the original sender’s.</p>
</li>
<li><p>Packet captures (for example, via tcpdump) only show the translated addresses, obscuring which internal endpoint truly sent the traffic.</p>
</li>
</ol>
<h3 id="heading-example-home-wi-fi-router-nat"><strong>Example: Home Wi-Fi Router NAT</strong></h3>
<p>Imagine your home network: you have a laptop, a phone, and a smart TV all connected to the same Wi-Fi. Your Internet provider assigns you <strong>one public IP address</strong> (say, 203.0.113.5). Internally, your router gives each device a <strong>private IP</strong> (for example, 192.168.1.10 for your laptop, 192.168.1.11 for your phone, and so on).</p>
<ul>
<li><p><strong>Outbound traffic:</strong> When your laptop (192.168.1.10) requests a webpage, the router rewrites the packet’s source IP from 192.168.1.10 → 203.0.113.5 (and tracks which internal port maps to which device).</p>
</li>
<li><p><strong>Inbound traffic:</strong> When the webpage replies, it arrives at 203.0.113.5, and the router uses its NAT table to forward that packet back to 192.168.1.10.</p>
</li>
</ul>
<p>Because of this translation:</p>
<ol>
<li><p>External servers <strong>only see</strong> the router’s IP (203.0.113.5), not your laptop’s.</p>
</li>
<li><p>Packets are “masqueraded” so multiple devices can share one public address.</p>
</li>
</ol>
<p>In contrast, Kubernetes pods communicate <strong>without</strong> this extra translation layer – each pod IP is “real” within the cluster, so no router-like step obscures who talked to whom.</p>
<h3 id="heading-example-e-commerce-microservices">Example: E-Commerce Microservices</h3>
<p>Consider an online store built as separate microservices, each running in its own pod with a unique IP:</p>
<ul>
<li><p><strong>Product Catalog Service</strong>: 10.244.1.2</p>
</li>
<li><p><strong>Shopping Cart Service</strong>: 10.244.2.3</p>
</li>
<li><p><strong>User Authentication Service</strong>: 10.244.1.4</p>
</li>
<li><p><strong>Payment Processing Service</strong>: 10.244.3.5</p>
</li>
</ul>
<p>When a shopper adds an item to their cart, the Shopping Cart Pod reaches out directly to the Product Catalog Pod at 10.244.1.2. Because there’s no NAT or external proxy in the data path, this communication is fast and reliable – which is crucial for delivering a snappy, real-time user experience.</p>
<p><strong>Tip:</strong> For a complete, hands-on implementation of this scenario (and others), check out the “networking-concepts-practice” section of my: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/blob/main/intermediate/k8/networking-concepts-practice/README.md">Learn-DevOps-by-building | networking-concepts-practice</a></p>
<h3 id="heading-importance-in-distributed-systems">Importance in Distributed Systems</h3>
<p>Networking in distributed systems facilitates the interaction of multiple services, enabling microservices architectures to function efficiently. Reliable networking supports redundancy, scalability, and fault tolerance.</p>
<h3 id="heading-kubernetes-networking-model-principles">Kubernetes Networking Model Principles</h3>
<p>Kubernetes networking operates on three foundational pillars that create a consistent and high-performance network environment:</p>
<h4 id="heading-1-unique-ip-per-pod">1. Unique IP per Pod</h4>
<p>Every pod receives its own routable IP address, eliminating port conflicts and simplifying service discovery. This design treats pods like traditional VMs or physical hosts: each can bind to standard ports (for example, 80/443) without remapping.</p>
<p>This helps developers avoid port-management complexity, and tools (like monitoring, tracing) work seamlessly, since pods appear as first-class network endpoints.</p>
<h4 id="heading-2-nat-free-pod-communication">2. NAT-Free Pod Communication:</h4>
<p>Pods communicate directly without Network Address Translation (NAT). Packets retain their original source/destination IPs, ensuring end-to-end visibility. This simplifies debugging (for example, <code>tcpdump</code> shows real pod IPs) and enables precise network policies. No translation layer also means lower latency and no hidden stateful bottlenecks.</p>
<h4 id="heading-3-direct-node-pod-routing">3. Direct Node-Pod Routing:</h4>
<p>Nodes route traffic to pods without centralized gateways. Each node handles forwarding decisions locally (via CNI plugins), creating a flat L3 network. This avoids single points of failure and optimizes performance – cross-node traffic flows directly between nodes, not through proxies. Scalability is inherent, and adding nodes expands capacity linearly.</p>
<h3 id="heading-challenges-in-container-networking">Challenges in Container Networking</h3>
<p>Common challenges include managing dynamic IP addresses, securing communications, and scaling networks without performance degradation. While Kubernetes abstracts networking complexities, real-world deployments face hurdles, like:</p>
<h4 id="heading-dynamic-ip-management">Dynamic IP Management:</h4>
<p>Pods are ephemeral – IPs change constantly during scaling, failures, or updates. Hard-coded IPs break, and DNS caching (with misconfigured TTLs) risks routing to stale endpoints. Solutions like CoreDNS dynamically track pod IPs via the Kubernetes API, while readiness probes ensure only live pods are advertised.</p>
<h4 id="heading-secure-communication">Secure Communication:</h4>
<p>Default cluster-wide pod connectivity exposes "east-west" threats. Compromised workloads can scan internal services, and encrypting traffic (for example, mTLS) adds CPU overhead. Network Policies enforce segmentation (for example, isolating PCI-compliant services), and service meshes automate encryption without app changes.</p>
<h4 id="heading-performance-at-scale">Performance at Scale:</h4>
<p>Large clusters strain legacy tooling. <code>iptables</code> rules explode with thousands of services, slowing packet processing. Overlay networks (for example, VXLAN) fragment packets, and centralized load balancers bottleneck traffic. Modern CNIs (Cilium/eBPF, Calico/BGP) bypass kernel bottlenecks, while IPVS replaces <code>iptables</code> for O(1) lookups.</p>
<h2 id="heading-core-concepts-in-kubernetes-networking">Core Concepts in Kubernetes Networking</h2>
<h3 id="heading-what-are-pods-and-nodes">What are Pods and Nodes?</h3>
<p>Pods are the smallest deployable units. Each pod runs on a node, which could be a virtual or physical machine.</p>
<h4 id="heading-scenario-example-web-application-deployment">Scenario Example: Web Application Deployment</h4>
<p>A typical web application might have:</p>
<ul>
<li><p>Three frontend pods running NGINX (distributed across two nodes)</p>
</li>
<li><p>Five backend API pods running Node.js (distributed across three nodes)</p>
</li>
<li><p>Two database pods running PostgreSQL (on dedicated nodes with SSD storage)</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># View pods distributed across nodes</span>
kubectl get pods -o wide

NAME                        READY   STATUS    NODE
frontend-6f4d85b5c9-1p4z2   1/1     Running   worker-node-1
frontend-6f4d85b5c9-2m5x3   1/1     Running   worker-node-1
frontend-6f4d85b5c9-3n6c4   1/1     Running   worker-node-2
backend-7c8d96b6b8-4q7d5    1/1     Running   worker-node-2
backend-7c8d96b6b8-5r8e6    1/1     Running   worker-node-3
...
</code></pre>
<h3 id="heading-what-are-services">What are Services?</h3>
<p>Services expose pods using selectors. They provide a stable network identity even as pod IPs change.</p>
<pre><code class="lang-bash">kubectl expose pod nginx-pod --port=80 --target-port=80 --name=nginx-service
</code></pre>
<h4 id="heading-scenario-example-database-service-migration">Scenario Example: Database Service Migration</h4>
<p>A team needs to migrate their database from MySQL to PostgreSQL without disrupting application functionality:</p>
<ol>
<li><p>Deploy PostgreSQL pods alongside existing MySQL pods</p>
</li>
<li><p>Create a database service that initially selects only MySQL pods:</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">database-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">mysql</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">3306</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">3306</span>
</code></pre>
<ol start="3">
<li><p>Update application to be compatible with both databases</p>
</li>
<li><p>Update the service selector to include both MySQL and PostgreSQL pods:</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">selector:</span>
  <span class="hljs-attr">app:</span> <span class="hljs-string">database</span>  <span class="hljs-comment"># New label applied to both MySQL and PostgreSQL pods</span>
</code></pre>
<ol start="5">
<li>Gradually remove MySQL pods while the service routes traffic to available PostgreSQL pods</li>
</ol>
<p>The service abstraction allows for zero-downtime migration by providing a consistent endpoint throughout the transition.</p>
<h3 id="heading-communication-paths">Communication Paths</h3>
<p>A <strong>communication path</strong> is simply the route that network traffic takes from its source to its destination within (or into/out of) the cluster. In Kubernetes, the three main paths are:</p>
<ul>
<li><p><strong>Pod-to-Pod:</strong> Direct traffic between two pods (possibly on different nodes).</p>
</li>
<li><p><strong>Pod-to-Service:</strong> Traffic from a pod destined for a Kubernetes Service (which then load-balances to one of its backend pods).</p>
</li>
<li><p><strong>External-to-Service:</strong> Traffic originating outside the cluster (e.g. from an end-user or external system) directed at a Service (often via a LoadBalancer or Ingress).</p>
</li>
</ul>
<h4 id="heading-pod-to-pod-communication">Pod-to-Pod Communication</h4>
<p>Pods communicate directly with each other using their IP addresses without NAT. For example:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it pod-a -- ping pod-b
</code></pre>
<h4 id="heading-scenario-example-sidecar-logging">Scenario Example: Sidecar Logging</h4>
<p>In a log aggregation setup, each application pod has a sidecar container that processes and forwards logs:</p>
<ol>
<li><p>Application container writes logs to a shared volume</p>
</li>
<li><p>Sidecar container reads from the volume and forwards to a central logging service</p>
</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Check communication between application and sidecar</span>
kubectl <span class="hljs-built_in">exec</span> -it app-pod -c app -- ls -la /var/<span class="hljs-built_in">log</span>/app
kubectl <span class="hljs-built_in">exec</span> -it app-pod -c log-forwarder -- tail -f /var/<span class="hljs-built_in">log</span>/app/application.log
</code></pre>
<p>Because both containers are in the same pod, they can communicate via <a target="_blank" href="http://localhost">localhost</a> and shared volumes without any network configuration.</p>
<h4 id="heading-pod-to-service-communication">Pod-to-Service Communication</h4>
<p>Pods communicate with services using DNS names, enabling load-balanced access to multiple pods:</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it pod-a -- curl http://my-service.default.svc.cluster.local
</code></pre>
<h4 id="heading-scenario-example-api-gateway-pattern">Scenario Example: API Gateway Pattern</h4>
<p>A microservices architecture uses an API gateway pattern:</p>
<ol>
<li><p>Frontend pods need to access fifteen or more backend microservices</p>
</li>
<li><p>Instead of tracking individual pod IPs, the frontend connects to service names:</p>
</li>
</ol>
<pre><code class="lang-javascript"><span class="hljs-comment">// Frontend code</span>
<span class="hljs-keyword">const</span> authService = <span class="hljs-string">'http://auth-service.default.svc.cluster.local'</span>;
<span class="hljs-keyword">const</span> userService = <span class="hljs-string">'http://user-service.default.svc.cluster.local'</span>;
<span class="hljs-keyword">const</span> productService = <span class="hljs-string">'http://product-service.default.svc.cluster.local'</span>;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">getUserProducts</span>(<span class="hljs-params">userId</span>) </span>{
  <span class="hljs-keyword">const</span> authResponse = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`<span class="hljs-subst">${authService}</span>/validate`</span>);
  <span class="hljs-keyword">if</span> (authResponse.ok) {
    <span class="hljs-keyword">const</span> user = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`<span class="hljs-subst">${userService}</span>/users/<span class="hljs-subst">${userId}</span>`</span>);
    <span class="hljs-keyword">const</span> products = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">`<span class="hljs-subst">${productService}</span>/products?user=<span class="hljs-subst">${userId}</span>`</span>);
    <span class="hljs-keyword">return</span> { user, products };
  }
}
</code></pre>
<p>Each service name resolves to a stable endpoint, even as the underlying pods are scaled, replaced, or rescheduled.</p>
<h4 id="heading-external-to-service-communication">External-to-Service Communication</h4>
<p>External communication is facilitated through service types like NodePort or LoadBalancer. An example of NodePort usage:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-nodeport-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">NodePort</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">nodePort:</span> <span class="hljs-number">30080</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">my-app</span>
</code></pre>
<p>Now, this service can be accessed externally via:</p>
<pre><code class="lang-bash">curl http://&lt;NodeIP&gt;:30080
</code></pre>
<h4 id="heading-scenario-example-public-facing-web-application">Scenario Example: Public-Facing Web Application</h4>
<p>A company runs a public-facing web application that needs external access:</p>
<ol>
<li><p>Deploy the application pods with three replicas</p>
</li>
<li><p>Create a LoadBalancer service to expose the application:</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">web-app</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">service.beta.kubernetes.io/aws-load-balancer-type:</span> <span class="hljs-string">nlb</span>  <span class="hljs-comment"># Cloud-specific annotation</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">web-app</span>
</code></pre>
<ol start="3">
<li><p>When deployed on AWS, this automatically provisions a Network Load Balancer with a public IP</p>
</li>
<li><p>External users access the application through the load balancer, which distributes traffic across all three pods</p>
</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Check the external IP assigned to the service</span>
kubectl get service web-app

NAME     TYPE          CLUSTER-IP     EXTERNAL-IP        PORT(S)
web-app  LoadBalancer  10.100.41.213  a1b2c3.amazonaws.com  80:32456/TCP
</code></pre>
<h2 id="heading-cluster-networking-components">Cluster Networking Components</h2>
<p>Kubernetes networking transforms abstract principles into reality through tightly orchestrated components. Central to this is the <strong>Container Network Interface (CNI)</strong>, a standardized specification that governs how network connectivity is established for containers.</p>
<h3 id="heading-what-is-a-container-network-interface-cni">What is a Container Network Interface (CNI) ?</h3>
<p>At its essence, CNI acts as Kubernetes' networking plugin framework. It’s responsible for dynamically assigning IP addresses to pods, creating virtual network interfaces (like virtual Ethernet pairs), and configuring routes whenever a pod starts or stops.</p>
<p>Crucially, Kubernetes delegates these low-level networking operations to CNI plugins, allowing you to choose implementations aligned with your environment’s needs: whether that’s Flannel’s simple overlay networks for portability, Calico’s high-performance BGP routing for bare-metal efficiency, or Cilium’s eBPF-powered data plane for advanced security and observability.</p>
<p>Working alongside CNI, kube-proxy operates on every node, translating Service abstractions into concrete routing rules within the node’s kernel (using <code>iptables</code> or <code>IPVS</code>). Meanwhile, CoreDNS provides seamless service discovery by dynamically mapping human-readable names (for example, <code>cart-service.production.svc.cluster.local</code>) to stable Service IPs. Together, these components form a cohesive fabric, ensuring pods can communicate reliably whether they’re on the same node or distributed across global clusters.</p>
<h3 id="heading-high-level-cni-plugin-differences"><strong>High-Level CNI Plugin Differences:</strong></h3>
<ul>
<li><p><strong>Flannel:</strong> Simple overlay (VXLAN, host-gw) for basic multi-host networking.</p>
</li>
<li><p><strong>Calico:</strong> Pure-L3 routing using BGP or IP-in-IP, plus rich network policies.</p>
</li>
<li><p><strong>Cilium:</strong> eBPF-based dataplane for ultra-fast packet processing and advanced features like API-aware policies.</p>
</li>
</ul>
<p>These High-Level Plugins implement the CNI standard for managing pod IPs and routing.</p>
<pre><code class="lang-bash">kubectl get pods -n kube-system
</code></pre>
<h4 id="heading-scenario-example-multi-cloud-deployment-with-calico">Scenario Example: Multi-Cloud Deployment with Calico</h4>
<p>A company operates a hybrid deployment across AWS and Azure:</p>
<ol>
<li>Choose Calico as the CNI plugin for consistent networking across clouds:</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Install Calico on both clusters</span>
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

<span class="hljs-comment"># Verify Calico pods are running</span>
kubectl get pods -n kube-system -l k8s-app=calico-node
</code></pre>
<p>Calico provides:</p>
<ul>
<li><p>Consistent IPAM (IP Address Management) across both clouds</p>
</li>
<li><p>Network policy enforcement in both environments</p>
</li>
<li><p>BGP routing for optimized cross-node traffic</p>
</li>
</ul>
<ol start="2">
<li>When migrating workloads between clouds, the networking layer behaves consistently despite different underlying infrastructure.</li>
</ol>
<h3 id="heading-what-is-kube-proxy">What is kube-proxy?</h3>
<p>kube-proxy is a network component that runs on each node and implements Kubernetes’ <strong>Service</strong> abstraction. Its responsibilities include:</p>
<ul>
<li><p><strong>Watching the API server</strong> for Service and Endpoint changes.</p>
</li>
<li><p><strong>Programming the node’s packet-filtering layer</strong> (iptables or IPVS) so that traffic to a Service ClusterIP:port gets load-balanced to one of its healthy backend pods.</p>
</li>
<li><p><strong>Handling session affinity,</strong> if configured (so repeated requests from the same client go to the same pod).</p>
</li>
</ul>
<p>By doing this per-node, <code>kube-proxy</code> ensures any pod on that node can reach any Service IP without needing a central gateway.</p>
<h3 id="heading-what-are-iptables-amp-ipvs">What are iptables &amp; IPVS?</h3>
<p>Both iptables and IPVS are Linux kernel subsystems that <code>kube-proxy</code> can use to manage Service traffic:</p>
<h4 id="heading-iptables-mode">iptables mode</h4>
<p><code>kube-proxy</code> generates a set of NAT rules (in the <code>nat</code> table) so that when a packet arrives for a Service IP, the kernel rewrites its destination to one of the backend pod IPs.</p>
<h4 id="heading-ipvs-mode">IPVS mode</h4>
<p>IPVS (IP Virtual Server) runs as part of the kernel’s Netfilter framework. Instead of dozens or hundreds of iptables rules, it keeps a high-performance hash table of virtual services and real servers.</p>
<p>Here's the comparison of <code>iptables</code> and <code>IPVS</code> modes in a clean table format:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Mode</strong></td><td><strong>Pros</strong></td><td><strong>Cons</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>iptables</strong></td><td>• Simple and universally available on Linux systems</td><td></td></tr>
<tr>
<td>• Battle-tested and easy to debug</td><td>• Rule complexity grows linearly with Services/Endpoints</td><td></td></tr>
<tr>
<td>• Packet processing slows at scale due to sequential rule checks</td><td></td><td></td></tr>
<tr>
<td>• Service updates trigger full rule reloads</td><td></td><td></td></tr>
<tr>
<td><strong>IPVS</strong></td><td>• O(1) lookup time regardless of cluster size</td><td></td></tr>
<tr>
<td>• Built-in load-balancing algorithms (RR, LC, SH)</td><td></td><td></td></tr>
<tr>
<td>• Incremental updates without full rule recomputation</td><td></td><td></td></tr>
<tr>
<td>• Lower CPU overhead for large clusters</td><td>• Requires Linux kernel ≥4.4 and IPVS modules loaded</td><td></td></tr>
<tr>
<td>• More complex initial configuration</td><td></td><td></td></tr>
<tr>
<td>• Limited visibility with traditional tool</td><td></td></tr>
</tbody>
</table>
</div><h4 id="heading-scenario-example-debugging-service-connectivity">Scenario Example: Debugging Service Connectivity</h4>
<p>When troubleshooting service connectivity issues in a production cluster:</p>
<ol>
<li>First, check if kube-proxy is functioning:</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Check kube-proxy pods</span>
kubectl get pods -n kube-system -l k8s-app=kube-proxy

<span class="hljs-comment"># Examine kube-proxy logs</span>
kubectl logs -n kube-system kube-proxy-a1b2c
</code></pre>
<ol start="2">
<li>Inspect the iptables rules created by kube-proxy on a node:</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Connect to a node</span>
ssh worker-node-1

<span class="hljs-comment"># View iptables rules for a specific service</span>
sudo iptables-save | grep my-service
</code></pre>
<ol start="3">
<li>The output reveals how traffic to ClusterIP 10.96.45.10 is load-balanced across multiple backend pod IPs:</li>
</ol>
<pre><code class="lang-bash">-A KUBE-SVC-XYZAB12345 -m comment --comment <span class="hljs-string">"default/my-service"</span> -m statistic --mode random --probability 0.33332 -j KUBE-SEP-POD1
-A KUBE-SVC-XYZAB12345 -m comment --comment <span class="hljs-string">"default/my-service"</span> -m statistic --mode random --probability 0.50000 -j KUBE-SEP-POD2
-A KUBE-SVC-XYZAB12345 -m comment --comment <span class="hljs-string">"default/my-service"</span> -j KUBE-SEP-POD3
</code></pre>
<p>Understanding these rules helps diagnose why traffic might not be reaching certain pods.</p>
<h2 id="heading-dns-and-service-discovery">DNS and Service Discovery</h2>
<p>Every service in Kubernetes relies on DNS to map a human-friendly name (for example, <code>my-svc.default.svc.cluster.local</code>) to its ClusterIP. When pods come and go, DNS records must update quickly so clients never hit stale addresses.</p>
<p>Kubernetes uses <strong>CoreDNS</strong> as a cluster DNS server. When you create a Service, an A record is added pointing to its ClusterIP. Endpoints (the pod IPs) are published as SRV (Service) records. If a pod crashes or is rescheduled, CoreDNS watches the Endpoints API and updates its records in near–real time.</p>
<p><strong>Key mechanics:</strong></p>
<ol>
<li><p><strong>Service A record →</strong> ClusterIP</p>
</li>
<li><p><strong>Endpoint SRV records →</strong> backend pod IPs &amp; ports</p>
</li>
<li><p><strong>TTL tuning →</strong> how long clients cache entries</p>
</li>
</ol>
<p><strong>Why recovery matters:</strong></p>
<ul>
<li><p>A DNS TTL that’s too long can leave clients retrying an old IP.</p>
</li>
<li><p>A TTL that’s too short increases DNS load.</p>
</li>
<li><p>Readiness probes must signal “not ready” before CoreDNS removes a pod’s record.</p>
</li>
</ul>
<h3 id="heading-coredns">CoreDNS</h3>
<p>CoreDNS provides DNS resolution for services inside the cluster.</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it busybox -- nslookup nginx-service
</code></pre>
<p>Service discovery is automatic, using:</p>
<pre><code class="lang-bash">&lt;service&gt;.&lt;namespace&gt;.svc.cluster.local
</code></pre>
<h4 id="heading-scenario-example-microservices-environment-variables-vs-dns">Scenario Example: Microservices Environment Variables vs. DNS</h4>
<p>A team is migrating from hardcoded environment variables to Kubernetes DNS:</p>
<p><strong>Before:</strong> Configuration via environment variables</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">order-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">order-app</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">order-service:v1</span>
    <span class="hljs-attr">env:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PAYMENT_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"10.100.45.12"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">INVENTORY_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"10.100.67.34"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">USER_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"10.100.23.78"</span>
</code></pre>
<p><strong>After:</strong> Using Kubernetes DNS service discovery</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">order-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">order-app</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">order-service:v2</span>
    <span class="hljs-attr">env:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PAYMENT_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"payment-service.default.svc.cluster.local"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">INVENTORY_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"inventory-service.default.svc.cluster.local"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">USER_SERVICE_HOST</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"user-service.default.svc.cluster.local"</span>
</code></pre>
<p>When the team needs to relocate the payment service to a dedicated namespace for PCI compliance:</p>
<ol>
<li><p>Move payment service to "finance" namespace</p>
</li>
<li><p>Update only one environment variable:</p>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">PAYMENT_SERVICE_HOST</span>
  <span class="hljs-attr">value:</span> <span class="hljs-string">"payment-service.finance.svc.cluster.local"</span>
</code></pre>
<ol start="3">
<li>The application continues working without rebuilding container images or updating other services</li>
</ol>
<h2 id="heading-pod-networking-deep-dive">Pod Networking Deep Dive</h2>
<p>Under the hood, each pod has its own network namespace, virtual Ethernet (<code>veth</code>) pair, and an interface like <code>eth0</code>. The CNI plugin glues these into the cluster fabric.</p>
<p>When the kubelet creates a pod, it calls your CNI plugin:</p>
<ul>
<li><ol>
<li><p><strong>Allocates an IP</strong> from a pool.</p>
<ol start="2">
<li><p><strong>Creates a</strong> <code>veth</code> pair and moves one end into the pod’s netns.</p>
</li>
<li><p><strong>Programs routes</strong> on the host so that other nodes know how to reach this IP.</p>
</li>
</ol>
</li>
</ol>
</li>
</ul>
<h3 id="heading-namespaces-and-virtual-ethernet">Namespaces and Virtual Ethernet</h3>
<p>Each pod gets a Linux network namespace and connects to the host via a virtual Ethernet pair.</p>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it nginx-pod -- ip addr
</code></pre>
<h4 id="heading-scenario-example-debugging-network-connectivity">Scenario Example: Debugging Network Connectivity</h4>
<p>When troubleshooting connectivity issues between pods:</p>
<ol>
<li>Examine the network interfaces inside a pod:</li>
</ol>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it web-frontend-pod -- ip addr

1: lo: &lt;LOOPBACK,UP,LOWER_UP&gt; mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
2: eth0@if18: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1450 qdisc noqueue state UP group default
    link/ether 82:cf:d8:e9:7a:12 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.2.45/24 scope global eth0
    inet6 fe80::80cf:d8ff:fee9:7a12/64 scope link
</code></pre>
<ol start="2">
<li>Trace the path from pod to node:</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># On the node hosting the pod</span>
sudo ip netns list
<span class="hljs-comment"># Shows namespace like: cni-1a2b3c4d-e5f6-7890-a1b2-c3d4e5f6g7h8</span>

<span class="hljs-comment"># Examine connections on the node</span>
sudo ip link | grep veth
<span class="hljs-comment"># Shows virtual ethernet pairs like: veth123456@if2: ...</span>

<span class="hljs-comment"># Check routes on the node</span>
sudo ip route | grep 10.244.2.45
<span class="hljs-comment"># Shows how traffic reaches the pod</span>
</code></pre>
<p>This investigation reveals how traffic flows from the pod through its namespace, via virtual ethernet pairs, then through the node's routing table to reach other pods.</p>
<h3 id="heading-shared-networking-in-multi-container-pods">Shared Networking in Multi-Container Pods</h3>
<p>Multi-container pods share the same network namespace. Use this for sidecar and helper containers.</p>
<h4 id="heading-scenario-example-service-mesh-sidecar">Scenario Example: Service Mesh Sidecar</h4>
<p>When implementing Istio service mesh with automatic sidecar injection:</p>
<ol>
<li>Deploy an application with Istio sidecar injection enabled:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Pod</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">api-service</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">sidecar.istio.io/inject:</span> <span class="hljs-string">"true"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">containers:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">api-app</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">api-service:v1</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
</code></pre>
<ol start="2">
<li>After deployment, the pod has two containers sharing the same network namespace:</li>
</ol>
<pre><code class="lang-bash">kubectl describe pod api-service

Name:         api-service
...
Containers:
  api-app:
    ...
    Ports:          8080/TCP
    ...
  istio-proxy:
    ...
    Ports:          15000/TCP, 15001/TCP, 15006/TCP, 15008/TCP
    ...
</code></pre>
<ol start="3">
<li>The sidecar container intercepts all network traffic:</li>
</ol>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it api-service -c istio-proxy -- netstat -tulpn

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address     Foreign Address     State       PID/Program name
tcp        0      0 0.0.0.0:15001     0.0.0.0:*           LISTEN      1/envoy
tcp        0      0 0.0.0.0:15006     0.0.0.0:*           LISTEN      1/envoy
</code></pre>
<ol start="4">
<li>Traffic to the application container is transparently intercepted without requiring application changes:</li>
</ol>
<pre><code class="lang-bash">kubectl <span class="hljs-built_in">exec</span> -it api-service -c api-app -- curl localhost:8080
<span class="hljs-comment"># Actually goes through the proxy even though it looks direct to the app</span>
</code></pre>
<p>This shared network namespace enables the service mesh to implement features like traffic encryption, routing, and metrics collection without application modifications.</p>
<h2 id="heading-services-and-load-balancing">Services and Load Balancing</h2>
<p>Kubernetes Services abstract a set of pods behind a single virtual IP. That virtual IP can be exposed in several ways:</p>
<p>A Service object defines a stable IP (ClusterIP), DNS entry, and a selector. kube-proxy then programs the node to intercept traffic to that IP and forward it to one of the pods.</p>
<h3 id="heading-service-types"><strong>Service types:</strong></h3>
<ul>
<li><p><strong>ClusterIP (default):</strong> internal only</p>
</li>
<li><p><strong>NodePort:</strong> opens the Service on every node’s port (e.g. <code>30080</code>)</p>
</li>
<li><p><strong>LoadBalancer:</strong> asks your cloud provider for an external LB</p>
</li>
<li><p><strong>ExternalName:</strong> CNAME to an outside DNS name</p>
</li>
</ul>
<h3 id="heading-load-balancing-mechanics"><strong>Load-balancing mechanics:</strong></h3>
<ul>
<li><p><strong>kube-proxy + iptables/IPVS</strong> (round-robin, least-conn)</p>
</li>
<li><p><strong>External Ingress</strong> (NGINX, Traefik) for HTTP/S with host/path routing</p>
</li>
</ul>
<h3 id="heading-service-types-1">🔧 Service Types</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type</td><td>Description</td></tr>
</thead>
<tbody>
<tr>
<td>ClusterIP</td><td>Default, internal only</td></tr>
<tr>
<td>NodePort</td><td>Exposes service on node IP</td></tr>
<tr>
<td>LoadBalancer</td><td>Uses cloud provider LB</td></tr>
<tr>
<td>ExternalName</td><td>DNS alias for external service</td></tr>
</tbody>
</table>
</div><h4 id="heading-scenario-example-multi-tier-application-exposure">Scenario Example: Multi-Tier Application Exposure</h4>
<p>A company runs a three-tier web application with different exposure requirements:</p>
<ol>
<li>Frontend web tier (public-facing):</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">frontend-service</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">service.beta.kubernetes.io/aws-load-balancer-ssl-cert:</span> <span class="hljs-string">"arn:aws:acm:region:account:certificate/cert-id"</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">443</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
</code></pre>
<ol start="2">
<li>API tier (internal to frontend only):</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">api-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ClusterIP</span>  <span class="hljs-comment"># Internal only</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8000</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">api</span>
</code></pre>
<ol start="3">
<li>Database tier (internal to API only):</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">db-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ClusterIP</span>
  <span class="hljs-attr">ports:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">5432</span>
    <span class="hljs-attr">targetPort:</span> <span class="hljs-number">5432</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">database</span>
</code></pre>
<p>This configuration creates a secure architecture where:</p>
<ul>
<li><p>Only the frontend is exposed to the internet (with TLS)</p>
</li>
<li><p>The API is only accessible from the frontend pods within the cluster</p>
</li>
<li><p>The database is only accessible from the API pods within the cluster</p>
</li>
</ul>
<h3 id="heading-ingress-controllers">Ingress Controllers</h3>
<p>Ingress provides HTTP(S) routing and TLS termination.</p>
<pre><code class="lang-bash">helm install my-ingress ingress-nginx/ingress-nginx
</code></pre>
<h4 id="heading-scenario-example-hosting-multiple-applications-on-a-single-domain">Scenario Example: Hosting Multiple Applications on a Single Domain</h4>
<p>A company hosts multiple microservices apps under the same domain with different paths:</p>
<ol>
<li>Deploy nginx-ingress controller:</li>
</ol>
<pre><code class="lang-bash">helm install nginx-ingress ingress-nginx/ingress-nginx --<span class="hljs-built_in">set</span> controller.publishService.enabled=<span class="hljs-literal">true</span>
</code></pre>
<ol start="2">
<li>Configure routing for multiple services:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Ingress</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">company-apps</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">kubernetes.io/ingress.class:</span> <span class="hljs-string">nginx</span>
    <span class="hljs-attr">cert-manager.io/cluster-issuer:</span> <span class="hljs-string">letsencrypt-prod</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">tls:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">hosts:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">services.company.com</span>
    <span class="hljs-attr">secretName:</span> <span class="hljs-string">company-tls</span>
  <span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">host:</span> <span class="hljs-string">services.company.com</span>
    <span class="hljs-attr">http:</span>
      <span class="hljs-attr">paths:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">/dashboard</span>
        <span class="hljs-attr">pathType:</span> <span class="hljs-string">Prefix</span>
        <span class="hljs-attr">backend:</span>
          <span class="hljs-attr">service:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">dashboard-service</span>
            <span class="hljs-attr">port:</span>
              <span class="hljs-attr">number:</span> <span class="hljs-number">80</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">/api</span>
        <span class="hljs-attr">pathType:</span> <span class="hljs-string">Prefix</span>
        <span class="hljs-attr">backend:</span>
          <span class="hljs-attr">service:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">api-gateway</span>
            <span class="hljs-attr">port:</span>
              <span class="hljs-attr">number:</span> <span class="hljs-number">80</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">/docs</span>
        <span class="hljs-attr">pathType:</span> <span class="hljs-string">Prefix</span>
        <span class="hljs-attr">backend:</span>
          <span class="hljs-attr">service:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">documentation-service</span>
            <span class="hljs-attr">port:</span>
              <span class="hljs-attr">number:</span> <span class="hljs-number">80</span>
</code></pre>
<ol start="3">
<li><p>User traffic flow:</p>
<ul>
<li><p>User visits <a target="_blank" href="https://services.company.com/dashboard">https://services.company.com/dashboard</a></p>
</li>
<li><p>Traffic hits the LoadBalancer service for the ingress controller</p>
</li>
<li><p>Ingress controller routes to the dashboard-service based on path</p>
</li>
<li><p>Dashboard service load balances across dashboard pods</p>
</li>
</ul>
</li>
</ol>
<p>This allows hosting multiple applications behind a single domain and TLS certificate.</p>
<h2 id="heading-network-policies-and-security">Network Policies and Security</h2>
<p>Network Policies restrict communication based on pod selectors and namespaces.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">policyTypes:</span>
<span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>

<span class="hljs-attr">matchLabels:</span>
  <span class="hljs-attr">app:</span> <span class="hljs-string">frontend</span>
</code></pre>
<h3 id="heading-use-cases">Use Cases</h3>
<ul>
<li><p>Isolate environments (for example, dev vs prod)</p>
</li>
<li><p>Control egress to the internet</p>
</li>
<li><p>Enforce zero-trust networking</p>
</li>
</ul>
<h4 id="heading-scenario-example-pci-compliance-for-payment-processing">Scenario Example: PCI Compliance for Payment Processing</h4>
<p>A financial application processes credit card payments and must comply with PCI DSS requirements:</p>
<ol>
<li>Create dedicated namespace with strict isolation:</li>
</ol>
<pre><code class="lang-bash">kubectl create namespace payment-processing
</code></pre>
<ol start="2">
<li>Deploy payment pods to the isolated namespace:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">payment-processor</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">payment-processing</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">payment</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">payment</span>
        <span class="hljs-attr">pci:</span> <span class="hljs-string">"true"</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">payment-app</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">payment-processor:v1</span>
        <span class="hljs-attr">ports:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
</code></pre>
<ol start="3">
<li><p>Define network policy that:</p>
<ul>
<li><p>Only allows traffic from authorized services</p>
</li>
<li><p>Blocks all egress except to specific APIs</p>
</li>
<li><p>Monitors and logs all connection attempts</p>
</li>
</ul>
</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">NetworkPolicy</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pci-payment-policy</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">payment-processing</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">podSelector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">pci:</span> <span class="hljs-string">"true"</span>
  <span class="hljs-attr">policyTypes:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Ingress</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">Egress</span>
  <span class="hljs-attr">ingress:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">from:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">namespaceSelector:</span>
        <span class="hljs-attr">matchLabels:</span>
          <span class="hljs-attr">environment:</span> <span class="hljs-string">production</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">podSelector:</span>
        <span class="hljs-attr">matchLabels:</span>
          <span class="hljs-attr">role:</span> <span class="hljs-string">checkout</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
  <span class="hljs-attr">egress:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">to:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">ipBlock:</span>
        <span class="hljs-attr">cidr:</span> <span class="hljs-number">192.168</span><span class="hljs-number">.5</span><span class="hljs-number">.0</span><span class="hljs-string">/24</span>  <span class="hljs-comment"># Payment gateway API</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">443</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">to:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">namespaceSelector:</span>
        <span class="hljs-attr">matchLabels:</span>
          <span class="hljs-attr">name:</span> <span class="hljs-string">logging</span>
    <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">8125</span>  <span class="hljs-comment"># Metrics port</span>
</code></pre>
<ol start="4">
<li>Validate policy with connectivity tests:</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-comment"># Test from authorized pod (should succeed)</span>
kubectl <span class="hljs-built_in">exec</span> -it -n production checkout-pod -- curl payment-processor.payment-processing.svc.cluster.local:8080

<span class="hljs-comment"># Test from unauthorized pod (should fail)</span>
kubectl <span class="hljs-built_in">exec</span> -it -n default test-pod -- curl payment-processor.payment-processing.svc.cluster.local:8080
</code></pre>
<p>This comprehensive network policy ensures that sensitive payment data is isolated and can only be accessed by authorized services.</p>
<h2 id="heading-common-pitfalls-and-troubleshooting">Common Pitfalls and Troubleshooting</h2>
<h3 id="heading-pod-not-reachable">Pod Not Reachable</h3>
<ul>
<li><p><strong>Symptom:</strong> <code>ping</code> or application traffic times out.</p>
</li>
<li><p><strong>Steps to troubleshoot:</strong></p>
<ol>
<li><p><strong>Check pod status &amp; logs:</strong></p>
<pre><code class="lang-bash"> kubectl get pod myapp-abc123 -o wide
 kubectl logs myapp-abc123
</code></pre>
</li>
<li><p><strong>Inspect CNI plugin logs:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># e.g. for Calico on kube-system:</span>
 kubectl -n kube-system logs ds/calico-node
</code></pre>
</li>
<li><p><strong>Run a network debug container (netshoot):</strong></p>
<pre><code class="lang-bash"> kubectl run -it --rm netshoot --image=nicolaka/netshoot -- bash
 <span class="hljs-comment"># inside netshoot:</span>
 ping &lt;pod-IP&gt;
 ip link show
 ip route show
</code></pre>
</li>
</ol>
</li>
<li><p><strong>Why pods can be unreachable:</strong> IP allocation failures, misconfigured <code>veth</code>, MTU mismatch, CNI initialization errors.</p>
</li>
</ul>
<h3 id="heading-service-unreachable">Service Unreachable</h3>
<ul>
<li><p><strong>Symptom:</strong> Clients can’t hit the Service IP, or <code>curl</code> to <code>ClusterIP:port</code> fails.</p>
</li>
<li><p><strong>Steps to troubleshoot:</strong></p>
<ol>
<li><p><strong>Verify Service and Endpoints:</strong></p>
<pre><code class="lang-bash"> kubectl get svc my-svc -o yaml
 kubectl get endpoints my-svc -o wide
</code></pre>
</li>
<li><p><strong>Inspect kube-proxy rules:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># iptables mode:</span>
 sudo iptables-save | grep &lt;ClusterIP&gt;
 <span class="hljs-comment"># IPVS mode:</span>
 sudo ipvsadm -Ln
</code></pre>
</li>
<li><p><strong>Test connectivity from a pod:</strong></p>
<pre><code class="lang-bash"> kubectl <span class="hljs-built_in">exec</span> -it netshoot -- curl -v http://&lt;ClusterIP&gt;:&lt;port&gt;
</code></pre>
</li>
</ol>
</li>
<li><p><strong>Why services break:</strong> Missing endpoints (selector mismatch), stale kube-proxy rules, DNS entries pointing at wrong IP.</p>
</li>
</ul>
<h3 id="heading-policy-blocked-traffic">Policy-Blocked Traffic</h3>
<ul>
<li><p><strong>Symptom:</strong> Connections are actively refused or immediately reset.</p>
</li>
<li><p><strong>Steps to troubleshoot:</strong></p>
<ol>
<li><p><strong>List NetworkPolicies in the namespace:</strong></p>
<pre><code class="lang-bash"> kubectl get netpol
</code></pre>
</li>
<li><p><strong>Describe the policy logic:</strong></p>
<pre><code class="lang-bash"> kubectl describe netpol allow-frontend
</code></pre>
</li>
<li><p><strong>Simulate allowed vs. blocked flows:</strong></p>
<pre><code class="lang-bash"> <span class="hljs-comment"># From a debug pod:</span>
 kubectl <span class="hljs-built_in">exec</span> -it netshoot -- \
   curl --connect-timeout 2 http://&lt;target-pod-IP&gt;:&lt;port&gt;
</code></pre>
</li>
</ol>
</li>
<li><p><strong>Why policies bite you:</strong> Default “deny” behavior in some CNI plugins, overly strict podSelector or namespaceSelector, missing egress rules.</p>
</li>
</ul>
<h3 id="heading-tools-you-can-use">🔍 Tools you can use:</h3>
<ul>
<li><p><strong>kubectl exec:</strong> Run arbitrary commands <strong>inside any pod</strong>. It’s ideal for running <code>ping</code>, <code>curl</code>, <code>ip</code>, or <code>tcpdump</code> from the pod’s own network namespace.</p>
</li>
<li><p><strong>tcpdump:</strong> Capture raw packets on an interface. Use it (inside netshoot or via <code>kubectl exec</code>) to see if traffic actually leaves/arrives at a pod.</p>
</li>
<li><p><strong>Netshoot:</strong> A utility pod image packed with networking tools (<code>ping</code>, <code>traceroute</code>, <code>dig</code>, <code>curl</code>, <code>tcpdump</code>, and so on) so you don’t have to build your own.</p>
</li>
<li><p><strong>Cilium Hubble:</strong> An observability UI/API for <strong>Cilium</strong> that shows per-connection flows, L4/L7 metadata, and policy verdicts in real time.</p>
</li>
<li><p><strong>Calico Flow Logs:</strong> Calico’s <strong>eBPF-based</strong> logging of allow/deny decisions and packet metadata. It’s great for auditing exactly which policy rule matched a given packet.</p>
</li>
</ul>
<h4 id="heading-scenario-example-troubleshooting-service-connection-issues">Scenario Example: Troubleshooting Service Connection Issues</h4>
<p>A team is experiencing intermittent connection failures to a database service:</p>
<ol>
<li>Check if the service exists and has endpoints:</li>
</ol>
<pre><code class="lang-bash">kubectl get service postgres-db
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
postgres-db  ClusterIP   10.96.145.232   &lt;none&gt;        5432/TCP   3d

kubectl get endpoints postgres-db
NAME         ENDPOINTS                                   AGE
postgres-db  &lt;none&gt;                                      3d
</code></pre>
<ol start="2">
<li>The service exists but has no endpoints. Check pod selectors:</li>
</ol>
<pre><code class="lang-bash">kubectl describe service postgres-db
Name:              postgres-db
Namespace:         default
Selector:          app=postgres,tier=db
...

kubectl get pods --selector=app=postgres,tier=db
No resources found <span class="hljs-keyword">in</span> default namespace.
</code></pre>
<ol start="3">
<li>Inspect the database pods:</li>
</ol>
<pre><code class="lang-bash">kubectl get pods -l app=postgres
NAME                        READY   STATUS    RESTARTS   AGE
postgres-6b4f87b5c9-8p7x2   1/1     Running   0          3d

kubectl describe pod postgres-6b4f87b5c9-8p7x2
...
Labels:       app=postgres
              pod-template-hash=6b4f87b5c9
...
</code></pre>
<ol start="4">
<li><p>Found the issue: The pod has label <code>app=postgres</code> but missing the <code>tier=db</code> label required by the service selector.</p>
</li>
<li><p>Fix by updating the service selector:</p>
</li>
</ol>
<pre><code class="lang-bash">kubectl patch service postgres-db -p <span class="hljs-string">'{"spec":{"selector":{"app":"postgres"}}}'</span>
</code></pre>
<ol start="6">
<li>Verify endpoints are now populated:</li>
</ol>
<pre><code class="lang-bash">kubectl get endpoints postgres-db
NAME         ENDPOINTS             AGE
postgres-db  10.244.2.45:5432      3d
</code></pre>
<p>This systematic debugging approach quickly identified a label mismatch causing the connection issues.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this tutorial, you explored:</p>
<ul>
<li><p>Pod and service communication</p>
</li>
<li><p>Cluster-wide routing and discovery</p>
</li>
<li><p>Load balancing and ingress</p>
</li>
<li><p>Network policy configuration</p>
</li>
</ul>
<p>As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
<p>For more hands-on projects, follow and star this repository: <a target="_blank" href="https://github.com/Caesarsage/Learn-DevOps-by-building/blob/main/intermediate/k8/networking-concepts-practice/README.md">Learn-DevOps-by-building | networking-concepts-practice</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Send and Parse JSON Data in Golang – Data Encoding and Decoding Explained With Examples ]]>
                </title>
                <description>
                    <![CDATA[ When building web applications in Golang, working with JSON data is inevitable. Whether you're sending responses to clients or parsing requests, JSON encoding and decoding are essential skills to master.  In this article, we'll explore the different ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/encoding-and-decoding-data-in-golang/</link>
                <guid isPermaLink="false">66b906a2e4bfcbefb35a6b94</guid>
                
                    <category>
                        <![CDATA[ Go Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ json ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 05 Aug 2024 13:00:54 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/ferenc-almasi-HfFoo4d061A-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When building web applications in Golang, working with JSON data is inevitable. Whether you're sending responses to clients or parsing requests, JSON encoding and decoding are essential skills to master. </p>
<p>In this article, we'll explore the different ways to encode and decode JSON in Golang.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-how-to-send-json-responses-encoding">How to Send JSON Responses (Encoding)</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-the-marshal-function-for-json-encoding">How to Use the Marshal Function for JSON Encoding</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-the-newencoder-function">How to Use the NewEncoder Function</a></li>
<li><a class="post-section-overview" href="#heading-how-to-parse-json-requests-decoding">How to Parse JSON Requests (Decoding)</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-the-unmarshal-function-to-parse-json-requests">How to Use the Unmarshal Function to Parse JSON Requests</a></li>
<li><a class="post-section-overview" href="#how-to-use-newdecoder-function-for-json-decoding">How to Use NewDecoder Function for JSON Decoding</a></li>
<li><a class="post-section-overview" href="#heading-custom-json-marshaling-and-unmarshaling">Custom JSON Marshaling and Unmarshaling</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/p/4e27d014-692d-4c5d-bad0-0bd1af87cef3/how-to-use-json-marshaler">How to Use JSON Marshaler</a></li>
<li><a class="post-section-overview" href="#heading-how-to-use-json-unmarshaler">How to Use JSON Unmarshaler</a></li>
<li><a class="post-section-overview" href="#heading-trade-offs">Trade-offs</a></li>
<li><a class="post-section-overview" href="#heading-use-cases-and-recommendations">Use Cases and Recommendations</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<h2 id="heading-how-to-send-json-responses-encoding">How to Send JSON Responses (Encoding)</h2>
<p>JSON encoding is the process of converting Go data structures into JSON format.</p>
<p>Encoding refers to the process of converting data from one format to another. In the context of computing and data transmission, encoding typically involves converting data into a standardized format that can be easily stored, transmitted, or processed by different systems or applications.</p>
<p>Think of encoding like packing a suitcase for a trip. You take your clothes (data) and pack them into a suitcase (encoded format) so that they can be easily transported (transmitted) and unpacked (decoded) at your destination.</p>
<p>In the case of JSON encoding, the data is converted into a text-based format that uses human-readable characters to represent the data. This makes it easy for humans to read and understand the data, as well as for different systems to exchange and process the data.</p>
<p>Some common reasons for encoding data include:</p>
<ul>
<li>Data compression: Reducing the size of the data to make it easier to store or transmit.</li>
<li>Data security: Protecting the data from unauthorized access or tampering.</li>
<li>Data compatibility: Converting data into a format that can be read and processed by different systems or applications.</li>
<li>Data transmission: Converting data into a format that can be easily transmitted over a network or other communication channels.</li>
</ul>
<p>In Golang, we can use the <code>encoding/json</code> package to encode JSON data.</p>
<h3 id="heading-how-to-use-the-marshal-function-for-json-encoding">How to Use the Marshal Function for JSON Encoding</h3>
<p>The <code>Marshal</code> function is the most commonly used method for encoding JSON data in Golang. It takes a Go data structure as input and returns a JSON-encoded string.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> ( 
    <span class="hljs-string">"encoding/json"</span>
    <span class="hljs-string">"fmt"</span>
    <span class="hljs-string">"net/http"</span>
 )

<span class="hljs-keyword">type</span> Person <span class="hljs-keyword">struct</span> { 
    Name <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"name"`</span> 
    Age <span class="hljs-keyword">int</span> <span class="hljs-string">`json:"age"`</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handler</span><span class="hljs-params">(w http.ResponseWriter, r *http.Request)</span></span> { 
    person := Person{  Name: <span class="hljs-string">"John"</span>,  Age: <span class="hljs-number">30</span>, } 

    <span class="hljs-comment">// Encoding - One step</span>
    jsonStr, err := json.Marshal(person) 

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {  
        http.Error(w, err.Error(), http.StatusInternalServerError)  
        <span class="hljs-keyword">return</span> 
    } 

    w.Write(jsonStr)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> { 
    http.HandleFunc(<span class="hljs-string">"/"</span>, handler) 
    http.ListenAndServe(<span class="hljs-string">":8080"</span>, <span class="hljs-literal">nil</span>)
 }
</code></pre>
<h4 id="heading-code-explanation">Code Explanation:</h4>
<h6 id="heading-imports">Imports:</h6>
<ul>
<li><code>encoding/json</code>: Provides functions for encoding and decoding JSON.</li>
<li><code>fmt</code>: For printing output.</li>
</ul>
<h6 id="heading-user-struct">User Struct:</h6>
<ul>
<li>Defines a struct <code>User</code> with fields <code>Name</code> and <code>Age</code>.</li>
<li>Struct tags (for example: <code>json:"name"</code>) specify the JSON key names.</li>
</ul>
<h6 id="heading-main-function">main Function:</h6>
<ul>
<li>Creates a <code>User</code> instance.</li>
<li>Calls <code>json.Marshal</code> to encode the <code>user</code> struct into JSON. This returns a byte slice and an error.</li>
<li>If there's no error, it converts the byte slice to a string and prints it.</li>
</ul>
<h3 id="heading-how-to-use-the-newencoder-function">How to Use the NewEncoder Function</h3>
<p>The <code>NewEncoder</code> function is used to encode JSON data to a writer, such as a file or network connection.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> ( 
    <span class="hljs-string">"encoding/json"</span> 
    <span class="hljs-string">"fmt"</span> 
    <span class="hljs-string">"net/http"</span>
)

<span class="hljs-keyword">type</span> Person <span class="hljs-keyword">struct</span> { 
    Name <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"name"`</span> 
    Age <span class="hljs-keyword">int</span> <span class="hljs-string">`json:"age"`</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handler</span><span class="hljs-params">(w http.ResponseWriter, r *http.Request)</span></span> { 
    person := Person{  Name: <span class="hljs-string">"John"</span>,  Age: <span class="hljs-number">30</span> } 

    <span class="hljs-comment">// Encoding - 2 step . NewEncoder and Encode</span>
    encoder := json.NewEncoder(w) 

    err := encoder.Encode(person) 

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {  
        http.Error(w, err.Error(), http.StatusInternalServerError)  

        <span class="hljs-keyword">return</span> 
   }}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> { 
   http.HandleFunc(<span class="hljs-string">"/"</span>, handler) http.ListenAndServe(<span class="hljs-string">":8080"</span>, <span class="hljs-literal">nil</span>)
}
</code></pre>
<h4 id="heading-code-explanation-1">Code Explanation:</h4>
<h6 id="heading-inside-the-handler">Inside the handler:</h6>
<ul>
<li>The <code>handler</code> function is an HTTP handler that handles incoming HTTP requests.</li>
<li><code>w http.ResponseWriter</code>: Used to write the response.</li>
<li><code>r *http.Request</code>: Represents the incoming request.</li>
<li>A <code>Person</code> instance named <code>person</code> was created and initialized with the values <code>Name: "John"</code> and <code>Age: 30</code>.</li>
<li>A JSON encoder was created using <code>json.NewEncoder(w)</code>, which will write the JSON output to the response writer <code>w</code>.</li>
<li>The <code>person</code> struct was encoded to JSON and written to the response using <code>encoder.Encode(person)</code>.</li>
<li>If an error occurs during encoding, it is sent back to the client as an HTTP error response with a status code <code>500 Internal Server Error</code>.</li>
</ul>
<h2 id="heading-how-to-parse-json-requests-decoding">How to Parse JSON Requests (Decoding)</h2>
<p>JSON decoding is the process of converting JSON data into Go data structures. </p>
<p>Decoding refers to the process of converting data from a standardized format back into its original form. In computing and data transmission, decoding involves taking encoded data and transforming it into a format that can be easily understood and processed by a specific system or application.</p>
<p>Think of decoding like unpacking a suitcase after a trip. You take the packed suitcase (encoded data) and unpack it, putting each item (data) back to its original place, so that you can use it again.</p>
<p>In the case of JSON decoding, the text-based JSON data is converted back into its original form, such as a Go data structure (like a struct or slice), so that it can be easily accessed and processed by the application.</p>
<p>Some common reasons for decoding data include:</p>
<ul>
<li>Data extraction: Retrieving specific data from a larger encoded dataset.</li>
<li>Data analysis: Converting encoded data into a format that can be easily analyzed or processed.</li>
<li>Data storage: Converting encoded data into a format that can be easily stored in a database or file system.</li>
<li>Data visualization: Converting encoded data into a format that can be easily visualized or displayed.</li>
</ul>
<p>Decoding is essentially the reverse process of encoding, and it's an essential step in many data processing pipelines.</p>
<p>In Golang, we can use the <code>encoding/json</code> package to decode JSON data.</p>
<h3 id="heading-how-to-use-the-unmarshal-function-to-parse-json-requests">How to Use the Unmarshal Function to Parse JSON Requests</h3>
<p>The <code>Unmarshal</code> function is the most commonly used method for decoding JSON data in Golang. It takes a JSON-encoded string as input and returns a Go data structure.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> ( 
    <span class="hljs-string">"encoding/json"</span> 
    <span class="hljs-string">"fmt"</span> 
    <span class="hljs-string">"net/http"</span>
)

<span class="hljs-keyword">type</span> Person <span class="hljs-keyword">struct</span> { 
    Name <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"name"`</span> 
    Age <span class="hljs-keyword">int</span> <span class="hljs-string">`json:"age"`</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handler</span><span class="hljs-params">(w http.ResponseWriter, r *http.Request)</span></span> { 
    <span class="hljs-keyword">var</span> person Person err := json.NewDecoder(r.Body).Decode(&amp;person)

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {  
        http.Error(w, err.Error(), http.StatusBadRequest)  
        <span class="hljs-keyword">return</span>
    } 

    fmt.Println(person.Name) 
    <span class="hljs-comment">// Output: John fmt.Println(person.Age) </span>
    <span class="hljs-comment">// Output: 30</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> { 
    http.HandleFunc(<span class="hljs-string">"/"</span>, handler) http.ListenAndServe(<span class="hljs-string">":8080"</span>, <span class="hljs-literal">nil</span>)
}
</code></pre>
<h4 id="heading-code-explanation-2">Code Explanation:</h4>
<h6 id="heading-inside-the-handler-1">Inside the handler:</h6>
<ul>
<li>The <code>handler</code> function is an HTTP handler that handles incoming HTTP requests.</li>
<li><code>w http.ResponseWriter</code>: Used to write the response.</li>
<li><code>r *http.Request</code>: Represents the incoming request.</li>
<li>A variable <code>person</code> of type <code>Person</code> was declared.</li>
<li><code>json.NewDecoder(r.Body).Decode(&amp;person)</code>: This decodes the JSON request body into the <code>person</code> struct.</li>
<li>If an error occurs during decoding, it sends back an HTTP 400 error response with a status code <code>400 Bad Request</code>.</li>
<li>If decoding is successful, the <code>person</code> struct fields <code>Name</code> and <code>Age</code> are printed using <code>fmt.Println</code>.</li>
</ul>
<h3 id="heading-how-to-use-the-newdecoder-function-for-json-decoding">How to Use the NewDecoder Function for JSON Decoding</h3>
<p>The <code>NewDecoder</code> function is also used to decode JSON data from a reader, such as a file or network connection.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> ( 
    <span class="hljs-string">"encoding/json"</span> 
    <span class="hljs-string">"fmt"</span> 
    <span class="hljs-string">"net/http"</span>
)

<span class="hljs-keyword">type</span> Person <span class="hljs-keyword">struct</span> { 
    Name <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"name"`</span> 
    Age <span class="hljs-keyword">int</span> <span class="hljs-string">`json:"age"`</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">handler</span><span class="hljs-params">(w http.ResponseWriter, r *http.Request)</span></span> { 

    decoder := json.NewDecoder(r.Body) 

    <span class="hljs-keyword">var</span> person Person err := decoder.Decode(&amp;person) 

    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {  
        http.Error(w, err.Error(), http.StatusBadRequest)  
        <span class="hljs-keyword">return</span> 
       } 

    fmt.Println(person.Name) 
    <span class="hljs-comment">// Output: John fmt.Println(person.Age) </span>
    <span class="hljs-comment">// Output: 30</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> { 
    http.HandleFunc(<span class="hljs-string">"/"</span>, handler) 

    http.ListenAndServe(<span class="hljs-string">":8080"</span>, <span class="hljs-literal">nil</span>)
 }
</code></pre>
<h4 id="heading-code-explanation-3">Code Explanation:</h4>
<h6 id="heading-inside-the-handler-function">Inside the handler function:</h6>
<ul>
<li>The <code>handler</code> function is an HTTP handler that handles incoming HTTP requests.</li>
<li><code>w http.ResponseWriter</code>: Used to write the response.</li>
<li><code>r *http.Request</code>: Represents the incoming request.</li>
</ul>
<h6 id="heading-create-a-decoder">Create a Decoder:</h6>
<ul>
<li><code>decoder := json.NewDecoder(r.Body)</code>: Creates a new JSON decoder that reads from the request body.</li>
</ul>
<h6 id="heading-declare-a-person-variable">Declare a Person Variable:</h6>
<ul>
<li><code>var person Person</code>: Declares a variable <code>person</code> of type <code>Person</code>.</li>
</ul>
<h6 id="heading-decode-json-into-person-struct">Decode JSON into Person Struct:</h6>
<ul>
<li><code>err := decoder.Decode(&amp;person)</code>: Decodes the JSON from the request body into the <code>person</code> struct.</li>
<li>If an error occurs during decoding, it sends an HTTP 400 error response with the status code <code>400 Bad Request</code> and returns from the function.</li>
</ul>
<h6 id="heading-print-the-decoded-values">Print the Decoded Values:</h6>
<ul>
<li><code>fmt.Println(person.Name)</code>: Prints the <code>Name</code> field of the <code>person</code> struct.</li>
<li><code>fmt.Println(person.Age)</code>: Prints the <code>Age</code> field of the <code>person</code> struct.</li>
</ul>
<h3 id="heading-custom-json-marshaling-and-unmarshaling">Custom JSON Marshaling and Unmarshaling</h3>
<p>In some cases, the default JSON encoding and decoding behavior provided by <code>json.Marshal</code> and <code>json.Unmarshal</code> may not be sufficient. For instance, you may need to customize how certain fields are represented in JSON. This is where the <code>json.Marshaler</code> and <code>json.Unmarshaler</code> interfaces come in handy.</p>
<h4 id="heading-how-to-use-json-marshaler">How to use JSON Marshaler</h4>
<p>The <code>json.Marshaler</code> interface allows you to customize the JSON encoding of a type by implementing the <code>MarshalJSON</code> method. This method returns a JSON-encoded byte slice and an error.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(p Person)</span> <span class="hljs-title">MarshalJSON</span><span class="hljs-params">()</span> <span class="hljs-params">([]<span class="hljs-keyword">byte</span>, error)</span></span> {
    <span class="hljs-keyword">type</span> Alias Person
    <span class="hljs-keyword">return</span> json.Marshal(&amp;<span class="hljs-keyword">struct</span> {
        Alias
        Age <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"age"`</span>
    }{
        Alias: (Alias)(p),
        Age:   strconv.Itoa(p.Age) + <span class="hljs-string">" years"</span>,
    })
}
</code></pre>
<p>In this example, the <code>Age</code> field is converted to a string with a " years" suffix when encoding to JSON.</p>
<h4 id="heading-how-to-use-json-unmarshaler">How to use JSON Unmarshaler</h4>
<p>The <code>json.Unmarshaler</code> interface allows you to customize the JSON decoding of a type by implementing the <code>UnmarshalJSON</code> method. This method takes a JSON-encoded byte slice and returns an error.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(p *Person)</span> <span class="hljs-title">UnmarshalJSON</span><span class="hljs-params">(data []<span class="hljs-keyword">byte</span>)</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">type</span> Alias Person
    aux := &amp;<span class="hljs-keyword">struct</span> {
        Alias
        Age <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"age"`</span>
    }{Alias: (Alias)(*p)}

    <span class="hljs-keyword">if</span> err := json.Unmarshal(data, &amp;aux); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    ageStr := strings.TrimSuffix(aux.Age, <span class="hljs-string">" years"</span>)
    age, err := strconv.Atoi(ageStr)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }

    p.Age = age
    p.Name = aux.Name
    <span class="hljs-keyword">return</span> <span class="hljs-literal">nil</span>
}
</code></pre>
<p>In this example, the <code>Age</code> field is converted from a string with a " years" suffix to an integer when decoding from JSON.</p>
<h2 id="heading-trade-offs">Trade-offs</h2>
<p>From the various methods described above for encoding and decoding JSON. Here are the trade-offs for the most commonly used methods:</p>
<h3 id="heading-jsonmarshal-and-jsonunmarshal">json.Marshal and json.Unmarshal:</h3>
<h4 id="heading-pros">Pros:</h4>
<ul>
<li><strong>Ease of Use</strong>: Straightforward for encoding (Marshal) and decoding (Unmarshal) JSON.</li>
<li><strong>Flexibility</strong>: Can be used with various types including structs, maps, slices, and more.</li>
<li><strong>Customization</strong>: Struct tags (<code>json:"name"</code>) allow customization of JSON keys and other options.</li>
</ul>
<h4 id="heading-cons">Cons:</h4>
<ul>
<li><strong>Performance</strong>: May not be the fastest method for very large or complex JSON structures.</li>
<li><strong>Error Handling</strong>: Error messages can sometimes be less descriptive for deeply nested or complex data structures.</li>
</ul>
<h3 id="heading-jsonnewencoder-and-jsonnewdecoder">json.NewEncoder and json.NewDecoder:</h3>
<h4 id="heading-pros-1">Pros:</h4>
<ul>
<li><strong>Stream-Based</strong>: Suitable for encoding/decoding JSON in a streaming manner, which can handle large data sets without consuming a lot of memory.</li>
<li><strong>Flexibility</strong>: Can work directly with <code>io.Reader</code> and <code>io.Writer</code> interfaces, making them useful for network operations and large files.</li>
</ul>
<h4 id="heading-cons-1">Cons:</h4>
<ul>
<li><strong>Complexity</strong>: Slightly more complex to use compared to <code>json.Marshal</code> and <code>json.Unmarshal</code>.</li>
<li><strong>Error Handling</strong>: Similar to <code>json.Marshal</code> and <code>json.Unmarshal</code>, error messages can be less clear for complex structures.</li>
</ul>
<h3 id="heading-custom-marshaler-and-unmarshaler-interfaces-jsonmarshaler-and-jsonunmarshaler">Custom Marshaler and Unmarshaler Interfaces (json.Marshaler and json.Unmarshaler):</h3>
<h4 id="heading-pros-2">Pros:</h4>
<ul>
<li><strong>Customization</strong>: Full control over how types are encoded/decoded. Useful for handling complex types or custom JSON structures.</li>
<li><strong>Flexibility</strong>: Allows for implementing custom logic during marshaling/"unmarshaling."</li>
</ul>
<h4 id="heading-cons-2">Cons:</h4>
<ul>
<li><strong>Complexity</strong>: More complex to implement and use, as it requires writing custom methods.</li>
<li><strong>Maintenance</strong>: Increases the maintenance burden since custom logic needs to be kept in sync with any changes in the struct or data format.</li>
</ul>
<h3 id="heading-use-cases-and-recommendations">Use Cases and Recommendations</h3>
<ul>
<li><strong>Simple Data Structures</strong>: Use <code>json.Marshal</code> and <code>json.Unmarshal</code> for straightforward encoding/decoding of simple data structures.</li>
<li><strong>Large Data Streams</strong>: Use <code>json.NewEncoder</code> and <code>json.NewDecoder</code> for working with large data streams or when interacting with files or network operations.</li>
<li><strong>Custom Requirements</strong>: Implement <code>json.Marshaler</code> and <code>json.Unmarshaler</code> interfaces when you need custom behavior for specific types.</li>
<li><strong>Quick Operations</strong>: Use anonymous structs for quick, throwaway operations where defining a full struct type is unnecessary.</li>
</ul>
<p>Each method has its own strengths and trade-offs, and the best choice depends on the specific requirements of your application.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In conclusion, mastering JSON encoding and decoding is crucial for developing web applications in Golang. </p>
<p>By understanding the different methods available in the <code>encoding/json</code> package, you can choose the most suitable approach based on your specific requirements.</p>
<p>The <code>Marshal</code> and <code>Unmarshal</code> functions offer simplicity and flexibility for general use, while <code>NewEncoder</code> and <code>NewDecoder</code> provide efficient streaming capabilities for large datasets. </p>
<p>For scenarios that demand customized JSON representations, implementing the <code>json.Marshaler</code> and <code>json.Unmarshaler</code> interfaces gives you fine-grained control over the encoding and decoding processes. </p>
<p>Each method has its own strengths and trade-offs, and knowing when and how to use them will enable you handle JSON data effectively in your applications.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ ​​How to Upload Large Files Efficiently with AWS S3 Multipart Upload ]]>
                </title>
                <description>
                    <![CDATA[ Imagine running a media streaming platform where users upload large high-definition videos. Uploading such large files can be slow and may fail if the network is unreliable.  Using traditional single-part uploads can be cumbersome and inefficient for... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/upload-large-files-with-aws/</link>
                <guid isPermaLink="false">66b906c2cacc627a9522d23c</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ S3 ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 08 Jul 2024 12:02:56 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/mr-cup-fabien-barral-o6GEPQXnqMY-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Imagine running a media streaming platform where users upload large high-definition videos. Uploading such large files can be slow and may fail if the network is unreliable. </p>
<p>Using traditional single-part uploads can be cumbersome and inefficient for large files, often leading to timeout errors or the need to restart the entire upload process if any part fails. This is where the Amazon S3 multipart upload feature comes into play, offering a robust solution to these challenges.</p>
<p>In this article, you'll explore how to efficiently handle large files with Amazon S3 multipart upload. We'll discuss the benefits of using this feature, walk through the process of uploading files in parts, and provide code examples using the AWS SDK for full-stack Node and React project. </p>
<p>By the end of this article, you should have a good understanding of how to leverage the Amazon S3 multipart upload to optimize file uploads in your applications.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we start, ensure you have the following:</p>
<ul>
<li>An AWS account with IAM user credentials.</li>
<li>Node.js installed on your development machine.</li>
<li>Basic knowledge of JavaScript, React, and Node.js.</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents:</h2>
<ul>
<li><a class="post-section-overview" href="#">Introduction</a></li>
<li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></li>
<li><a class="post-section-overview" href="#table-of-contents">Table of Contents</a></li>
<li><a class="post-section-overview" href="#heading-how-it-works">How it works</a></li>
<li><a class="post-section-overview" href="#heading-step-1-how-to-set-up-aws-s3">Step 1: How to Set Up AWS S3</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-an-s3-bucket">How to Create an S3 Bucket</a></li>
<li><a class="post-section-overview" href="#heading-how-to-configure-s3-bucket-policy">How to Configure s3 Bucket Policy</a></li>
<li><a class="post-section-overview" href="#step-2-how-to-set-up-aws-s3-backend-with-nodejs">Step 2: How to Set Up AWS S3 Backend with Node.js</a></li>
<li><a class="post-section-overview" href="#initialize-a-nodejs-project">How to Initialize a Node.js Project</a></li>
<li><a class="post-section-overview" href="#heading-install-required-packages-1">Install Required Packages</a></li>
<li><a class="post-section-overview" href="#heading-create-server-file">Create Server file</a></li>
<li><a class="post-section-overview" href="#imports-and-configuration">Imports and configuration</a></li>
<li><a class="post-section-overview" href="#heading-middleware-and-aws-configuration">Middleware and AWS Configuration</a></li>
<li><a class="post-section-overview" href="#heading-routes">Routes</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/p/d96e9e12-b460-4784-b0cf-88855383af4d/start-initialize-upload-endpoint">Start/Initialize Upload Endpoint</a></li>
<li><a class="post-section-overview" href="#heading-upload-part-endpoint">Upload Part Endpoint</a></li>
<li><a class="post-section-overview" href="#heading-complete-upload-endpoint">Complete Upload Endpoint</a></li>
<li><a class="post-section-overview" href="#heading-start-the-server">Start the Server</a></li>
<li><a class="post-section-overview" href="#heading-environment-variables">Environment Variables</a></li>
<li><a class="post-section-overview" href="#heading-running-the-server">Running the Server</a></li>
<li><a class="post-section-overview" href="#heading-step-3-how-to-set-up-the-frontend-with-react">Step 3: How to Set Up the Frontend with React</a></li>
<li><a class="post-section-overview" href="#initialize-a-react-project">How to Initialize a React Project</a></li>
<li><a class="post-section-overview" href="#heading-install-required-packages-1">Install Required Packages</a></li>
<li><a class="post-section-overview" href="#heading-create-components">Create Components</a></li>
<li><a class="post-section-overview" href="#heading-app-component">App Component</a></li>
<li><a class="post-section-overview" href="#heading-testing">Testing</a></li>
<li><a class="post-section-overview" href="#heading-part-upload">Part Upload</a></li>
<li><a class="post-section-overview" href="#heading-complete-part-upload">Complete Part Upload</a></li>
<li><a class="post-section-overview" href="#heading-full-code-on-github">Full Code on GitHub</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<h2 id="heading-how-it-works">How It Works</h2>
<p>A large file upload is divided into smaller parts/chunks, each part is uploaded independently to Amazon S3. Once all the parts have been uploaded, they are combined to create the final object.</p>
<p>Example: Uploading a 100MB file in 5MB parts would result in 20 parts being uploaded to S3. Each part is uploaded with a unique identifier, and the order is maintained to ensure that the file can be reassembled correctly.</p>
<p>Retries can be configured to automatically retry failed parts, and the upload can be paused and resumed at any time. This makes the process more robust and fault-tolerant, especially for large files.</p>
<p><img src="https://media.amazonwebservices.com/blog/s3_multipart_upload.png" alt="https://media.amazonwebservices.com/blog/s3_multipart_upload.png" width="341" height="377" loading="lazy">
<em>multipart AWS s3 uploads</em></p>
<p>Learn more on the <a target="_blank" href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html">Amazon S3 multipart upload docs</a>.</p>
<p>Let's get started!</p>
<h2 id="heading-step-1-how-to-set-up-aws-s3">Step 1: How to Set Up AWS S3</h2>
<h3 id="heading-how-to-create-an-s3-bucket">How to Create an S3 Bucket</h3>
<p>First, log into the AWS Management console</p>
<ul>
<li>Navigate to the S3 service.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/06/create-bucket.png" alt="How to create an s3 bucket" width="600" height="400" loading="lazy">
<em>How to create an s3 bucket</em></p>
<p>Create a new bucket and take note of the bucket name.</p>
<p>Uncheck the Public Access settings for simplicity We'll also configure bucket access using IAM policies after creating the bucket.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/06/create-bucket2.png" alt="How to create an s3 bucket" width="600" height="400" loading="lazy">
<em>How to create an s3 bucket</em></p>
<ul>
<li>Leave other settings as default and create the bucket.</li>
</ul>
<h3 id="heading-how-to-configure-s3-bucket-policy">How to Configure S3 Bucket Policy</h3>
<p>Now, that you have created the bucket, let's set up the policy to allow users read your objects(file/videos) url.</p>
<ul>
<li>Click on the bucket name and navigate to the <code>Permissions</code> tab.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/06/permission.png" alt="How to configure s3 bucket policy" width="600" height="400" loading="lazy">
<em>How to configure s3 bucket policy</em></p>
<p>Navigate to the <code>Bucket Policy</code> section and click on Edit.</p>
<p>Input the following policy, and replace <code>your-bucket-name</code> with your actual bucket name:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
  <span class="hljs-attr">"Statement"</span>: [
    {
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Principal"</span>: <span class="hljs-string">"*"</span>,
      <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"s3:GetObject"</span>,
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"arn:aws:s3:::your-bucket-name/*"</span>
    }
  ]
}
</code></pre>
<p><code>Version</code>: Amazon S3 object version number for the bucket policy language.</p>
<p><code>Statement</code>: An array of one or more individual statements that define the policy.</p>
<p><code>Effect</code>: The effect determines whether the statement allows or denies access.</p>
<p><code>Principal</code>: The entity that the policy is applied to. In this case, we are allowing all principals. In production, you should specify the IAM user or role that needs access.</p>
<p><code>Action</code>: The action that the policy allows or denies. In this case, we are allowing the <code>s3:GetObject</code> action, which allows users to retrieve objects from the bucket.</p>
<p><code>Resource</code>: The Amazon Resource Name (ARN) of the bucket and objects that the policy applies to. In this case, we are allowing access to all objects in the bucket.</p>
<p>Click on Save changes to apply the policy.</p>
<h2 id="heading-step-2-how-to-set-up-aws-s3-backend-with-nodejs">Step 2: How to Set Up AWS S3 Backend with Node.js</h2>
<p>Next, let's set up the backend server with AWS SDK to handle the file upload process.</p>
<h3 id="heading-how-to-initialize-a-nodejs-project">How to Initialize a Node.js Project</h3>
<p>Create a new directory for your project and initialize a new Node.js project:</p>
<pre><code class="lang-bash">mkdir s3-multipart-upload
<span class="hljs-built_in">cd</span> s3-multipart-upload
npm init -y
</code></pre>
<h3 id="heading-install-required-packages">Install Required Packages</h3>
<p>Install the following packages using npm:</p>
<pre><code class="lang-bash"> npm install express dotenv multer aws-sdk
</code></pre>
<h3 id="heading-create-server-file">Create Server File</h3>
<p>Create a new file named <code>app.js</code> (For simplicity, we are going to use this file only for all the upload logic) and add the following code:</p>
<h4 id="heading-imports-and-configurations">Imports and Configurations</h4>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> cors = <span class="hljs-built_in">require</span>(<span class="hljs-string">"cors"</span>);
<span class="hljs-keyword">const</span> express = <span class="hljs-built_in">require</span>(<span class="hljs-string">"express"</span>);
<span class="hljs-keyword">const</span> AWS = <span class="hljs-built_in">require</span>(<span class="hljs-string">"aws-sdk"</span>);
<span class="hljs-keyword">const</span> dotenv = <span class="hljs-built_in">require</span>(<span class="hljs-string">"dotenv"</span>);
<span class="hljs-keyword">const</span> multer = <span class="hljs-built_in">require</span>(<span class="hljs-string">"multer"</span>);

<span class="hljs-keyword">const</span> multerUpload = multer();
dotenv.config();

<span class="hljs-keyword">const</span> app = express();
<span class="hljs-keyword">const</span> port = <span class="hljs-number">3001</span>;
</code></pre>
<h5 id="heading-imports">Imports</h5>
<p><code>cors</code>: Middleware for enabling Cross-Origin Resource Sharing (CORS). This is necessary to allow your frontend application interact with the backend hosted on a different domain or port.</p>
<p><code>express</code>: A minimal and flexible Node.js web application framework.</p>
<p><code>AWS</code>: The AWS SDK for JavaScript, which allows you to interact with AWS services.</p>
<p><code>dotenv</code>: A module that loads environment variables from a <strong>.env</strong> file into <strong>process.env</strong>.</p>
<p><code>multer</code>: Middleware for handling multipart/form-data, which is primarily used for uploading files.</p>
<h5 id="heading-configurations">Configurations</h5>
<p><code>multerUpload</code>: Initializes <code>multer</code> for handling file uploads.</p>
<p><code>dotenv.config()</code>: Loads the environment variables from a .env file.</p>
<p><code>app</code>: Initializes an Express application.</p>
<p><code>port</code>: Sets the port on which the Express application will run.</p>
<h4 id="heading-middleware-and-aws-configuration">Middleware and AWS Configuration</h4>
<p>Next, add the following code to configure middleware and AWS SDK:</p>
<pre><code class="lang-javascript">app.use(cors());

AWS.config.update({
  <span class="hljs-attr">accessKeyId</span>: process.env.AWS_ACCESS_KEY,
  <span class="hljs-attr">secretAccessKey</span>: process.env.AWS_SECRET_KEY,
  <span class="hljs-attr">region</span>: process.env.AWS_REGION,
});

<span class="hljs-keyword">const</span> s3 = <span class="hljs-keyword">new</span> AWS.S3();
app.use(express.json({ <span class="hljs-attr">limit</span>: <span class="hljs-string">"50mb"</span> }));
app.use(express.urlencoded({ <span class="hljs-attr">limit</span>: <span class="hljs-string">"50mb"</span>, <span class="hljs-attr">extended</span>: <span class="hljs-literal">true</span> }));
</code></pre>
<p><code>app.use(cors())</code>: Enables CORS for all routes, allowing your frontend to communicate with the backend without issues related to cross-origin requests.</p>
<p><code>AWS.config.update({ ... })</code>: Configures the AWS SDK with the access key, secret key, and region from the environment variables.<br>const s3 = new AWS.S3(): Creates an instance of the S3 service.</p>
<p><code>app.use(express.json({ limit: '50mb' }))</code>: Configures Express to parse JSON bodies with a size limit of 50MB.</p>
<p><code>app.use(express.urlencoded({ limit: '50mb', extended: true }))</code>: Configures Express to parse URL-encoded bodies with a size limit of 50MB.</p>
<h3 id="heading-routes">Routes</h3>
<p>It's time to start creating our routes. The routes required for the multipart upload process are as follows:</p>
<ul>
<li>Initialization of the upload process.</li>
<li>Uploading parts of the file.</li>
<li>Completing the upload process.</li>
</ul>
<h4 id="heading-startinitialize-upload-endpoint">Start/Initialize Upload Endpoint</h4>
<p>This route puts the upload process in play. Add the following code to create an endpoint for initializing the multipart upload process:</p>
<pre><code class="lang-javascript">app.post(<span class="hljs-string">"/start-upload"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> { fileName, fileType } = req.body;

  <span class="hljs-keyword">const</span> params = {
    <span class="hljs-attr">Bucket</span>: process.env.S3_BUCKET,
    <span class="hljs-attr">Key</span>: fileName,
    <span class="hljs-attr">ContentType</span>: fileType,
  };

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> upload = <span class="hljs-keyword">await</span> s3.createMultipartUpload(params).promise();
    <span class="hljs-comment">// console.log({ upload });</span>
    res.send({ <span class="hljs-attr">uploadId</span>: upload.UploadId });
  } <span class="hljs-keyword">catch</span> (error) {
    res.send(error);
  }
});
</code></pre>
<p>The function above creates a POST endpoint <strong>/start-upload</strong> that expects a JSON body with <code>fileName</code> and <code>fileType</code> properties. It then uses the <code>createMultipartUpload</code> method from the S3 service to initialize the multipart upload process. If successful, it returns the <code>uploadId</code> to the user, which will be used to upload parts of the file.</p>
<h4 id="heading-upload-part-endpoint">Upload Part Endpoint</h4>
<p>This is the route where the different smaller parts of the large file upload are received and tagged. Add the following code to create an endpoint for uploading parts of the file:</p>
<pre><code class="lang-javascript">app.post(<span class="hljs-string">"/upload-part"</span>, multerUpload.single(<span class="hljs-string">"fileChunk"</span>), <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> { fileName, partNumber, uploadId, fileChunk } = req.body;

  <span class="hljs-keyword">const</span> params = {
    <span class="hljs-attr">Bucket</span>: process.env.S3_BUCKET,
    <span class="hljs-attr">Key</span>: fileName,
    <span class="hljs-attr">PartNumber</span>: partNumber,
    <span class="hljs-attr">UploadId</span>: uploadId,
    <span class="hljs-attr">Body</span>: Buffer.from(fileChunk, <span class="hljs-string">"base64"</span>),
  };

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> uploadParts = <span class="hljs-keyword">await</span> s3.uploadPart(params).promise();
    <span class="hljs-built_in">console</span>.log({ uploadParts });
    res.send({ <span class="hljs-attr">ETag</span>: uploadParts.ETag });
  } <span class="hljs-keyword">catch</span> (error) {
    res.send(error);
  }
});
</code></pre>
<p>The function above creates a POST endpoint at <strong>/upload-part</strong> that expects a form-data body with <code>uploadId</code>, <code>partNumber</code>, and <code>fileName</code> properties. It uses the <code>uploadPart</code> method from the S3 service to upload the part of the file. If successful, it returns the <code>ETag</code> of the uploaded part to the client.</p>
<p>The <code>ETag</code> is a unique identifier for the upload part that will be used to complete the multipart upload.</p>
<h4 id="heading-complete-upload-endpoint">Complete Upload Endpoint</h4>
<p>Once the part has been uploaded, the final step is to combine all the parts to create the final object.</p>
<p>Add the following code to create an endpoint for completing the multipart upload process:</p>
<pre><code class="lang-js">app.post(<span class="hljs-string">"/complete-upload"</span>, <span class="hljs-keyword">async</span> (req, res) =&gt; {
  <span class="hljs-keyword">const</span> { fileName, uploadId, parts } = req.body;

  <span class="hljs-keyword">const</span> params = {
    <span class="hljs-attr">Bucket</span>: process.env.S3_BUCKET,
    <span class="hljs-attr">Key</span>: fileName,
    <span class="hljs-attr">UploadId</span>: uploadId,
    <span class="hljs-attr">MultipartUpload</span>: {
      <span class="hljs-attr">Parts</span>: parts,
    },
  };

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> complete = <span class="hljs-keyword">await</span> s3.completeMultipartUpload(params).promise();
    <span class="hljs-built_in">console</span>.log({ complete });
    res.send({ <span class="hljs-attr">fileUrl</span>: complete.Location });
  } <span class="hljs-keyword">catch</span> (error) {
    res.send(error);
  }
});
</code></pre>
<p>The function above creates a POST endpoint at <strong>/complete-upload</strong> that expects a JSON body with <code>uploadId</code>, <code>fileName</code>, and <code>parts</code> properties. It uses the <code>completeMultipartUpload</code> method from the S3 service to combine the uploaded parts and creates the final object. If successful, it returns the data object containing <code>fileUrl</code> about the completed upload.</p>
<h3 id="heading-start-the-server">Start the Server</h3>
<p>Finally, add the following code to start the Express server:</p>
<pre><code class="lang-javascript">app.listen(port, <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Server running on port <span class="hljs-subst">${port}</span>`</span>);
});
</code></pre>
<p>This code starts the Express server on port 3001 and logs a message to the console when the server is running.</p>
<h3 id="heading-environment-variables">Environment Variables</h3>
<p>Create a new file named .env in the root directory of your project and add the following environment variables:</p>
<pre><code class="lang-bash">AWS_ACCESS_KEY=your-access-key
AWS_SECRET_KEY=your-secret-key
AWS_REGION=your-region
S3_BUCKET=your-bucket-name
</code></pre>
<p>Replace <code>your-access-key</code>, <code>your-secret-key</code>, <code>your-region</code>, and <code>your-bucket-name</code> with your actual AWS credentials and bucket name.</p>
<h3 id="heading-running-the-server">Running the Server</h3>
<p>To run the server, execute the following command in your terminal:</p>
<pre><code class="lang-bash">node app.js
</code></pre>
<p>This will start the server on port 3001.</p>
<h2 id="heading-step-3-how-to-set-up-the-frontend-with-react">Step 3: How to Set Up the Frontend with React</h2>
<p>Now that the backend is set up, let's create a React frontend to interact with the server and upload files to S3 using the multipart upload process.</p>
<p>The frontend will be in charge of splitting the file into parts, uploading each part to the server, and completing the upload process.</p>
<h3 id="heading-how-to-initialize-a-react-project">How to Initialize a React Project</h3>
<p>Create a new React project using Create React App:</p>
<pre><code class="lang-bash">npx create-react-app s3-multipart-upload-frontend
<span class="hljs-built_in">cd</span> s3-multipart-upload-frontend
</code></pre>
<h3 id="heading-install-required-packages-1">Install Required Packages</h3>
<p>Install the following packages using npm:</p>
<pre><code class="lang-bash">  npm install axios
</code></pre>
<h3 id="heading-create-components">Create Components</h3>
<p>Create a new file named <strong>Upload.js</strong> in the src/components directory and add the following code:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> React, { useState } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>;
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">"axios"</span>;

<span class="hljs-keyword">const</span> CHUNK_SIZE = <span class="hljs-number">5</span> * <span class="hljs-number">1024</span> * <span class="hljs-number">1024</span>; <span class="hljs-comment">// 5MB</span>

<span class="hljs-keyword">const</span> FileUpload = <span class="hljs-function">() =&gt;</span> {
  <span class="hljs-keyword">const</span> [file, setFile] = useState(<span class="hljs-literal">null</span>);
  <span class="hljs-keyword">const</span> [fileUrl, setFileUrl] = useState(<span class="hljs-string">""</span>);

  <span class="hljs-keyword">const</span> handleFileChange = <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
    setFile(e.target.files[<span class="hljs-number">0</span>]);
  };

  <span class="hljs-keyword">const</span> handleFileUpload = <span class="hljs-keyword">async</span> () =&gt; {
    <span class="hljs-keyword">const</span> fileName = file.name;
    <span class="hljs-keyword">const</span> fileType = file.type;
    <span class="hljs-keyword">let</span> uploadId = <span class="hljs-string">""</span>;
    <span class="hljs-keyword">let</span> parts = [];

    <span class="hljs-keyword">try</span> {
      <span class="hljs-comment">// Start the multipart upload</span>
      <span class="hljs-keyword">const</span> startUploadResponse = <span class="hljs-keyword">await</span> axios.post(
        <span class="hljs-string">"http://localhost:3001/start-upload"</span>,
        {
          fileName,
          fileType,
        }
      );

      uploadId = startUploadResponse.data.uploadId;

      <span class="hljs-comment">// Split the file into chunks and upload each part</span>
      <span class="hljs-keyword">const</span> totalParts = <span class="hljs-built_in">Math</span>.ceil(file.size / CHUNK_SIZE);

      <span class="hljs-built_in">console</span>.log(totalParts);

      <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> partNumber = <span class="hljs-number">1</span>; partNumber &lt;= totalParts; partNumber++) {
        <span class="hljs-keyword">const</span> start = (partNumber - <span class="hljs-number">1</span>) * CHUNK_SIZE;
        <span class="hljs-keyword">const</span> end = <span class="hljs-built_in">Math</span>.min(start + CHUNK_SIZE, file.size);
        <span class="hljs-keyword">const</span> fileChunk = file.slice(start, end);

        <span class="hljs-keyword">const</span> reader = <span class="hljs-keyword">new</span> FileReader();
        reader.readAsArrayBuffer(fileChunk);

        <span class="hljs-keyword">const</span> uploadPart = <span class="hljs-function">() =&gt;</span> {
          <span class="hljs-keyword">return</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Promise</span>(<span class="hljs-function">(<span class="hljs-params">resolve, reject</span>) =&gt;</span> {
            reader.onload = <span class="hljs-keyword">async</span> () =&gt; {
              <span class="hljs-keyword">const</span> fileChunkBase64 = btoa(
                <span class="hljs-keyword">new</span> <span class="hljs-built_in">Uint8Array</span>(reader.result).reduce(
                  <span class="hljs-function">(<span class="hljs-params">data, byte</span>) =&gt;</span> data + <span class="hljs-built_in">String</span>.fromCharCode(byte),
                  <span class="hljs-string">""</span>
                )
              );

              <span class="hljs-keyword">const</span> uploadPartResponse = <span class="hljs-keyword">await</span> axios.post(
                <span class="hljs-string">"http://localhost:3001/upload-part"</span>,
                {
                  fileName,
                  partNumber,
                  uploadId,
                  <span class="hljs-attr">fileChunk</span>: fileChunkBase64,
                }
              );

              parts.push({
                <span class="hljs-attr">ETag</span>: uploadPartResponse.data.ETag,
                <span class="hljs-attr">PartNumber</span>: partNumber,
              });
              resolve();
            };
            reader.onerror = reject;
          });
        };

        <span class="hljs-keyword">await</span> uploadPart();
      }

      <span class="hljs-comment">// Complete the multipart upload</span>
      <span class="hljs-keyword">const</span> completeUploadResponse = <span class="hljs-keyword">await</span> axios.post(
        <span class="hljs-string">"http://localhost:3001/complete-upload"</span>,
        {
          fileName,
          uploadId,
          parts,
        }
      );

      setFileUrl(completeUploadResponse.data.fileUrl);
      alert(<span class="hljs-string">"File uploaded successfully"</span>);
    } <span class="hljs-keyword">catch</span> (error) {
      <span class="hljs-built_in">console</span>.error(<span class="hljs-string">"Error uploading file:"</span>, error);
    }
  };

  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"file"</span> <span class="hljs-attr">onChange</span>=<span class="hljs-string">{handleFileChange}</span> /&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">disabled</span>=<span class="hljs-string">{!file}</span> <span class="hljs-attr">onClick</span>=<span class="hljs-string">{handleFileUpload}</span>&gt;</span>
        Upload
      <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">hr</span> /&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">br</span> /&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">br</span> /&gt;</span>
      {fileUrl &amp;&amp; (
        <span class="hljs-tag">&lt;<span class="hljs-name">a</span> <span class="hljs-attr">href</span>=<span class="hljs-string">{fileUrl}</span> <span class="hljs-attr">target</span>=<span class="hljs-string">"_blank"</span> <span class="hljs-attr">rel</span>=<span class="hljs-string">"noopener noreferrer"</span>&gt;</span>
          View Uploaded File
        <span class="hljs-tag">&lt;/<span class="hljs-name">a</span>&gt;</span>
      )}
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  );
};

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> FileUpload;
</code></pre>
<p>The <code>FileUpload</code> component above handles the file upload process using the multipart upload method. It splits the file into chunks, uploads each part to the server, and completes the upload process.</p>
<p>The component consists of the following key parts:</p>
<p><code>CHUNK_SIZE</code>: The size of each part in bytes. In this case, we are using 5MB parts.</p>
<p><code>handleFileChange</code>: A function that sets the selected file in the state.</p>
<p><code>handleFileUpload</code>: A function that initiates the multipart upload process by sending the file to the server in parts.</p>
<ul>
<li>It starts the upload process by calling the <strong>/start-upload</strong> endpoint and retrieves the uploadId.</li>
<li>It splits the file into chunks and uploads each part to the server using the <strong>/upload-part</strong> endpoint.</li>
<li>It completes the upload process by calling the <strong>/complete-upload</strong> endpoint with the uploadId and parts array.</li>
</ul>
<p><code>fileUrl</code>: A state variable that stores the URL of the uploaded file.</p>
<p>The component renders an input field for selecting a file, a button to upload the file, and a link to view the uploaded file.</p>
<h3 id="heading-app-component">App Component</h3>
<p>Update the App.js file in the src directory with the following code:</p>
<pre><code class="lang-javascript">
<span class="hljs-keyword">import</span> React <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>;

<span class="hljs-keyword">import</span> FileUpload <span class="hljs-keyword">from</span> <span class="hljs-string">"./components/FileUpload"</span>;

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">App</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">className</span>=<span class="hljs-string">"App"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Large File Upload with S3 Multipart Upload<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">FileUpload</span> /&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  );
}


<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> App;
</code></pre>
<p>The App component renders the FileUpload component, which handles the file upload process.</p>
<h3 id="heading-how-to-start-the-frontend">How to Start the Frontend</h3>
<p>To run the frontend, execute the following command in your terminal:</p>
<pre><code class="lang-bash">npm start
</code></pre>
<p>This will start the React development server on port 3000 and open the application in your default web browser.</p>
<h2 id="heading-testing">Testing</h2>
<p>Let's test the application by uploading a large file using the frontend. You should see the file being uploaded in parts and then combined to create the final object on the server inspecting your network tab.</p>
<h3 id="heading-part-upload">Part Upload</h3>
<p>In the image below, the <code>start-upload</code> endpoint is called to initialize and start the upload process. The large file uploaded is broken into chunks and uploaded with the <code>upload-part</code> endpoint. You can see up to 10 or more (depending on the size of each chunk to the total file size).</p>
<p>Each upload part has a unique identifier <code>Etag</code> used for the complete upload.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/06/uplaod-start-parts.png" alt="Image uploading in parts" width="600" height="400" loading="lazy">
<em>Image uploading in parts</em></p>
<h3 id="heading-complete-part-upload">Complete Part Upload</h3>
<p>The last and final step of the process is the <code>complete-upload</code> endpoint where the upload parts are combined to form a single object for the file uploaded.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/06/upload-complete.png" alt="Image uploading in parts" width="600" height="400" loading="lazy">
<em>Image uploads completed</em></p>
<p>You can click on the <code>View Uploaded File</code> to access your uploaded file.</p>
<h2 id="heading-full-code-on-github">Full Code on GitHub</h2>
<p>Click the link below to access the full code on GitHub:</p>
<p><a target="_blank" href="https://github.com/Caesarsage/aws-multipart-uploads-react-node.git">Multipart file uploads with react and NodeJS</a></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, we explored how to efficiently handle large files with Amazon S3 multipart upload. We discussed the benefits of using this feature, walked through the process of uploading files in parts, and provided code examples using Node.js and React. </p>
<p>This is a high-level implementation of the multipart upload process, you can further enhance it by adding more features like progress tracking, error handling, and resumable uploads.</p>
<p>By leveraging Amazon S3 multipart upload, you can optimize file uploads in your applications by dividing large files into smaller parts, uploading them independently, and combining them to create the final object. This approach not only enhances upload performance but also adds fault tolerance and flexibility to pause and resume uploads, making it ideal for handling large files over unstable networks.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Handle Concurrency with Goroutines and Channels in Go ]]>
                </title>
                <description>
                    <![CDATA[ Concurrency is the ability of a program to perform multiple tasks simultaneously. It is a crucial aspect of building scalable and responsive systems.  Go's concurrency model is based on the concept of goroutines, lightweight threads that can run mult... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-handle-concurrency-in-go/</link>
                <guid isPermaLink="false">66b906aecacc627a9522d238</guid>
                
                    <category>
                        <![CDATA[ concurrency ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Go Language ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Fri, 10 May 2024 15:07:54 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/05/joshua-sortino-LqKhnDzSF-8-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Concurrency is the ability of a program to perform multiple tasks simultaneously. It is a crucial aspect of building scalable and responsive systems. </p>
<p>Go's concurrency model is based on the concept of goroutines, lightweight threads that can run multiple functions concurrently, and channels, a built-in communication mechanism for safe and efficient data exchange between goroutines.</p>
<p>Go's concurrency features enable developers to write programs that can:</p>
<ul>
<li>Handle multiple requests simultaneously, improving responsiveness and throughput.</li>
<li>Utilize multi-core processors efficiently, maximizing system resources.</li>
<li>Write concurrent code that is safe, efficient, and easy to maintain.</li>
</ul>
<p>Go's concurrency model is designed to minimize overhead, reduce latency, and prevent common concurrency errors like race conditions and deadlocks. </p>
<p>With Go, developers can build high-performance, scalable, and concurrent systems with ease, making it an ideal choice for building modern distributed systems, networks, and cloud infrastructure.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-case-study-a-bank-teller">Case study: A Bank Teller</a></li>
<li><a class="post-section-overview" href="#heading-sequential-processing-no-concurrency">Sequential Processing</a></li>
<li><a class="post-section-overview" href="#heading-concurrency">Concurrency</a></li>
<li><a class="post-section-overview" href="#heading-what-are-goroutines-and-channels">What are Goroutines and Channels?</a></li>
<li><a class="post-section-overview" href="#what-is-a-gourotine">What is a Goroutine?</a></li>
<li><a class="post-section-overview" href="#heading-how-to-implement-a-goroutine">How to Implement a Goroutine</a></li>
<li><a class="post-section-overview" href="#heading-how-does-a-goroutine-work">How Does a Goroutine Work?</a></li>
<li><a class="post-section-overview" href="#heading-what-are-waitgroups">What are waitGroups?</a></li>
<li><a class="post-section-overview" href="#heading-what-are-channels">What are Channels?</a></li>
<li><a class="post-section-overview" href="#heading-how-to-write-data-to-a-channel">How to Write Data to a Channel</a></li>
<li><a class="post-section-overview" href="#heading-how-to-read-data-from-a-channel">How to Read Data from a Channel</a></li>
<li><a class="post-section-overview" href="#heading-how-to-implement-channels-with-goroutine">How to Implement Channels with Goroutine</a></li>
<li><a class="post-section-overview" href="#heading-what-are-channel-buffers">What are Channel Buffers?</a></li>
<li><a class="post-section-overview" href="#heading-what-is-an-unbuffered-channel">What is an Unbuffered Channel?</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-buffered-channel">How to Create a Buffered Channel</a></li>
<li><a class="post-section-overview" href="#heading-what-are-channel-directions">What are Channel Directions?</a></li>
<li><a class="post-section-overview" href="#heading-how-to-handle-multiple-communication-operations-with-channel-select">How to Handle Multiple Communication Operations with Channel Select</a></li>
<li><a class="post-section-overview" href="#how-to-timeout-long-running-process-in-a-channel">How to Timeout Long Running Processes in a Channel</a></li>
<li><a class="post-section-overview" href="#heading-how-to-close-a-channel">How to Close a Channel</a></li>
<li><a class="post-section-overview" href="#heading-how-to-iterate-over-channel-messages">How to Iterate Over Channel Messages</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<p>Let's consider a scenario to illustrate concurrency:</p>
<h2 id="heading-case-study-a-bank-teller">Case Study: A Bank Teller</h2>
<p>Imagine a busy bank with two tellers, Maria and David. Customers arrive at the bank to conduct various transactions like deposits, withdrawals, and transfers. The goal is to serve customers quickly and efficiently.</p>
<h3 id="heading-sequential-processing-no-concurrency">Sequential Processing (No Concurrency)</h3>
<p>Maria and David work sequentially, one at a time. When a customer arrives, Maria helps the customer, and David waits until Maria is finished before helping the next customer. This leads to a long wait time for customers.</p>
<h3 id="heading-concurrency">Concurrency</h3>
<p>Maria and David work concurrently, serving customers simultaneously. When a customer arrives, Maria helps the customer with a transaction, and David simultaneously helps another customer with a different transaction. They work together, sharing resources like the bank's database and cash supplies, to serve multiple customers at the same time.</p>
<p>In this scenario, concurrency enables Maria and David to work together efficiently, serving multiple customers simultaneously, and improving the overall customer experience. This same concept applies to computer programming, where concurrency enables multiple tasks to run simultaneously, improving responsiveness, efficiency, and performance.</p>
<h2 id="heading-what-are-goroutines-and-channels">What are Goroutines and Channels?</h2>
<p>A goroutine is a lightweight thread managed by the Go runtime. It is a function that runs on the Go runtime. It helps address concurrency and async flow requirements.</p>
<p>Goroutines allow you to start up and run other threads of execution concurrently within your program.</p>
<p>Channels are used to communicate between goroutines. It is a typed conduit through which you can send and receive values with the channel operator: <code>&lt;-</code>.</p>
<h3 id="heading-how-to-implement-a-goroutine">How to Implement a Goroutine</h3>
<p>To use and implement a <code>goroutine</code>, the <code>go</code> keyword is used to precede a function.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"math/rand"</span>
  <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">pause</span><span class="hljs-params">()</span></span> {
  time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">1000</span>)) * time.Millisecond)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">sendMsg</span><span class="hljs-params">(msg <span class="hljs-keyword">string</span>)</span></span> {
  pause()
  fmt.Println(msg)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  sendMsg(<span class="hljs-string">"hello"</span>) <span class="hljs-comment">// sync</span>

  <span class="hljs-keyword">go</span> sendMsg(<span class="hljs-string">"test1"</span>) <span class="hljs-comment">// async</span>
  <span class="hljs-keyword">go</span> sendMsg(<span class="hljs-string">"test2"</span>) <span class="hljs-comment">// async</span>
  <span class="hljs-keyword">go</span> sendMsg(<span class="hljs-string">"test3"</span>) <span class="hljs-comment">// async</span>

  sendMsg(<span class="hljs-string">"main"</span>) <span class="hljs-comment">// sync</span>

  time.Sleep(<span class="hljs-number">2</span> * time.Second)
}
</code></pre>
<p>From the example above,</p>
<ul>
<li>The <code>sendMsg</code> function is called synchronously and asynchronously.</li>
<li>The <code>sendMsg</code> function is called synchronously when the <code>sendMsg</code> function is called without the <code>go</code> keyword.</li>
<li>The <code>sendMsg</code> function is called asynchronously when the <code>sendMsg</code> function is called with the <code>go</code> keyword.</li>
</ul>
<h3 id="heading-how-does-a-goroutine-work">How Does a Goroutine Work?</h3>
<p>When the <code>sendMsg</code> function is called with the <code>go</code> keyword, the <code>main</code> function will not wait for the <code>sendMsg</code> function to finish executing before it continues to the next line of code and will return immediately after the <code>sendMsg</code> function is called.</p>
<p>Otherwise, the function is called synchronously, and the <code>main</code> function will wait for the <code>sendMsg</code> function to finish executing before it continues to the next line of code.</p>
<p>The order of the output when you run the above example will differ from the order of the code because the three <code>goroutine</code> all run concurrently and since the functions pause for a period of time, the order which they wake will differ and be outputted.</p>
<p>The <code>time.Sleep(2 * time.Second)</code> is a quick and simple method used to keep the main function running for 2 seconds to allow the <code>goroutine</code> to finish executing before the main function exits. Otherwise, the main function will exit immediately after the <code>goroutine</code> is called and the <code>goroutine</code> will not have enough time to finish executing resulting to errors.</p>
<h3 id="heading-what-are-waitgroups">What are WaitGroups?</h3>
<p>Unlike the <code>time.Sleep(2 * time.Second)</code> used in the example above, the <code>WaitGroups</code> are more standard to wait for a collection of goroutines to finish executing. It is a simple way to synchronize multiple goroutines.</p>
<p>A goroutine can also be declared with anonymous functions</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"sync"</span>
  <span class="hljs-string">"time"</span>
  <span class="hljs-string">"math/rand"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">pause</span><span class="hljs-params">()</span></span> {
  time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">1000</span>)) * time.Millisecond)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">sendMsg</span><span class="hljs-params">(msg <span class="hljs-keyword">string</span>, wg *sync.WaitGroup)</span></span> {
  <span class="hljs-keyword">defer</span> wg.Done()
  pause()
  fmt.Println(msg)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  <span class="hljs-keyword">var</span> wg sync.WaitGroup

  wg.Add(<span class="hljs-number">3</span>)

  <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(msg <span class="hljs-keyword">string</span>)</span></span> {
    <span class="hljs-keyword">defer</span> wg.Done()
    pause()
    fmt.Println(msg)
  }(<span class="hljs-string">"test1"</span>)


  <span class="hljs-keyword">go</span> sendMsg(<span class="hljs-string">"test2"</span>, &amp;wg)
  <span class="hljs-keyword">go</span> sendMsg(<span class="hljs-string">"test3"</span>, &amp;wg)

  wg.Wait()
}
</code></pre>
<p>From the example above, the <strong><code>sync.WaitGroup</code></strong> is used to wait for the three <code>goroutine</code> to finish executing before the main function exits. It synchronizes the three <code>goroutine</code> and the main function.</p>
<ul>
<li>The <strong><code>sync.WaitGroup (wg)</code></strong> manages the goroutines and keeps track of the number of goroutines that are running.</li>
<li>The <strong><code>sync.WaitGroup.Add (wg.Add)</code></strong> method is used to add the number of goroutines as arguments that are running.</li>
<li>The <strong><code>sync.WaitGroup.Done (wg.Done)</code></strong> method is used to decrement the number of goroutines that are running.</li>
<li>The <code>**sync.WaitGroup.Wait (wg.Wait)**</code> method is used to wait for all the goroutines to finish executing before the main function exits.</li>
</ul>
<h2 id="heading-what-are-channels">What are Channels?</h2>
<p>Channels are used to communicate between goroutines. It is a typed conduit through which you can send and receive messages with the channel operator, <code>**&lt;-**</code>.</p>
<p>In their simplest form, one goroutine writes messages into the channel and another goroutine reads the same messages out of the channel.</p>
<p>Channels are created using the <code>make</code> method and the <code>chan</code> keyword together with its type. Channels are used to transfer messages of which type it was declared with.</p>
<p>Example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span>{
    msgChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)
}
</code></pre>
<p>The example above creates a channel <code>msgChan</code> of type <code>string</code>.</p>
<h3 id="heading-how-to-write-data-to-a-channel">How to Write Data to a Channel</h3>
<p>To write data to a channel, first specify the name (<code>msgChan</code>) of the channel, followed by the <code>&lt;-</code> operator and the message. This is considered the <strong>Sender.</strong></p>
<pre><code class="lang-go">msgChan &lt;- <span class="hljs-string">"hello world"</span>
</code></pre>
<h3 id="heading-how-to-read-data-from-a-channel">How to Read Data from a Channel</h3>
<p>To read data from a channel, simple move the operator (<code>&lt;-</code>) to front of the channel name (<code>msgChan</code>) and you can assign it to a variable. This is considered the <strong>Receiver.</strong></p>
<pre><code class="lang-go">msg := &lt;- msgChan
</code></pre>
<h3 id="heading-how-to-implement-channels-with-goroutine">How to Implement Channels with Goroutine</h3>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"math/rand"</span>
  <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {

  msgChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)

  <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
    time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">1000</span>)) * time.Millisecond)
    msgChan &lt;- <span class="hljs-string">"hello"</span> <span class="hljs-comment">// Write data to the channel</span>
    msgChan &lt;- <span class="hljs-string">"world"</span> <span class="hljs-comment">// Write data to the channel</span>
  }()

  msg1 := &lt;- msgChan
  msg2 := &lt;- msgChan

  fmt.Println(msg1, msg2)
}
</code></pre>
<p>The example above shows how to write and read data from a channel. The <code>msgChan</code> channel is created and the <code>go</code> keyword is used to create a goroutine that writes data to the channel. The <code>msg1</code> and <code>msg2</code> variables are used to read data from the channel.</p>
<p>Channels behave as a <code>first-in-first-out</code> queue. So, when one goroutine writes data to the channel, the other goroutine reads the data from the channel in the same order it was written.</p>
<h2 id="heading-what-are-channel-buffers">What are Channel Buffers?</h2>
<p>Channels can be <code>buffered</code> or <code>unbuffered</code>. The previous examples include the use of an unbuffered channels.</p>
<h3 id="heading-what-is-an-unbuffered-channel">What is an Unbuffered Channel?</h3>
<p>An unbuffered channel causes the sender to block immediately after sending a message into the channel until the receiver receives the message.</p>
<h3 id="heading-what-is-a-buffered-channel">What is a Buffered Channel?</h3>
<p>A buffered channel allows the sender to send messages into the channel without blocking until the buffer is full. So, the sender blocks only once the buffer has filled up and waits until another goroutine reads off the channel, making sure the space size becomes available before unblocking.</p>
<h3 id="heading-how-to-create-a-buffered-channel">How to Create a Buffered Channel</h3>
<p>When creating a buffered channel, use the <code>make</code> function and specify a second parameter to indicate the buffer size.</p>
<pre><code class="lang-go">msgBufChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">2</span>)
</code></pre>
<p>The example above creates a buffered channel <code>msgBufChan</code> of type <code>string</code> with a buffer size of 2. This means that the channel can hold up to two messages before it blocks.</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  size := <span class="hljs-number">3</span>
  msgBufChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">int</span>, size)

  <span class="hljs-comment">// reader (receiver)</span>
  <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
    <span class="hljs-keyword">for</span> {
      _ = &lt;- msgBufChan
      time.Sleep(time.Second)
    }
  }()

  <span class="hljs-comment">//writer (sender)</span>
  writer := <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
    <span class="hljs-keyword">for</span> i := <span class="hljs-number">0</span>; i &lt;=&gt; <span class="hljs-number">10</span>; i++ {
      msgBufChan &lt;- i
      <span class="hljs-built_in">println</span>(i)
    }
  }

  writer()
}
</code></pre>
<p>The example above creates a buffered channel <code>msgBufChan</code> of type <code>int</code> with a buffer size of 3.</p>
<ul>
<li>The <code>writer</code> function writes data to the channel and the <code>reader</code> function reads data from the channel.</li>
<li>When the program runs, you will see that the number <code>0 through to 3</code> printed out immediately and the remaining numbers <code>5 through to 10</code> are printed out slowly about one per second (<code>time.Sleep(time.Second</code>).</li>
<li>This is showing the effect of buffered channel that specify the size it can hold before it blocks.</li>
</ul>
<h2 id="heading-what-are-channel-directions">What are Channel Directions?</h2>
<p>When using channels as function parameters, by default, you can send and receive messages within the function. To provide additional safety at compile time, channel function parameters can be defined with a direction. That is, they can be defined to be <strong>read-only</strong> or <strong>write-only</strong>.</p>
<p>Example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">writer</span><span class="hljs-params">(channel <span class="hljs-keyword">chan</span>&lt;- <span class="hljs-keyword">string</span>, msg <span class="hljs-keyword">string</span>)</span></span> {
  channel &lt;- msg
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">reader</span><span class="hljs-params">(channel &lt;-<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)</span></span> {
  msg := &lt;- channel
  fmt.Println(msg)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  msgChan := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">1</span>)

  <span class="hljs-keyword">go</span> reader(msgChan)


  <span class="hljs-keyword">for</span> i :- <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">10</span>; i++ {
    writer(msgChan, fmt.Sprintf(<span class="hljs-string">"msg %d"</span>, i))
  }

  time.Sleep(time.Second * <span class="hljs-number">5</span>)
}
</code></pre>
<p>The example above shows how to define a channel with a direction.</p>
<ul>
<li>The <code>writer</code> function is defined with a write-only channel and</li>
<li>The <code>reader</code> function is defined with a read-only channel.</li>
</ul>
<p>The <code>msgChan</code> channel is created with a buffer size of 1. The <code>writer</code> function writes data to the channel and the <code>reader</code> function reads data from the channel.</p>
<h2 id="heading-how-to-handle-multiple-communication-operations-with-channel-select">How to Handle Multiple Communication Operations with Channel Select</h2>
<p>The <code>select</code> statement lets a goroutine wait on multiple communication operations. A <code>select</code> blocks until one of its cases can run, then it executes that case. It chooses one at random if multiple are ready.</p>
<p>The <code>select</code> and <code>case</code> statements are used to simplify the management and readability of <code>wait</code> across multiple channels.</p>
<p>Example:‌</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"time"</span>
  <span class="hljs-string">"math/rand"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">pause</span><span class="hljs-params">()</span></span> {
  time.Sleep(time.Duration(rand.Intn(<span class="hljs-number">1000</span>)) * time.Millisecond)
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">test1</span><span class="hljs-params">(c <span class="hljs-keyword">chan</span>&lt;- <span class="hljs-keyword">string</span>)</span></span> {
  <span class="hljs-keyword">for</span> {
    pause()
    c &lt;- <span class="hljs-string">"hello"</span>
  }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">test2</span><span class="hljs-params">(c <span class="hljs-keyword">chan</span>&lt;- <span class="hljs-keyword">string</span>)</span></span> {
  <span class="hljs-keyword">for</span> {
    pause()
    c &lt;- <span class="hljs-string">"world"</span>
  }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  rand.Seed(time.Now().Unix())

  c1 := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)
  c2 := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)

  <span class="hljs-keyword">go</span> test1(c1)
  <span class="hljs-keyword">go</span> test2(c2)

  <span class="hljs-keyword">for</span> {
    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg1 := &lt;- c1:
      fmt.Println(msg1)
    <span class="hljs-keyword">case</span> msg2 := &lt;- c2:
      fmt.Println(msg2)
    }
  }
}
</code></pre>
<p>The example above shows how to use the <code>select</code> statement to wait on multiple channels. The <code>test1</code> and <code>test2</code> functions write data to the <code>c1</code> and <code>c2</code> channels respectively. The <code>main</code> function reads data from the <code>c1</code> and <code>c2</code> channels using the <code>select</code> statement.</p>
<p>The select statement will block until one of the channels is ready to send or receive data. If both channels are ready, the select statement will choose one at random.</p>
<h2 id="heading-how-to-timeout-long-running-processes-in-a-channel">How to Timeout Long Running Processes in a Channel</h2>
<p>The <code>time.After</code> function is used to create a channel that sends a message after a specified duration. This can be used to implement a timeout for a channel.</p>
<p>It can be specified in a <code>select</code> statement to help manage situations where it's taking too long to receive a message from any of the channels being monitored.</p>
<p>Also consider using <code>timeout</code> when working with external resources as you can never guarantee the response time and, therefore may need to proactively take action after a predetermined time has passed.</p>
<p>Implementing a <code>timeout</code> with a <code>select</code> statement is very straightforward.</p>
<p>Example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"time"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
     c1 := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)

    <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(channel <span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)</span></span> {
        time.Sleep(<span class="hljs-number">1</span> * time.Second)
        channel &lt;- <span class="hljs-string">"hello world"</span>
    }(c1)

    <span class="hljs-keyword">select</span> {
    <span class="hljs-keyword">case</span> msg2 := &lt;-c1:
        fmt.Println(msg2)
    <span class="hljs-keyword">case</span> &lt;-time.After(<span class="hljs-number">2</span> * time.Second): <span class="hljs-comment">//Timeout after 2 second</span>
        fmt.Println(<span class="hljs-string">"timeout"</span>)
  }
}
</code></pre>
<ul>
<li>The example above shows how to use the <code>time.After</code> function to create a channel that sends a message after a specified duration.</li>
<li>The <code>main</code> function reads data from the <code>c1</code> channel using the <code>select</code> statement.</li>
<li>The <code>select</code> statement will block until one of the channels is ready to send or receive data.</li>
<li>If the <code>c1</code> channel is ready, the <code>main</code> function will print the message.</li>
<li>If the <code>c1</code> channel is not ready after 2 seconds, the <code>main</code> function will print a timeout message.</li>
</ul>
<h2 id="heading-how-to-close-a-channel">How to Close a Channel</h2>
<p>Closing a channel is used to indicate that no more values will be sent on the channel. It is used to signal to the receiver that the channel has been closed and no more values will be sent.</p>
<p>Go channels can be explicitly closed to help with synchronization issues. The default implementation will close the channel when all the values have been sent.</p>
<p>Closing a channel is done by invoking the built-in <code>close</code> function.‌</p>
<pre><code class="lang-go"><span class="hljs-built_in">close</span>(channel)
</code></pre>
<p>Example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
  <span class="hljs-string">"bytes"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">process</span><span class="hljs-params">(work &lt;-<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, fin <span class="hljs-keyword">chan</span>&lt;- <span class="hljs-keyword">string</span>)</span></span> {
  <span class="hljs-keyword">var</span> b bytes.Buffer
  <span class="hljs-keyword">for</span> {
    <span class="hljs-keyword">if</span> msg, notClosed := &lt;-work; notClosed {
      fmt.Printf(<span class="hljs-string">"%s received...\n"</span>, msg)
    } <span class="hljs-keyword">else</span> {
      fmt.Println(<span class="hljs-string">"Channel closed"</span>)
      fin &lt;- b.String()
      <span class="hljs-keyword">return</span>
    }
  }
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  work := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">3</span>)
  fin := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>)

  <span class="hljs-keyword">go</span> process(work, fin)

  word := <span class="hljs-string">"hello world"</span>

  <span class="hljs-keyword">for</span> i := <span class="hljs-number">0</span>; i &lt; <span class="hljs-built_in">len</span>(word); i++ {
    letter := <span class="hljs-keyword">string</span>(word[i])
    work &lt;- letter
    fmt.Printf(<span class="hljs-string">"%s sent ...\n"</span>, letters)
  }

  <span class="hljs-built_in">close</span>(work)

  fmt.Printf(<span class="hljs-string">"result: %s\n"</span>, &lt;-fin)
}
</code></pre>
<p>The example above shows how to close a channel. The <code>work</code> channel is created with a buffer size of 3. The <code>process</code> function reads data from the <code>work</code> channel and writes data to the <code>fin</code> channel. The <code>main</code> function writes data to the <code>work</code> channel and closes the <code>work</code> channel. The <code>process</code> function will print the message if the <code>work</code> channel is not closed. If the <code>work</code> channel is closed, the <code>process</code> function will print a message and write the data to the <code>fin</code> channel.</p>
<h2 id="heading-how-to-iterate-over-channel-messages">How to Iterate Over Channel Messages</h2>
<p>Channels can be iterated over by using the <code>range</code> keyword, similar to <code>arrays, slice, and/or maps</code>. This allows you to quickly and easily iterate over the messages within a channel.</p>
<p>Example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> (
  <span class="hljs-string">"fmt"</span>
)

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
  c := <span class="hljs-built_in">make</span>(<span class="hljs-keyword">chan</span> <span class="hljs-keyword">string</span>, <span class="hljs-number">3</span>)

  <span class="hljs-keyword">go</span> <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">()</span></span> {
    c &lt;- <span class="hljs-string">"hello"</span>
    c &lt;- <span class="hljs-string">"world"</span>
    c &lt;- <span class="hljs-string">"goroutine"</span>
    <span class="hljs-built_in">close</span>(c) <span class="hljs-comment">// Closing the channel is very important before proceeding to the iteration hence deadlock error</span>
  }()

  <span class="hljs-keyword">for</span> msg := <span class="hljs-keyword">range</span> c {
    fmt.Println(msg)
  }
}
</code></pre>
<p>The example above shows how to iterate over a channel using the <code>range</code> keyword. The <code>c</code> channel is created with a buffer size of 3. The <code>go</code> keyword is used to create a goroutine that writes data to the <code>c</code> channel. The <code>main</code> function iterates over the <code>c</code> channel using the <code>range</code> keyword and prints the message.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, we learned how to handle concurrency with goroutines and channels in Go. We learned how to create goroutines, and how to use <code>WaitGroups</code> and channels to communicate between goroutines. </p>
<p>We also learned how to use channel buffers, channel directions, channel <code>select</code>, channel timeout, channel closing, and channel range. </p>
<p>Goroutines and channels are powerful features in Go that help address concurrency and async flow requirements.</p>
<p>As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is Amazon EC2 Auto Scaling? ]]>
                </title>
                <description>
                    <![CDATA[ Auto scaling is like having a smart system that keeps an eye on how many people are visiting your website. When you have a lot of people, it quickly adds more servers to handle the extra traffic. And when things quiet down, it scales back to save you... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-amazon-ec2-auto-scaling/</link>
                <guid isPermaLink="false">66b906c577c23fa04d7098f7</guid>
                
                    <category>
                        <![CDATA[ Amazon ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ec2 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ scaling ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 06 May 2024 16:32:47 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/05/christophe-hautier-902vnYeoWS4-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Auto scaling is like having a smart system that keeps an eye on how many people are visiting your website. When you have a lot of people, it quickly adds more servers to handle the extra traffic. And when things quiet down, it scales back to save you money.</p>
<p>In AWS, there are two important services that help with this: Amazon EC2 Auto Scaling and AWS Auto Scaling. Amazon EC2 Auto Scaling is specifically for managing your EC2 servers, while AWS Auto Scaling can also handle other things like DynamoDB tables and Amazon Aurora databases.</p>
<p>In this article, we'll dive deeper into how Amazon EC2 Auto Scaling works and how you can use it to keep your website running smoothly without overspending on servers.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li>Have an AWS account</li>
<li>Basic understanding of EC2 instance</li>
</ul>
<h2 id="heading-table-of-content">Table of Content</h2>
<ul>
<li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></li>
<li><a class="post-section-overview" href="#heading-example-use-case">Example Use case</a></li>
<li><a class="post-section-overview" href="#advantages-of-amazon-ec2-auto-scaling">Advantages of Amazon EC2 Auto Scaling</a></li>
<li><a class="post-section-overview" href="#heading-components-of-ec2-auto-scaling">Components of EC2 Auto Scaling</a></li>
<li><a class="post-section-overview" href="#what-is-launch-configurations-vs-launch-templates">What is Launch Configurations vs Launch Templates</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-launch-template">How to create a launch template</a></li>
<li><a class="post-section-overview" href="#heading-what-are-auto-scaling-groups-asgs">What are Auto Scaling Groups (ASGs)</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-an-auto-scaling-group">How to create an Auto Scaling Group</a></li>
<li><a class="post-section-overview" href="#heading-what-are-scaling-policies">What are Scaling Policies</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<h2 id="heading-example-use-case">Example Use Case</h2>
<h3 id="heading-scenario">Scenario:</h3>
<p>Imagine running a website that sells trendy clothes. Sometimes, lots of people visit your site at once, especially during lunch breaks or evenings. Other times, it's pretty quiet.</p>
<h3 id="heading-problem">Problem:</h3>
<p>You need enough servers to handle busy times, but you don't want to waste money on too many servers when it's quiet.</p>
<h3 id="heading-solution-with-amazon-ec2-auto-scaling">Solution with Amazon EC2 Auto Scaling:</h3>
<p><strong>Traffic Analysis</strong>: Look at when people visit your site the most. This helps you understand when you need more servers.</p>
<p><strong>Set Rules</strong>: Decide when to add or remove servers automatically. For example, you might say, "If more than 70% of our servers are busy for more than 5 minutes, add one more server."</p>
<p><strong>Adjust Server Numbers</strong>: Tell Amazon the smallest and biggest number of servers you need. You can also say how many you'd like on average. For instance, you might say, "Keep at least 2 servers running all the time. But if it's busy, go up to 10 servers. And usually, we need around 4."</p>
<p><strong>Load Balancing</strong>: Make sure all servers get some work. Use a load balancer to send visitors to the least busy server. This keeps everything running smoothly even if you have many servers.</p>
<p><strong>Test and Watch</strong>: Before trusting everything, test to see if it works as planned. Keep an eye on it afterward to make sure it's doing its job right.</p>
<p><strong>Save Money</strong>: With auto scaling, you don't pay for servers you're not using. When traffic is low, it reduces the number of servers, saving you money. When traffic picks up, it adds more servers, so your site stays fast.</p>
<h2 id="heading-advantages-of-using-amazon-ec2-auto-scaling">Advantages of Using Amazon EC2 Auto Scaling</h2>
<p><strong>Cost Optimization</strong>: EC2 Auto Scaling helps optimize costs by automatically adjusting the number of EC2 instances based on demand. During periods of low traffic, it reduces the number of instances, saving on operational costs. Conversely, during high traffic, it scales up to ensure optimal performance without over-provisioning resources.</p>
<p><strong>Improved Availability</strong>: By automatically distributing incoming traffic across multiple instances and fault tolerance of your application. If any instance fails/is unhealthy, the Auto Scaling group replaces it with a new one, ensuring minimal disruption to your services.</p>
<p><strong>Scalability</strong>: EC2 Auto Scaling allows your application to handle sudden spikes in traffic or increased workload without manual intervention. </p>
<p><strong>Enhanced Performance</strong>: With EC2 Auto Scaling, you can maintain consistent performance levels even during peak usage periods. By automatically adding more instances when traffic increases, it prevents performance degradation and ensures a smooth user experience.</p>
<p><strong>Ease of Management</strong>: EC2 Auto Scaling simplifies the management of your EC2 fleet by automating instance provisioning, scaling, and monitoring.</p>
<p><strong>Integration with AWS Services</strong>: EC2 Auto Scaling integrates seamlessly with other AWS services such as Elastic Load Balancing (ELB) and Amazon CloudWatch.</p>
<p><strong>Highly Customizable</strong>: EC2 Auto Scaling offers flexibility and customization options to meet the specific needs of your application.</p>
<h2 id="heading-components-of-ec2-auto-scaling">Components of EC2 Auto Scaling</h2>
<p>Let's get a better understanding on how the Auto Scaling works through its different components. </p>
<p>There are two distinct steps to configuration. The first step is the creation of a launch configuration or launch template. The second is the creation of an Auto Scaling group.</p>
<h2 id="heading-launch-configurations-and-launch-templates">Launch Configurations and Launch Templates</h2>
<p>Launch configurations or launch templates define the configuration settings for the EC2 instances that will be launched by the Auto Scaling group. </p>
<p>These settings include the AMI (Amazon Machine Image), instance type, security groups, key pair, and user data. </p>
<p>Launch configurations are older and being phased out in favor of launch templates, which offer more features and flexibility.</p>
<h3 id="heading-how-to-create-a-launch-template">How to Create a Launch Template</h3>
<p>First, navigate to EC2 Instance page</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/launch-template-1.png" alt="AWS instance page" width="600" height="400" loading="lazy">
<em>AWS instance page</em></p>
<p>Select the Launch Templates under the instances and click the create button.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/launch-template-2.png" alt="AWS launch templates" width="600" height="400" loading="lazy">
<em>AWS launch templates</em></p>
<p>The following screen should show up, almost similar to launching an <code>EC2 instance</code>. You can fill the required information accordingly.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/screencapture-us-east-1-console-aws-amazon-ec2-home-2024-05-03-22_52_38-1.png" alt="Create AWS launch templates" width="600" height="400" loading="lazy">
<em>Create AWS launch templates</em></p>
<p>After configuration, click the "Create Launch" template button and allow it to create, then view your newly created launch template with default and latest version as 1. You can use this launch template to create another launch template and specify a different version for it.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/launch-template-3-1.png" alt="View AWS launch templates" width="600" height="400" loading="lazy">
<em>View AWS launch templates</em></p>
<p>Auto scaling requires either a launch template or launch configuration to identify the instance it's launching and its configurations.</p>
<h2 id="heading-what-are-auto-scaling-groups-asgs">What are Auto Scaling Groups (ASGs)</h2>
<p>Auto Scaling groups are the core component of EC2 Auto Scaling. They define the group of EC2 instances that are managed together and share the same scaling policies. ASGs ensure that your application can automatically scale out (add instances) or scale in (remove instances) based on demand.</p>
<h3 id="heading-how-to-create-an-auto-scaling-group">How to create an Auto Scaling Group</h3>
<p>First, navigate to EC2 Instance page and under the Auto Scaling group, select and click the create button.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>creating an Auto Scaling group</em></p>
<p>On the create screen, the first step is to give your ASG a <code>Name</code> and then select your <code>launch template</code> created from the steps above. </p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-2.png" alt="Image" width="600" height="400" loading="lazy">
<em>creating a launch template</em></p>
<p>The next step requires you to select or override an instance launch template. You also select a VPC and subnet.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>selecting instance launch template</em></p>
<p>The next step is to configure advanced options such as adding a load balancer and monitoring. You can attach or add a new load balancer but for this article we will skip this part.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>configuring advanced options</em></p>
<p>Next, configure the group size and scaling. Here, we want to configure the scale between minimum of 2 and maximum of 5. Also, set the metrics type to track the CPU utilization (set to 50 – you can increase to 70 or more) for scaling.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/screencapture-us-east-1-console-aws-amazon-ec2-home-2024-05-03-23_41_58.png" alt="Image" width="600" height="400" loading="lazy">
<em>configuring group size and scaling</em></p>
<p>Next two steps are for adding notifications (you will need to create an SNS service for this) and tags. In this article, we are going to skip these and create our ASG.</p>
<p>Create and view the ASG created. From its <strong>activity</strong> folder, you can see those two instances launched. Also, from the instances page, you should see two EC2 instances. This is because we set our desired state to 2.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-5.png" alt="Image" width="600" height="400" loading="lazy">
<em>Auto Scaling groups</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/05/asg-6.png" alt="Image" width="600" height="400" loading="lazy">
<em>Auto Scaling groups</em></p>
<h2 id="heading-what-are-scaling-policies">What are Scaling Policies?</h2>
<p>Scaling policies define the rules that govern how the Auto Scaling group scales in or out in response to changing demand. There are four types of scaling policies:</p>
<p>Let's break down each type of scaling with examples:</p>
<h3 id="heading-manual-scaling">Manual Scaling</h3>
<p>Manual scaling involves adjusting the number of EC2 instances in your Auto Scaling group manually, without relying on automated triggers or policies. This type of scaling is typically done in response to predictable events or planned changes in demand.</p>
<p><strong>Example</strong>: Assuming you run an e-commerce website, and you know that there will be a flash sale event that will attract a large number of visitors. To handle the expected surge in traffic, you can manually increase the desired capacity of your Auto Scaling group before the event, adding more EC2 instances in advance of the anticipated demand spike. After the event is over, you can manually reduce the desired capacity back to its normal level.</p>
<h5 id="heading-pros">Pros:</h5>
<ul>
<li><strong>Control</strong>: Offers direct control over the number of EC2 instances in the Auto Scaling group.</li>
<li><strong>Flexibility</strong>: Allows for immediate adjustments based on specific requirements or events.</li>
</ul>
<h5 id="heading-cons">Cons:</h5>
<ul>
<li><strong>Manual Intervention</strong>: Relies on human intervention, which can be time-consuming and prone to errors.</li>
<li><strong>Lack of Automation</strong>: Not suitable for handling dynamic or unpredictable fluctuations in demand efficiently.</li>
</ul>
<h3 id="heading-schedule-scaling">Schedule Scaling</h3>
<p>Schedule scaling involves defining predefined schedules to adjust the number of EC2 instances in your Auto Scaling group automatically. This type of scaling is useful for applications with predictable traffic patterns, such as daily or weekly fluctuations in demand.</p>
<p><strong>Example</strong>: Consider a video streaming service that experiences peak traffic during evenings and weekends. You can set up a schedule scaling policy to increase the desired capacity of your Auto Scaling group every evening at 6 PM and decrease it every morning at 6 AM. This ensures that you have enough capacity to handle peak demand periods without overspending on resources during off-peak hours.</p>
<h5 id="heading-pros-1">Pros:</h5>
<ul>
<li><strong>Predictability</strong>: Well-suited for applications with predictable traffic patterns, such as daily or weekly fluctuations.</li>
<li><strong>Cost Optimization</strong>: Helps optimize costs by aligning resources with expected demand patterns.</li>
</ul>
<h5 id="heading-cons-1">Cons:</h5>
<ul>
<li><strong>Limited Adaptability</strong>: May not be responsive to sudden changes in demand or unexpected traffic spikes.</li>
<li><strong>Requires Planning</strong>: Requires upfront planning and configuration of schedules based on historical data or business insights.</li>
</ul>
<h3 id="heading-dynamic-scaling">Dynamic Scaling</h3>
<p>Dynamic scaling adjusts the number of EC2 instances in your Auto Scaling group automatically based on real-time metrics, such as CPU utilization, network traffic, or other application-specific metrics. This type of scaling is responsive to fluctuations in demand and helps ensure optimal performance and cost-effectiveness.</p>
<h5 id="heading-types">Types:</h5>
<ul>
<li><strong>Step Scaling</strong>: This policy scales the number of instances based on a series of scaling adjustments defined by step adjustments and associated metrics thresholds. </li>
<li><strong>Target Tracking</strong>: This policy automatically adjusts the number of instances to maintain a specified target metric, such as average CPU utilization or network traffic.</li>
</ul>
<p>When adding instances to the ASG, it will take a few minutes for them to come online and handle load. This is why a cooldown policy has to be set.</p>
<p><strong>Scaling Cooldowns:</strong> Scaling cooldowns help prevent rapid fluctuations in the number of instances by imposing a cooldown period after a scaling activity is triggered. During this cooldown period, EC2 Auto Scaling will not launch or terminate additional instances, allowing time for the newly launched instances to stabilize or for the impact of terminated instances to be observed.</p>
<p><strong>Example</strong>: Let's say you operate a ride-sharing platform where demand can vary unpredictably throughout the day. With dynamic scaling, you can configure Auto Scaling policies to add more EC2 instances when the number of ride requests exceeds a certain threshold, and remove instances when demand decreases. This allows you to dynamically adapt to changing traffic patterns in real-time, ensuring a seamless experience for both drivers and passengers.</p>
<h5 id="heading-pros-2">Pros:</h5>
<ul>
<li><strong>Real-Time Responsiveness</strong>: Adjusts resource allocation dynamically in response to actual demand, ensuring optimal performance.</li>
<li><strong>Cost Efficiency</strong>: Automatically scales resources up or down, helping to optimize costs by only using what is needed.</li>
</ul>
<h5 id="heading-cons-2">Cons:</h5>
<ul>
<li><strong>Potential Over-Provisioning</strong>: May lead to over-provisioning during sudden spikes in demand if scaling policies are not properly configured.</li>
<li><strong>Complexity</strong>: Requires careful configuration of scaling policies and monitoring of metrics to ensure effective scaling behavior.</li>
</ul>
<h3 id="heading-predictive-scaling">Predictive Scaling</h3>
<p>Predictive scaling uses machine learning algorithms and historical data to forecast future demand and proactively adjust the number of EC2 instances in your Auto Scaling group. This type of scaling helps prevent under-provisioning or over-provisioning of resources by anticipating changes in demand before they occur.</p>
<p><strong>Example</strong>: Suppose you operate a weather forecasting application that experiences increased demand during severe weather events. By analyzing historical data on weather patterns and user behavior, predictive scaling can predict when a surge in traffic is likely to occur and automatically scale up the capacity of your Auto Scaling group ahead of time. This ensures that your application remains responsive and available during peak usage periods without unnecessary resource waste.</p>
<h5 id="heading-pros-3">Pros:</h5>
<ul>
<li><strong>Proactive Optimization</strong>: Anticipates future demand based on historical data, ensuring resources are provisioned ahead of time.</li>
<li><strong>Improved Cost Management</strong>: Helps prevent under-provisioning and over-provisioning, optimizing resource usage and costs.</li>
</ul>
<h4 id="heading-cons-3">Cons:</h4>
<ul>
<li><strong>Data Dependence</strong>: Relies on accurate historical data and effective machine learning models for accurate predictions.</li>
<li><strong>Initial Setup</strong>: Requires initial setup and configuration of predictive scaling models, which can be complex and resource-intensive.</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In conclusion, Amazon EC2 Auto Scaling offers a range of strategies to effectively manage and optimize the performance of applications running on EC2 instances.</p>
<p>Whether it's through manual adjustments, scheduled scaling, dynamic responses to real-time metrics, or proactive measures based on predictive analytics, EC2 Auto Scaling provides the flexibility and automation needed to ensure that resources are aligned with demand. </p>
<p>By leveraging these scaling capabilities, businesses can enhance availability, improve cost efficiency, and deliver a seamless user experience, ultimately driving better outcomes for their applications and customers on the AWS platform.</p>
<p>As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The new() vs make() Functions in Go – When to Use Each One ]]>
                </title>
                <description>
                    <![CDATA[ Go, also known as Golang, is a statically-typed, compiled programming language designed for simplicity and efficiency.  When it comes to working with data structures like slices, maps, and channels, you'll likely encounter the new() and make() functi... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/new-vs-make-functions-in-go/</link>
                <guid isPermaLink="false">66b906bd53c4132f77b5c303</guid>
                
                    <category>
                        <![CDATA[ Go Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Thu, 04 Jan 2024 15:44:43 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/pexels-skitterphoto-422844.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Go, also known as Golang, is a statically-typed, compiled programming language designed for simplicity and efficiency. </p>
<p>When it comes to working with data structures like slices, maps, and channels, you'll likely encounter the <code>new()</code> and <code>make()</code> functions. While both are used for memory allocation, they serve distinct purposes. </p>
<p>In this article, we'll explore the differences between <code>new()</code> and <code>make()</code> in Go and discuss when to use each.</p>
<h2 id="heading-the-new-function">The <code>new()</code> Function</h2>
<p>The <code>new()</code> function in Go is a built-in function that allocates memory for a new zeroed value of a specified type and returns a pointer to it. It is primarily used for initializing and obtaining a pointer to a newly allocated zeroed value of a given type, usually for data types like structs.</p>
<p>Here's a simple example:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> <span class="hljs-string">"fmt"</span>

<span class="hljs-keyword">type</span> Person <span class="hljs-keyword">struct</span> {
    Name     <span class="hljs-keyword">string</span>
    Age      <span class="hljs-keyword">int</span>
    Gender     <span class="hljs-keyword">string</span>
}

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    <span class="hljs-comment">// Using new() to allocate memory for a Person struct</span>
    p := <span class="hljs-built_in">new</span>(Person)

    <span class="hljs-comment">// Initializing the fields</span>
    p.Name = <span class="hljs-string">"John Doe"</span>
    p.Age = <span class="hljs-number">30</span>
    p.Gender = <span class="hljs-string">"Male"</span>

    fmt.Println(p)
}
</code></pre>
<p>In this example, <code>new(Person)</code> allocates memory for a new <code>Person</code> struct, and <code>p</code> is a pointer to the newly allocated zeroed value.</p>
<h2 id="heading-the-make-function">The <code>make()</code> Function</h2>
<p>On the other hand, the <code>make()</code> function is used for initializing slices, maps, and channels – data structures that require runtime initialization. Unlike <code>new()</code>, <code>make()</code> returns an initialized (non-zeroed) value of a specified type.</p>
<p>Let's look at an example using a slice:</p>
<pre><code class="lang-go"><span class="hljs-keyword">package</span> main

<span class="hljs-keyword">import</span> <span class="hljs-string">"fmt"</span>

<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">main</span><span class="hljs-params">()</span></span> {
    <span class="hljs-comment">// Using make() to create a slice with a specified length and capacity</span>
    s := <span class="hljs-built_in">make</span>([]<span class="hljs-keyword">int</span>, <span class="hljs-number">10</span>, <span class="hljs-number">15</span>)

    <span class="hljs-comment">// Initializing the elements</span>
    <span class="hljs-keyword">for</span> i := <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">10</span>; i++ {
        s[i] = i + <span class="hljs-number">1</span>
    }

    fmt.Println(s)
}
</code></pre>
<p>In this example, <code>make([]int, 10, 15)</code> creates a slice of integers with a length of 10 and a capacity of 15. The <code>make()</code> function ensures that the slice is initialized with non-zero values.</p>
<h2 id="heading-when-to-use-new-and-make-in-go">When to Use <code>new()</code> and <code>make()</code> in Go</h2>
<h3 id="heading-use-new-for-value-types">Use <code>new()</code> for Value Types</h3>
<p>When dealing with value types like structs, you can use <code>new()</code> to allocate memory for a new zeroed value. This is suitable for scenarios where you want a pointer to an initialized structure.</p>
<pre><code class="lang-go">p := <span class="hljs-built_in">new</span>(Person)
</code></pre>
<h3 id="heading-use-make-for-reference-types">Use <code>make()</code> for Reference Types:</h3>
<p>For slices, maps, and channels, where initialization involves setting up data structures and internal pointers, use <code>make()</code> to create an initialized instance.</p>
<pre><code class="lang-go">s := <span class="hljs-built_in">make</span>([]<span class="hljs-keyword">int</span>, <span class="hljs-number">5</span>, <span class="hljs-number">10</span>)
</code></pre>
<h3 id="heading-pointer-vs-value">Pointer vs. Value:</h3>
<p>Keep in mind that <code>new()</code> returns a pointer, while <code>make()</code> returns a non-zeroed value. Choose the appropriate method based on whether you need a pointer or an initialized value.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Understanding the distinction between <code>new()</code> and <code>make()</code> in Go is crucial for writing clean and efficient code. By using the right method for the appropriate data types, you can ensure proper memory allocation and initialization in your Go programs.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy an AWS Lambda Function with Serverless Framework ]]>
                </title>
                <description>
                    <![CDATA[ Serverless computing has revolutionized the way developers build and deploy applications in the cloud. It takes away the complexities of server management, allowing developers to focus solely on writing code and delivering value to their users. In th... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-deploy-aws-lambda-with-serverless/</link>
                <guid isPermaLink="false">66b906ab53c4132f77b5c301</guid>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ aws lambda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless framework ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Destiny Erhabor ]]>
                </dc:creator>
                <pubDate>Mon, 18 Sep 2023 23:52:30 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/09/joshua-woroniecki-lzh3hPtJz9c-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Serverless computing has revolutionized the way developers build and deploy applications in the cloud. It takes away the complexities of server management, allowing developers to focus solely on writing code and delivering value to their users.</p>
<p>In the realm of serverless computing, AWS Lambda stands out as a leading platform for running serverless functions in a scalable and cost-effective manner.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><a class="post-section-overview" href="#heading-what-is-serverless-framework">What is Serverless Framework?</a></li>
<li><a class="post-section-overview" href="#heading-purpose-and-scope-of-the-guide">Purpose and Scope of the Guide</a></li>
<li><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></li>
<li><a class="post-section-overview" href="#heading-how-to-configure-the-aws-cli">How to Configure the AWS CLI</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-the-iam-role">How to Create the IAM Role</a></li>
<li><a class="post-section-overview" href="#heading-how-to-create-a-serverless-project">How to Create a Serverless Project</a></li>
<li><a class="post-section-overview" href="#heading-how-to-write-the-python-function">How to Write the Python Function</a></li>
<li><a class="post-section-overview" href="#heading-how-to-define-serverless-configuration">How to Define Serverless Configuration</a></li>
<li><a class="post-section-overview" href="#heading-how-to-deploy-the-python-function">How to Deploy the Python Function</a></li>
<li><a class="post-section-overview" href="#heading-how-to-test-the-api">How to Test the API</a></li>
<li><a class="post-section-overview" href="#heading-monitoring-and-logging">Monitoring and Logging</a></li>
<li><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></li>
</ul>
<h2 id="heading-what-is-serverless-framework">What is Serverless Framework?</h2>
<p>Serverless Framework is a powerful tool that simplifies the deployment and management of serverless applications across various cloud providers, including Amazon Web Services (AWS). This guide aims to walk you through the process of using the Serverless Framework to deploy a simple Python function to AWS Lambda, expose it via API Gateway, and monitor it using AWS CloudWatch.</p>
<h3 id="heading-purpose-and-scope-of-the-guide">Purpose and Scope of the Guide</h3>
<p>The purpose of this guide is to provide you with a step-by-step tutorial on deploying a serverless Python function on AWS using the Serverless Framework. Whether you're new to serverless computing or looking to expand your skills, this tutorial is designed to help you with the following:</p>
<ul>
<li>How to set up the necessary prerequisites, including AWS account configuration.</li>
<li>How to create a new serverless project using the Serverless Framework.</li>
<li>How to write a Python function that will be deployed to AWS Lambda.</li>
<li>How to define the serverless configuration in a  <code>serverless.yml</code> file.</li>
<li>How to deploy the Python function and API Gateway.</li>
<li>How to test the deployed API using various tools like cURL or Postman.</li>
<li>How to set up monitoring and logging with AWS CloudWatch.</li>
</ul>
<p>By the end of this article, you'll have a clear understanding of how to leverage the Serverless Framework to deploy and manage serverless applications on AWS. </p>
<p>You'll also gain practical experience in deploying serverless functions and exposing them through an API endpoint, paving the way for building and scaling serverless applications in your projects.</p>
<p>Now, let's dive into the prerequisites needed to get started with this tutorial.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along with this tutorial, you'll need the following:</p>
<ul>
<li><a target="_blank" href="https://www.console.aws.amazon.com">An AWS account</a>.</li>
<li><a target="_blank" href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html">The AWS CLI</a> (Command Line Interface).</li>
<li><a target="_blank" href="https://www.serverless.com/framework/docs/getting-started/">The Serverless Framework</a>.</li>
</ul>
<h2 id="heading-how-to-configure-the-aws-cli">How to Configure the AWS CLI</h2>
<p>You'll need to set the AWS credentials for the AWS CLI if you have not done that already. You'll be using it along with the Serverless Framework to deploy the resources on AWS. </p>
<p>You can create the AWS credentials file by entering the following command in the terminal:</p>
<pre><code class="lang-bash"> cat &lt;&lt;EOF &gt; ~/.aws/credentials
    [default]
    aws_access_key_id = &lt;REPLACE_WITH_YOUR_SECRET_KEY&gt;
    aws_secret_access_key = &lt;REPLACE_WITH_YOUR_ACCESS_KEY&gt; 
  EOF

  cat &lt;&lt;EOF &gt; ~/.aws/config
    [default]
    region = eu-west-1
    output = json
  EOF
</code></pre>
<h2 id="heading-how-to-create-the-iam-role">How to Create the IAM Role</h2>
<p>The IAM role is also used by the Serverless Framework to deploy the resources on AWS. Enter the following command to create the role:</p>
<pre><code class="lang-bash"> aws iam create-role --role-name serverlessLabs --assume-role-policy-document <span class="hljs-string">'{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}'</span>
</code></pre>
<p>This policy allows the role to be used by the AWS Lambda service.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/create-role-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Enter the following command to attach the <strong><code>AWSLambdaBasicExecutionRole</code></strong> policy to the role:</p>
<pre><code class="lang-bash">aws iam attach-role-policy --role-name serverlessLabs --policy-arn arn:aws:iam::aws:policy/AWSLambda_FullAccess
</code></pre>
<p>To verify that the role has been created successfully, you can run the following command to get information about the IAM role:</p>
<pre><code class="lang-bash">aws iam get-role --role-name serverlessLabs
</code></pre>
<p>Here's what the information should look like:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/role-aws-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-create-a-serverless-project">How to Create a Serverless Project</h2>
<p>This project is a simple python function that is deployed to AWS Lambda, API Gateway, and CloudWatch using the Serverless Framework. </p>
<p>The function is triggered by an HTTP GET request and returns a simple string. The function is deployed to the eu-west-1 region.</p>
<p>First, install Serverless Framework using <code>npm</code>:</p>
<pre><code class="lang-bash">npm install -g serverless
</code></pre>
<p>Next, create a new Serverless Framework project using the <code>serverless</code> command and then follow the prompt:</p>
<pre><code class="lang-bash">serverless
</code></pre>
<p>Then choose AWS Python Starter from the template list. Give it any name of your choice – I used <strong>serverless-lab</strong>.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/serverless-template-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>After the command runs successfully, you will see the two main components created: <code>serverless.yaml</code>, and <code>handler.py.</code></p>
<h2 id="heading-how-to-write-the-python-function">How to Write the Python Function</h2>
<p>To keep things organized, let's create a folder named <strong>functions</strong>, and create a file named <code>__init__.py</code> inside it. You can do that using this command:</p>
<pre><code class="lang-bash">mkdir <span class="hljs-built_in">functions</span>  touch <span class="hljs-built_in">functions</span>/__init__.py
</code></pre>
<p>Create your first function by creating a file named <code>first_function.py</code> inside the <strong>functions</strong> folder:</p>
<pre><code class="lang-bash">touch <span class="hljs-built_in">functions</span>/first_function.py
</code></pre>
<p>Then open the <code>first_function.py</code> file, and paste the following Python code to define the function you'll deploy:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">first_function</span>(<span class="hljs-params">event, context</span>):</span>
  print(<span class="hljs-string">"The first function has been invoked!!"</span>)
  <span class="hljs-keyword">return</span> {
    <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">200</span>,
    <span class="hljs-string">'body'</span>: <span class="hljs-string">"Hello, World!.\n This is the first function."</span>
  }
</code></pre>
<p>This code above is a simple Python function that returns a JSON object with status code and body values. As you can see, we inserted the two parameters — <code>event</code> and <code>context</code> — required from the functions as a Serverless Framework convention.</p>
<p>Next, open the <code>handler.py</code> file and delete its content and paste the following Python code to define the handler that will be invoked when the function is triggered:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> functions.first_function <span class="hljs-keyword">import</span> first_function
</code></pre>
<p>The code above exposes the function you created in the <code>first_function.py</code> file. We imported the function, and exposed it to the framework.</p>
<h2 id="heading-how-to-define-serverless-configuration">How to Define Serverless Configuration</h2>
<p>To start with the configuration, open the <code>serverless.yaml</code> file and delete all of its content and paste the following YAML code to define the microservice you will deploy:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">service:</span> <span class="hljs-string">serverless-lab</span>

<span class="hljs-attr">provider:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">aws</span>
  <span class="hljs-attr">runtime:</span> <span class="hljs-string">python3.7</span>
  <span class="hljs-attr">lambdaHashingVersion:</span> <span class="hljs-number">20201221</span>
  <span class="hljs-attr">region:</span> <span class="hljs-string">eu-west-1</span>
  <span class="hljs-attr">timeout:</span> <span class="hljs-number">10</span> <span class="hljs-comment"># You set a timeout of 10 seconds for the functions</span>
  <span class="hljs-attr">role:</span> <span class="hljs-string">arn:aws:iam::155318317806:role/serverlessLabs</span> <span class="hljs-comment"># Enter your Arn role here</span>
  <span class="hljs-attr">memorySize:</span> <span class="hljs-number">512</span>

<span class="hljs-attr">functions:</span>
  <span class="hljs-attr">first_function:</span>
    <span class="hljs-attr">handler:</span> <span class="hljs-string">handler.first_function</span>
    <span class="hljs-attr">events:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">http:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">first</span>
        <span class="hljs-attr">method:</span> <span class="hljs-string">get</span>
</code></pre>
<p>Let's break down each section line by line:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">service:</span> <span class="hljs-string">serverless-lab</span>
</code></pre>
<p><code>**service**</code> specifies the name of your Serverless service or project. In this case, it's named "serverless-lab," which will be used as the service name when deploying to AWS.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">provider:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">aws</span>
  <span class="hljs-attr">runtime:</span> <span class="hljs-string">python3.7</span>
  <span class="hljs-attr">lambdaHashingVersion:</span> <span class="hljs-number">20201221</span>
  <span class="hljs-attr">region:</span> <span class="hljs-string">eu-west-1</span> <span class="hljs-comment"># enter your region</span>
  <span class="hljs-attr">profile:</span> <span class="hljs-string">personalCaesarAcc</span>
  <span class="hljs-attr">timeout:</span> <span class="hljs-number">10</span> <span class="hljs-comment"># You set a timeout of 10 seconds for the functions</span>
  <span class="hljs-attr">role:</span> <span class="hljs-string">arn:aws:iam::155318317806:role/serverlessLabs</span> <span class="hljs-comment"># Enter your Arn role here</span>
  <span class="hljs-attr">memorySize:</span> <span class="hljs-number">512</span>
</code></pre>
<p><code>**provider**</code> defines the AWS provider for your service. It specifies various configuration settings for AWS Lambda functions and other AWS resources.</p>
<ul>
<li><code>name: aws</code> specifies that you are using AWS as your cloud provider.</li>
<li><code>runtime: python3.7</code> sets the runtime for AWS Lambda functions to Python 3.7.</li>
<li><code>lambdaHashingVersion: 20201221</code> specifies the Lambda function hashing version. This is an internal AWS setting.</li>
<li><code>region: eu-west-1</code> specifies the AWS region where your service will be deployed. You can replace "eu-west-1" with your desired AWS region.</li>
<li><code>timeout: 10</code> sets a timeout of 10 seconds for AWS Lambda functions. This means that each function should complete its execution within 10 seconds.</li>
<li><code>role: arn:aws:iam::155318317806:role/serverlessLabs</code> specifies the AWS IAM role ARN that your Lambda functions will assume. This role defines the permissions your functions have within AWS services. You can replace this with the ARN of your desired IAM role.</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-attr">functions:</span>
  <span class="hljs-attr">first_function:</span>
    <span class="hljs-attr">handler:</span> <span class="hljs-string">handler.first_function</span>
    <span class="hljs-attr">events:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">http:</span>
        <span class="hljs-attr">path:</span> <span class="hljs-string">first</span>
        <span class="hljs-attr">method:</span> <span class="hljs-string">get</span>
</code></pre>
<p><code>**functions**</code> defines the AWS Lambda functions in your service.</p>
<ul>
<li><code>first_function</code> denotes the name of your AWS Lambda function.</li>
<li><code>handler: handler.first_function</code> specifies the entry point for this function, which is <code>handler.first_function</code> in the <code>handler</code> module. This is typically in the <code>&lt;module_name&gt;.&lt;function_name&gt;</code> format.</li>
<li><code>events</code> specifies the events that triggers the function.</li>
<li><code>- http</code> indicates that the function is triggered by an HTTP event (API Gateway).</li>
<li><code>path: first</code> specifies the API endpoint path (<code>/first</code>) that triggers the function.</li>
<li><code>method: get</code> specifies that this function is triggered when an HTTP GET request is made to the specified path.</li>
</ul>
<h2 id="heading-how-to-deploy-the-python-function">How to Deploy the Python Function</h2>
<p>You can use the command below to deploy the microservice on AWS:</p>
<pre><code class="lang-bash">serverless deploy
</code></pre>
<p>After a while, the deployment will be completed and you can see information like the endpoint, hosted on API Gateway, to trigger the function you just deployed.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/deploy-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The framework deployed the function on AWS Lambda and, because you attached an HTTP trigger to it. It has deployed an API on API Gateway to let the function be reachable.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/function-on-aws-api-gateway-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-test-the-api">How to Test the API</h2>
<p>From the deployment, you have a single function named <code>first_function</code>, and a single HTTP GET endpoint. </p>
<p>Using the GET endpoint (the endpoint generated in the terminal after deploying the function) in your browser, you can call the function:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/url-test-2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The image above shows the functionality created in the deployed function running in the browser.</p>
<h2 id="heading-monitoring-and-logging">Monitoring and Logging</h2>
<p>The log group is automatically saved on AWS CloudWatch because there is a print statement defined in the function. Enter the following command to access the function's logs:</p>
<pre><code class="lang-bash">serverless logs -f first_function
</code></pre>
<p>AWS CloudWatch is the native AWS logging service that lets you monitor and access logs from your applications. You can find log groups, and you can also apply filter expressions on logs to retrieve those you need.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/09/cloudwatch-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can delete the microservice and resources you just deployed using the <code>serverless remove</code> command. </p>
<p>Check out <a target="_blank" href="https://github.com/Caesarsage/Devops-projects/tree/main/project-08">my GitHub repository</a> to see the full code</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this comprehensive guide, we've explored the powerful world of serverless computing and demonstrated how to harness its capabilities using the Serverless Framework and Amazon Web Services (AWS). </p>
<p>You've embarked on a journey from setting up your development environment to deploying a simple Python function as an AWS Lambda-backed API, all while gaining insights into monitoring and logging with AWS CloudWatch.</p>
<p>This guide serves as a starting point for your serverless journey. As you become more proficient with the Serverless Framework and AWS, you'll be able to build and deploy sophisticated serverless applications that scale dynamically and meet the demands of modern, cloud-native architectures.</p>
<p>As always, I hope you enjoyed the article and learned something new. If you want, you can also follow me on <a target="_blank" href="https://www.linkedin.com/in/destiny-erhabor">LinkedIn</a> or <a target="_blank" href="https://twitter.com/caesar_sage">Twitter</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
