<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Docker - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Docker - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 14 May 2026 09:13:43 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/docker/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA ]]>
                </title>
                <description>
                    <![CDATA[ WhatsApp is where your many of your customers likely already are. For support tickets, order updates, booking reminders, and lead qualification, a WhatsApp channel often converts several times better  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-self-hosted-whatsapp-bot-with-n8n-and-waha/</link>
                <guid isPermaLink="false">6a01e032fca21b0d4b2bb4c1</guid>
                
                    <category>
                        <![CDATA[ whatsapp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ automation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ n8n ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ self-hosted ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ אחיה כהן ]]>
                </dc:creator>
                <pubDate>Mon, 11 May 2026 13:57:06 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/28affe4d-9359-4cbb-a311-a2ee9d0829c0.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>WhatsApp is where your many of your customers likely already are. For support tickets, order updates, booking reminders, and lead qualification, a WhatsApp channel often converts several times better than email.</p>
<p>But the official WhatsApp Business Cloud API can be slow to onboard, template-restricted for proactive messages, and priced per conversation — which adds up fast at scale.</p>
<p>There's another path: you can run your own WhatsApp HTTP gateway on a small server, connect it to a workflow engine, and keep every message — inbound and outbound — inside infrastructure you control. No monthly conversation fees, no template approvals for routine replies, no third-party middleman holding your customer data.</p>
<p>In this tutorial, you'll build exactly that. By the end, you'll have a WhatsApp bot that:</p>
<ul>
<li><p>Receives every incoming message through a webhook</p>
</li>
<li><p>Routes messages through an n8n workflow</p>
</li>
<li><p>Replies automatically based on keywords, AI, or any API call you want</p>
</li>
<li><p>Runs entirely on your own server, using two open-source tools</p>
</li>
</ul>
<p>You'll use <strong>WAHA</strong> (WhatsApp HTTP API) as the gateway, and <strong>n8n</strong> as the workflow engine. Both run in Docker, both are free for self-hosting, and together they cover everything from a simple auto-reply to a full CRM integration.</p>
<h2 id="heading-table-of-contents">Table of contents</h2>
<ul>
<li><p><a href="#heading-what-youll-learn">What You'll Learn</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-a-note-on-which-whatsapp-account-to-use">A Note on Which WhatsApp Account to Use</a></p>
</li>
<li><p><a href="#heading-waha-vs-the-official-whatsapp-business-cloud-api">WAHA vs the official WhatsApp Business Cloud API</a></p>
</li>
<li><p><a href="#heading-part-1-understanding-waha">Part 1: Understanding WAHA</a></p>
</li>
<li><p><a href="#heading-part-2-running-waha-with-docker">Part 2: Running WAHA with Docker</a></p>
</li>
<li><p><a href="#heading-part-3-starting-a-whatsapp-session">Part 3: Starting a WhatsApp session</a></p>
</li>
<li><p><a href="#heading-part-4-running-n8n">Part 4: Running n8n</a></p>
</li>
<li><p><a href="#heading-part-5-creating-the-webhook-trigger-in-n8n">Part 5: Creating the Webhook Trigger in n8n</a></p>
</li>
<li><p><a href="#heading-part-6-wiring-waha-to-n8n">Part 6: Wiring WAHA to n8n</a></p>
</li>
<li><p><a href="#heading-part-7-building-the-first-auto-reply">Part 7: Building the first auto-reply</a></p>
</li>
<li><p><a href="#heading-part-8-a-second-example-proactive-booking-confirmations">Part 8: A Second Example — Proactive Booking Confirmations</a></p>
</li>
<li><p><a href="#heading-part-9-going-to-production">Part 9: Going to Production</a></p>
</li>
<li><p><a href="#heading-common-pitfalls">Common Pitfalls</a></p>
</li>
<li><p><a href="#heading-where-to-go-next">Where to Go Next</a></p>
</li>
</ul>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<ul>
<li><p>How WAHA works under the hood and when to use it instead of the official Cloud API</p>
</li>
<li><p>How to run WAHA and n8n side by side with Docker Compose</p>
</li>
<li><p>How to scan the QR code and bind a WhatsApp account to your gateway</p>
</li>
<li><p>How to connect WAHA's webhook to an n8n workflow</p>
</li>
<li><p>How to build a keyword-based auto-reply bot</p>
</li>
<li><p>How to send proactive confirmations from a separate workflow</p>
</li>
<li><p>How to harden the setup for production (HTTPS, API keys, rate limits, Queue Mode)</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A Linux server (any VPS works — 2 GB of RAM is enough for a small bot)</p>
</li>
<li><p>Docker and Docker Compose installed</p>
</li>
<li><p>A public hostname with DNS pointing at the server, or an ngrok tunnel for local testing</p>
</li>
<li><p>A WhatsApp account you're willing to dedicate to the bot (more on that below)</p>
</li>
<li><p>Basic familiarity with JSON and HTTP requests</p>
</li>
</ul>
<p>You don't need prior n8n experience. If you can drag a box and wire it to another box, you can build the flow.</p>
<h2 id="heading-a-note-on-which-whatsapp-account-to-use">A Note on Which WhatsApp Account to Use</h2>
<p>WAHA works by running an actual WhatsApp Web session inside a headless Chromium process. It logs in as a real account — the same way you would open web.whatsapp.com in your browser. Meta doesn't officially endorse this approach for commercial use at scale, and heavy volume from a single number can lead to a ban.</p>
<p>For that reason, use a dedicated number for the bot. Don't use your personal WhatsApp. Get a second SIM, eSIM, or a VoIP number that supports WhatsApp activation. Keep outbound volume reasonable, and you'll be fine for most small-business use cases.</p>
<p>If you plan to send thousands of marketing messages per day, switch to the official WhatsApp Business Cloud API — that's what it exists for. This tutorial is aimed at the middle ground: support bots, order updates, booking confirmations, and similar conversational flows where you need real-time control without enterprise pricing.</p>
<h2 id="heading-waha-vs-the-official-whatsapp-business-cloud-api">WAHA vs the official WhatsApp Business Cloud API</h2>
<p>Before writing any code, it helps to understand when each option is the right fit.</p>
<table>
<thead>
<tr>
<th>Dimension</th>
<th>WAHA (self-hosted)</th>
<th>WhatsApp Cloud API (Meta)</th>
</tr>
</thead>
<tbody><tr>
<td>Onboarding</td>
<td>Scan a QR code — ready in minutes</td>
<td>Business verification, app review — days to weeks</td>
</tr>
<tr>
<td>Cost</td>
<td>Server cost only</td>
<td>Per-conversation pricing</td>
</tr>
<tr>
<td>Template approval</td>
<td>Not needed</td>
<td>Required for proactive messages outside the 24-hour window</td>
</tr>
<tr>
<td>Session model</td>
<td>One WhatsApp Web session per Core container</td>
<td>Native API, no web session</td>
</tr>
<tr>
<td>Risk</td>
<td>Account ban possible at high unsolicited volume</td>
<td>Rate limits but no ban for normal use</td>
</tr>
<tr>
<td>Vendor lock-in</td>
<td>None — pure open source</td>
<td>Tied to Meta's API and pricing</td>
</tr>
<tr>
<td>Best for</td>
<td>Support bots, small-team workflows, internal tools</td>
<td>High-volume marketing, regulated industries, &gt;100k monthly messages</td>
</tr>
</tbody></table>
<p>Neither is strictly better. If you run a support team for a small business, WAHA is often the pragmatic choice. If you're a bank sending millions of transactional messages, you want the Cloud API. Many teams run both — WAHA for conversational support, Cloud API for bulk transactional traffic.</p>
<h2 id="heading-part-1-understanding-waha">Part 1: Understanding WAHA</h2>
<p>WAHA is an open-source project that wraps WhatsApp Web behind a clean REST API. You <code>POST /api/sendText</code> with a chat ID and a message, and WAHA sends it. You configure a webhook URL, and WAHA <code>POST</code>s to that URL every time a message arrives.</p>
<p>Under the hood, WAHA spawns a Chromium instance, opens WhatsApp Web, and uses an engine (<code>whatsapp-web.js</code>, <code>NOWEB</code>, or <code>GOWS</code>) to automate the session. Your code doesn't see any of that complexity — you just see an HTTP API.</p>
<p>The project ships in two flavors:</p>
<ul>
<li><p><strong>WAHA Core</strong> — free, MIT licensed, one active session per container, community support.</p>
</li>
<li><p><strong>WAHA Plus</strong> — commercial license, multi-session support, priority support, and access to advanced endpoints.</p>
</li>
</ul>
<p>For most developers building a single bot, Core is enough. You can always upgrade later.</p>
<p>Official docs live at <a href="https://waha.devlike.pro/">waha.devlike.pro</a>. Keep that open in another tab — we'll reference specific endpoints as we go.</p>
<h2 id="heading-part-2-running-waha-with-docker">Part 2: Running WAHA with Docker</h2>
<p>Create a fresh directory for the project:</p>
<pre><code class="language-bash">mkdir whatsapp-bot &amp;&amp; cd whatsapp-bot
</code></pre>
<p>Create a <code>docker-compose.yml</code> file:</p>
<pre><code class="language-yaml">services:
  waha:
    image: devlikeapro/waha:latest
    container_name: waha
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - WAHA_DASHBOARD_ENABLED=true
      - WAHA_DASHBOARD_USERNAME=admin
      - WAHA_DASHBOARD_PASSWORD=change-me-now
      - WHATSAPP_API_KEY=super-secret-key-change-me
      - WHATSAPP_DEFAULT_ENGINE=WEBJS
    volumes:
      - ./waha-sessions:/app/.sessions
</code></pre>
<p>A few things to notice:</p>
<ul>
<li><p>The dashboard username and password protect the web UI at <code>http://your-server:3000</code>. Always change the defaults before you expose the port publicly.</p>
</li>
<li><p><code>WHATSAPP_API_KEY</code> is the key every HTTP request to WAHA must include in the <code>X-Api-Key</code> header. Treat it like a database password.</p>
</li>
<li><p><code>WHATSAPP_DEFAULT_ENGINE=WEBJS</code> uses the mature <code>whatsapp-web.js</code> engine. WAHA also supports <code>NOWEB</code> and <code>GOWS</code> engines with different trade-offs — WEBJS is the safest default for a first deployment.</p>
</li>
<li><p>The volume mount persists the session across restarts. Without it, every container rebuild forces you to scan the QR code again.</p>
</li>
</ul>
<p>Start the container:</p>
<pre><code class="language-bash">docker compose up -d
docker compose logs -f waha
</code></pre>
<p>Within about 20 seconds WAHA finishes booting. Visit <code>http://your-server:3000</code> and log in with the dashboard credentials.</p>
<h2 id="heading-part-3-starting-a-whatsapp-session">Part 3: Starting a WhatsApp session</h2>
<p>WAHA calls each WhatsApp account a "session." You can have one session at a time on WAHA Core.</p>
<p>From the dashboard, click <strong>Start New Session</strong> and name it <code>default</code>. WAHA displays a QR code.</p>
<p>On your phone:</p>
<ol>
<li><p>Open WhatsApp.</p>
</li>
<li><p>Tap the three-dot menu (Android) or Settings (iOS).</p>
</li>
<li><p>Tap Linked Devices → Link a Device.</p>
</li>
<li><p>Point the camera at the QR code on your screen.</p>
</li>
</ol>
<p>Within a few seconds the dashboard shows <code>WORKING</code> status. Your session is live.</p>
<p>You can also do this over the API. Start the session (<code>default</code> is the session name, encoded in the URL path):</p>
<pre><code class="language-bash">curl -X POST http://your-server:3000/api/sessions/default/start \
  -H "X-Api-Key: super-secret-key-change-me"
</code></pre>
<p>The call is idempotent — if the session is already running, nothing happens.</p>
<p>Fetch the QR as a PNG:</p>
<pre><code class="language-bash">curl http://your-server:3000/api/default/auth/qr \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Accept: image/png" \
  --output qr.png
</code></pre>
<p>Scan and you're in.</p>
<p>Test that the session works by sending a message to yourself:</p>
<pre><code class="language-bash">curl -X POST http://your-server:3000/api/sendText \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "session": "default",
    "chatId": "15555550123@c.us",
    "text": "Hello from WAHA!"
  }'
</code></pre>
<p>Replace <code>15555550123</code> with your own number (country code plus number, no <code>+</code>, no spaces, no dashes). The <code>@c.us</code> suffix marks it as an individual chat. Groups use <code>@g.us</code>.</p>
<p>If the message lands on your phone — congratulations. The gateway works.</p>
<h2 id="heading-part-4-running-n8n">Part 4: Running n8n</h2>
<p>Add an <code>n8n</code> service to your <code>docker-compose.yml</code> alongside WAHA:</p>
<pre><code class="language-yaml">services:
  waha:
    # ... existing config

  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=n8n.example.com
      - N8N_PORT=5678
      - N8N_PROTOCOL=https
      - WEBHOOK_URL=https://n8n.example.com/
      - GENERIC_TIMEZONE=UTC
    volumes:
      - ./n8n-data:/home/node/.n8n
</code></pre>
<p>Replace <code>n8n.example.com</code> with your real domain. For purely local testing, set:</p>
<pre><code class="language-yaml">- N8N_HOST=localhost
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://localhost:5678/
</code></pre>
<p>If you want to test webhooks from your laptop without a server, run <code>ngrok http 5678</code> in another terminal and use the ngrok HTTPS URL as <code>WEBHOOK_URL</code>. n8n uses <code>WEBHOOK_URL</code> to tell external services where to POST — get this wrong and your webhooks will 404.</p>
<p>Start the stack:</p>
<pre><code class="language-bash">docker compose up -d
</code></pre>
<p>Visit <code>http://your-server:5678</code>. On the first visit, n8n walks you through creating an owner account (email and password). Every subsequent visit requires that login. For extra safety in production, put n8n behind a reverse proxy with an allow-list or an additional auth layer — we'll set that up later.</p>
<h2 id="heading-part-5-creating-the-webhook-trigger-in-n8n">Part 5: Creating the Webhook Trigger in n8n</h2>
<p>Click Create Workflow. You'll see an empty canvas.</p>
<p>Add a Webhook node and configure it:</p>
<ul>
<li><p><strong>HTTP Method</strong>: POST</p>
</li>
<li><p><strong>Path</strong>: <code>whatsapp</code> (this becomes part of the URL)</p>
</li>
<li><p><strong>Response Mode</strong>: Respond Immediately</p>
</li>
<li><p><strong>Response Data</strong>: First Entry JSON</p>
</li>
</ul>
<p>Click Listen for Test Event. n8n shows you two URLs: a test URL and a production URL. Copy the production URL. It looks like this:</p>
<pre><code class="language-plaintext">https://n8n.example.com/webhook/whatsapp
</code></pre>
<p>Not <code>webhook-test</code> — that one only fires while the editor is open. You want <code>webhook</code>.</p>
<h2 id="heading-part-6-wiring-waha-to-n8n">Part 6: Wiring WAHA to n8n</h2>
<p>WAHA can POST to a webhook on every WhatsApp event. Tell it where to send those events.</p>
<p>In the WAHA dashboard, open your session and set the webhook URL. Or do it over the API:</p>
<pre><code class="language-bash">curl -X PUT http://your-server:3000/api/sessions/default \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "webhooks": [
        {
          "url": "https://n8n.example.com/webhook/whatsapp",
          "events": ["message", "session.status"]
        }
      ]
    }
  }'
</code></pre>
<p>The <code>message</code> event fires on every inbound message. <code>session.status</code> fires when the session connects, disconnects, or reconnects — which is useful for alerting when your bot goes down.</p>
<p>Test it. From another phone, send a WhatsApp message to your bot's number. Head back to the n8n editor. Within a second or two the webhook node lights up with the event data.</p>
<p>The payload looks roughly like this:</p>
<pre><code class="language-json">{
  "event": "message",
  "session": "default",
  "payload": {
    "id": "false_15555550123@c.us_3EB0...",
    "from": "15555550123@c.us",
    "body": "Hello",
    "timestamp": 1713801234,
    "fromMe": false
  }
}
</code></pre>
<p>Everything you need is in <code>payload</code>: who sent it (<code>from</code>), what they said (<code>body</code>), and when (<code>timestamp</code>).</p>
<h2 id="heading-part-7-building-the-first-auto-reply">Part 7: Building the first auto-reply</h2>
<p>A bot that only listens is boring. Let's make it answer.</p>
<p>You'll build a tiny keyword router: if the user sends <code>hi</code> or <code>hello</code>, the bot greets them. If they send <code>price</code>, it sends a pricing message. Anything else gets a fallback.</p>
<p>After the Webhook node, add a Switch node.</p>
<p>Configure the Switch node:</p>
<ul>
<li><p><strong>Mode</strong>: Expression</p>
</li>
<li><p><strong>Value</strong>: <code>{{ $json.payload.body.toLowerCase().trim() }}</code></p>
</li>
<li><p>Add routing rules:</p>
<ul>
<li><p>Rule 1: equals <code>hi</code> — output 0</p>
</li>
<li><p>Rule 2: equals <code>hello</code> — output 0</p>
</li>
<li><p>Rule 3: equals <code>price</code> — output 1</p>
</li>
<li><p>Fallback output: 2</p>
</li>
</ul>
</li>
</ul>
<p>After the Switch, add three HTTP Request nodes, one per output.</p>
<p>Configure each HTTP Request node identically, except for the body text:</p>
<ul>
<li><p><strong>Method</strong>: POST</p>
</li>
<li><p><strong>URL</strong>: <code>http://waha:3000/api/sendText</code> (inside the Docker network you can reach WAHA by its service name. From outside use the full public URL)</p>
</li>
<li><p><strong>Send Headers</strong>: on</p>
<ul>
<li><p><code>X-Api-Key</code>: <code>super-secret-key-change-me</code></p>
</li>
<li><p><code>Content-Type</code>: <code>application/json</code></p>
</li>
</ul>
</li>
<li><p><strong>Send Body</strong>: on</p>
<ul>
<li><p><strong>Body Content Type</strong>: JSON</p>
</li>
<li><p><strong>Specify Body</strong>: Using JSON</p>
</li>
</ul>
</li>
</ul>
<p>For the greeting node, the JSON body is:</p>
<pre><code class="language-json">{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Hi! I'm the bot. Send 'price' to see pricing, or anything else for help."
}
</code></pre>
<p>For the pricing node:</p>
<pre><code class="language-json">{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Our plans start at $49/month. Reply 'sales' to talk to a human."
}
</code></pre>
<p>For the fallback:</p>
<pre><code class="language-json">{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "I didn't catch that. Try 'hi' or 'price'."
}
</code></pre>
<p>The <code>={{ ... }}</code> syntax is an n8n expression — at runtime it pulls values from earlier nodes.</p>
<p>Connect the Switch outputs to their matching HTTP Request nodes. Save the workflow. Click Activate in the top-right.</p>
<p>Send <code>hi</code> to your bot from any phone. It should reply within a second.</p>
<p>Congratulations — you have a WhatsApp bot running entirely on your own infrastructure.</p>
<h2 id="heading-part-8-a-second-example-proactive-booking-confirmations">Part 8: A Second Example — Proactive Booking Confirmations</h2>
<p>Auto-reply is useful. Proactive outbound is where the value really compounds. Here's a second workflow that sends a booking confirmation whenever a new row lands in a database.</p>
<p>Create a second workflow in n8n. Use one of these triggers:</p>
<ul>
<li><p><strong>Schedule Trigger</strong> — poll a database every minute for new rows</p>
</li>
<li><p><strong>Webhook Trigger</strong> — listen for a notification from your booking system</p>
</li>
<li><p><strong>Database Trigger</strong> (Postgres, MySQL, Supabase) — react to inserts in real time</p>
</li>
</ul>
<p>For this example, use a Schedule Trigger set to every minute, followed by a Postgres <strong>Execute Query</strong> node that reads pending confirmations:</p>
<pre><code class="language-sql">SELECT id, customer_phone, service_name, booking_time
FROM bookings
WHERE confirmation_sent = false
LIMIT 20;
</code></pre>
<p>After the Postgres node, add an HTTP Request node pointing to the same WAHA <code>sendText</code> endpoint you used earlier. The body:</p>
<pre><code class="language-json">{
  "session": "default",
  "chatId": "={{ $json.customer_phone }}@c.us",
  "text": "Hi! Your booking for {{ \(json.service_name }} on {{ \)json.booking_time }} is confirmed. Reply 'change' to reschedule."
}
</code></pre>
<p>Finally, add a second Postgres node that marks the booking as sent:</p>
<pre><code class="language-sql">UPDATE bookings
SET confirmation_sent = true, confirmation_sent_at = NOW()
WHERE id = {{ $json.id }};
</code></pre>
<p>Activate the workflow. Every minute, n8n pulls pending bookings, sends a WhatsApp confirmation, and marks them done.</p>
<p>This pattern generalizes. Replace the SQL with a call to Shopify for order confirmations, Stripe for receipt messages, or Calendly for appointment reminders. The WhatsApp layer stays the same — only the source of truth changes.</p>
<h2 id="heading-part-9-going-to-production">Part 9: Going to Production</h2>
<p>The setup above works, but it's not yet production-ready. Here's what to harden before you point real customers at it.</p>
<h3 id="heading-1-put-everything-behind-https">1. Put Everything Behind HTTPS</h3>
<p>Never expose n8n or WAHA directly on plain HTTP. Put a reverse proxy in front. Caddy is the easiest choice because it handles Let's Encrypt automatically.</p>
<p>A minimal <code>Caddyfile</code>:</p>
<pre><code class="language-plaintext">n8n.example.com {
    reverse_proxy n8n:5678
}

waha.example.com {
    reverse_proxy waha:3000
}
</code></pre>
<p>Run Caddy as another service in the same Docker Compose. TLS certificates are issued and renewed automatically.</p>
<h3 id="heading-2-rotate-the-api-keys">2. Rotate the API Keys</h3>
<p>Don't ship <code>super-secret-key-change-me</code> to production. Generate a real key:</p>
<pre><code class="language-bash">openssl rand -hex 32
</code></pre>
<p>Put it in a <code>.env</code> file, reference it as <code>${WHATSAPP_API_KEY}</code> in <code>docker-compose.yml</code>, and add <code>.env</code> to your <code>.gitignore</code>.</p>
<h3 id="heading-3-rate-limit-outbound-messages">3. Rate-limit Outbound Messages</h3>
<p>WhatsApp bans accounts that send too many messages too fast. A safe outbound rate for a fresh number is well under 20 messages per minute. For bursty replies, add an n8n Wait node between sends, or queue outgoing messages through a small custom function node that sleeps between requests.</p>
<h3 id="heading-4-scale-n8n-with-queue-mode">4. Scale n8n with Queue Mode</h3>
<p>By default, n8n runs everything in a single process. That's fine for low volume. For higher throughput, switch to Queue Mode:</p>
<ul>
<li><p>Add a Redis container.</p>
</li>
<li><p>Run one <code>n8n</code> main container (the web UI and webhook receiver).</p>
</li>
<li><p>Run one or more <code>n8n-worker</code> containers that pull jobs from the queue.</p>
</li>
</ul>
<p>Queue Mode is documented at <a href="https://docs.n8n.io/hosting/scaling/queue-mode/">docs.n8n.io/hosting/scaling/queue-mode/</a>. Setup adds two environment variables (<code>EXECUTIONS_MODE=queue</code>, <code>QUEUE_BULL_REDIS_HOST=redis</code>) and decouples incoming webhooks from workflow execution. The webhook responds in milliseconds while workers chew through the queue in the background.</p>
<h3 id="heading-5-monitor-the-session">5. Monitor the Session</h3>
<p>WhatsApp Web sessions drop. The phone loses connection, WhatsApp rotates security tokens, or your server reboots. Catch those drops early.</p>
<p>Subscribe to the <code>session.status</code> webhook event in WAHA. When status becomes <code>FAILED</code> or <code>STOPPED</code>, route it to an n8n workflow that posts to Slack, sends an email, or pages you. The faster you know, the faster you recover.</p>
<p>For overall uptime, point something like Uptime Kuma at <code>GET /api/sessions/default</code> on WAHA. If WAHA reports <code>WORKING</code>, you're fine. Anything else triggers an alert.</p>
<h3 id="heading-6-back-up-the-sessions-volume">6. Back Up the Sessions Volume</h3>
<p>The <code>waha-sessions</code> directory contains the logged-in state. If you lose it, you have to scan the QR code again — possibly from a phone that's no longer handy. Back it up nightly. A simple cron job with <code>tar</code> and <code>rclone</code> to S3-compatible storage is plenty.</p>
<h3 id="heading-7-add-a-live-agent-handoff">7. Add a Live-Agent Handoff</h3>
<p>Not every conversation should stay with the bot. When a user types <code>human</code> — or when your intent classifier can't answer confidently — hand off to a real agent.</p>
<p>Chatwoot is a solid open-source option: it has a dedicated WhatsApp channel, agent inbox, team assignment, and conversation history. The handoff is an n8n branch that stops processing bot replies and forwards the message stream to Chatwoot's API.</p>
<h2 id="heading-common-pitfalls">Common Pitfalls</h2>
<p>A few issues catch almost everyone on their first production deploy.</p>
<h3 id="heading-webhooks-timing-out">Webhooks Timing Out</h3>
<p>WAHA gives your webhook a few seconds to respond. If your n8n workflow is slow (calling an LLM, hitting a remote API), the webhook times out and WAHA retries, potentially causing duplicate replies.</p>
<p>Fix: make the webhook return <code>200</code> immediately and offload the slow work. In n8n, set the Webhook node's Response Mode to <em>Using Respond to Webhook Node</em>, add a Respond to Webhook node as the first step with a <code>200</code> and empty body, then do the heavy lifting after that.</p>
<h3 id="heading-duplicate-messages">Duplicate Messages</h3>
<p>WAHA delivers the same <code>message</code> event more than once in edge cases (phone comes back online, session reconnects). Store the <code>payload.id</code> somewhere — Redis, a database, or n8n's static data store — and drop any ID you've already processed.</p>
<h3 id="heading-messages-arriving-out-of-order">Messages Arriving Out of Order</h3>
<p>The webhook is async, and n8n may parallelize executions. If ordering matters — for example, in a multi-step conversation — key a queue by the sender's <code>chatId</code> and process each sender serially.</p>
<h3 id="heading-sessions-disconnecting-after-a-phone-restart">Sessions Disconnecting After a Phone Restart</h3>
<p>Normal WhatsApp Web behavior. WAHA auto-reconnects, but occasionally the linked-devices list needs a manual refresh. If a session refuses to come back, stop the WAHA container, delete that session's folder under <code>waha-sessions/</code>, start the container again, and rescan the QR.</p>
<h3 id="heading-your-number-gets-banned">Your Number Gets Banned</h3>
<p>The single biggest cause is rate: a new number blasting hundreds of messages an hour gets flagged fast. Warm up a number slowly — send a normal, human-like volume for the first week. Don't send to strangers unsolicited. Prefer inbound-driven replies over outbound pushes wherever you can.</p>
<h3 id="heading-the-wrong-chat-id-format">The Wrong Chat ID Format</h3>
<p>WhatsApp individual chats use <code>&lt;number&gt;@c.us</code> and groups use <code>&lt;groupId&gt;@g.us</code>. Don't include the <code>+</code> or spaces in the number. If WAHA returns a 404 when sending, the chat ID is almost always the problem.</p>
<h2 id="heading-where-to-go-next">Where to Go Next</h2>
<p>You now have the foundation. The same two-service stack supports almost any bot you can imagine — you're only limited by what you can build in an n8n workflow.</p>
<p>Some natural next steps:</p>
<ul>
<li><p><strong>Plug in AI replies:</strong> Add an OpenAI or Anthropic node after the Webhook, pass the user's message through it with a short system prompt, and send the response back through WAHA. Cap conversation length to prevent runaway token usage.</p>
</li>
<li><p><strong>Integrate a CRM:</strong> Look up the caller's <code>chatId</code> in HubSpot, Pipedrive, or your own database before deciding how to reply. Segment responses by customer tier.</p>
</li>
<li><p><strong>Send proactive notifications:</strong> Appointment reminders, shipping updates, payment receipts, abandoned-cart nudges. Keep the content transactional and expected — unsolicited marketing blasts are the fastest way to a ban.</p>
</li>
<li><p><strong>Log every conversation:</strong> Add a Postgres or Supabase node after the Webhook to persist messages for analytics and customer history. Your future self (and your support team) will thank you.</p>
</li>
<li><p><strong>Add media handling:</strong> WAHA exposes <code>sendImage</code>, <code>sendFile</code>, and <code>sendVoice</code> endpoints. Teach the bot to accept photos for support tickets, or send invoices as PDFs directly inside the chat.</p>
</li>
</ul>
<p>The WhatsApp layer stays the same. Everything interesting happens upstream in the workflow.</p>
<p><em>If you want to see production examples of n8n and WAHA running at scale — or you need a similar automation built for your business — I'm the founder of Achiya Automation, where we ship WhatsApp, n8n, and Chatwoot integrations. You can find more at</em> <a href="https://achiya-automation.com"><em>achiya-automation.com</em></a><em>.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Dockerize a Go Application – Full Step-by-Step Walkthrough ]]>
                </title>
                <description>
                    <![CDATA[ Imagine that you want to share your source code with someone who doesn’t have Go installed on their computer. Unfortunately, this person won’t be able to run your application. Even if they do have Go  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-dockerize-a-go-application-full-step-by-step-walkthrough/</link>
                <guid isPermaLink="false">69f248846e0124c05e445b7a</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker compose ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Njong Emy ]]>
                </dc:creator>
                <pubDate>Wed, 29 Apr 2026 18:05:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/e49dda12-fd5e-4474-aa18-b72624640bf3.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Imagine that you want to share your source code with someone who doesn’t have Go installed on their computer. Unfortunately, this person won’t be able to run your application. Even if they do have Go installed, application behaviour may differ because your local development environment is different from theirs.</p>
<p>So how do you bundle up your application so that it can run the same way in every local environment? That’s where Docker comes in.</p>
<p>For beginners, Docker isn't always a very easy concept to grasp. But once you get it, I promise that it’s very interesting. So interesting that you’ll want to dockerize every application you lay your hands on.</p>
<p>For this article, a Go application will be our case study. The fundamental concept of containerization as explained here is transferable, so don’t worry too much about how dockerizing applications in another language will look like.</p>
<p>We’ll go through the basics of dockerizing a Go app with just Docker, images and containers, setting up multiple containers in one application with Docker Compose, and the constituent of a Docker Compose file.</p>
<p>By the end of this article, you'll have a basic understanding of what Docker is, what an image or container is, and how to orchestrate multiple, dependent containers with Docker Compose.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ol>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-what-is-docker">What is Docker</a>?</p>
</li>
<li><p><a href="#heading-how-to-install-docker">How to Install Docker</a></p>
</li>
<li><p><a href="#heading-what-is-a-dockerfile">What is a Dockerfile</a>?</p>
</li>
<li><p><a href="#heading-what-is-docker-compose">What is Docker Compose</a>?</p>
</li>
<li><p><a href="#heading-the-app-container">The app Container</a></p>
</li>
<li><p><a href="#heading-the-database-container">The database Container</a></p>
</li>
<li><p><a href="#heading-the-phpmyadmin-container">The phpMyAdmin Container</a></p>
</li>
<li><p><a href="#heading-running-everything-together">Running Everything Together</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You don't need any prior knowledge of Docker to follow this tutorial. This article is written with a beginner POV in mind, so it's okay if the concept is new to you.</p>
<p>In order to be fully engaged and understand the Go coding examples used here, it'll be helpful if you have basic knowledge of Golang. If you already understand how to set up a Go application on your local computer, you're good to go. If not, you can check this article on <a href="https://www.freecodecamp.org/news/how-to-get-started-coding-in-golang/">how to get started coding in Go</a>.</p>
<h2 id="heading-what-is-docker">What is Docker?</h2>
<p>Imagine that you have a box. In that box, you put your code and everything that it needs to run. That is, the programming language it uses and any other external packages you need to install.</p>
<p>If someone needs your application, you can just hand them the box. You can also hand this box to as many people as you want. They don’t need to install the language or any other thing on their computer because everything they need is already inside the box. So, when they run the application, what they're actually doing is running an instance of that box.</p>
<p>The app is running within the box which is the standard environment. This means for everyone who got the box and “opened it”, the application is going to run the exact same way.</p>
<p>With the help of Docker, apps can run under the same conditions across different systems, and you avoid the problem of “it works on my machine”.</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/3b2b169d-d882-48a8-88bf-233e4acec611.png" alt="A box containing dependencies, runtime, and source code that has arrows pointing to multiple developers" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>In technical Docker terms, this box is called an <strong>image</strong> and the running instance is called a <strong>container</strong>.</p>
<p>An image is a lightweight, standalone, executable package that includes everything needed to run a piece of software. That is, code, runtime, libraries, system tools, and even the operating system.</p>
<p>A container is simply a runnable instance of an image. This represents the execution environment for a specific application.</p>
<p>If all this seems to abstract, don’t worry. We’ll get our hands dirty in a little bit.</p>
<h2 id="heading-how-to-install-docker">How to Install Docker</h2>
<p>In order to install Docker, we're going to install Docker Desktop which comes bundled up with the Docker Engine. Docker Destop is a GUI for managing containers, and you'll see how useful it is in subsequent sections.</p>
<p>At the time of writing, I'm using WSL (Windows Sub-system for Linux). If you're doing the same, you'll need to take that into consideration before installing because Docker requires different installation prerequisites and steps for different operating systems.</p>
<p>To install Docker Desktop on WSL,</p>
<ol>
<li><p>Download and install the <a href="https://desktop.docker.com/win/main/amd64/Docker%20Desktop%20Installer.exe?utm_source=docker&amp;utm_medium=webreferral&amp;utm_campaign=docs-driven-download-windows&amp;_gl=1*6mcgze*_gcl_au*MTg5NDEzMjg4NS4xNzc0ODU5MzQ3*_ga*MTkwMzQzNjIyLjE3NzQ4NTkzNDc.*_ga_XJWPQMJYHQ*czE3NzY2MzUyMzgkbzMkZzEkdDE3NzY2MzY3MDkkajYwJGwwJGgw">windows</a> <code>.exe</code> file</p>
</li>
<li><p>Start Docker Desktop from the Start Menu and navigate to settings</p>
</li>
<li><p>Select <strong>Use WSL 2 based engine</strong> from the <strong>General</strong> tab</p>
</li>
<li><p>Click on apply.</p>
</li>
</ol>
<p>That’s it for the WSL installation. If you are running another operating system, the <a href="https://docs.docker.com/get-started/introduction/get-docker-desktop/">official docs</a> have a list of installation options for you.</p>
<h2 id="heading-what-is-a-dockerfile">What is a Dockerfile?</h2>
<p>In order to build your box in the first place, Docker needs to follow a couple of outlined steps. It needs to know the dependencies, the run time, and it also needs to have the source code. All these steps we list in a Dockerfile.</p>
<p>Before we get down to cracking anything, let’s create a working directory and navigate into it.</p>
<pre><code class="language-bash">mkdir go_book_api &amp;&amp; cd go_book_api
</code></pre>
<p>To intialise the Go module in your application, run the following command:</p>
<pre><code class="language-bash">go mod init go_book_api
</code></pre>
<p>This creates a <code>go.mod</code> file to keep track of your project dependencies. In the root of the project, create a <code>cmd</code> directory, and a <code>main.go</code> file in it. This will serve as the entry point of your application. In the <code>main.go</code> file, you can have a simple print statement:</p>
<pre><code class="language-go">// cmd/main.go
package main

import "fmt"

func main() {
	fmt.Println("Look at me gooo!")
}
</code></pre>
<p>Now, go ahead and create a file in the root of your project and call it <code>Dockerfile</code>. This file has no extensions, but your system automatically knows that it's a file for Docker commands.</p>
<p>Go ahead and paste the following in that file, and then we'll go through each of them one by one:</p>
<pre><code class="language-bash"># base image
FROM golang:1.24

# define the working directory
WORKDIR /app

# copy the go.mod and go.sum so that the packages to be installed
# are known in the container. ./ here is the WORKDIR, /app
COPY go.mod ./

# command to install modules
RUN go mod download

# copy source code into working dir
COPY . .

# build
RUN CGO_ENABLED=0 GOOS=linux go build -o /docker-gs-ping ./cmd/main.go

# run the compiled binary when the container starts
CMD ["/docker-gs-ping"]
</code></pre>
<p>Most Dockerfiles begin with a base image, which is specified by the <code>FROM</code> keyword. A base image is a foundational template that provides minimal operating system environment, libraries, or dependencies required to build and run an application within a container.</p>
<p>In this case, your base image is <code>golang:1.24</code> . Your base image could have been an operating system like Linux. In that case. when you ship your code to someone who isn’t running a Linux operating system, they wouldn’t have to worry because they will be running the application in an environment that already has a minimal Linux OS. In the same light, someone who doesn’t have Go installed locally can run your application.</p>
<p>To figure out what base image to use when setting up your Dockerfile, you can always peruse the official Docker Hub repository for published images. For this case, you can check out base images that are officially published by Golang <a href="https://hub.docker.com/hardened-images/catalog/dhi/golang/images">here</a>.</p>
<p>The next step is to define a working directory. Inside your box, you have a filesystem that is almost identical to the ones you’d see on a Linux system. You have folders like <code>/app</code>, <code>/bin</code> , <code>/usr</code> , and <code>/var</code> , and so on. The working directory you've defined in this case is <code>/app</code>, and it's done with the <code>WORKDIR</code> command.</p>
<p>After setting a working directory, you want to copy the <code>go.mod</code> and <code>go.sum</code> file into it, so that Docker knows what dependencies to add into your box.</p>
<p>The <code>COPY</code> command in Docker takes at least two arguments: the source directory(ies), and then the destination directory. In this case, you want to copy <code>go.mod</code> and <code>go.sum</code> into the working directory of your box, <code>/app</code>.</p>
<p>In the box, you'll run a command that downloads and installs all the modules defined in the <code>go.mod</code> file. To run a command in Docker environment, use <code>RUN</code> and then the command, which is <code>go mod download</code> in this case.</p>
<p>The next step is to copy any source code you have into the working directory.</p>
<p>At this point, you have the dependencies and the source code. The last step is to build the Go application into a single executable file which can be run inside your environment (inside the container).</p>
<p>Within the container, you’ll have a compiled binary at <code>/docker-gs-ping</code>, which is as a result of the compilation of the code in your <code>main.go</code> file. The last step is a <code>RUN</code> command that just tells Docker to run the executable binary after building it. It’s a way of saying “once the container starts running, execute this binary file”.</p>
<p>With these steps, Docker will build an image (a box per our analogy) that you can run. To build the image, you can run this command in your terminal:</p>
<pre><code class="language-go">docker build -t go_book_api .
</code></pre>
<p>The <code>docker build</code> command tells Docker to build an image based on the steps in the Dockerfile. <code>-t</code> is the flag for a tag, and this helps you refer to the image later when running the container.</p>
<p>To accompany your tag, you'll provide a name to the image which is <code>go_book_api</code> in this case. The <code>.</code> at the end is important because it tells Docker where the Dockerfile in question is, and the files that you need to copy into your image.</p>
<p>This is what the building looks like in my IDE:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/361a805e-153d-4034-9d9a-d34c9015738a.png" alt="screenshot of IDE terminal showing a Docker image being built" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>If you check the Images tab on Docker Compose, you'll see that an image is built:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/b569277e-295b-4a3d-8e51-fb91dd7e3d91.png" alt="screenshot of a built container image on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>You can host this image on a public image repository platform like <a href="https://www.docker.com/products/docker-hub/">Docker Hub</a>, and share it with your friends. They can pull your image, set it up, and run your application even if they don’t have Go installed. All they need to do is get the container running.</p>
<p>If you click on the little play button to the far-right, you can spin up an instance of the image (a container).</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/09726294-be22-458d-b660-5f6d32102205.png" alt="screenshot of Docker Compose modal for running a new container" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>You can give a descriptive name to the container (Docker will generate a random one if you don’t), and click on the Run button. Once the container starts running, you're redirected to its log page.</p>
<p>Your container is up and running! You can see that this is a running instance of your application.</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/3133c16c-0950-4f03-9502-ae6495535c13.png" alt="screenshot of a running docker container on Docker Compose" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-what-is-docker-compose">What is Docker Compose?</h2>
<p>If you were building a simple Go application that needed no external dependencies, the above set-up would be more than sufficient.</p>
<p>In our example here, the application is supposed to be for a book API, so you’d expect that we'd have some service like a database and a database administrator client like phpMyAdmin to visualize or tables.</p>
<p>To set all this up in one file would be a little complicated using just Docker. This is because Docker doesn't allow you to have one base image for Go, another base image for a database, and so on, in one file.</p>
<p>You could use the base image of a small operating system, and then run commands to manually install these other services as dependencies, but this method makes your application hard to maintain and scale. This method isn't advisable because if one dependency crashes, the whole application will collapse instantly.</p>
<p>To remedy this situation, Docker compose allows you to have multiple containers for your application that are connected together. Docker compose handles running the containers in the right order, allows one container to use a folder from another container, or even keep its data in another container – and so on.</p>
<p>Our previous analogy of boxes is the same, except with Docker Compose, we don’t necessarily have only one box anymore:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/2c890de4-8d5d-4457-a27a-fc441f58d794.png" alt="image of a box containing multiple containers that have arrows pointing to different developers" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The point of Docker Compose is to help you orchestrate multiple images needed to run your application. You can think of it as connecting several boxes together.</p>
<p>Following the explanation from before, your application would be running in the <code>Go book api</code> container, the book data we'll create with your application would be stored in the <code>mysql</code> container which is the database, and you can visualize your database with phpMyadmin, which is in the <code>phpMyadmin</code> container.</p>
<p>To see this technically, create a <code>docker-compose.yml</code> file in the root of the project. The name of this file is important, and Docker Compose only accepts filenames such as <code>compose.yml</code> , <code>docker-compose.yml</code> , or <code>docker-compose.yaml</code>. The file extension hints that the commands are written in <code>yaml</code> which is a language mostly used for file configurations.</p>
<pre><code class="language-bash">services:
  app:
    depends_on:
      - database
    build: 
      context: .
    container_name: go_book_api
    hostname: go_book_api
    networks:
      - go_book_api_net
    ports:
      - 8080:8080
    env_file:
      - .env
    
  database:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: ${DB_NAME}
      MYSQL_PASSWORD: ${DB_PASSWORD}
      MYSQL_USER: ${DB_USER}
    volumes:
      - mysql-go:/var/lib/mysql
    ports:
      - 3356:3306
    networks:
      - go_book_api_net

  phpmyadmin:
    image: phpmyadmin
    restart: always
    ports:
      - 9000:80
    environment:
      PMA_HOST: database
      PMA_ARBITRARY: 1
    depends_on:
      - database
    networks:
      - go_book_api_net

volumes:
  mysql-go:

networks:
  go_book_api_net:
    driver: bridge
</code></pre>
<p>At the root level of the docker-compose file, you have <code>services</code> . These are all the containers that are your application needs to run, and in the context of Docker Compose, they're each regarded as a service.</p>
<h3 id="heading-the-app-container">The <code>app</code> Container</h3>
<pre><code class="language-bash"> app:
    depends_on:
      - database
    build: 
      context: .
    container_name: go_book_api
    hostname: go_book_api
    networks:
      - go_book_api_net
    ports:
      - 8080:8080
    env_file:
      - .env
</code></pre>
<p>The very first container is the <code>app</code> container, which is your Go application. Under the <code>app</code> container, you'll need to define a few parameters that this container also needs to run.</p>
<p>The <code>depends_on</code> attribute controls the start-up and shut-down order of services within a container. This ensures that if container A depends on container B to start, the container B should be started first so that container A can use it. In this case, the <code>database</code> container must be started before the <code>app</code> container. Note that this doesn't mean <code>app</code> will always wait for the <code>database</code> to be ready.</p>
<p>The next attribute which is <code>build</code> tells Docker Compose to build the Docker image from the local project. Since the Dockerfile for your application is in the root of your app, you'll specify the root path with the <code>context</code> attribute as <code>.</code> .</p>
<p>To give a specific name to your container, you'll use <code>container_name</code>. <code>hostname</code> is what other containers will use for communication.</p>
<p>Recall that the point of Docker Compose is to have multiple containers communicating with each other. They do this with the help of networks. So you'll create another attribute, <code>networks</code>, and give it a name, <code>go_book_api_net</code> . To every other container that you want to associate with this <code>app</code>, you're going to specify the same network.</p>
<p>The next attribute is <code>ports</code> . Your application is an API, which means it's running on a backend Go server. To access the API, you'll need to map a local port to a port on the container. You're mapping port <code>8080</code> on your computer to port <code>8080</code> in the container.</p>
<p>The <code>env_file</code> attribute just tells Docker Compose where to read environment variables from. In this case, you can create a <code>.env</code> file in the root of your project to store important variables that your container will need.</p>
<h3 id="heading-the-database-container">The <code>database</code> Container</h3>
<pre><code class="language-bash">  database:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: ${DB_NAME}
      MYSQL_PASSWORD: ${DB_PASSWORD}
      MYSQL_USER: ${DB_USER}
    volumes:
      - mysql-go:/var/lib/mysql
    ports:
      - 3356:3306
    networks:
      - go_book_api_net
</code></pre>
<p>The second container is the <code>database</code> container. Note, that you can give whatever name you choose to your listed services, but giving your containers descriptive names is always a good convention to follow.</p>
<p>For your Go application database, you'll be working with a MySQL database in this case. Your application needs MySQL to run, so you must set it up as one of the services.</p>
<p>Remember that to build a container, you need a base image. Your base image in this case is <code>mysql:8.0</code> , as you've specified with the <code>image</code> property above. When trying to set up this container, Docker Compose knows to build your database container from this already existing official image.</p>
<p>If you’ve set up a database locally before, you know that configuration is a step you can’t skip. Every database you create needs a user, a password, and the database name. You can set these variables up in the <code>environment</code> property. Instead of hardcoding these values, you can set them up in a <code>.env</code> file, and reference the environmental variables as you've done here.</p>
<p>Database servers usually listen on specific ports for incoming connections, whether the database is running locally or remotely. Just as you specified for your <code>app</code> container, you can set a port for your database and map it to a corresponding port in the container. If you want to access the database locally, you'd do that on port <code>3356</code>, and all requests are forwarded to port <code>3306</code> in the database container.</p>
<p>Once your containers go functional and your application starts running, creating, and storing data in the database, you’ll realise that every time you stop and then restart your containers, you lose the data stored in the database.</p>
<p>To avoid this, you'll need to store your data outside the container. That way, you won't lose the contents of your database every time you stop running your containers.</p>
<p>This is what volumes are for. You can allocate a specific location outside the database container to store all that content. For your <code>volume</code> in this case, the storage location you specified is <code>mysql-go:/var/lib/mysql</code> .</p>
<p>Just as you set the network in your <code>app</code> container above to <code>go_book_api_net</code>, you'll specify the same network for this database container. Since you want the containers to communicate with each other, it makes sense that they're within the same network.</p>
<h3 id="heading-the-phpmyadmin-container">The <code>phpMyAdmin</code> Container</h3>
<p>The last container or last service you need (but that is optional) to configure in this case is the phpMyAdmin container. I find it easier having a database client because it lets me easily see the structure and content of my database.</p>
<pre><code class="language-bash"> phpmyadmin:
    image: phpmyadmin
    restart: always
    ports:
      - 9000:80
    environment:
      PMA_HOST: database
      PMA_ARBITRARY: 1
    depends_on:
      - database
    networks:
      - go_book_api_net
</code></pre>
<p>The process is almost the same as the previous containers you've configured. You'll start by pulling the official <code>phpmyadmin</code> image from Docker so that your container is built on it.</p>
<p>The <code>restart</code> option here is just so that if you stop and restart the container, phpMyAdmin automatically reloads again.</p>
<p>On the host machine, which is your local environment, you can have access to this service via port <code>9000</code> and it maps to port <code>80</code> in the container.</p>
<p>As for the <code>environment</code> , <code>PMA_HOST</code> tells phpMyAdmin to connect to a host called <code>database</code> (which is your database container). This works because both containers are on the same network, as you can see in the <code>networks</code> attribute. <code>PMA_ARBITRARY</code> is used so that if you decide to connect to another host (say, you set up a another database in future and still wish to connect via phpMyAdmin), you can do that via the UI.</p>
<p>Your database client depends on the <code>database</code> container, and so you need to specify that in <code>depends_on</code>:</p>
<pre><code class="language-bash">volumes:
  mysql-go:

networks:
  go_book_api_net:
    driver: bridge
</code></pre>
<p>The final section of your Docker Compose file is where you declared named values for the volume and network you've used in setting up your containers.</p>
<p>For the <code>volumes</code>, you'll declare a value called <code>mysql-go</code>. To the container where you want to attach this volume, you'll assign a specific storage location. You can see this in use in the database container.</p>
<pre><code class="language-bash"> volumes:
      - mysql-go:/var/lib/mysql
</code></pre>
<p>The same concept follows for the network. You have a named network called <code>go_book_api_net</code> that every container within this same network can use. The <code>driver</code> option is used here to specify the network type, and <code>bridge</code> is used for private internal networks.</p>
<h3 id="heading-running-everything-together">Running Everything Together</h3>
<p>Before Docker Compose, you had one Dockerfile that built a single container for your Go application. With Docker Compose, You’re gonna be building three containers (your application container, the database, and phpMyAdmin), and orchestrating them to work together as one single application.</p>
<p>You can push all this to a platform like GitHub, and someone can clone, start, and run the application without having any of these services (MySQL or PhpMyAdmin) installed locally on their computer. But they do need to have Docker installed.</p>
<p>To build your containers all together, you can use the command <code>docker compose build</code>:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/0040fbdc-c541-494f-af9b-664d6a00bc17.png" alt="screenshot of IDE terminal showing build for an image" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>If you check your Docker Compose UI again, we see that a new image has been built, and it corresponds to the app service</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/736be9be-feb1-4888-8d15-c818e4683f4b.png" alt="screenshot of a built image on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>To start running the containers, you can use the command <code>docker compose up</code>:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/8ba14bb9-77d5-48a1-b574-54a848f54b1e.png" alt="a screenshot of running containers in terminal IDE" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>If you navigate to the container tab of Docker Compose, you can see that your containers are up and running:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/82e3d54d-bfec-4cea-806a-c52846a3e077.png" alt="A screenshot of running containers on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The main app service, <code>go_book_api</code>, isn’t running because when you run your image, your binary runs and exits almost immediately.</p>
<p>In your <code>main.go</code>, let’s rewrite the code to set up a minimal HTTP handler function that listens on port <code>8080</code>:</p>
<pre><code class="language-go">// cmd/main.go
package main

import (
	"log"
	"net/http"
)

func main() {
	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		_, _ = w.Write([]byte("ok"))
	})

	log.Println("listening on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}
</code></pre>
<p>If you’re new to Go, don’t let the code above bother you too much. All it does it set up a <code>health</code> endpoint with an associated handler function that listens on a port (<code>8080</code> in this case) and prints “ok”.</p>
<p>In your <code>Dockerfile</code>, let’s add a command to execute the created binary when the container starts:</p>
<pre><code class="language-go"># run the compiled binary when the container starts
CMD ["/docker-gs-ping"]
</code></pre>
<p>After adding this, you'll need to rebuild the containers and start them again. You can see that all containers are running now:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/3ddf3e15-87b8-4978-851f-d6179e323166.png" alt="A screenshot of running containers on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>If you click on the <code>go_book_api</code> container, you can see that your server is running on port <code>8080</code> as configured:</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/ddd07614-eb53-4bfc-b088-e824f651ef6c.png" alt="A screenshot of a running container on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Since your app is running on port <code>8080</code> and you have a <code>/health</code> endpoint set up for it, you can actually visit that endpoint in a browser to see the output “ok”.</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/39a1ea3e-7cbf-4d46-9bbe-bf8053d48586.png" alt="an image of health endpoint showing ok response on the browser" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Also, if you click on the exposed <code>phpmyadmin</code> port, you can access the database client locally on port <code>9000</code>. Based on the environment variables set up in the <code>.env</code> file, you can log in.</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/8d7de244-7268-4d17-a779-785feae389c4.png" alt="screenshot of browser with phpMyAdmin login form" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Another interesting thing to look for on Docker desktop is volumes. There is a volumes tab where you can see your configured <code>mysql-go</code> volume.</p>
<img src="https://cdn.hashnode.com/uploads/covers/61d7e29f8d56921d07b9014e/66d1dde3-2fc1-48aa-b701-7504dba2007f.png" alt="a screenshot of the volumes tab on Docker Desktop" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>You can always open these volumes/containers on the docker GUI, go through the files and logs, experiment with putting one container down and seeing how the others respond, and so on.</p>
<p>After this entire setup, what do you notice? You didn’t have to install Go, MySQL, or phpMyAdmin locally. You only used officially published base images to orchestrate a full application. That's the magic of Docker.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Docker can be very abstract at the beginning, but understanding the fundamental purpose behind it makes everything much clearer.</p>
<p>In this article, you've learned what Docker is, how to containerize a basic Go application, and how to manage multiple containers with Docker Compose.</p>
<p>If you have trouble wrapping your head around why or how the Dockerfile is set up in the order that it is, my advice is not to get too stuck figuring it out on your own. As a Docker beginner, I realised that it’s easier if you imagine it as creating a recipe. If you try to build an image and it fails, you know there’s a step that you’re skipping.</p>
<p>The <a href="https://www.docker.com/">official docker documentation</a> has amazing resources if you want to understand Docker further than this tutorial. I encourage you to do so because this article only scratches the surface of the amazing things you can achieve with containerization.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Trace Multi-Agent AI Swarms with Jaeger v2 ]]>
                </title>
                <description>
                    <![CDATA[ When you run a single AI agent, debugging is straightforward. You read the log, you see what happened. When you run five agents in a swarm, each spawning its own tool calls and producing its own outpu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/multi-agent-ai-swarms-tracing/</link>
                <guid isPermaLink="false">69eaae45904b915438cefb47</guid>
                
                    <category>
                        <![CDATA[ jaeger ]]>
                    </category>
                
                    <category>
                        <![CDATA[ OpenTelemetry ]]>
                    </category>
                
                    <category>
                        <![CDATA[ distributed tracing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ multi-agent systems ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ai agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ observability ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Christopher Galliart ]]>
                </dc:creator>
                <pubDate>Thu, 23 Apr 2026 23:41:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/308710e6-cfe6-4007-887a-c49a5e2e6b9a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When you run a single AI agent, debugging is straightforward. You read the log, you see what happened.</p>
<p>When you run five agents in a swarm, each spawning its own tool calls and producing its own output, "read the log" stops being a strategy.</p>
<p>I built <a href="https://github.com/HatmanStack/claude-forge">Claude Forge</a> as an adversarial multi-agent coding framework on top of Claude Code. A typical run spawns a planner, an implementer, a reviewer, and a fixer. They evaluate each other's work and loop back when quality checks fail.</p>
<p>But when something went wrong, I had timestamps and text dumps but no way to see which agent was responsible, how long it actually took, or where the tokens went.</p>
<p>Jaeger fixed that. This article covers setting up Jaeger v2 with Docker, wiring it into a multi-agent system through OpenTelemetry, and what I learned along the way.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-distributed-tracing">What Is Distributed Tracing?</a></p>
</li>
<li><p><a href="#heading-why-jaeger-v2">Why Jaeger v2?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-installing-docker-on-debian">Installing Docker on Debian</a></p>
</li>
<li><p><a href="#heading-setting-up-jaeger-v2">Setting Up Jaeger v2</a></p>
</li>
<li><p><a href="#heading-setting-up-claude-forge-tracing">Setting Up Claude Forge Tracing</a></p>
</li>
<li><p><a href="#heading-understanding-the-span-model">Understanding the Span Model</a></p>
</li>
<li><p><a href="#heading-instrumenting-a-multi-agent-swarm">Instrumenting a Multi-Agent Swarm</a></p>
</li>
<li><p><a href="#heading-viewing-traces-in-the-jaeger-ui">Viewing Traces in the Jaeger UI</a></p>
</li>
<li><p><a href="#heading-lessons-from-the-trenches">Lessons from the Trenches</a></p>
</li>
<li><p><a href="#heading-environment-variable-reference">Environment Variable Reference</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-what-is-distributed-tracing">What Is Distributed Tracing?</h2>
<p>Distributed tracing tracks a single operation as it moves through multiple services. A span is one unit of work with a start time, end time, and key-value attributes. Spans nest into parent-child trees. One tree per operation is one trace.</p>
<p>Microservices people already know this pattern: follow an HTTP request from the gateway through auth, the database, and the cache. Same idea works for multi-agent AI. Follow one swarm invocation from the orchestrator through each subagent and its tool calls.</p>
<p>OpenTelemetry (OTel) is the standard. It gives you SDKs for creating spans and shipping them over OTLP. Jaeger receives that data and renders it as a searchable timeline.</p>
<h2 id="heading-why-jaeger-v2">Why Jaeger v2?</h2>
<p>Jaeger started at Uber and graduated as a CNCF project in 2019. v1 hit end of life in December 2025. v2 is the current release, built on the OpenTelemetry Collector framework. Single binary: collector, query service, and UI. It speaks OTLP natively on port 4317 (gRPC) and 4318 (HTTP). There's no separate collector needed for local work.</p>
<p>One important difference from v1: configuration moved from CLI flags and environment variables to a YAML file. The old <code>-e SPAN_STORAGE_TYPE=badger</code> env vars are silently ignored in v2. The container starts fine but falls back to in-memory storage. I lost two days of traces before noticing. More on the correct setup below.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><strong>Docker</strong> installed and running.</p>
</li>
<li><p><strong>Claude Code</strong> installed.</p>
</li>
<li><p><strong>Python 3.8+</strong> for the tracing hook.</p>
</li>
<li><p><strong>Claude Forge</strong> or another multi-agent system to instrument.</p>
</li>
</ul>
<h2 id="heading-installing-docker-on-debian">Installing Docker on Debian</h2>
<p>Skip this if you already have Docker. macOS and Windows users can use Docker Desktop. On Debian:</p>
<pre><code class="language-bash">sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/debian \
  \((. /etc/os-release &amp;&amp; echo "\)VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list &gt; /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
</code></pre>
<p>Ubuntu users: replace both <code>linux/debian</code> URLs with <code>linux/ubuntu</code>.</p>
<h2 id="heading-setting-up-jaeger-v2">Setting Up Jaeger v2</h2>
<h3 id="heading-basic-run">Basic Run</h3>
<p>For quick testing with no persistence:</p>
<pre><code class="language-bash">docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0
</code></pre>
<p>Port 16686 is the UI. Port 4317 is OTLP/gRPC ingestion. Port 4318 is OTLP/HTTP. Remove the container and your traces are gone.</p>
<h3 id="heading-persistent-storage-with-badger">Persistent Storage with Badger</h3>
<p>v2 reads configuration from a YAML file, not environment variables. Save this as <code>~/.local/share/jaeger/config.yaml</code>:</p>
<pre><code class="language-yaml">service:
  extensions: [jaeger_storage, jaeger_query, healthcheckv2]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]
extensions:
  healthcheckv2:
    use_v2: true
    http: { endpoint: 0.0.0.0:13133 }
  jaeger_query:
    storage: { traces: main_store }
  jaeger_storage:
    backends:
      main_store:
        badger:
          directories: { keys: /badger/key, values: /badger/data }
          ephemeral: false
          ttl: { spans: 720h }
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
processors:
  batch:
exporters:
  jaeger_storage_exporter:
    trace_storage: main_store
</code></pre>
<p>The Jaeger container runs as UID 10001. Docker named volumes default to root ownership. Without fixing permissions first, the container crash-loops with <code>mkdir /badger/key: permission denied</code>.</p>
<p>Pre-create the volume and fix ownership:</p>
<pre><code class="language-bash">docker volume create jaeger-data

docker run --rm \
  -v jaeger-data:/badger \
  alpine sh -c "mkdir -p /badger/data /badger/key &amp;&amp; chown -R 10001:10001 /badger"
</code></pre>
<p>Then run Jaeger with the config mounted in:</p>
<pre><code class="language-bash">docker run -d --name jaeger \
  --restart unless-stopped \
  -v ~/.local/share/jaeger/config.yaml:/etc/jaeger/config.yaml:ro \
  -v jaeger-data:/badger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0 \
  --config /etc/jaeger/config.yaml
</code></pre>
<p>Verify persistence by running <code>docker restart jaeger</code> and confirming a previously recorded trace is still there. Hit <code>http://localhost:16686</code> and you should see the UI.</p>
<h2 id="heading-setting-up-claude-forge-tracing">Setting Up Claude Forge Tracing</h2>
<h3 id="heading-installing-claude-forge">Installing Claude Forge</h3>
<p>Install it through the Claude Code plugin marketplace:</p>
<pre><code class="language-bash">/plugin marketplace add hatmanstack/claude-forge
/plugin install forge@claude-forge
/reload-plugins
</code></pre>
<p>The install opens a TUI to confirm scope and settings. After reload, commands use the <code>forge:</code> prefix (for example, <code>/forge:pipeline</code>).</p>
<p>You can also clone the repo from <a href="https://github.com/HatmanStack/claude-forge">GitHub</a>.</p>
<h3 id="heading-installing-the-tracing-hook">Installing the Tracing Hook</h3>
<p>From your target project directory, run the install script. For plugin installs:</p>
<pre><code class="language-bash">cd your-project
forge-trace                # if you set up the alias from the README
# or, without the alias:
bash "$(find ~/.claude -path '*/forge*' -name install-tracing.sh 2&gt;/dev/null | head -1)"
</code></pre>
<p>For clone installs:</p>
<pre><code class="language-bash">cd your-project
bash /path/to/claude-forge/bin/install-tracing.sh
</code></pre>
<p>The script builds a dedicated venv at <code>~/.local/share/claude-forge/venv</code> (prefers <code>uv</code>, falls back to <code>python3 -m venv</code>), installs the OpenTelemetry packages, copies the hook into place, merges hook entries into <code>.claude/settings.local.json</code>, and self-tests against the OTLP endpoint.</p>
<p>Pass <code>--no-settings</code> to skip the settings merge, or <code>--uninstall</code> to tear everything down.</p>
<h3 id="heading-opting-in">Opting In</h3>
<p>Add to your shell init and restart your terminal:</p>
<pre><code class="language-bash">export CLAUDE_FORGE_TRACING=1
</code></pre>
<p>Restart Claude Code, run <code>/pipeline</code>, then check <code>http://localhost:16686</code> for the <code>claude-forge</code> service.</p>
<h2 id="heading-understanding-the-span-model">Understanding the Span Model</h2>
<p>Here's what the hierarchy looks like for a typical swarm run:</p>
<pre><code class="language-plaintext">session: "implement login form with OAuth"        &lt;- root span
├── subagent:planner
│   ├── tool:Write  (Phase-0.md)                  &lt;- mutation spans (on by default)
│   ├── tool:Write  (Phase-1.md)
│   └── subagent_result:planner                   &lt;- duration, token counts, output
├── subagent:implementer
│   ├── tool:Edit   (src/auth.ts)
│   ├── tool:Bash   (npm test)
│   ├── tool:Write  (src/oauth.ts)
│   └── subagent_result:implementer
├── subagent:reviewer
│   └── subagent_result:reviewer
└── session_complete                              &lt;- session totals
</code></pre>
<p>The root span's name comes from the first line of your prompt. Find traces by what you asked for, not by a UUID.</p>
<p>Subagents get an anchor span on start and a result span on completion. The result carries duration, token counts, prompt, and output.</p>
<h3 id="heading-three-tiers-of-detail">Three Tiers of Detail</h3>
<p>Not all inner tool calls are equally interesting. Write, Edit, MultiEdit, and Bash are mutational: small in number, high signal. They tell you what actually changed. Read, Glob, Grep, and WebFetch are navigation: lots of them, mostly noise.</p>
<p>Tracing captures mutations by default. That middle ground turned out to be the right one. Before this change, you either saw nothing inside subagents or you saw 200+ spans per run.</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Subagents</th>
<th>Mutations (Write/Edit/Bash)</th>
<th>Other inner tools</th>
</tr>
</thead>
<tbody><tr>
<td>Default</td>
<td>yes</td>
<td>yes</td>
<td>no</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_INNER=1</code></td>
<td>yes</td>
<td>yes</td>
<td>yes (minus blocklist)</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_MUTATIONS=0</code></td>
<td>yes</td>
<td>no</td>
<td>no (or per INNER)</td>
</tr>
</tbody></table>
<h3 id="heading-span-attributes">Span Attributes</h3>
<p><strong>On</strong> <code>session_complete</code><strong>:</strong> <code>session.tokens.input</code>, <code>session.tokens.output</code>, <code>session.tokens.total</code>, <code>session.tokens.turns</code>, <code>session.duration_ms</code>, <code>user.prompt</code> (first 2KB).</p>
<p><strong>On</strong> <code>subagent_result</code><strong>:</strong> <code>agent.description</code>, <code>agent.prompt</code>, <code>agent.output</code>, <code>agent.duration_ms</code>, <code>agent.is_error</code>, <code>agent.tokens.input</code>, <code>agent.tokens.output</code>.</p>
<p><strong>On</strong> <code>tool:*</code><strong>:</strong> <code>tool.name</code>, <code>tool.input</code>, <code>tool.output</code>, <code>tool.duration_ms</code>, <code>tool.is_error</code>.</p>
<h2 id="heading-instrumenting-a-multi-agent-swarm">Instrumenting a Multi-Agent Swarm</h2>
<h3 id="heading-hook-architecture">Hook Architecture</h3>
<p>Claude Code has lifecycle hooks that fire scripts on specific events. Four matter here:</p>
<ol>
<li><p><strong>UserPromptSubmit</strong> (create the root span),</p>
</li>
<li><p><strong>PreToolUse</strong> (start a span),</p>
</li>
<li><p><strong>PostToolUse</strong> (end it with results), and</p>
</li>
<li><p><strong>Stop</strong> (finalize the trace). Each hook gets a JSON payload on stdin and runs as a subprocess.</p>
</li>
</ol>
<h3 id="heading-sending-spans-with-opentelemetry">Sending Spans with OpenTelemetry</h3>
<p>Here's some minimal Python to get a span into Jaeger:</p>
<pre><code class="language-python">from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "my-agent-system"})
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent-tracer")

with tracer.start_as_current_span("my-agent-task") as span:
    span.set_attribute("agent.name", "planner")
    span.set_attribute("agent.tokens.input", 1500)
    span.set_attribute("agent.tokens.output", 800)
</code></pre>
<p>Refresh <code>localhost:16686</code>, pick your service, click "Find Traces."</p>
<h3 id="heading-correlating-pre-and-post-events">Correlating Pre and Post Events</h3>
<p>You need to match each PreToolUse to its PostToolUse. Agent-type tool calls didn't include a <code>tool_use_id</code> in the payload, so I hashed the tool name and input instead. Pre and Post carry identical <code>tool_input</code>, so the hashes line up.</p>
<pre><code class="language-python">import hashlib, json

def correlation_key(tool_name: str, tool_input: dict) -&gt; str:
    content = json.dumps({"tool": tool_name, "input": tool_input}, sort_keys=True)
    return hashlib.sha1(content.encode()).hexdigest()[:16]
</code></pre>
<h3 id="heading-state-across-invocations">State Across Invocations</h3>
<p>Every hook call is a separate process. No shared memory. So I wrote span context to JSON files on Pre and read them back on Post:</p>
<pre><code class="language-plaintext">/tmp/claude-forge-tracing/&lt;session_id&gt;/
├── _root.json              # trace ID, root span context
├── _session_start_ns.json  # timestamp for duration calculation
├── subagent_&lt;hash&gt;.json    # per-subagent span context
└── tool_&lt;hash&gt;.json        # per-tool span context
</code></pre>
<p>File names get sanitized against path traversal. <code>_safe_name()</code> strips everything outside <code>[A-Za-z0-9._-]</code> and falls back to a SHA1 slug.</p>
<h3 id="heading-flushing-without-blocking">Flushing Without Blocking</h3>
<pre><code class="language-python">try:
    provider.force_flush(timeout_millis=1000)
except Exception:
    pass  # Never block the swarm
</code></pre>
<p>I tried 2000ms first and the swarm felt slow. 100ms lost spans on cold TLS connections. 1000ms worked. If Jaeger is down, the swarm keeps running regardless.</p>
<h2 id="heading-viewing-traces-in-the-jaeger-ui">Viewing Traces in the Jaeger UI</h2>
<p>Open <code>http://localhost:16686</code>. Pick <code>claude-forge</code> from the service dropdown. Click "Find Traces."</p>
<p>The trace search filters by operation name, tags, and time range. Since session spans take their name from your prompt, searching "login form" pulls up the runs where you asked for one.</p>
<p>The timeline view is where I spend most of my time. Every span is a horizontal bar, nested by parent-child relationships. I can see the planner took 12 seconds, the implementer 45, the reviewer 8. Click any bar to see token counts, prompts, outputs, error status.</p>
<p>Trace comparison puts two runs side by side. This is good for figuring out why one run succeeded and another did not.</p>
<h2 id="heading-lessons-from-the-trenches">Lessons from the Trenches</h2>
<p><strong>One trace per swarm, not per subagent:</strong> My first version wiped the root span's state file on every Stop event, so each subagent started a new trace. I changed Stop to mark a timestamp while preserving the root.</p>
<p><strong>Use descriptions, not type names:</strong> Subagents all report their type as <code>general-purpose</code>. The description field is where the actual role lives.</p>
<p><strong>Token attribution needs per-agent transcripts:</strong> Claude Code writes subagent transcripts to <code>~/.claude/projects/&lt;project&gt;/&lt;session&gt;/subagents/agent-*.jsonl</code>. Match them via <code>agent-*.meta.json</code>.</p>
<p><strong>Parse boolean env vars explicitly:</strong> <code>bool("0")</code> in Python is <code>True</code>. Use an allowlist: <code>{"1", "true", "yes", "on"}</code>.</p>
<h2 id="heading-environment-variable-reference">Environment Variable Reference</h2>
<table>
<thead>
<tr>
<th>Variable</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>CLAUDE_FORGE_TRACING=1</code></td>
<td>Master opt-in. Hook is a no-op without this.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_MUTATIONS=0</code></td>
<td>Disable default mutation spans (Write/Edit/Bash). On by default.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_INNER=1</code></td>
<td>Capture all inner tool calls as child spans (off by default).</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_TRACE_TOOL_BLOCKLIST</code></td>
<td>Comma-separated tools to skip when inner tracing is on. Defaults to <code>Read,Glob,Grep,TodoWrite,NotebookRead</code>.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_HOOK_DEBUG=1</code></td>
<td>Enable debug logging of raw hook payloads. Off by default.</td>
</tr>
<tr>
<td><code>CLAUDE_FORGE_HOOK_DEBUG_LOG</code></td>
<td>Override debug log path. Defaults to <code>~/.cache/claude-forge/hook.log</code>.</td>
</tr>
<tr>
<td><code>OTEL_EXPORTER_OTLP_ENDPOINT</code></td>
<td>OTLP/gRPC endpoint. Defaults to <code>http://localhost:4317</code>.</td>
</tr>
</tbody></table>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>Without visibility into the process, you're being inefficient with tokens and your time. Multi-agent swarms cost real money on every run. When an agent fails and retries, or when a reviewer rejects work that was close, you're paying for that blind.</p>
<p>Tracing gives you the map. You find out where the failure modes are. You find out which agents burn tokens going nowhere. A 45-second implementer run might have been 10 seconds with a better planner prompt. But you would never know that without seeing the breakdown.</p>
<p>Get observability in early. Jaeger and OpenTelemetry make it cheap to set up. Once you can see where things go wrong you can actually fix them.</p>
<p>Claude Forge tracing is on the <a href="https://github.com/HatmanStack/claude-forge">main branch</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik ]]>
                </title>
                <description>
                    <![CDATA[ This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server. You’ll learn how to expose services on  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-production-ready-ci-cd-pipeline-for-monorepo-based-microservices-system/</link>
                <guid isPermaLink="false">69ea60c8904b915438a58ca2</guid>
                
                    <category>
                        <![CDATA[ Jenkins ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ci-cd ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Traefik ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md Tarikul Islam ]]>
                </dc:creator>
                <pubDate>Thu, 23 Apr 2026 18:11:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/66cb39fcaa2a09f9a8d691c1/d59c62f5-e376-4f09-851f-83e437f9960a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server.</p>
<p>You’ll learn how to expose services on a custom domain with auto-renewing HTTPS, and implement a smart deployment strategy that detects changes and redeploys only the affected microservices. This helps avoid unnecessary full-stack redeploys. We'll also cover real production issues and the exact fixes for each one.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a href="#heading-1-what-youll-build">1. What you'll build</a></p>
</li>
<li><p><a href="#heading-2-architecture">2. Architecture</a></p>
</li>
<li><p><a href="#heading-3-server-prerequisites">3. Server prerequisites</a></p>
</li>
<li><p><a href="#heading-4-traefik-the-reverse-proxy">4. Traefik — the reverse proxy</a></p>
</li>
<li><p><a href="#heading-5-run-jenkins-in-docker">5. Run Jenkins in Docker</a></p>
</li>
<li><p><a href="#heading-6-expose-jenkins-on-a-domain-via-traefik">6. Expose Jenkins on a domain via Traefik</a></p>
</li>
<li><p><a href="#heading-7-first-time-jenkins-setup">7. First-time Jenkins setup</a></p>
</li>
<li><p><a href="#heading-8-add-the-github-credential">8. Add the GitHub credential</a></p>
</li>
<li><p><a href="#heading-9-create-the-pipeline-job">9. Create the pipeline job</a></p>
</li>
<li><p><a href="#heading-10-the-jenkinsfile-deploy-only-what-changed">10. The Jenkinsfile (deploy only what changed)</a></p>
</li>
<li><p><a href="#heading-11-end-to-end-test">11. End-to-end test</a></p>
</li>
<li><p><a href="#heading-12-troubleshooting-every-error-we-hit">12. Troubleshooting — every error we hit</a></p>
</li>
<li><p><a href="#heading-13-mental-model-host-vs-container">13. Mental model: host vs. container</a></p>
</li>
<li><p><a href="#heading-14-daily-operations-cheat-sheet">14. Daily operations cheat sheet</a></p>
</li>
<li><p><a href="#heading-15-what-id-do-differently-next-time">15. What I'd do differently next time</a></p>
</li>
<li><p><a href="#heading-closing-thoughts">Closing thoughts</a></p>
</li>
</ul>
<h2 id="heading-1-what-youll-build">1. What You'll Build</h2>
<p>In this tutorial, you'll build a Jenkins instance running inside Docker on the same Linux server as your application stack.</p>
<p>Traefik will act as a reverse proxy in front of Jenkins, exposing it via a clean URL (<a href="https://jenkins.example.com"><code>https://jenkins.example.com</code></a>) with <strong>auto-renewing Let's Encrypt certificates</strong>.</p>
<p>You'll also create a Jenkinsfile in your application repository that:</p>
<ul>
<li><p>Automatically triggers on every push to the <code>staging</code> branch,</p>
</li>
<li><p>Detects which microservices changed in each commit,</p>
</li>
<li><p>Pulls the latest code on the host machine,</p>
</li>
<li><p>Rebuilds and restarts <strong>only the affected services</strong>.</p>
</li>
</ul>
<p>On every push, only the relevant services are redeployed.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>Before jumping in, this guide assumes you’re already comfortable with a few core concepts and tools.</p>
<p>This isn't a beginner-level tutorial — we’ll be working directly with infrastructure, containers, and CI/CD pipelines.</p>
<p>You should be familiar with:</p>
<ul>
<li><p>Basic Linux commands (SSH, file system navigation, permissions)</p>
</li>
<li><p>Docker fundamentals (images, containers, volumes, networks)</p>
</li>
<li><p>Git workflows (clone, pull, branches)</p>
</li>
<li><p>General idea of CI/CD pipelines</p>
</li>
</ul>
<p>Tools and environment required:</p>
<ul>
<li><p>A Linux server (Ubuntu recommended)</p>
</li>
<li><p>Docker Engine + Docker Compose (v2)</p>
</li>
<li><p>A domain name (for Traefik + HTTPS)</p>
</li>
<li><p>GitHub repository (for your backend project)</p>
</li>
<li><p>Basic understanding of microservices architecture</p>
</li>
</ul>
<p>If you’re comfortable with the above, you’re ready to follow along.</p>
<h2 id="heading-2-architecture">2. Architecture</h2>
<p>Here's an overview of the architecture:</p>
<pre><code class="language-plaintext">┌──────────────────────────── Linux server (Ubuntu) ────────────────────────────┐
│                                                                               │
│   /home/developer/projects/                                                  │
│       └── project-prod-configs/             ← infra repo (compose, Traefik) │
│              ├── docker-compose.staging.yml                                   │
│              ├── traefik.staging.yml                                          │
│              └── project-backend/          ← app repo (services, gateways) │
│                     ├── Jenkinsfile                                           │
│                     ├── docker-compose.staging.yml                            │
│                     └── apps/                                                 │
│                            ├── services/&lt;name&gt;/                               │
│                            ├── gateways/&lt;name&gt;/                               │
│                            └── core/&lt;name&gt;/                                   │
│                                                                               │
│   ┌─────────────────────── Docker network: proxy ──────────────────────┐      │
│   │  traefik (80, 443)                                                 │      │
│   │     │                                                              │      │
│   │     ├──► jenkins  (projects-jenkins-staging)                     │      │
│   │     │      ↳ /projects  ← bind-mount of the host project tree     │      │
│   │     │      ↳ /var/run/docker.sock ← controls host Docker           │      │
│   │     │                                                              │      │
│   │     └──► your services &amp; gateways (built by the pipeline)          │      │
│   └────────────────────────────────────────────────────────────────────┘      │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘
            ▲
            │  webhook on push
            │
   GitHub: &lt;org&gt;/project-backend (branch: staging)
</code></pre>
<p>There are two key ideas here:</p>
<ol>
<li><p><strong>Jenkins runs in a container</strong>, but it controls the <strong>host's</strong> Docker by mounting <code>/var/run/docker.sock</code>. It also bind-mounts the project folder as <code>/projects/...</code>, so it can <code>cd</code> into the real code on the host and run <code>docker compose</code> there.</p>
</li>
<li><p>The <strong>Jenkinsfile lives inside the app repo</strong>, so the pipeline definition is versioned with the code. Jenkins simply points at it.</p>
</li>
</ol>
<h3 id="heading-3-server-prerequisites">3. Server Prerequisites</h3>
<p>Before we start configuring Jenkins or Traefik, we need to prepare the server properly.</p>
<p>In this step, we’ll:</p>
<ul>
<li><p>Create a dedicated Linux user for managing the project</p>
</li>
<li><p>Install Docker and Docker Compose</p>
</li>
<li><p>Set up the folder structure for our repositories</p>
</li>
</ul>
<p>This ensures our CI/CD pipeline runs in a clean and predictable environment.</p>
<pre><code class="language-bash"># Linux user that owns the project tree
sudo adduser developer

# Docker engine + Compose plugin
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker developer

# Sanity check Compose v2
docker compose version
# -&gt; Docker Compose version v2.x.y

# Find where the Compose plugin binary lives — write it down, you'll need it
ls /usr/libexec/docker/cli-plugins/docker-compose
# (some distros use /usr/lib/docker/cli-plugins/docker-compose)

# Project layout
sudo mkdir -p /home/developer/project
sudo chown -R developer:developer /home/developer/project

# Clone both repos in the right place
cd /home/developer/projects
git clone https://github.com/&lt;org&gt;/projects-prod-configs.git
cd projects-prod-configs
git clone -b staging https://github.com/&lt;org&gt;/projects-backend.git
</code></pre>
<p>You should now have:</p>
<pre><code class="language-plaintext">/home/developer/projects/projects-prod-configs/projects-backend
</code></pre>
<p>Memorize this path — your Jenkinsfile references it.</p>
<h3 id="heading-dns">DNS</h3>
<p>Point an A-record for your Jenkins subdomain to the server's public IP <strong>before</strong> the next steps so Let's Encrypt can validate via HTTP challenge:</p>
<pre><code class="language-plaintext">jenkins.example.com   A   &lt;server-public-ip&gt;
</code></pre>
<h2 id="heading-4-traefik-the-reverse-proxy">4. Traefik — the Reverse Proxy</h2>
<p>Traefik acts as the entry point to your entire system. Instead of exposing each service manually with ports, Traefik automatically:</p>
<ul>
<li><p>Routes traffic based on domain names</p>
</li>
<li><p>Generates and renews HTTPS certificates using Let’s Encrypt</p>
</li>
<li><p>Connects to Docker and detects services dynamically</p>
</li>
</ul>
<p>In simple terms, Traefik lets you access services like:</p>
<p><a href="https://jenkins.example.com">https://jenkins.example.com</a><br><a href="https://api.example.com">https://api.example.com</a></p>
<p>…without manually configuring NGINX or managing SSL certificates.</p>
<p>In this setup, Traefik watches Docker containers and routes traffic using labels we'll define later.</p>
<p>Traefik gives every container a real domain and a real cert with <strong>zero per-service config</strong> — you just add a few labels.</p>
<h3 id="heading-traefikstagingyml-static-config"><code>traefik.staging.yml</code> (static config)</h3>
<p>Put this at the root of your infra repo:</p>
<pre><code class="language-yaml">api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@example.com           # ← change me
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false              # only containers with traefik.enable=true
    network: proxy
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: INFO

accessLog: {}
</code></pre>
<h3 id="heading-the-traefik-service-in-docker-composestagingyml">The Traefik service in <code>docker-compose.staging.yml</code></h3>
<pre><code class="language-yaml">networks:
  proxy:
    name: proxy
    driver: bridge
  internal:
    name: internal
    driver: bridge

volumes:
  acme-data:
  traefik-logs:
  jenkins-data:

services:
  traefik:
    image: traefik:v2.11
    container_name: projects-traefik-staging
    restart: unless-stopped
    ports:
      - "80:80"        # HTTP (auto-redirects to HTTPS)
      - "443:443"      # HTTPS
      - "8080:8080"    # Traefik dashboard (internal only — protect via firewall)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.staging.yml:/etc/traefik/traefik.yml:ro
      - ./dynamic:/etc/traefik/dynamic:ro
      - acme-data:/etc/traefik           # persists Let's Encrypt certs
      - traefik-logs:/var/log/traefik
    networks:
      - proxy
    command:
      - '--api.insecure=false'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--providers.docker.network=proxy'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
      - '--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL:-admin@example.com}'
      - '--certificatesresolvers.letsencrypt.acme.storage=/etc/traefik/acme.json'
      - '--log.level=INFO'
      - '--accesslog=true'
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      # Traefik's own dashboard
      - "traefik.http.routers.traefik-dash.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-dash.entrypoints=websecure"
      - "traefik.http.routers.traefik-dash.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-dash.service=api@internal"
</code></pre>
<p>Bring it up:</p>
<pre><code class="language-bash">cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d traefik
</code></pre>
<p>Watch the logs the first time — Traefik will request a cert for the dashboard host as soon as DNS resolves.</p>
<pre><code class="language-bash">docker logs -f projects-traefik-staging
</code></pre>
<p><strong>Tip.</strong> While testing, switch ACME to staging endpoint (<code>acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory</code>) so you don't burn through Let's Encrypt's rate limits if you misconfigure DNS. Remove that flag before going live.</p>
<h2 id="heading-5-run-jenkins-in-docker">5. Run Jenkins in Docker</h2>
<p>Add this Jenkins service to the same <code>docker-compose.staging.yml</code>. Every line matters (and the comments explain why).</p>
<pre><code class="language-yaml">  jenkins:
    image: jenkins/jenkins:lts
    container_name: projects-jenkins-staging
    restart: unless-stopped
    user: root                           # to use host docker.sock without UID juggling
    environment:
      - JAVA_OPTS=-Xmx1g -Xms512m -Duser.timezone=Asia/Dhaka
      - TZ=Asia/Dhaka                    # OS-level timezone inside container
      - JENKINS_OPTS=--prefix=/
    ports:
      - "3095:8080"                      # web UI (also reachable directly if needed)
      - "50000:50000"                    # inbound agent port
    volumes:
      - jenkins-data:/var/jenkins_home   # Jenkins config/jobs/secrets persistence
      - /var/run/docker.sock:/var/run/docker.sock                          # control host Docker
      - /usr/bin/docker:/usr/bin/docker                                     # docker CLI from host
      - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro  # docker compose plugin
      - /home/developer/projects:/projects                                # project tree
      - /etc/localtime:/etc/localtime:ro                                    # match host clock
      - /etc/timezone:/etc/timezone:ro
    networks:
      - proxy
      - internal
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080/login']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    deploy:
      resources:
        limits:
          memory: 1024M
</code></pre>
<p><strong>Why</strong> <code>user: root</code><strong>?</strong> It's the simplest way to share <code>docker.sock</code> and the project bind-mount without UID/GID gymnastics. If you prefer an unprivileged user, you'll need to set <code>group: docker</code> and align UIDs/perms on host folders — possible but out of scope here.</p>
<h2 id="heading-6-expose-jenkins-on-a-domain-via-traefik">6. Expose Jenkins on a Domain via Traefik</h2>
<p>This is the section many guides skip. We'll add <strong>labels</strong> to the Jenkins service so Traefik picks it up automatically. No editing of Traefik config required.</p>
<pre><code class="language-yaml">  jenkins:
    # ... everything above ...
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # 1) Router — match incoming Host
      - "traefik.http.routers.jenkins.rule=Host(`jenkins.example.com`)"
      - "traefik.http.routers.jenkins.entrypoints=websecure"
      - "traefik.http.routers.jenkins.tls.certresolver=letsencrypt"
      - "traefik.http.routers.jenkins.service=jenkins"

      # 2) Service — tell Traefik which container port is the app
      - "traefik.http.services.jenkins.loadbalancer.server.port=8080"

      # 3) Middleware — Jenkins needs X-Forwarded-Proto so it knows it's behind HTTPS
      - "traefik.http.middlewares.jenkins-headers.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.routers.jenkins.middlewares=jenkins-headers"
</code></pre>
<p>What each line does:</p>
<table>
<thead>
<tr>
<th>Label</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><code>traefik.enable=true</code></td>
<td>Opts this container in (we set <code>exposedByDefault=false</code>).</td>
</tr>
<tr>
<td><code>traefik.docker.network=proxy</code></td>
<td>Tells Traefik which network to talk to Jenkins on (Jenkins is on both <code>proxy</code> and <code>internal</code>).</td>
</tr>
<tr>
<td><code>routers.jenkins.rule=Host(...)</code></td>
<td>Forwards only this hostname to Jenkins.</td>
</tr>
<tr>
<td><code>routers.jenkins.entrypoints=websecure</code></td>
<td>Listens only on 443. (HTTP redirect was set up in section 4.)</td>
</tr>
<tr>
<td><code>routers.jenkins.tls.certresolver=letsencrypt</code></td>
<td>Auto-issues + renews the cert.</td>
</tr>
<tr>
<td><code>services.jenkins.loadbalancer.server.port=8080</code></td>
<td>Jenkins listens on 8080 inside the container.</td>
</tr>
<tr>
<td><code>customrequestheaders.X-Forwarded-Proto=https</code></td>
<td>Without this, Jenkins generates <code>http://</code> URLs in webhooks/links and breaks.</td>
</tr>
</tbody></table>
<p>Bring Jenkins up:</p>
<pre><code class="language-bash">cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d jenkins

# Watch Traefik issue the certificate
docker logs -f projects-traefik-staging | grep -i acme
</code></pre>
<p>After 10–60 seconds you should be able to open <code>https://jenkins.example.com</code> and see Jenkins's setup wizard with a valid lock icon.</p>
<p>Inside Jenkins (after first login):</p>
<p>Manage Jenkins → System → Jenkins URL → set this to: <a href="https://jenkins.example.com/">https://jenkins.example.com/</a></p>
<p>This is important because Jenkins uses this base URL to generate:</p>
<ul>
<li><p>Webhook endpoints (for GitHub triggers)</p>
</li>
<li><p>Links inside emails and build logs</p>
</li>
</ul>
<p>If this isn't set correctly, GitHub webhooks may fail, and any links Jenkins generates will point to the wrong address (often localhost or internal IPs).</p>
<h2 id="heading-7-first-time-jenkins-setup">7. First-Time Jenkins Setup</h2>
<p>If you're running Jenkins for the first time on this server, follow this section to complete the initial setup.</p>
<p>If you already have Jenkins configured, you can skip this section — but make sure the required plugins and settings match what we use later in this guide.</p>
<ol>
<li><p>Open <code>https://jenkins.example.com</code>. Get the initial admin password:</p>
<pre><code class="language-bash">docker exec projects-jenkins-staging cat /var/jenkins_home/secrets/initialAdminPassword
</code></pre>
</li>
<li><p>Paste it, choose Install suggested plugins.</p>
</li>
<li><p>Create your admin user.</p>
</li>
<li><p>Manage Jenkins → Plugins → Available and install:</p>
<ul>
<li><p>GitHub (and GitHub Branch Source)</p>
</li>
<li><p>Pipeline: GitHub</p>
</li>
<li><p>Credentials Binding (usually preinstalled)</p>
</li>
</ul>
</li>
</ol>
<p>That's all the plugins you need for the rest of this guide.</p>
<h2 id="heading-8-add-the-github-credential">8. Add the GitHub Credential</h2>
<p>Jenkins needs permission to access your GitHub repository.</p>
<p>This is done using a GitHub Personal Access Token (PAT), which acts like a password for secure API and Git operations.</p>
<p>We’ll store this token inside Jenkins as a credential so it can pull code during pipeline execution and authenticate securely without exposing secrets in code.</p>
<p>This single credential is used both for the SCM checkout and for the deploy-time <code>git pull</code>.</p>
<ol>
<li><p>Create a Personal Access Token (classic) on GitHub with <code>repo</code> scope.</p>
</li>
<li><p>In Jenkins: Manage Jenkins → Credentials → System → Global → Add Credentials.</p>
</li>
<li><p>Fill in:</p>
<ul>
<li><p>Kind: Username with password</p>
</li>
<li><p>Username: your GitHub username</p>
</li>
<li><p>Password: the token</p>
</li>
<li><p><strong>ID:</strong> <code>github_classic_token</code> <em>(the Jenkinsfile references this exact ID)</em></p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-9-create-the-pipeline-job">9. Create the Pipeline Job</h2>
<p>Now that Jenkins has access to your repository, the next step is to define how deployments should run.</p>
<p>A pipeline job tells Jenkins:</p>
<ul>
<li><p>where your code lives,</p>
</li>
<li><p>which branch to monitor,</p>
</li>
<li><p>and how to execute your deployment process.</p>
</li>
</ul>
<p>In Jenkins, create a new Pipeline job and connect it to your GitHub repository. Once this is set up, Jenkins will automatically trigger deployments whenever you push to the <code>staging</code> branch.</p>
<p>Start by creating a new job:</p>
<p>New Item → Pipeline → name it <code>projects-staging</code> → OK</p>
<p>Then configure the job:</p>
<ul>
<li><p>Under <strong>Build Triggers</strong>, enable:<br><strong>GitHub hook trigger for GITScm polling</strong></p>
</li>
<li><p>Under <strong>Pipeline</strong>:</p>
<ul>
<li><p>Definition: Pipeline script from SCM</p>
</li>
<li><p>SCM: Git</p>
</li>
<li><p>Repository URL: <code>https://github.com/&lt;org&gt;/projects-backend.git</code></p>
</li>
<li><p>Credentials: <code>github_classic_token</code></p>
</li>
<li><p>Branch: <code>*/staging</code></p>
</li>
<li><p>Script Path: <code>Jenkinsfile</code></p>
</li>
</ul>
</li>
</ul>
<p>Save the configuration.</p>
<p>At this point, Jenkins is fully connected to your repository and ready to run your deployment pipeline automatically.</p>
<h2 id="heading-10-the-jenkinsfile-deploy-only-what-changed">10. The Jenkinsfile (Deploy Only What Changed)</h2>
<p>Place this at the root of the <strong>app</strong> repo (<code>projects-backend/Jenkinsfile</code>), branch <code>staging</code>.</p>
<pre><code class="language-groovy">pipeline {
  agent any

  environment {
    PROJECT_PATH = "/projects/projects-prod-configs/projects-backend"
    COMPOSE_FILE = "docker-compose.staging.yml"
  }

  stages {

    stage('Checkout') {
      steps {
        checkout scm
        echo "Checkout completed for branch: ${env.BRANCH_NAME ?: 'staging'}"
      }
    }

    stage('Detect Changes') {
      steps {
        script {
          def changedFiles = sh(
            script: "git diff --name-only HEAD~1 HEAD",
            returnStdout: true
          ).trim()

          echo "Changed files:\n${changedFiles}"

          def services = [] as Set
          changedFiles.split('\n').each { file -&gt;
            def svc  = file =~ /^apps\/services\/([a-z0-9-]+)\//
            def gw   = file =~ /^apps\/gateways\/([a-z0-9-]+)\//
            def core = file =~ /^apps\/core\/([a-z0-9-]+)\//
            if (svc)  { services &lt;&lt; svc[0][1]  }
            if (gw)   { services &lt;&lt; gw[0][1]   }
            if (core) { services &lt;&lt; core[0][1] }
          }
          services = services.findAll { !it.endsWith('-e2e') }
          env.CHANGED_SERVICES = services.join(' ')

          echo "Services to deploy: ${env.CHANGED_SERVICES ?: '(none)'}"
        }
      }
    }

    stage('Deploy') {
      when { expression { return env.CHANGED_SERVICES?.trim() } }
      steps {
        withCredentials([usernamePassword(
          credentialsId: 'github_classic_token',
          usernameVariable: 'GIT_USER',
          passwordVariable: 'GIT_TOKEN'
        )]) {
          sh '''
            set -eu
            git config --global --add safe.directory "${PROJECT_PATH}"
            cd "${PROJECT_PATH}"
            git remote set-url origin "https://github.com/&lt;org&gt;/projects-backend.git"
            git -c credential.helper= \
                -c "credential.helper=!f() { echo username=\({GIT_USER}; echo password=\){GIT_TOKEN}; }; f" \
                pull origin staging
            docker compose -f "\({COMPOSE_FILE}" up -d --build \){CHANGED_SERVICES}
          '''
        }
        echo "Deployed: ${env.CHANGED_SERVICES}"
      }
    }

    stage('Skip Deployment') {
      when { expression { return !env.CHANGED_SERVICES?.trim() } }
      steps { echo "No service changes detected — nothing to deploy." }
    }
  }
}
</code></pre>
<p>Why each tricky line is there:</p>
<ul>
<li><p><code>git config --global --add safe.directory ...</code> — git refuses to operate on a repo whose owner UID differs from the current user's. The repo on disk is owned by <code>developer</code>, but Git inside the container runs as <code>root</code>. This whitelists the path.</p>
</li>
<li><p><code>git remote set-url origin "https://..."</code> — flips the on-disk remote to HTTPS so the <strong>token can be used</strong>. (A PAT can't authenticate <code>git@github.com:</code> URLs — those use SSH.) Idempotent — safe to re-run.</p>
</li>
<li><p><code>git -c credential.helper="!f() { echo username=...; echo password=...; }; f"</code> — feeds the username/token to git for that one command without writing the token to disk and without exposing it on the process command line.</p>
</li>
<li><p><code>${CHANGED_SERVICES}</code> is unquoted on purpose so multiple service names expand as separate args.</p>
</li>
</ul>
<h2 id="heading-11-end-to-end-test">11. End-to-End Test</h2>
<p>Before considering the setup complete, we need to verify that the entire pipeline works as expected.</p>
<p>This end-to-end test ensures that:</p>
<ul>
<li><p>GitHub webhooks are triggering Jenkins correctly,</p>
</li>
<li><p>Jenkins can detect which services changed,</p>
</li>
<li><p>and only the affected services are rebuilt and deployed.</p>
</li>
</ul>
<p>In other words, this simulates a real production deployment.</p>
<p>Start by making a small change in your repository. For example, modify a file inside:</p>
<p>apps/gateways/student-apigw/</p>
<p>Then push the change to the <code>staging</code> branch.</p>
<p>Once pushed, Jenkins should automatically trigger via the webhook. If not, you can manually click <strong>Build Now</strong>.</p>
<p>Now open the build’s <strong>Console Output</strong> and verify the flow. You should see something like:</p>
<ul>
<li><p>Checkout completed for branch: staging</p>
</li>
<li><p>Services to deploy: student-apigw</p>
</li>
<li><p>git pull origin staging (successful)</p>
</li>
<li><p>docker compose ... up -d --build student-apigw</p>
</li>
<li><p>Deployed: student-apigw</p>
</li>
</ul>
<p>If you see this sequence, your pipeline is working correctly.</p>
<p>If anything fails, don’t worry — jump to Section 12 where every common issue and its fix is documented.</p>
<h2 id="heading-12-troubleshooting-every-error-we-hit">12. Troubleshooting — Every Error We Hit</h2>
<p>This section covers real issues we faced while setting up this pipeline — and more importantly, <em>why each fix works</em>. Understanding the “why” will help you debug similar problems in your own setup.</p>
<h3 id="heading-cd-cant-cd-to-projectsprojects-prod-configsprojects-backend">cd: can't cd to /projects/projects-prod-configs/projects-backend</h3>
<p><strong>Cause:</strong><br>The Jenkinsfile runs <code>cd $PROJECT_PATH</code>, but inside the container that path doesn’t exist. This usually happens when:</p>
<ul>
<li><p>the project wasn’t cloned on the host, or</p>
</li>
<li><p>the bind mount isn’t configured correctly.</p>
</li>
</ul>
<p><strong>Fix:</strong></p>
<pre><code class="language-bash">ls /home/developer/projects/projects-prod-configs/projects-backend
# If missing: git clone -b staging &lt;url&gt; there.
</code></pre>
<p>Confirm the bind mount:</p>
<pre><code class="language-plaintext">docker inspect projects-jenkins-staging --format '{{range .Mounts}}{{.Source}} -&gt; {{.Destination}}{{println}}{{end}}'
</code></pre>
<p>If missing, recreate the container:</p>
<pre><code class="language-plaintext">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Jenkins runs inside a container, but your code lives on the host. The bind mount connects them. Without it, Jenkins cannot access your project directory.</p>
<h3 id="heading-fatal-detected-dubious-ownership-in-repository">fatal: detected dubious ownership in repository</h3>
<p><strong>Cause:</strong><br>Git blocks access when the repository owner differs from the current user.</p>
<ul>
<li><p>Repo owner: <code>developer</code> (host)</p>
</li>
<li><p>Git runs as: <code>root</code> (inside container)</p>
</li>
</ul>
<p><strong>Fix:</strong></p>
<pre><code class="language-plaintext">git config --global --add safe.directory "${PROJECT_PATH}"
</code></pre>
<p><strong>Why this works:</strong></p>
<p>This explicitly tells Git that the directory is trusted, bypassing ownership mismatch security restrictions.</p>
<h3 id="heading-host-key-verification-failed-could-not-read-from-remote-repository"><code>Host key verification failed</code> / <code>Could not read from remote repository</code></h3>
<h4 id="heading-cause">Cause:</h4>
<p>The repository uses SSH (<code>git@github.com:...</code>), but:</p>
<ul>
<li><p>the container has no SSH keys</p>
</li>
<li><p>no known_hosts file exists</p>
</li>
</ul>
<p>Also, GitHub tokens cannot authenticate over SSH.</p>
<p><strong>Fix (recommended):</strong></p>
<pre><code class="language-plaintext">git remote set-url origin "https://github.com/&lt;org&gt;/projects-backend.git"
</code></pre>
<p><strong>Why this works:</strong></p>
<p>HTTPS uses token-based authentication (PAT), which works inside containers without SSH configuration.</p>
<h3 id="heading-unknown-shorthand-flag-f-in-f-docker-compose"><code>unknown shorthand flag: 'f' in -f</code> ( <code>docker compose</code>)</h3>
<p><strong>Cause:</strong><br>The Docker CLI exists, but the Docker Compose plugin is missing inside the container.</p>
<p><strong>Fix:</strong></p>
<pre><code class="language-plaintext">volumes:
  - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro
</code></pre>
<p>Find your path if needed:</p>
<pre><code class="language-plaintext">find /usr -name docker-compose -type f 2&gt;/dev/null
</code></pre>
<p>Verify:</p>
<pre><code class="language-plaintext">docker exec projects-jenkins-staging docker compose version
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Docker Compose v2 is a CLI plugin. Mounting this directory makes the <code>docker compose</code> command available inside the container.</p>
<h3 id="heading-wrong-timezone-in-build-timestamps-and-jenkins-ui">Wrong timezone in build timestamps and Jenkins UI</h3>
<p><strong>Fix:</strong> Set both env var and JVM flag, and bind-mount the host's clock files:</p>
<pre><code class="language-yaml">environment:
  - TZ=Asia/Dhaka
  - JAVA_OPTS=... -Duser.timezone=Asia/Dhaka
volumes:
  - /etc/localtime:/etc/localtime:ro
  - /etc/timezone:/etc/timezone:ro
</code></pre>
<p>You <strong>must</strong> recreate the container for env-var changes to take effect:</p>
<pre><code class="language-bash">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong><br>Jenkins runs on Java, which uses its own timezone separate from the OS.<br>By aligning OS timezone, JVM timezone, and host clock, you ensure consistent timestamps everywhere.</p>
<h3 id="heading-errsockettimeout-pnpm-install-fails">ERR_SOCKET_TIMEOUT (pnpm install fails)</h3>
<h4 id="heading-cause">Cause:</h4>
<p>If you have multiple services building in parallel and each runs pnpm install with ~1500 packages, the network gets saturated and a timeout occurs.</p>
<h4 id="heading-fixes">Fixes:</h4>
<p>a) Increase timeout + control concurrency</p>
<pre><code class="language-xml">RUN pnpm install --frozen-lockfile --ignore-scripts 
--network-timeout 600000 
--network-concurrency 8
</code></pre>
<p>Why: Gives pnpm more time and reduces network overload.</p>
<p>b) Enable pnpm cache (BuildKit)</p>
<pre><code class="language-xml">RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store 
pnpm install --frozen-lockfile --ignore-scripts
</code></pre>
<p>Why: Dependencies are cached and reused instead of downloading every time.</p>
<p>c) Avoid unnecessary rebuilds</p>
<pre><code class="language-xml">docker compose -f \(COMPOSE_FILE build \)CHANGED_SERVICES docker compose -f \(COMPOSE_FILE up -d --no-build \)CHANGED_SERVICES
</code></pre>
<p>Why: Only changed services are rebuilt → less network load → fewer failures.</p>
<h3 id="heading-container-changes-dont-apply-after-editing-docker-composeyml">Container changes don’t apply after editing docker-compose.yml</h3>
<h4 id="heading-cause">Cause:</h4>
<p>Docker compose up -d does not update running containers.</p>
<h4 id="heading-fix">Fix:</h4>
<pre><code class="language-xml">docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins
</code></pre>
<p><strong>Why this works:</strong></p>
<p>This forces Docker to recreate the container with updated configuration (env, volumes, labels).</p>
<h3 id="heading-traefik-shows-default-certificate-no-https">Traefik shows default certificate (no HTTPS)</h3>
<h4 id="heading-common-causes">Common causes:</h4>
<p>DNS not pointing to server Port 80 blocked Wrong Docker network</p>
<h4 id="heading-check">Check:</h4>
<pre><code class="language-xml">dig +short jenkins.example.com docker logs projects-traefik-staging 2&gt;&amp;1 | grep -i acme
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Let’s Encrypt uses HTTP-01 challenge, so it must reach your server via port 80. If DNS or networking is wrong, certificate issuance fails.</p>
<h3 id="heading-jenkins-reverse-proxy-setup-is-broken">Jenkins: "Reverse proxy setup is broken"</h3>
<h4 id="heading-fix">Fix:</h4>
<p>Set the Jenkins URL to <a href="https://jenkins.example.com/">https://jenkins.example.com/</a><br>Ensure header:</p>
<pre><code class="language-xml">X-Forwarded-Proto: https
</code></pre>
<p><strong>Why this works:</strong></p>
<p>Jenkins needs to know it's behind HTTPS. Without this, it generates incorrect URLs (http instead of https), breaking redirects and webhooks.</p>
<h2 id="heading-13-mental-model-host-vs-container">13. Mental Model: Host vs. Container</h2>
<p>Many setup mistakes come from confusing the <strong>host</strong> filesystem with the <strong>container</strong> filesystem. This table makes it explicit:</p>
<table>
<thead>
<tr>
<th>Inside the Jenkins container</th>
<th>Comes from on the host</th>
</tr>
</thead>
<tbody><tr>
<td><code>/var/jenkins_home</code></td>
<td>docker volume <code>jenkins-data</code> (Jenkins config, jobs, secrets)</td>
</tr>
<tr>
<td><code>/projects/...</code></td>
<td><code>/home/developer/projects/...</code> (your project tree)</td>
</tr>
<tr>
<td><code>/usr/bin/docker</code></td>
<td>host's <code>/usr/bin/docker</code></td>
</tr>
<tr>
<td><code>/usr/libexec/docker/cli-plugins/docker-compose</code></td>
<td>host plugin (lets <code>docker compose</code> work)</td>
</tr>
<tr>
<td><code>/var/run/docker.sock</code></td>
<td>host Docker daemon (so builds happen on the host's engine)</td>
</tr>
<tr>
<td><code>/etc/localtime</code>, <code>/etc/timezone</code></td>
<td>host clock</td>
</tr>
<tr>
<td><code>~/.ssh</code></td>
<td><strong>nothing</strong> — that's why SSH-to-GitHub doesn't work without extra setup</td>
</tr>
</tbody></table>
<p>When debugging, always ask: <em>"Inside which filesystem is this command running, and does the file/folder it's looking for exist there?"</em></p>
<h2 id="heading-14-daily-operations-cheat-sheet">14. Daily Operations Cheat Sheet</h2>
<pre><code class="language-bash"># Recreate Jenkins after changing compose
cd /home/developer/Projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

# Tail Jenkins logs
docker logs -f projects-jenkins-staging

# Open a shell inside the Jenkins container
docker exec -it projects-jenkins-staging bash

# From inside the container — sanity checks
docker compose version
ls /projects/projects-prod-configs/projects-backend
git -C /projects/projects-prod-configs/projects-backend remote -v

# Manually trigger the same deploy the pipeline does
cd /projects/projects-configs/projects-backend
git pull origin staging
docker compose -f docker-compose.staging.yml up -d --build student-apigw

# Inspect Traefik routing decisions
docker logs projects-traefik-staging 2&gt;&amp;1 | grep -i jenkins

# Check renewed certs
docker exec projects-traefik-staging cat /etc/traefik/acme.json | head -50
</code></pre>
<h2 id="heading-15-what-id-do-differently-next-time">15. What I'd Do Differently Next Time</h2>
<ul>
<li><p><strong>Pre-build a base image</strong> with all node_modules baked in. With ~1500 packages × 15 services, every clean build re-downloads ~22k tarballs. A shared base cuts that 90%.</p>
</li>
<li><p><strong>Run a private npm proxy</strong> (Verdaccio / Nexus / GitHub Packages) on the same Docker network — eliminates flaky <code>npmjs.org</code> timeouts entirely.</p>
</li>
<li><p><strong>Per-service Jenkinsfile</strong> if your services drift apart in tooling. With one Jenkinsfile, every team contends for the same pipeline definition.</p>
</li>
<li><p><strong>Replace</strong> <code>git diff HEAD~1 HEAD</code> with <code>git diff $(git merge-base HEAD origin/staging~1) HEAD</code> so squash-merges and force-pushes don't accidentally skip services.</p>
</li>
<li><p><strong>Move secrets to a vault</strong> (HashiCorp Vault / AWS Secrets Manager / Doppler). PATs in Jenkins work, but rotation across many jobs is painful.</p>
</li>
<li><p><strong>Use Jenkins' Configuration-as-Code (JCasC)</strong> so the entire Jenkins setup (jobs, credentials definitions, plugins) is in git. Then a server rebuild is a one-command operation.</p>
</li>
</ul>
<h2 id="heading-closing-thoughts">Closing Thoughts</h2>
<p>The pipeline itself is just three stages: <strong>Checkout → Detect Changes → Deploy</strong> — but a real production setup is mostly about <strong>plumbing</strong>: reverse proxy, certificates, bind-mounts, credentials, timezones, build caches. None of these are exotic. Together they decide whether your Friday-afternoon deploy goes silently green or eats your weekend.</p>
<p>Follow sections 1–11 to get a working pipeline. Bookmark section 12 to keep it working.</p>
<p>Happy shipping.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build Microservices-Based REST APIs for Healthcare Portals ]]>
                </title>
                <description>
                    <![CDATA[ Microservices architecture enables healthcare portals to scale, secure sensitive data, and evolve rapidly. Using ASP.NET 10 and C#, you can build independent REST APIs for services like patients, appo ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-microservices-based-rest-apis-for-healthcare-portals/</link>
                <guid isPermaLink="false">69e2610cfd22b8ad6251e84b</guid>
                
                    <category>
                        <![CDATA[ REST APIs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Microservices ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ASP.NET 10 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Database per Service Pattern ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Service Communication ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containerization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Gopinath Karunanithi ]]>
                </dc:creator>
                <pubDate>Fri, 17 Apr 2026 16:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/d834b346-3fcf-442c-836c-94ed7ef8a17d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Microservices architecture enables healthcare portals to scale, secure sensitive data, and evolve rapidly.</p>
<p>Using ASP.NET 10 and C#, you can build independent REST APIs for services like patients, appointments, and authentication, each with its own database and deployment lifecycle.</p>
<p>Combined with API gateways, JWT-based security, observability, and containerization, this approach ensures reliable, maintainable, and production-ready healthcare systems.</p>
<p>In this tutorial, you’ll learn how to design and build a microservices-based healthcare portal using ASP.NET 10 and C#. We’ll cover how to structure services, implement REST APIs, secure endpoints, enable service communication, and deploy using modern containerization practices.</p>
<p>By the end, you’ll have a clear understanding of how to create scalable, secure, and production-ready healthcare systems.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-overview">Overview</a></p>
</li>
<li><p><a href="#heading-why-use-microservices-for-healthcare-portals">Why Use Microservices for Healthcare Portals?</a></p>
</li>
<li><p><a href="#heading-high-level-architecture">High-Level Architecture</a></p>
</li>
<li><p><a href="#heading-designing-rest-apis-for-healthcare-services">Designing REST APIs for Healthcare Services</a></p>
</li>
<li><p><a href="#heading-how-to-build-a-microservice-with-aspnet-10">How to Build a Microservice with ASP.NET 10</a></p>
</li>
<li><p><a href="#heading-database-per-service-pattern">Database per Service Pattern</a></p>
</li>
<li><p><a href="#heading-service-communication">Service Communication</a></p>
</li>
<li><p><a href="#heading-api-gateway-implementation">API Gateway Implementation</a></p>
</li>
<li><p><a href="#heading-implementing-security-in-healthcare-apis">Implementing Security in Healthcare APIs</a></p>
</li>
<li><p><a href="#heading-observability-and-logging">Observability and Logging</a></p>
</li>
<li><p><a href="#heading-containerization-with-docker">Containerization with Docker</a></p>
</li>
<li><p><a href="#heading-deployment-strategies">Deployment Strategies</a></p>
</li>
<li><p><a href="#heading-best-practices-with-examples">Best Practices (With Examples)</a></p>
</li>
<li><p><a href="#heading-when-not-to-use-microservices">When NOT to Use Microservices</a></p>
</li>
<li><p><a href="#heading-future-enhancements">Future Enhancements</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before getting started, you should be familiar with:</p>
<ul>
<li><p>C# and ASP.NET Core fundamentals</p>
</li>
<li><p>REST API concepts (HTTP methods, routing, status codes)</p>
</li>
<li><p>Basic understanding of microservices architecture</p>
</li>
</ul>
<p>Tools required:</p>
<ul>
<li><p>.NET 10 SDK</p>
</li>
<li><p>Visual Studio or VS Code</p>
</li>
<li><p>Postman or Swagger</p>
</li>
<li><p>Docker (optional but recommended)</p>
</li>
</ul>
<h2 id="heading-overview">Overview</h2>
<p>Healthcare portals power critical workflows such as patient registration, appointment scheduling, electronic health records (EHR), billing, and telemedicine. These systems must handle sensitive data, high availability requirements, and frequent updates.</p>
<p>Traditionally, many healthcare applications were built as monolithic systems. While simple to start with, monoliths quickly become difficult to scale, maintain, and secure. A single failure can impact the entire system, and even small changes require redeploying the entire application.</p>
<p>Microservices architecture addresses these challenges by breaking the application into smaller, independent services. Each service is responsible for a specific domain, such as patient management or appointment scheduling, and can be developed, deployed, and scaled independently.</p>
<p>In this article, you'll learn how to design and implement a microservices-based healthcare REST API using ASP.NET 10 and C#. We'll walk through architecture design, service implementation, communication patterns, security, observability, and deployment strategies.</p>
<h2 id="heading-why-use-microservices-for-healthcare-portals">Why Use Microservices for Healthcare Portals?</h2>
<p>Healthcare systems are inherently complex. They involve multiple domains such as patient records, appointments, billing, authentication and authorization. A microservices approach allows each of these domains to be handled independently. There are many benefits to this approach such as:</p>
<ul>
<li><p><strong>Scalability</strong>: Scale only the services under heavy load (for example, appointments during peak hours)</p>
</li>
<li><p><strong>Fault isolation</strong>: Failure in one service does not crash the entire system</p>
</li>
<li><p><strong>Faster deployment</strong>: Teams can deploy updates independently</p>
</li>
<li><p><strong>Improved security</strong>: Sensitive services can have stricter access controls</p>
</li>
</ul>
<p>For example, a patient service can handle personal data, while a billing service manages transactions, each with different security policies.</p>
<h2 id="heading-high-level-architecture"><strong>High-Level Architecture</strong></h2>
<p>A typical healthcare microservices architecture includes API Gateway (central entry point), microservices (Patient, Appointment, Auth), database per Service and service Communication Layer.</p>
<p>The request flow starts with the client sending a request. Then the API Gateway routes the request and the target microservice processes it. Then a response is returned. This separation ensures modularity and maintainability.</p>
<h2 id="heading-designing-rest-apis-for-healthcare-services">Designing REST APIs for Healthcare Services</h2>
<p>Designing REST APIs in a microservices architecture requires clear, consistent naming conventions so that endpoints are intuitive, predictable, and easy to consume by clients and other services.</p>
<h3 id="heading-naming-conventions">Naming Conventions</h3>
<p>REST APIs are resource-oriented, meaning URLs should represent entities (nouns), not actions (verbs). Each resource corresponds to a domain object in your system, such as patients, appointments, or billing records.</p>
<p><strong>Key principles:</strong></p>
<ul>
<li><p>Use plural nouns for resources (for example, <code>/patients</code>, <code>/appointments</code>)</p>
</li>
<li><p>Avoid verbs in URLs (don't use <code>/getPatients</code>)</p>
</li>
<li><p>Use hierarchical structure for relationships (for example, <code>/patients/{id}/appointments</code>)</p>
</li>
<li><p>Keep naming consistent across all services</p>
</li>
</ul>
<p>These conventions improve API readability, developer experience, and maintainability across teams</p>
<h4 id="heading-example-patient-api-endpoints">Example: Patient API Endpoints</h4>
<p>The following endpoints represent standard CRUD (Create, Read, Update, Delete) operations for managing patients:</p>
<pre><code class="language-plaintext">GET    /api/patients        // Retrieve all patients
GET    /api/patients/{id}   // Retrieve a specific patient
POST   /api/patients        // Create a new patient
PUT    /api/patients/{id}   // Update an existing patient
DELETE /api/patients/{id}   // Delete a patient
</code></pre>
<p>Each HTTP method defines the type of operation being performed:</p>
<ul>
<li><p>GET: Fetch data (read-only)</p>
</li>
<li><p>POST: Create new resources</p>
</li>
<li><p>PUT: Update existing resources</p>
</li>
<li><p>DELETE: Remove resources</p>
</li>
</ul>
<p>These operations follow REST standards, ensuring consistency across services and making APIs easier to integrate with frontend apps, mobile clients, or third-party healthcare systems</p>
<h3 id="heading-best-practices-for-designing-healthcare-rest-apis">Best Practices for Designing Healthcare REST APIs</h3>
<p>Designing REST APIs for healthcare systems requires more than standard conventions. It demands careful consideration of performance, data sensitivity, and interoperability.</p>
<h4 id="heading-1-use-proper-http-methods">1. Use proper HTTP methods</h4>
<p>Ensure each endpoint uses the correct HTTP verb (GET, POST, PUT, DELETE) to clearly communicate its purpose. This improves API predictability and aligns with REST standards used across healthcare platforms.</p>
<h4 id="heading-2-return-meaningful-status-codes">2. Return meaningful status codes</h4>
<p>Use appropriate HTTP status codes to indicate the result of a request. For example:</p>
<ul>
<li><p>200 OK for successful retrieval</p>
</li>
<li><p>201 Created for successful resource creation</p>
</li>
<li><p>400 Bad Request for validation errors</p>
</li>
<li><p>404 Not Found when a resource doesn’t exist<br>Clear status codes help clients handle responses correctly.</p>
</li>
</ul>
<h4 id="heading-3-implement-pagination-for-large-datasets">3. Implement pagination for large datasets</h4>
<p>Healthcare systems often deal with large volumes of data (for example, patient records, appointment logs). Use pagination to limit response size:</p>
<p><code>GET /api/patients?page=1&amp;pageSize=20</code></p>
<p>This improves performance and reduces server load.</p>
<h4 id="heading-4-use-api-versioning">4. Use API versioning</h4>
<p>Version your APIs to avoid breaking existing clients when making changes:</p>
<p><code>/api/v1/patients</code></p>
<p>This is especially important in healthcare, where integrations with external systems must remain stable over time.</p>
<h4 id="heading-5-validate-and-sanitize-input-data">5. Validate and sanitize input data</h4>
<p>Always validate incoming data to prevent errors and ensure data integrity. For example, enforce required fields like patient name, date of birth, and contact details.</p>
<h4 id="heading-6-protect-sensitive-data">6. Protect sensitive data</h4>
<p>Avoid exposing sensitive patient information unnecessarily. Use filtering, masking, or field-level access control where needed to comply with healthcare data regulations.</p>
<h4 id="heading-7-ensure-consistent-response-structure">7. Ensure consistent response structure</h4>
<p>Return responses in a standard format (for example, including data, status, and message fields). This makes APIs easier to consume and debug across multiple services.</p>
<h2 id="heading-how-to-build-a-microservice-with-aspnet-10">How to Build a Microservice with ASP.NET 10</h2>
<p>Let’s implement a simple Patient Service.</p>
<h3 id="heading-step-1-create-project">Step 1: Create Project</h3>
<p>In this step, we'll create a new <a href="http://ASP.NET">ASP.NET</a> Web API project that will serve as our Patient microservice. This project provides the foundation for defining endpoints, handling HTTP requests, and structuring our service independently from other parts of the system.</p>
<pre><code class="language-shell">dotnet new webapi -n PatientService
cd PatientService
</code></pre>
<h3 id="heading-step-2-define-model">Step 2: Define Model</h3>
<p>Next, we'll define a simple data model representing a patient. Models define the structure of the data your API will send and receive, and they typically map to database entities in real-world applications.</p>
<pre><code class="language-csharp">public class Patient
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Email { get; set; }
}
</code></pre>
<h3 id="heading-step-3-create-controller">Step 3: Create Controller</h3>
<p>Here, we're creating a controller to handle incoming HTTP requests. Controllers define API endpoints and contain the logic for processing requests, interacting with data, and returning responses to clients.</p>
<pre><code class="language-csharp">[ApiController]
[Route("api/patients")]
public class PatientController : ControllerBase
{
    private static List&lt;Patient&gt; patients = new();

    [HttpGet]
    public IActionResult GetPatients()
    {
        return Ok(patients);
    }

    [HttpPost]
    public IActionResult AddPatient(Patient patient)
    {
        patients.Add(patient);
        return CreatedAtAction(nameof(GetPatients), patient);
    }
}
</code></pre>
<h2 id="heading-database-per-service-pattern">Database per Service Pattern</h2>
<p>Each microservice should manage its own database to ensure loose coupling and independent operation. This allows services to evolve, scale, and be deployed without affecting others. It also improves data isolation and aligns with the core principles of microservices architecture.</p>
<p>Here's an example with Entity Framework Core:</p>
<pre><code class="language-csharp">public class PatientDbContext : DbContext
{
    public PatientDbContext(DbContextOptions&lt;PatientDbContext&gt; options)
        : base(options) { }

    public DbSet&lt;Patient&gt; Patients { get; set; }
}
</code></pre>
<p>This matters because it avoids cross-service dependencies, enables independent scaling, and improves data security, making microservices more efficient and secure.</p>
<h2 id="heading-service-communication">Service Communication</h2>
<p>Microservices communicate with each other to share data and coordinate workflows across the system. This communication can be handled through synchronous requests or asynchronous messaging, depending on the use case.</p>
<p>Choosing the right approach helps ensure scalability, reliability, and responsiveness in distributed systems</p>
<h3 id="heading-1-synchronous-communication-http">1. Synchronous Communication (HTTP)</h3>
<pre><code class="language-csharp">var response = await httpClient.GetAsync("http://appointment-service/api/appointments");
</code></pre>
<h3 id="heading-2-asynchronous-communication-messaging">2. Asynchronous Communication (Messaging)</h3>
<p>Using message brokers like RabbitMQ:</p>
<ul>
<li><p>Services publish events</p>
</li>
<li><p>Other services consume them</p>
</li>
</ul>
<p><strong>Example:</strong></p>
<p>When a patient registers, an event triggers an appointment service.</p>
<h2 id="heading-api-gateway-implementation"><strong>API Gateway Implementation</strong></h2>
<p>An API Gateway acts as the central entry point for all client requests in a microservices architecture. It handles routing, authentication, and request aggregation, simplifying how clients interact with multiple services. This layer helps improve security, scalability, and overall system management.</p>
<p>Here's an example (Ocelot configuration):</p>
<pre><code class="language-json">{
  "Routes": [
    {
      "DownstreamPathTemplate": "/api/patients",
      "UpstreamPathTemplate": "/patients",
      "DownstreamHostAndPorts": [
        { "Host": "localhost", "Port": 5001 }
      ]
    }
  ]
}
</code></pre>
<p>Benefits include centralized routing, authentication handling, and rate limiting</p>
<h2 id="heading-implementing-security-in-healthcare-apis">Implementing Security in Healthcare APIs</h2>
<p>Security is critical in healthcare systems due to the sensitive nature of patient data. APIs must enforce strong authentication, authorization, and data protection mechanisms. Proper security ensures compliance, prevents unauthorized access, and safeguards user trust.</p>
<h3 id="heading-1-jwt-authentication">1. JWT Authentication</h3>
<pre><code class="language-csharp">builder.Services.AddAuthentication("Bearer")
    .AddJwtBearer(options =&gt;
    {
        options.Authority = "https://auth-server";
        options.Audience = "healthcare-api";
    });
</code></pre>
<p>JWT (JSON Web Token) authentication is used to verify the identity of users accessing the API.</p>
<p>The authentication scheme ("Bearer") tells the API to expect a token in the Authorization header: <code>Authorization: Bearer &lt;token&gt;</code></p>
<p>Authority represents the trusted authentication server (identity provider) that issues tokens.</p>
<p>And audience ensures that the token is intended specifically for this API.</p>
<p>When a request is made, the API:</p>
<ol>
<li><p>Extracts the JWT from the request header</p>
</li>
<li><p>Validates its signature using the authority</p>
</li>
<li><p>Checks claims like expiration and audience</p>
</li>
<li><p>Grants access only if the token is valid</p>
</li>
</ol>
<p>This ensures that only authenticated users can access healthcare services.</p>
<h3 id="heading-2-role-based-authorization">2. Role-Based Authorization</h3>
<pre><code class="language-csharp">[Authorize(Roles = "Doctor")]
public IActionResult GetSensitiveData()
{
    return Ok();
}
</code></pre>
<p>Role-based authorization restricts access based on user roles.</p>
<ul>
<li><p>The <code>[Authorize]</code> attribute enforces that only authenticated users can access the endpoint.</p>
</li>
<li><p>The <code>Roles = "Doctor"</code> condition ensures that only users with the Doctor role can access this resource.</p>
</li>
</ul>
<p>When a user sends a request:</p>
<ol>
<li><p>Their JWT token is validated</p>
</li>
<li><p>The system checks the role claim inside the token</p>
</li>
<li><p>Access is granted only if the required role matches</p>
</li>
</ol>
<p>This is critical in healthcare systems where doctors access medical records, admins manage system data, and patients access only their own information.</p>
<h3 id="heading-3-secure-secrets-management">3. Secure Secrets Management</h3>
<pre><code class="language-csharp">var connectionString = Environment.GetEnvironmentVariable("DB_CONNECTION");
</code></pre>
<p>Sensitive configuration data such as database connection strings should never be hardcoded in the application.</p>
<p><code>Environment.GetEnvironmentVariable()</code> retrieves secrets securely from the environment. These values are typically stored in:</p>
<ul>
<li><p>Environment variables</p>
</li>
<li><p>Secret managers (Azure Key Vault, AWS Secrets Manager)</p>
</li>
<li><p>Container orchestration platforms</p>
</li>
</ul>
<p>Benefits:</p>
<ul>
<li><p>Prevents exposure of credentials in source code</p>
</li>
<li><p>Supports secure deployments across environments</p>
</li>
<li><p>Simplifies secret rotation without code changes</p>
</li>
</ul>
<h3 id="heading-4-enforce-https">4. Enforce HTTPS</h3>
<pre><code class="language-csharp">app.UseHttpsRedirection();
</code></pre>
<p>HTTPS ensures that all communication between the client and server is encrypted.</p>
<p><code>UseHttpsRedirection()</code> automatically redirects HTTP requests to HTTPS. This protects sensitive healthcare data (such as patient records and credentials) from Man-in-the-Middle attacks, data interception, and unauthorized access.</p>
<p>In healthcare systems, encryption is essential for compliance with data protection standards and regulations.</p>
<p>Together, these security mechanisms provide multiple layers of protection:</p>
<ul>
<li><p>Authentication verifies identity</p>
</li>
<li><p>Authorization controls access</p>
</li>
<li><p>Secrets management protects credentials</p>
</li>
<li><p>HTTPS secures data in transit</p>
</li>
</ul>
<p>This layered approach is essential for safeguarding sensitive healthcare data and ensuring compliance with industry standards.</p>
<h2 id="heading-observability-and-logging"><strong>Observability and Logging</strong></h2>
<p>Observability enables you to monitor system health, diagnose issues, and understand how services interact in real time. By implementing logging, metrics, and tracing, teams can quickly identify failures and performance bottlenecks. This is essential for maintaining reliability in distributed systems.</p>
<p>Here's a basic logging example:</p>
<pre><code class="language-csharp">_logger.LogInformation("Fetching patients");
</code></pre>
<p>This line writes an informational log entry whenever the patient data is being retrieved. The _logger instance is part of ASP.NET’s built-in logging framework and is typically injected into the class through dependency injection.</p>
<p>Logging at this level helps developers trace normal application behavior and understand when specific operations occur, which is especially useful during debugging and monitoring in production environments.</p>
<h3 id="heading-application-insights-integration">Application Insights Integration</h3>
<pre><code class="language-csharp">builder.Services.AddApplicationInsightsTelemetry();
</code></pre>
<p>This configuration enables integration with Application Insights, a cloud-based monitoring service. By adding this line, the application automatically collects telemetry data such as request rates, response times, failure rates, and dependency calls. This allows teams to monitor the health of the application in real time and quickly identify performance bottlenecks or failures across distributed microservices.</p>
<h3 id="heading-custom-metrics">Custom Metrics</h3>
<pre><code class="language-csharp">var telemetryClient = new TelemetryClient();
telemetryClient.TrackMetric("PatientsFetched", 1);
</code></pre>
<p>Here, a TelemetryClient instance is used to send custom metrics to the monitoring system. The TrackMetric method records a numerical value –&nbsp;in this case, tracking how many times patients are fetched.</p>
<p>Custom metrics like this help measure business-specific operations and provide deeper insight into how the system is being used beyond standard performance metrics.</p>
<h3 id="heading-health-checks">Health Checks</h3>
<pre><code class="language-csharp">app.MapHealthChecks("/health");
</code></pre>
<p>This line exposes a health check endpoint at /health that external systems can use to verify whether the service is running correctly. When this endpoint is called, it returns the status of the application and any configured dependencies, such as databases or external services.</p>
<p>Health checks are commonly used by load balancers, container orchestrators, and monitoring tools to automatically detect failures and restart or reroute traffic if needed.</p>
<p>Together, logging, telemetry, custom metrics, and health checks provide a complete observability strategy. They allow teams to understand system behavior, detect issues early, and maintain reliability across distributed healthcare services where uptime and performance are critical.</p>
<h2 id="heading-containerization-with-docker">Containerization with Docker</h2>
<p>Containerization allows microservices to run in isolated and consistent environments across development and production. Using Docker, you can package applications with all dependencies, ensuring portability and easier deployment. This approach simplifies scaling and infrastructure management.</p>
<p>The following Dockerfile shows a minimal setup for packaging the Patient Service into a container image:</p>
<pre><code class="language-dockerfile">FROM mcr.microsoft.com/dotnet/aspnet:10.0
WORKDIR /app
COPY . .
ENTRYPOINT ["dotnet", "PatientService.dll"]
</code></pre>
<p>This Dockerfile defines how the Patient Service is packaged into a container image so it can run consistently across different environments.</p>
<p>The <strong>FROM</strong> instruction specifies the base image, which in this case is the official ASP.NET runtime image for .NET 10. This image includes all the necessary runtime components required to execute the application, so you don’t need to install .NET separately inside the container.</p>
<p>The <strong>WORKDIR /app</strong> line sets the working directory inside the container. All subsequent commands will run relative to this directory, helping organize application files in a predictable structure.</p>
<p>The <strong>COPY . .</strong> instruction copies all files from the current project directory on your machine into the container’s working directory. This includes the compiled application binaries and any required resources.</p>
<p>Finally, the <strong>ENTRYPOINT</strong> defines the command that runs when the container starts. In this case, it launches the PatientService application using the .NET runtime.</p>
<p>Together, these steps package the microservice into a portable unit that can be deployed consistently across development, staging, and production environments. This ensures that the application behaves the same regardless of where it is deployed, which is a key advantage of containerization in microservices architectures.</p>
<h2 id="heading-deployment-strategies"><strong>Deployment Strategies</strong></h2>
<p>Deploying microservices requires strategies that minimize downtime and reduce risk during updates.</p>
<p>Techniques like rolling updates, canary releases, and blue-green deployments help ensure smooth transitions. These approaches improve system stability and user experience during releases.</p>
<h3 id="heading-key-strategies">Key Strategies</h3>
<p>Deploying microservices requires strategies that minimize downtime, reduce risk, and ensure system stability –&nbsp;especially in healthcare systems where availability and data integrity are critical.</p>
<h4 id="heading-1-rolling-updates">1. Rolling Updates</h4>
<p>Rolling updates deploy changes gradually by updating instances of a service one at a time instead of all at once. As new versions are deployed, old instances are terminated in phases, ensuring that the system remains available throughout the process.</p>
<p>This approach works well for stateless services and is commonly used in container orchestration platforms. It allows continuous availability while still enabling safe deployment of new features.</p>
<p>Rolling updates are best used when:</p>
<ul>
<li><p>You want zero downtime deployments</p>
</li>
<li><p>Backward compatibility between versions is maintained</p>
</li>
<li><p>Changes are relatively low risk</p>
</li>
</ul>
<h4 id="heading-2-canary-deployments">2. Canary Deployments</h4>
<p>Canary deployments release a new version of a service to a small subset of users before rolling it out to everyone. This allows teams to monitor the behavior of the new version in a real-world environment with limited exposure.</p>
<p>If issues are detected, the deployment can be rolled back quickly without affecting the majority of users.</p>
<p>Canary deployments are ideal when:</p>
<ul>
<li><p>Releasing high-risk or complex features</p>
</li>
<li><p>Testing performance under real traffic</p>
</li>
<li><p>Gradually validating new functionality</p>
</li>
</ul>
<h4 id="heading-3-blue-green-deployments">3. Blue-Green Deployments</h4>
<p>Blue-green deployment involves maintaining two identical environments: one running the current version (blue) and one running the new version (green). Traffic is switched from blue to green once the new version is fully tested and ready.</p>
<p>If something goes wrong, traffic can be immediately switched back to the previous version.</p>
<p>This strategy is particularly useful when:</p>
<ul>
<li><p>You need instant rollback capability</p>
</li>
<li><p>System stability is critical</p>
</li>
<li><p>Downtime must be completely avoided</p>
</li>
</ul>
<h3 id="heading-choosing-the-right-strategy-for-healthcare-microservices">Choosing the Right Strategy for Healthcare Microservices</h3>
<p>In a healthcare portal, where reliability and patient data integrity are essential, blue-green deployments are often the safest choice. They allow full validation of the new version before exposing it to users and provide immediate rollback in case of failure.</p>
<p>But rolling updates are also commonly used for routine updates where backward compatibility is ensured, while canary deployments are useful when introducing new features like AI diagnostics or analytics modules.</p>
<h4 id="heading-example-blue-green-deployment-with-containers">Example: Blue-Green Deployment with Containers</h4>
<p>Let’s walk through a simple conceptual example using containers.</p>
<p>Assume you have two environments:</p>
<ul>
<li><p>Blue (current version) running PatientService v1</p>
</li>
<li><p>Green (new version) running PatientService v2</p>
</li>
</ul>
<p>First, you deploy the new version (v2) alongside the existing one without affecting users.</p>
<p>Then you run tests and verify that the new version behaves correctly.</p>
<p>After that, you update the load balancer or API gateway to route traffic from blue to green. Then you monitor the system for errors or performance issues.</p>
<p>If everything is stable, you keep green as the active environment. If not, switch traffic back to blue instantly.</p>
<p>In a real-world setup, this traffic switching is typically handled by:</p>
<ul>
<li><p>API Gateways</p>
</li>
<li><p>Load balancers</p>
</li>
<li><p>Kubernetes services</p>
</li>
</ul>
<p>This approach ensures that users experience no downtime while giving teams full control over deployment risk.</p>
<p>In practice, many production systems combine these strategies –&nbsp;for example, starting with a canary release and then completing deployment with a rolling update – to balance risk and efficiency.</p>
<h2 id="heading-best-practices-with-examples">Best Practices (With Examples)</h2>
<p>Designing reliable microservices for healthcare systems requires applying proven patterns that improve stability, maintainability, and resilience. Below are some key best practices with practical examples.</p>
<h3 id="heading-1-use-api-versioning">1. Use API Versioning</h3>
<p>API versioning ensures backward compatibility when your service evolves. In healthcare systems, where integrations with external systems (labs, insurance, EHR) are common, breaking changes can cause serious issues.</p>
<p>Here's an example:</p>
<pre><code class="language-csharp">[Route("api/v1/patients")]
</code></pre>
<p>This route attribute defines the base URL for the API and explicitly includes a version identifier (v1). By embedding the version in the route, the service can support multiple versions of the same API simultaneously. This allows existing clients to continue using older versions while newer versions are introduced without breaking compatibility.</p>
<p>You can later introduce a new version:</p>
<pre><code class="language-csharp">[Route("api/v2/patients")]
</code></pre>
<p>This represents a newer version of the same API with potentially updated functionality or structure. By separating versions at the routing level, developers can evolve the API safely while giving clients time to migrate.</p>
<p>This approach is especially important in healthcare systems where external integrations must remain stable over long periods.</p>
<p>This allows safe rollout of new features, support for legacy clients and gradual migration between versions.</p>
<h3 id="heading-2-implement-retry-policies">2. Implement Retry Policies</h3>
<p>Network calls between microservices can fail due to transient issues such as timeouts or temporary service unavailability. Retry policies help automatically recover from such failures.</p>
<p>Here's an example (using Polly):</p>
<pre><code class="language-csharp">services.AddHttpClient("api")
    .AddTransientHttpErrorPolicy(p =&gt; p.RetryAsync(3));
</code></pre>
<p>This code configures an HTTP client with a retry policy using <a href="https://www.pollydocs.org/">Polly</a>, a .NET resilience and transient-fault-handling library. Polly allows developers to define policies such as retries, circuit breakers, and timeouts for handling unreliable network calls.</p>
<p>The <code>AddTransientHttpErrorPolicy</code> method applies a retry strategy for temporary failures such as network timeouts or server errors. The <code>RetryAsync(3)</code> configuration means that if a request fails due to a transient issue, it will automatically be retried up to three times before returning an error.</p>
<p>This improves system reliability by handling temporary issues without requiring manual intervention.</p>
<p>This configuration retries failed requests up to three times before failing.</p>
<p>You can also add exponential backoff:</p>
<pre><code class="language-csharp">.AddTransientHttpErrorPolicy(p =&gt;
    p.WaitAndRetryAsync(3, retryAttempt =&gt;
        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))));
</code></pre>
<p>This configuration enhances the retry mechanism by introducing exponential backoff. Instead of retrying immediately, the system waits progressively longer between each retry attempt.</p>
<p>Exponential backoff means:</p>
<ul>
<li><p>The first retry waits for 2¹ seconds</p>
</li>
<li><p>The second retry waits for 2² seconds</p>
</li>
<li><p>The third retry waits for 2³ seconds</p>
</li>
</ul>
<p>This approach reduces pressure on failing services and avoids overwhelming them with repeated requests. It's particularly useful in distributed systems where temporary failures are common and services need time to recover.</p>
<p>This helps in improving reliability, reducing temporary failures and avoiding manual retries.</p>
<h3 id="heading-3-enforce-input-validation">3. Enforce Input Validation</h3>
<p>Validating incoming data is critical, especially in healthcare systems where incorrect data can lead to serious consequences.</p>
<p>Here's an example:</p>
<pre><code class="language-csharp">if (string.IsNullOrEmpty(patient.Name))
    return BadRequest("Name is required");
</code></pre>
<p>This is a simple manual validation check that ensures the Name field is provided before processing the request. If the value is missing or empty, the API immediately returns a <code>BadRequest</code> response, preventing invalid data from entering the system.</p>
<p>A better approach is using data annotations:</p>
<pre><code class="language-csharp">public class Patient
{
    public int Id { get; set; }

    [Required]
    public string Name { get; set; }
}
</code></pre>
<p>This example uses data annotations to enforce validation rules at the model level. The [Required] attribute ensures that the Name property must be provided when a request is made. ASP.NET automatically validates the model during request processing and returns an error response if validation fails.</p>
<p>This approach is more scalable and maintainable than manual checks, especially in larger applications.</p>
<p>This ensures clean and valid data, reduced runtime errors, and better API usability.</p>
<h3 id="heading-4-use-circuit-breaker-pattern">4. Use Circuit Breaker Pattern</h3>
<p>The circuit breaker pattern prevents cascading failures when a dependent service is down or slow.</p>
<p>For example, if the Appointment Service is unavailable, repeated calls from the Patient Service can overload the system. A circuit breaker stops these calls temporarily.</p>
<p>Here's an example (again using Polly):</p>
<pre><code class="language-csharp">services.AddHttpClient("api")
    .AddTransientHttpErrorPolicy(p =&gt;
        p.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)));
</code></pre>
<p>This means:</p>
<ul>
<li><p>After 5 consecutive failures, the circuit opens</p>
</li>
<li><p>No further requests are sent for 30 seconds</p>
</li>
<li><p>System gets time to recover</p>
</li>
</ul>
<p>This helps in protecting system stability, preventing resource exhaustion, and improving overall resilience.</p>
<p>These practices ensure your microservices are backward-compatible (versioning), resilient (retry + circuit breaker), and reliable (validation).</p>
<p>In healthcare systems, where uptime and data integrity are critical, applying these patterns is essential.</p>
<p>This code configures a circuit breaker policy using Polly to protect the system from repeated failures when calling external services.</p>
<p>The <code>CircuitBreakerAsync(5, TimeSpan.FromSeconds(30))</code> configuration means that if five consecutive requests fail, the circuit will open and block further requests for 30 seconds. During this time, the system will not attempt to call the failing service, allowing it time to recover.</p>
<p>After the break period, the circuit enters a half-open state where a limited number of requests are allowed to test if the service has recovered. If successful, normal operation resumes. Otherwise, the circuit opens again.</p>
<p>This pattern prevents cascading failures, reduces unnecessary load on failing services, and improves overall system resilience.</p>
<p>These examples demonstrate how small design decisions (like versioning, retries, validation, and fault handling) can significantly improve the reliability and maintainability of microservices, especially in healthcare systems where failures can have serious consequences.</p>
<h2 id="heading-when-not-to-use-microservices">When NOT to Use Microservices</h2>
<p>Microservices are powerful, but they're not a universal solution. In many cases, adopting microservices too early can introduce unnecessary complexity instead of solving real problems.</p>
<p>Before choosing this architecture, it’s important to understand when a simpler approach—such as a monolith—is more appropriate.</p>
<h3 id="heading-1-when-the-application-is-small">1. When the Application Is Small</h3>
<p>If your application has limited functionality (for example, a basic patient registration system or internal tool), splitting it into multiple services adds unnecessary overhead.</p>
<p>A monolithic architecture allows you to develop faster with less setup, debug issues more easily, and avoid managing multiple deployments.</p>
<p><strong>Example:</strong> A simple clinic portal with only patient registration and appointment booking doesn't require separate services for each feature.</p>
<h3 id="heading-2-when-the-team-size-is-limited">2. When the Team Size Is Limited</h3>
<p>When the team size is limited, microservices can become challenging. Managing multiple codebases, handling service communication, and dealing with deployments and monitoring can slow down development, making it tough for small teams to handle the complexity.</p>
<p><strong>Example:</strong> A team of 2–3 developers may spend more time managing infrastructure than building features if microservices are used prematurely.</p>
<h3 id="heading-3-when-deployment-complexity-outweighs-benefits">3. When Deployment Complexity Outweighs Benefits</h3>
<p>Microservices introduce operational complexity, including API gateways, service discovery, container orchestration (for example, Kubernetes), and monitoring and logging across services.</p>
<p>If your application doesn't require independent scaling or frequent deployments, this complexity may not be justified.</p>
<p><strong>Example:</strong> If all components of your system scale together and are updated at the same time, a monolith is often more efficient.</p>
<h3 id="heading-4-when-domain-boundaries-arent-clear">4. When Domain Boundaries Aren't Clear</h3>
<p>Microservices rely on well-defined service boundaries. If your domain isn't clearly understood, splitting into services too early can lead to tight coupling between services, frequent cross-service changes, and poorly designed APIs.</p>
<p>In such cases, starting with a monolith and refactoring later is a better approach.</p>
<h3 id="heading-5-when-you-lack-devops-and-observability-maturity">5. When You Lack DevOps and Observability Maturity</h3>
<p>Microservices require strong DevOps practices, including CI/CD pipelines, centralized logging, distributed tracing and monitoring &amp; alerting. Without these, debugging issues becomes extremely difficult.</p>
<h2 id="heading-future-enhancements"><strong>Future Enhancements</strong></h2>
<p>Healthcare systems are evolving rapidly, and microservices architectures can adapt to support new capabilities. Future improvements may include:</p>
<h3 id="heading-1event-driven-architecture">1.Event-Driven Architecture</h3>
<p>Adopting an event-driven approach allows services to communicate asynchronously through events rather than direct requests. This improves scalability, responsiveness, and fault tolerance, making it easier to handle high volumes of patient data and real-time updates across multiple services.</p>
<h3 id="heading-2-ai-powered-diagnostics">2. AI-Powered Diagnostics</h3>
<p>Integrating AI and machine learning can enhance diagnostic capabilities by analyzing patient data, detecting patterns, and providing predictive insights. This can improve clinical decision-making and streamline workflows within the healthcare portal.</p>
<h3 id="heading-3integration-with-fhir-standards">3.Integration with FHIR Standards</h3>
<p>Supporting FHIR (Fast Healthcare Interoperability Resources) standards enables seamless data exchange between different healthcare systems, labs, and third-party applications. Standardized APIs ensure better interoperability, compliance, and easier integration with external platforms.</p>
<h3 id="heading-4real-time-analytics">4.Real-Time Analytics</h3>
<p>Real-time analytics allows healthcare providers to monitor patient data, system performance, and operational metrics continuously. This supports proactive decision-making, early detection of anomalies, and improved overall quality of care.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Microservices-based REST API development provides a powerful foundation for building scalable and secure healthcare portals. By breaking applications into independent services, teams can achieve better scalability, faster deployments, and improved fault isolation.</p>
<p>However, adopting microservices is not just a technical shift—it is an architectural and operational commitment. Developers should start small, identify clear service boundaries, and gradually evolve their systems.</p>
<p>As your application grows, focus on strengthening security, improving observability, and automating deployments. These practices will ensure your healthcare platform remains reliable, compliant, and ready to scale in a cloud-native world.</p>
<p>The next step is to build your first microservice, deploy it using containers, and incrementally expand your system into a fully distributed healthcare platform.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Open Source Data Lake for Batch Ingestion ]]>
                </title>
                <description>
                    <![CDATA[ Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-open-source-data-lake-for-batch-ingestion/</link>
                <guid isPermaLink="false">69e0f1a7b67a275a9d3c9122</guid>
                
                    <category>
                        <![CDATA[ data-engineering ]]>
                    </category>
                
                    <category>
                        <![CDATA[ apache-airflow ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ingestion ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Puneet Singh ]]>
                </dc:creator>
                <pubDate>Thu, 16 Apr 2026 14:26:47 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ef685075-beac-4bf4-b435-6e942e5e1ac1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.</p>
<p>But the trade-off isn't just merely renting the outside infrastructure. It also includes proprietary abstraction lock-in, and an operational and security surface area built on top of vendor capabilities.</p>
<p>In this article, you'll set up a batch ingestion layer on an open-source data lake stack where you own every component.</p>
<p>The focus is deliberately narrow. We'll get the ingestion layer up and running end-to-end. Then we'll build on foundations that allow future extension: analytics, governance, and stream processing without locking you into any single tool for those layers. We'll also review documented integration failures along the way: misconfigured catalogs, partition values written as NULL, and Python version mismatches.</p>
<p>By the end, you'll have:</p>
<ul>
<li><p>A working single-node data lake running on Docker (compose), built on RustFS (object storage), Apache Iceberg (table format), and Project Nessie (catalog).</p>
</li>
<li><p>A batch pipeline orchestrated with Apache Airflow, executing PySpark jobs that write versioned, partitioned Iceberg tables.</p>
</li>
<li><p>A real-world ingestion pattern, an external web scraper decoupled from Airflow via Redis, writing raw data to object storage with a lightweight signal table.</p>
</li>
<li><p>A view of what this stack is and isn't, and what you'd add to take it toward production.</p>
</li>
</ul>
<p>A word on scope: this covers the E in <a href="https://www.getdbt.com/blog/extract-load-transform">ELT</a>: getting data in. Transformation (dbt, Spark SQL) and analytics (Trino, Superset) are a natural next layer, but are outside the scope of this article. What you build here is the foundation they'd sit on.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-the-ingestion-problem">The Ingestion Problem</a></p>
</li>
<li><p><a href="#heading-stack">Stack</a></p>
</li>
<li><p><a href="#heading-system-overview">System Overview</a></p>
</li>
<li><p><a href="#heading-quick-start">Quick Start</a></p>
</li>
<li><p><a href="#heading-running-the-pipelines">Running the Pipelines</a></p>
</li>
<li><p><a href="#heading-setup">Setup</a></p>
<ul>
<li><p><a href="#heading-rustfs">RustFS</a></p>
</li>
<li><p><a href="#heading-nessie">Nessie</a></p>
</li>
<li><p><a href="#heading-spark">Spark</a></p>
</li>
<li><p><a href="#heading-apache-airflow">Apache Airflow</a></p>
</li>
<li><p><a href="#heading-scrapredis">Scrapredis</a></p>
</li>
<li><p><a href="#heading-scrapworker">Scrapworker</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-path-forward">Path Forward</a></p>
<ul>
<li><p><a href="#heading-extending-capabilities">Extending Capabilities</a></p>
</li>
<li><p><a href="#heading-adding-layers">Adding Layers</a></p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-the-ingestion-problem">The Ingestion Problem</h2>
<p>The structure of a stack/solution is easier to understand with a use case. A high-level goal is to ingest financial data from external market APIs for trend analysis. You'll focus specifically on setting up ingestion of such data into the warehouse for further analytics.</p>
<p>The data is ingested via a web crawler with a specific rate limit per endpoint. In Batch processing, time-based partitioning is effective for processing by downstream pipelines. It also favors cleaner data retention.</p>
<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. This keeps rate limiting and crawl lifecycle outside the orchestration layer, with each component failing and recovering independently.</p>
<p>During ingestion, the priority is data landing with high reliability due to the lack of idempotency in crawl jobs.</p>
<h2 id="heading-stack">Stack</h2>
<ul>
<li><p><a href="https://rustfs.com/"><strong>RustFS</strong></a><strong>:</strong> An S3-compatible object store written in Rust</p>
</li>
<li><p><a href="https://projectnessie.org/"><strong>Project Nessie</strong></a><strong>:</strong> Transactional catalog for Apache Iceberg tables</p>
</li>
<li><p><a href="https://spark.apache.org/"><strong>Apache Spark</strong></a><strong>:</strong> Distributed compute engine</p>
</li>
<li><p><a href="https://airflow.apache.org/"><strong>Apache Airflow</strong></a><strong>:</strong> Job scheduling and orchestration</p>
</li>
<li><p><a href="https://jupyter.org/"><strong>Jupyter Notebook</strong></a> <em>(optional)</em>: Ad-hoc Spark queries against Iceberg tables, not covered in this article</p>
</li>
<li><p><strong>Scrapredis:</strong> Job queue for the web crawler</p>
</li>
<li><p><strong>Scrapworker:</strong> Web crawler and ingestion worker</p>
</li>
</ul>
<p>This setup was tested on a 4-core x86/AMD CPU, 16GB RAM, 60GB disk GCP VM running Debian GNU/Linux 11 (Bullseye). Docker with Compose v2 is required. The setup should work on any comparable Linux environment with similar or better specs.</p>
<h2 id="heading-system-overview">System Overview</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/429a1e8a-bc39-44dc-8e0b-2cd9152370f5.png" alt="Data Platform Architecture" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The crawler runs as an external process, decoupled from Airflow via a Redis job queue. Airflow pushes a job specification to the queue containing the endpoint, query params, and target path. The crawler picks it up, executes the crawl, and writes raw results directly to object storage.</p>
<p>This separation keeps rate limiting and crawl lifecycle concerns outside the orchestration layer, and isolates failure modes.</p>
<p>A crawl failure is harder to recover since crawl jobs lack idempotency. Pipeline failures after the crawl stage are independently retryable without re-triggering a crawl.</p>
<h2 id="heading-quick-start">Quick Start</h2>
<p>First, initialize the project:</p>
<pre><code class="language-bash"># Clone the repository
git clone https://github.com/ps-mir/data-platform

# Create the shared Docker network
docker network create data-platform

# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh &amp;&amp; ./init.sh
</code></pre>
<p>Start services in this order (shutdown in reverse):</p>
<ol>
<li><strong>RustFS</strong></li>
</ol>
<pre><code class="language-bash">cd rustfs &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Nessie</strong></li>
</ol>
<pre><code class="language-bash">cd nessie &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Spark</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd spark &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Scrapredis</strong></li>
</ol>
<pre><code class="language-bash">cd scrapredis &amp;&amp; docker compose up -d
</code></pre>
<ol>
<li><strong>Airflow</strong> — requires a build on first run</li>
</ol>
<pre><code class="language-bash">cd airflow-docker &amp;&amp; docker compose build &amp;&amp; docker compose up -d
</code></pre>
<p>Create the Nessie namespaces once after Nessie is up:</p>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<p>Scrapworker runs on the host directly (it's not dockerized). It requires Python &gt;=3.14:</p>
<pre><code class="language-bash">cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworker
</code></pre>
<p>Scrapworker must be running before activating <code>scraper_pipeline_v1</code> in Airflow. Without it, the pipeline will push jobs to the queue with no worker to pick them up and hang indefinitely in <code>wait_for_completion</code>.</p>
<p>Trino is also present in setup but not tested for integration with Nessie yet.</p>
<h2 id="heading-running-the-pipelines">Running the Pipelines</h2>
<p>With the stack running, the next step is to activate the pipelines in Airflow. All DAGs are paused at creation by default. The four pipelines build on each other in complexity. Working through them in order is the fastest way to confirm that each layer of the stack is wired correctly before moving to the next.</p>
<p>All four pipelines are loaded but paused by default. Unpause each one in the Airflow UI before triggering.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/38f95d52-c092-4a00-b660-1233077b781b.png" alt="All Airflow Pipelines" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Let's go over each pipeline:</p>
<h3 id="heading-sparkstaticdatav1skeleton-hello-dag">spark_static_data_v1_skeleton: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step1_hello_dag.py">Hello DAG</a></h3>
<p>This is a minimal DAG with no Spark, just a Python task that prints a message. If it goes green, Airflow's scheduler and worker are healthy. <code>[2026-04-09 22:00:01] INFO - Task operator:&lt;Task(_PythonDecoratedOperator): say_hello&gt;</code></p>
<h3 id="heading-sparkstaticdatav2submit-spark-submit">spark_static_data_v2_submit: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step2_spark_submit.py">Spark Submit</a></h3>
<p>This submits a PySpark job via <code>SparkSubmitOperator</code> that writes a static dataset to an Iceberg table. No partitioning, every run overwrites the previous content.</p>
<p>In Nessie catalog it appears as:</p>
<pre><code class="language-bash">Type: ICEBERG_TABLE
Metadata Location:s3://warehouse/default/static_data_e7e43123-95a7-44d2-b6d5-67c9c7aa4321/metadata/00000-08a5a2db-6f12-4f21-b2a9-de3d9123fbd3.metadata.json
</code></pre>
<h3 id="heading-sparkpartitioneddatav1-spark-partitioned">spark_partitioned_data_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/step3_spark_partitioned.py">Spark Partitioned</a></h3>
<p>This extends step2 with time-based partitioning. Partition values are derived from the scheduled slot time, so every run writes to its own <code>(ds, hr, min)</code> partition without touching previous ones.</p>
<p>Example file path in RustFS: <code>warehouse/default/static_data_partitioned_b172c66f-722b-44f3-bbee-069355753ff6/data/ds=2026-03-28/hr=23/min=15/00000-4-7a196a47-2ac0-4023-af68-ca10487fccb2-0-00001.parquet</code></p>
<h3 id="heading-scraperpipelinev1-scraper-pipeline">scraper_pipeline_v1: <a href="https://github.com/ps-mir/data-platform/blob/07ad47d68fec51f48cd41560921d509a70c5bb6f/airflow-docker/dags/scraper_pipeline.py">Scraper Pipeline</a></h3>
<p>This is the full ingestion flow. Airflow pushes a job to Scrapredis, Scrapworker calls the Binance API and writes raw results to RustFS, then Airflow publishes a signal row to the Nessie catalog.</p>
<p>Every run fetches: <code>https://api.binance.com/api/v3/trades?symbol=BTCUSDT&amp;limit=10</code></p>
<h2 id="heading-setup">Setup</h2>
<p>This is a single-node development setup using Docker Compose. It's built on a well-structured base config that can be extended to production with targeted changes.</p>
<ul>
<li><p>A production deployment would require HA configuration, persistent volume management, and security hardening for each component.</p>
</li>
<li><p>Images are pinned to specific versions to avoid silent breakage between pulls.</p>
</li>
<li><p>All containers share a common external Docker network named <code>data-platform</code>, which allows services to communicate using container names as hostnames.</p>
</li>
<li><p>An <code>init.sh</code> script creates the required local dirs inside the data folder and also creates the Docker network.</p>
</li>
</ul>
<h3 id="heading-rustfs">RustFS</h3>
<p>RustFS is the object storage layer in this stack. Nessie's REST catalog mode has a hard dependency on an S3-compatible endpoint. Running it against a local filesystem fails the Nessie healthcheck at startup and causes catalog initialization to error out. The REST catalog is the recommended mode for new setups because it enables credential vending and multi-engine coordination.</p>
<p>MinIO was the natural choice for self-hosted S3-compatible storage, but it shifted to a more restrictive license. RustFS is the open-source alternative, written in Rust and backed by local disk.</p>
<p>At write time, Spark pushes Parquet files directly to RustFS via S3FileIO. Nessie commits the table metadata alongside, so data and catalog state land together or not at all. This is <a href="https://iceberg.apache.org/">Apache Iceberg</a>'s core guarantee: atomic commits across both data files and metadata.</p>
<p>For production or cloud deployments, managed object storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage are the natural next step. Self-hosted alternatives at scale include <a href="https://github.com/seaweedfs/seaweedfs">SeaweedFS</a>, <a href="https://docs.ceph.com/en/latest/radosgw/">Ceph/RGW</a>, and <a href="https://garagehq.deuxfleurs.fr/">Garage</a>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Bucket creation:</strong> A <code>rustfs-init</code> sidecar using <code>amazon/aws-cli</code> runs after RustFS passes its healthcheck and creates the <code>s3://warehouse</code> bucket automatically. You don't create the bucket manually.</p>
</li>
<li><p><strong>Permissions:</strong> RustFS runs as uid=10001 inside the container. The host directories (<code>data/rustfs/data</code> and <code>data/rustfs/applogs</code>) must be owned by that uid before the container starts, or it will fail silently. <code>init.sh</code> handles this with <code>sudo chown -R 10001:10001</code>.</p>
</li>
<li><p><strong>Image pinning:</strong> The compose file pins to <code>rustfs/rustfs:1.0.0-alpha.85-glibc</code>. Before upgrading, verify the uid hasn't changed: <code>docker run --rm --entrypoint id rustfs/rustfs:&lt;new-tag&gt;</code>. If it has, re-run <code>init.sh</code> or re-chown manually.</p>
</li>
<li><p><strong>Spark writes:</strong> Spark writes data files directly to RustFS via S3FileIO. Nessie only manages catalog metadata, it doesn't proxy data. The two interact at commit time, not at write time.</p>
</li>
</ul>
<h3 id="heading-nessie">Nessie</h3>
<p>The catalog tracks the list of tables in the warehouse, along with their data files and schema. Without it, it's hard for Spark to agree on what's in the warehouse.</p>
<p><a href="https://hive.apache.org/docs/latest/admin/adminmanual-metastore-administration/">Hive Metastore</a> offers a Thrift-based API and has been the catalog standard for years. It provides transaction semantics on metadata updates through its backing database, but those transactions stop at the catalog layer. Data files underneath aren't part of the same commit, and there's no cross-table history beyond what the database retains.</p>
<p>Apache Iceberg closes the data and metadata gap with atomic table commits. Nessie builds on that and goes further: it treats the catalog like a Git repository. Every table write is a commit. You can branch, tag, and roll back across multiple tables atomically.</p>
<p>Spark reads and writes table metadata through Nessie's Iceberg REST endpoint. Catalog state is persisted to Postgres, so it survives container restarts.</p>
<h4 id="heading-namespace-bootstrap">Namespace bootstrap</h4>
<p>Unlike Hive Metastore, Nessie doesn't auto-create namespaces. Attempting to write a table to a namespace that doesn't exist fails after data has already been written to RustFS, leaving orphaned files with no catalog entry. Namespaces are structural metadata and belong in a one-time bootstrap step, not in a pipeline.</p>
<p>Nessie manages the Iceberg catalog metadata under <code>s3://warehouse/</code>. Iceberg table data lands under paths derived from the namespace, for example, <code>s3://warehouse/default/</code> for the <code>default</code> namespace.</p>
<h4 id="heading-s3-credential-configuration-issue">S3 Credential Configuration Issue</h4>
<p>Nessie's S3 credential fields don't accept plain strings (likely for security reasons). They require a secret URI in the form <code>urn:nessie-secret:quarkus:&lt;name&gt;</code> even for local credentials.</p>
<p>Additionally, the SCREAMING_SNAKE_CASE environment variable convention is ambiguous for Quarkus property names containing hyphens. The property is silently ignored, and the default (which fails) is used instead. The working approach is dot-notation keys passed directly in the compose environment block, which Quarkus reads without conversion:</p>
<pre><code class="language-properties">nessie.catalog.service.s3.default-options.access-key: "urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key"
nessie.catalog.secrets.access-key.name: rustfsadmin
nessie.catalog.secrets.access-key.secret: rustfsadmin
</code></pre>
<h4 id="heading-nessie-health-check">Nessie health check</h4>
<p>Once the RustFS settings are corrected, Nessie's health check URL(<a href="http://localhost:9090/q/health">http://localhost:9090/q/health</a>) should return the following response:</p>
<pre><code class="language-json">{
    "status": "UP",
    "checks": [
        {
            "name": "MongoDB connection health check",
            "status": "UP"
        },
        {
            "name": "Warehouses Object Stores",
            "status": "UP",
            "data": {
                "warehouse.warehouse.status": "UP"
            }
        },
        {
            "name": "Database connections health check",
            "status": "UP",
            "data": {
                "&lt;default&gt;": "UP"
            }
        }
    ]
}
</code></pre>
<p>The MongoDB connection health check appears in the response even though this stack doesn't use MongoDB. It's a Quarkus built-in probe registered automatically regardless of store type. With JDBC configured, MongoDB is never connected and the UP report is just a placeholder response.</p>
<h4 id="heading-catalog-endpoint-vs-management">Catalog endpoint vs Management</h4>
<p>Nessie exposes two separate APIs. The Iceberg REST catalog is at <code>/iceberg</code>. This is what Spark and Trino connect to. The Nessie management API is at <code>/api/v2</code>, which is for branch operations, commit history, and table inspection. They aren't interchangeable.</p>
<pre><code class="language-properties"># Iceberg REST API
http://localhost:19120/iceberg/v1/main/namespaces
http://localhost:19120/iceberg/v1/config

# Nessie management API
http://localhost:19120/api/v2/config
</code></pre>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><code>path-style-access: true</code> is required for any non-AWS S3 endpoint. <code>region</code> is a dummy value required by the AWS SDK internally.</p>
</li>
<li><p>Nessie's internal port 9000 is remapped to 9090 on the host to avoid conflict with RustFS which occupies 9000 and 9001.</p>
</li>
</ul>
<h4 id="heading-forward-path">Forward path</h4>
<p>Nessie is a stateless REST service, so scaling reads can be done with LB with no coordination between nodes. Durability comes entirely from backend store.</p>
<h3 id="heading-spark">Spark</h3>
<p>As a distributed compute engine, Apache Spark is a reliable and stable choice for long-running jobs. In the current setup, it executes PySpark jobs submitted by Airflow, reads and writes Iceberg tables via the Nessie REST catalog, and writes data files directly to RustFS using S3FileIO. Spark runs in standalone mode with a single master and worker, configured via <code>spark-defaults.conf</code>.</p>
<p>Two JARs are required and must be placed in <code>data/spark/jars/</code> before starting:</p>
<ul>
<li><p><code>iceberg-spark-runtime-3.5_2.12</code>: Iceberg integration for Spark: SparkCatalog, DataFrameWriterV2, SQL extensions, and all table format logic.</p>
</li>
<li><p><code>iceberg-aws-bundle</code>: AWS SDK v2 and Iceberg's S3FileIO, the storage transport layer for writing data files to RustFS. The Spark base image ships only Hadoop AWS (SDK v1). This bundle provides the SDK v2 classes that S3FileIO requires.</p>
</li>
</ul>
<p>Spark uses a custom Dockerfile to install Python 3.12. Build the image before first use:</p>
<pre><code class="language-bash">cd spark
docker compose build
docker compose up -d
</code></pre>
<p>The PySpark jobs are covered in the Airflow section, where we walk through each DAG and its corresponding Spark script as part of the pipeline.</p>
<p>Before submitting any Spark job that writes an Iceberg table, the target namespace must exist in Nessie. Nessie doesn't auto-create namespaces, unlike Hive Metastore. Attempting to write to a missing namespace fails after data has already been written to RustFS, leaving orphaned files with no catalog entry.</p>
<p>Create the <code>default</code> namespace once before running any pipeline:</p>
<pre><code class="language-bash"># Nessie should be up and running at this point
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'
{
  "namespace" : [ "default" ],
  "properties" : { }
}
</code></pre>
<p>Verify:</p>
<pre><code class="language-bash">curl http://localhost:19120/iceberg/v1/main/namespaces
</code></pre>
<h4 id="heading-catalog-mismatch-tables-missing-across-query-engines">Catalog Mismatch: Tables Missing Across Query Engines</h4>
<p>If tables written by Spark aren't visible in Trino, the likely cause is a catalog mismatch. Spark configured with <code>NessieCatalog</code> and Trino using the Iceberg REST catalog maintain separate metadata views — they don't share table state. Both engines must point at the same catalog endpoint: <code>http://nessie:19120/iceberg</code>.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p><strong>Worker memory:</strong> The worker is configured with <code>SPARK_WORKER_MEMORY: 8g</code>. Spark's default is 1g is enough to register but not enough to run a job without queuing. Tune this based on available host memory.</p>
</li>
<li><p><strong>Remote signing:</strong> <code>remote-signing-enabled: false</code> Nessie's REST catalog supports credential vending via IAM/STS, but since that integration isn't present here, remote signing is disabled explicitly to avoid request failures.</p>
</li>
<li><p><strong>Config changes need full restart:</strong> Docker file-level bind mounts cache the inode at container start. Editing <code>spark-defaults.conf</code> won't take effect until Spark and the Airflow worker are restarted. In client mode, the Airflow worker is the Spark driver (the process that reads the config on job submission) and must be restarted too.</p>
</li>
<li><p><strong>Jupyter Notebook:</strong> A Jupyter instance with PySpark is included in the stack for ad-hoc queries against Iceberg tables. It connects to the same Spark cluster and Nessie catalog, so any table written by a pipeline is immediately queryable.</p>
</li>
</ul>
<p>⚠️ <strong>Warning:</strong> The Spark worker and Airflow worker (the driver) must run the same Python minor version. PySpark enforces this at runtime and fails immediately if they diverge. The Spark image in this stack uses a custom Dockerfile to install Python 3.12, matching Airflow's base image. If you upgrade either, verify that the versions stay aligned.</p>
<h3 id="heading-apache-airflow">Apache Airflow</h3>
<p>Airflow makes it easier to author, schedule and monitor workflows. In this case, it handles the ingestion for batch processing, but it can be extended to use cases like stream processing.</p>
<p>The Airflow components resemble more closely the DAG processor Airflow Architecture from the <a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/overview.html">official docs</a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69607e708806706b5c49c7af/a438e02b-0b16-44c7-bcae-92c954a942cc.png" alt="DAG Processor Airflow Architecture" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Key aspects:</p>
<ul>
<li><p>The DAG Processor continuously parses DAG files and serializes them to the Metadata DB.</p>
</li>
<li><p>The Scheduler reads from there, detects when a DAG run is due, creates task instances, and pushes them to the CeleryExecutor (via Redis queue).</p>
</li>
<li><p>The Celery worker picks up a task and executes it. In the case of a <code>SparkSubmitOperator</code>, the worker process becomes the Spark driver, submitting the job to the Spark cluster.</p>
</li>
<li><p>Executors run on the Spark worker, write Parquet files directly to RustFS, and commit the table metadata to Nessie. Airflow records the task outcome back in the Metadata DB.</p>
</li>
</ul>
<p>Airflow uses a custom Dockerfile to install Java 17 and additional providers. Build the image before first use:</p>
<pre><code class="language-bash">cd airflow-docker
docker compose build
docker compose up -d
</code></pre>
<h4 id="heading-pipelines">Pipelines</h4>
<p>Pipelines need to be created inside <code>airflow-docker/dags</code> folder for dag processor to pick up load the pipeline DAG in metadata DB. Four pipeline examples are provided with varying complexity.</p>
<ol>
<li><p><code>step1_hello_dag.py</code>: single-task DAG with no dependencies, just a Python function that prints a message.</p>
</li>
<li><p><code>step2_spark_submit.py</code>: submits a PySpark job via SparkSubmitOperator. The job writes a static dataset to an Iceberg table via the Nessie catalog.</p>
</li>
<li><p><code>step3_spark_partitioned.py</code>: extends step 2 with time-based partitioning. The scheduled slot time is passed to the PySpark script.</p>
<ul>
<li>Time-based partition values are derived from <code>data_interval_start</code> for idempotency (Backfill, Reruns).</li>
</ul>
</li>
<li><p><code>scraper_pipeline</code>: a real-world ingestion pipeline. Coordinates with the external task executor <code>scrapworker</code> via the Redis queue <code>scrapredis</code>.</p>
<ul>
<li>Both <code>scrapredis</code> and <code>scrapworker</code> must be up and running for this pipeline to work.</li>
</ul>
</li>
</ol>
<h4 id="heading-deploy-mode-and-driver-config">Deploy Mode and Driver Config</h4>
<p>The initial <code>SparkSubmitOperator</code> configuration used <code>deploy_mode="cluster"</code>, which runs the driver on the Spark cluster rather than the submitting machine. This fails immediately on Spark standalone clusters with a hard error:</p>
<pre><code class="language-plaintext">Cluster deploy mode is currently not supported for python applications on standalone clusters.
</code></pre>
<p>Cluster mode for Python is only available on YARN and Kubernetes. The fix is <code>deploy_mode="client"</code>, but this shifts the problem: in client mode, the driver runs on the Airflow worker container, which means the worker needs everything the Spark containers have.</p>
<p>Overall, three changes are required in the Airflow worker:</p>
<ul>
<li><p>The Iceberg and Nessie JARs at <code>/opt/spark/user-jars/</code></p>
</li>
<li><p><code>spark-defaults.conf</code> with catalog, extension, and JAR config</p>
</li>
<li><p><code>SPARK_CONF_DIR=/opt/spark/conf</code>, without this, pip-installed PySpark's <code>spark-submit</code> silently ignores the mounted conf file and runs with no catalog config</p>
</li>
</ul>
<p>The fix was adding all three to <code>x-airflow-common</code> in <code>airflow-docker/docker-compose.yaml</code> so every Airflow service inherits them:</p>
<pre><code class="language-yaml">environment:
  SPARK_CONF_DIR: /opt/spark/conf

volumes:
  - ../data/spark/jars:/opt/spark/user-jars:ro
  - ../spark/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro
</code></pre>
<h4 id="heading-partition-values-written-as-null">Partition Values Written as NULL</h4>
<p>When the third pipeline (Spark Partitioned) ran for the first time, the data landed correctly in RustFS, but querying the Iceberg partitions metadata showed:</p>
<pre><code class="language-plaintext">+------------------+----------+
|         partition|file_count|
+------------------+----------+
|{NULL, NULL, NULL}|         2|
+------------------+----------+
</code></pre>
<p>The original script used Spark's DataSource V1 API:</p>
<pre><code class="language-python">df.write.format("iceberg").mode("overwrite").saveAsTable(table)
</code></pre>
<p>The script used Spark's V1 DataFrame write API with format("iceberg"), which loads an isolated table reference and bypasses Iceberg's catalog write path. As a result, Iceberg committed the data files to storage but wrote NULL partition values into the manifest metadata.</p>
<p>The fix is in Iceberg's native DataFrameWriterV2 API:</p>
<pre><code class="language-python">df.writeTo(table).overwritePartitions()
</code></pre>
<p>This routes through Iceberg's native write path, evaluates partition transforms from the real column values (ds, hr, min), and registers them correctly in the manifest. <code>overwritePartitions()</code> overwrites only the partitions present in the DataFrame. A rerun with the same scheduled time produces the same values and atomically replaces that partition, leaving all others untouched.</p>
<p>⚠️ Existing NULL-partition manifest entries aren't retroactively corrected by subsequent V2 writes. For a brand-new table containing only bad data, DROP TABLE and rewrite is the simplest recovery.</p>
<h3 id="heading-scrapredis">Scrapredis</h3>
<p>Scrapredis is a dedicated Redis instance that sits between Airflow and Scrapworker as a job queue. It's separate from Airflow's internal Redis, which exists solely for CeleryExecutor task dispatch. The separation means the crawler's job queue can be managed, scaled, or replaced without touching Airflow's internals.</p>
<p>The pattern generalises beyond scraping. Any external process that needs its own lifecycle, resource profile, or rate limiting can be wired the same way: Airflow pushes a job, the external worker pops it, and Airflow polls for the result.</p>
<p>The scraper pipeline follows this round-trip:</p>
<ol>
<li>Airflow pushes the job payload to the queue:</li>
</ol>
<pre><code class="language-python">QUEUE_KEY = "scrapworker:jobs"
client.lpush(QUEUE_KEY, json.dumps(payload))
</code></pre>
<ol>
<li>Scrapworker blocks on the queue and pops the next job:</li>
</ol>
<pre><code class="language-python">while True:
    _, payload = client.blpop(redis_cfg["queue_key"])
</code></pre>
<ol>
<li>Once the crawl finishes, Scrapworker writes the outcome and <code>s3_path</code> back to Redis:</li>
</ol>
<pre><code class="language-python">client.set(status_key, json.dumps({"status": "finished", "worker_id": worker_id, "s3_path": job["s3_path"]}), ex=TERMINAL_TTL)
</code></pre>
<ol>
<li>The <code>wait_for_completion</code> task polls for that status key. On success, <code>publish_nessie_signal</code> picks up the <code>s3_path</code> and writes the signal row to Nessie.</li>
</ol>
<h3 id="heading-scrapworker">Scrapworker</h3>
<p>Scrapworker is a Python app that uses the Scrapy crawl framework to crawl all pages of the request. It's decoupled from Airflow due to URL/client specific rate limit semantics. For simplicity, consider it a type of external worker that receives and executes requests from Airflow.</p>
<p>It's responsible for downloading and writing content to object storage (RustFS). The Nessie catalog update is decoupled and kept in a separate Airflow pipeline task.</p>
<h4 id="heading-fixed-signal-table">Fixed Signal Table</h4>
<p>Scrapworker writes raw JSON to RustFS rather than writing scraped data directly as Iceberg columns. The pipeline then publishes a single lightweight signal row to a Nessie-managed Iceberg table.</p>
<p>The signal schema is fixed and minimal (<code>run_id</code>, <code>endpoint</code>, <code>s3_path</code>, <code>ds</code>, <code>hr</code>, <code>min</code>, <code>published_at</code>). It never changes, regardless of what's being scraped.</p>
<p>Mirroring the scraped payload as Iceberg columns would force Scrapworker to own schema evolution across different endpoints. This isn't an ideal place for schema ownership. Instead, schema ownership sits downstream:</p>
<pre><code class="language-plaintext">Scrapworker  →  raw files in RustFS  +  signal row in Iceberg (from Pipeline)
Airflow job  →  reads raw via s3_path, applies schema, writes structured Iceberg table
</code></pre>
<p>The downstream job knows the domain, knows the schema, and is the right place to handle type casting, nulls, and partition layout. Scrapworker stays generic and thin — the same code handles any endpoint without modification.</p>
<h4 id="heading-why-signal-publish-is-a-separate-airflow-task">Why Signal Publish is a Separate Airflow Task</h4>
<p>Scrapworker writes to RustFS and sets <code>status: finished</code> in Redis with the <code>s3_path</code>. A separate Airflow task reads that status and publishes the signal row to Nessie. The two writes are intentionally decoupled.</p>
<p>If scrapworker published to Nessie directly after writing to RustFS, the two writes would share a failure mode. A Nessie failure after a successful RustFS write would leave data stranded with no signal and no clean recovery path. The only option would be a re-crawl which lacks idempotency.</p>
<p>With the decoupled approach, each failure is isolated. A Nessie failure triggers an Airflow retry of the signal publish task only, no re-scrape, no duplicate crawl. RustFS and Nessie failures are independently recoverable.</p>
<h4 id="heading-notes">Notes:</h4>
<ul>
<li><p>Raw scraped files are written directly to <code>s3://warehouse/raw/</code>, entirely outside Nessie's management. Nothing in the Iceberg layer touches this path.</p>
</li>
<li><p>The scrapworker signal table lives in a dedicated <code>scraper</code> namespace. Create it once before scrapworker runs for the first time.</p>
</li>
</ul>
<pre><code class="language-bash">curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'
</code></pre>
<h2 id="heading-path-forward">Path Forward</h2>
<p>The stack we've built here is a working ingestion layer. It lands data reliably, tracks it in a versioned catalog, and gives you a foundation to build on. Two directions are worth considering from here.</p>
<h3 id="heading-extending-capabilities">Extending Capabilities</h3>
<p>These are improvements to what's already in the stack, making it more robust without adding new components.</p>
<p><strong>Ingestion reliability:</strong> Scrapworker currently handles failures by setting <code>status: failed</code> in Redis, which requires Airflow to re-trigger the full pipeline. Adding client-side rate limiting and per-endpoint retry logic with backoff would make crawl jobs more self-healing, so that a failed page fetch can retry independently without surfacing to Airflow at all.</p>
<p><strong>Config validation:</strong> A misconfigured endpoint schema in <code>config.yaml</code> fails silently at runtime, often deep into a crawl. A <code>validate_config()</code> call at startup would catch missing required fields like <code>offset_param</code> or <code>response_map</code> before any job runs. This becomes more important as more endpoints are added.</p>
<p><strong>Observability:</strong> Airflow alerting and SLA monitoring give early warning when pipelines miss their schedule or tasks take longer than expected. The signal table is useful here too. A lightweight monitor that checks for expected signal rows within a time window is a simple SLA check that works without external tooling.</p>
<h3 id="heading-adding-layers">Adding Layers</h3>
<p>These are new capabilities that build on the ingestion foundation.</p>
<p><strong>Transform layer:</strong> The raw Iceberg tables written by the ingestion layer are the input for a transform step. dbt or Spark SQL can read from raw, apply schema, clean types, and write structured tables to a separate namespace. This is the L in ELT and the natural next step once ingestion is stable.</p>
<p><strong>Analytics:</strong> Trino is already in the stack and partially integrated. Connecting it fully to Nessie enables SQL queries across all Iceberg tables. Adding Superset on top gives a visualisation layer without requiring any changes to the ingestion pipeline.</p>
<p><strong>Broader source onboarding:</strong> The current stack handles one ingestion pattern: a scheduled Airflow pipeline triggering an external HTTP crawler. The same foundation supports pull-based sources like databases using CDC, and push-based sources like event streams via Kafka. The Iceberg tables and Nessie catalog serve as the landing zone regardless of how data arrives.</p>
<p><strong>Governance:</strong> Iceberg and Nessie provide the foundations, covering snapshots, schema evolution, commit history, and time travel. The governance layer on top requires deliberate additions: access control, data quality checks, lineage tracking, and schema enforcement. None of these require replacing what's here, as they sit on top of it.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Fashion App That Helps You Organize Your Wardrobe  ]]>
                </title>
                <description>
                    <![CDATA[ I used to spend too long deciding what to wear, even when my closet was full. That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-fashion-app-to-organize-your-wardrobe/</link>
                <guid isPermaLink="false">69de6abf91716f3cfb5448a1</guid>
                
                    <category>
                        <![CDATA[ webdev ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ React ]]>
                    </category>
                
                    <category>
                        <![CDATA[ full stack ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mokshita V P ]]>
                </dc:creator>
                <pubDate>Tue, 14 Apr 2026 16:26:39 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/bf593ff6-6de8-4b30-ab0a-700c3410ccb1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I used to spend too long deciding what to wear, even when my closet was full.</p>
<p>That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better organization, better visibility, and better guidance when making outfit decisions.</p>
<p>So I built a fashion web app that helps users organize their wardrobe, get outfit suggestions, evaluate shopping decisions, and improve recommendations over time using feedback.</p>
<p>In this article, I’ll walk through what the app does, how I built it, the decisions I made along the way, and the challenges that shaped the final result.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-table-of-contents">Table of Contents</a></p>
</li>
<li><p><a href="#heading-what-the-app-does">What the App Does</a></p>
</li>
<li><p><a href="#heading-why-i-built-it">Why I Built It</a></p>
</li>
<li><p><a href="#heading-tech-stack">Tech Stack</a></p>
</li>
<li><p><a href="#heading-product-walkthrough-what-users-see">Product Walkthrough (What Users See)</a></p>
</li>
<li><p><a href="#heading-how-i-built-it">How I Built It</a></p>
</li>
<li><p><a href="#heading-challenges-i-faced">Challenges I Faced</a></p>
</li>
<li><p><a href="#heading-what-i-learned">What I Learned</a></p>
</li>
<li><p><a href="#heading-what-i-want-to-improve-next">What I Want to Improve Next</a></p>
</li>
<li><p><a href="#heading-future-improvements">Future Improvements</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-the-app-does">What the App Does</h2>
<p>At a high level, the app combines six core capabilities:</p>
<ol>
<li><p>Wardrobe management</p>
</li>
<li><p>Outfit recommendations</p>
</li>
<li><p>Shopping suggestions</p>
</li>
<li><p>Discard recommendations</p>
</li>
<li><p>Feedback and usage tracking</p>
</li>
<li><p>Secure multi-user accounts</p>
</li>
</ol>
<p>Users can upload clothing items, explore suggested outfits, and mark recommendations as helpful or not helpful. They can also rate outfits and track whether items are worn, kept, or discarded.</p>
<p>That feedback becomes structured data for improving future recommendation quality.</p>
<h2 id="heading-why-i-built-it">Why I Built It</h2>
<p>I wanted to create something that felt personal and actually useful. A lot of fashion apps look polished, but they do not always help with everyday decisions. My goal was to build something that could make wardrobe management easier and outfit selection less overwhelming. The app needed to do three things well:</p>
<ul>
<li><p>store each user’s wardrobe data</p>
</li>
<li><p>personalize recommendations</p>
</li>
<li><p>learn from user feedback over time .</p>
</li>
</ul>
<p>That feedback loop mattered to me because it makes the app feel more alive instead of static.</p>
<h2 id="heading-tech-stack">Tech Stack</h2>
<p>Here are the tools I used to built the app:</p>
<ul>
<li><p>Frontend: React + Vite</p>
</li>
<li><p>Backend: FastAPI</p>
</li>
<li><p>Database: SQLite (local development)</p>
</li>
<li><p>Background jobs: Celery + Redis</p>
</li>
<li><p>Authentication: JWT (access + refresh token flow)</p>
</li>
<li><p>Deployment support: Docker and GitHub Codespaces</p>
</li>
</ul>
<p>This ended up giving me a pretty modular setup, which helped a lot as features started increasing: fast frontend iteration, clean API boundaries, and room to evolve recommendations separately from UI.</p>
<h2 id="heading-product-walkthrough-what-users-see">Product Walkthrough (What Users See)</h2>
<h3 id="heading-1-onboarding-and-account-setup">1. Onboarding and Account Setup</h3>
<p>To start using the app, a user needs to register, verify their email, and complete some profile basics.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/1ff4fb0d-dc97-4088-b720-db917b53ba5b.png" alt="Onboarding screen showing account creation, email verification, and profile fields for body shape, height, weight, and style preferences." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Each account is isolated, so wardrobe history and recommendations stay user-specific.</p>
<p>In this onboarding screen above, you can see account creation, email verification, and profile fields for body shape, height, weight, and style preferences.</p>
<h3 id="heading-2-wardrobe-upload">2. Wardrobe Upload</h3>
<p>Users can upload clothing images .</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/d69bf10b-b79b-4294-923c-5c9e5840098a.png" alt="Wardrobe upload form showing clothing image analysis results with category, dominant color, secondary color, and pattern details." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Image analysis labels each item and makes it searchable for recommendations. The wardrobe upload form shows image analysis results with category, dominant color, secondary color, and pattern details listed.</p>
<h3 id="heading-3-outfit-recommendations">3. Outfit Recommendations</h3>
<p>Users can request recommendations, then rate outputs.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/61527ddf-11e4-4284-92fd-2d0c948ae2db.png" alt="Outfit recommendation dashboard showing ranked outfit cards with feedback and rating actions." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Above you can see the outfit recommendation dashboard that shows ranked outfit cards with feedback and rating actions. Recommendations are ranked by a weighted scoring model.</p>
<h3 id="heading-4-shopping-and-discard-assistants">4. Shopping and Discard Assistants</h3>
<p>The app evaluates new items against existing wardrobe data and flags low-value wardrobe items that may be worth removing.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/88ed83c4-fdba-40e7-ad32-f77bdf21cb4d.png" alt="Shopping and discard analysis screen showing recommendation scores, written reasons, and styling guidance for each item." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>You can see the recommendation scores, written reasons (not just a binary decision), and styling guidance for each item above. It also features a "how to style it" incase the user still wants to keep the item.</p>
<h2 id="heading-how-i-built-it">How I Built It</h2>
<h3 id="heading-1-frontend-setup-react-vite">1. Frontend Setup (React + Vite)</h3>
<p>I used React + Vite because I wanted fast iteration and a clean component structure.</p>
<p>The frontend is split into feature areas like onboarding, wardrobe management, outfits, shopping, and discarded-item suggestions. I also keep API calls in a service layer so the UI components stay focused on rendering and interaction.</p>
<p>The snippet below is a simplified example of the API service pattern used in the app. It is not meant to be copy-pasted as-is, but it shows the same structure the frontend uses when talking to the backend.</p>
<p>Example API client pattern:</p>
<pre><code class="language-javascript">export async function getOutfitRecommendations(userId, params = {}) {
  const query = new URLSearchParams(params).toString();
  const url = `/users/\({userId}/outfits/recommend\){query ? `?${query}` : ""}`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${localStorage.getItem("access_token")}`,
    },
  });

  if (!response.ok) {
    throw new Error("Failed to fetch outfit recommendations");
  }

  return response.json();
}
</code></pre>
<p>Here's what's happening in that snippet:</p>
<ul>
<li><p><code>URLSearchParams</code> builds optional query strings like <code>occasion</code>, <code>season</code>, or <code>limit</code>.</p>
</li>
<li><p>The request path is user-scoped, which keeps each user’s recommendations isolated.</p>
</li>
<li><p>The <code>Authorization</code> header sends the access token so the backend can verify the session.</p>
</li>
<li><p>The response is checked before parsing so the UI can surface a useful error if the request fails.</p>
</li>
</ul>
<p>This pattern kept the frontend simple and reusable as the number of API calls grew.</p>
<h3 id="heading-2-backend-architecture-with-fastapi">2. Backend Architecture with FastAPI</h3>
<p>The backend is organized around clear route groups:</p>
<ul>
<li><p>auth routes for register, login, refresh, logout, and sessions</p>
</li>
<li><p>user analysis routes</p>
</li>
<li><p>wardrobe CRUD routes</p>
</li>
<li><p>recommendation routes for outfits, shopping, and discard analysis</p>
</li>
<li><p>feedback routes for ratings and helpfulness signals</p>
</li>
</ul>
<p>One of the most important design choices was enforcing ownership checks on user-scoped resources. That prevented one user from accessing another user’s wardrobe or feedback data.</p>
<p>The backend snippet below is another simplified example from the app’s route layer. It shows the request validation and orchestration logic, while the actual scoring work stays in the recommendation service.</p>
<pre><code class="language-python">@app.get("/users/{user_id}/outfits/recommend")
def recommend_outfits(user_id: int, occasion: str | None = None, season: str | None = None, limit: int = 10):
    user = get_user_or_404(user_id)
    wardrobe_items = get_user_wardrobe(user_id)

    if len(wardrobe_items) &lt; 2:
        raise HTTPException(status_code=400, detail="Not enough wardrobe items")

    recommendations = outfit_generator.generate_outfit_recommendations(
        wardrobe_items=wardrobe_items,
        body_shape=user.body_shape,
        undertone=user.undertone,
        occasion=occasion,
        season=season,
        top_k=limit,
    )

    return {"user_id": user_id, "recommendations": recommendations}
</code></pre>
<p>Here's how to read that code:</p>
<ul>
<li><p><code>get_user_or_404</code> loads the profile data needed for personalization.</p>
</li>
<li><p><code>get_user_wardrobe</code> fetches only the current user’s items.</p>
</li>
<li><p>The minimum wardrobe check prevents the recommendation logic from running on incomplete data.</p>
</li>
<li><p><code>generate_outfit_recommendations</code> handles the scoring logic separately, which keeps the route handler small and easier to test.</p>
</li>
<li><p>The response returns the results in a shape the frontend can consume directly.</p>
</li>
</ul>
<p>That separation helped keep the API layer readable while the recommendation logic stayed isolated in its own service.</p>
<h3 id="heading-3-recommendation-logic">3. Recommendation Logic</h3>
<p>I intentionally started with deterministic rules before introducing heavy ML. That made behavior easier to debug and explain.</p>
<p>The outfit recommender scores combinations using weighted signals:</p>
<p>$$\text{outfit score} = 0.4 \cdot \text{color harmony} + 0.4 \cdot \text{body-shape fit} + 0.2 \cdot \text{undertone fit}$$</p>
<p>The snippet below is a simplified example from the recommendation engine. It shows how the app combines multiple signals into a single score:</p>
<pre><code class="language-python">def score_outfit(combo, user_context):
    color_score = color_harmony.score(combo)
    shape_score = body_shape_rules.score(combo, user_context.body_shape)
    undertone_score = undertone_rules.score(combo, user_context.undertone)

    total = 0.4 * color_score + 0.4 * shape_score + 0.2 * undertone_score
    return round(total, 3)
</code></pre>
<p>The logic behind this approach is straightforward:</p>
<ul>
<li><p>color harmony helps the outfit feel visually coherent</p>
</li>
<li><p>body-shape scoring helps the outfit feel flattering</p>
</li>
<li><p>undertone scoring helps the colors work better with the user’s profile</p>
</li>
</ul>
<p>I used a similar structure for discard recommendations and shopping suggestions, but with different factors and thresholds.</p>
<h3 id="heading-4-authentication-and-secure-multi-user-design">4. Authentication and Secure Multi-user Design</h3>
<p>Security was one of the most important parts of this build.</p>
<p>I implemented:</p>
<ul>
<li><p>short-lived access tokens</p>
</li>
<li><p>refresh tokens with JTI tracking</p>
</li>
<li><p>token rotation on refresh</p>
</li>
<li><p>session revocation (single session and all sessions)</p>
</li>
<li><p>email verification and password reset flows</p>
</li>
</ul>
<p>The snippet below is a simplified example of the refresh-token lifecycle used in the app. It shows the important control points rather than every helper function:</p>
<pre><code class="language-python">def refresh_access_token(refresh_token: str):
    payload = decode_jwt(refresh_token)
    jti = payload["jti"]

    token_record = db.get_refresh_token(jti)
    if not token_record or token_record.revoked:
        raise AuthError("Invalid refresh token")

    new_refresh, new_jti = issue_refresh_token(payload["sub"])
    token_record.revoked = True
    token_record.replaced_by_jti = new_jti

    new_access = issue_access_token(payload["sub"])
    return {"access_token": new_access, "refresh_token": new_refresh}
</code></pre>
<p>What this code is doing:</p>
<ul>
<li><p>It decodes the refresh token and looks up its JTI in the database.</p>
</li>
<li><p>It rejects reused or revoked sessions, which helps prevent replay attacks.</p>
</li>
<li><p>It rotates the refresh token instead of reusing it.</p>
</li>
<li><p>It issues a fresh access token so the session stays valid without forcing the user to log in again.</p>
</li>
</ul>
<p>This design made multi-device sessions safer and gave me server-side control over logout behavior.</p>
<h3 id="heading-5-background-jobs-for-long-running-operations">5. Background Jobs for Long-running Operations</h3>
<p>Image analysis can be expensive, especially when the app needs to classify clothing, analyze colors, and estimate body-shape-related signals. To keep the request path responsive, I added Celery + Redis support for background tasks.</p>
<p>That gave the app two modes:</p>
<ul>
<li><p>synchronous processing for simpler local development</p>
</li>
<li><p>queued processing for heavier or slower jobs</p>
</li>
</ul>
<p>That tradeoff mattered because it let me keep the developer experience simple without blocking the app during more expensive work.</p>
<h3 id="heading-6-data-model-and-feedback-capture">6. Data Model and Feedback Capture</h3>
<p>A recommendation system only improves if it captures the right signals.</p>
<p>So I added dedicated feedback tables for:</p>
<ul>
<li><p>outfit ratings (1-5 + optional comments)</p>
</li>
<li><p>recommendation helpful/unhelpful feedback</p>
</li>
<li><p>item usage actions (worn/kept/discarded)</p>
</li>
</ul>
<p>Here is the shape of one of those models:</p>
<pre><code class="language-python">class RecommendationFeedback(Base):
    __tablename__ = "recommendation_feedback"

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    recommendation_type = Column(String(50), nullable=False)
    recommendation_id = Column(Integer, nullable=False)
    helpful = Column(Boolean, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)
</code></pre>
<p>How to read this model:</p>
<ul>
<li><p><code>user_id</code> ties feedback to the person who gave it.</p>
</li>
<li><p><code>recommendation_type</code> tells me whether the feedback belongs to outfits, shopping, or discard suggestions.</p>
</li>
<li><p><code>recommendation_id</code> identifies the exact recommendation.</p>
</li>
<li><p><code>helpful</code> stores the user’s direct response.</p>
</li>
<li><p><code>created_at</code> makes it possible to analyze feedback trends over time.</p>
</li>
</ul>
<p>This part of the system gives the app a real learning foundation, even though the feedback-to-model-update loop is still a future improvement.</p>
<h2 id="heading-challenges-i-faced">Challenges I Faced</h2>
<p>This was the section that taught me the most.</p>
<h3 id="heading-1-image-heavy-endpoints-were-slower-than-i-wanted">1. Image-heavy endpoints were slower than I wanted</h3>
<p>The analyze and wardrobe upload flows were doing a lot of work at once: image validation, classification, color extraction, storage, and database writes.</p>
<p>At first, that made the request flow feel heavier than it should have.</p>
<p>What I changed:</p>
<ul>
<li><p>I bounded concurrent image jobs so the app wouldn't try to do too much at once.</p>
</li>
<li><p>I separated slower jobs into background processing where possible.</p>
</li>
<li><p>I used load-test results to confirm which endpoints were actually expensive.</p>
</li>
</ul>
<p>The practical effect was that heavy image requests stopped competing with each other so aggressively. Instead of letting many expensive tasks pile up inside the same request cycle, I limited the active work and pushed slower operations into the queue when needed.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Bounding concurrency prevented the system from overloading CPU-bound tasks.</p>
</li>
<li><p>Moving expensive work into async jobs kept the main request/response cycle more responsive.</p>
</li>
<li><p>Load testing gave me evidence instead of guesswork, so I could tune the system based on real performance behavior.</p>
</li>
</ul>
<p>In other words, I didn't just “optimize” the endpoint in theory. I changed the execution model so expensive analysis could not block every other request behind it.</p>
<h3 id="heading-2-jwt-sessions-needed-real-server-side-control">2. JWT sessions needed real server-side control</h3>
<p>A basic JWT setup is easy to get working, but it becomes less useful if you cannot revoke sessions or manage multiple devices cleanly.</p>
<p>What I changed:</p>
<ul>
<li><p>I stored refresh tokens in the database.</p>
</li>
<li><p>I tracked token JTI values.</p>
</li>
<li><p>I rotated refresh tokens when users refreshed their session.</p>
</li>
<li><p>I added endpoints for logging out a single session or all sessions.</p>
</li>
</ul>
<p>The important shift here was moving from “token exists, therefore session is valid” to “token exists, matches the database record, and has not been revoked or replaced.” That gave the server the authority to invalidate old sessions immediately.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Server-side token tracking made revocation possible.</p>
</li>
<li><p>Rotation reduced the chance of token reuse.</p>
</li>
<li><p>Session management became visible to the user, which made the app feel more trustworthy.</p>
</li>
</ul>
<p>This is what made logout-all and multi-device management work in a real way instead of just being cosmetic UI actions.</p>
<h3 id="heading-3-user-data-isolation-had-to-be-explicit">3. User data isolation had to be explicit</h3>
<p>Because this is a multi-user app, I had to be careful that one account could never accidentally see another account’s wardrobe data.</p>
<p>What I changed:</p>
<ul>
<li><p>I added ownership checks to user-scoped routes.</p>
</li>
<li><p>I kept all wardrobe and feedback queries filtered by <code>user_id</code>.</p>
</li>
<li><p>I used encrypted image storage instead of exposing raw paths.</p>
</li>
</ul>
<p>In practice, this meant every route had to ask the same question: “Does this user own the resource they are trying to access?” If the answer was no, the request stopped immediately.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Ownership checks made data access rules explicit.</p>
</li>
<li><p>User-filtered queries prevented accidental cross-account reads.</p>
</li>
<li><p>Encrypted storage improved privacy and reduced the risk of exposing image data directly.</p>
</li>
</ul>
<p>That combination is what kept wardrobe data, feedback history, and images separated correctly across accounts.</p>
<h3 id="heading-4-docker-made-the-project-easier-to-share-but-only-after-the-stack-was-organized">4. Docker made the project easier to share, but only after the stack was organized</h3>
<p>The app includes the frontend, backend, Redis, Celery worker, and Celery Beat, so the first challenge was making the setup feel reproducible instead of fragile.</p>
<p>What I changed:</p>
<ul>
<li><p>I defined the stack in Docker Compose.</p>
</li>
<li><p>I documented the required environment variables.</p>
</li>
<li><p>I kept the dev stack aligned with how the app runs in practice.</p>
</li>
</ul>
<p>This removed a lot of setup ambiguity. Instead of asking someone to manually figure out how the frontend, backend, Redis, and workers fit together, I made the stack describe itself.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Docker let contributors start the project with fewer manual steps.</p>
</li>
<li><p>Clear environment configuration reduced setup mistakes.</p>
</li>
<li><p>Matching the stack to the architecture made the app easier to understand and test.</p>
</li>
</ul>
<p>That was important because the app depends on several moving parts, and the simplest way to make the project approachable was to make startup behavior predictable.</p>
<h2 id="heading-what-i-learned">What I Learned</h2>
<p>This project taught me a few important lessons:</p>
<ul>
<li><p>Small features become much more valuable when they work together.</p>
</li>
<li><p>Feedback data is one of the strongest signals for improving recommendations.</p>
</li>
<li><p>Clean data modeling matters a lot when multiple users are involved.</p>
</li>
<li><p>Docker and clear setup instructions make a project much easier for other people to try.</p>
</li>
</ul>
<p>I also learned that a project does not need to be huge to be useful. A focused app that solves one problem well can still feel meaningful.</p>
<h2 id="heading-what-i-want-to-improve-next">What I Want to Improve Next</h2>
<p>My roadmap from here:</p>
<ol>
<li><p>Integrate feedback directly into ranking updates</p>
</li>
<li><p>Add visual analytics for recommendation quality trends</p>
</li>
<li><p>Improve mobile UX parity</p>
</li>
<li><p>Deploy with persistent cloud storage and production database defaults</p>
</li>
<li><p>Provide a public demo mode for easier evaluation</p>
</li>
</ol>
<h2 id="heading-future-improvements">Future Improvements</h2>
<p>There are still a few things I would like to add later:</p>
<ul>
<li><p>a more advanced recommendation engine</p>
</li>
<li><p>visual analytics for user feedback</p>
</li>
<li><p>better mobile support</p>
</li>
<li><p>live deployment with persistent cloud storage</p>
</li>
<li><p>a public demo mode for easier testing</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This project began as a personal frustration and turned into a full web application with authentication, wardrobe storage, recommendation logic, and feedback infrastructure.</p>
<p>The most rewarding part was seeing how practical software decisions, not just flashy UI, can help people make everyday choices faster.</p>
<p>If you want to explore or run the project, <a href="https://github.com/Mokshitavp1/fashion_assistant">check out the repo</a>. You can try the flows and share feedback. I would especially love input on recommendation quality, UX clarity, and what features would make this genuinely useful in daily life.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)
 ]]>
                </title>
                <description>
                    <![CDATA[ If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening ins ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-and-deploy-multi-architecture-docker-apps-on-google-cloud-using-arm-nodes/</link>
                <guid isPermaLink="false">69dcf2c3f57346bc1e05a01d</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Kubernetes ]]>
                    </category>
                
                    <category>
                        <![CDATA[ google cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ARM ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Amina Lawal ]]>
                </dc:creator>
                <pubDate>Mon, 13 Apr 2026 13:42:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/e89ae65a-4b3a-44b7-94d8-d0638f017bf6.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.</p>
<p>Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.</p>
<p>AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.</p>
<p>But there's a problem that catches almost every team off guard when they start this transition: <strong>container architecture mismatch</strong>.</p>
<p>If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic <code>exec format error</code>.</p>
<p>The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.</p>
<p>We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.</p>
<p><strong>Here's what you'll build, step by step:</strong></p>
<ul>
<li><p>A Go HTTP server that reports the CPU architecture it's running on at runtime</p>
</li>
<li><p>A multi-stage Dockerfile that cross-compiles for both <code>linux/amd64</code> and <code>linux/arm64</code> without slow QEMU emulation</p>
</li>
<li><p>A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture</p>
</li>
<li><p>A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool</p>
</li>
<li><p>A Kubernetes Deployment that pins your workload exclusively to the ARM nodes</p>
</li>
</ul>
<p>By the end, you'll hit a live endpoint and see the word <code>arm64</code> staring back at you from a Google Cloud ARM node. Let's get into it.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</a></p>
</li>
<li><p><a href="#heading-step-3-write-the-application">Step 3: Write the Application</a></p>
</li>
<li><p><a href="#heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</a></p>
</li>
<li><p><a href="#heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</a></p>
</li>
<li><p><a href="#heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</a></p>
</li>
<li><p><a href="#heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</a></p>
</li>
<li><p><a href="#heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</a></p>
</li>
<li><p><a href="#heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</a></p>
</li>
<li><p><a href="#heading-cleanup">Cleanup</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-project-file-structure">Project File Structure</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have the following ready:</p>
<ul>
<li><p><strong>A Google Cloud project</strong> with billing enabled. If you don't have one, create it at <a href="https://console.cloud.google.com">console.cloud.google.com</a>. The total cost to follow this tutorial is around $5–10.</p>
</li>
<li><p><code>gcloud</code> <strong>CLI</strong> installed and authenticated. Run <code>gcloud auth login</code> to sign in and <code>gcloud config set project YOUR_PROJECT_ID</code> to point it at your project.</p>
</li>
<li><p><strong>Docker Desktop</strong> version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.</p>
</li>
<li><p><code>kubectl</code> installed. This is the CLI for interacting with Kubernetes clusters.</p>
</li>
<li><p>Basic familiarity with <strong>Docker</strong> (images, layers, Dockerfile) and <strong>Kubernetes</strong> (pods, deployments, services). You don't need to be an expert, but you should know what these things are.</p>
</li>
</ul>
<h2 id="heading-step-1-set-up-your-google-cloud-project">Step 1: Set Up Your Google Cloud Project</h2>
<p>Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.</p>
<h3 id="heading-enable-the-required-apis">Enable the Required APIs</h3>
<p>Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:</p>
<pre><code class="language-bash">gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com
</code></pre>
<p>Here's what each one does:</p>
<ul>
<li><p><code>artifactregistry.googleapis.com</code> — enables <strong>Artifact Registry</strong>, where we'll store our Docker images</p>
</li>
<li><p><code>container.googleapis.com</code> — enables <strong>Google Kubernetes Engine (GKE)</strong>, where our cluster will run</p>
</li>
<li><p><code>containeranalysis.googleapis.com</code> — enables vulnerability scanning for images stored in Artifact Registry</p>
</li>
</ul>
<h3 id="heading-create-a-docker-repository-in-artifact-registry">Create a Docker Repository in Artifact Registry</h3>
<p>Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:</p>
<pre><code class="language-bash">gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"
</code></pre>
<p>Breaking down the flags:</p>
<ul>
<li><p><code>--repository-format=docker</code> — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)</p>
</li>
<li><p><code>--location=us-central1</code> — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run <code>gcloud artifacts locations list</code> to see all options.</p>
</li>
<li><p><code>--description</code> — a human-readable label for the repository, shown in the console.</p>
</li>
</ul>
<h3 id="heading-authenticate-docker-to-push-to-artifact-registry">Authenticate Docker to Push to Artifact Registry</h3>
<p>Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:</p>
<pre><code class="language-bash">gcloud auth configure-docker us-central1-docker.pkg.dev
</code></pre>
<p>This adds a credential helper entry to your <code>~/.docker/config.json</code> file. What that means in practice: any time Docker tries to push or pull from a URL under <code>us-central1-docker.pkg.dev</code>, it will automatically call <code>gcloud</code> to get a valid auth token. You won't need to run <code>docker login</code> manually.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/31fd020f-ffa2-40bd-9057-57b16a61b325.png" alt="Terminal output of the gcloud artifacts repositories list command, showing a row for multi-arch-repo with format DOCKER, location us-central1" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-2-create-the-gke-cluster">Step 2: Create the GKE Cluster</h2>
<p>With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.</p>
<pre><code class="language-bash">gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>What each flag does:</p>
<ul>
<li><p><code>--zone=us-central1-a</code> — creates a zonal cluster in a single availability zone. A regional cluster (using <code>--region</code>) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If <code>us-central1-a</code> is unavailable, try <code>us-central1-b</code>.</p>
</li>
<li><p><code>--num-nodes=2</code> — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.</p>
</li>
<li><p><code>--machine-type=e2-standard-2</code> — the machine type for this default node pool. <code>e2-standard-2</code> is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.</p>
</li>
<li><p><code>--workload-pool=PROJECT_ID.svc.id.goog</code> — enables <strong>Workload Identity</strong>, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.</p>
</li>
</ul>
<p>This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/332250a8-3f99-4eb1-849f-51ab054c9567.png" alt="GCP Console Kubernetes Engine Clusters page showing axion-tutorial-cluster with a green checkmark status, the zone us-central1-a, and Kubernetes version in the table." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-3-write-the-application">Step 3: Write the Application</h2>
<p>We need an application to containerize. We'll use <strong>Go</strong> for three specific reasons:</p>
<ol>
<li><p>Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.</p>
</li>
<li><p>Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.</p>
</li>
<li><p>Go exposes the architecture the binary was compiled for via <code>runtime.GOARCH</code>. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.</p>
</li>
</ol>
<p>Start by creating the project directories:</p>
<pre><code class="language-bash">mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app
</code></pre>
<p>Initialize the Go module from inside <code>app/</code>. This creates <code>go.mod</code> in the current directory:</p>
<pre><code class="language-bash">go mod init hello-axion
</code></pre>
<p><code>go mod init</code> is Go's built-in command for starting a new module. It writes a <code>go.mod</code> file that declares the module name (<code>hello-axion</code>) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.</p>
<p>Now create the application at <code>app/main.go</code>:</p>
<pre><code class="language-go">package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}
</code></pre>
<p>Verify both files were created:</p>
<pre><code class="language-bash">ls -la
</code></pre>
<p>You should see <code>go.mod</code> and <code>main.go</code> listed.</p>
<p>Let's walk through what this code does:</p>
<ul>
<li><p><code>import "runtime"</code> — imports Go's built-in <code>runtime</code> package, which exposes information about the Go runtime environment, including the CPU architecture.</p>
</li>
<li><p><code>runtime.GOARCH</code> — returns a string like <code>"arm64"</code> or <code>"amd64"</code> representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be <code>arm64</code>. This is the core of our proof.</p>
</li>
<li><p><code>os.Hostname()</code> — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.</p>
</li>
<li><p><code>handler</code> — the main HTTP handler, registered on the root path <code>/</code>. It writes the architecture, OS, and hostname to the response.</p>
</li>
<li><p><code>healthz</code> — a separate handler registered on <code>/healthz</code>. It returns HTTP 200 with the text <code>ok</code>. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.</p>
</li>
<li><p><code>http.ListenAndServe(":8080", nil)</code> — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.</p>
</li>
</ul>
<h2 id="heading-step-4-enable-multi-arch-builds-with-docker-buildx">Step 4: Enable Multi-Arch Builds with Docker Buildx</h2>
<p>Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.</p>
<h3 id="heading-why-your-docker-images-are-architecture-specific-by-default">Why Your Docker Images Are Architecture-Specific By Default</h3>
<p>A CPU only understands instructions written for its specific <strong>Instruction Set Architecture (ISA)</strong>. ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.</p>
<p>When you build a Docker image the normal way (<code>docker build</code>), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:</p>
<pre><code class="language-shell">standard_init_linux.go:228: exec user process caused: exec format error
</code></pre>
<p>That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."</p>
<h3 id="heading-the-solution-a-single-image-tag-that-serves-any-architecture">The Solution: A Single Image Tag That Serves Any Architecture</h3>
<p>Docker solves this with a structure called a <strong>Manifest List</strong> (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.</p>
<p>When a server pulls <code>hello-axion:v1</code>, here's what actually happens:</p>
<ol>
<li><p>Docker contacts the registry and requests the manifest for <code>hello-axion:v1</code></p>
</li>
<li><p>The registry returns the Manifest List, which looks like this internally:</p>
</li>
</ol>
<pre><code class="language-json">{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}
</code></pre>
<ol>
<li>Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.</li>
</ol>
<p>One tag, two actual images. Completely transparent to your deployment manifests.</p>
<h3 id="heading-set-up-docker-buildx">Set Up Docker Buildx</h3>
<p><strong>Docker Buildx</strong> is the CLI tool that builds these Manifest Lists. It's powered by the <strong>BuildKit</strong> engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:</p>
<pre><code class="language-bash">docker buildx create --name multiarch-builder --use
</code></pre>
<ul>
<li><p><code>--name multiarch-builder</code> — gives this builder a memorable name. You can have multiple builders. This command creates a new one named <code>multiarch-builder</code>.</p>
</li>
<li><p><code>--use</code> — immediately sets this new builder as the active one, so all future <code>docker buildx build</code> commands use it.</p>
</li>
</ul>
<p>Now boot the builder and confirm it supports the platforms we need:</p>
<pre><code class="language-bash">docker buildx inspect --bootstrap
</code></pre>
<ul>
<li><code>--bootstrap</code> — starts the builder container if it isn't already running, and prints its full configuration.</li>
</ul>
<p>You should see output like this:</p>
<pre><code class="language-plaintext">Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...
</code></pre>
<p>The <code>Platforms</code> line lists every architecture this builder can produce images for. As long as you see <code>linux/amd64</code> and <code>linux/arm64</code> in that list, you're ready to build for both x86 and ARM.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/1c19aca1-30c4-406d-9c37-679ee4f2928f.png" alt="Terminal output showing the multiarch-builder details with Name, Driver set to docker-container, and a Platforms list that includes linux/amd64 and linux/arm64 highlighted." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-5-write-the-dockerfile">Step 5: Write the Dockerfile</h2>
<p>Now we can write the Dockerfile. We'll use two techniques together: a <strong>multi-stage build</strong> to keep the final image tiny, and a <strong>cross-compilation trick</strong> to avoid slow CPU emulation.</p>
<p>Create <code>app/Dockerfile</code> with the following content:</p>
<pre><code class="language-dockerfile"># -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]
</code></pre>
<p>There's a lot happening here. Let's go through it carefully.</p>
<h3 id="heading-stage-1-the-builder">Stage 1: The Builder</h3>
<p><code>FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder</code></p>
<p>This is the most important line in the file. <code>\(BUILDPLATFORM</code> is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine <em>running the build</em> (your laptop). By pinning the builder stage to <code>\)BUILDPLATFORM</code>, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.</p>
<p>Without <code>--platform=$BUILDPLATFORM</code>, Buildx would have to use <strong>QEMU</strong> — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.</p>
<p><code>ARG TARGETOS</code> <strong>and</strong> <code>ARG TARGETARCH</code></p>
<p>These two lines declare that our Dockerfile expects build arguments named <code>TARGETOS</code> and <code>TARGETARCH</code>. Buildx injects these automatically based on the <code>--platform</code> flag you pass at build time. For a <code>linux/arm64</code> target, <code>TARGETOS</code> will be <code>linux</code> and <code>TARGETARCH</code> will be <code>arm64</code>.</p>
<p><code>COPY go.mod .</code> <strong>and</strong> <code>RUN go mod download</code></p>
<p>We copy <code>go.mod</code> first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for <code>go mod download</code>.</p>
<p>On future builds, as long as <code>go.mod</code> hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.</p>
<p><code>RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go</code></p>
<p>This is the cross-compilation step. <code>GOOS</code> and <code>GOARCH</code> are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the <code>\(TARGETOS</code> and <code>\)TARGETARCH</code> build args injected by Buildx.</p>
<p>The <code>-ldflags="-w -s"</code> flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.</p>
<h3 id="heading-stage-2-the-runtime-image">Stage 2: The Runtime Image</h3>
<p><code>FROM alpine:latest</code></p>
<p>This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, <code>alpine:latest</code> is itself a multi-arch image, so Docker automatically selects the <code>arm64</code> or <code>amd64</code> Alpine variant depending on which platform this stage is built for.</p>
<p>Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains <em>only</em> Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.</p>
<p><code>RUN addgroup -S appgroup &amp;&amp; adduser -S appuser -G appgroup</code> and <code>USER appuser</code></p>
<p>These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.</p>
<p><code>COPY --from=builder /app/server .</code></p>
<p>This is how multi-stage builds work: the <code>--from=builder</code> flag tells Docker to copy files from the <code>builder</code> stage (Stage 1), not from your local disk. Only the compiled binary (<code>server</code>) makes it into the final image.</p>
<h2 id="heading-step-6-build-and-push-the-multi-arch-image">Step 6: Build and Push the Multi-Arch Image</h2>
<p>With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.</p>
<p>From inside the <code>app/</code> directory, run:</p>
<pre><code class="language-bash">docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual GCP project ID.</p>
<p>Here's what each part of this command does:</p>
<ul>
<li><p><code>docker buildx build</code> — uses the Buildx CLI instead of the standard <code>docker build</code>. Buildx is required for multi-platform builds.</p>
</li>
<li><p><code>--platform linux/amd64,linux/arm64</code> — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the <code>$BUILDPLATFORM</code> cross-compilation trick, both builds run natively on your machine without QEMU emulation.</p>
</li>
<li><p><code>-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1</code> — the full image path in Artifact Registry. The format is always <code>REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG</code>.</p>
</li>
<li><p><code>--push</code> — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.</p>
</li>
<li><p><code>.</code> — the build context, the directory Docker scans for the Dockerfile and any files the build needs.</p>
</li>
</ul>
<p>Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:</p>
<pre><code class="language-plaintext"> =&gt; [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 =&gt; [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 =&gt; pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/dc88f558-b4ee-4100-bfe1-eaa943bec9bc.png" alt="Terminal showing docker buildx build output with two parallel build tracks labeled linux/amd64 and linux/arm64, and a final line reading pushing manifest for the Artifact Registry image path." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-verify-the-multi-arch-image-in-artifact-registry">Verify the Multi-Arch Image in Artifact Registry</h3>
<p>Once the push completes, navigate to <strong>GCP Console → Artifact Registry → Repositories → multi-arch-repo</strong> and click on <code>hello-axion</code>.</p>
<p>You won't see a single image — you'll see something labelled <strong>"Image Index"</strong>. That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for <code>linux/amd64</code> and one for <code>linux/arm64</code>.</p>
<p>You can also inspect this from the command line:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/28d0e4a4-1d45-4c0b-ac47-34dc3b72c11d.png" alt="Google Cloud Artifact Registry console showing hello-axion as an Image Index with two child images: one labeled linux/amd64 and one labeled linux/arm64, each with its own digest and size." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The output lists every manifest inside the image index. You'll see entries for <code>linux/amd64</code> and <code>linux/arm64</code> — those are our two real images. You'll also see two entries with <code>Platform: unknown/unknown</code> labelled as <code>attestation-manifest</code>. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).</p>
<p>The two entries you care about are <code>linux/amd64</code> and <code>linux/arm64</code>. Note the digest for the <code>arm64</code> entry — we'll use it in the verification step to confirm the cluster pulled the right variant.</p>
<h2 id="heading-step-7-add-the-axion-arm-node-pool">Step 7: Add the Axion ARM Node Pool</h2>
<p>We have a universal image. Now we need somewhere to run it.</p>
<p>Recall the cluster we created in Step 2 — it's running <code>e2-standard-2</code> x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a <strong>mixed-architecture cluster</strong> where different workloads can be routed to different hardware.</p>
<h3 id="heading-choosing-your-arm-machine-type">Choosing Your ARM Machine Type</h3>
<p>Google Cloud currently offers two ARM-based machine series in GKE:</p>
<table>
<thead>
<tr>
<th>Series</th>
<th>Example type</th>
<th>What it is</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Tau T2A</strong></td>
<td><code>t2a-standard-2</code></td>
<td>First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.</td>
</tr>
<tr>
<td><strong>Axion (C4A)</strong></td>
<td><code>c4a-standard-2</code></td>
<td>Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.</td>
</tr>
</tbody></table>
<p>This tutorial uses <code>t2a-standard-2</code> because it's widely available. The commands are identical for <code>c4a-standard-2</code> — just swap the <code>--machine-type</code> value. If <code>t2a-standard-2</code> isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.</p>
<h3 id="heading-create-the-arm-node-pool">Create the ARM Node Pool</h3>
<p>Add the ARM node pool to your existing cluster:</p>
<pre><code class="language-bash">gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized
</code></pre>
<p>What each flag does:</p>
<ul>
<li><p><code>--cluster=axion-tutorial-cluster</code> — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.</p>
</li>
<li><p><code>--zone=us-central1-a</code> — must match the zone you used when creating the cluster.</p>
</li>
<li><p><code>--machine-type=t2a-standard-2</code> — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.</p>
</li>
<li><p><code>--num-nodes=2</code> — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.</p>
</li>
<li><p><code>--node-labels=workload-type=arm-optimized</code> — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic <code>kubernetes.io/arch=arm64</code> label) is good practice in real clusters — it communicates the <em>intent</em> of the pool, not just its hardware.</p>
</li>
</ul>
<p>This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:</p>
<pre><code class="language-bash">gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch
</code></pre>
<p>The <code>get-credentials</code> command configures <code>kubectl</code> to authenticate with your new cluster. The <code>get nodes</code> command then lists all nodes and adds a column showing the <code>kubernetes.io/arch</code> label.</p>
<p>You should see something like:</p>
<pre><code class="language-plaintext">NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m
</code></pre>
<p><code>amd64</code> for the default x86 pool, <code>arm64</code> for our new Axion pool. This <code>kubernetes.io/arch</code> label is applied automatically by GKE — you don't set it, it's derived from the hardware.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/6389f4c6-17fe-4086-982f-39d94dbfa252.png" alt="Terminal output of kubectl get nodes with a ARCH column showing amd64 for two default-pool nodes and arm64 for two axion-pool nodes." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-8-deploy-the-app-to-the-arm-node-pool">Step 8: Deploy the App to the ARM Node Pool</h2>
<p>We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: <strong>Kubernetes doesn't know or care about image architecture by default</strong>.</p>
<p>If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.</p>
<p>To guarantee that pods land on ARM nodes and only ARM nodes, we use a <code>nodeSelector</code>.</p>
<h3 id="heading-how-nodeselector-works">How nodeSelector Works</h3>
<p>A <code>nodeSelector</code> is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the <code>nodeSelector</code>, the scheduler skips it — the pod will remain in <code>Pending</code> state rather than land on the wrong node.</p>
<p>This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (<code>preferredDuringSchedulingIgnoredDuringExecution</code>), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.</p>
<h3 id="heading-write-the-deployment-manifest">Write the Deployment Manifest</h3>
<p>Create <code>k8s/deployment.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your project ID. Here's what the key sections do:</p>
<p><code>replicas: 3</code> — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in <code>us-central1</code>, which distributes load across availability zones.</p>
<p><code>selector.matchLabels</code> and <code>template.metadata.labels</code> — these two blocks must match. The <code>selector</code> tells the Deployment which pods it "owns," and the <code>template.metadata.labels</code> is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.</p>
<p><code>nodeSelector: kubernetes.io/arch: arm64</code> — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies <code>kubernetes.io/arch=arm64</code> to all ARM nodes, our pods will schedule only onto the <code>axion-pool</code> nodes.</p>
<p><code>livenessProbe</code> — periodically calls <code>GET /healthz</code>. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. <code>initialDelaySeconds: 5</code> gives the server 5 seconds to start up before the first check.</p>
<p><code>readinessProbe</code> — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.</p>
<p><code>resources.requests</code> — reserves <code>250m</code> (25% of a CPU core) and <code>64Mi</code> of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.</p>
<p><code>resources.limits</code> — caps the container at <code>500m</code> CPU and <code>128Mi</code> memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.</p>
<h3 id="heading-a-note-on-taints-and-tolerations">A Note on Taints and Tolerations</h3>
<p>Once you're comfortable with <code>nodeSelector</code>, the next step in production clusters is adding a <strong>taint</strong> to your ARM node pool. A taint is a repellent — any pod without an explicit <strong>toleration</strong> for that taint is blocked from landing on the tainted node.</p>
<p>This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:</p>
<pre><code class="language-bash"># Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule
</code></pre>
<p>And a matching toleration in the pod spec:</p>
<pre><code class="language-yaml">tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"
</code></pre>
<p>We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.</p>
<h3 id="heading-write-the-service-manifest">Write the Service Manifest</h3>
<p>We also need a Kubernetes Service to expose the pods over the network. Create <code>k8s/service.yaml</code>:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer
</code></pre>
<ul>
<li><p><code>selector: app: hello-axion</code> — the Service discovers pods using labels. Any pod with <code>app: hello-axion</code> on it will be added to this Service's load balancer pool.</p>
</li>
<li><p><code>port: 80</code> — the port the Service is reachable on from outside the cluster.</p>
</li>
<li><p><code>targetPort: 8080</code> — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.</p>
</li>
<li><p><code>type: LoadBalancer</code> — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.</p>
</li>
</ul>
<h3 id="heading-apply-both-manifests">Apply Both Manifests</h3>
<pre><code class="language-bash">kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
</code></pre>
<p><code>kubectl apply</code> reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.</p>
<p>Watch the pods come up in real time:</p>
<pre><code class="language-bash">kubectl get pods -w
</code></pre>
<p>The <code>-w</code> flag watches for changes and prints updates as they happen. You should see pods transition from <code>Pending</code> → <code>ContainerCreating</code> → <code>Running</code>. Once all three show <code>Running</code>, press <code>Ctrl+C</code> to stop watching.</p>
<h2 id="heading-step-9-verify-the-deployment">Step 9: Verify the Deployment</h2>
<p>Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.</p>
<h3 id="heading-confirm-pod-placement">Confirm Pod Placement</h3>
<pre><code class="language-bash">kubectl get pods -o wide
</code></pre>
<p>The <code>-o wide</code> flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the <code>NODE</code> column:</p>
<pre><code class="language-plaintext">NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...
</code></pre>
<p>All three pods should show node names containing <code>axion-pool</code>. None should show <code>default-pool</code>.</p>
<h3 id="heading-confirm-the-nodes-are-arm">Confirm the Nodes Are ARM</h3>
<p>Take one of those node names and verify its architecture label:</p>
<pre><code class="language-bash">kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch
</code></pre>
<p>Replace <code>NODE_NAME</code> with one of the node names from the previous command. You should see:</p>
<pre><code class="language-plaintext">kubernetes.io/arch=arm64
</code></pre>
<p>That's the automatic label GKE applied when it provisioned the ARM hardware. Our <code>nodeSelector</code> matched on this label to pin the pods here.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/815312ea-e2bf-4106-863e-55cd0bdad5f7.png" alt="Terminal split into two sections: the top showing kubectl get pods -o wide with all pods scheduled on nodes containing axion-pool in the name, and the bottom showing kubectl get node with kubernetes.io/arch=arm64 in the labels output." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-ask-the-application-itself">Ask the Application Itself</h3>
<p>This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.</p>
<p>Use <code>kubectl port-forward</code> to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:</p>
<pre><code class="language-bash">kubectl port-forward deployment/hello-axion 8080:8080
</code></pre>
<p>This command stays running in the foreground — open a <strong>second terminal window</strong> and run:</p>
<pre><code class="language-bash">curl http://localhost:8080
</code></pre>
<p>You should see:</p>
<pre><code class="language-plaintext">Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12
</code></pre>
<p><code>Architecture : arm64</code>. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/114ff82d-950f-4059-a1fa-89baffb90b6c.png" alt="Terminal output of curl http://localhost:8080 showing the four-line response: Hello from freeCodeCamp, Architecture: arm64, OS: linux, and the pod hostname." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-the-bonus-see-the-manifest-list-in-action">The Bonus: See the Manifest List in Action</h3>
<p>Want to see the multi-arch image indexing at work? Stop the port-forward, then run:</p>
<pre><code class="language-bash">docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<p>Replace <code>PROJECT_ID</code> with your actual Google Cloud project ID.</p>
<p>You'll see four entries in the manifest list. Two are real images — <code>Platform: linux/amd64</code> and <code>Platform: linux/arm64</code>. The other two show <code>Platform: unknown/unknown</code> with an <code>attestation-manifest</code> annotation. These are <strong>build provenance records</strong> that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.</p>
<p>You may notice that if you check the image digest recorded in a running pod:</p>
<pre><code class="language-bash">kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'
</code></pre>
<p>Replace <code>POD_NAME</code> with one of the pod names from earlier.</p>
<p>The digest returned matches the <strong>top-level manifest list digest</strong>, not the <code>arm64</code>-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.</p>
<p>The definitive proof that the right binary is running is what you already have: the node labeled <code>kubernetes.io/arch=arm64</code> and the application reporting <code>Architecture: arm64</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5f97fb446ea7602886a16070/7dffe0c8-28cf-4a5d-8459-1e8db3da7dc0.png" alt="top-level manifest list digest" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-10-cost-savings-and-tradeoffs">Step 10: Cost Savings and Tradeoffs</h2>
<p>The hands-on work is done. Let's talk about why any of this is worth the effort.</p>
<h3 id="heading-the-cost-math">The Cost Math</h3>
<p>At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the <a href="https://cloud.google.com/compute/vm-instance-pricing">official pricing page</a> before making decisions):</p>
<table>
<thead>
<tr>
<th>Instance</th>
<th>vCPU</th>
<th>Memory</th>
<th>Approx. $/hour</th>
</tr>
</thead>
<tbody><tr>
<td><code>n2-standard-4</code> (x86)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.19</td>
</tr>
<tr>
<td><code>t2a-standard-4</code> (Tau ARM)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.14</td>
</tr>
<tr>
<td><code>c4a-standard-4</code> (Axion)</td>
<td>4</td>
<td>16 GB</td>
<td>~$0.15</td>
</tr>
</tbody></table>
<p>That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.</p>
<p>Here's how that looks at scale, for a service running 20 nodes continuously for a year:</p>
<ul>
<li><p>20 × <code>n2-standard-4</code> × \(0.19 × 8,760 hours = <strong>\)33,288/year</strong></p>
</li>
<li><p>20 × <code>t2a-standard-4</code> × \(0.14 × 8,760 hours = <strong>\)24,528/year</strong></p>
</li>
</ul>
<p>That's roughly <strong>$8,760 saved annually</strong> on compute, before committed use discounts (which further widen the gap).</p>
<h3 id="heading-when-arm-is-the-right-choice">When ARM Is the Right Choice</h3>
<p>ARM works best for:</p>
<ul>
<li><p><strong>Stateless API servers and web applications</strong> — like the app we built. ARM excels at high-throughput, low-latency network workloads.</p>
</li>
<li><p><strong>Background workers and queue processors</strong> — long-running services that don't depend on x86-specific binaries.</p>
</li>
<li><p><strong>Microservices written in Go, Rust, or Python</strong> — these languages have excellent ARM64 support and are built cross-platform by default.</p>
</li>
</ul>
<h3 id="heading-when-to-proceed-carefully">When to Proceed Carefully</h3>
<ul>
<li><p><strong>Native library dependencies</strong> — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.</p>
</li>
<li><p><strong>CI pipelines need ARM too</strong> — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.</p>
</li>
<li><p><strong>Profile before optimizing</strong> — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.</p>
</li>
</ul>
<h2 id="heading-cleanup">Cleanup</h2>
<p>When you're done, clean up to avoid ongoing charges:</p>
<pre><code class="language-bash"># Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Let's recap what you built and why each part matters.</p>
<p>You started with a Go application, a Dockerfile, and a <code>docker buildx build</code> command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.</p>
<p>You provisioned a GKE cluster with two node pools running different CPU architectures, then used <code>nodeSelector</code> to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.</p>
<p>The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same <code>nodeSelector</code> approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.</p>
<p><strong>Where to go from here:</strong></p>
<ul>
<li><p>Add a GitHub Actions workflow that runs <code>docker buildx build --platform linux/amd64,linux/arm64</code> on every push, automating this entire process in CI.</p>
</li>
<li><p>Audit one of your existing stateless services for ARM compatibility and try migrating it.</p>
</li>
<li><p>Explore <strong>Node Affinity</strong> as a softer alternative to <code>nodeSelector</code> for workloads that can run on either architecture but prefer ARM.</p>
</li>
<li><p>Look into <strong>GKE Autopilot</strong>, which now supports ARM nodes and handles node pool management automatically.</p>
</li>
</ul>
<p>Happy building.</p>
<h2 id="heading-project-file-structure">Project File Structure</h2>
<pre><code class="language-plaintext">hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service
</code></pre>
<p>All source files for this tutorial are available in the companion GitHub repository: <a href="https://github.com/Amiynarh/multi-arch-docker-gke-arm">https://github.com/Amiynarh/multi-arch-docker-gke-arm</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Self-Host Your Own Server Monitoring Dashboard Using Uptime Kuma and Docker ]]>
                </title>
                <description>
                    <![CDATA[ As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains. And while many SaaS tools can  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/self-host-uptime-kuma-docker/</link>
                <guid isPermaLink="false">69d4185f40c9cabf44851652</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ self-hosted ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ monitoring ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Ubuntu ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abdul Talha ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 20:32:31 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ea068a20-bc19-400a-a42e-1bbb7e492da8.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains.</p>
<p>And while many SaaS tools can monitor your site, they often charge high monthly fees for simple alerts.</p>
<p>My goal with this article is to help you stop paying those expensive fees by showing you a powerful, free, open-source alternative called Uptime Kuma.</p>
<p>In this guide, you'll learn how to use Docker to deploy Uptime Kuma safely on a local Ubuntu machine.</p>
<p>By the end of this tutorial, you'll have set up your own private server monitoring dashboard in less than 10 minutes and created an automated Discord alert to ping your phone if your website goes offline.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-update-packages-and-prepare-the-firewall">Step 1: Update Packages and Prepare the Firewall</a></p>
</li>
<li><p><a href="#heading-step-2-create-the-docker-compose-file">Step 2: Create the Docker Compose File</a></p>
</li>
<li><p><a href="#heading-step-3-start-the-application">Step 3: Start the Application</a></p>
</li>
<li><p><a href="#heading-step-4-access-the-dashboard">Step 4: Access the Dashboard</a></p>
</li>
<li><p><a href="#heading-step-5-use-case-monitor-a-website-and-send-discord-alerts">Step 5: Use Case – Monitor a Website and Send Discord Alerts</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>An Ubuntu machine (like a local server, VM, or desktop).</p>
</li>
<li><p>Docker and Docker Compose installed.</p>
</li>
<li><p>Basic knowledge of the Linux terminal.</p>
</li>
</ul>
<h2 id="heading-step-1-update-packages-and-prepare-the-firewall">Step 1: Update Packages and Prepare the Firewall</h2>
<p>First, you'll want to make sure your system has the newest updates. Then, you'll install the Uncomplicated Firewall (UFW) and open the network "door" (port) that Uptime Kuma uses for the dashboard. You'll also need to allow SSH so you don't lock yourself out.</p>
<p>Run these commands in your terminal:</p>
<ol>
<li>Update your packages:</li>
</ol>
<pre><code class="language-shell">sudo apt update &amp;&amp; sudo apt upgrade -y
</code></pre>
<ol>
<li>Install the firewall:</li>
</ol>
<pre><code class="language-shell">sudo apt install ufw -y
</code></pre>
<ol>
<li>Allow SSH and open port 3001:</li>
</ol>
<pre><code class="language-shell">sudo ufw allow ssh
sudo ufw allow 3001/tcp
</code></pre>
<ol>
<li>Enable the firewall:</li>
</ol>
<pre><code class="language-shell">sudo ufw enable
sudo ufw reload
</code></pre>
<h2 id="heading-step-2-create-the-docker-compose-file">Step 2: Create the Docker Compose File</h2>
<p>Using a <code>docker-compose.yml</code> file is the professional way to manage Docker containers. It keeps your setup organised in one single place.</p>
<p>To start, create a new folder for your project and enter it:</p>
<pre><code class="language-shell">mkdir uptime-kuma &amp;&amp; cd uptime-kuma
</code></pre>
<p>Then create the configuration file:</p>
<pre><code class="language-shell">nano docker-compose.yml
</code></pre>
<p>Paste the following code into the editor:</p>
<pre><code class="language-yaml">services:
  uptime-kuma:
    image: louislam/uptime-kuma:2
    restart: unless-stopped
    volumes:
      - ./data:/app/data
    ports:
      - "3001:3001"
</code></pre>
<p><strong>Note</strong>: The <code>./data:/app/data</code> line is very important. It saves your database in a normal folder on your machine, making it easy to back up later.</p>
<p>Finally, save and exit: Press <code>CTRL + X</code>, then <code>Y</code>, then <code>Enter</code>.</p>
<h2 id="heading-step-3-start-the-application">Step 3: Start the Application</h2>
<p>Now, tell Docker to read your file and start the monitoring service in the background.</p>
<pre><code class="language-shell">docker compose up -d
</code></pre>
<p><strong>How to verify:</strong> Docker will download the files. When it finishes, your terminal should print <code>Started uptime-kuma</code>.</p>
<h2 id="heading-step-4-access-the-dashboard">Step 4: Access the Dashboard</h2>
<p>To access the dashboard, first open your web browser and go to <code>http://localhost:3001</code> (or your machine's local IP address).</p>
<p>When asked to choose the database, select <strong>SQLite</strong>. It's simple, fast, and requires no extra setup.</p>
<p>Then create an account and choose a secure admin username and password.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/02913589-020e-4a8a-aa7a-1bf70a9244c6.png" alt="02913589-020e-4a8a-aa7a-1bf70a9244c6" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-5-use-case-monitor-a-website-and-send-discord-alerts">Step 5: Use Case – Monitor a Website and Send Discord Alerts</h2>
<p>Now you'll put Uptime Kuma to work by monitoring a live website and setting up an alert. Just follow these steps:</p>
<ol>
<li><p>Click Add New Monitor.</p>
</li>
<li><p>Set the Monitor Type to <code>HTTP(s)</code>.</p>
</li>
<li><p>Give it a Friendly Name (e.g., "My Blog") and enter your website's URL.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/74567f1e-acc4-480f-b969-7883e01aa459.png" alt="74567f1e-acc4-480f-b969-7883e01aa459" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-pro-tip-how-to-fix-down-errors-bot-protection">Pro-Tip: How to Fix "Down" Errors (Bot Protection)</h3>
<p>If your site uses strict security, it might block Uptime Kuma and say your site is "Down" with a 403 Forbidden error.</p>
<p><strong>The Fix:</strong> Scroll down to Advanced, find the User Agent box, and paste this text to make Uptime Kuma look like a normal Chrome browser:</p>
<p><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36</code></p>
<h3 id="heading-add-a-discord-alert">Add a Discord Alert</h3>
<p>To get a message on your phone when your site goes down:</p>
<ol>
<li><p>On the right side of the monitor screen, click Setup Notification.</p>
</li>
<li><p>Select Discord from the dropdown list.</p>
</li>
<li><p>Paste a Discord Webhook URL (you can create one in your Discord server settings under Integrations).</p>
</li>
<li><p>Click Test to receive a test ping, then click Save.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Congratulations! You just took control of your server health. By deploying Uptime Kuma, you replaced an expensive SaaS subscription with a powerful, free monitoring tool that alerts you the second a project goes offline.</p>
<p><strong>Let’s connect!</strong> I am a developer and technical writer specialising in writing step-by-step guides and workflows. You can find my latest projects on my <a href="https://blog.abdultalha.tech/portfolio"><strong>Technical Writing Portfolio</strong></a> or reach out to me directly on <a href="https://www.linkedin.com/in/abdul-talha/"><strong>LinkedIn</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Bank Ledger in Golang with PostgreSQL using the Double-Entry Accounting Principle. ]]>
                </title>
                <description>
                    <![CDATA[ The Hidden Bugs in How Most Developers Store Money Imagine you're building the backend for a million-dollar fintech app. You store each user's balance as a single number in the database. It feels simp ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-bank-ledger-in-go-with-postgresql-using-the-double-entry-accounting-principle/</link>
                <guid isPermaLink="false">69c4173d10e664c5dac8cea1</guid>
                
                    <category>
                        <![CDATA[ Go Language ]]>
                    </category>
                
                    <category>
                        <![CDATA[ golang ]]>
                    </category>
                
                    <category>
                        <![CDATA[ PostgreSQL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ banking ]]>
                    </category>
                
                    <category>
                        <![CDATA[ accounting ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ double entry ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Paul Babatuyi ]]>
                </dc:creator>
                <pubDate>Wed, 25 Mar 2026 17:11:25 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/faea1d4c-5319-4746-96b0-315f37017e26.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <h2 id="heading-the-hidden-bugs-in-how-most-developers-store-money">The Hidden Bugs in How Most Developers Store Money</h2>
<p>Imagine you're building the backend for a million-dollar fintech app. You store each user's balance as a single number in the database. It feels simple: just update the number when money moves.</p>
<p>But with one line of code like <code>UPDATE accounts SET balance = balance - 100</code>, you've created a system that can silently lose millions. A server crash, a race condition, or a clever attack, and suddenly money vanishes or appears out of thin air.</p>
<p>There's no audit trail, no way to know what happened, and no way to prove it didn't happen on purpose.</p>
<p>This isn't just a theoretical risk. It's a trap that's caught even experienced developers. The world's most trusted financial systems avoid it by using double-entry accounting. Every transaction creates two records: a debit on one account, a credit on another. This lets you reconstruct every cent from history, catch inconsistencies, and audit every transaction.</p>
<p>There are no deletes, and no silent updates. Just an append-only trail that makes fraud and bugs much harder to hide.</p>
<p>In this guide, you'll build a robust backend in Go and PostgreSQL, using patterns inspired by real fintech companies. You'll learn how to design a double-entry ledger, generate type-safe SQL with sqlc, and write transactions that are safe even under heavy load.</p>
<p>By the end, you'll understand why these patterns matter –&nbsp;and how to use them to build software you can trust with real money.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites-and-project-overview">Prerequisites and Project Overview</a></p>
</li>
<li><p><a href="#heading-the-double-entry-foundation-how-every-penny-is-accounted-for">The Double-Entry Foundation</a></p>
</li>
<li><p><a href="#heading-type-safe-sql-with-sqlc-no-more-surprises">Type-Safe SQL with sqlc</a></p>
</li>
<li><p><a href="#heading-the-store-layer-transactions-and-automatic-retries">The Store Layer: Transactions and Retries</a></p>
</li>
<li><p><a href="#heading-the-service-layer-where-business-logic-meets-double-entry">The Service Layer: Business Logic</a></p>
</li>
<li><p><a href="#heading-the-api-layer-secure-predictable-and-boring-by-design">The API Layer</a></p>
</li>
<li><p><a href="#heading-running-it-locally-your-first-end-to-end-test">Running It Locally</a></p>
</li>
<li><p><a href="#heading-testing-prove-the-system-works">Testing: Prove the System Works</a></p>
</li>
<li><p><a href="#heading-deployment-engineering-decisions-that-matter-in-production">Deployment</a></p>
</li>
<li><p><a href="#heading-conclusion-building-for-the-real-world">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-project-resources">Project Resources:</h3>
<p>Here's the project repository: <a href="https://github.com/PaulBabatuyi/double-entry-bank-Go">https://github.com/PaulBabatuyi/double-entry-bank-Go</a></p>
<p>And here's the front-end repository: <a href="https://github.com/PaulBabatuyi/double-entry-bankhttps://github.com/PaulBabatuyi/double-entry-bank">https://github.com/PaulBabatuyi/double-entry-bank</a></p>
<p>You can find the live frontend here: <a href="https://golangbank.app">https://golangbank.app</a></p>
<img src="https://cdn.hashnode.com/uploads/covers/6968db1b0578d1643036e600/2240e617-5a6d-4742-995f-6ecb8fecb56e.png" alt="Double-entry frontend transaction" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>You can find the live Swagger back-end API here: <a href="https://golangbank.app/swagger">https://golangbank.app/swagger</a></p>
<img src="https://cdn.hashnode.com/uploads/covers/6968db1b0578d1643036e600/3a6c1e02-5ceb-43e4-86a3-0530735b79cb.png" alt="Backend API endpoints (Swagger)" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-prerequisites-and-project-overview">Prerequisites and Project Overview</h2>
<p>Before you dive in, make sure you have the following installed:</p>
<ul>
<li><p>Go 1.23 or newer</p>
</li>
<li><p>Docker and Docker Compose</p>
</li>
<li><p><code>golang-migrate</code> CLI: <code>go install github.com/golang-migrate/migrate/v4/cmd/migrate@latest</code></p>
</li>
<li><p><code>sqlc</code> CLI: <code>go install github.com/sqlc-dev/sqlc/cmd/sqlc@latest</code></p>
</li>
</ul>
<p>You'll also need a basic understanding of PostgreSQL and REST APIs to follow along.</p>
<p>If you've built a CRUD app before, you're ready for this. The project uses sqlc for type-safe queries, JWT for authentication, and a layered architecture that keeps business logic, persistence, and HTTP handling cleanly separated.</p>
<p>Here's how the project is organized:</p>
<pre><code class="language-plaintext">.
├── cmd/                # Server entrypoint
│   └── main.go
├── internal/
│   ├── api/            # HTTP handlers &amp; middleware
│   ├── db/             # Store layer (transactions, sqlc)
│   └── service/        # Business logic (ledger operations)
├── postgres/
│   ├── migrations/     # SQL migration files
│   └── queries/        # sqlc query files
├── docs/               # Swagger docs
├── Dockerfile, docker-compose.yml, Makefile
└── README.md
</code></pre>
<p>The architecture follows a clear three-layer pattern:</p>
<ul>
<li><p><strong>API Layer</strong>: Handles HTTP requests, authentication, and routing.</p>
</li>
<li><p><strong>Service Layer</strong>: Contains the business logic. This is where double-entry rules are enforced.</p>
</li>
<li><p><strong>Store Layer</strong>: Manages database transactions and persistence.</p>
</li>
</ul>
<p>Every request flows from the handler, through the service, to the store, and finally to PostgreSQL. This separation makes the code easier to test, debug, and extend.</p>
<h3 id="heading-backend-request-flow">Backend Request Flow</h3>
<pre><code class="language-mermaid">graph TD
    A[HTTP Request] --&gt; B[Handler - API Layer]
    B --&gt; C[LedgerService - Business Logic]
    C --&gt; D[Store - Persistence Layer]
    D --&gt; E[(PostgreSQL)]
    E --&gt; D
    D --&gt; C
    C --&gt; B
    B --&gt; F[HTTP Response]
</code></pre>
<h2 id="heading-the-double-entry-foundation-how-every-penny-is-accounted-for">The Double-Entry Foundation: How Every Penny is Accounted For</h2>
<p>Let's get to the heart of what makes this system bulletproof: double-entry accounting. Every operation – a deposit, withdrawal, or transfer&nbsp;– creates two entries that always balance. This is the secret sauce that keeps banks, payment apps, and even crypto exchanges from losing track of money.</p>
<p>Picture a simple deposit of $1,000:</p>
<pre><code class="language-plaintext">| Account              | Debit   | Credit  |
|----------------------|---------|---------|
| User Account         |         | 1,000   |
| Settlement Account   | 1,000   |         |
</code></pre>
<p>Total debits always equal total credits. This is the fundamental rule. Every single operation in this system produces exactly this structure, with no exceptions.</p>
<p>Now picture a $200 transfer from User A to User B. Notice there are four entries, not two – both sides of both accounts are recorded:</p>
<pre><code class="language-plaintext">| Account       | Debit   | Credit  | Description           |
|---------------|---------|---------|-----------------------|
| User A        | 200     |         | Transfer to User B    |
| User B        |         | 200     | Transfer from User A  |
</code></pre>
<p>Both entries share the same <code>transaction_id</code>, so you can always retrieve the complete picture of what happened with a single query. There's no guessing and no reconstructing, as the ledger tells the full story.</p>
<h3 id="heading-why-the-settlement-account-goes-negative">Why the Settlement Account Goes Negative</h3>
<p>This trips up newcomers, so it's worth explaining explicitly. When a user deposits \(1,000, the settlement account is debited \)1,000. After several user deposits, the settlement balance will be negative. That's correct and expected: it represents the total amount of real-world money currently held inside the system on behalf of users. The invariant is:</p>
<pre><code class="language-plaintext">SUM(all user account balances) + settlement balance = 0
</code></pre>
<p>If that ever doesn't hold, something is broken.</p>
<h3 id="heading-enforcing-the-rules-in-the-database">Enforcing the Rules in the Database</h3>
<p>The database itself enforces these rules, not just the application code. Here's the core of the <code>entries</code> table migration:</p>
<pre><code class="language-sql">CREATE TABLE IF NOT EXISTS entries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id UUID NOT NULL REFERENCES accounts(id) ON DELETE RESTRICT,
    debit NUMERIC(19,4) NOT NULL DEFAULT 0.0000 CHECK (debit &gt;= 0),
    credit NUMERIC(19,4) NOT NULL DEFAULT 0.0000 CHECK (credit &gt;= 0),
    transaction_id UUID NOT NULL,
    operation_type operation_type NOT NULL,
    description TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT check_single_side CHECK (
        (debit &gt; 0 AND credit = 0) OR (debit = 0 AND credit &gt; 0)
    )
);
</code></pre>
<p>Let's break down why each piece matters:</p>
<ul>
<li><p><strong>Single-sided entries are impossible.</strong> The <code>check_single_side</code> constraint means every entry must be either a debit or a credit, never both. If you try to insert an invalid row, the database rejects it – there's no way around it.</p>
</li>
<li><p><strong>Every transaction is linked.</strong> Both the debit and credit entries share the same <code>transaction_id</code> (a UUID). This lets you fetch both sides of any operation instantly, making audits and debugging straightforward.</p>
</li>
<li><p><strong>Operation types are explicit.</strong> The <code>operation_type</code> column is an enum at the database level, so only valid types like <code>deposit</code>, <code>withdrawal</code>, or <code>transfer</code> are allowed. There are no typos and no surprises.</p>
</li>
</ul>
<h3 id="heading-the-settlement-account-the-systems-anchor">The Settlement Account: The System's Anchor</h3>
<p>Every real-world ledger needs a way to represent money entering or leaving the system. That's what the settlement account does. Here's how it's seeded in the database:</p>
<pre><code class="language-sql">INSERT INTO accounts (id, name, balance, currency, is_system)
SELECT gen_random_uuid(), 'Settlement Account', 0.0000, 'USD', TRUE
WHERE NOT EXISTS (
    SELECT 1 FROM accounts WHERE is_system = TRUE AND name = 'Settlement Account'
);
</code></pre>
<p>The settlement account represents the "outside world." When a user deposits money, it comes from the settlement account. When they withdraw, it goes back. Using <code>WHERE NOT EXISTS</code> makes this migration idempotent –&nbsp;that is, safe to run multiple times without creating duplicates.</p>
<h2 id="heading-type-safe-sql-with-sqlc-no-more-surprises">Type-Safe SQL with sqlc: No More Surprises</h2>
<p>In financial systems, you can't afford surprises from your database layer. That's why this project uses sqlc, a tool that turns your SQL queries into type-safe Go code at compile time.</p>
<p>With sqlc, you see exactly what SQL runs, catch mistakes before they hit production, and avoid the "magic" (and hidden bugs) of most ORMs. Every query is explicit, every type is checked, and you get the best of both worlds: raw SQL power with Go's safety.</p>
<h3 id="heading-why-numeric-becomes-string-and-not-float64">Why NUMERIC Becomes String (and Not float64)</h3>
<p>Here's a subtle but critical detail from <code>sqlc.yaml</code>:</p>
<pre><code class="language-yaml">overrides:
    - db_type: "pg_catalog.numeric"
      go_type: "string"
    - column: "entries.debit"
      go_type: "string"
    - column: "entries.credit"
      go_type: "string"
    - column: "accounts.balance"
      go_type: "string"
    - db_type: "operation_type"
      go_type: "string"
</code></pre>
<p><strong>Why string, not float64?</strong> Floating point arithmetic is imprecise. <code>0.1 + 0.2</code> in most programming languages does not equal exactly <code>0.3</code>.</p>
<p>For money, you need exact decimal arithmetic. This project uses <code>shopspring/decimal</code> for all calculations and stores amounts as strings, converting at the service layer boundary. The database column itself is <code>NUMERIC(19,4)</code>, which stores exact decimals – no float rounding ever touches your money.</p>
<h3 id="heading-preventing-race-conditions-locking-with-for-update">Preventing Race Conditions: Locking with FOR UPDATE</h3>
<p>One of the most important queries in the system is <code>GetAccountForUpdate</code>:</p>
<pre><code class="language-sql">SELECT * FROM accounts
WHERE id = $1
LIMIT 1
FOR UPDATE; -- locks row for update, prevents TOCTOU races
</code></pre>
<p>This query uses <code>FOR UPDATE</code> to lock the account row during a transaction. Why? Imagine two requests both see a \(500 balance and both try to withdraw \)400. Without locking, both would succeed, and you'd end up with a negative balance. With <code>FOR UPDATE</code>, the second transaction waits until the first finishes, eliminating this classic race condition.</p>
<h3 id="heading-calculating-the-true-balance-always-trust-the-entries">Calculating the True Balance: Always Trust the Entries</h3>
<p>The real source of truth for any account is the sum of its entries, not the denormalized <code>balance</code> column. Here's the reconciliation query:</p>
<pre><code class="language-sql">SELECT CAST(
    (COALESCE(SUM(credit), 0::NUMERIC) - COALESCE(SUM(debit), 0::NUMERIC))
    AS NUMERIC(19,4)
) AS calculated_balance
FROM entries
WHERE account_id = $1;
</code></pre>
<p>This computes the true balance from the ledger itself. It's how you catch bugs, audit the system, and prove that every penny is accounted for. The <code>balance</code> column on accounts is a denormalized cache for fast reads –&nbsp;and this query is the ground truth that validates it.</p>
<h2 id="heading-the-store-layer-transactions-and-automatic-retries">The Store Layer: Transactions and Automatic Retries</h2>
<p>Every financial operation in this system runs inside a transaction –&nbsp;no exceptions. This is enforced by the <code>ExecTx</code> pattern in the store layer:</p>
<pre><code class="language-go">func (store *Store) ExecTx(ctx context.Context, fn func(q *sqlc.Queries) error) error {
    const maxAttempts = 10
    var lastErr error
    for attempt := 0; attempt &lt; maxAttempts; attempt++ {
        lastErr = store.execTxOnce(ctx, fn)
        if lastErr == nil {
            return nil
        }
        if !isSerializationError(lastErr) {
            return lastErr
        }
        if attempt &lt; maxAttempts-1 {
            if waitErr := sleepWithContext(ctx, retryWait(attempt)); waitErr != nil {
                return waitErr
            }
        }
    }
    return fmt.Errorf("transaction failed after %d attempts due to serialization conflicts: %w", maxAttempts, lastErr)
}
</code></pre>
<h3 id="heading-why-serializable-isolation">Why Serializable Isolation?</h3>
<p>The transaction uses PostgreSQL's strictest isolation level: <code>sql.LevelSerializable</code>. This is like running transactions one at a time, eliminating entire classes of concurrency bugs. If two operations would conflict, PostgreSQL aborts one and returns a serialization error (SQLSTATE 40001).</p>
<h3 id="heading-automatic-retries-handling-real-world-concurrency">Automatic Retries: Handling Real-World Concurrency</h3>
<p>When a serialization error occurs, the code automatically retries with exponential backoff:</p>
<pre><code class="language-go">func retryWait(attempt int) time.Duration {
    base := 50 * time.Millisecond
    for i := 0; i &lt; attempt; i++ {
        base *= 2
        if base &gt;= time.Second {
            return time.Second
        }
    }
    return base
}

func sleepWithContext(ctx context.Context, d time.Duration) error {
    select {
    case &lt;-ctx.Done():
        return ctx.Err()
    case &lt;-time.After(d):
        return nil
    }
}
</code></pre>
<p>The backoff starts at 50ms and doubles each attempt, capping at 1 second. Up to 10 attempts are made. If the client disconnects mid-retry, <code>sleepWithContext</code> detects the cancelled context and returns immediately. This means no wasted resources.</p>
<h2 id="heading-the-service-layer-where-business-logic-meets-double-entry">The Service Layer: Where Business Logic Meets Double-Entry</h2>
<p>The service layer is the heart of the system. Its job is to translate business operations – deposits, withdrawals, transfers – into double-entry journal entries that always balance.</p>
<h3 id="heading-deposit-crediting-the-user-debiting-the-settlement">Deposit: Crediting the User, Debiting the Settlement</h3>
<p>Every deposit creates two entries: a credit to the user's account and a matching debit to the settlement account. Both entries share the same transaction ID.</p>
<pre><code class="language-go">func (s *LedgerService) Deposit(ctx context.Context, accountID uuid.UUID, amountStr string) error {
    amount, err := validatePositiveAmount(amountStr)
    if err != nil {
        return err
    }
    return s.store.ExecTx(ctx, func(q *sqlc.Queries) error {
        settlement, err := q.GetSettlementAccountForUpdate(ctx)
        if err != nil {
            return fmt.Errorf("settlement account not found: %w", err)
        }
        account, err := q.GetAccountForUpdate(ctx, accountID)
        if err != nil {
            return fmt.Errorf("account not found: %w", err)
        }
        if account.Currency != settlement.Currency {
            return ErrCurrencyMismatch
        }
        txID := uuid.New()
        // 1. Credit user account
        _, err = q.CreateEntry(ctx, sqlc.CreateEntryParams{
            AccountID:     accountID,
            Debit:         decimal.Zero.StringFixed(4),
            Credit:        amount.StringFixed(4),
            TransactionID: txID,
            OperationType: "deposit",
            Description:   sql.NullString{String: "External deposit", Valid: true},
        })
        if err != nil { return err }
        // 2. Debit settlement (opposing entry)
        _, err = q.CreateEntry(ctx, sqlc.CreateEntryParams{
            AccountID:     settlement.ID,
            Debit:         amount.StringFixed(4),
            Credit:        decimal.Zero.StringFixed(4),
            TransactionID: txID,
            OperationType: "deposit",
            Description:   sql.NullString{String: fmt.Sprintf("Deposit to account %s", accountID), Valid: true},
        })
        if err != nil { return err }
        // 3. Update both balances atomically
        if err = q.UpdateAccountBalance(ctx, sqlc.UpdateAccountBalanceParams{
            Balance: amount.StringFixed(4), ID: accountID,
        }); err != nil { return err }
        return q.UpdateAccountBalance(ctx, sqlc.UpdateAccountBalanceParams{
            Balance: amount.Neg().StringFixed(4), ID: settlement.ID,
        })
    })
}
</code></pre>
<p>Two things are worth highlighting. First, both accounts are locked with <code>GetAccountForUpdate</code> and <code>GetSettlementAccountForUpdate</code> before any entries are written. This prevents any other concurrent transaction from reading a stale balance and acting on it.</p>
<p>Second, <code>amount.Neg()</code> is used to debit the settlement. Its balance goes down, representing real money now held inside the system.</p>
<h3 id="heading-withdraw-debiting-the-user-crediting-the-settlement">Withdraw: Debiting the User, Crediting the Settlement</h3>
<p>Withdrawals are the mirror image of deposits. The key difference is the insufficient funds check, which must happen inside the transaction after the lock is acquired:</p>
<pre><code class="language-go">balanceDec, err := decimal.NewFromString(account.Balance)
if err != nil {
    return errors.New("invalid balance")
}
if balanceDec.LessThan(amount) {
    return ErrInsufficientFunds
}
</code></pre>
<p>Checking balance inside the transaction after <code>FOR UPDATE</code> is critical. Checking it before, outside the transaction, would create a classic time-of-check-to-time-of-use (TOCTOU) race. Two concurrent withdrawals could both pass the check, then both execute, overdrawing the account.</p>
<p>The entries for a $500 withdrawal look like this:</p>
<pre><code class="language-plaintext">| Account              | Debit   | Credit  |
|----------------------|---------|---------|
| User Account         | 500     |         |
| Settlement Account   |         | 500     |
</code></pre>
<p>The settlement is credited because real money is leaving the system, and it's being "returned" to the outside world.</p>
<h3 id="heading-transfer-user-to-user-no-settlement-involved">Transfer: User-to-User, No Settlement Involved</h3>
<p>Transfers move money directly between two user accounts. The settlement account isn't involved. Both accounts are locked, currency is validated, and an insufficient funds check runs before any entries are created:</p>
<pre><code class="language-go">func (s *LedgerService) Transfer(ctx context.Context, fromID, toID uuid.UUID, amountStr string) error {
    amount, err := validatePositiveAmount(amountStr)
    if err != nil { return err }
    if fromID == toID {
        return ErrSameAccountTransfer
    }
    return s.store.ExecTx(ctx, func(q *sqlc.Queries) error {
        fromAcc, err := q.GetAccountForUpdate(ctx, fromID)
        if err != nil { return err }
        toAcc, err := q.GetAccountForUpdate(ctx, toID)
        if err != nil { return err }
        if fromAcc.Currency != toAcc.Currency {
            return ErrCurrencyMismatch
        }
        fromBalance, _ := decimal.NewFromString(fromAcc.Balance)
        if fromBalance.LessThan(amount) {
            return ErrInsufficientFunds
        }
        txID := uuid.New()
        // Debit sender, credit receiver — same transaction ID
        // ... CreateEntry calls + UpdateAccountBalance calls
    })
}
</code></pre>
<p>A $200 transfer creates exactly two entries under the same <code>transaction_id</code>:</p>
<pre><code class="language-plaintext">| Account  | Debit   | Credit  |
|----------|---------|---------|
| Sender   | 200     |         |
| Receiver |         | 200     |
</code></pre>
<h3 id="heading-reconcileaccount-trust-but-verify">ReconcileAccount: Trust, But Verify</h3>
<p>Reconciliation is how you prove the system is correct. The <code>ReconcileAccount</code> function compares the stored <code>balance</code> column against the sum of all credits minus debits in the entries table:</p>
<pre><code class="language-go">func (s *LedgerService) ReconcileAccount(ctx context.Context, accountID uuid.UUID) (bool, error) {
    account, err := s.store.GetAccount(ctx, accountID)
    if err != nil { return false, fmt.Errorf("account not found: %w", err) }

    calculatedStr, err := s.store.GetAccountBalance(ctx, accountID)
    if err != nil { return false, fmt.Errorf("failed to calculate balance: %w", err) }

    calculated, _ := decimal.NewFromString(calculatedStr)
    stored, _ := decimal.NewFromString(account.Balance)

    if !stored.Equal(calculated) {
        log.Error().
            Str("stored_balance", account.Balance).
            Str("calculated", calculated.StringFixed(4)).
            Msg("Balance mismatch detected")
        return false, fmt.Errorf("balance mismatch: stored %s, calculated %s",
            account.Balance, calculated.StringFixed(4))
    }
    return true, nil
}
</code></pre>
<p>If they don't match, something has gone wrong: a bug, a direct database modification, or a race condition that slipped through. In production, this check can run as a background job to catch issues before they become incidents.</p>
<h2 id="heading-the-api-layer-secure-predictable-and-boring-by-design">The API Layer: Secure, Predictable, and Boring (By Design)</h2>
<p>The API layer is where your business logic meets the outside world. Its job is to be secure, predictable, and, if you've done things right, a little bit boring.</p>
<h3 id="heading-jwt-authentication-secrets-matter">JWT Authentication: Secrets Matter</h3>
<p>Authentication is handled with JWTs. The secret used to sign tokens must be at least 32 characters long (as shorter secrets are insecure and can be brute-forced). This is enforced at startup:</p>
<pre><code class="language-go">// internal/api/middleware.go
func InitTokenAuth(secret string) error {
    if secret == "" {
        return errors.New("JWT_SECRET environment variable is required")
    }
    if len(secret) &lt; 32 {
        return errors.New("JWT_SECRET must be at least 32 characters")
    }
    TokenAuth = jwtauth.New("HS256", []byte(secret), nil)
    return nil
}
</code></pre>
<p>The server will refuse to start if the secret is missing or too short. There's no fallback and no default: the system fails loudly rather than running insecurely.</p>
<h3 id="heading-the-handler-pattern-parse-authorize-validate-call-respond">The Handler Pattern: Parse, Authorize, Validate, Call, Respond</h3>
<p>Every handler follows the same recipe: extract JWT claims, parse the account ID, fetch the account and verify ownership, decode the request body, call the service, and respond. Authorization always happens before calling the service layer. The service knows nothing about users, keeping business logic clean and testable.</p>
<pre><code class="language-go">// internal/api/handler.go
func (h *Handler) Register(w http.ResponseWriter, r *http.Request) {
    var input struct {
        Email    string `json:"email"`
        Password string `json:"password"`
    }
    if err := json.NewDecoder(r.Body).Decode(&amp;input); err != nil {
        respondError(w, http.StatusBadRequest, "invalid input")
        return
    }
    // ... hash password, create user, generate JWT ...
}
</code></pre>
<h3 id="heading-amount-normalization-defensive-by-default">Amount Normalization: Defensive by Default</h3>
<p>API clients send amounts in different formats –&nbsp;sometimes as strings, sometimes as numbers. The normalization logic ensures all amounts are handled safely:</p>
<pre><code class="language-go">// internal/api/amount.go
func normalizeAmountInput(value interface{}) (string, error) {
    switch v := value.(type) {
    case string:
        return strings.TrimSpace(v), nil
    case json.Number:
        return strings.TrimSpace(v.String()), nil
    case float64:
        return strconv.FormatFloat(v, 'f', -1, 64), nil
    default:
        return "", errors.New("amount must be a number or string")
    }
}
</code></pre>
<p>The decoder uses <code>dec.UseNumber()</code> so JSON numbers arrive as <code>json.Number</code> rather than <code>float64</code>, preserving full precision. The <code>float64</code> case exists as a safety fallback only.</p>
<h3 id="heading-frontend-deployment-boundary">Frontend Deployment Boundary</h3>
<p>The backend no longer serves static frontend files. The frontend is deployed separately at <code>https://golangbank.app</code> from its own repository: <code>https://github.com/PaulBabatuyi/double-entry-bank</code>.</p>
<h2 id="heading-running-it-locally-your-first-end-to-end-test">Running It Locally: Your First End-to-End Test</h2>
<pre><code class="language-bash">git clone https://github.com/PaulBabatuyi/double-entry-bank-Go.git
cd double-entry-bank-Go
cp .env.example .env
# Edit .env — set JWT_SECRET with: openssl rand -base64 32
make postgres
make migrate-up
make server
</code></pre>
<p>Once the server is running:</p>
<ul>
<li><p><strong>Frontend</strong>: <a href="https://golangbank.app">https://golangbank.app</a></p>
</li>
<li><p><strong>Swagger UI</strong>: <a href="http://localhost:8080/swagger/index.html">http://localhost:8080/swagger/index.html</a> (local dev) or <a href="https://golangbank.app/swagger">https://golangbank.app/swagger</a> (production)</p>
</li>
<li><p><strong>Health check</strong>: <a href="http://localhost:8080/health">http://localhost:8080/health</a></p>
</li>
</ul>
<p>The Swagger UI lets you explore every endpoint, authorize with your JWT token, and test operations directly in the browser.</p>
<h2 id="heading-testing-prove-the-system-works">Testing: Prove the System Works</h2>
<p>Testing financial systems is non-negotiable, and claims about correctness need to be backed by code. This project tests all three layers, each targeting a different kind of failure.</p>
<h3 id="heading-service-layer-core-financial-logic">Service Layer: Core Financial Logic</h3>
<p>The most important tests live in <code>internal/service/ledger_test.go</code>. They run against a real PostgreSQL database – not mocks –&nbsp;because mock-based tests can give a false sense of security. Real database tests catch issues that only appear in production-like environments.</p>
<pre><code class="language-go">func TestDeposit_Success(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "0.00")

    err := ledger.Deposit(context.Background(), accountID, "100.00")
    require.NoError(t, err)

    balance := getAccountBalance(t, ledger, accountID)
    assert.Equal(t, "100.0000", balance)
}

func TestWithdraw_InsufficientFunds(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "50.00")

    err := ledger.Withdraw(context.Background(), accountID, "100.00")
    assert.ErrorIs(t, err, ErrInsufficientFunds)
}
</code></pre>
<p>The <code>createTestAccount</code> helper uses the settlement account's currency automatically, which is important: all accounts must share a currency for transfers to work, and tests that silently use a different currency will fail in confusing ways.</p>
<h3 id="heading-concurrency-test-proving-serializable-isolation-works">Concurrency Test: Proving Serializable Isolation Works</h3>
<p>This is the most important test in the suite:</p>
<pre><code class="language-go">func TestConcurrentDeposits(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "0.00")

    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        defer wg.Done()
        _ = ledger.Deposit(context.Background(), accountID, "100.00")
    }()
    go func() {
        defer wg.Done()
        _ = ledger.Deposit(context.Background(), accountID, "100.00")
    }()
    wg.Wait()

    balance := getAccountBalance(t, ledger, accountID)
    assert.Equal(t, "200.0000", balance)
}
</code></pre>
<p>Two goroutines deposit simultaneously. The serializable isolation level and retry logic ensure both operations succeed and neither overwrites the other. Without the <code>FOR UPDATE</code> locks and transaction retry logic, this test would fail non-deterministically – which is exactly the kind of bug that's impossible to reproduce in development but devastating in production.</p>
<h3 id="heading-store-layer-transaction-mechanics">Store Layer: Transaction Mechanics</h3>
<p>Tests in <code>internal/db/store_test.go</code> verify the retry infrastructure itself, without needing a database connection:</p>
<pre><code class="language-go">func TestIsSerializationError(t *testing.T) {
    pqErr := &amp;pq.Error{Code: "40001"}
    assert.True(t, isSerializationError(pqErr))
    assert.False(t, isSerializationError(errors.New("some other error")))
}

func TestRetryWait(t *testing.T) {
    assert.Equal(t, 50*time.Millisecond, retryWait(0))
    assert.Equal(t, 100*time.Millisecond, retryWait(1))
    assert.Equal(t, 200*time.Millisecond, retryWait(2))
    assert.Equal(t, time.Second, retryWait(5)) // capped
}

func TestSleepWithContext_Cancel(t *testing.T) {
    ctx, cancel := context.WithCancel(context.Background())
    cancel() // cancel immediately
    err := sleepWithContext(ctx, 50*time.Millisecond)
    assert.Error(t, err) // should return immediately, not wait
}
</code></pre>
<h3 id="heading-api-layer-authentication-and-input-handling">API Layer: Authentication and Input Handling</h3>
<p>Handler tests in <code>internal/api/handler_test.go</code> verify that the HTTP layer behaves correctly at its boundaries:</p>
<pre><code class="language-go">func TestRegisterHandler_BadRequest(t *testing.T) {
    h := setupTestHandler(t)
    req := httptest.NewRequest(http.MethodPost, "/register", nil)
    rw := httptest.NewRecorder()
    h.Register(rw, req)
    assert.Equal(t, http.StatusBadRequest, rw.Code)
}

func TestRegisterHandler_Success(t *testing.T) {
    h := setupTestHandler(t)
    _ = InitTokenAuth("fV7sliKV3qn657I60wEFtw/Auk/0bNU9zdp30wFzfDg=")

    email := "testuser_" + uuid.New().String() + "@example.com"
    body, _ := json.Marshal(map[string]string{"email": email, "password": "testpassword123"})

    req := httptest.NewRequest(http.MethodPost, "/register", bytes.NewReader(body))
    rw := httptest.NewRecorder()
    h.Register(rw, req)
    assert.Equal(t, http.StatusCreated, rw.Code)
}
</code></pre>
<p>Using <code>uuid.New().String()</code> in the email ensures each test run creates a unique user, preventing conflicts on repeated runs against the same database.</p>
<p>Middleware tests verify the security boundary itself:</p>
<pre><code class="language-go">func TestInitTokenAuthFromEnv_MissingSecret(t *testing.T) {
    os.Unsetenv("JWT_SECRET")
    err := InitTokenAuthFromEnv()
    assert.Error(t, err) // must fail without a secret
}
</code></pre>
<h3 id="heading-running-the-tests">Running the Tests</h3>
<pre><code class="language-bash"># Start the database
make postgres

# Run all tests with race detection
make test

# Run with coverage report
make coverage

# Run tests the same way CI does (includes migrations)
make ci-test
</code></pre>
<p>The <code>-race</code> flag is non-negotiable for financial code. It instruments the binary to detect data races at runtime –&nbsp;something static analysis can't catch. If a race exists, the race detector will find it.</p>
<h2 id="heading-deployment-engineering-decisions-that-matter-in-production">Deployment: Engineering Decisions That Matter in Production</h2>
<p>The deployment setup for this project reflects several engineering decisions worth understanding, regardless of what platform you deploy to.</p>
<h3 id="heading-migrations-on-container-start">Migrations on Container Start</h3>
<p>The Docker entrypoint runs <code>golang-migrate up</code> before starting the Go binary:</p>
<pre><code class="language-sh"># docker-entrypoint
migrate -path /app/postgres/migrations -database "$migrate_db_url" up
exec /usr/local/bin/ledger
</code></pre>
<p>Running migrations at startup rather than as a separate CI step has trade-offs. The upside is simplicity: the container is always self-consistent when it starts. The downside is that each deployment takes slightly longer. For a solo project or small team, this is the right call. At scale you'd separate migrations from deployment.</p>
<h3 id="heading-startup-retry-logic">Startup Retry Logic</h3>
<p>The entrypoint retries migrations up to 12 times with a 5-second sleep between attempts:</p>
<pre><code class="language-sh">max_attempts=12
attempt=1
while [ "\(attempt" -le "\)max_attempts" ]; do
    migration_output=$(migrate ... up 2&gt;&amp;1)
    # If "connection refused" or "timeout", keep retrying
    # If any other error, fail immediately
    attempt=$((attempt + 1))
done
</code></pre>
<p>The critical distinction is which errors trigger a retry. Network-transient errors (connection refused, timeout) are retried. Everything else&nbsp;–&nbsp;a bad migration SQL, a missing tabl&nbsp;–&nbsp;fails immediately. This avoids waiting the full 60 seconds when a deployment has a real problem.</p>
<h3 id="heading-db-url-fallback-chain">DB URL Fallback Chain</h3>
<p>In cloud environments, the internal database URL is often a different variable than what you configure locally. The <code>resolveDBURL</code> function handles this transparently:</p>
<pre><code class="language-go">func resolveDBURL() string {
    connStr := strings.TrimSpace(os.Getenv("DB_URL"))
    fallbackVars := []string{"INTERNAL_DATABASE_URL", "RENDER_DATABASE_URL", "DATABASE_URL"}
    // Falls back through the chain if DB_URL is empty or resolves to localhost
    ...
}
</code></pre>
<p>This pattern means local developers set <code>DB_URL</code> in <code>.env</code> and don't need to think about it, while the deployed container automatically uses the internal database connection without any manual wiring.</p>
<h3 id="heading-http-server-timeouts">HTTP Server Timeouts</h3>
<p>The server is configured with explicit timeouts:</p>
<pre><code class="language-go">srv := &amp;http.Server{
    Addr:              ":" + port,
    Handler:           r,
    ReadTimeout:       15 * time.Second,
    WriteTimeout:      15 * time.Second,
    IdleTimeout:       60 * time.Second,
    ReadHeaderTimeout: 5 * time.Second,
}
</code></pre>
<p>Without timeouts, a slow or malicious client can hold connections open indefinitely, eventually exhausting the server's resources. <code>ReadHeaderTimeout</code> is particularly important: it limits how long the server waits for the HTTP headers before closing the connection, protecting against Slowloris-style attacks.</p>
<h2 id="heading-conclusion-building-for-the-real-world">Conclusion: Building for the Real World</h2>
<p>You've just walked through the core patterns that power real fintech systems:</p>
<ul>
<li><p>Double-entry ledger with database-enforced constraints</p>
</li>
<li><p>Settlement account for tracking external cash flows</p>
</li>
<li><p>Serializable transactions with exponential backoff retry</p>
</li>
<li><p>Reconciliation endpoint for verifying correctness</p>
</li>
<li><p>Type-safe queries with sqlc</p>
</li>
<li><p>Row-level locking to prevent race conditions</p>
</li>
<li><p>Tests that prove correctness under concurrency</p>
</li>
</ul>
<p>These aren't just Go patterns. They're the same principles used at companies like Monzo, Stripe, and Nubank. The implementation details differ, but the underlying ideas are the same: every dollar is accounted for, every operation is atomic, and the system can always explain where every penny went.</p>
<p>What's next? Three concrete next steps:</p>
<ol>
<li><p><strong>Add idempotency keys</strong> to prevent duplicate transactions on retries. If a client retries a deposit because of a network timeout, you need to detect and reject the duplicate.</p>
</li>
<li><p><strong>Add Prometheus metrics</strong> for transaction latency and failure rates. You want to know when your p99 latency spikes before your users do.</p>
</li>
<li><p><strong>Add a scheduled reconciliation job</strong> that runs <code>ReconcileAccount</code> for every account on a schedule and alerts on mismatches. Catch bugs automatically, before they become customer complaints.</p>
</li>
</ol>
<p>The developer who stores balance as a single number and updates it directly will eventually have an incident. The developer who builds a ledger has an audit trail, a reconciliation tool, and a system that can explain every penny.</p>
<p>That's the real reason fintech engineers build this way: not because it's more complex, but because it's more honest about what money actually is.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers ]]>
                </title>
                <description>
                    <![CDATA[ Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Go ]]>
                </description>
                <link>https://www.freecodecamp.org/news/docker-container-doctor-how-i-built-an-ai-agent-that-monitors-and-fixes-my-containers/</link>
                <guid isPermaLink="false">69c1768730a9b81e3a833f20</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agents ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Mon, 23 Mar 2026 17:21:11 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8bb7701d-e519-407f-92ba-59639e13729d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.</p>
<p>I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.</p>
<p>So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-why-not-just-use-prometheus">Why Not Just Use Prometheus?</a></p>
</li>
<li><p><a href="#heading-the-architecture">The Architecture</a></p>
</li>
<li><p><a href="#heading-setting-up-the-project">Setting Up the Project</a></p>
</li>
<li><p><a href="#heading-the-monitoring-script--line-by-line">The Monitoring Script — Line by Line</a></p>
</li>
<li><p><a href="#heading-the-claude-diagnosis-prompt-and-why-structure-matters">The Claude Diagnosis Prompt (and Why Structure Matters)</a></p>
</li>
<li><p><a href="#heading-auto-fix-logic--being-conservative-on-purpose">Auto-Fix Logic — Being Conservative on Purpose</a></p>
</li>
<li><p><a href="#heading-adding-slack-notifications">Adding Slack Notifications</a></p>
</li>
<li><p><a href="#heading-health-check-endpoint">Health Check Endpoint</a></p>
</li>
<li><p><a href="#heading-rate-limiting-claude-calls">Rate Limiting Claude Calls</a></p>
</li>
<li><p><a href="#heading-docker-compose--the-full-setup">Docker Compose — The Full Setup</a></p>
</li>
<li><p><a href="#heading-real-errors-i-caught-in-production">Real Errors I Caught in Production</a></p>
</li>
<li><p><a href="#heading-cost-breakdown--what-this-actually-costs">Cost Breakdown — What This Actually Costs</a></p>
</li>
<li><p><a href="#heading-security-considerations">Security Considerations</a></p>
</li>
<li><p><a href="#heading-what-id-do-differently">What I'd Do Differently</a></p>
</li>
<li><p><a href="#heading-whats-next">What's Next?</a></p>
</li>
</ol>
<h2 id="heading-why-not-just-use-prometheus">Why Not Just Use Prometheus?</h2>
<p>Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.</p>
<p>Even then, those tools tell you <em>what</em> happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you <em>why</em>. You still need a human to look at the logs, figure out the root cause, and decide what to do.</p>
<p>That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).</p>
<h2 id="heading-the-architecture">The Architecture</h2>
<p>Here's how the pieces fit together:</p>
<pre><code class="language-plaintext">┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘
</code></pre>
<p>The flow works like this:</p>
<ol>
<li><p>The Container Doctor runs in its own container with the Docker socket mounted</p>
</li>
<li><p>Every 10 seconds, it pulls the last 50 lines of logs from each target container</p>
</li>
<li><p>It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")</p>
</li>
<li><p>When it finds something, it sends the logs to Claude with a structured prompt</p>
</li>
<li><p>Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart</p>
</li>
<li><p>If severity is high and auto-restart is safe, the script restarts the container</p>
</li>
<li><p>Either way, it sends a Slack notification with the full diagnosis</p>
</li>
<li><p>A simple health endpoint lets you check the doctor's own status</p>
</li>
</ol>
<p>The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.</p>
<h2 id="heading-setting-up-the-project">Setting Up the Project</h2>
<p>Create your project directory:</p>
<pre><code class="language-bash">mkdir container-doctor &amp;&amp; cd container-doctor
</code></pre>
<p>Here's your <code>requirements.txt</code>:</p>
<pre><code class="language-plaintext">docker==7.0.0
anthropic&gt;=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0
</code></pre>
<p>Install locally for testing: <code>pip install -r requirements.txt</code></p>
<p>Create a <code>.env</code> file:</p>
<pre><code class="language-bash">ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20
</code></pre>
<p>A quick note on <code>CHECK_INTERVAL</code>: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.</p>
<h2 id="heading-the-monitoring-script-line-by-line">The Monitoring Script – Line by Line</h2>
<p>Here's the full <code>container_doctor.py</code>. I'll walk through the important parts after:</p>
<pre><code class="language-python">import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now &gt; rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total &gt;= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start &gt;= 0 and end &gt; start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t &gt; datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) &gt;= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")
</code></pre>
<p>That's a lot of code, so let me walk through the parts that matter.</p>
<p><strong>Error deduplication (</strong><code>is_new_error</code><strong>)</strong>: This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.</p>
<p><strong>Rate limiting (</strong><code>check_rate_limit</code><strong>)</strong>: Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.</p>
<p><strong>Restart throttling (inside</strong> <code>apply_fix</code><strong>)</strong>: If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.</p>
<p><strong>Post-restart verification</strong>: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.</p>
<h2 id="heading-the-claude-diagnosis-prompt-and-why-structure-matters">The Claude Diagnosis Prompt (and Why Structure Matters)</h2>
<p>Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.</p>
<p>The version I landed on is explicit about format:</p>
<pre><code class="language-python">prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""
</code></pre>
<p>A few things I learned:</p>
<p><strong>Include the detected patterns.</strong> Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.</p>
<p><strong>Ask for</strong> <code>estimated_impact</code><strong>.</strong> This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."</p>
<p><code>likely_recurring</code> <strong>is gold.</strong> If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.</p>
<p>Claude returns something like:</p>
<pre><code class="language-json">{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}
</code></pre>
<p>I only auto-restart on <code>high</code> severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.</p>
<h2 id="heading-auto-fix-logic-being-conservative-on-purpose">Auto-Fix Logic – Being Conservative on Purpose</h2>
<p>The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:</p>
<p>Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.</p>
<p>The three safety checks before any restart:</p>
<ol>
<li><p><strong>Global toggle</strong>: <code>AUTO_FIX=true</code> in .env. I can kill all auto-fixes instantly by changing one variable.</p>
</li>
<li><p><strong>Claude's assessment</strong>: <code>auto_restart_safe</code> must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.</p>
</li>
<li><p><strong>Restart throttle</strong>: No more than 3 restarts per container per hour. After that, it's a human problem.</p>
</li>
</ol>
<p>If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.</p>
<h2 id="heading-adding-slack-notifications">Adding Slack Notifications</h2>
<p>Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.</p>
<p>The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.</p>
<p>To set this up, create a Slack app at <a href="https://api.slack.com/apps">api.slack.com/apps</a>, add an incoming webhook, and paste the URL in your <code>.env</code>.</p>
<h2 id="heading-health-check-endpoint">Health Check Endpoint</h2>
<p>The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:</p>
<pre><code class="language-bash">curl http://localhost:8080/health
</code></pre>
<p>Returns:</p>
<pre><code class="language-json">{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}
</code></pre>
<p>And <code>/history</code> returns the last 50 diagnoses:</p>
<pre><code class="language-bash">curl http://localhost:8080/history
</code></pre>
<p>I point an uptime checker (UptimeRobot, free tier) at the <code>/health</code> endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.</p>
<h2 id="heading-rate-limiting-claude-calls">Rate Limiting Claude Calls</h2>
<p>This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.</p>
<p>The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.</p>
<p>Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.</p>
<h2 id="heading-docker-compose-the-full-setup">Docker Compose – The Full Setup</h2>
<p>Here's the complete <code>docker-compose.yml</code> with the Container Doctor, a sample web server, API, and database:</p>
<pre><code class="language-yaml">version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:
</code></pre>
<p>And the <code>Dockerfile</code>:</p>
<pre><code class="language-dockerfile">FROM python:3.12-slim

WORKDIR /app

RUN apt-get update &amp;&amp; apt-get install -y curl &amp;&amp; rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]
</code></pre>
<p>Start everything: <code>docker compose up -d</code></p>
<p><strong>Important:</strong> The socket mount (<code>/var/run/docker.sock:/var/run/docker.sock</code>) gives the Container Doctor full access to the Docker daemon. Don't copy <code>.env</code> into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.</p>
<h2 id="heading-real-errors-i-caught-in-production">Real Errors I Caught in Production</h2>
<p>I've been running this for about 3 weeks now. Here are the actual incidents it caught:</p>
<h3 id="heading-incident-1-oom-kill-week-1">Incident 1: OOM Kill (Week 1)</h3>
<p>Logs showed a single word: <code>Killed</code>. That's Linux's OOMKiller doing its thing.</p>
<p>Claude's diagnosis:</p>
<pre><code class="language-json">{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}
</code></pre>
<p>The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.</p>
<h3 id="heading-incident-2-connection-pool-exhausted-week-2">Incident 2: Connection Pool Exhausted (Week 2)</h3>
<pre><code class="language-plaintext">ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached
</code></pre>
<p>Claude caught that my pool size was too small for the number of workers:</p>
<pre><code class="language-json">{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}
</code></pre>
<h3 id="heading-incident-3-transient-timeout-week-2">Incident 3: Transient Timeout (Week 2)</h3>
<pre><code class="language-plaintext">WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry
</code></pre>
<p>Claude correctly identified this as a non-issue:</p>
<pre><code class="language-json">{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}
</code></pre>
<p>No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.</p>
<h3 id="heading-incident-4-disk-full-week-3">Incident 4: Disk Full (Week 3)</h3>
<pre><code class="language-plaintext">ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space
</code></pre>
<pre><code class="language-json">{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}
</code></pre>
<p>Notice Claude said <code>auto_restart_safe: false</code> here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.</p>
<h2 id="heading-cost-breakdown-what-this-actually-costs">Cost Breakdown – What This Actually Costs</h2>
<p>After 3 weeks of running this on 5 containers:</p>
<ul>
<li><p><strong>Claude API</strong>: ~$3.80/month (with rate limiting and deduplication)</p>
</li>
<li><p><strong>Linode compute</strong>: $0 extra (the Container Doctor uses about 50MB RAM)</p>
</li>
<li><p><strong>Slack</strong>: Free tier</p>
</li>
<li><p><strong>My time saved</strong>: ~2-3 hours/month of 3 AM debugging</p>
</li>
</ul>
<p>Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.</p>
<p>If you're monitoring more containers or have noisier logs, expect higher costs. The <code>MAX_DIAGNOSES_PER_HOUR</code> setting is your budget knob.</p>
<h2 id="heading-security-considerations">Security Considerations</h2>
<p>Let's talk about the elephant in the room: the Docker socket.</p>
<p>Mounting <code>/var/run/docker.sock</code> gives the Container Doctor <strong>root-equivalent access</strong> to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.</p>
<p>Here's how I mitigate this:</p>
<ol>
<li><p><strong>Network isolation</strong>: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.</p>
</li>
<li><p><strong>Read-mostly access</strong>: The script only <em>reads</em> logs and <em>restarts</em> containers. It never execs into containers, pulls images, or modifies volumes.</p>
</li>
<li><p><strong>No external inputs</strong>: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).</p>
</li>
<li><p><strong>API key rotation</strong>: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.</p>
</li>
</ol>
<p>For a more secure setup, consider Docker's <code>--read-only</code> flag on the socket mount and a tool like <a href="https://github.com/Tecnativa/docker-socket-proxy">docker-socket-proxy</a> to restrict which API calls the Container Doctor can make.</p>
<h2 id="heading-what-id-do-differently">What I'd Do Differently</h2>
<p>After 3 weeks in production, here's my honest retrospective:</p>
<p><strong>I'd use structured logging from day one.</strong> My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.</p>
<p><strong>I'd add per-container policies.</strong> Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.</p>
<p><strong>I'd build a simple web UI.</strong> The <code>/history</code> endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.</p>
<p><strong>I'd try local models first.</strong> For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.</p>
<p><strong>I'd add a "learning mode."</strong> Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.</p>
<h2 id="heading-whats-next">What's Next?</h2>
<p>If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.</p>
<p>Got questions or built something similar? Drop a comment below or find me on <a href="https://github.com/balajee-asish">GitHub</a> and <a href="https://linkedin.com/in/balajee-asish">LinkedIn</a>.</p>
<p>Happy building.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Troubleshoot Ghost CMS: Fixing WSL, Docker, and ActivityPub Errors ]]>
                </title>
                <description>
                    <![CDATA[ Setting up Ghost CMS (Content Management System) on your local machine is a great way to develop themes and test new features. But if you're using Windows or Docker, you might run into errors that sto ]]>
                </description>
                <link>https://www.freecodecamp.org/news/fix-ghost-cms-errors/</link>
                <guid isPermaLink="false">69bc3254b238fd45a31f6959</guid>
                
                    <category>
                        <![CDATA[ ghost ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ WSL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Node.js ]]>
                    </category>
                
                    <category>
                        <![CDATA[ troubleshooting ]]>
                    </category>
                
                    <category>
                        <![CDATA[ debugging ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abdul Talha ]]>
                </dc:creator>
                <pubDate>Thu, 19 Mar 2026 17:28:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/85f5e0bb-26ff-42ce-ba66-afec6df4bb5d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Setting up Ghost CMS (Content Management System) on your local machine is a great way to develop themes and test new features. But if you're using Windows or Docker, you might run into errors that stop your progress. And debugging takes time away from your actual development work.</p>
<p>In this guide, you'll learn the root causes and exact fixes for three common Ghost CMS deployment errors:</p>
<ul>
<li><p><strong>Error 1:</strong> SQLite installation failures on Windows.</p>
</li>
<li><p><strong>Error 2:</strong> Docker containers crashing with Code 137 (memory limits).</p>
</li>
<li><p><strong>Error 3:</strong> "Loading Interrupted" errors in the ActivityPub Network tab.</p>
</li>
</ul>
<p>By the end of this article, you'll have a stable, working local Ghost setup. You'll know how to properly use WSL for Node.js apps, manage Docker resources, and successfully configure Ghost's new social web features.</p>
<h2 id="heading-error-1-sqlite-installation-failures-on-windows">Error 1: SQLite Installation Failures on Windows</h2>
<h3 id="heading-the-symptom"><strong>The Symptom</strong></h3>
<p>When you run the command <code>ghost install local</code> on a Windows machine, the setup fails. You will see a long list of red text in your terminal that looks like this:</p>
<pre><code class="language-plaintext">Error: Cannot find module 'sqlite3'
...
node-pre-gyp ERR! stack Error: Failed to execute...
...
MSB4019: The imported project "C:\Microsoft.Cpp.Default.props" was not found.
</code></pre>
<p>The error usually mentions "sqlite3" and says it "failed to execute" or is "missing."</p>
<h3 id="heading-the-cause"><strong>The Cause</strong></h3>
<p>Ghost uses SQLite to store your blog's data. SQLite is a "native module." This means it needs a small piece of code that must be built to fit your computer's system perfectly.</p>
<p>Because Ghost was created to run on Linux servers, it expects to find Linux build tools to make these files. Windows uses different tools and a different way of organising files. When the Ghost CLI tries to build the SQLite files on Windows, it can't find the tools it needs, so the installation stops. Using WSL gives Ghost the Linux environment it expects.</p>
<h3 id="heading-how-to-fix-it">How to Fix it:</h3>
<p>You can use Windows Subsystem for Linux (WSL) to create a working setup.</p>
<ol>
<li><p>Open your WSL terminal (like Ubuntu).</p>
</li>
<li><p>Check your tools by running <code>node --version</code>, <code>npm --version</code>, and <code>python3 --version</code>.</p>
</li>
<li><p>Install the Ghost CLI globally inside WSL:</p>
<pre><code class="language-plaintext">npm install -g ghost-cli@latest
</code></pre>
</li>
<li><p>Run the local setup command:</p>
<pre><code class="language-plaintext">ghost install local
</code></pre>
</li>
<li><p>Start the server:</p>
<pre><code class="language-plaintext">ghost start
</code></pre>
</li>
</ol>
<h3 id="heading-how-to-verify">How to Verify:</h3>
<p>Open your web browser and go to <code>http://localhost:2368</code>. You should see the default Ghost welcome page load without errors.</p>
<h2 id="heading-error-2-docker-container-exiting-with-code-137">Error 2: Docker Container Exiting with Code 137</h2>
<h3 id="heading-the-symptom">The Symptom:</h3>
<p>When you're running Ghost using Docker Compose, the containers crash. The terminal logs show <code>Ghost admin container exiting with code 137</code> or <code>Admin service killed due to memory constraints</code>.</p>
<h3 id="heading-the-cause">The Cause:</h3>
<p>So why does this happen? Well, error code 137 means your computer ran out of memory (RAM) and stopped the container. This usually happens if you try to run the full Ghost developer setup (which includes 15+ extra tools) on a standard computer.</p>
<h3 id="heading-how-to-fix-it">How to Fix it:</h3>
<p>To fix this error, you can switch from the complex setup to a simple setup using the official Ghost Docker image.</p>
<p>To do this, first stop and remove the broken containers:</p>
<pre><code class="language-plaintext">docker-compose down -v
docker system prune -a
</code></pre>
<p>Then create a new <code>docker-compose.yml</code> file with only the basic tools (Ghost and a database):</p>
<pre><code class="language-plaintext">services:
  ghost:
    image: ghost:latest
    restart: always
    ports:
      - "2368:2368"
    environment:
      database__client: mysql
      database__connection__host: mysql
      database__connection__user: root
      database__connection__password: yourpassword
      database__connection__database: ghost
      url: http://localhost:2368
    volumes:
      - ghost_content:/var/lib/ghost/content

  mysql:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: yourpassword
      MYSQL_DATABASE: ghost
    volumes:
      - mysql_data:/var/lib/mysql

volumes:
  ghost_content:
  mysql_data:
</code></pre>
<p>Then start the simple setup:</p>
<pre><code class="language-plaintext">docker-compose up -d
</code></pre>
<h3 id="heading-how-to-verify">How to Verify:</h3>
<p>Type <code>docker-compose ps</code> in your terminal. You should see both the <code>ghost</code> and <code>mysql</code> containers listed with a status of "Up".</p>
<h2 id="heading-error-3-loading-interrupted-in-network-analytics">Error 3: "Loading Interrupted" in Network Analytics</h2>
<h3 id="heading-the-symptom">The Symptom:</h3>
<p>When you click the <strong>Analytics → Network</strong> tab in your local Ghost admin panel, the page shows a "Loading Interrupted" error. Your terminal logs show 404 errors and webhook failures:</p>
<pre><code class="language-plaintext">INFO "GET /.ghost/activitypub/v1/feed/reader/" 404 52ms
ERROR No webhook secret found - cannot initialise
</code></pre>
<h3 id="heading-the-cause">The Cause:</h3>
<p>The Network tab acts as an ActivityPub reader, not a normal analytics dashboard. This error happens because ActivityPub is not set up for local use. It needs extra tools (Caddy, Redis) and a clean web address without port numbers to work.</p>
<h3 id="heading-how-to-fix-it">How to Fix it:</h3>
<p>To fix this error, just run Ghost with its required Docker tools and update your local config file to turn on the social web features.</p>
<p>First, start the required tools (Caddy, MySQL, Redis) from your Ghost folder:</p>
<pre><code class="language-plaintext">SSH_AUTH_SOCK=/dev/null docker compose up -d caddy mysql redis
</code></pre>
<p>Then open your <code>config.local.json</code> file. Set the URL to a clean localhost address (remove the <code>:2368</code> port) and turn on the developer features:</p>
<pre><code class="language-plaintext">{
    "url": "http://localhost",
    "social_web_enabled": true,
    "enableDeveloperExperiments": true
}
</code></pre>
<p>Stop your current Ghost process:</p>
<pre><code class="language-plaintext">pkill -f "yarn dev:ghost"
</code></pre>
<p>And restart Ghost with the new settings:</p>
<pre><code class="language-plaintext">yarn dev:ghost
</code></pre>
<h3 id="heading-how-to-verify">How to Verify:</h3>
<p>Log back into your Ghost admin panel and click <strong>Analytics → Network</strong>. The error message will be gone, and you will see the ActivityPub feed instead.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Local setups can be hard, especially when mixing Windows, Docker, and new features like ActivityPub.</p>
<p>By fixing these three errors, you did more than just get Ghost running. You learned how to bypass Windows limits using WSL, how to manage Docker memory, and how Ghost routes social web traffic.</p>
<p>You now have a stable, fast, and fully working Ghost CMS workspace ready for your content.</p>
<p><strong>Let’s connect!</strong> You can find my latest work on my <a href="https://blog.abdultalha.tech/portfolio"><strong>Technical Writing Portfolio</strong></a> or reach out to me on <a href="https://www.linkedin.com/in/abdul-talha/"><strong>LinkedIn</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Optimize Your Docker Build Cache & Cut Your CI/CD Pipeline Times by 80% ]]>
                </title>
                <description>
                    <![CDATA[ Every developer has been there. You push a one-line fix, grab your coffee, and wait. And wait. Twelve minutes later, your Docker image finishes rebuilding from scratch because something about the cach ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-optimize-your-docker-build-cache/</link>
                <guid isPermaLink="false">69bb1e218c55d6eefb64955f</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ optimization ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Wed, 18 Mar 2026 21:50:25 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/9a5ca46f-c571-4d38-90b5-3c6d7d22c00f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every developer has been there. You push a one-line fix, grab your coffee, and wait. And wait. Twelve minutes later, your Docker image finishes rebuilding from scratch because something about the cache broke again.</p>
<p>I spent a good chunk of last year debugging slow Docker builds across multiple teams. The pattern was always the same: builds that should take two minutes were eating up fifteen, and nobody knew why. The fix turned out to be surprisingly systematic once I understood what was actually happening under the hood.</p>
<p>This guide walks you through exactly how to fix slow Docker builds, step by step. We'll start with how the cache actually works, then tear apart the most common mistakes, and finish with production-ready patterns you can copy into your projects today.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-how-docker-build-cache-actually-works">How Docker Build Cache Actually Works</a></p>
<ul>
<li><p><a href="#heading-how-cache-keys-are-computed">How Cache Keys Are Computed</a></p>
</li>
<li><p><a href="#heading-the-cache-chain-rule">The Cache Chain Rule</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-identify-common-cache-busting-mistakes">How to Identify Common Cache-Busting Mistakes</a></p>
<ul>
<li><p><a href="#heading-mistake-1-copying-everything-too-early">Mistake 1: Copying Everything Too Early</a></p>
</li>
<li><p><a href="#heading-mistake-2-not-separating-dependency-files">Mistake 2: Not Separating Dependency Files</a></p>
</li>
<li><p><a href="#heading-mistake-3-using-add-instead-of-copy">Mistake 3: Using ADD Instead of COPY</a></p>
</li>
<li><p><a href="#heading-mistake-4-splitting-apt-get-update-and-install">Mistake 4: Splitting apt-get update and install</a></p>
</li>
<li><p><a href="#heading-mistake-5-embedding-timestamps-or-git-hashes-too-early">Mistake 5: Embedding Timestamps or Git Hashes Too Early</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-structure-your-dockerfile-for-maximum-cache-reuse">How to Structure Your Dockerfile for Maximum Cache Reuse</a></p>
<ul>
<li><p><a href="#heading-step-1-apply-the-dependency-first-pattern">Step 1: Apply the Dependency-First Pattern</a></p>
</li>
<li><p><a href="#heading-step-2-add-an-aggressive-dockerignore">Step 2: Add an Aggressive .dockerignore</a></p>
</li>
<li><p><a href="#heading-step-3-use-multi-stage-builds">Step 3: Use Multi-Stage Builds</a></p>
</li>
<li><p><a href="#heading-step-4-order-layers-by-change-frequency">Step 4: Order Layers by Change Frequency</a></p>
</li>
<li><p><a href="#heading-step-5-use-buildkit-mount-caches">Step 5: Use BuildKit Mount Caches</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-set-up-cicd-cache-backends">How to Set Up CI/CD Cache Backends</a></p>
<ul>
<li><p><a href="#heading-option-a-registry-based-cache">Option A: Registry-Based Cache</a></p>
</li>
<li><p><a href="#heading-option-b-github-actions-cache">Option B: GitHub Actions Cache</a></p>
</li>
<li><p><a href="#heading-option-c-s3-or-cloud-storage">Option C: S3 or Cloud Storage</a></p>
</li>
<li><p><a href="#heading-option-d-local-cache-with-persistent-runners">Option D: Local Cache with Persistent Runners</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-implement-advanced-cache-patterns">How to Implement Advanced Cache Patterns</a></p>
<ul>
<li><p><a href="#heading-parallel-build-stages">Parallel Build Stages</a></p>
</li>
<li><p><a href="#heading-cache-warming-for-feature-branches">Cache Warming for Feature Branches</a></p>
</li>
<li><p><a href="#heading-selective-cache-invalidation-with-build-args">Selective Cache Invalidation with Build Args</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-measure-your-improvements">How to Measure Your Improvements</a></p>
<ul>
<li><p><a href="#heading-the-four-scenarios-to-benchmark">The Four Scenarios to Benchmark</a></p>
</li>
<li><p><a href="#heading-real-world-before-and-after-numbers">Real-World Before and After Numbers</a></p>
</li>
<li><p><a href="#heading-how-to-check-cache-hit-rates">How to Check Cache Hit Rates</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-complete-optimized-dockerfile-examples">Complete Optimized Dockerfile Examples</a></p>
<ul>
<li><p><a href="#heading-nodejs-full-stack-app">Node.js Full-Stack App</a></p>
</li>
<li><p><a href="#heading-python-fastapi-app">Python FastAPI App</a></p>
</li>
<li><p><a href="#heading-go-microservice">Go Microservice</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-troubleshooting-guide">Troubleshooting Guide</a></p>
</li>
<li><p><a href="#heading-quick-reference-checklist">Quick-Reference Checklist</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you'll need:</p>
<ul>
<li><p>A working Docker installation (Docker Desktop or Docker Engine 20.10+)</p>
</li>
<li><p>Basic comfort with writing Dockerfiles</p>
</li>
<li><p>Access to a CI/CD system like GitHub Actions, GitLab CI, or Jenkins</p>
</li>
</ul>
<h2 id="heading-how-docker-build-cache-actually-works">How Docker Build Cache Actually Works</h2>
<p>Every instruction in a Dockerfile produces a <strong>layer</strong>. Docker stores these layers and reuses them when it detects nothing has changed. That's the cache. Simple enough in theory, but the details matter a lot.</p>
<h3 id="heading-how-cache-keys-are-computed">How Cache Keys Are Computed</h3>
<p>Different instructions compute their cache keys differently:</p>
<table>
<thead>
<tr>
<th>Instruction</th>
<th>Cache Key Based On</th>
<th>What Breaks It</th>
</tr>
</thead>
<tbody><tr>
<td><code>RUN</code></td>
<td>The exact command string</td>
<td>Any change to the command text</td>
</tr>
<tr>
<td><code>COPY</code> / <code>ADD</code></td>
<td>File checksums of the source content</td>
<td>Any modification to the copied files</td>
</tr>
<tr>
<td><code>ENV</code> / <code>ARG</code></td>
<td>The variable name and value</td>
<td>Changing the value</td>
</tr>
<tr>
<td><code>FROM</code></td>
<td>The base image digest</td>
<td>A new version of the base image</td>
</tr>
</tbody></table>
<h3 id="heading-the-cache-chain-rule">The Cache Chain Rule</h3>
<p>Here's the thing most people miss: <strong>Docker cache is sequential.</strong> If any layer's cache gets invalidated, every layer after it rebuilds from scratch, even if those later layers haven't changed at all.</p>
<p>Picture a row of dominoes. Knock one over in the middle and everything after it goes down too. This is why the order of instructions in your Dockerfile is so important.</p>
<blockquote>
<p><strong>Key insight:</strong> The single most impactful optimization you can make is reordering your Dockerfile so that the stuff that changes most often comes last.</p>
</blockquote>
<h2 id="heading-how-to-identify-common-cache-busting-mistakes">How to Identify Common Cache-Busting Mistakes</h2>
<p>Before we fix anything, let's look at what's probably breaking your cache right now. I've seen these patterns in almost every unoptimized Dockerfile I've reviewed.</p>
<h3 id="heading-mistake-1-copying-everything-too-early">Mistake 1: Copying Everything Too Early</h3>
<p>This is the big one. Putting <code>COPY . .</code> near the top of the Dockerfile, before installing dependencies, means that <em>any</em> file change in your project invalidates the cache from that point forward. Changed a README? Cool, now your dependencies reinstall.</p>
<pre><code class="language-dockerfile"># BAD: Any file change invalidates the dependency install
FROM node:20-alpine
WORKDIR /app
COPY . .                    # Cache busted on every commit
RUN npm ci                  # Reinstalls every single time
RUN npm run build
</code></pre>
<h3 id="heading-mistake-2-not-separating-dependency-files">Mistake 2: Not Separating Dependency Files</h3>
<p>Your dependency manifests (<code>package.json</code>, <code>requirements.txt</code>, <code>go.mod</code>, <code>Gemfile</code>) change way less often than your source code. If you don't copy them separately, you're reinstalling all dependencies every time you touch a source file.</p>
<h3 id="heading-mistake-3-using-add-instead-of-copy">Mistake 3: Using ADD Instead of COPY</h3>
<p><code>ADD</code> has special behaviors like auto-extracting archives and fetching remote URLs. Those features make its cache behavior unpredictable. Stick with <code>COPY</code> unless you specifically need archive extraction.</p>
<h3 id="heading-mistake-4-splitting-apt-get-update-and-install">Mistake 4: Splitting apt-get update and install</h3>
<p>When you put <code>apt-get update</code> and <code>apt-get install</code> in separate <code>RUN</code> commands, the update step gets cached with stale package indexes. Then the install step fails or grabs outdated packages.</p>
<pre><code class="language-dockerfile"># BAD: Stale package index
RUN apt-get update
RUN apt-get install -y curl    # May fail with stale index

# GOOD: Always combine them
RUN apt-get update &amp;&amp; apt-get install -y curl &amp;&amp; rm -rf /var/lib/apt/lists/*
</code></pre>
<h3 id="heading-mistake-5-embedding-timestamps-or-git-hashes-too-early">Mistake 5: Embedding Timestamps or Git Hashes Too Early</h3>
<p>Injecting build-time variables like timestamps or git commit hashes via <code>ARG</code> or <code>ENV</code> early in the Dockerfile invalidates the cache on every single build. Move these to the very last layer.</p>
<blockquote>
<p>⚠️ <strong>Watch out for this:</strong> CI/CD systems often inject variables like <code>BUILD_NUMBER</code> or <code>GIT_SHA</code> as build args automatically. If those <code>ARG</code> declarations sit near the top, your cache is toast on every run.</p>
</blockquote>
<h2 id="heading-how-to-structure-your-dockerfile-for-maximum-cache-reuse">How to Structure Your Dockerfile for Maximum Cache Reuse</h2>
<p>Now let's fix those mistakes. These five steps, applied in order, will get you most of the way to an optimized build.</p>
<h3 id="heading-step-1-apply-the-dependency-first-pattern">Step 1: Apply the Dependency-First Pattern</h3>
<p>Copy only the dependency manifests first, install, and then copy the rest of the source code. This one change alone can cut your build times in half.</p>
<pre><code class="language-dockerfile"># GOOD: Dependency-first pattern for Node.js
FROM node:20-alpine
WORKDIR /app

# Copy ONLY dependency files
COPY package.json package-lock.json ./

# Install dependencies (cached unless package files change)
RUN npm ci --production

# Copy source code (only this layer rebuilds on code changes)
COPY . .

# Build
RUN npm run build
</code></pre>
<p>The same idea works across every language:</p>
<table>
<thead>
<tr>
<th>Language</th>
<th>Copy First</th>
<th>Install Command</th>
</tr>
</thead>
<tbody><tr>
<td>Node.js</td>
<td><code>package.json</code>, <code>package-lock.json</code></td>
<td><code>npm ci</code></td>
</tr>
<tr>
<td>Python</td>
<td><code>requirements.txt</code> or <code>pyproject.toml</code></td>
<td><code>pip install -r requirements.txt</code></td>
</tr>
<tr>
<td>Go</td>
<td><code>go.mod</code>, <code>go.sum</code></td>
<td><code>go mod download</code></td>
</tr>
<tr>
<td>Rust</td>
<td><code>Cargo.toml</code>, <code>Cargo.lock</code></td>
<td><code>cargo fetch</code></td>
</tr>
<tr>
<td>Java (Maven)</td>
<td><code>pom.xml</code></td>
<td><code>mvn dependency:go-offline</code></td>
</tr>
<tr>
<td>Ruby</td>
<td><code>Gemfile</code>, <code>Gemfile.lock</code></td>
<td><code>bundle install</code></td>
</tr>
</tbody></table>
<h3 id="heading-step-2-add-an-aggressive-dockerignore">Step 2: Add an Aggressive .dockerignore</h3>
<p>A <code>.dockerignore</code> file keeps irrelevant files out of the build context. Fewer files in the context means fewer things that can break your cache.</p>
<pre><code class="language-plaintext"># .dockerignore
.git
node_modules
dist
*.md
*.log
.env*
docker-compose*.yml
Dockerfile*
.github
tests
coverage
__pycache__
</code></pre>
<h3 id="heading-step-3-use-multi-stage-builds">Step 3: Use Multi-Stage Builds</h3>
<p>Multi-stage builds let you use a full development image for compiling, then copy only the finished artifacts into a slim runtime image. You get smaller images, better security, and improved cache performance because build tools and intermediate files don't carry over.</p>
<pre><code class="language-dockerfile"># Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json ./
EXPOSE 3000
CMD ["node", "dist/index.js"]
</code></pre>
<h3 id="heading-step-4-order-layers-by-change-frequency">Step 4: Order Layers by Change Frequency</h3>
<p>Think of your Dockerfile as a stack. Put the boring, stable stuff at the top and the volatile stuff at the bottom:</p>
<ol>
<li><p>Base image and system dependencies (rarely change)</p>
</li>
<li><p>Language runtime configuration (occasionally change)</p>
</li>
<li><p>Application dependencies (change when you add or remove packages)</p>
</li>
<li><p>Source code (changes on every commit)</p>
</li>
<li><p>Build-time metadata like git hash or version labels (changes every build)</p>
</li>
</ol>
<h3 id="heading-step-5-use-buildkit-mount-caches">Step 5: Use BuildKit Mount Caches</h3>
<p>Docker BuildKit supports <code>RUN --mount=type=cache</code>, which mounts a persistent cache directory that survives across builds. This is a game-changer for package managers that maintain their own download caches.</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .

# Mount pip cache so downloads persist across builds
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

COPY . .
</code></pre>
<p>The best part: mount caches persist even when the layer itself gets invalidated. So if you add one new package, pip only downloads that one package instead of re-fetching everything.</p>
<p>Here are the common cache targets for popular package managers:</p>
<table>
<thead>
<tr>
<th>Package Manager</th>
<th>Cache Target</th>
</tr>
</thead>
<tbody><tr>
<td>pip</td>
<td><code>/root/.cache/pip</code></td>
</tr>
<tr>
<td>npm</td>
<td><code>/root/.npm</code></td>
</tr>
<tr>
<td>yarn</td>
<td><code>/usr/local/share/.cache/yarn</code></td>
</tr>
<tr>
<td>go</td>
<td><code>/go/pkg/mod</code></td>
</tr>
<tr>
<td>apt</td>
<td><code>/var/cache/apt</code></td>
</tr>
<tr>
<td>maven</td>
<td><code>/root/.m2/repository</code></td>
</tr>
</tbody></table>
<h2 id="heading-how-to-set-up-cicd-cache-backends">How to Set Up CI/CD Cache Backends</h2>
<p>Here's where things get tricky. Your local Docker cache works great on your laptop because the layers persist between builds. But CI/CD runners are usually ephemeral: each job starts with a totally empty cache. Without explicit cache configuration, every CI build is a cold build.</p>
<h3 id="heading-option-a-registry-based-cache">Option A: Registry-Based Cache</h3>
<p>BuildKit can push and pull cache layers from a container registry. This is the most portable approach and works with any CI system.</p>
<pre><code class="language-bash">docker buildx build \
  --cache-from type=registry,ref=myregistry.io/myapp:buildcache \
  --cache-to type=registry,ref=myregistry.io/myapp:buildcache,mode=max \
  --tag myregistry.io/myapp:latest \
  --push .
</code></pre>
<blockquote>
<p>💡 <strong>Use</strong> <code>mode=max</code> to cache all layers including intermediate build stages. The default <code>mode=min</code> only caches layers in the final stage, which means your build stage layers get thrown away.</p>
</blockquote>
<h3 id="heading-option-b-github-actions-cache">Option B: GitHub Actions Cache</h3>
<p>If you're on GitHub Actions, there's native integration with BuildKit through the GitHub Actions cache API. It's fast and requires minimal setup.</p>
<pre><code class="language-yaml"># .github/workflows/build.yml
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myregistry.io/myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max
</code></pre>
<h3 id="heading-option-c-s3-or-cloud-storage">Option C: S3 or Cloud Storage</h3>
<p>For teams on AWS, GCP, or Azure, cloud object storage makes a solid cache backend. It's fast, persistent, and works across any CI system.</p>
<pre><code class="language-bash">docker buildx build \
  --cache-from type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp \
  --cache-to type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp,mode=max \
  --tag myapp:latest .
</code></pre>
<h3 id="heading-option-d-local-cache-with-persistent-runners">Option D: Local Cache with Persistent Runners</h3>
<p>If your CI runners have persistent storage (self-hosted runners, GitLab runners with shared volumes), you can export cache to a local directory.</p>
<pre><code class="language-bash">docker buildx build \
  --cache-from type=local,src=/ci-cache/myapp \
  --cache-to type=local,dest=/ci-cache/myapp,mode=max \
  --tag myapp:latest .
</code></pre>
<h2 id="heading-how-to-implement-advanced-cache-patterns">How to Implement Advanced Cache Patterns</h2>
<p>Once you've nailed the basics, these patterns can squeeze out even more performance.</p>
<h3 id="heading-parallel-build-stages">Parallel Build Stages</h3>
<p>BuildKit builds independent stages in parallel. If your app has a frontend and a backend that don't depend on each other during build, split them into separate stages and let BuildKit run them simultaneously.</p>
<pre><code class="language-dockerfile"># These stages build in parallel
FROM node:20-alpine AS frontend
WORKDIR /frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM python:3.12-slim AS backend
WORKDIR /backend
COPY backend/requirements.txt .
RUN pip install -r requirements.txt
COPY backend/ .

# Final stage combines both
FROM python:3.12-slim
COPY --from=backend /backend /app
COPY --from=frontend /frontend/dist /app/static
CMD ["python", "/app/main.py"]
</code></pre>
<h3 id="heading-cache-warming-for-feature-branches">Cache Warming for Feature Branches</h3>
<p>Feature branches often start with a cold cache because they diverge from main. You can warm the cache by specifying multiple <code>--cache-from</code> sources. Docker checks them in order.</p>
<pre><code class="language-bash">docker buildx build \
  --cache-from type=registry,ref=registry.io/app:cache-${BRANCH} \
  --cache-from type=registry,ref=registry.io/app:cache-main \
  --cache-to type=registry,ref=registry.io/app:cache-${BRANCH},mode=max \
  --tag registry.io/app:${BRANCH} .
</code></pre>
<p>If the branch cache hits, Docker uses it. If not, it falls back to main's cache, which usually shares most layers. This makes a massive difference for short-lived branches.</p>
<h3 id="heading-selective-cache-invalidation-with-build-args">Selective Cache Invalidation with Build Args</h3>
<p>You can use <code>ARG</code> instructions as cache boundaries. Anything above the <code>ARG</code> stays cached, while anything below it rebuilds when the arg value changes.</p>
<pre><code class="language-dockerfile">FROM node:20-alpine
WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci

# This ARG only invalidates layers below it
ARG CACHE_BUST_CODE=1
COPY . .
RUN npm run build

# This ARG only invalidates the label
ARG GIT_SHA=unknown
LABEL git.sha=$GIT_SHA
</code></pre>
<h2 id="heading-how-to-measure-your-improvements">How to Measure Your Improvements</h2>
<p>Optimization without measurement is just guessing. Here's how to actually prove your changes are working.</p>
<h3 id="heading-the-four-scenarios-to-benchmark">The Four Scenarios to Benchmark</h3>
<p>Run each scenario at least three times and take the median:</p>
<ol>
<li><p><strong>Cold build:</strong> No cache at all (first build or after <code>docker builder prune</code>)</p>
</li>
<li><p><strong>Warm build:</strong> No changes, full cache hit</p>
</li>
<li><p><strong>Code change:</strong> Only source code modified</p>
</li>
<li><p><strong>Dependency change:</strong> Package manifest modified</p>
</li>
</ol>
<h3 id="heading-real-world-before-and-after-numbers">Real-World Before and After Numbers</h3>
<p>Here's what I saw on a mid-sized Node.js project after applying the techniques from this guide:</p>
<table>
<thead>
<tr>
<th>Scenario</th>
<th>Before</th>
<th>After</th>
<th>Improvement</th>
</tr>
</thead>
<tbody><tr>
<td>Cold build</td>
<td>12 min 34 sec</td>
<td>8 min 10 sec</td>
<td>35%</td>
</tr>
<tr>
<td>Warm build (no changes)</td>
<td>12 min 34 sec</td>
<td>14 sec</td>
<td>98%</td>
</tr>
<tr>
<td>Code change only</td>
<td>12 min 34 sec</td>
<td>1 min 52 sec</td>
<td>85%</td>
</tr>
<tr>
<td>Dependency change</td>
<td>12 min 34 sec</td>
<td>4 min 20 sec</td>
<td>65%</td>
</tr>
</tbody></table>
<p>The "before" column is the same for all rows because without cache optimization, every build was essentially a cold build. That 85% improvement on code-only changes is the number that matters most, since that's what happens on the vast majority of commits.</p>
<h3 id="heading-how-to-check-cache-hit-rates">How to Check Cache Hit Rates</h3>
<p>Set <code>BUILDKIT_PROGRESS=plain</code> to get detailed output showing which layers hit cache:</p>
<pre><code class="language-bash">BUILDKIT_PROGRESS=plain docker buildx build . 2&gt;&amp;1 | grep -E 'CACHED|DONE'
</code></pre>
<p>Look for the <code>CACHED</code> prefix on layers. Your goal is to see <code>CACHED</code> on everything except the layers that actually needed to change.</p>
<h2 id="heading-complete-optimized-dockerfile-examples">Complete Optimized Dockerfile Examples</h2>
<p>Here are production-ready Dockerfiles you can adapt for your own projects.</p>
<h3 id="heading-nodejs-full-stack-app">Node.js Full-Stack App</h3>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN --mount=type=cache,target=/root/.npm npm ci

FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 appgroup \
    &amp;&amp; adduser --system --uid 1001 appuser
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
COPY package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]
</code></pre>
<h3 id="heading-python-fastapi-app">Python FastAPI App</h3>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --user -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
</code></pre>
<h3 id="heading-go-microservice">Go Microservice</h3>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -ldflags='-s -w' -o /app/server ./cmd/server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
EXPOSE 8080
ENTRYPOINT ["/server"]
</code></pre>
<h2 id="heading-troubleshooting-guide">Troubleshooting Guide</h2>
<p>When things go wrong, check this table first:</p>
<table>
<thead>
<tr>
<th>Symptom</th>
<th>Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td>All layers rebuild every time</td>
<td><code>COPY . .</code> is too early, or <code>.dockerignore</code> is missing</td>
<td>Move <code>COPY . .</code> after dependency install; add <code>.dockerignore</code></td>
</tr>
<tr>
<td>Cache never hits in CI</td>
<td>No cache backend configured</td>
<td>Add <code>--cache-from</code> / <code>--cache-to</code> with registry, gha, or s3 backend</td>
</tr>
<tr>
<td>Cache hits locally but not in CI</td>
<td>Different Docker versions or BuildKit not enabled</td>
<td>Set <code>DOCKER_BUILDKIT=1</code> and match Docker versions</td>
</tr>
<tr>
<td>Dependency layer always rebuilds</td>
<td>Source files copied before dependency install</td>
<td>Use the dependency-first pattern</td>
</tr>
<tr>
<td>Image size keeps growing</td>
<td>Build artifacts leaking into final image</td>
<td>Use multi-stage builds; only copy runtime artifacts</td>
</tr>
<tr>
<td>Registry cache is very slow</td>
<td><code>mode=max</code> caching too many layers</td>
<td>Try <code>mode=min</code> or switch to gha/s3 for faster backends</td>
</tr>
</tbody></table>
<h2 id="heading-quick-reference-checklist">Quick-Reference Checklist</h2>
<p>Print this out and tape it next to your monitor:</p>
<ul>
<li><p>[ ] Enable BuildKit: set <code>DOCKER_BUILDKIT=1</code> or use <code>docker buildx</code></p>
</li>
<li><p>[ ] Add a comprehensive <code>.dockerignore</code> file</p>
</li>
<li><p>[ ] Use the dependency-first pattern: copy manifests, install, then copy source</p>
</li>
<li><p>[ ] Order layers from least-changed to most-changed</p>
</li>
<li><p>[ ] Combine <code>RUN</code> commands that belong together (<code>apt-get update &amp;&amp; install</code>)</p>
</li>
<li><p>[ ] Use multi-stage builds to separate build and runtime</p>
</li>
<li><p>[ ] Add <code>RUN --mount=type=cache</code> for package manager caches</p>
</li>
<li><p>[ ] Move volatile <code>ARG</code>s (git hash, build number) to the very last layers</p>
</li>
<li><p>[ ] Configure a CI/CD cache backend (registry, gha, or s3)</p>
</li>
<li><p>[ ] Set up cache warming for feature branches from the main branch</p>
</li>
<li><p>[ ] Use <code>COPY</code> instead of <code>ADD</code> unless you need archive extraction</p>
</li>
<li><p>[ ] Benchmark all four scenarios: cold, warm, code change, dependency change</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I used to think slow Docker builds were just something you had to live with. After going through this process on a few projects, I realized the fix is pretty mechanical once you understand that one core principle: cache is sequential, and order matters.</p>
<p>Start with the dependency-first pattern and a <code>.dockerignore</code>. Those two changes alone will probably cut your build times in half. Then add multi-stage builds, mount caches, and CI/CD cache backends as you need them.</p>
<p>The teams I've worked with typically see 70-85% reductions in CI/CD pipeline times after spending a few hours on these changes. That's time you get back on every single commit, every single day.</p>
<p>If you found this helpful, consider sharing it with your team. There's a good chance whoever wrote your Dockerfile last didn't know about half of these tricks. No shade to them, I didn't either until I went looking.</p>
<p>Happy building.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Containerize Your MLOps Pipeline from Training to Serving ]]>
                </title>
                <description>
                    <![CDATA[ Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deplo ]]>
                </description>
                <link>https://www.freecodecamp.org/news/containerize-mlops-pipeline-from-training-to-serving/</link>
                <guid isPermaLink="false">69b33f5993256dfc5313bee2</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ production ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NVIDIA ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Thu, 12 Mar 2026 22:34:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/156eaca3-8884-4f57-9010-9766278dbf5a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.</p>
<p>The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.</p>
<p>Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.</p>
<p>That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.</p>
<p>In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+</p>
</li>
<li><p>For GPU training, you'll need the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA Container Toolkit</a> installed on the host and a compatible GPU driver. Run <code>nvidia-smi</code> to verify your GPU is visible, and <code>docker compose version</code> to check your Compose version.</p>
</li>
<li><p>Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-training-container">How to Build the Training Container</a></p>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</a></p>
</li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-serving-container">How to Build the Serving Container</a></p>
<ul>
<li><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-decouple-models-from-containers">Decouple Models from Containers</a></li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-where-this-breaks-down">Where This Breaks Down</a></p>
</li>
</ul>
<h2 id="heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</h2>
<p>If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.</p>
<p>An MLOps pipeline is a chain of interdependent stages:</p>
<ol>
<li><p><strong>Data ingestion and validation.</strong> Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.</p>
</li>
<li><p><strong>Feature engineering.</strong> You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.</p>
</li>
<li><p><strong>Experiment tracking.</strong> You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.</p>
</li>
<li><p><strong>Model training.</strong> The model learns from your features. This is the compute-heavy part that often needs GPUs.</p>
</li>
<li><p><strong>Evaluation.</strong> You measure the trained model against test data to see if it's good enough to deploy.</p>
</li>
<li><p><strong>Packaging and serving.</strong> You wrap the trained model in an API so other systems can send it data and get predictions back.</p>
</li>
<li><p><strong>Monitoring.</strong> You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.</p>
</li>
</ol>
<p>Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.</p>
<p>The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.</p>
<p>Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.</p>
<p>This gives you the flexibility to:</p>
<ul>
<li><p>Scale training on expensive GPU instances while running serving on cheaper CPU nodes</p>
</li>
<li><p>Update your feature engineering code without rebuilding your training environment</p>
</li>
<li><p>Version each stage independently in your container registry</p>
</li>
<li><p>Let data scientists and ML engineers work on training while platform engineers optimize serving</p>
</li>
</ul>
<h2 id="heading-how-to-build-the-training-container">How to Build the Training Container</h2>
<p>The training container is where most teams start, and where most teams make their first mistake.</p>
<p>The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.</p>
<p>Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.</p>
<p>If you're new to these concepts: a <strong>multi-stage build</strong> lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.</p>
<p>A <strong>cache mount</strong> tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.</p>
<p>Here's the training Dockerfile:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]
</code></pre>
<p>Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.</p>
<p>That's why we put things in order of how often they change:</p>
<ol>
<li><p><strong>System packages at the top</strong> (they almost never change). Installing <code>python3.11</code> and <code>git</code> takes time, but you only do it once.</p>
</li>
<li><p><strong>Python dependencies in the middle</strong> (they change when you add or update a library). This layer rebuilds when <code>requirements-train.txt</code> changes.</p>
</li>
<li><p><strong>Your actual code at the bottom</strong> (changes on every commit). This is the layer that rebuilds most often.</p>
</li>
</ol>
<p>With this ordering, a code change only rebuilds the final layer, not the entire image. If you put <code>COPY src/</code> before <code>pip install</code>, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.</p>
<p>The <code>--mount=type=cache,target=/root/.cache/pip</code> line on the <code>pip install</code> command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.</p>
<h3 id="heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</h3>
<p>Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.</p>
<p>It's a good idea to maintain separate requirements files:</p>
<pre><code class="language-plaintext"># requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0
</code></pre>
<p>The overlap is smaller than you'd think. <code>torch</code> and <code>scikit-learn</code> appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.</p>
<h3 id="heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</h3>
<p>One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.</p>
<p>Make sure you check your host driver version before choosing your base image:</p>
<pre><code class="language-bash"># On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed
</code></pre>
<p>If your host driver is 535.x, don't use a <code>cuda:12.6</code> base image. Use <code>cuda:12.2</code> or upgrade the driver. Mismatched versions produce cryptic errors like <code>CUDA error: no kernel image is available for execution on the device</code> that are painful to debug.</p>
<p>Pin your base images to specific tags (not <code>latest</code>) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.</p>
<h2 id="heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</h2>
<p>If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.</p>
<p><a href="https://mlflow.org/">MLflow</a> is the most widely adopted open-source tool for this. It logs three things for every training run: <strong>parameters</strong> (learning rate, batch size, number of epochs), <strong>metrics</strong> (accuracy, loss, F1 score), and <strong>artifacts</strong> (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.</p>
<p>Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:</p>
<pre><code class="language-yaml">services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:
</code></pre>
<p>Let me break down what's happening here.</p>
<p>The <code>mlflow</code> service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.</p>
<p>The <code>depends_on</code> with <code>condition: service_healthy</code> tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.</p>
<p>The <code>db</code> service runs Postgres with a health check that uses <code>pg_isready</code>, a built-in Postgres utility that checks if the database is accepting connections. The <code>start_period</code> gives Postgres 10 seconds to initialize before health checks start counting failures.</p>
<p>Your training code connects to MLflow by setting one environment variable:</p>
<pre><code class="language-python">import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")
</code></pre>
<p>After the run completes, open <code>http://localhost:5000</code> in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.</p>
<p>A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.</p>
<h2 id="heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</h2>
<p>Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.</p>
<p><a href="https://dvc.org/">DVC (Data Version Control)</a> fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a <code>.dvc</code> file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.</p>
<p>The workflow on your local machine looks like this:</p>
<pre><code class="language-bash"># Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"
</code></pre>
<p>Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run <code>dvc pull</code> and DVC downloads it from remote storage.</p>
<p>The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
</code></pre>
<p>The entrypoint script pulls the data and then starts training:</p>
<pre><code class="language-bash">#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"
</code></pre>
<p>For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:</p>
<pre><code class="language-yaml">training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1
</code></pre>
<p>Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.</p>
<p>You can reproduce any past experiment by checking out the Git commit and running the training container.</p>
<h2 id="heading-how-to-build-the-serving-container">How to Build the Serving Container</h2>
<p>"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a <code>/predict</code> endpoint that accepts transaction data and returns a fraud probability.</p>
<p>The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:</p>
<pre><code class="language-dockerfile">FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]
</code></pre>
<p>A few things to understand if you're new to this:</p>
<p><code>uvicorn</code> is a lightweight Python web server that runs <a href="https://fastapi.tiangolo.com/">FastAPI</a> applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.</p>
<p><code>HEALTHCHECK</code> tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the <code>curl</code> command against the <code>/health</code> endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.</p>
<p><code>start-period</code> of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without <code>start-period</code>, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.</p>
<p>Notice we're using <code>python:3.11-slim</code> here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.</p>
<p>If you want to skip the <code>curl</code> dependency, use Python's built-in <code>urllib</code> for the health check:</p>
<pre><code class="language-dockerfile">HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
</code></pre>
<h3 id="heading-decouple-models-from-containers">Decouple Models from Containers</h3>
<p>This is one of the most important patterns in this article, and the one beginners most often get wrong.</p>
<p>The temptation is to copy your trained model file (the <code>.pkl</code>, <code>.pt</code>, or <code>.onnx</code> file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.</p>
<p>Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.</p>
<p>Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:</p>
<pre><code class="language-python">import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models:/&lt;model-name&gt;/&lt;stage&gt;" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}
</code></pre>
<p>When a client sends a POST request to <code>/predict</code> with JSON like <code>{"amount": 500, "merchant_category": "electronics", "hour": 23}</code>, the model returns a prediction. The <code>/health</code> endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker <code>HEALTHCHECK</code> checks for.</p>
<p>Promoting a new model version means updating the <code>MODEL_URI</code> environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.</p>
<p>For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:</p>
<pre><code class="language-python">@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}
</code></pre>
<h2 id="heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</h2>
<p>By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.</p>
<p>This requires two things on the host (the machine running Docker, not inside the container):</p>
<ol>
<li><p><strong>NVIDIA GPU drivers</strong> installed and working. Verify with <code>nvidia-smi</code>. If that command shows your GPUs, you're good.</p>
</li>
<li><p><strong>NVIDIA Container Toolkit</strong> installed. This is the bridge between Docker and the GPU drivers. Install it from the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA docs</a> and verify with <code>docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi</code>. If you see your GPU listed, the toolkit is working.</p>
</li>
</ol>
<p>Once the host is set up, GPU access in Docker Compose looks like this:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>The <code>deploy.resources.reservations.devices</code> block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding <code>print(torch.cuda.is_available())</code> to your training script, which should print <code>True</code>.</p>
<p>If you're running Compose v2.30.0+, you can use the shorter <code>gpus</code> syntax:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using <code>device_ids</code>. This matters when running multiple training jobs at the same time:</p>
<pre><code class="language-yaml">services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
</code></pre>
<p>Note that <code>CUDA_VISIBLE_DEVICES</code> inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.</p>
<h2 id="heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</h2>
<p>If you're new to Compose profiles: by default, <code>docker compose up</code> starts every service defined in your <code>docker-compose.yml</code>. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).</p>
<p>Profiles solve this. When you add <code>profiles: ["train"]</code> to a service, that service is excluded from <code>docker compose up</code> by default. It only starts when you explicitly activate the profile with <code>docker compose --profile train</code>. This means one file defines your entire ML infrastructure, but you control what runs and when.</p>
<p>Here's the complete <code>docker-compose.yml</code> that ties every piece from this article together:</p>
<pre><code class="language-yaml">services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:
</code></pre>
<p>The day-to-day workflow with this file:</p>
<pre><code class="language-bash"># Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'
</code></pre>
<p>This single-file approach means a new team member can clone the repo, run <code>docker compose up -d</code>, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).</p>
<h2 id="heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</h2>
<p>Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.</p>
<p>Here are the practices that make this work:</p>
<h3 id="heading-pin-everything">Pin Everything</h3>
<p>Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with <code>pip freeze &gt; requirements.txt</code>. Use fixed random seeds in your training code and log them in MLflow.</p>
<h3 id="heading-log-everything">Log Everything</h3>
<p>Every training run should log the exact library versions (<code>pip freeze</code>), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:</p>
<pre><code class="language-python">import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...
</code></pre>
<h3 id="heading-version-everything">Version Everything</h3>
<p>Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.</p>
<h2 id="heading-where-this-breaks-down">Where This Breaks Down</h2>
<p>This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:</p>
<p><strong>Large datasets.</strong> Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.</p>
<p><strong>GPU driver mismatches.</strong> Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.</p>
<p><strong>Multi-node training.</strong> When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.</p>
<p><strong>Serving at scale.</strong> A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with <code>docker compose up --scale serving=3</code>, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.</p>
<p><strong>Secrets in production.</strong> The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.</p>
<p>That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.</p>
<p>Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.</p>
<p>But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.</p>
<p>If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Self-Host AFFiNE on Windows with WSL and Docker ]]>
                </title>
                <description>
                    <![CDATA[ Depending on cloud apps means that you don't truly own your notes. If your internet goes down or if the company changes its rules, you could lose access. In this article, you'll learn how to build you ]]>
                </description>
                <link>https://www.freecodecamp.org/news/self-host-affine-windows/</link>
                <guid isPermaLink="false">69b2e3051be92d8f177bf807</guid>
                
                    <category>
                        <![CDATA[ self-hosted ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                    <category>
                        <![CDATA[ deployment ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ WSL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Abdul Talha ]]>
                </dc:creator>
                <pubDate>Thu, 12 Mar 2026 16:00:05 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/950eee10-aa2c-4071-9c40-abaf759f6d10.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Depending on cloud apps means that you don't truly own your notes. If your internet goes down or if the company changes its rules, you could lose access.</p>
<p>In this article, you'll learn how to build your own private workspace using AFFiNE. You'll use Docker Compose to link three separate pieces of software together:</p>
<ul>
<li><p>The AFFiNE Core application.</p>
</li>
<li><p>A PostgreSQL database to store your notes and pages.</p>
</li>
<li><p>A Redis cache to make the app run fast and smooth.</p>
</li>
</ul>
<p>By the end of this article, you'll have a fully functional web app running on your own computer that works just like the cloud version of Notion.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-what-is-affine">What is AFFiNE?</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-step-1-preparing-your-workspace">Step 1: Preparing Your Workspace</a></p>
</li>
<li><p><a href="#heading-step-2-getting-the-official-setup-files">Step 2: Getting the Official Setup Files</a></p>
</li>
<li><p><a href="#heading-step-3-configuring-your-environment-env">Step 3: Configuring Your Environment (.env)</a></p>
</li>
<li><p><a href="#heading-step-4-launching-the-system">Step 4: Launching the System</a></p>
</li>
<li><p><a href="#heading-step-5-accessing-the-admin-panel">Step 5: Accessing the Admin Panel</a></p>
</li>
<li><p><a href="#heading-step-6-configuration-making-it-yours">Step 6: Configuration (Making It Yours)</a></p>
</li>
<li><p><a href="#heading-step-7-connecting-the-desktop-app-optional">Step 7: Connecting the Desktop App (Optional)</a></p>
</li>
<li><p><a href="#heading-step-8-stopping-the-server-and-safe-backups">Step 8: Stopping the Server and Safe Backups</a></p>
</li>
<li><p><a href="#heading-step-9-how-to-upgrade-later">Step 9: How to Upgrade Later</a></p>
</li>
<li><p><a href="#heading-common-installation-errors-and-troubleshooting">Common Installation Errors and Troubleshooting</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-affine">What is AFFiNE?</h2>
<p>AFFiNE is an "all-in-one" workspace that combines the powers of writing, drawing, and planning.</p>
<p>While tools like Notion focus on documents and Miro focus on whiteboards, AFFiNE lets you do both in a single space. You can turn your written notes into a visual canvas with one click. This makes it perfect for brainstorming, tracking tasks, and managing your personal knowledge.</p>
<h3 id="heading-the-power-of-self-hosting">The Power of Self-Hosting</h3>
<p>While AFFiNE offers a cloud version, hosting it yourself gives you three major benefits:</p>
<ul>
<li><p><strong>Total data ownership:</strong> Your notes never leave your machine. You own the database.</p>
</li>
<li><p><strong>Privacy in the AI age:</strong> No big tech company can scan your private ideas or use them for AI training.</p>
</li>
<li><p><strong>Real DevOps skills:</strong> Learning how to manage Docker inside WSL is a high-value skill for any modern developer.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow this article, make sure you have these tools ready on your machine:</p>
<ul>
<li><p><strong>WSL 2 Installation:</strong> You must have WSL installed if you are using Windows (I am using Ubuntu for this guide).</p>
</li>
<li><p><strong>Docker and Docker Compose:</strong> These must be installed and running on your machine.</p>
</li>
<li><p><strong>Linux Terminal Commands:</strong> You should be familiar with basic commands like <code>mkdir</code>, <code>cd</code>, and <code>wget</code>.</p>
</li>
</ul>
<h2 id="heading-step-1-preparing-your-workspace">Step 1: Preparing Your Workspace</h2>
<p>To start, create a folder for your AFFiNE files. This keeps your data in one organised place.</p>
<p>Then open your WSL terminal and run these commands:</p>
<pre><code class="language-shell">mkdir affine
cd affine
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/021e4aef-ede1-4bec-b96e-2acaea9d8f40.png" alt="A terminal Showing the commands mkdir and cd" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-2-getting-the-official-setup-files">Step 2: Getting the Official Setup Files</h2>
<p>You will download the official configuration files directly from the AFFiNE. In your WSL terminal, run these two commands:</p>
<ol>
<li>Download the Docker Compose file:</li>
</ol>
<pre><code class="language-shell">wget -O docker-compose.yml https://github.com/toeverything/affine/releases/latest/download/docker-compose.yml
</code></pre>
<ol>
<li>Download the Environment template:</li>
</ol>
<pre><code class="language-shell">wget -O .env https://github.com/toeverything/affine/releases/latest/download/default.env.example
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/5b366a5f-b426-4e70-95c0-b469f40d6af5.png" alt="A terminal Showing the commands to download affine" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-3-configuring-your-environment-env">Step 3: Configuring Your Environment (.env)</h2>
<p>The <code>.env</code> file is like a hidden settings sheet. It keeps your passwords and setup details private.</p>
<p>To edit this file, you can use Nano, which is a simple text editor built into your Linux terminal. Follow these steps to update your settings:</p>
<ol>
<li><p><strong>Open the file with Nano:</strong></p>
<pre><code class="language-shell">nano .env
</code></pre>
</li>
<li><p><strong>Update the settings:</strong> Use your arrow keys to move around the file. Update these specific lines to match the locations below. This keeps your data safely inside your new <code>affine</code> folder:</p>
<pre><code class="language-plaintext">DB_DATA_LOCATION=./postgres
UPLOAD_LOCATION=./storage
CONFIG_LOCATION=./config

DB_USERNAME=affine
DB_PASSWORD=
DB_DATABASE=affine
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/d0f4a358-e221-45d3-94df-d97b606b4afc.png" alt="A terminal to change the values in env file" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>Save and Exit:</strong> Press Ctrl + O to save.</p>
<ul>
<li><p>Press <strong>Enter</strong> to confirm the filename.</p>
</li>
<li><p>Press <strong>Ctrl + X</strong> to exit the editor.</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-step-4-launching-the-system">Step 4: Launching the System</h2>
<p>Run this Docker command to build your workspace:</p>
<pre><code class="language-shell">docker compose up -d
</code></pre>
<p>Docker will download the AFFiNE app and a Postgres database. The <code>-d</code> flag means it will run quietly in the background.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/407237bd-f805-4fca-b15c-6bf001f467e7.png" alt="A terminal Showing the commands for docker compose" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-5-accessing-the-admin-panel">Step 5: Accessing the Admin Panel</h2>
<p>Once the terminal says "Started," your private server is live!</p>
<p>Open your web browser and go to:</p>
<pre><code class="language-plaintext">http://localhost:3010/
</code></pre>
<p>The first time you visit this page, you must create an admin account. This is the master key to your server.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/780fafda-0afd-4b67-a2fa-6248b4d5d4f3.png" alt="creating an Admin account" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-6-configuration-making-it-yours">Step 6: Configuration (Making It Yours)</h2>
<p>There are two ways to configure your server.</p>
<h3 id="heading-the-easy-way-admin-panel"><strong>The Easy Way: Admin Panel</strong></h3>
<p>In your browser, go to <code>http://localhost:3010/admin/settings</code>. You can change your server name or set up emails here.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/0f8d4e97-7a47-4328-8e91-a36582d47143.png" alt="Overview of the settings page" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-the-developer-way-config-file"><strong>The Developer Way: Config File</strong></h3>
<p>You can also create a <code>config.json</code> file inside your <code>./config</code> folder.</p>
<pre><code class="language-json">{
  "$schema": "https://github.com/toeverything/affine/releases/latest/download/config.schema.json",
  "server": {
    "name": "My Private Workspace"
  }
}
</code></pre>
<h2 id="heading-step-7-connecting-the-desktop-app-optional">Step 7: Connecting the Desktop App (Optional)</h2>
<p>You don't have to use the browser. You can connect the official AFFiNE desktop app.</p>
<ol>
<li><p>Download the AFFiNE desktop app.</p>
</li>
<li><p>Click the workspace list panel in the top left corner.</p>
</li>
<li><p>Click "Add Server" and enter <code>http://localhost:3010</code>.</p>
</li>
<li><p>Log in with your account.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/2c668ed4-3552-420f-9217-e5f8d09f311c.png" alt="Connecting your local server to Affine Server" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/6729b04417afd6915f5c2e3e/3a12b7f6-33b9-497e-8684-7fd7a09d8c42.png" alt="Overview of Workspace" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-step-8-stopping-the-server-and-safe-backups">Step 8: Stopping the Server and Safe Backups</h2>
<p>You must turn your server off safely before you back up your notes.</p>
<p>To do that, run this command:</p>
<pre><code class="language-shell">docker compose down
</code></pre>
<p>Once it stops, you can safely copy your entire <code>affine</code> folder to a safe place.</p>
<h2 id="heading-step-9-how-to-upgrade-later">Step 9: How to Upgrade Later</h2>
<p>When AFFiNE releases a new version, run these commands inside your <code>affine</code> folder:</p>
<ol>
<li>Download the newest blueprint:</li>
</ol>
<pre><code class="language-shell">wget -O docker-compose.yml https://github.com/toeverything/affine/releases/latest/download/docker-compose.yml
</code></pre>
<ol>
<li>Pull the new images and restart:</li>
</ol>
<pre><code class="language-shell">docker compose pull
docker compose up -d
</code></pre>
<h2 id="heading-common-installation-errors-and-troubleshooting">Common Installation Errors and Troubleshooting</h2>
<h3 id="heading-1-docker-is-not-running">1. Docker is Not Running</h3>
<ul>
<li><p><strong>The Error:</strong> Terminal says <code>docker: command not found</code>.</p>
</li>
<li><p><strong>The Fix:</strong> Open the Docker Desktop app on Windows and wait for it to start.</p>
</li>
</ul>
<h3 id="heading-2-docker-is-not-connected-to-wsl">2. Docker is Not Connected to WSL</h3>
<ul>
<li><strong>The Fix:</strong> In Docker Desktop, go to <strong>Settings &gt; Resources &gt; WSL Integration</strong> and turn it ON for your distro.</li>
</ul>
<h3 id="heading-3-the-port-is-already-in-use">3. The Port is Already in Use</h3>
<ul>
<li><strong>The Fix:</strong> Open <code>docker-compose.yml</code>. Change <code>"3010:3010"</code> to <code>"4000:3010"</code>. You will now visit <code>localhost:4000</code>.</li>
</ul>
<h3 id="heading-4-permission-denied">4. Permission Denied</h3>
<ul>
<li><strong>The Fix:</strong> If you cannot delete a folder, use the sudo command: <code>sudo rm -rf affine/</code>.</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you've successfully built a self-hosted, private workspace. You practised using WSL, Docker Compose, and Postgres. These are valuable skills for any developer.</p>
<p><strong>Your next steps:</strong></p>
<ol>
<li><p>Create a note in AFFiNE documenting what you learned.</p>
</li>
<li><p>Turn off your server (<code>docker compose down</code>) and copy your folder to a backup drive.</p>
</li>
<li><p>Explore Cloudflare Tunnels if you want to access your server from your phone!</p>
</li>
</ol>
<p>Self-hosting takes a little work, but the privacy is worth it.</p>
<p><strong>Let’s connect!</strong> You can find my latest work on my <a href="https://blog.abdultalha.tech/portfolio"><strong>Technical Writing Portfolio</strong></a> or reach out to me on <a href="https://www.linkedin.com/in/abdul-talha/"><strong>LinkedIn</strong></a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
