Docker - freeCodeCamp.org

How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA

אחיה כהן — Mon, 11 May 2026 13:57:06 +0000

WhatsApp is where your many of your customers likely already are. For support tickets, order updates, booking reminders, and lead qualification, a WhatsApp channel often converts several times better than email.

But the official WhatsApp Business Cloud API can be slow to onboard, template-restricted for proactive messages, and priced per conversation — which adds up fast at scale.

There's another path: you can run your own WhatsApp HTTP gateway on a small server, connect it to a workflow engine, and keep every message — inbound and outbound — inside infrastructure you control. No monthly conversation fees, no template approvals for routine replies, no third-party middleman holding your customer data.

In this tutorial, you'll build exactly that. By the end, you'll have a WhatsApp bot that:

Receives every incoming message through a webhook
Routes messages through an n8n workflow
Replies automatically based on keywords, AI, or any API call you want
Runs entirely on your own server, using two open-source tools

You'll use WAHA (WhatsApp HTTP API) as the gateway, and n8n as the workflow engine. Both run in Docker, both are free for self-hosting, and together they cover everything from a simple auto-reply to a full CRM integration.

What You'll Learn
Prerequisites
A Note on Which WhatsApp Account to Use
WAHA vs the official WhatsApp Business Cloud API
Part 1: Understanding WAHA
Part 2: Running WAHA with Docker
Part 3: Starting a WhatsApp session
Part 4: Running n8n
Part 5: Creating the Webhook Trigger in n8n
Part 6: Wiring WAHA to n8n
Part 7: Building the first auto-reply
Part 8: A Second Example — Proactive Booking Confirmations
Part 9: Going to Production
Common Pitfalls
Where to Go Next

What You'll Learn

How WAHA works under the hood and when to use it instead of the official Cloud API
How to run WAHA and n8n side by side with Docker Compose
How to scan the QR code and bind a WhatsApp account to your gateway
How to connect WAHA's webhook to an n8n workflow
How to build a keyword-based auto-reply bot
How to send proactive confirmations from a separate workflow
How to harden the setup for production (HTTPS, API keys, rate limits, Queue Mode)

Prerequisites

A Linux server (any VPS works — 2 GB of RAM is enough for a small bot)
Docker and Docker Compose installed
A public hostname with DNS pointing at the server, or an ngrok tunnel for local testing
A WhatsApp account you're willing to dedicate to the bot (more on that below)
Basic familiarity with JSON and HTTP requests

You don't need prior n8n experience. If you can drag a box and wire it to another box, you can build the flow.

A Note on Which WhatsApp Account to Use

WAHA works by running an actual WhatsApp Web session inside a headless Chromium process. It logs in as a real account — the same way you would open web.whatsapp.com in your browser. Meta doesn't officially endorse this approach for commercial use at scale, and heavy volume from a single number can lead to a ban.

For that reason, use a dedicated number for the bot. Don't use your personal WhatsApp. Get a second SIM, eSIM, or a VoIP number that supports WhatsApp activation. Keep outbound volume reasonable, and you'll be fine for most small-business use cases.

If you plan to send thousands of marketing messages per day, switch to the official WhatsApp Business Cloud API — that's what it exists for. This tutorial is aimed at the middle ground: support bots, order updates, booking confirmations, and similar conversational flows where you need real-time control without enterprise pricing.

WAHA vs the official WhatsApp Business Cloud API

Before writing any code, it helps to understand when each option is the right fit.

Dimension	WAHA (self-hosted)	WhatsApp Cloud API (Meta)
Onboarding	Scan a QR code — ready in minutes	Business verification, app review — days to weeks
Cost	Server cost only	Per-conversation pricing
Template approval	Not needed	Required for proactive messages outside the 24-hour window
Session model	One WhatsApp Web session per Core container	Native API, no web session
Risk	Account ban possible at high unsolicited volume	Rate limits but no ban for normal use
Vendor lock-in	None — pure open source	Tied to Meta's API and pricing
Best for	Support bots, small-team workflows, internal tools	High-volume marketing, regulated industries, >100k monthly messages

Neither is strictly better. If you run a support team for a small business, WAHA is often the pragmatic choice. If you're a bank sending millions of transactional messages, you want the Cloud API. Many teams run both — WAHA for conversational support, Cloud API for bulk transactional traffic.

Part 1: Understanding WAHA

WAHA is an open-source project that wraps WhatsApp Web behind a clean REST API. You POST /api/sendText with a chat ID and a message, and WAHA sends it. You configure a webhook URL, and WAHA POSTs to that URL every time a message arrives.

Under the hood, WAHA spawns a Chromium instance, opens WhatsApp Web, and uses an engine (whatsapp-web.js, NOWEB, or GOWS) to automate the session. Your code doesn't see any of that complexity — you just see an HTTP API.

The project ships in two flavors:

WAHA Core — free, MIT licensed, one active session per container, community support.
WAHA Plus — commercial license, multi-session support, priority support, and access to advanced endpoints.

For most developers building a single bot, Core is enough. You can always upgrade later.

Official docs live at waha.devlike.pro. Keep that open in another tab — we'll reference specific endpoints as we go.

Part 2: Running WAHA with Docker

Create a fresh directory for the project:

mkdir whatsapp-bot && cd whatsapp-bot

Create a docker-compose.yml file:

services:
  waha:
    image: devlikeapro/waha:latest
    container_name: waha
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - WAHA_DASHBOARD_ENABLED=true
      - WAHA_DASHBOARD_USERNAME=admin
      - WAHA_DASHBOARD_PASSWORD=change-me-now
      - WHATSAPP_API_KEY=super-secret-key-change-me
      - WHATSAPP_DEFAULT_ENGINE=WEBJS
    volumes:
      - ./waha-sessions:/app/.sessions

A few things to notice:

The dashboard username and password protect the web UI at http://your-server:3000. Always change the defaults before you expose the port publicly.
WHATSAPP_API_KEY is the key every HTTP request to WAHA must include in the X-Api-Key header. Treat it like a database password.
WHATSAPP_DEFAULT_ENGINE=WEBJS uses the mature whatsapp-web.js engine. WAHA also supports NOWEB and GOWS engines with different trade-offs — WEBJS is the safest default for a first deployment.
The volume mount persists the session across restarts. Without it, every container rebuild forces you to scan the QR code again.

Start the container:

docker compose up -d
docker compose logs -f waha

Within about 20 seconds WAHA finishes booting. Visit http://your-server:3000 and log in with the dashboard credentials.

Part 3: Starting a WhatsApp session

WAHA calls each WhatsApp account a "session." You can have one session at a time on WAHA Core.

From the dashboard, click Start New Session and name it default. WAHA displays a QR code.

On your phone:

Open WhatsApp.
Tap the three-dot menu (Android) or Settings (iOS).
Tap Linked Devices → Link a Device.
Point the camera at the QR code on your screen.

Within a few seconds the dashboard shows WORKING status. Your session is live.

You can also do this over the API. Start the session (default is the session name, encoded in the URL path):

curl -X POST http://your-server:3000/api/sessions/default/start \
  -H "X-Api-Key: super-secret-key-change-me"

The call is idempotent — if the session is already running, nothing happens.

Fetch the QR as a PNG:

curl http://your-server:3000/api/default/auth/qr \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Accept: image/png" \
  --output qr.png

Scan and you're in.

Test that the session works by sending a message to yourself:

curl -X POST http://your-server:3000/api/sendText \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "session": "default",
    "chatId": "15555550123@c.us",
    "text": "Hello from WAHA!"
  }'

Replace 15555550123 with your own number (country code plus number, no +, no spaces, no dashes). The @c.us suffix marks it as an individual chat. Groups use @g.us.

If the message lands on your phone — congratulations. The gateway works.

Part 4: Running n8n

Add an n8n service to your docker-compose.yml alongside WAHA:

services:
  waha:
    # ... existing config

  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "5678:5678"
    environment:
      - N8N_HOST=n8n.example.com
      - N8N_PORT=5678
      - N8N_PROTOCOL=https
      - WEBHOOK_URL=https://n8n.example.com/
      - GENERIC_TIMEZONE=UTC
    volumes:
      - ./n8n-data:/home/node/.n8n

Replace n8n.example.com with your real domain. For purely local testing, set:

- N8N_HOST=localhost
- N8N_PROTOCOL=http
- WEBHOOK_URL=http://localhost:5678/

If you want to test webhooks from your laptop without a server, run ngrok http 5678 in another terminal and use the ngrok HTTPS URL as WEBHOOK_URL. n8n uses WEBHOOK_URL to tell external services where to POST — get this wrong and your webhooks will 404.

Start the stack:

docker compose up -d

Visit http://your-server:5678. On the first visit, n8n walks you through creating an owner account (email and password). Every subsequent visit requires that login. For extra safety in production, put n8n behind a reverse proxy with an allow-list or an additional auth layer — we'll set that up later.

Part 5: Creating the Webhook Trigger in n8n

Click Create Workflow. You'll see an empty canvas.

Add a Webhook node and configure it:

HTTP Method: POST
Path: whatsapp (this becomes part of the URL)
Response Mode: Respond Immediately
Response Data: First Entry JSON

Click Listen for Test Event. n8n shows you two URLs: a test URL and a production URL. Copy the production URL. It looks like this:

https://n8n.example.com/webhook/whatsapp

Not webhook-test — that one only fires while the editor is open. You want webhook.

Part 6: Wiring WAHA to n8n

WAHA can POST to a webhook on every WhatsApp event. Tell it where to send those events.

In the WAHA dashboard, open your session and set the webhook URL. Or do it over the API:

curl -X PUT http://your-server:3000/api/sessions/default \
  -H "X-Api-Key: super-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "webhooks": [
        {
          "url": "https://n8n.example.com/webhook/whatsapp",
          "events": ["message", "session.status"]
        }
      ]
    }
  }'

The message event fires on every inbound message. session.status fires when the session connects, disconnects, or reconnects — which is useful for alerting when your bot goes down.

Test it. From another phone, send a WhatsApp message to your bot's number. Head back to the n8n editor. Within a second or two the webhook node lights up with the event data.

The payload looks roughly like this:

{
  "event": "message",
  "session": "default",
  "payload": {
    "id": "false_15555550123@c.us_3EB0...",
    "from": "15555550123@c.us",
    "body": "Hello",
    "timestamp": 1713801234,
    "fromMe": false
  }
}

Everything you need is in payload: who sent it (from), what they said (body), and when (timestamp).

Part 7: Building the first auto-reply

A bot that only listens is boring. Let's make it answer.

You'll build a tiny keyword router: if the user sends hi or hello, the bot greets them. If they send price, it sends a pricing message. Anything else gets a fallback.

After the Webhook node, add a Switch node.

Configure the Switch node:

Mode: Expression
Value: {{ $json.payload.body.toLowerCase().trim() }}
Add routing rules:
- Rule 1: equals hi — output 0
- Rule 2: equals hello — output 0
- Rule 3: equals price — output 1
- Fallback output: 2

After the Switch, add three HTTP Request nodes, one per output.

Configure each HTTP Request node identically, except for the body text:

Method: POST
URL: http://waha:3000/api/sendText (inside the Docker network you can reach WAHA by its service name. From outside use the full public URL)
Send Headers: on
- X-Api-Key: super-secret-key-change-me
- Content-Type: application/json
Send Body: on
- Body Content Type: JSON
- Specify Body: Using JSON

For the greeting node, the JSON body is:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Hi! I'm the bot. Send 'price' to see pricing, or anything else for help."
}

For the pricing node:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "Our plans start at $49/month. Reply 'sales' to talk to a human."
}

For the fallback:

{
  "session": "default",
  "chatId": "={{ $('Webhook').item.json.payload.from }}",
  "text": "I didn't catch that. Try 'hi' or 'price'."
}

The ={{ ... }} syntax is an n8n expression — at runtime it pulls values from earlier nodes.

Connect the Switch outputs to their matching HTTP Request nodes. Save the workflow. Click Activate in the top-right.

Send hi to your bot from any phone. It should reply within a second.

Congratulations — you have a WhatsApp bot running entirely on your own infrastructure.

Part 8: A Second Example — Proactive Booking Confirmations

Auto-reply is useful. Proactive outbound is where the value really compounds. Here's a second workflow that sends a booking confirmation whenever a new row lands in a database.

Create a second workflow in n8n. Use one of these triggers:

Schedule Trigger — poll a database every minute for new rows
Webhook Trigger — listen for a notification from your booking system
Database Trigger (Postgres, MySQL, Supabase) — react to inserts in real time

For this example, use a Schedule Trigger set to every minute, followed by a Postgres Execute Query node that reads pending confirmations:

SELECT id, customer_phone, service_name, booking_time
FROM bookings
WHERE confirmation_sent = false
LIMIT 20;

After the Postgres node, add an HTTP Request node pointing to the same WAHA sendText endpoint you used earlier. The body:

{
  "session": "default",
  "chatId": "={{ $json.customer_phone }}@c.us",
  "text": "Hi! Your booking for {{ \(json.service_name }} on {{ \)json.booking_time }} is confirmed. Reply 'change' to reschedule."
}

Finally, add a second Postgres node that marks the booking as sent:

UPDATE bookings
SET confirmation_sent = true, confirmation_sent_at = NOW()
WHERE id = {{ $json.id }};

Activate the workflow. Every minute, n8n pulls pending bookings, sends a WhatsApp confirmation, and marks them done.

This pattern generalizes. Replace the SQL with a call to Shopify for order confirmations, Stripe for receipt messages, or Calendly for appointment reminders. The WhatsApp layer stays the same — only the source of truth changes.

Part 9: Going to Production

The setup above works, but it's not yet production-ready. Here's what to harden before you point real customers at it.

1. Put Everything Behind HTTPS

Never expose n8n or WAHA directly on plain HTTP. Put a reverse proxy in front. Caddy is the easiest choice because it handles Let's Encrypt automatically.

A minimal Caddyfile:

n8n.example.com {
    reverse_proxy n8n:5678
}

waha.example.com {
    reverse_proxy waha:3000
}

Run Caddy as another service in the same Docker Compose. TLS certificates are issued and renewed automatically.

2. Rotate the API Keys

Don't ship super-secret-key-change-me to production. Generate a real key:

openssl rand -hex 32

Put it in a .env file, reference it as ${WHATSAPP_API_KEY} in docker-compose.yml, and add .env to your .gitignore.

3. Rate-limit Outbound Messages

WhatsApp bans accounts that send too many messages too fast. A safe outbound rate for a fresh number is well under 20 messages per minute. For bursty replies, add an n8n Wait node between sends, or queue outgoing messages through a small custom function node that sleeps between requests.

4. Scale n8n with Queue Mode

By default, n8n runs everything in a single process. That's fine for low volume. For higher throughput, switch to Queue Mode:

Add a Redis container.
Run one n8n main container (the web UI and webhook receiver).
Run one or more n8n-worker containers that pull jobs from the queue.

Queue Mode is documented at docs.n8n.io/hosting/scaling/queue-mode/. Setup adds two environment variables (EXECUTIONS_MODE=queue, QUEUE_BULL_REDIS_HOST=redis) and decouples incoming webhooks from workflow execution. The webhook responds in milliseconds while workers chew through the queue in the background.

5. Monitor the Session

WhatsApp Web sessions drop. The phone loses connection, WhatsApp rotates security tokens, or your server reboots. Catch those drops early.

Subscribe to the session.status webhook event in WAHA. When status becomes FAILED or STOPPED, route it to an n8n workflow that posts to Slack, sends an email, or pages you. The faster you know, the faster you recover.

For overall uptime, point something like Uptime Kuma at GET /api/sessions/default on WAHA. If WAHA reports WORKING, you're fine. Anything else triggers an alert.

6. Back Up the Sessions Volume

The waha-sessions directory contains the logged-in state. If you lose it, you have to scan the QR code again — possibly from a phone that's no longer handy. Back it up nightly. A simple cron job with tar and rclone to S3-compatible storage is plenty.

7. Add a Live-Agent Handoff

Not every conversation should stay with the bot. When a user types human — or when your intent classifier can't answer confidently — hand off to a real agent.

Chatwoot is a solid open-source option: it has a dedicated WhatsApp channel, agent inbox, team assignment, and conversation history. The handoff is an n8n branch that stops processing bot replies and forwards the message stream to Chatwoot's API.

Common Pitfalls

A few issues catch almost everyone on their first production deploy.

Webhooks Timing Out

WAHA gives your webhook a few seconds to respond. If your n8n workflow is slow (calling an LLM, hitting a remote API), the webhook times out and WAHA retries, potentially causing duplicate replies.

Fix: make the webhook return 200 immediately and offload the slow work. In n8n, set the Webhook node's Response Mode to Using Respond to Webhook Node, add a Respond to Webhook node as the first step with a 200 and empty body, then do the heavy lifting after that.

Duplicate Messages

WAHA delivers the same message event more than once in edge cases (phone comes back online, session reconnects). Store the payload.id somewhere — Redis, a database, or n8n's static data store — and drop any ID you've already processed.

Messages Arriving Out of Order

The webhook is async, and n8n may parallelize executions. If ordering matters — for example, in a multi-step conversation — key a queue by the sender's chatId and process each sender serially.

Sessions Disconnecting After a Phone Restart

Normal WhatsApp Web behavior. WAHA auto-reconnects, but occasionally the linked-devices list needs a manual refresh. If a session refuses to come back, stop the WAHA container, delete that session's folder under waha-sessions/, start the container again, and rescan the QR.

Your Number Gets Banned

The single biggest cause is rate: a new number blasting hundreds of messages an hour gets flagged fast. Warm up a number slowly — send a normal, human-like volume for the first week. Don't send to strangers unsolicited. Prefer inbound-driven replies over outbound pushes wherever you can.

The Wrong Chat ID Format

WhatsApp individual chats use @c.us and groups use @g.us. Don't include the + or spaces in the number. If WAHA returns a 404 when sending, the chat ID is almost always the problem.

Where to Go Next

You now have the foundation. The same two-service stack supports almost any bot you can imagine — you're only limited by what you can build in an n8n workflow.

Some natural next steps:

Plug in AI replies: Add an OpenAI or Anthropic node after the Webhook, pass the user's message through it with a short system prompt, and send the response back through WAHA. Cap conversation length to prevent runaway token usage.
Integrate a CRM: Look up the caller's chatId in HubSpot, Pipedrive, or your own database before deciding how to reply. Segment responses by customer tier.
Send proactive notifications: Appointment reminders, shipping updates, payment receipts, abandoned-cart nudges. Keep the content transactional and expected — unsolicited marketing blasts are the fastest way to a ban.
Log every conversation: Add a Postgres or Supabase node after the Webhook to persist messages for analytics and customer history. Your future self (and your support team) will thank you.
Add media handling: WAHA exposes sendImage, sendFile, and sendVoice endpoints. Teach the bot to accept photos for support tickets, or send invoices as PDFs directly inside the chat.

The WhatsApp layer stays the same. Everything interesting happens upstream in the workflow.

If you want to see production examples of n8n and WAHA running at scale — or you need a similar automation built for your business — I'm the founder of Achiya Automation, where we ship WhatsApp, n8n, and Chatwoot integrations. You can find more at achiya-automation.com.

How to Dockerize a Go Application – Full Step-by-Step Walkthrough

Njong Emy — Wed, 29 Apr 2026 18:05:56 +0000

Imagine that you want to share your source code with someone who doesn’t have Go installed on their computer. Unfortunately, this person won’t be able to run your application. Even if they do have Go installed, application behaviour may differ because your local development environment is different from theirs.

So how do you bundle up your application so that it can run the same way in every local environment? That’s where Docker comes in.

For beginners, Docker isn't always a very easy concept to grasp. But once you get it, I promise that it’s very interesting. So interesting that you’ll want to dockerize every application you lay your hands on.

For this article, a Go application will be our case study. The fundamental concept of containerization as explained here is transferable, so don’t worry too much about how dockerizing applications in another language will look like.

We’ll go through the basics of dockerizing a Go app with just Docker, images and containers, setting up multiple containers in one application with Docker Compose, and the constituent of a Docker Compose file.

By the end of this article, you'll have a basic understanding of what Docker is, what an image or container is, and how to orchestrate multiple, dependent containers with Docker Compose.

Prerequisites

You don't need any prior knowledge of Docker to follow this tutorial. This article is written with a beginner POV in mind, so it's okay if the concept is new to you.

In order to be fully engaged and understand the Go coding examples used here, it'll be helpful if you have basic knowledge of Golang. If you already understand how to set up a Go application on your local computer, you're good to go. If not, you can check this article on how to get started coding in Go.

What is Docker?

Imagine that you have a box. In that box, you put your code and everything that it needs to run. That is, the programming language it uses and any other external packages you need to install.

If someone needs your application, you can just hand them the box. You can also hand this box to as many people as you want. They don’t need to install the language or any other thing on their computer because everything they need is already inside the box. So, when they run the application, what they're actually doing is running an instance of that box.

The app is running within the box which is the standard environment. This means for everyone who got the box and “opened it”, the application is going to run the exact same way.

With the help of Docker, apps can run under the same conditions across different systems, and you avoid the problem of “it works on my machine”.

In technical Docker terms, this box is called an image and the running instance is called a container.

An image is a lightweight, standalone, executable package that includes everything needed to run a piece of software. That is, code, runtime, libraries, system tools, and even the operating system.

A container is simply a runnable instance of an image. This represents the execution environment for a specific application.

If all this seems to abstract, don’t worry. We’ll get our hands dirty in a little bit.

How to Install Docker

In order to install Docker, we're going to install Docker Desktop which comes bundled up with the Docker Engine. Docker Destop is a GUI for managing containers, and you'll see how useful it is in subsequent sections.

At the time of writing, I'm using WSL (Windows Sub-system for Linux). If you're doing the same, you'll need to take that into consideration before installing because Docker requires different installation prerequisites and steps for different operating systems.

To install Docker Desktop on WSL,

Download and install the windows .exe file
Start Docker Desktop from the Start Menu and navigate to settings
Select Use WSL 2 based engine from the General tab
Click on apply.

That’s it for the WSL installation. If you are running another operating system, the official docs have a list of installation options for you.

What is a Dockerfile?

In order to build your box in the first place, Docker needs to follow a couple of outlined steps. It needs to know the dependencies, the run time, and it also needs to have the source code. All these steps we list in a Dockerfile.

Before we get down to cracking anything, let’s create a working directory and navigate into it.

mkdir go_book_api && cd go_book_api

To intialise the Go module in your application, run the following command:

go mod init go_book_api

This creates a go.mod file to keep track of your project dependencies. In the root of the project, create a cmd directory, and a main.go file in it. This will serve as the entry point of your application. In the main.go file, you can have a simple print statement:

// cmd/main.go
package main

import "fmt"

func main() {
	fmt.Println("Look at me gooo!")
}

Now, go ahead and create a file in the root of your project and call it Dockerfile. This file has no extensions, but your system automatically knows that it's a file for Docker commands.

Go ahead and paste the following in that file, and then we'll go through each of them one by one:

# base image
FROM golang:1.24

# define the working directory
WORKDIR /app

# copy the go.mod and go.sum so that the packages to be installed
# are known in the container. ./ here is the WORKDIR, /app
COPY go.mod ./

# command to install modules
RUN go mod download

# copy source code into working dir
COPY . .

# build
RUN CGO_ENABLED=0 GOOS=linux go build -o /docker-gs-ping ./cmd/main.go

# run the compiled binary when the container starts
CMD ["/docker-gs-ping"]

Most Dockerfiles begin with a base image, which is specified by the FROM keyword. A base image is a foundational template that provides minimal operating system environment, libraries, or dependencies required to build and run an application within a container.

In this case, your base image is golang:1.24 . Your base image could have been an operating system like Linux. In that case. when you ship your code to someone who isn’t running a Linux operating system, they wouldn’t have to worry because they will be running the application in an environment that already has a minimal Linux OS. In the same light, someone who doesn’t have Go installed locally can run your application.

To figure out what base image to use when setting up your Dockerfile, you can always peruse the official Docker Hub repository for published images. For this case, you can check out base images that are officially published by Golang here.

The next step is to define a working directory. Inside your box, you have a filesystem that is almost identical to the ones you’d see on a Linux system. You have folders like /app, /bin , /usr , and /var , and so on. The working directory you've defined in this case is /app, and it's done with the WORKDIR command.

After setting a working directory, you want to copy the go.mod and go.sum file into it, so that Docker knows what dependencies to add into your box.

The COPY command in Docker takes at least two arguments: the source directory(ies), and then the destination directory. In this case, you want to copy go.mod and go.sum into the working directory of your box, /app.

In the box, you'll run a command that downloads and installs all the modules defined in the go.mod file. To run a command in Docker environment, use RUN and then the command, which is go mod download in this case.

The next step is to copy any source code you have into the working directory.

At this point, you have the dependencies and the source code. The last step is to build the Go application into a single executable file which can be run inside your environment (inside the container).

Within the container, you’ll have a compiled binary at /docker-gs-ping, which is as a result of the compilation of the code in your main.go file. The last step is a RUN command that just tells Docker to run the executable binary after building it. It’s a way of saying “once the container starts running, execute this binary file”.

With these steps, Docker will build an image (a box per our analogy) that you can run. To build the image, you can run this command in your terminal:

docker build -t go_book_api .

The docker build command tells Docker to build an image based on the steps in the Dockerfile. -t is the flag for a tag, and this helps you refer to the image later when running the container.

To accompany your tag, you'll provide a name to the image which is go_book_api in this case. The . at the end is important because it tells Docker where the Dockerfile in question is, and the files that you need to copy into your image.

This is what the building looks like in my IDE:

If you check the Images tab on Docker Compose, you'll see that an image is built:

You can host this image on a public image repository platform like Docker Hub, and share it with your friends. They can pull your image, set it up, and run your application even if they don’t have Go installed. All they need to do is get the container running.

If you click on the little play button to the far-right, you can spin up an instance of the image (a container).

You can give a descriptive name to the container (Docker will generate a random one if you don’t), and click on the Run button. Once the container starts running, you're redirected to its log page.

Your container is up and running! You can see that this is a running instance of your application.

What is Docker Compose?

If you were building a simple Go application that needed no external dependencies, the above set-up would be more than sufficient.

In our example here, the application is supposed to be for a book API, so you’d expect that we'd have some service like a database and a database administrator client like phpMyAdmin to visualize or tables.

To set all this up in one file would be a little complicated using just Docker. This is because Docker doesn't allow you to have one base image for Go, another base image for a database, and so on, in one file.

You could use the base image of a small operating system, and then run commands to manually install these other services as dependencies, but this method makes your application hard to maintain and scale. This method isn't advisable because if one dependency crashes, the whole application will collapse instantly.

To remedy this situation, Docker compose allows you to have multiple containers for your application that are connected together. Docker compose handles running the containers in the right order, allows one container to use a folder from another container, or even keep its data in another container – and so on.

Our previous analogy of boxes is the same, except with Docker Compose, we don’t necessarily have only one box anymore:

The point of Docker Compose is to help you orchestrate multiple images needed to run your application. You can think of it as connecting several boxes together.

Following the explanation from before, your application would be running in the Go book api container, the book data we'll create with your application would be stored in the mysql container which is the database, and you can visualize your database with phpMyadmin, which is in the phpMyadmin container.

To see this technically, create a docker-compose.yml file in the root of the project. The name of this file is important, and Docker Compose only accepts filenames such as compose.yml , docker-compose.yml , or docker-compose.yaml. The file extension hints that the commands are written in yaml which is a language mostly used for file configurations.

services:
  app:
    depends_on:
      - database
    build: 
      context: .
    container_name: go_book_api
    hostname: go_book_api
    networks:
      - go_book_api_net
    ports:
      - 8080:8080
    env_file:
      - .env
    
  database:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: ${DB_NAME}
      MYSQL_PASSWORD: ${DB_PASSWORD}
      MYSQL_USER: ${DB_USER}
    volumes:
      - mysql-go:/var/lib/mysql
    ports:
      - 3356:3306
    networks:
      - go_book_api_net

  phpmyadmin:
    image: phpmyadmin
    restart: always
    ports:
      - 9000:80
    environment:
      PMA_HOST: database
      PMA_ARBITRARY: 1
    depends_on:
      - database
    networks:
      - go_book_api_net

volumes:
  mysql-go:

networks:
  go_book_api_net:
    driver: bridge

At the root level of the docker-compose file, you have services . These are all the containers that are your application needs to run, and in the context of Docker Compose, they're each regarded as a service.

The `app` Container

 app:
    depends_on:
      - database
    build: 
      context: .
    container_name: go_book_api
    hostname: go_book_api
    networks:
      - go_book_api_net
    ports:
      - 8080:8080
    env_file:
      - .env

The very first container is the app container, which is your Go application. Under the app container, you'll need to define a few parameters that this container also needs to run.

The depends_on attribute controls the start-up and shut-down order of services within a container. This ensures that if container A depends on container B to start, the container B should be started first so that container A can use it. In this case, the database container must be started before the app container. Note that this doesn't mean app will always wait for the database to be ready.

The next attribute which is build tells Docker Compose to build the Docker image from the local project. Since the Dockerfile for your application is in the root of your app, you'll specify the root path with the context attribute as . .

To give a specific name to your container, you'll use container_name. hostname is what other containers will use for communication.

Recall that the point of Docker Compose is to have multiple containers communicating with each other. They do this with the help of networks. So you'll create another attribute, networks, and give it a name, go_book_api_net . To every other container that you want to associate with this app, you're going to specify the same network.

The next attribute is ports . Your application is an API, which means it's running on a backend Go server. To access the API, you'll need to map a local port to a port on the container. You're mapping port 8080 on your computer to port 8080 in the container.

The env_file attribute just tells Docker Compose where to read environment variables from. In this case, you can create a .env file in the root of your project to store important variables that your container will need.

The `database` Container

  database:
    image: mysql:8.0
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: ${DB_NAME}
      MYSQL_PASSWORD: ${DB_PASSWORD}
      MYSQL_USER: ${DB_USER}
    volumes:
      - mysql-go:/var/lib/mysql
    ports:
      - 3356:3306
    networks:
      - go_book_api_net

The second container is the database container. Note, that you can give whatever name you choose to your listed services, but giving your containers descriptive names is always a good convention to follow.

For your Go application database, you'll be working with a MySQL database in this case. Your application needs MySQL to run, so you must set it up as one of the services.

Remember that to build a container, you need a base image. Your base image in this case is mysql:8.0 , as you've specified with the image property above. When trying to set up this container, Docker Compose knows to build your database container from this already existing official image.

If you’ve set up a database locally before, you know that configuration is a step you can’t skip. Every database you create needs a user, a password, and the database name. You can set these variables up in the environment property. Instead of hardcoding these values, you can set them up in a .env file, and reference the environmental variables as you've done here.

Database servers usually listen on specific ports for incoming connections, whether the database is running locally or remotely. Just as you specified for your app container, you can set a port for your database and map it to a corresponding port in the container. If you want to access the database locally, you'd do that on port 3356, and all requests are forwarded to port 3306 in the database container.

Once your containers go functional and your application starts running, creating, and storing data in the database, you’ll realise that every time you stop and then restart your containers, you lose the data stored in the database.

To avoid this, you'll need to store your data outside the container. That way, you won't lose the contents of your database every time you stop running your containers.

This is what volumes are for. You can allocate a specific location outside the database container to store all that content. For your volume in this case, the storage location you specified is mysql-go:/var/lib/mysql .

Just as you set the network in your app container above to go_book_api_net, you'll specify the same network for this database container. Since you want the containers to communicate with each other, it makes sense that they're within the same network.

The `phpMyAdmin` Container

The last container or last service you need (but that is optional) to configure in this case is the phpMyAdmin container. I find it easier having a database client because it lets me easily see the structure and content of my database.

 phpmyadmin:
    image: phpmyadmin
    restart: always
    ports:
      - 9000:80
    environment:
      PMA_HOST: database
      PMA_ARBITRARY: 1
    depends_on:
      - database
    networks:
      - go_book_api_net

The process is almost the same as the previous containers you've configured. You'll start by pulling the official phpmyadmin image from Docker so that your container is built on it.

The restart option here is just so that if you stop and restart the container, phpMyAdmin automatically reloads again.

On the host machine, which is your local environment, you can have access to this service via port 9000 and it maps to port 80 in the container.

As for the environment , PMA_HOST tells phpMyAdmin to connect to a host called database (which is your database container). This works because both containers are on the same network, as you can see in the networks attribute. PMA_ARBITRARY is used so that if you decide to connect to another host (say, you set up a another database in future and still wish to connect via phpMyAdmin), you can do that via the UI.

Your database client depends on the database container, and so you need to specify that in depends_on:

volumes:
  mysql-go:

networks:
  go_book_api_net:
    driver: bridge

The final section of your Docker Compose file is where you declared named values for the volume and network you've used in setting up your containers.

For the volumes, you'll declare a value called mysql-go. To the container where you want to attach this volume, you'll assign a specific storage location. You can see this in use in the database container.

 volumes:
      - mysql-go:/var/lib/mysql

The same concept follows for the network. You have a named network called go_book_api_net that every container within this same network can use. The driver option is used here to specify the network type, and bridge is used for private internal networks.

Running Everything Together

Before Docker Compose, you had one Dockerfile that built a single container for your Go application. With Docker Compose, You’re gonna be building three containers (your application container, the database, and phpMyAdmin), and orchestrating them to work together as one single application.

You can push all this to a platform like GitHub, and someone can clone, start, and run the application without having any of these services (MySQL or PhpMyAdmin) installed locally on their computer. But they do need to have Docker installed.

To build your containers all together, you can use the command docker compose build:

If you check your Docker Compose UI again, we see that a new image has been built, and it corresponds to the app service

To start running the containers, you can use the command docker compose up:

If you navigate to the container tab of Docker Compose, you can see that your containers are up and running:

The main app service, go_book_api, isn’t running because when you run your image, your binary runs and exits almost immediately.

In your main.go, let’s rewrite the code to set up a minimal HTTP handler function that listens on port 8080:

// cmd/main.go
package main

import (
	"log"
	"net/http"
)

func main() {
	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		_, _ = w.Write([]byte("ok"))
	})

	log.Println("listening on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}

If you’re new to Go, don’t let the code above bother you too much. All it does it set up a health endpoint with an associated handler function that listens on a port (8080 in this case) and prints “ok”.

In your Dockerfile, let’s add a command to execute the created binary when the container starts:

# run the compiled binary when the container starts
CMD ["/docker-gs-ping"]

After adding this, you'll need to rebuild the containers and start them again. You can see that all containers are running now:

If you click on the go_book_api container, you can see that your server is running on port 8080 as configured:

Since your app is running on port 8080 and you have a /health endpoint set up for it, you can actually visit that endpoint in a browser to see the output “ok”.

Also, if you click on the exposed phpmyadmin port, you can access the database client locally on port 9000. Based on the environment variables set up in the .env file, you can log in.

Another interesting thing to look for on Docker desktop is volumes. There is a volumes tab where you can see your configured mysql-go volume.

You can always open these volumes/containers on the docker GUI, go through the files and logs, experiment with putting one container down and seeing how the others respond, and so on.

After this entire setup, what do you notice? You didn’t have to install Go, MySQL, or phpMyAdmin locally. You only used officially published base images to orchestrate a full application. That's the magic of Docker.

Wrapping Up

Docker can be very abstract at the beginning, but understanding the fundamental purpose behind it makes everything much clearer.

In this article, you've learned what Docker is, how to containerize a basic Go application, and how to manage multiple containers with Docker Compose.

If you have trouble wrapping your head around why or how the Dockerfile is set up in the order that it is, my advice is not to get too stuck figuring it out on your own. As a Docker beginner, I realised that it’s easier if you imagine it as creating a recipe. If you try to build an image and it fails, you know there’s a step that you’re skipping.

The official docker documentation has amazing resources if you want to understand Docker further than this tutorial. I encourage you to do so because this article only scratches the surface of the amazing things you can achieve with containerization.

How to Trace Multi-Agent AI Swarms with Jaeger v2

Christopher Galliart — Thu, 23 Apr 2026 23:41:57 +0000

When you run a single AI agent, debugging is straightforward. You read the log, you see what happened.

When you run five agents in a swarm, each spawning its own tool calls and producing its own output, "read the log" stops being a strategy.

I built Claude Forge as an adversarial multi-agent coding framework on top of Claude Code. A typical run spawns a planner, an implementer, a reviewer, and a fixer. They evaluate each other's work and loop back when quality checks fail.

But when something went wrong, I had timestamps and text dumps but no way to see which agent was responsible, how long it actually took, or where the tokens went.

Jaeger fixed that. This article covers setting up Jaeger v2 with Docker, wiring it into a multi-agent system through OpenTelemetry, and what I learned along the way.

What Is Distributed Tracing?
Why Jaeger v2?
Prerequisites
Installing Docker on Debian
Setting Up Jaeger v2
Setting Up Claude Forge Tracing
Understanding the Span Model
Instrumenting a Multi-Agent Swarm
Viewing Traces in the Jaeger UI
Lessons from the Trenches
Environment Variable Reference
Wrapping Up

What Is Distributed Tracing?

Distributed tracing tracks a single operation as it moves through multiple services. A span is one unit of work with a start time, end time, and key-value attributes. Spans nest into parent-child trees. One tree per operation is one trace.

Microservices people already know this pattern: follow an HTTP request from the gateway through auth, the database, and the cache. Same idea works for multi-agent AI. Follow one swarm invocation from the orchestrator through each subagent and its tool calls.

OpenTelemetry (OTel) is the standard. It gives you SDKs for creating spans and shipping them over OTLP. Jaeger receives that data and renders it as a searchable timeline.

Why Jaeger v2?

Jaeger started at Uber and graduated as a CNCF project in 2019. v1 hit end of life in December 2025. v2 is the current release, built on the OpenTelemetry Collector framework. Single binary: collector, query service, and UI. It speaks OTLP natively on port 4317 (gRPC) and 4318 (HTTP). There's no separate collector needed for local work.

One important difference from v1: configuration moved from CLI flags and environment variables to a YAML file. The old -e SPAN_STORAGE_TYPE=badger env vars are silently ignored in v2. The container starts fine but falls back to in-memory storage. I lost two days of traces before noticing. More on the correct setup below.

Prerequisites

Docker installed and running.
Claude Code installed.
Python 3.8+ for the tracing hook.
Claude Forge or another multi-agent system to instrument.

Installing Docker on Debian

Skip this if you already have Docker. macOS and Windows users can use Docker Desktop. On Debian:

sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/debian \
  \((. /etc/os-release && echo "\)VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker

Ubuntu users: replace both linux/debian URLs with linux/ubuntu.

Setting Up Jaeger v2

Basic Run

For quick testing with no persistence:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0

Port 16686 is the UI. Port 4317 is OTLP/gRPC ingestion. Port 4318 is OTLP/HTTP. Remove the container and your traces are gone.

Persistent Storage with Badger

v2 reads configuration from a YAML file, not environment variables. Save this as ~/.local/share/jaeger/config.yaml:

service:
  extensions: [jaeger_storage, jaeger_query, healthcheckv2]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]
extensions:
  healthcheckv2:
    use_v2: true
    http: { endpoint: 0.0.0.0:13133 }
  jaeger_query:
    storage: { traces: main_store }
  jaeger_storage:
    backends:
      main_store:
        badger:
          directories: { keys: /badger/key, values: /badger/data }
          ephemeral: false
          ttl: { spans: 720h }
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
processors:
  batch:
exporters:
  jaeger_storage_exporter:
    trace_storage: main_store

The Jaeger container runs as UID 10001. Docker named volumes default to root ownership. Without fixing permissions first, the container crash-loops with mkdir /badger/key: permission denied.

Pre-create the volume and fix ownership:

docker volume create jaeger-data

docker run --rm \
  -v jaeger-data:/badger \
  alpine sh -c "mkdir -p /badger/data /badger/key && chown -R 10001:10001 /badger"

Then run Jaeger with the config mounted in:

docker run -d --name jaeger \
  --restart unless-stopped \
  -v ~/.local/share/jaeger/config.yaml:/etc/jaeger/config.yaml:ro \
  -v jaeger-data:/badger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.17.0 \
  --config /etc/jaeger/config.yaml

Verify persistence by running docker restart jaeger and confirming a previously recorded trace is still there. Hit http://localhost:16686 and you should see the UI.

Setting Up Claude Forge Tracing

Installing Claude Forge

Install it through the Claude Code plugin marketplace:

/plugin marketplace add hatmanstack/claude-forge
/plugin install forge@claude-forge
/reload-plugins

The install opens a TUI to confirm scope and settings. After reload, commands use the forge: prefix (for example, /forge:pipeline).

You can also clone the repo from GitHub.

Installing the Tracing Hook

From your target project directory, run the install script. For plugin installs:

cd your-project
forge-trace                # if you set up the alias from the README
# or, without the alias:
bash "$(find ~/.claude -path '*/forge*' -name install-tracing.sh 2>/dev/null | head -1)"

For clone installs:

cd your-project
bash /path/to/claude-forge/bin/install-tracing.sh

The script builds a dedicated venv at ~/.local/share/claude-forge/venv (prefers uv, falls back to python3 -m venv), installs the OpenTelemetry packages, copies the hook into place, merges hook entries into .claude/settings.local.json, and self-tests against the OTLP endpoint.

Pass --no-settings to skip the settings merge, or --uninstall to tear everything down.

Opting In

Add to your shell init and restart your terminal:

export CLAUDE_FORGE_TRACING=1

Restart Claude Code, run /pipeline, then check http://localhost:16686 for the claude-forge service.

Understanding the Span Model

Here's what the hierarchy looks like for a typical swarm run:

session: "implement login form with OAuth"        <- root span
├── subagent:planner
│   ├── tool:Write  (Phase-0.md)                  <- mutation spans (on by default)
│   ├── tool:Write  (Phase-1.md)
│   └── subagent_result:planner                   <- duration, token counts, output
├── subagent:implementer
│   ├── tool:Edit   (src/auth.ts)
│   ├── tool:Bash   (npm test)
│   ├── tool:Write  (src/oauth.ts)
│   └── subagent_result:implementer
├── subagent:reviewer
│   └── subagent_result:reviewer
└── session_complete                              <- session totals

The root span's name comes from the first line of your prompt. Find traces by what you asked for, not by a UUID.

Subagents get an anchor span on start and a result span on completion. The result carries duration, token counts, prompt, and output.

Three Tiers of Detail

Not all inner tool calls are equally interesting. Write, Edit, MultiEdit, and Bash are mutational: small in number, high signal. They tell you what actually changed. Read, Glob, Grep, and WebFetch are navigation: lots of them, mostly noise.

Tracing captures mutations by default. That middle ground turned out to be the right one. Before this change, you either saw nothing inside subagents or you saw 200+ spans per run.

Mode	Subagents	Mutations (Write/Edit/Bash)	Other inner tools
Default	yes	yes	no
`CLAUDE_FORGE_TRACE_INNER=1`	yes	yes	yes (minus blocklist)
`CLAUDE_FORGE_TRACE_MUTATIONS=0`	yes	no	no (or per INNER)

Span Attributes

On session_complete: session.tokens.input, session.tokens.output, session.tokens.total, session.tokens.turns, session.duration_ms, user.prompt (first 2KB).

On subagent_result: agent.description, agent.prompt, agent.output, agent.duration_ms, agent.is_error, agent.tokens.input, agent.tokens.output.

On tool:*: tool.name, tool.input, tool.output, tool.duration_ms, tool.is_error.

Instrumenting a Multi-Agent Swarm

Hook Architecture

Claude Code has lifecycle hooks that fire scripts on specific events. Four matter here:

UserPromptSubmit (create the root span),
PreToolUse (start a span),
PostToolUse (end it with results), and
Stop (finalize the trace). Each hook gets a JSON payload on stdin and runs as a subprocess.

Sending Spans with OpenTelemetry

Here's some minimal Python to get a span into Jaeger:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "my-agent-system"})
exporter = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent-tracer")

with tracer.start_as_current_span("my-agent-task") as span:
    span.set_attribute("agent.name", "planner")
    span.set_attribute("agent.tokens.input", 1500)
    span.set_attribute("agent.tokens.output", 800)

Refresh localhost:16686, pick your service, click "Find Traces."

Correlating Pre and Post Events

You need to match each PreToolUse to its PostToolUse. Agent-type tool calls didn't include a tool_use_id in the payload, so I hashed the tool name and input instead. Pre and Post carry identical tool_input, so the hashes line up.

import hashlib, json

def correlation_key(tool_name: str, tool_input: dict) -> str:
    content = json.dumps({"tool": tool_name, "input": tool_input}, sort_keys=True)
    return hashlib.sha1(content.encode()).hexdigest()[:16]

State Across Invocations

Every hook call is a separate process. No shared memory. So I wrote span context to JSON files on Pre and read them back on Post:

/tmp/claude-forge-tracing//
├── _root.json              # trace ID, root span context
├── _session_start_ns.json  # timestamp for duration calculation
├── subagent_.json    # per-subagent span context
└── tool_.json        # per-tool span context

File names get sanitized against path traversal. _safe_name() strips everything outside [A-Za-z0-9._-] and falls back to a SHA1 slug.

Flushing Without Blocking

try:
    provider.force_flush(timeout_millis=1000)
except Exception:
    pass  # Never block the swarm

I tried 2000ms first and the swarm felt slow. 100ms lost spans on cold TLS connections. 1000ms worked. If Jaeger is down, the swarm keeps running regardless.

Viewing Traces in the Jaeger UI

Open http://localhost:16686. Pick claude-forge from the service dropdown. Click "Find Traces."

The trace search filters by operation name, tags, and time range. Since session spans take their name from your prompt, searching "login form" pulls up the runs where you asked for one.

The timeline view is where I spend most of my time. Every span is a horizontal bar, nested by parent-child relationships. I can see the planner took 12 seconds, the implementer 45, the reviewer 8. Click any bar to see token counts, prompts, outputs, error status.

Trace comparison puts two runs side by side. This is good for figuring out why one run succeeded and another did not.

Lessons from the Trenches

One trace per swarm, not per subagent: My first version wiped the root span's state file on every Stop event, so each subagent started a new trace. I changed Stop to mark a timestamp while preserving the root.

Use descriptions, not type names: Subagents all report their type as general-purpose. The description field is where the actual role lives.

Token attribution needs per-agent transcripts: Claude Code writes subagent transcripts to ~/.claude/projects///subagents/agent-*.jsonl. Match them via agent-*.meta.json.

Parse boolean env vars explicitly: bool("0") in Python is True. Use an allowlist: {"1", "true", "yes", "on"}.

Environment Variable Reference

Variable	Purpose
`CLAUDE_FORGE_TRACING=1`	Master opt-in. Hook is a no-op without this.
`CLAUDE_FORGE_TRACE_MUTATIONS=0`	Disable default mutation spans (Write/Edit/Bash). On by default.
`CLAUDE_FORGE_TRACE_INNER=1`	Capture all inner tool calls as child spans (off by default).
`CLAUDE_FORGE_TRACE_TOOL_BLOCKLIST`	Comma-separated tools to skip when inner tracing is on. Defaults to `Read,Glob,Grep,TodoWrite,NotebookRead`.
`CLAUDE_FORGE_HOOK_DEBUG=1`	Enable debug logging of raw hook payloads. Off by default.
`CLAUDE_FORGE_HOOK_DEBUG_LOG`	Override debug log path. Defaults to `~/.cache/claude-forge/hook.log`.
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP/gRPC endpoint. Defaults to `http://localhost:4317`.

Wrapping Up

Without visibility into the process, you're being inefficient with tokens and your time. Multi-agent swarms cost real money on every run. When an agent fails and retries, or when a reviewer rejects work that was close, you're paying for that blind.

Tracing gives you the map. You find out where the failure modes are. You find out which agents burn tokens going nowhere. A 45-second implementer run might have been 10 seconds with a better planner prompt. But you would never know that without seeing the breakdown.

Get observability in early. Jaeger and OpenTelemetry make it cheap to set up. Once you can see where things go wrong you can actually fix them.

Claude Forge tracing is on the main branch.

How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik

Md Tarikul Islam — Thu, 23 Apr 2026 18:11:20 +0000

This tutorial is a complete, real-world guide to building a production-ready CI/CD pipeline using Jenkins, Docker Compose, and Traefik on a single Linux server.

You’ll learn how to expose services on a custom domain with auto-renewing HTTPS, and implement a smart deployment strategy that detects changes and redeploys only the affected microservices. This helps avoid unnecessary full-stack redeploys. We'll also cover real production issues and the exact fixes for each one.

1. What you'll build
2. Architecture
3. Server prerequisites
4. Traefik — the reverse proxy
5. Run Jenkins in Docker
6. Expose Jenkins on a domain via Traefik
7. First-time Jenkins setup
8. Add the GitHub credential
9. Create the pipeline job
10. The Jenkinsfile (deploy only what changed)
11. End-to-end test
12. Troubleshooting — every error we hit
13. Mental model: host vs. container
14. Daily operations cheat sheet
15. What I'd do differently next time
Closing thoughts

1. What You'll Build

In this tutorial, you'll build a Jenkins instance running inside Docker on the same Linux server as your application stack.

Traefik will act as a reverse proxy in front of Jenkins, exposing it via a clean URL (https://jenkins.example.com) with auto-renewing Let's Encrypt certificates.

You'll also create a Jenkinsfile in your application repository that:

Automatically triggers on every push to the staging branch,
Detects which microservices changed in each commit,
Pulls the latest code on the host machine,
Rebuilds and restarts only the affected services.

On every push, only the relevant services are redeployed.

Prerequisites

Before jumping in, this guide assumes you’re already comfortable with a few core concepts and tools.

This isn't a beginner-level tutorial — we’ll be working directly with infrastructure, containers, and CI/CD pipelines.

You should be familiar with:

Basic Linux commands (SSH, file system navigation, permissions)
Docker fundamentals (images, containers, volumes, networks)
Git workflows (clone, pull, branches)
General idea of CI/CD pipelines

Tools and environment required:

A Linux server (Ubuntu recommended)
Docker Engine + Docker Compose (v2)
A domain name (for Traefik + HTTPS)
GitHub repository (for your backend project)
Basic understanding of microservices architecture

If you’re comfortable with the above, you’re ready to follow along.

2. Architecture

Here's an overview of the architecture:

┌──────────────────────────── Linux server (Ubuntu) ────────────────────────────┐
│                                                                               │
│   /home/developer/projects/                                                  │
│       └── project-prod-configs/             ← infra repo (compose, Traefik) │
│              ├── docker-compose.staging.yml                                   │
│              ├── traefik.staging.yml                                          │
│              └── project-backend/          ← app repo (services, gateways) │
│                     ├── Jenkinsfile                                           │
│                     ├── docker-compose.staging.yml                            │
│                     └── apps/                                                 │
│                            ├── services//                               │
│                            ├── gateways//                               │
│                            └── core//                                   │
│                                                                               │
│   ┌─────────────────────── Docker network: proxy ──────────────────────┐      │
│   │  traefik (80, 443)                                                 │      │
│   │     │                                                              │      │
│   │     ├──► jenkins  (projects-jenkins-staging)                     │      │
│   │     │      ↳ /projects  ← bind-mount of the host project tree     │      │
│   │     │      ↳ /var/run/docker.sock ← controls host Docker           │      │
│   │     │                                                              │      │
│   │     └──► your services & gateways (built by the pipeline)          │      │
│   └────────────────────────────────────────────────────────────────────┘      │
│                                                                               │
└───────────────────────────────────────────────────────────────────────────────┘
            ▲
            │  webhook on push
            │
   GitHub: /project-backend (branch: staging)

There are two key ideas here:

Jenkins runs in a container, but it controls the host's Docker by mounting /var/run/docker.sock. It also bind-mounts the project folder as /projects/..., so it can cd into the real code on the host and run docker compose there.
The Jenkinsfile lives inside the app repo, so the pipeline definition is versioned with the code. Jenkins simply points at it.

3. Server Prerequisites

Before we start configuring Jenkins or Traefik, we need to prepare the server properly.

In this step, we’ll:

Create a dedicated Linux user for managing the project
Install Docker and Docker Compose
Set up the folder structure for our repositories

This ensures our CI/CD pipeline runs in a clean and predictable environment.

# Linux user that owns the project tree
sudo adduser developer

# Docker engine + Compose plugin
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker developer

# Sanity check Compose v2
docker compose version
# -> Docker Compose version v2.x.y

# Find where the Compose plugin binary lives — write it down, you'll need it
ls /usr/libexec/docker/cli-plugins/docker-compose
# (some distros use /usr/lib/docker/cli-plugins/docker-compose)

# Project layout
sudo mkdir -p /home/developer/project
sudo chown -R developer:developer /home/developer/project

# Clone both repos in the right place
cd /home/developer/projects
git clone https://github.com//projects-prod-configs.git
cd projects-prod-configs
git clone -b staging https://github.com//projects-backend.git

You should now have:

/home/developer/projects/projects-prod-configs/projects-backend

Memorize this path — your Jenkinsfile references it.

DNS

Point an A-record for your Jenkins subdomain to the server's public IP before the next steps so Let's Encrypt can validate via HTTP challenge:

jenkins.example.com   A

4. Traefik — the Reverse Proxy

Traefik acts as the entry point to your entire system. Instead of exposing each service manually with ports, Traefik automatically:

Routes traffic based on domain names
Generates and renews HTTPS certificates using Let’s Encrypt
Connects to Docker and detects services dynamically

In simple terms, Traefik lets you access services like:

https://jenkins.example.com
https://api.example.com

…without manually configuring NGINX or managing SSL certificates.

In this setup, Traefik watches Docker containers and routes traffic using labels we'll define later.

Traefik gives every container a real domain and a real cert with zero per-service config — you just add a few labels.

`traefik.staging.yml` (static config)

Put this at the root of your infra repo:

api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  letsencrypt:
    acme:
      httpChallenge:
        entryPoint: web
      email: admin@example.com           # ← change me
      storage: /etc/traefik/acme.json

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false              # only containers with traefik.enable=true
    network: proxy
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: INFO

accessLog: {}

The Traefik service in `docker-compose.staging.yml`

networks:
  proxy:
    name: proxy
    driver: bridge
  internal:
    name: internal
    driver: bridge

volumes:
  acme-data:
  traefik-logs:
  jenkins-data:

services:
  traefik:
    image: traefik:v2.11
    container_name: projects-traefik-staging
    restart: unless-stopped
    ports:
      - "80:80"        # HTTP (auto-redirects to HTTPS)
      - "443:443"      # HTTPS
      - "8080:8080"    # Traefik dashboard (internal only — protect via firewall)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.staging.yml:/etc/traefik/traefik.yml:ro
      - ./dynamic:/etc/traefik/dynamic:ro
      - acme-data:/etc/traefik           # persists Let's Encrypt certs
      - traefik-logs:/var/log/traefik
    networks:
      - proxy
    command:
      - '--api.insecure=false'
      - '--api.dashboard=true'
      - '--providers.docker=true'
      - '--providers.docker.exposedbydefault=false'
      - '--providers.docker.network=proxy'
      - '--entrypoints.web.address=:80'
      - '--entrypoints.websecure.address=:443'
      - '--entrypoints.web.http.redirections.entryPoint.to=websecure'
      - '--entrypoints.web.http.redirections.entryPoint.scheme=https'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge=true'
      - '--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web'
      - '--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL:-admin@example.com}'
      - '--certificatesresolvers.letsencrypt.acme.storage=/etc/traefik/acme.json'
      - '--log.level=INFO'
      - '--accesslog=true'
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"
      # Traefik's own dashboard
      - "traefik.http.routers.traefik-dash.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.traefik-dash.entrypoints=websecure"
      - "traefik.http.routers.traefik-dash.tls.certresolver=letsencrypt"
      - "traefik.http.routers.traefik-dash.service=api@internal"

Bring it up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d traefik

Watch the logs the first time — Traefik will request a cert for the dashboard host as soon as DNS resolves.

docker logs -f projects-traefik-staging

Tip. While testing, switch ACME to staging endpoint (acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory) so you don't burn through Let's Encrypt's rate limits if you misconfigure DNS. Remove that flag before going live.

5. Run Jenkins in Docker

Add this Jenkins service to the same docker-compose.staging.yml. Every line matters (and the comments explain why).

  jenkins:
    image: jenkins/jenkins:lts
    container_name: projects-jenkins-staging
    restart: unless-stopped
    user: root                           # to use host docker.sock without UID juggling
    environment:
      - JAVA_OPTS=-Xmx1g -Xms512m -Duser.timezone=Asia/Dhaka
      - TZ=Asia/Dhaka                    # OS-level timezone inside container
      - JENKINS_OPTS=--prefix=/
    ports:
      - "3095:8080"                      # web UI (also reachable directly if needed)
      - "50000:50000"                    # inbound agent port
    volumes:
      - jenkins-data:/var/jenkins_home   # Jenkins config/jobs/secrets persistence
      - /var/run/docker.sock:/var/run/docker.sock                          # control host Docker
      - /usr/bin/docker:/usr/bin/docker                                     # docker CLI from host
      - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro  # docker compose plugin
      - /home/developer/projects:/projects                                # project tree
      - /etc/localtime:/etc/localtime:ro                                    # match host clock
      - /etc/timezone:/etc/timezone:ro
    networks:
      - proxy
      - internal
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:8080/login']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    deploy:
      resources:
        limits:
          memory: 1024M

Why user: root? It's the simplest way to share docker.sock and the project bind-mount without UID/GID gymnastics. If you prefer an unprivileged user, you'll need to set group: docker and align UIDs/perms on host folders — possible but out of scope here.

6. Expose Jenkins on a Domain via Traefik

This is the section many guides skip. We'll add labels to the Jenkins service so Traefik picks it up automatically. No editing of Traefik config required.

  jenkins:
    # ... everything above ...
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=proxy"

      # 1) Router — match incoming Host
      - "traefik.http.routers.jenkins.rule=Host(`jenkins.example.com`)"
      - "traefik.http.routers.jenkins.entrypoints=websecure"
      - "traefik.http.routers.jenkins.tls.certresolver=letsencrypt"
      - "traefik.http.routers.jenkins.service=jenkins"

      # 2) Service — tell Traefik which container port is the app
      - "traefik.http.services.jenkins.loadbalancer.server.port=8080"

      # 3) Middleware — Jenkins needs X-Forwarded-Proto so it knows it's behind HTTPS
      - "traefik.http.middlewares.jenkins-headers.headers.customrequestheaders.X-Forwarded-Proto=https"
      - "traefik.http.routers.jenkins.middlewares=jenkins-headers"

What each line does:

Label	Purpose
`traefik.enable=true`	Opts this container in (we set `exposedByDefault=false`).
`traefik.docker.network=proxy`	Tells Traefik which network to talk to Jenkins on (Jenkins is on both `proxy` and `internal`).
`routers.jenkins.rule=Host(...)`	Forwards only this hostname to Jenkins.
`routers.jenkins.entrypoints=websecure`	Listens only on 443. (HTTP redirect was set up in section 4.)
`routers.jenkins.tls.certresolver=letsencrypt`	Auto-issues + renews the cert.
`services.jenkins.loadbalancer.server.port=8080`	Jenkins listens on 8080 inside the container.
`customrequestheaders.X-Forwarded-Proto=https`	Without this, Jenkins generates `http://` URLs in webhooks/links and breaks.

Bring Jenkins up:

cd /home/developer/projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d jenkins

# Watch Traefik issue the certificate
docker logs -f projects-traefik-staging | grep -i acme

After 10–60 seconds you should be able to open https://jenkins.example.com and see Jenkins's setup wizard with a valid lock icon.

Inside Jenkins (after first login):

Manage Jenkins → System → Jenkins URL → set this to: https://jenkins.example.com/

This is important because Jenkins uses this base URL to generate:

Webhook endpoints (for GitHub triggers)
Links inside emails and build logs

If this isn't set correctly, GitHub webhooks may fail, and any links Jenkins generates will point to the wrong address (often localhost or internal IPs).

7. First-Time Jenkins Setup

If you're running Jenkins for the first time on this server, follow this section to complete the initial setup.

If you already have Jenkins configured, you can skip this section — but make sure the required plugins and settings match what we use later in this guide.

Open https://jenkins.example.com. Get the initial admin password:

docker exec projects-jenkins-staging cat /var/jenkins_home/secrets/initialAdminPassword

Paste it, choose Install suggested plugins.
Create your admin user.
Manage Jenkins → Plugins → Available and install:
- GitHub (and GitHub Branch Source)
- Pipeline: GitHub
- Credentials Binding (usually preinstalled)

That's all the plugins you need for the rest of this guide.

8. Add the GitHub Credential

Jenkins needs permission to access your GitHub repository.

This is done using a GitHub Personal Access Token (PAT), which acts like a password for secure API and Git operations.

We’ll store this token inside Jenkins as a credential so it can pull code during pipeline execution and authenticate securely without exposing secrets in code.

This single credential is used both for the SCM checkout and for the deploy-time git pull.

Create a Personal Access Token (classic) on GitHub with repo scope.
In Jenkins: Manage Jenkins → Credentials → System → Global → Add Credentials.
Fill in:
- Kind: Username with password
- Username: your GitHub username
- Password: the token
- ID: github_classic_token (the Jenkinsfile references this exact ID)

9. Create the Pipeline Job

Now that Jenkins has access to your repository, the next step is to define how deployments should run.

A pipeline job tells Jenkins:

where your code lives,
which branch to monitor,
and how to execute your deployment process.

In Jenkins, create a new Pipeline job and connect it to your GitHub repository. Once this is set up, Jenkins will automatically trigger deployments whenever you push to the staging branch.

Start by creating a new job:

New Item → Pipeline → name it projects-staging → OK

Then configure the job:

Under Build Triggers, enable:
GitHub hook trigger for GITScm polling
Under Pipeline:
- Definition: Pipeline script from SCM
- SCM: Git
- Repository URL: https://github.com//projects-backend.git
- Credentials: github_classic_token
- Branch: */staging
- Script Path: Jenkinsfile

Save the configuration.

At this point, Jenkins is fully connected to your repository and ready to run your deployment pipeline automatically.

10. The Jenkinsfile (Deploy Only What Changed)

Place this at the root of the app repo (projects-backend/Jenkinsfile), branch staging.

pipeline {
  agent any

  environment {
    PROJECT_PATH = "/projects/projects-prod-configs/projects-backend"
    COMPOSE_FILE = "docker-compose.staging.yml"
  }

  stages {

    stage('Checkout') {
      steps {
        checkout scm
        echo "Checkout completed for branch: ${env.BRANCH_NAME ?: 'staging'}"
      }
    }

    stage('Detect Changes') {
      steps {
        script {
          def changedFiles = sh(
            script: "git diff --name-only HEAD~1 HEAD",
            returnStdout: true
          ).trim()

          echo "Changed files:\n${changedFiles}"

          def services = [] as Set
          changedFiles.split('\n').each { file ->
            def svc  = file =~ /^apps\/services\/([a-z0-9-]+)\//
            def gw   = file =~ /^apps\/gateways\/([a-z0-9-]+)\//
            def core = file =~ /^apps\/core\/([a-z0-9-]+)\//
            if (svc)  { services << svc[0][1]  }
            if (gw)   { services << gw[0][1]   }
            if (core) { services << core[0][1] }
          }
          services = services.findAll { !it.endsWith('-e2e') }
          env.CHANGED_SERVICES = services.join(' ')

          echo "Services to deploy: ${env.CHANGED_SERVICES ?: '(none)'}"
        }
      }
    }

    stage('Deploy') {
      when { expression { return env.CHANGED_SERVICES?.trim() } }
      steps {
        withCredentials([usernamePassword(
          credentialsId: 'github_classic_token',
          usernameVariable: 'GIT_USER',
          passwordVariable: 'GIT_TOKEN'
        )]) {
          sh '''
            set -eu
            git config --global --add safe.directory "${PROJECT_PATH}"
            cd "${PROJECT_PATH}"
            git remote set-url origin "https://github.com//projects-backend.git"
            git -c credential.helper= \
                -c "credential.helper=!f() { echo username=\({GIT_USER}; echo password=\){GIT_TOKEN}; }; f" \
                pull origin staging
            docker compose -f "\({COMPOSE_FILE}" up -d --build \){CHANGED_SERVICES}
          '''
        }
        echo "Deployed: ${env.CHANGED_SERVICES}"
      }
    }

    stage('Skip Deployment') {
      when { expression { return !env.CHANGED_SERVICES?.trim() } }
      steps { echo "No service changes detected — nothing to deploy." }
    }
  }
}

Why each tricky line is there:

git config --global --add safe.directory ... — git refuses to operate on a repo whose owner UID differs from the current user's. The repo on disk is owned by developer, but Git inside the container runs as root. This whitelists the path.
git remote set-url origin "https://..." — flips the on-disk remote to HTTPS so the token can be used. (A PAT can't authenticate git@github.com: URLs — those use SSH.) Idempotent — safe to re-run.
git -c credential.helper="!f() { echo username=...; echo password=...; }; f" — feeds the username/token to git for that one command without writing the token to disk and without exposing it on the process command line.
${CHANGED_SERVICES} is unquoted on purpose so multiple service names expand as separate args.

11. End-to-End Test

Before considering the setup complete, we need to verify that the entire pipeline works as expected.

This end-to-end test ensures that:

GitHub webhooks are triggering Jenkins correctly,
Jenkins can detect which services changed,
and only the affected services are rebuilt and deployed.

In other words, this simulates a real production deployment.

Start by making a small change in your repository. For example, modify a file inside:

apps/gateways/student-apigw/

Then push the change to the staging branch.

Once pushed, Jenkins should automatically trigger via the webhook. If not, you can manually click Build Now.

Now open the build’s Console Output and verify the flow. You should see something like:

Checkout completed for branch: staging
Services to deploy: student-apigw
git pull origin staging (successful)
docker compose ... up -d --build student-apigw
Deployed: student-apigw

If you see this sequence, your pipeline is working correctly.

If anything fails, don’t worry — jump to Section 12 where every common issue and its fix is documented.

12. Troubleshooting — Every Error We Hit

This section covers real issues we faced while setting up this pipeline — and more importantly, why each fix works. Understanding the “why” will help you debug similar problems in your own setup.

cd: can't cd to /projects/projects-prod-configs/projects-backend

Cause:
The Jenkinsfile runs cd $PROJECT_PATH, but inside the container that path doesn’t exist. This usually happens when:

the project wasn’t cloned on the host, or
the bind mount isn’t configured correctly.

Fix:

ls /home/developer/projects/projects-prod-configs/projects-backend
# If missing: git clone -b staging  there.

Confirm the bind mount:

docker inspect projects-jenkins-staging --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{println}}{{end}}'

If missing, recreate the container:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

Jenkins runs inside a container, but your code lives on the host. The bind mount connects them. Without it, Jenkins cannot access your project directory.

fatal: detected dubious ownership in repository

Cause:
Git blocks access when the repository owner differs from the current user.

Repo owner: developer (host)
Git runs as: root (inside container)

Fix:

git config --global --add safe.directory "${PROJECT_PATH}"

Why this works:

This explicitly tells Git that the directory is trusted, bypassing ownership mismatch security restrictions.

`Host key verification failed` / `Could not read from remote repository`

Cause:

The repository uses SSH (git@github.com:...), but:

the container has no SSH keys
no known_hosts file exists

Also, GitHub tokens cannot authenticate over SSH.

Fix (recommended):

git remote set-url origin "https://github.com//projects-backend.git"

Why this works:

HTTPS uses token-based authentication (PAT), which works inside containers without SSH configuration.

`unknown shorthand flag: 'f' in -f` ( `docker compose`)

Cause:
The Docker CLI exists, but the Docker Compose plugin is missing inside the container.

Fix:

volumes:
  - /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro

Find your path if needed:

find /usr -name docker-compose -type f 2>/dev/null

Verify:

docker exec projects-jenkins-staging docker compose version

Why this works:

Docker Compose v2 is a CLI plugin. Mounting this directory makes the docker compose command available inside the container.

Wrong timezone in build timestamps and Jenkins UI

Fix: Set both env var and JVM flag, and bind-mount the host's clock files:

environment:
  - TZ=Asia/Dhaka
  - JAVA_OPTS=... -Duser.timezone=Asia/Dhaka
volumes:
  - /etc/localtime:/etc/localtime:ro
  - /etc/timezone:/etc/timezone:ro

You must recreate the container for env-var changes to take effect:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:
Jenkins runs on Java, which uses its own timezone separate from the OS.
By aligning OS timezone, JVM timezone, and host clock, you ensure consistent timestamps everywhere.

ERR_SOCKET_TIMEOUT (pnpm install fails)

Cause:

If you have multiple services building in parallel and each runs pnpm install with ~1500 packages, the network gets saturated and a timeout occurs.

Fixes:

a) Increase timeout + control concurrency

RUN pnpm install --frozen-lockfile --ignore-scripts 
--network-timeout 600000 
--network-concurrency 8

Why: Gives pnpm more time and reduces network overload.

b) Enable pnpm cache (BuildKit)

RUN --mount=type=cache,id=pnpm-store,target=/root/.local/share/pnpm/store 
pnpm install --frozen-lockfile --ignore-scripts

Why: Dependencies are cached and reused instead of downloading every time.

c) Avoid unnecessary rebuilds

docker compose -f \(COMPOSE_FILE build \)CHANGED_SERVICES docker compose -f \(COMPOSE_FILE up -d --no-build \)CHANGED_SERVICES

Why: Only changed services are rebuilt → less network load → fewer failures.

Container changes don’t apply after editing docker-compose.yml

Cause:

Docker compose up -d does not update running containers.

Fix:

docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

Why this works:

This forces Docker to recreate the container with updated configuration (env, volumes, labels).

Traefik shows default certificate (no HTTPS)

Common causes:

DNS not pointing to server Port 80 blocked Wrong Docker network

Check:

dig +short jenkins.example.com docker logs projects-traefik-staging 2>&1 | grep -i acme

Why this works:

Let’s Encrypt uses HTTP-01 challenge, so it must reach your server via port 80. If DNS or networking is wrong, certificate issuance fails.

Jenkins: "Reverse proxy setup is broken"

Fix:

Set the Jenkins URL to https://jenkins.example.com/
Ensure header:

X-Forwarded-Proto: https

Why this works:

Jenkins needs to know it's behind HTTPS. Without this, it generates incorrect URLs (http instead of https), breaking redirects and webhooks.

13. Mental Model: Host vs. Container

Many setup mistakes come from confusing the host filesystem with the container filesystem. This table makes it explicit:

Inside the Jenkins container	Comes from on the host
`/var/jenkins_home`	docker volume `jenkins-data` (Jenkins config, jobs, secrets)
`/projects/...`	`/home/developer/projects/...` (your project tree)
`/usr/bin/docker`	host's `/usr/bin/docker`
`/usr/libexec/docker/cli-plugins/docker-compose`	host plugin (lets `docker compose` work)
`/var/run/docker.sock`	host Docker daemon (so builds happen on the host's engine)
`/etc/localtime`, `/etc/timezone`	host clock
`~/.ssh`	nothing — that's why SSH-to-GitHub doesn't work without extra setup

When debugging, always ask: "Inside which filesystem is this command running, and does the file/folder it's looking for exist there?"

14. Daily Operations Cheat Sheet

# Recreate Jenkins after changing compose
cd /home/developer/Projects/projects-prod-configs
docker compose -f docker-compose.staging.yml up -d --force-recreate jenkins

# Tail Jenkins logs
docker logs -f projects-jenkins-staging

# Open a shell inside the Jenkins container
docker exec -it projects-jenkins-staging bash

# From inside the container — sanity checks
docker compose version
ls /projects/projects-prod-configs/projects-backend
git -C /projects/projects-prod-configs/projects-backend remote -v

# Manually trigger the same deploy the pipeline does
cd /projects/projects-configs/projects-backend
git pull origin staging
docker compose -f docker-compose.staging.yml up -d --build student-apigw

# Inspect Traefik routing decisions
docker logs projects-traefik-staging 2>&1 | grep -i jenkins

# Check renewed certs
docker exec projects-traefik-staging cat /etc/traefik/acme.json | head -50

15. What I'd Do Differently Next Time

Pre-build a base image with all node_modules baked in. With ~1500 packages × 15 services, every clean build re-downloads ~22k tarballs. A shared base cuts that 90%.
Run a private npm proxy (Verdaccio / Nexus / GitHub Packages) on the same Docker network — eliminates flaky npmjs.org timeouts entirely.
Per-service Jenkinsfile if your services drift apart in tooling. With one Jenkinsfile, every team contends for the same pipeline definition.
Replace git diff HEAD~1 HEAD with git diff $(git merge-base HEAD origin/staging~1) HEAD so squash-merges and force-pushes don't accidentally skip services.
Move secrets to a vault (HashiCorp Vault / AWS Secrets Manager / Doppler). PATs in Jenkins work, but rotation across many jobs is painful.
Use Jenkins' Configuration-as-Code (JCasC) so the entire Jenkins setup (jobs, credentials definitions, plugins) is in git. Then a server rebuild is a one-command operation.

Closing Thoughts

The pipeline itself is just three stages: Checkout → Detect Changes → Deploy — but a real production setup is mostly about plumbing: reverse proxy, certificates, bind-mounts, credentials, timezones, build caches. None of these are exotic. Together they decide whether your Friday-afternoon deploy goes silently green or eats your weekend.

Follow sections 1–11 to get a working pipeline. Bookmark section 12 to keep it working.

Happy shipping.

How to Build Microservices-Based REST APIs for Healthcare Portals

Gopinath Karunanithi — Fri, 17 Apr 2026 16:30:00 +0000

Microservices architecture enables healthcare portals to scale, secure sensitive data, and evolve rapidly.

Using ASP.NET 10 and C#, you can build independent REST APIs for services like patients, appointments, and authentication, each with its own database and deployment lifecycle.

Combined with API gateways, JWT-based security, observability, and containerization, this approach ensures reliable, maintainable, and production-ready healthcare systems.

In this tutorial, you’ll learn how to design and build a microservices-based healthcare portal using ASP.NET 10 and C#. We’ll cover how to structure services, implement REST APIs, secure endpoints, enable service communication, and deploy using modern containerization practices.

By the end, you’ll have a clear understanding of how to create scalable, secure, and production-ready healthcare systems.

Prerequisites
Overview
Why Use Microservices for Healthcare Portals?
High-Level Architecture
Designing REST APIs for Healthcare Services
How to Build a Microservice with ASP.NET 10
Database per Service Pattern
Service Communication
API Gateway Implementation
Implementing Security in Healthcare APIs
Observability and Logging
Containerization with Docker
Deployment Strategies
Best Practices (With Examples)
When NOT to Use Microservices
Future Enhancements
Conclusion

Prerequisites

Before getting started, you should be familiar with:

C# and ASP.NET Core fundamentals
REST API concepts (HTTP methods, routing, status codes)
Basic understanding of microservices architecture

Tools required:

.NET 10 SDK
Visual Studio or VS Code
Postman or Swagger
Docker (optional but recommended)

Overview

Healthcare portals power critical workflows such as patient registration, appointment scheduling, electronic health records (EHR), billing, and telemedicine. These systems must handle sensitive data, high availability requirements, and frequent updates.

Traditionally, many healthcare applications were built as monolithic systems. While simple to start with, monoliths quickly become difficult to scale, maintain, and secure. A single failure can impact the entire system, and even small changes require redeploying the entire application.

Microservices architecture addresses these challenges by breaking the application into smaller, independent services. Each service is responsible for a specific domain, such as patient management or appointment scheduling, and can be developed, deployed, and scaled independently.

In this article, you'll learn how to design and implement a microservices-based healthcare REST API using ASP.NET 10 and C#. We'll walk through architecture design, service implementation, communication patterns, security, observability, and deployment strategies.

Why Use Microservices for Healthcare Portals?

Healthcare systems are inherently complex. They involve multiple domains such as patient records, appointments, billing, authentication and authorization. A microservices approach allows each of these domains to be handled independently. There are many benefits to this approach such as:

Scalability: Scale only the services under heavy load (for example, appointments during peak hours)
Fault isolation: Failure in one service does not crash the entire system
Faster deployment: Teams can deploy updates independently
Improved security: Sensitive services can have stricter access controls

For example, a patient service can handle personal data, while a billing service manages transactions, each with different security policies.

High-Level Architecture

A typical healthcare microservices architecture includes API Gateway (central entry point), microservices (Patient, Appointment, Auth), database per Service and service Communication Layer.

The request flow starts with the client sending a request. Then the API Gateway routes the request and the target microservice processes it. Then a response is returned. This separation ensures modularity and maintainability.

Designing REST APIs for Healthcare Services

Designing REST APIs in a microservices architecture requires clear, consistent naming conventions so that endpoints are intuitive, predictable, and easy to consume by clients and other services.

Naming Conventions

REST APIs are resource-oriented, meaning URLs should represent entities (nouns), not actions (verbs). Each resource corresponds to a domain object in your system, such as patients, appointments, or billing records.

Key principles:

Use plural nouns for resources (for example, /patients, /appointments)
Avoid verbs in URLs (don't use /getPatients)
Use hierarchical structure for relationships (for example, /patients/{id}/appointments)
Keep naming consistent across all services

These conventions improve API readability, developer experience, and maintainability across teams

Example: Patient API Endpoints

The following endpoints represent standard CRUD (Create, Read, Update, Delete) operations for managing patients:

GET    /api/patients        // Retrieve all patients
GET    /api/patients/{id}   // Retrieve a specific patient
POST   /api/patients        // Create a new patient
PUT    /api/patients/{id}   // Update an existing patient
DELETE /api/patients/{id}   // Delete a patient

Each HTTP method defines the type of operation being performed:

GET: Fetch data (read-only)
POST: Create new resources
PUT: Update existing resources
DELETE: Remove resources

These operations follow REST standards, ensuring consistency across services and making APIs easier to integrate with frontend apps, mobile clients, or third-party healthcare systems

Best Practices for Designing Healthcare REST APIs

Designing REST APIs for healthcare systems requires more than standard conventions. It demands careful consideration of performance, data sensitivity, and interoperability.

1. Use proper HTTP methods

Ensure each endpoint uses the correct HTTP verb (GET, POST, PUT, DELETE) to clearly communicate its purpose. This improves API predictability and aligns with REST standards used across healthcare platforms.

2. Return meaningful status codes

Use appropriate HTTP status codes to indicate the result of a request. For example:

200 OK for successful retrieval
201 Created for successful resource creation
400 Bad Request for validation errors
404 Not Found when a resource doesn’t exist
Clear status codes help clients handle responses correctly.

3. Implement pagination for large datasets

Healthcare systems often deal with large volumes of data (for example, patient records, appointment logs). Use pagination to limit response size:

GET /api/patients?page=1&pageSize=20

This improves performance and reduces server load.

4. Use API versioning

Version your APIs to avoid breaking existing clients when making changes:

/api/v1/patients

This is especially important in healthcare, where integrations with external systems must remain stable over time.

5. Validate and sanitize input data

Always validate incoming data to prevent errors and ensure data integrity. For example, enforce required fields like patient name, date of birth, and contact details.

6. Protect sensitive data

Avoid exposing sensitive patient information unnecessarily. Use filtering, masking, or field-level access control where needed to comply with healthcare data regulations.

7. Ensure consistent response structure

Return responses in a standard format (for example, including data, status, and message fields). This makes APIs easier to consume and debug across multiple services.

How to Build a Microservice with ASP.NET 10

Let’s implement a simple Patient Service.

Step 1: Create Project

In this step, we'll create a new ASP.NET Web API project that will serve as our Patient microservice. This project provides the foundation for defining endpoints, handling HTTP requests, and structuring our service independently from other parts of the system.

dotnet new webapi -n PatientService
cd PatientService

Step 2: Define Model

Next, we'll define a simple data model representing a patient. Models define the structure of the data your API will send and receive, and they typically map to database entities in real-world applications.

public class Patient
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Email { get; set; }
}

Step 3: Create Controller

Here, we're creating a controller to handle incoming HTTP requests. Controllers define API endpoints and contain the logic for processing requests, interacting with data, and returning responses to clients.

[ApiController]
[Route("api/patients")]
public class PatientController : ControllerBase
{
    private static List patients = new();

    [HttpGet]
    public IActionResult GetPatients()
    {
        return Ok(patients);
    }

    [HttpPost]
    public IActionResult AddPatient(Patient patient)
    {
        patients.Add(patient);
        return CreatedAtAction(nameof(GetPatients), patient);
    }
}

Database per Service Pattern

Each microservice should manage its own database to ensure loose coupling and independent operation. This allows services to evolve, scale, and be deployed without affecting others. It also improves data isolation and aligns with the core principles of microservices architecture.

Here's an example with Entity Framework Core:

public class PatientDbContext : DbContext
{
    public PatientDbContext(DbContextOptions options)
        : base(options) { }

    public DbSet Patients { get; set; }
}

This matters because it avoids cross-service dependencies, enables independent scaling, and improves data security, making microservices more efficient and secure.

Service Communication

Microservices communicate with each other to share data and coordinate workflows across the system. This communication can be handled through synchronous requests or asynchronous messaging, depending on the use case.

Choosing the right approach helps ensure scalability, reliability, and responsiveness in distributed systems

1. Synchronous Communication (HTTP)

var response = await httpClient.GetAsync("http://appointment-service/api/appointments");

2. Asynchronous Communication (Messaging)

Using message brokers like RabbitMQ:

Services publish events
Other services consume them

Example:

When a patient registers, an event triggers an appointment service.

API Gateway Implementation

An API Gateway acts as the central entry point for all client requests in a microservices architecture. It handles routing, authentication, and request aggregation, simplifying how clients interact with multiple services. This layer helps improve security, scalability, and overall system management.

Here's an example (Ocelot configuration):

{
  "Routes": [
    {
      "DownstreamPathTemplate": "/api/patients",
      "UpstreamPathTemplate": "/patients",
      "DownstreamHostAndPorts": [
        { "Host": "localhost", "Port": 5001 }
      ]
    }
  ]
}

Benefits include centralized routing, authentication handling, and rate limiting

Implementing Security in Healthcare APIs

Security is critical in healthcare systems due to the sensitive nature of patient data. APIs must enforce strong authentication, authorization, and data protection mechanisms. Proper security ensures compliance, prevents unauthorized access, and safeguards user trust.

1. JWT Authentication

builder.Services.AddAuthentication("Bearer")
    .AddJwtBearer(options =>
    {
        options.Authority = "https://auth-server";
        options.Audience = "healthcare-api";
    });

JWT (JSON Web Token) authentication is used to verify the identity of users accessing the API.

The authentication scheme ("Bearer") tells the API to expect a token in the Authorization header: Authorization: Bearer

Authority represents the trusted authentication server (identity provider) that issues tokens.

And audience ensures that the token is intended specifically for this API.

When a request is made, the API:

Extracts the JWT from the request header
Validates its signature using the authority
Checks claims like expiration and audience
Grants access only if the token is valid

This ensures that only authenticated users can access healthcare services.

2. Role-Based Authorization

[Authorize(Roles = "Doctor")]
public IActionResult GetSensitiveData()
{
    return Ok();
}

Role-based authorization restricts access based on user roles.

The [Authorize] attribute enforces that only authenticated users can access the endpoint.
The Roles = "Doctor" condition ensures that only users with the Doctor role can access this resource.

When a user sends a request:

Their JWT token is validated
The system checks the role claim inside the token
Access is granted only if the required role matches

This is critical in healthcare systems where doctors access medical records, admins manage system data, and patients access only their own information.

3. Secure Secrets Management

var connectionString = Environment.GetEnvironmentVariable("DB_CONNECTION");

Sensitive configuration data such as database connection strings should never be hardcoded in the application.

Environment.GetEnvironmentVariable() retrieves secrets securely from the environment. These values are typically stored in:

Environment variables
Secret managers (Azure Key Vault, AWS Secrets Manager)
Container orchestration platforms

Benefits:

Prevents exposure of credentials in source code
Supports secure deployments across environments
Simplifies secret rotation without code changes

4. Enforce HTTPS

app.UseHttpsRedirection();

HTTPS ensures that all communication between the client and server is encrypted.

UseHttpsRedirection() automatically redirects HTTP requests to HTTPS. This protects sensitive healthcare data (such as patient records and credentials) from Man-in-the-Middle attacks, data interception, and unauthorized access.

In healthcare systems, encryption is essential for compliance with data protection standards and regulations.

Together, these security mechanisms provide multiple layers of protection:

Authentication verifies identity
Authorization controls access
Secrets management protects credentials
HTTPS secures data in transit

This layered approach is essential for safeguarding sensitive healthcare data and ensuring compliance with industry standards.

Observability and Logging

Observability enables you to monitor system health, diagnose issues, and understand how services interact in real time. By implementing logging, metrics, and tracing, teams can quickly identify failures and performance bottlenecks. This is essential for maintaining reliability in distributed systems.

Here's a basic logging example:

_logger.LogInformation("Fetching patients");

This line writes an informational log entry whenever the patient data is being retrieved. The _logger instance is part of ASP.NET’s built-in logging framework and is typically injected into the class through dependency injection.

Logging at this level helps developers trace normal application behavior and understand when specific operations occur, which is especially useful during debugging and monitoring in production environments.

Application Insights Integration

builder.Services.AddApplicationInsightsTelemetry();

This configuration enables integration with Application Insights, a cloud-based monitoring service. By adding this line, the application automatically collects telemetry data such as request rates, response times, failure rates, and dependency calls. This allows teams to monitor the health of the application in real time and quickly identify performance bottlenecks or failures across distributed microservices.

Custom Metrics

var telemetryClient = new TelemetryClient();
telemetryClient.TrackMetric("PatientsFetched", 1);

Here, a TelemetryClient instance is used to send custom metrics to the monitoring system. The TrackMetric method records a numerical value – in this case, tracking how many times patients are fetched.

Custom metrics like this help measure business-specific operations and provide deeper insight into how the system is being used beyond standard performance metrics.

Health Checks

app.MapHealthChecks("/health");

This line exposes a health check endpoint at /health that external systems can use to verify whether the service is running correctly. When this endpoint is called, it returns the status of the application and any configured dependencies, such as databases or external services.

Health checks are commonly used by load balancers, container orchestrators, and monitoring tools to automatically detect failures and restart or reroute traffic if needed.

Together, logging, telemetry, custom metrics, and health checks provide a complete observability strategy. They allow teams to understand system behavior, detect issues early, and maintain reliability across distributed healthcare services where uptime and performance are critical.

Containerization with Docker

Containerization allows microservices to run in isolated and consistent environments across development and production. Using Docker, you can package applications with all dependencies, ensuring portability and easier deployment. This approach simplifies scaling and infrastructure management.

The following Dockerfile shows a minimal setup for packaging the Patient Service into a container image:

FROM mcr.microsoft.com/dotnet/aspnet:10.0
WORKDIR /app
COPY . .
ENTRYPOINT ["dotnet", "PatientService.dll"]

This Dockerfile defines how the Patient Service is packaged into a container image so it can run consistently across different environments.

The FROM instruction specifies the base image, which in this case is the official ASP.NET runtime image for .NET 10. This image includes all the necessary runtime components required to execute the application, so you don’t need to install .NET separately inside the container.

The WORKDIR /app line sets the working directory inside the container. All subsequent commands will run relative to this directory, helping organize application files in a predictable structure.

The COPY . . instruction copies all files from the current project directory on your machine into the container’s working directory. This includes the compiled application binaries and any required resources.

Finally, the ENTRYPOINT defines the command that runs when the container starts. In this case, it launches the PatientService application using the .NET runtime.

Together, these steps package the microservice into a portable unit that can be deployed consistently across development, staging, and production environments. This ensures that the application behaves the same regardless of where it is deployed, which is a key advantage of containerization in microservices architectures.

Deployment Strategies

Deploying microservices requires strategies that minimize downtime and reduce risk during updates.

Techniques like rolling updates, canary releases, and blue-green deployments help ensure smooth transitions. These approaches improve system stability and user experience during releases.

Key Strategies

Deploying microservices requires strategies that minimize downtime, reduce risk, and ensure system stability – especially in healthcare systems where availability and data integrity are critical.

1. Rolling Updates

Rolling updates deploy changes gradually by updating instances of a service one at a time instead of all at once. As new versions are deployed, old instances are terminated in phases, ensuring that the system remains available throughout the process.

This approach works well for stateless services and is commonly used in container orchestration platforms. It allows continuous availability while still enabling safe deployment of new features.

Rolling updates are best used when:

You want zero downtime deployments
Backward compatibility between versions is maintained
Changes are relatively low risk

2. Canary Deployments

Canary deployments release a new version of a service to a small subset of users before rolling it out to everyone. This allows teams to monitor the behavior of the new version in a real-world environment with limited exposure.

If issues are detected, the deployment can be rolled back quickly without affecting the majority of users.

Canary deployments are ideal when:

Releasing high-risk or complex features
Testing performance under real traffic
Gradually validating new functionality

3. Blue-Green Deployments

Blue-green deployment involves maintaining two identical environments: one running the current version (blue) and one running the new version (green). Traffic is switched from blue to green once the new version is fully tested and ready.

If something goes wrong, traffic can be immediately switched back to the previous version.

This strategy is particularly useful when:

You need instant rollback capability
System stability is critical
Downtime must be completely avoided

Choosing the Right Strategy for Healthcare Microservices

In a healthcare portal, where reliability and patient data integrity are essential, blue-green deployments are often the safest choice. They allow full validation of the new version before exposing it to users and provide immediate rollback in case of failure.

But rolling updates are also commonly used for routine updates where backward compatibility is ensured, while canary deployments are useful when introducing new features like AI diagnostics or analytics modules.

Example: Blue-Green Deployment with Containers

Let’s walk through a simple conceptual example using containers.

Assume you have two environments:

Blue (current version) running PatientService v1
Green (new version) running PatientService v2

First, you deploy the new version (v2) alongside the existing one without affecting users.

Then you run tests and verify that the new version behaves correctly.

After that, you update the load balancer or API gateway to route traffic from blue to green. Then you monitor the system for errors or performance issues.

If everything is stable, you keep green as the active environment. If not, switch traffic back to blue instantly.

In a real-world setup, this traffic switching is typically handled by:

API Gateways
Load balancers
Kubernetes services

This approach ensures that users experience no downtime while giving teams full control over deployment risk.

In practice, many production systems combine these strategies – for example, starting with a canary release and then completing deployment with a rolling update – to balance risk and efficiency.

Best Practices (With Examples)

Designing reliable microservices for healthcare systems requires applying proven patterns that improve stability, maintainability, and resilience. Below are some key best practices with practical examples.

1. Use API Versioning

API versioning ensures backward compatibility when your service evolves. In healthcare systems, where integrations with external systems (labs, insurance, EHR) are common, breaking changes can cause serious issues.

Here's an example:

[Route("api/v1/patients")]

This route attribute defines the base URL for the API and explicitly includes a version identifier (v1). By embedding the version in the route, the service can support multiple versions of the same API simultaneously. This allows existing clients to continue using older versions while newer versions are introduced without breaking compatibility.

You can later introduce a new version:

[Route("api/v2/patients")]

This represents a newer version of the same API with potentially updated functionality or structure. By separating versions at the routing level, developers can evolve the API safely while giving clients time to migrate.

This approach is especially important in healthcare systems where external integrations must remain stable over long periods.

This allows safe rollout of new features, support for legacy clients and gradual migration between versions.

2. Implement Retry Policies

Network calls between microservices can fail due to transient issues such as timeouts or temporary service unavailability. Retry policies help automatically recover from such failures.

Here's an example (using Polly):

services.AddHttpClient("api")
    .AddTransientHttpErrorPolicy(p => p.RetryAsync(3));

This code configures an HTTP client with a retry policy using Polly, a .NET resilience and transient-fault-handling library. Polly allows developers to define policies such as retries, circuit breakers, and timeouts for handling unreliable network calls.

The AddTransientHttpErrorPolicy method applies a retry strategy for temporary failures such as network timeouts or server errors. The RetryAsync(3) configuration means that if a request fails due to a transient issue, it will automatically be retried up to three times before returning an error.

This improves system reliability by handling temporary issues without requiring manual intervention.

This configuration retries failed requests up to three times before failing.

You can also add exponential backoff:

.AddTransientHttpErrorPolicy(p =>
    p.WaitAndRetryAsync(3, retryAttempt =>
        TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))));

This configuration enhances the retry mechanism by introducing exponential backoff. Instead of retrying immediately, the system waits progressively longer between each retry attempt.

Exponential backoff means:

The first retry waits for 2¹ seconds
The second retry waits for 2² seconds
The third retry waits for 2³ seconds

This approach reduces pressure on failing services and avoids overwhelming them with repeated requests. It's particularly useful in distributed systems where temporary failures are common and services need time to recover.

This helps in improving reliability, reducing temporary failures and avoiding manual retries.

3. Enforce Input Validation

Validating incoming data is critical, especially in healthcare systems where incorrect data can lead to serious consequences.

Here's an example:

if (string.IsNullOrEmpty(patient.Name))
    return BadRequest("Name is required");

This is a simple manual validation check that ensures the Name field is provided before processing the request. If the value is missing or empty, the API immediately returns a BadRequest response, preventing invalid data from entering the system.

A better approach is using data annotations:

public class Patient
{
    public int Id { get; set; }

    [Required]
    public string Name { get; set; }
}

This example uses data annotations to enforce validation rules at the model level. The [Required] attribute ensures that the Name property must be provided when a request is made. ASP.NET automatically validates the model during request processing and returns an error response if validation fails.

This approach is more scalable and maintainable than manual checks, especially in larger applications.

This ensures clean and valid data, reduced runtime errors, and better API usability.

4. Use Circuit Breaker Pattern

The circuit breaker pattern prevents cascading failures when a dependent service is down or slow.

For example, if the Appointment Service is unavailable, repeated calls from the Patient Service can overload the system. A circuit breaker stops these calls temporarily.

Here's an example (again using Polly):

services.AddHttpClient("api")
    .AddTransientHttpErrorPolicy(p =>
        p.CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)));

This means:

After 5 consecutive failures, the circuit opens
No further requests are sent for 30 seconds
System gets time to recover

This helps in protecting system stability, preventing resource exhaustion, and improving overall resilience.

These practices ensure your microservices are backward-compatible (versioning), resilient (retry + circuit breaker), and reliable (validation).

In healthcare systems, where uptime and data integrity are critical, applying these patterns is essential.

This code configures a circuit breaker policy using Polly to protect the system from repeated failures when calling external services.

The CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)) configuration means that if five consecutive requests fail, the circuit will open and block further requests for 30 seconds. During this time, the system will not attempt to call the failing service, allowing it time to recover.

After the break period, the circuit enters a half-open state where a limited number of requests are allowed to test if the service has recovered. If successful, normal operation resumes. Otherwise, the circuit opens again.

This pattern prevents cascading failures, reduces unnecessary load on failing services, and improves overall system resilience.

These examples demonstrate how small design decisions (like versioning, retries, validation, and fault handling) can significantly improve the reliability and maintainability of microservices, especially in healthcare systems where failures can have serious consequences.

When NOT to Use Microservices

Microservices are powerful, but they're not a universal solution. In many cases, adopting microservices too early can introduce unnecessary complexity instead of solving real problems.

Before choosing this architecture, it’s important to understand when a simpler approach—such as a monolith—is more appropriate.

1. When the Application Is Small

If your application has limited functionality (for example, a basic patient registration system or internal tool), splitting it into multiple services adds unnecessary overhead.

A monolithic architecture allows you to develop faster with less setup, debug issues more easily, and avoid managing multiple deployments.

Example: A simple clinic portal with only patient registration and appointment booking doesn't require separate services for each feature.

2. When the Team Size Is Limited

When the team size is limited, microservices can become challenging. Managing multiple codebases, handling service communication, and dealing with deployments and monitoring can slow down development, making it tough for small teams to handle the complexity.

Example: A team of 2–3 developers may spend more time managing infrastructure than building features if microservices are used prematurely.

3. When Deployment Complexity Outweighs Benefits

Microservices introduce operational complexity, including API gateways, service discovery, container orchestration (for example, Kubernetes), and monitoring and logging across services.

If your application doesn't require independent scaling or frequent deployments, this complexity may not be justified.

Example: If all components of your system scale together and are updated at the same time, a monolith is often more efficient.

4. When Domain Boundaries Aren't Clear

Microservices rely on well-defined service boundaries. If your domain isn't clearly understood, splitting into services too early can lead to tight coupling between services, frequent cross-service changes, and poorly designed APIs.

In such cases, starting with a monolith and refactoring later is a better approach.

5. When You Lack DevOps and Observability Maturity

Microservices require strong DevOps practices, including CI/CD pipelines, centralized logging, distributed tracing and monitoring & alerting. Without these, debugging issues becomes extremely difficult.

Future Enhancements

Healthcare systems are evolving rapidly, and microservices architectures can adapt to support new capabilities. Future improvements may include:

1.Event-Driven Architecture

Adopting an event-driven approach allows services to communicate asynchronously through events rather than direct requests. This improves scalability, responsiveness, and fault tolerance, making it easier to handle high volumes of patient data and real-time updates across multiple services.

2. AI-Powered Diagnostics

Integrating AI and machine learning can enhance diagnostic capabilities by analyzing patient data, detecting patterns, and providing predictive insights. This can improve clinical decision-making and streamline workflows within the healthcare portal.

3.Integration with FHIR Standards

Supporting FHIR (Fast Healthcare Interoperability Resources) standards enables seamless data exchange between different healthcare systems, labs, and third-party applications. Standardized APIs ensure better interoperability, compliance, and easier integration with external platforms.

4.Real-Time Analytics

Real-time analytics allows healthcare providers to monitor patient data, system performance, and operational metrics continuously. This supports proactive decision-making, early detection of anomalies, and improved overall quality of care.

Conclusion

Microservices-based REST API development provides a powerful foundation for building scalable and secure healthcare portals. By breaking applications into independent services, teams can achieve better scalability, faster deployments, and improved fault isolation.

However, adopting microservices is not just a technical shift—it is an architectural and operational commitment. Developers should start small, identify clear service boundaries, and gradually evolve their systems.

As your application grows, focus on strengthening security, improving observability, and automating deployments. These practices will ensure your healthcare platform remains reliable, compliant, and ready to scale in a cloud-native world.

The next step is to build your first microservice, deploy it using containers, and incrementally expand your system into a fully distributed healthcare platform.

How to Build an Open Source Data Lake for Batch Ingestion

Puneet Singh — Thu, 16 Apr 2026 14:26:47 +0000

Creating a data platform has been made easier by cloud data analytics platforms like Databricks, Snowflake, and BigQuery. They offer excellent ramp-up and scaling options for small to mid-size teams.

But the trade-off isn't just merely renting the outside infrastructure. It also includes proprietary abstraction lock-in, and an operational and security surface area built on top of vendor capabilities.

In this article, you'll set up a batch ingestion layer on an open-source data lake stack where you own every component.

The focus is deliberately narrow. We'll get the ingestion layer up and running end-to-end. Then we'll build on foundations that allow future extension: analytics, governance, and stream processing without locking you into any single tool for those layers. We'll also review documented integration failures along the way: misconfigured catalogs, partition values written as NULL, and Python version mismatches.

By the end, you'll have:

A working single-node data lake running on Docker (compose), built on RustFS (object storage), Apache Iceberg (table format), and Project Nessie (catalog).
A batch pipeline orchestrated with Apache Airflow, executing PySpark jobs that write versioned, partitioned Iceberg tables.
A real-world ingestion pattern, an external web scraper decoupled from Airflow via Redis, writing raw data to object storage with a lightweight signal table.
A view of what this stack is and isn't, and what you'd add to take it toward production.

A word on scope: this covers the E in ELT: getting data in. Transformation (dbt, Spark SQL) and analytics (Trino, Superset) are a natural next layer, but are outside the scope of this article. What you build here is the foundation they'd sit on.

What We'll Cover:

The Ingestion Problem
Stack
System Overview
Quick Start
Running the Pipelines
Setup
- RustFS
- Nessie
- Spark
- Apache Airflow
- Scrapredis
- Scrapworker
Path Forward
- Extending Capabilities
- Adding Layers

The Ingestion Problem

The structure of a stack/solution is easier to understand with a use case. A high-level goal is to ingest financial data from external market APIs for trend analysis. You'll focus specifically on setting up ingestion of such data into the warehouse for further analytics.

The data is ingested via a web crawler with a specific rate limit per endpoint. In Batch processing, time-based partitioning is effective for processing by downstream pipelines. It also favors cleaner data retention.

The crawler runs as an external process, decoupled from Airflow via a Redis job queue. This keeps rate limiting and crawl lifecycle outside the orchestration layer, with each component failing and recovering independently.

During ingestion, the priority is data landing with high reliability due to the lack of idempotency in crawl jobs.

Stack

RustFS: An S3-compatible object store written in Rust
Project Nessie: Transactional catalog for Apache Iceberg tables
Apache Spark: Distributed compute engine
Apache Airflow: Job scheduling and orchestration
Jupyter Notebook (optional): Ad-hoc Spark queries against Iceberg tables, not covered in this article
Scrapredis: Job queue for the web crawler
Scrapworker: Web crawler and ingestion worker

This setup was tested on a 4-core x86/AMD CPU, 16GB RAM, 60GB disk GCP VM running Debian GNU/Linux 11 (Bullseye). Docker with Compose v2 is required. The setup should work on any comparable Linux environment with similar or better specs.

System Overview

The crawler runs as an external process, decoupled from Airflow via a Redis job queue. Airflow pushes a job specification to the queue containing the endpoint, query params, and target path. The crawler picks it up, executes the crawl, and writes raw results directly to object storage.

This separation keeps rate limiting and crawl lifecycle concerns outside the orchestration layer, and isolates failure modes.

A crawl failure is harder to recover since crawl jobs lack idempotency. Pipeline failures after the crawl stage are independently retryable without re-triggering a crawl.

Quick Start

First, initialize the project:

# Clone the repository
git clone https://github.com/ps-mir/data-platform

# Create the shared Docker network
docker network create data-platform

# Create host directories, set permissions, and download Spark JARs
chmod +x init.sh && ./init.sh

Start services in this order (shutdown in reverse):

RustFS

cd rustfs && docker compose up -d

Nessie

cd nessie && docker compose up -d

Spark — requires a build on first run

cd spark && docker compose build && docker compose up -d

Scrapredis

cd scrapredis && docker compose up -d

Airflow — requires a build on first run

cd airflow-docker && docker compose build && docker compose up -d

Create the Nessie namespaces once after Nessie is up:

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'

Scrapworker runs on the host directly (it's not dockerized). It requires Python >=3.14:

cd scrapworker
pip install -e .
CONFIG_PATH=./config/config.local.yaml RUSTFS_ACCESS_KEY=rustfsadmin RUSTFS_SECRET_KEY=rustfsadmin python -m scrapworker

Scrapworker must be running before activating scraper_pipeline_v1 in Airflow. Without it, the pipeline will push jobs to the queue with no worker to pick them up and hang indefinitely in wait_for_completion.

Trino is also present in setup but not tested for integration with Nessie yet.

Running the Pipelines

With the stack running, the next step is to activate the pipelines in Airflow. All DAGs are paused at creation by default. The four pipelines build on each other in complexity. Working through them in order is the fastest way to confirm that each layer of the stack is wired correctly before moving to the next.

All four pipelines are loaded but paused by default. Unpause each one in the Airflow UI before triggering.

Let's go over each pipeline:

spark_static_data_v1_skeleton: Hello DAG

This is a minimal DAG with no Spark, just a Python task that prints a message. If it goes green, Airflow's scheduler and worker are healthy. [2026-04-09 22:00:01] INFO - Task operator:

spark_static_data_v2_submit: Spark Submit

This submits a PySpark job via SparkSubmitOperator that writes a static dataset to an Iceberg table. No partitioning, every run overwrites the previous content.

In Nessie catalog it appears as:

Type: ICEBERG_TABLE
Metadata Location:s3://warehouse/default/static_data_e7e43123-95a7-44d2-b6d5-67c9c7aa4321/metadata/00000-08a5a2db-6f12-4f21-b2a9-de3d9123fbd3.metadata.json

spark_partitioned_data_v1: Spark Partitioned

This extends step2 with time-based partitioning. Partition values are derived from the scheduled slot time, so every run writes to its own (ds, hr, min) partition without touching previous ones.

Example file path in RustFS: warehouse/default/static_data_partitioned_b172c66f-722b-44f3-bbee-069355753ff6/data/ds=2026-03-28/hr=23/min=15/00000-4-7a196a47-2ac0-4023-af68-ca10487fccb2-0-00001.parquet

scraper_pipeline_v1: Scraper Pipeline

This is the full ingestion flow. Airflow pushes a job to Scrapredis, Scrapworker calls the Binance API and writes raw results to RustFS, then Airflow publishes a signal row to the Nessie catalog.

Every run fetches: https://api.binance.com/api/v3/trades?symbol=BTCUSDT&limit=10

Setup

This is a single-node development setup using Docker Compose. It's built on a well-structured base config that can be extended to production with targeted changes.

A production deployment would require HA configuration, persistent volume management, and security hardening for each component.
Images are pinned to specific versions to avoid silent breakage between pulls.
All containers share a common external Docker network named data-platform, which allows services to communicate using container names as hostnames.
An init.sh script creates the required local dirs inside the data folder and also creates the Docker network.

RustFS

RustFS is the object storage layer in this stack. Nessie's REST catalog mode has a hard dependency on an S3-compatible endpoint. Running it against a local filesystem fails the Nessie healthcheck at startup and causes catalog initialization to error out. The REST catalog is the recommended mode for new setups because it enables credential vending and multi-engine coordination.

MinIO was the natural choice for self-hosted S3-compatible storage, but it shifted to a more restrictive license. RustFS is the open-source alternative, written in Rust and backed by local disk.

At write time, Spark pushes Parquet files directly to RustFS via S3FileIO. Nessie commits the table metadata alongside, so data and catalog state land together or not at all. This is Apache Iceberg's core guarantee: atomic commits across both data files and metadata.

For production or cloud deployments, managed object storage services like AWS S3, Google Cloud Storage, or Azure Blob Storage are the natural next step. Self-hosted alternatives at scale include SeaweedFS, Ceph/RGW, and Garage.

Notes:

Bucket creation: A rustfs-init sidecar using amazon/aws-cli runs after RustFS passes its healthcheck and creates the s3://warehouse bucket automatically. You don't create the bucket manually.
Permissions: RustFS runs as uid=10001 inside the container. The host directories (data/rustfs/data and data/rustfs/applogs) must be owned by that uid before the container starts, or it will fail silently. init.sh handles this with sudo chown -R 10001:10001.
Image pinning: The compose file pins to rustfs/rustfs:1.0.0-alpha.85-glibc. Before upgrading, verify the uid hasn't changed: docker run --rm --entrypoint id rustfs/rustfs:. If it has, re-run init.sh or re-chown manually.
Spark writes: Spark writes data files directly to RustFS via S3FileIO. Nessie only manages catalog metadata, it doesn't proxy data. The two interact at commit time, not at write time.

Nessie

The catalog tracks the list of tables in the warehouse, along with their data files and schema. Without it, it's hard for Spark to agree on what's in the warehouse.

Hive Metastore offers a Thrift-based API and has been the catalog standard for years. It provides transaction semantics on metadata updates through its backing database, but those transactions stop at the catalog layer. Data files underneath aren't part of the same commit, and there's no cross-table history beyond what the database retains.

Apache Iceberg closes the data and metadata gap with atomic table commits. Nessie builds on that and goes further: it treats the catalog like a Git repository. Every table write is a commit. You can branch, tag, and roll back across multiple tables atomically.

Spark reads and writes table metadata through Nessie's Iceberg REST endpoint. Catalog state is persisted to Postgres, so it survives container restarts.

Namespace bootstrap

Unlike Hive Metastore, Nessie doesn't auto-create namespaces. Attempting to write a table to a namespace that doesn't exist fails after data has already been written to RustFS, leaving orphaned files with no catalog entry. Namespaces are structural metadata and belong in a one-time bootstrap step, not in a pipeline.

Nessie manages the Iceberg catalog metadata under s3://warehouse/. Iceberg table data lands under paths derived from the namespace, for example, s3://warehouse/default/ for the default namespace.

S3 Credential Configuration Issue

Nessie's S3 credential fields don't accept plain strings (likely for security reasons). They require a secret URI in the form urn:nessie-secret:quarkus: even for local credentials.

Additionally, the SCREAMING_SNAKE_CASE environment variable convention is ambiguous for Quarkus property names containing hyphens. The property is silently ignored, and the default (which fails) is used instead. The working approach is dot-notation keys passed directly in the compose environment block, which Quarkus reads without conversion:

nessie.catalog.service.s3.default-options.access-key: "urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key"
nessie.catalog.secrets.access-key.name: rustfsadmin
nessie.catalog.secrets.access-key.secret: rustfsadmin

Nessie health check

Once the RustFS settings are corrected, Nessie's health check URL(http://localhost:9090/q/health) should return the following response:

{
    "status": "UP",
    "checks": [
        {
            "name": "MongoDB connection health check",
            "status": "UP"
        },
        {
            "name": "Warehouses Object Stores",
            "status": "UP",
            "data": {
                "warehouse.warehouse.status": "UP"
            }
        },
        {
            "name": "Database connections health check",
            "status": "UP",
            "data": {
                "": "UP"
            }
        }
    ]
}

The MongoDB connection health check appears in the response even though this stack doesn't use MongoDB. It's a Quarkus built-in probe registered automatically regardless of store type. With JDBC configured, MongoDB is never connected and the UP report is just a placeholder response.

Catalog endpoint vs Management

Nessie exposes two separate APIs. The Iceberg REST catalog is at /iceberg. This is what Spark and Trino connect to. The Nessie management API is at /api/v2, which is for branch operations, commit history, and table inspection. They aren't interchangeable.

# Iceberg REST API
http://localhost:19120/iceberg/v1/main/namespaces
http://localhost:19120/iceberg/v1/config

# Nessie management API
http://localhost:19120/api/v2/config

Notes:

path-style-access: true is required for any non-AWS S3 endpoint. region is a dummy value required by the AWS SDK internally.
Nessie's internal port 9000 is remapped to 9090 on the host to avoid conflict with RustFS which occupies 9000 and 9001.

Forward path

Nessie is a stateless REST service, so scaling reads can be done with LB with no coordination between nodes. Durability comes entirely from backend store.

Spark

As a distributed compute engine, Apache Spark is a reliable and stable choice for long-running jobs. In the current setup, it executes PySpark jobs submitted by Airflow, reads and writes Iceberg tables via the Nessie REST catalog, and writes data files directly to RustFS using S3FileIO. Spark runs in standalone mode with a single master and worker, configured via spark-defaults.conf.

Two JARs are required and must be placed in data/spark/jars/ before starting:

iceberg-spark-runtime-3.5_2.12: Iceberg integration for Spark: SparkCatalog, DataFrameWriterV2, SQL extensions, and all table format logic.
iceberg-aws-bundle: AWS SDK v2 and Iceberg's S3FileIO, the storage transport layer for writing data files to RustFS. The Spark base image ships only Hadoop AWS (SDK v1). This bundle provides the SDK v2 classes that S3FileIO requires.

Spark uses a custom Dockerfile to install Python 3.12. Build the image before first use:

cd spark
docker compose build
docker compose up -d

The PySpark jobs are covered in the Airflow section, where we walk through each DAG and its corresponding Spark script as part of the pipeline.

Before submitting any Spark job that writes an Iceberg table, the target namespace must exist in Nessie. Nessie doesn't auto-create namespaces, unlike Hive Metastore. Attempting to write to a missing namespace fails after data has already been written to RustFS, leaving orphaned files with no catalog entry.

Create the default namespace once before running any pipeline:

# Nessie should be up and running at this point
curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["default"]}'
{
  "namespace" : [ "default" ],
  "properties" : { }
}

Verify:

curl http://localhost:19120/iceberg/v1/main/namespaces

Catalog Mismatch: Tables Missing Across Query Engines

If tables written by Spark aren't visible in Trino, the likely cause is a catalog mismatch. Spark configured with NessieCatalog and Trino using the Iceberg REST catalog maintain separate metadata views — they don't share table state. Both engines must point at the same catalog endpoint: http://nessie:19120/iceberg.

Notes:

Worker memory: The worker is configured with SPARK_WORKER_MEMORY: 8g. Spark's default is 1g is enough to register but not enough to run a job without queuing. Tune this based on available host memory.
Remote signing: remote-signing-enabled: false Nessie's REST catalog supports credential vending via IAM/STS, but since that integration isn't present here, remote signing is disabled explicitly to avoid request failures.
Config changes need full restart: Docker file-level bind mounts cache the inode at container start. Editing spark-defaults.conf won't take effect until Spark and the Airflow worker are restarted. In client mode, the Airflow worker is the Spark driver (the process that reads the config on job submission) and must be restarted too.
Jupyter Notebook: A Jupyter instance with PySpark is included in the stack for ad-hoc queries against Iceberg tables. It connects to the same Spark cluster and Nessie catalog, so any table written by a pipeline is immediately queryable.

⚠️ Warning: The Spark worker and Airflow worker (the driver) must run the same Python minor version. PySpark enforces this at runtime and fails immediately if they diverge. The Spark image in this stack uses a custom Dockerfile to install Python 3.12, matching Airflow's base image. If you upgrade either, verify that the versions stay aligned.

Apache Airflow

Airflow makes it easier to author, schedule and monitor workflows. In this case, it handles the ingestion for batch processing, but it can be extended to use cases like stream processing.

The Airflow components resemble more closely the DAG processor Airflow Architecture from the official docs.

Key aspects:

The DAG Processor continuously parses DAG files and serializes them to the Metadata DB.
The Scheduler reads from there, detects when a DAG run is due, creates task instances, and pushes them to the CeleryExecutor (via Redis queue).
The Celery worker picks up a task and executes it. In the case of a SparkSubmitOperator, the worker process becomes the Spark driver, submitting the job to the Spark cluster.
Executors run on the Spark worker, write Parquet files directly to RustFS, and commit the table metadata to Nessie. Airflow records the task outcome back in the Metadata DB.

Airflow uses a custom Dockerfile to install Java 17 and additional providers. Build the image before first use:

cd airflow-docker
docker compose build
docker compose up -d

Pipelines

Pipelines need to be created inside airflow-docker/dags folder for dag processor to pick up load the pipeline DAG in metadata DB. Four pipeline examples are provided with varying complexity.

step1_hello_dag.py: single-task DAG with no dependencies, just a Python function that prints a message.
step2_spark_submit.py: submits a PySpark job via SparkSubmitOperator. The job writes a static dataset to an Iceberg table via the Nessie catalog.
step3_spark_partitioned.py: extends step 2 with time-based partitioning. The scheduled slot time is passed to the PySpark script.
- Time-based partition values are derived from data_interval_start for idempotency (Backfill, Reruns).
scraper_pipeline: a real-world ingestion pipeline. Coordinates with the external task executor scrapworker via the Redis queue scrapredis.
- Both scrapredis and scrapworker must be up and running for this pipeline to work.

Deploy Mode and Driver Config

The initial SparkSubmitOperator configuration used deploy_mode="cluster", which runs the driver on the Spark cluster rather than the submitting machine. This fails immediately on Spark standalone clusters with a hard error:

Cluster deploy mode is currently not supported for python applications on standalone clusters.

Cluster mode for Python is only available on YARN and Kubernetes. The fix is deploy_mode="client", but this shifts the problem: in client mode, the driver runs on the Airflow worker container, which means the worker needs everything the Spark containers have.

Overall, three changes are required in the Airflow worker:

The Iceberg and Nessie JARs at /opt/spark/user-jars/
spark-defaults.conf with catalog, extension, and JAR config
SPARK_CONF_DIR=/opt/spark/conf, without this, pip-installed PySpark's spark-submit silently ignores the mounted conf file and runs with no catalog config

The fix was adding all three to x-airflow-common in airflow-docker/docker-compose.yaml so every Airflow service inherits them:

environment:
  SPARK_CONF_DIR: /opt/spark/conf

volumes:
  - ../data/spark/jars:/opt/spark/user-jars:ro
  - ../spark/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf:ro

Partition Values Written as NULL

When the third pipeline (Spark Partitioned) ran for the first time, the data landed correctly in RustFS, but querying the Iceberg partitions metadata showed:

+------------------+----------+
|         partition|file_count|
+------------------+----------+
|{NULL, NULL, NULL}|         2|
+------------------+----------+

The original script used Spark's DataSource V1 API:

df.write.format("iceberg").mode("overwrite").saveAsTable(table)

The script used Spark's V1 DataFrame write API with format("iceberg"), which loads an isolated table reference and bypasses Iceberg's catalog write path. As a result, Iceberg committed the data files to storage but wrote NULL partition values into the manifest metadata.

The fix is in Iceberg's native DataFrameWriterV2 API:

df.writeTo(table).overwritePartitions()

This routes through Iceberg's native write path, evaluates partition transforms from the real column values (ds, hr, min), and registers them correctly in the manifest. overwritePartitions() overwrites only the partitions present in the DataFrame. A rerun with the same scheduled time produces the same values and atomically replaces that partition, leaving all others untouched.

⚠️ Existing NULL-partition manifest entries aren't retroactively corrected by subsequent V2 writes. For a brand-new table containing only bad data, DROP TABLE and rewrite is the simplest recovery.

Scrapredis

Scrapredis is a dedicated Redis instance that sits between Airflow and Scrapworker as a job queue. It's separate from Airflow's internal Redis, which exists solely for CeleryExecutor task dispatch. The separation means the crawler's job queue can be managed, scaled, or replaced without touching Airflow's internals.

The pattern generalises beyond scraping. Any external process that needs its own lifecycle, resource profile, or rate limiting can be wired the same way: Airflow pushes a job, the external worker pops it, and Airflow polls for the result.

The scraper pipeline follows this round-trip:

Airflow pushes the job payload to the queue:

QUEUE_KEY = "scrapworker:jobs"
client.lpush(QUEUE_KEY, json.dumps(payload))

Scrapworker blocks on the queue and pops the next job:

while True:
    _, payload = client.blpop(redis_cfg["queue_key"])

Once the crawl finishes, Scrapworker writes the outcome and s3_path back to Redis:

client.set(status_key, json.dumps({"status": "finished", "worker_id": worker_id, "s3_path": job["s3_path"]}), ex=TERMINAL_TTL)

The wait_for_completion task polls for that status key. On success, publish_nessie_signal picks up the s3_path and writes the signal row to Nessie.

Scrapworker

Scrapworker is a Python app that uses the Scrapy crawl framework to crawl all pages of the request. It's decoupled from Airflow due to URL/client specific rate limit semantics. For simplicity, consider it a type of external worker that receives and executes requests from Airflow.

It's responsible for downloading and writing content to object storage (RustFS). The Nessie catalog update is decoupled and kept in a separate Airflow pipeline task.

Fixed Signal Table

Scrapworker writes raw JSON to RustFS rather than writing scraped data directly as Iceberg columns. The pipeline then publishes a single lightweight signal row to a Nessie-managed Iceberg table.

The signal schema is fixed and minimal (run_id, endpoint, s3_path, ds, hr, min, published_at). It never changes, regardless of what's being scraped.

Mirroring the scraped payload as Iceberg columns would force Scrapworker to own schema evolution across different endpoints. This isn't an ideal place for schema ownership. Instead, schema ownership sits downstream:

Scrapworker  →  raw files in RustFS  +  signal row in Iceberg (from Pipeline)
Airflow job  →  reads raw via s3_path, applies schema, writes structured Iceberg table

The downstream job knows the domain, knows the schema, and is the right place to handle type casting, nulls, and partition layout. Scrapworker stays generic and thin — the same code handles any endpoint without modification.

Why Signal Publish is a Separate Airflow Task

Scrapworker writes to RustFS and sets status: finished in Redis with the s3_path. A separate Airflow task reads that status and publishes the signal row to Nessie. The two writes are intentionally decoupled.

If scrapworker published to Nessie directly after writing to RustFS, the two writes would share a failure mode. A Nessie failure after a successful RustFS write would leave data stranded with no signal and no clean recovery path. The only option would be a re-crawl which lacks idempotency.

With the decoupled approach, each failure is isolated. A Nessie failure triggers an Airflow retry of the signal publish task only, no re-scrape, no duplicate crawl. RustFS and Nessie failures are independently recoverable.

Notes:

Raw scraped files are written directly to s3://warehouse/raw/, entirely outside Nessie's management. Nothing in the Iceberg layer touches this path.
The scrapworker signal table lives in a dedicated scraper namespace. Create it once before scrapworker runs for the first time.

curl -X POST http://localhost:19120/iceberg/v1/main/namespaces \
  -H "Content-Type: application/json" \
  -d '{"namespace": ["scraper"]}'

Path Forward

The stack we've built here is a working ingestion layer. It lands data reliably, tracks it in a versioned catalog, and gives you a foundation to build on. Two directions are worth considering from here.

Extending Capabilities

These are improvements to what's already in the stack, making it more robust without adding new components.

Ingestion reliability: Scrapworker currently handles failures by setting status: failed in Redis, which requires Airflow to re-trigger the full pipeline. Adding client-side rate limiting and per-endpoint retry logic with backoff would make crawl jobs more self-healing, so that a failed page fetch can retry independently without surfacing to Airflow at all.

Config validation: A misconfigured endpoint schema in config.yaml fails silently at runtime, often deep into a crawl. A validate_config() call at startup would catch missing required fields like offset_param or response_map before any job runs. This becomes more important as more endpoints are added.

Observability: Airflow alerting and SLA monitoring give early warning when pipelines miss their schedule or tasks take longer than expected. The signal table is useful here too. A lightweight monitor that checks for expected signal rows within a time window is a simple SLA check that works without external tooling.

Adding Layers

These are new capabilities that build on the ingestion foundation.

Transform layer: The raw Iceberg tables written by the ingestion layer are the input for a transform step. dbt or Spark SQL can read from raw, apply schema, clean types, and write structured tables to a separate namespace. This is the L in ELT and the natural next step once ingestion is stable.

Analytics: Trino is already in the stack and partially integrated. Connecting it fully to Nessie enables SQL queries across all Iceberg tables. Adding Superset on top gives a visualisation layer without requiring any changes to the ingestion pipeline.

Broader source onboarding: The current stack handles one ingestion pattern: a scheduled Airflow pipeline triggering an external HTTP crawler. The same foundation supports pull-based sources like databases using CDC, and push-based sources like event streams via Kafka. The Iceberg tables and Nessie catalog serve as the landing zone regardless of how data arrives.

Governance: Iceberg and Nessie provide the foundations, covering snapshots, schema evolution, commit history, and time travel. The governance layer on top requires deliberate additions: access control, data quality checks, lineage tracking, and schema enforcement. None of these require replacing what's here, as they sit on top of it.

How to Build a Fashion App That Helps You Organize Your Wardrobe

Mokshita V P — Tue, 14 Apr 2026 16:26:39 +0000

I used to spend too long deciding what to wear, even when my closet was full.

That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better organization, better visibility, and better guidance when making outfit decisions.

So I built a fashion web app that helps users organize their wardrobe, get outfit suggestions, evaluate shopping decisions, and improve recommendations over time using feedback.

In this article, I’ll walk through what the app does, how I built it, the decisions I made along the way, and the challenges that shaped the final result.

Table of Contents
What the App Does
Why I Built It
Tech Stack
Product Walkthrough (What Users See)
How I Built It
Challenges I Faced
What I Learned
What I Want to Improve Next
Future Improvements
Conclusion

What the App Does

At a high level, the app combines six core capabilities:

Wardrobe management
Outfit recommendations
Shopping suggestions
Discard recommendations
Feedback and usage tracking
Secure multi-user accounts

Users can upload clothing items, explore suggested outfits, and mark recommendations as helpful or not helpful. They can also rate outfits and track whether items are worn, kept, or discarded.

That feedback becomes structured data for improving future recommendation quality.

Why I Built It

I wanted to create something that felt personal and actually useful. A lot of fashion apps look polished, but they do not always help with everyday decisions. My goal was to build something that could make wardrobe management easier and outfit selection less overwhelming. The app needed to do three things well:

store each user’s wardrobe data
personalize recommendations
learn from user feedback over time .

That feedback loop mattered to me because it makes the app feel more alive instead of static.

Tech Stack

Here are the tools I used to built the app:

Frontend: React + Vite
Backend: FastAPI
Database: SQLite (local development)
Background jobs: Celery + Redis
Authentication: JWT (access + refresh token flow)
Deployment support: Docker and GitHub Codespaces

This ended up giving me a pretty modular setup, which helped a lot as features started increasing: fast frontend iteration, clean API boundaries, and room to evolve recommendations separately from UI.

Product Walkthrough (What Users See)

1. Onboarding and Account Setup

To start using the app, a user needs to register, verify their email, and complete some profile basics.

Each account is isolated, so wardrobe history and recommendations stay user-specific.

In this onboarding screen above, you can see account creation, email verification, and profile fields for body shape, height, weight, and style preferences.

2. Wardrobe Upload

Users can upload clothing images .

Image analysis labels each item and makes it searchable for recommendations. The wardrobe upload form shows image analysis results with category, dominant color, secondary color, and pattern details listed.

3. Outfit Recommendations

Users can request recommendations, then rate outputs.

Above you can see the outfit recommendation dashboard that shows ranked outfit cards with feedback and rating actions. Recommendations are ranked by a weighted scoring model.

4. Shopping and Discard Assistants

The app evaluates new items against existing wardrobe data and flags low-value wardrobe items that may be worth removing.

You can see the recommendation scores, written reasons (not just a binary decision), and styling guidance for each item above. It also features a "how to style it" incase the user still wants to keep the item.

How I Built It

1. Frontend Setup (React + Vite)

I used React + Vite because I wanted fast iteration and a clean component structure.

The frontend is split into feature areas like onboarding, wardrobe management, outfits, shopping, and discarded-item suggestions. I also keep API calls in a service layer so the UI components stay focused on rendering and interaction.

The snippet below is a simplified example of the API service pattern used in the app. It is not meant to be copy-pasted as-is, but it shows the same structure the frontend uses when talking to the backend.

Example API client pattern:

export async function getOutfitRecommendations(userId, params = {}) {
  const query = new URLSearchParams(params).toString();
  const url = `/users/\({userId}/outfits/recommend\){query ? `?${query}` : ""}`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${localStorage.getItem("access_token")}`,
    },
  });

  if (!response.ok) {
    throw new Error("Failed to fetch outfit recommendations");
  }

  return response.json();
}

Here's what's happening in that snippet:

URLSearchParams builds optional query strings like occasion, season, or limit.
The request path is user-scoped, which keeps each user’s recommendations isolated.
The Authorization header sends the access token so the backend can verify the session.
The response is checked before parsing so the UI can surface a useful error if the request fails.

This pattern kept the frontend simple and reusable as the number of API calls grew.

2. Backend Architecture with FastAPI

The backend is organized around clear route groups:

auth routes for register, login, refresh, logout, and sessions
user analysis routes
wardrobe CRUD routes
recommendation routes for outfits, shopping, and discard analysis
feedback routes for ratings and helpfulness signals

One of the most important design choices was enforcing ownership checks on user-scoped resources. That prevented one user from accessing another user’s wardrobe or feedback data.

The backend snippet below is another simplified example from the app’s route layer. It shows the request validation and orchestration logic, while the actual scoring work stays in the recommendation service.

@app.get("/users/{user_id}/outfits/recommend")
def recommend_outfits(user_id: int, occasion: str | None = None, season: str | None = None, limit: int = 10):
    user = get_user_or_404(user_id)
    wardrobe_items = get_user_wardrobe(user_id)

    if len(wardrobe_items) < 2:
        raise HTTPException(status_code=400, detail="Not enough wardrobe items")

    recommendations = outfit_generator.generate_outfit_recommendations(
        wardrobe_items=wardrobe_items,
        body_shape=user.body_shape,
        undertone=user.undertone,
        occasion=occasion,
        season=season,
        top_k=limit,
    )

    return {"user_id": user_id, "recommendations": recommendations}

Here's how to read that code:

get_user_or_404 loads the profile data needed for personalization.
get_user_wardrobe fetches only the current user’s items.
The minimum wardrobe check prevents the recommendation logic from running on incomplete data.
generate_outfit_recommendations handles the scoring logic separately, which keeps the route handler small and easier to test.
The response returns the results in a shape the frontend can consume directly.

That separation helped keep the API layer readable while the recommendation logic stayed isolated in its own service.

3. Recommendation Logic

I intentionally started with deterministic rules before introducing heavy ML. That made behavior easier to debug and explain.

The outfit recommender scores combinations using weighted signals:

$$\text{outfit score} = 0.4 \cdot \text{color harmony} + 0.4 \cdot \text{body-shape fit} + 0.2 \cdot \text{undertone fit}$$

The snippet below is a simplified example from the recommendation engine. It shows how the app combines multiple signals into a single score:

def score_outfit(combo, user_context):
    color_score = color_harmony.score(combo)
    shape_score = body_shape_rules.score(combo, user_context.body_shape)
    undertone_score = undertone_rules.score(combo, user_context.undertone)

    total = 0.4 * color_score + 0.4 * shape_score + 0.2 * undertone_score
    return round(total, 3)

The logic behind this approach is straightforward:

color harmony helps the outfit feel visually coherent
body-shape scoring helps the outfit feel flattering
undertone scoring helps the colors work better with the user’s profile

I used a similar structure for discard recommendations and shopping suggestions, but with different factors and thresholds.

4. Authentication and Secure Multi-user Design

Security was one of the most important parts of this build.

I implemented:

short-lived access tokens
refresh tokens with JTI tracking
token rotation on refresh
session revocation (single session and all sessions)
email verification and password reset flows

The snippet below is a simplified example of the refresh-token lifecycle used in the app. It shows the important control points rather than every helper function:

def refresh_access_token(refresh_token: str):
    payload = decode_jwt(refresh_token)
    jti = payload["jti"]

    token_record = db.get_refresh_token(jti)
    if not token_record or token_record.revoked:
        raise AuthError("Invalid refresh token")

    new_refresh, new_jti = issue_refresh_token(payload["sub"])
    token_record.revoked = True
    token_record.replaced_by_jti = new_jti

    new_access = issue_access_token(payload["sub"])
    return {"access_token": new_access, "refresh_token": new_refresh}

What this code is doing:

It decodes the refresh token and looks up its JTI in the database.
It rejects reused or revoked sessions, which helps prevent replay attacks.
It rotates the refresh token instead of reusing it.
It issues a fresh access token so the session stays valid without forcing the user to log in again.

This design made multi-device sessions safer and gave me server-side control over logout behavior.

5. Background Jobs for Long-running Operations

Image analysis can be expensive, especially when the app needs to classify clothing, analyze colors, and estimate body-shape-related signals. To keep the request path responsive, I added Celery + Redis support for background tasks.

That gave the app two modes:

synchronous processing for simpler local development
queued processing for heavier or slower jobs

That tradeoff mattered because it let me keep the developer experience simple without blocking the app during more expensive work.

6. Data Model and Feedback Capture

A recommendation system only improves if it captures the right signals.

So I added dedicated feedback tables for:

outfit ratings (1-5 + optional comments)
recommendation helpful/unhelpful feedback
item usage actions (worn/kept/discarded)

Here is the shape of one of those models:

class RecommendationFeedback(Base):
    __tablename__ = "recommendation_feedback"

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    recommendation_type = Column(String(50), nullable=False)
    recommendation_id = Column(Integer, nullable=False)
    helpful = Column(Boolean, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)

How to read this model:

user_id ties feedback to the person who gave it.
recommendation_type tells me whether the feedback belongs to outfits, shopping, or discard suggestions.
recommendation_id identifies the exact recommendation.
helpful stores the user’s direct response.
created_at makes it possible to analyze feedback trends over time.

This part of the system gives the app a real learning foundation, even though the feedback-to-model-update loop is still a future improvement.

Challenges I Faced

This was the section that taught me the most.

1. Image-heavy endpoints were slower than I wanted

The analyze and wardrobe upload flows were doing a lot of work at once: image validation, classification, color extraction, storage, and database writes.

At first, that made the request flow feel heavier than it should have.

What I changed:

I bounded concurrent image jobs so the app wouldn't try to do too much at once.
I separated slower jobs into background processing where possible.
I used load-test results to confirm which endpoints were actually expensive.

The practical effect was that heavy image requests stopped competing with each other so aggressively. Instead of letting many expensive tasks pile up inside the same request cycle, I limited the active work and pushed slower operations into the queue when needed.

Why this fixed it:

Bounding concurrency prevented the system from overloading CPU-bound tasks.
Moving expensive work into async jobs kept the main request/response cycle more responsive.
Load testing gave me evidence instead of guesswork, so I could tune the system based on real performance behavior.

In other words, I didn't just “optimize” the endpoint in theory. I changed the execution model so expensive analysis could not block every other request behind it.

2. JWT sessions needed real server-side control

A basic JWT setup is easy to get working, but it becomes less useful if you cannot revoke sessions or manage multiple devices cleanly.

What I changed:

I stored refresh tokens in the database.
I tracked token JTI values.
I rotated refresh tokens when users refreshed their session.
I added endpoints for logging out a single session or all sessions.

The important shift here was moving from “token exists, therefore session is valid” to “token exists, matches the database record, and has not been revoked or replaced.” That gave the server the authority to invalidate old sessions immediately.

Why this fixed it:

Server-side token tracking made revocation possible.
Rotation reduced the chance of token reuse.
Session management became visible to the user, which made the app feel more trustworthy.

This is what made logout-all and multi-device management work in a real way instead of just being cosmetic UI actions.

3. User data isolation had to be explicit

Because this is a multi-user app, I had to be careful that one account could never accidentally see another account’s wardrobe data.

What I changed:

I added ownership checks to user-scoped routes.
I kept all wardrobe and feedback queries filtered by user_id.
I used encrypted image storage instead of exposing raw paths.

In practice, this meant every route had to ask the same question: “Does this user own the resource they are trying to access?” If the answer was no, the request stopped immediately.

Why this fixed it:

Ownership checks made data access rules explicit.
User-filtered queries prevented accidental cross-account reads.
Encrypted storage improved privacy and reduced the risk of exposing image data directly.

That combination is what kept wardrobe data, feedback history, and images separated correctly across accounts.

The app includes the frontend, backend, Redis, Celery worker, and Celery Beat, so the first challenge was making the setup feel reproducible instead of fragile.

What I changed:

I defined the stack in Docker Compose.
I documented the required environment variables.
I kept the dev stack aligned with how the app runs in practice.

This removed a lot of setup ambiguity. Instead of asking someone to manually figure out how the frontend, backend, Redis, and workers fit together, I made the stack describe itself.

Why this fixed it:

Docker let contributors start the project with fewer manual steps.
Clear environment configuration reduced setup mistakes.
Matching the stack to the architecture made the app easier to understand and test.

That was important because the app depends on several moving parts, and the simplest way to make the project approachable was to make startup behavior predictable.

What I Learned

This project taught me a few important lessons:

Small features become much more valuable when they work together.
Feedback data is one of the strongest signals for improving recommendations.
Clean data modeling matters a lot when multiple users are involved.
Docker and clear setup instructions make a project much easier for other people to try.

I also learned that a project does not need to be huge to be useful. A focused app that solves one problem well can still feel meaningful.

What I Want to Improve Next

My roadmap from here:

Integrate feedback directly into ranking updates
Add visual analytics for recommendation quality trends
Improve mobile UX parity
Deploy with persistent cloud storage and production database defaults
Provide a public demo mode for easier evaluation

Future Improvements

There are still a few things I would like to add later:

a more advanced recommendation engine
visual analytics for user feedback
better mobile support
live deployment with persistent cloud storage
a public demo mode for easier testing

Conclusion

This project began as a personal frustration and turned into a full web application with authentication, wardrobe storage, recommendation logic, and feedback infrastructure.

The most rewarding part was seeing how practical software decisions, not just flashy UI, can help people make everyday choices faster.

If you want to explore or run the project, check out the repo. You can try the flows and share feedback. I would especially love input on recommendation quality, UX clarity, and what features would make this genuinely useful in daily life.

How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU)

Amina Lawal — Mon, 13 Apr 2026 13:42:27 +0000

If you've bought a laptop in the last few years, there's a good chance it's running an ARM processor. Apple's M-series chips put ARM on the map for developers, but the real revolution is happening inside cloud data centers.

Google Cloud Axion is Google's own custom ARM-based chip, built to handle the demands of modern cloud workloads. The performance and cost numbers are striking: Google claims Axion delivers up to 60% better energy efficiency and up to 65% better price-performance compared to comparable x86 machines.

AWS has Graviton. Azure has Cobalt. ARM is no longer niche. It's the direction the entire cloud industry is moving.

But there's a problem that catches almost every team off guard when they start this transition: container architecture mismatch.

If you build a Docker image on your M-series Mac and push it to an x86 server, it crashes on startup with a cryptic exec format error.

The server isn't broken. It just can't read the compiled instructions inside your image. An ARM binary and an x86 binary are written in fundamentally different languages at the machine level. The CPU literally can't execute instructions it wasn't designed for.

We're going to solve this problem completely in this tutorial. You'll build a single Docker image tag that automatically serves the correct binary on both ARM and x86 machines — no separate pipelines, no separate tags. Then you'll provision Google Cloud ARM nodes in GKE and configure your Kubernetes deployment to route workloads precisely to those cost-efficient nodes.

Here's what you'll build, step by step:

A Go HTTP server that reports the CPU architecture it's running on at runtime
A multi-stage Dockerfile that cross-compiles for both linux/amd64 and linux/arm64 without slow QEMU emulation
A multi-arch image in Google Artifact Registry that acts as a single entry point for any architecture
A GKE cluster with two node pools: a standard x86 pool and an ARM Axion pool
A Kubernetes Deployment that pins your workload exclusively to the ARM nodes

By the end, you'll hit a live endpoint and see the word arm64 staring back at you from a Google Cloud ARM node. Let's get into it.

Prerequisites
Step 1: Set Up Your Google Cloud Project
Step 2: Create the GKE Cluster
Step 3: Write the Application
Step 4: Enable Multi-Arch Builds with Docker Buildx
Step 5: Write the Dockerfile
Step 6: Build and Push the Multi-Arch Image
Step 7: Add the Axion ARM Node Pool
Step 8: Deploy the App to the ARM Node Pool
Step 9: Verify the Deployment
Step 10: Cost Savings and Tradeoffs
Cleanup
Conclusion
Project File Structure

Prerequisites

Before you start, make sure you have the following ready:

A Google Cloud project with billing enabled. If you don't have one, create it at console.cloud.google.com. The total cost to follow this tutorial is around $5–10.
gcloud CLI installed and authenticated. Run gcloud auth login to sign in and gcloud config set project YOUR_PROJECT_ID to point it at your project.
Docker Desktop version 19.03 or later. Docker Buildx (the tool we'll use for multi-arch builds) ships bundled with it.
kubectl installed. This is the CLI for interacting with Kubernetes clusters.
Basic familiarity with Docker (images, layers, Dockerfile) and Kubernetes (pods, deployments, services). You don't need to be an expert, but you should know what these things are.

Step 1: Set Up Your Google Cloud Project

Before writing a single line of application code, let's get the cloud infrastructure side ready. This is the foundation everything else will build on.

Enable the Required APIs

Google Cloud services are off by default in any new project. Run this command to turn on the three APIs we'll need:

gcloud services enable \
  artifactregistry.googleapis.com \
  container.googleapis.com \
  containeranalysis.googleapis.com

Here's what each one does:

artifactregistry.googleapis.com — enables Artifact Registry, where we'll store our Docker images
container.googleapis.com — enables Google Kubernetes Engine (GKE), where our cluster will run
containeranalysis.googleapis.com — enables vulnerability scanning for images stored in Artifact Registry

Create a Docker Repository in Artifact Registry

Artifact Registry is Google Cloud's managed container image store — the place where our built images will live before being deployed to the cluster. Create a dedicated repository for this tutorial:

gcloud artifacts repositories create multi-arch-repo \
  --repository-format=docker \
  --location=us-central1 \
  --description="Multi-arch tutorial images"

Breaking down the flags:

--repository-format=docker — tells Artifact Registry this repository stores Docker images (as opposed to npm packages, Maven artifacts, and so on)
--location=us-central1 — the Google Cloud region where your images will be stored. Use a region that's close to where your cluster will run to minimize image pull latency. Run gcloud artifacts locations list to see all options.
--description — a human-readable label for the repository, shown in the console.

Authenticate Docker to Push to Artifact Registry

Docker needs credentials before it can push images to Google Cloud. Run this command to wire up authentication automatically:

gcloud auth configure-docker us-central1-docker.pkg.dev

This adds a credential helper entry to your ~/.docker/config.json file. What that means in practice: any time Docker tries to push or pull from a URL under us-central1-docker.pkg.dev, it will automatically call gcloud to get a valid auth token. You won't need to run docker login manually.

Step 2: Create the GKE Cluster

With Artifact Registry ready to receive images, let's create the Kubernetes cluster. We'll start with a standard cluster using x86 nodes and add an ARM node pool later once we have an image to deploy.

gcloud container clusters create axion-tutorial-cluster \
  --zone=us-central1-a \
  --num-nodes=2 \
  --machine-type=e2-standard-2 \
  --workload-pool=PROJECT_ID.svc.id.goog

Replace PROJECT_ID with your actual Google Cloud project ID.

What each flag does:

--zone=us-central1-a — creates a zonal cluster in a single availability zone. A regional cluster (using --region) would spread nodes across three zones for higher resilience, but for this tutorial a single zone keeps things simple and avoids capacity issues that can affect specific zones. If us-central1-a is unavailable, try us-central1-b.
--num-nodes=2 — two x86 nodes in this zone. We need at least 2 to have enough capacity alongside our ARM node pool later.
--machine-type=e2-standard-2 — the machine type for this default node pool. e2-standard-2 is a cost-effective x86 machine with 2 vCPUs and 8 GB of memory, good for general workloads.
--workload-pool=PROJECT_ID.svc.id.goog — enables Workload Identity, which is Google's recommended way for pods to authenticate with Google Cloud APIs. It avoids the need to download and store service account key files inside your cluster.

This command takes a few minutes. While it runs, you can move on to writing the application. We'll come back to the cluster in Step 6.

Step 3: Write the Application

We need an application to containerize. We'll use Go for three specific reasons:

Go compiles into a single, statically-linked binary. There's no runtime to install, no interpreter — just the binary. This makes for extremely lean container images.
Go has first-class, built-in cross-compilation support. We can compile an ARM64 binary from an x86 Mac, or vice versa, by setting two environment variables. This will matter a lot when we get to the Dockerfile.
Go exposes the architecture the binary was compiled for via runtime.GOARCH. Our server will report this at runtime, giving us hard proof that the correct binary is running on the correct hardware.

Start by creating the project directories:

mkdir -p hello-axion/app hello-axion/k8s
cd hello-axion/app

Initialize the Go module from inside app/. This creates go.mod in the current directory:

go mod init hello-axion

go mod init is Go's built-in command for starting a new module. It writes a go.mod file that declares the module name (hello-axion) and the minimum Go version required. Every modern Go project needs this file — without it, the compiler doesn't know how to resolve packages.

Now create the application at app/main.go:

package main

import (
    "fmt"
    "net/http"
    "os"
    "runtime"
)

func handler(w http.ResponseWriter, r *http.Request) {
    hostname, _ := os.Hostname()
    fmt.Fprintf(w, "Hello from freeCodeCamp!\n")
    fmt.Fprintf(w, "Architecture : %s\n", runtime.GOARCH)
    fmt.Fprintf(w, "OS           : %s\n", runtime.GOOS)
    fmt.Fprintf(w, "Pod hostname : %s\n", hostname)
}

func healthz(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, "ok")
}

func main() {
    http.HandleFunc("/", handler)
    http.HandleFunc("/healthz", healthz)
    fmt.Println("Server starting on port 8080...")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        fmt.Fprintf(os.Stderr, "server error: %v\n", err)
        os.Exit(1)
    }
}

Verify both files were created:

ls -la

You should see go.mod and main.go listed.

Let's walk through what this code does:

import "runtime" — imports Go's built-in runtime package, which exposes information about the Go runtime environment, including the CPU architecture.
runtime.GOARCH — returns a string like "arm64" or "amd64" representing the architecture this binary was compiled for. When we deploy to an ARM node, this value will be arm64. This is the core of our proof.
os.Hostname() — returns the pod's hostname, which Kubernetes sets to the pod name. This lets us see which specific pod responded when we test the app later.
handler — the main HTTP handler, registered on the root path /. It writes the architecture, OS, and hostname to the response.
healthz — a separate handler registered on /healthz. It returns HTTP 200 with the text ok. Kubernetes will use this endpoint to check whether the container is alive and ready to serve traffic — we'll wire this up in the deployment manifest later.
http.ListenAndServe(":8080", nil) — starts the server on port 8080. If it fails to start (for example, if the port is already in use), it prints the error and exits with a non-zero code so Kubernetes knows something went wrong.

Step 4: Enable Multi-Arch Builds with Docker Buildx

Before we write the Dockerfile, we need to understand a fundamental constraint, because it directly shapes how the Dockerfile must be written.

Why Your Docker Images Are Architecture-Specific By Default

A CPU only understands instructions written for its specific Instruction Set Architecture (ISA). ARM64 and x86_64 are different ISAs — different vocabularies of machine-level operations. When you compile a Go program, the compiler translates your source code into binary instructions for exactly one ISA. That binary can't run on a different ISA.

When you build a Docker image the normal way (docker build), the binary inside that image is compiled for your local machine's ISA. If you're on an Apple Silicon Mac, you get an ARM64 binary. Push that image to an x86 server, and when Docker tries to execute the binary, the kernel rejects it:

standard_init_linux.go:228: exec user process caused: exec format error

That's the operating system saying: "This binary was written for a different processor. I have no idea what to do with it."

The Solution: A Single Image Tag That Serves Any Architecture

Docker solves this with a structure called a Manifest List (also called a multi-arch image index). Instead of one image, a Manifest List is a pointer table. It holds multiple image references — one per architecture — all under the same tag.

When a server pulls hello-axion:v1, here's what actually happens:

Docker contacts the registry and requests the manifest for hello-axion:v1
The registry returns the Manifest List, which looks like this internally:

{
  "manifests": [
    { "digest": "sha256:a1b2...", "platform": { "architecture": "amd64", "os": "linux" } },
    { "digest": "sha256:c3d4...", "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}

Docker checks the current machine's architecture, finds the matching entry, and pulls only that specific image layer. The x86 image never downloads onto your ARM server, and vice versa.

One tag, two actual images. Completely transparent to your deployment manifests.

Set Up Docker Buildx

Docker Buildx is the CLI tool that builds these Manifest Lists. It's powered by the BuildKit engine and ships bundled with Docker Desktop. Run the following to create and activate a new builder instance:

docker buildx create --name multiarch-builder --use

--name multiarch-builder — gives this builder a memorable name. You can have multiple builders. This command creates a new one named multiarch-builder.
--use — immediately sets this new builder as the active one, so all future docker buildx build commands use it.

Now boot the builder and confirm it supports the platforms we need:

docker buildx inspect --bootstrap

--bootstrap — starts the builder container if it isn't already running, and prints its full configuration.

You should see output like this:

Name:          multiarch-builder
Driver:        docker-container
Platforms:     linux/amd64, linux/arm64, linux/arm/v7, linux/386, ...

The Platforms line lists every architecture this builder can produce images for. As long as you see linux/amd64 and linux/arm64 in that list, you're ready to build for both x86 and ARM.

Step 5: Write the Dockerfile

Now we can write the Dockerfile. We'll use two techniques together: a multi-stage build to keep the final image tiny, and a cross-compilation trick to avoid slow CPU emulation.

Create app/Dockerfile with the following content:

# -----------------------------------------------------------
# Stage 1: Build
# -----------------------------------------------------------
# $BUILDPLATFORM = the machine running this build (your laptop)
# \(TARGETOS / \)TARGETARCH = the platform we are building FOR
# -----------------------------------------------------------
FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

ARG TARGETOS
ARG TARGETARCH

WORKDIR /app

COPY go.mod .
RUN go mod download

COPY main.go .

RUN GOOS=\(TARGETOS GOARCH=\)TARGETARCH go build -ldflags="-w -s" -o server main.go

# -----------------------------------------------------------
# Stage 2: Runtime
# -----------------------------------------------------------

FROM alpine:latest

RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

WORKDIR /app
COPY --from=builder /app/server .

EXPOSE 8080
CMD ["./server"]

There's a lot happening here. Let's go through it carefully.

Stage 1: The Builder

FROM --platform=$BUILDPLATFORM golang:1.23-alpine AS builder

This is the most important line in the file. $BUILDPLATFORM is a special build argument that Docker Buildx automatically injects — it equals the platform of the machine running the build (your laptop). By pinning the builder stage to $BUILDPLATFORM, the Go compiler always runs natively on your machine, not inside a CPU emulator. This is what makes multi-arch builds fast.

Without --platform=$BUILDPLATFORM, Buildx would have to use QEMU — a full CPU emulator — to run an ARM64 build environment on your x86 machine (or vice versa). QEMU works, but it's typically 5–10 times slower than native execution. For a project with many dependencies, that's the difference between a 2-minute build and a 20-minute build.

ARG TARGETOS and ARG TARGETARCH

These two lines declare that our Dockerfile expects build arguments named TARGETOS and TARGETARCH. Buildx injects these automatically based on the --platform flag you pass at build time. For a linux/arm64 target, TARGETOS will be linux and TARGETARCH will be arm64.

COPY go.mod . and RUN go mod download

We copy go.mod first, before copying the rest of the source code. Docker builds images layer by layer and caches each layer. By copying only the module file first, we create a cached layer for go mod download.

On future builds, as long as go.mod hasn't changed, Docker skips the download step entirely — even if the source code changed. This speeds up iterative development significantly.

RUN GOOS=$TARGETOS GOARCH=$TARGETARCH go build -ldflags="-w -s" -o server main.go

This is the cross-compilation step. GOOS and GOARCH are Go's built-in cross-compilation environment variables. Setting them tells the Go compiler to produce a binary for a different target than the machine it's running on. We set them from the $TARGETOS and $TARGETARCH build args injected by Buildx.

The -ldflags="-w -s" flag strips the debug symbol table and the DWARF debugging information from the binary. This has no effect on runtime behavior but reduces the binary size by roughly 30%.

Stage 2: The Runtime Image

FROM alpine:latest

This starts a brand-new image from Alpine Linux — a minimal Linux distribution that weighs about 5 MB. Critically, alpine:latest is itself a multi-arch image, so Docker automatically selects the arm64 or amd64 Alpine variant depending on which platform this stage is built for.

Everything from Stage 1 — the Go toolchain, the source files, the intermediate object files — is discarded. The final image contains only Alpine Linux plus our binary. Compared to a naive single-stage Go image (~300 MB), this approach produces an image under 15 MB.

RUN addgroup -S appgroup && adduser -S appuser -G appgroup and USER appuser

These two lines create a non-root user and set it as the active user for the container. Running containers as root is a security risk — if an attacker exploits a vulnerability in your application, they gain root access inside the container. Running as a non-root user limits the blast radius.

COPY --from=builder /app/server .

This is how multi-stage builds work: the --from=builder flag tells Docker to copy files from the builder stage (Stage 1), not from your local disk. Only the compiled binary (server) makes it into the final image.

Step 6: Build and Push the Multi-Arch Image

With the application and Dockerfile in place, we can now build images for both architectures and push them to Artifact Registry — all in a single command.

From inside the app/ directory, run:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 \
  --push \
  .

Replace PROJECT_ID with your actual GCP project ID.

Here's what each part of this command does:

docker buildx build — uses the Buildx CLI instead of the standard docker build. Buildx is required for multi-platform builds.
--platform linux/amd64,linux/arm64 — instructs Buildx to build the image twice: once targeting x86 Intel/AMD machines, and once targeting ARM64. Both builds run in parallel. Because our Dockerfile uses the $BUILDPLATFORM cross-compilation trick, both builds run natively on your machine without QEMU emulation.
-t us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1 — the full image path in Artifact Registry. The format is always REGION-docker.pkg.dev/PROJECT_ID/REPO_NAME/IMAGE_NAME:TAG.
--push — multi-arch images can't be loaded into your local Docker daemon (which only understands single-architecture images). This flag tells Buildx to skip local storage and push the completed Manifest List — with both architecture variants — directly to the registry.
. — the build context, the directory Docker scans for the Dockerfile and any files the build needs.

Watch the output as the build runs. You'll see BuildKit working on both platforms simultaneously:

 => [linux/amd64 builder 1/5] FROM golang:1.23-alpine
 => [linux/arm64 builder 1/5] FROM golang:1.23-alpine
 ...
 => pushing manifest for us-central1-docker.pkg.dev/.../hello-axion:v1

Verify the Multi-Arch Image in Artifact Registry

Once the push completes, navigate to GCP Console → Artifact Registry → Repositories → multi-arch-repo and click on hello-axion.

You won't see a single image — you'll see something labelled "Image Index". That's the Manifest List we created. Click into it, and you'll find two child images with separate digests, one for linux/amd64 and one for linux/arm64.

You can also inspect this from the command line:

docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

The output lists every manifest inside the image index. You'll see entries for linux/amd64 and linux/arm64 — those are our two real images. You'll also see two entries with Platform: unknown/unknown labelled as attestation-manifest. These are build provenance records that Docker Buildx automatically attaches to prove how and where the image was built (a supply chain security feature called SLSA attestation).

The two entries you care about are linux/amd64 and linux/arm64. Note the digest for the arm64 entry — we'll use it in the verification step to confirm the cluster pulled the right variant.

Step 7: Add the Axion ARM Node Pool

We have a universal image. Now we need somewhere to run it.

Recall the cluster we created in Step 2 — it's running e2-standard-2 x86 machines. We're going to add a second node pool running ARM machines. This is the key architectural move: a mixed-architecture cluster where different workloads can be routed to different hardware.

Choosing Your ARM Machine Type

Google Cloud currently offers two ARM-based machine series in GKE:

Series	Example type	What it is
Tau T2A	`t2a-standard-2`	First-gen Google ARM (Ampere Altra). Broadly available across regions. Great for getting started.
Axion (C4A)	`c4a-standard-2`	Google's custom ARM chip (Arm Neoverse V2 core). Newest generation, best price-performance. Still expanding availability.

This tutorial uses t2a-standard-2 because it's widely available. The commands are identical for c4a-standard-2 — just swap the --machine-type value. If t2a-standard-2 isn't available in your zone, GKE will tell you immediately when you run the node pool creation command below, and you can try a neighbouring zone.

Create the ARM Node Pool

Add the ARM node pool to your existing cluster:

gcloud container node-pools create axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a \
  --machine-type=t2a-standard-2 \
  --num-nodes=2 \
  --node-labels=workload-type=arm-optimized

What each flag does:

--cluster=axion-tutorial-cluster — the name of the cluster we created in Step 2. Node pools are always added to an existing cluster.
--zone=us-central1-a — must match the zone you used when creating the cluster.
--machine-type=t2a-standard-2 — GKE detects this is an ARM machine type and automatically provisions the nodes with an ARM-compatible version of Container-Optimized OS (COS). You don't need to configure anything special for ARM at the OS level.
--num-nodes=2 — two ARM nodes in the zone, enough to schedule our 3-replica deployment alongside other cluster overhead.
--node-labels=workload-type=arm-optimized — attaches a custom label to every node in this pool. We'll use this label in our deployment manifest to target these specific nodes. Using a descriptive custom label (rather than just relying on the automatic kubernetes.io/arch=arm64 label) is good practice in real clusters — it communicates the intent of the pool, not just its hardware.

This command takes a few minutes. Once it completes, let's confirm our cluster now has both node pools:

gcloud container clusters get-credentials axion-tutorial-cluster --zone=us-central1-a

kubectl get nodes --label-columns=kubernetes.io/arch

The get-credentials command configures kubectl to authenticate with your new cluster. The get nodes command then lists all nodes and adds a column showing the kubernetes.io/arch label.

You should see something like:

NAME                                    STATUS   ARCH    AGE
gke-...default-pool-abc...              Ready    amd64   15m
gke-...default-pool-def...              Ready    amd64   15m
gke-...axion-pool-jkl...                Ready    arm64   3m
gke-...axion-pool-mno...                Ready    arm64   3m

amd64 for the default x86 pool, arm64 for our new Axion pool. This kubernetes.io/arch label is applied automatically by GKE — you don't set it, it's derived from the hardware.

Step 8: Deploy the App to the ARM Node Pool

We have a multi-arch image and a mixed-architecture cluster. Here's something important to understand before writing the deployment manifest: Kubernetes doesn't know or care about image architecture by default.

If you applied a standard Deployment right now, the scheduler would look for any available node with enough CPU and memory and place pods there — potentially landing on x86 nodes instead of your ARM Axion nodes. The multi-arch Manifest List would handle this gracefully (the right binary would run regardless), but you'd lose the cost efficiency you provisioned Axion nodes for in the first place.

To guarantee that pods land on ARM nodes and only ARM nodes, we use a nodeSelector.

How nodeSelector Works

A nodeSelector is a set of key-value pairs in your pod spec. Before the Kubernetes scheduler places a pod, it checks every available node's labels. If a node doesn't have all the labels in the nodeSelector, the scheduler skips it — the pod will remain in Pending state rather than land on the wrong node.

This is a hard constraint, which is exactly what we want for cost optimization. Contrast this with Node Affinity's soft preference mode (preferredDuringSchedulingIgnoredDuringExecution), which says "try to use ARM, but fall back to x86 if needed." Soft preferences are useful for resilience, but they undermine the whole point of dedicated ARM pools. We want the hard constraint.

Write the Deployment Manifest

Create k8s/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-axion
  labels:
    app: hello-axion
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-axion
  template:
    metadata:
      labels:
        app: hello-axion
    spec:
      nodeSelector:
        kubernetes.io/arch: arm64

      containers:
      - name: hello-axion
        image: us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
        resources:
          requests:
            cpu: "250m"
            memory: "64Mi"
          limits:
            cpu: "500m"
            memory: "128Mi"

Replace PROJECT_ID with your project ID. Here's what the key sections do:

replicas: 3 — tells Kubernetes to keep three instances of this pod running at all times. If one crashes or a node goes down, the scheduler spins up a replacement. Three replicas also means one pod per ARM node in us-central1, which distributes load across availability zones.

selector.matchLabels and template.metadata.labels — these two blocks must match. The selector tells the Deployment which pods it "owns," and the template.metadata.labels is what those pods will be tagged with. If they don't match, Kubernetes won't be able to manage the pods.

nodeSelector: kubernetes.io/arch: arm64 — this is the pin. The Kubernetes scheduler filters out every node that doesn't carry this label before considering resource availability. Since GKE automatically applies kubernetes.io/arch=arm64 to all ARM nodes, our pods will schedule only onto the axion-pool nodes.

livenessProbe — periodically calls GET /healthz. If this check fails a certain number of times in a row (indicating the container has deadlocked or is otherwise unresponsive), Kubernetes restarts the container. initialDelaySeconds: 5 gives the server 5 seconds to start up before the first check.

readinessProbe — similar to the liveness probe, but with a different purpose. While the readiness probe is failing, Kubernetes removes the pod from the service's load balancer, so no traffic is sent to it. This is important during startup — the pod won't receive traffic until it signals it's ready.

resources.requests — reserves 250m (25% of a CPU core) and 64Mi of memory on the node for this pod. The scheduler uses these numbers to decide whether a node has enough room for the pod. Setting requests is required for sensible bin-packing. Without them, nodes can be silently overcommitted.

resources.limits — caps the container at 500m CPU and 128Mi memory. If the container exceeds these limits, Kubernetes throttles the CPU or kills the container (for memory). This prevents a single misbehaving pod from starving other workloads on the same node.

A Note on Taints and Tolerations

Once you're comfortable with nodeSelector, the next step in production clusters is adding a taint to your ARM node pool. A taint is a repellent — any pod without an explicit toleration for that taint is blocked from landing on the tainted node.

This means other workloads in your cluster can't accidentally consume your ARM capacity. You'd add the taint when creating the pool:

# Add --node-taints to the pool creation command:
--node-taints=workload-type=arm-optimized:NoSchedule

And a matching toleration in the pod spec:

tolerations:
- key: "workload-type"
  operator: "Equal"
  value: "arm-optimized"
  effect: "NoSchedule"

We're not doing this in the tutorial to keep things simple, but it's the pattern production multi-tenant clusters use to enforce hard separation between workload types.

Write the Service Manifest

We also need a Kubernetes Service to expose the pods over the network. Create k8s/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: hello-axion-svc
spec:
  selector:
    app: hello-axion
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer

selector: app: hello-axion — the Service discovers pods using labels. Any pod with app: hello-axion on it will be added to this Service's load balancer pool.
port: 80 — the port the Service is reachable on from outside the cluster.
targetPort: 8080 — the port on the pod that traffic gets forwarded to. Our Go server listens on port 8080, so this must match.
type: LoadBalancer — tells GKE to provision an external Google Cloud load balancer and assign it a public IP. This is what makes the Service reachable from the internet.

Apply Both Manifests

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml

kubectl apply reads each manifest file and creates or updates the resources described in it. If the resources don't exist yet, they're created. If they already exist, Kubernetes only applies the diff — it won't restart pods unnecessarily.

Watch the pods come up in real time:

kubectl get pods -w

The -w flag watches for changes and prints updates as they happen. You should see pods transition from Pending → ContainerCreating → Running. Once all three show Running, press Ctrl+C to stop watching.

Step 9: Verify the Deployment

Everything is running. Now we need evidence — not just that pods are up, but that they're on the right nodes and serving the right binary.

Confirm Pod Placement

kubectl get pods -o wide

The -o wide flag adds extra columns to the output, including the name of the node each pod was scheduled on. Look at the NODE column:

NAME                          READY   STATUS    NODE
hello-axion-7b8d9f-abc12      1/1     Running   gke-axion-tutorial-axion-pool-a-...
hello-axion-7b8d9f-def34      1/1     Running   gke-axion-tutorial-axion-pool-b-...
hello-axion-7b8d9f-ghi56      1/1     Running   gke-axion-tutorial-axion-pool-c-...

All three pods should show node names containing axion-pool. None should show default-pool.

Confirm the Nodes Are ARM

Take one of those node names and verify its architecture label:

kubectl get node NODE_NAME --show-labels | grep kubernetes.io/arch

Replace NODE_NAME with one of the node names from the previous command. You should see:

kubernetes.io/arch=arm64

That's the automatic label GKE applied when it provisioned the ARM hardware. Our nodeSelector matched on this label to pin the pods here.

Ask the Application Itself

This is the most satisfying verification step. Our Go server reports the architecture of the binary that's running. Let's ask it directly.

Use kubectl port-forward to create a secure tunnel from port 8080 on your local machine to port 8080 on the Deployment:

kubectl port-forward deployment/hello-axion 8080:8080

This command stays running in the foreground — open a second terminal window and run:

curl http://localhost:8080

You should see:

Hello from freeCodeCamp!
Architecture : arm64
OS           : linux
Pod hostname : hello-axion-7b8d9f-abc12

Architecture : arm64. That's our Go binary confirming that it was compiled for ARM64 and is executing on an ARM64 CPU. The single image tag we built does the right thing automatically.

The Bonus: See the Manifest List in Action

Want to see the multi-arch image indexing at work? Stop the port-forward, then run:

docker buildx imagetools inspect \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

Replace PROJECT_ID with your actual Google Cloud project ID.

You'll see four entries in the manifest list. Two are real images — Platform: linux/amd64 and Platform: linux/arm64. The other two show Platform: unknown/unknown with an attestation-manifest annotation. These are build provenance records that Docker Buildx automatically attaches to every image — a supply chain security feature (SLSA attestation) that proves how and where the image was built.

You may notice that if you check the image digest recorded in a running pod:

kubectl get pod POD_NAME \
  -o jsonpath='{.status.containerStatuses[0].imageID}'

Replace POD_NAME with one of the pod names from earlier.

The digest returned matches the top-level manifest list digest, not the arm64-specific one. This is expected behaviour. Modern Kubernetes (using containerd) records the manifest list digest, not the resolved platform digest. The platform resolution already happened when the node pulled the correct image variant.

The definitive proof that the right binary is running is what you already have: the node labeled kubernetes.io/arch=arm64 and the application reporting Architecture: arm64.

Step 10: Cost Savings and Tradeoffs

The hands-on work is done. Let's talk about why any of this is worth the effort.

The Cost Math

At the time of writing, here's how ARM compares to equivalent x86 machines on Google Cloud (prices are approximate and change over time — check the official pricing page before making decisions):

Instance	vCPU	Memory	Approx. $/hour
`n2-standard-4` (x86)	4	16 GB	~$0.19
`t2a-standard-4` (Tau ARM)	4	16 GB	~$0.14
`c4a-standard-4` (Axion)	4	16 GB	~$0.15

That's a raw 25–30% reduction in compute cost per node. Factor in Google's published claim of up to 65% better price-performance for Axion on relevant workloads — meaning you may need fewer nodes to handle the same traffic — and the savings compound further.

Here's how that looks at scale, for a service running 20 nodes continuously for a year:

20 × n2-standard-4 × $0.19 × 8,760 hours = $33,288/year
20 × t2a-standard-4 × $0.14 × 8,760 hours = $24,528/year

That's roughly $8,760 saved annually on compute, before committed use discounts (which further widen the gap).

When ARM Is the Right Choice

ARM works best for:

Stateless API servers and web applications — like the app we built. ARM excels at high-throughput, low-latency network workloads.
Background workers and queue processors — long-running services that don't depend on x86-specific binaries.
Microservices written in Go, Rust, or Python — these languages have excellent ARM64 support and are built cross-platform by default.

When to Proceed Carefully

Native library dependencies — some older C libraries, proprietary SDKs, or compiled ML model-serving runtimes don't have ARM64 builds. Always audit your dependency tree before migrating.
CI pipelines need ARM too — your automated tests should run on ARM, not just x86. An image that silently fails only on ARM is harder to debug than one that never claimed ARM support.
Profile before optimizing — the cost savings are real, but measure your actual workload behavior on ARM before committing. Not every workload benefits equally.

Cleanup

When you're done, clean up to avoid ongoing charges:

# Remove the Kubernetes resources from the cluster
kubectl delete -f k8s/

# Delete the ARM node pool
gcloud container node-pools delete axion-pool \
  --cluster=axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the cluster itself
gcloud container clusters delete axion-tutorial-cluster \
  --zone=us-central1-a

# Delete the images from Artifact Registry (optional — storage costs are minimal)
gcloud artifacts docker images delete \
  us-central1-docker.pkg.dev/PROJECT_ID/multi-arch-repo/hello-axion:v1

Conclusion

Let's recap what you built and why each part matters.

You started with a Go application, a Dockerfile, and a docker buildx build command that produced two images — one for x86, one for ARM64 — wrapped in a single Manifest List tag. Any server that pulls that tag gets the right binary automatically, without you maintaining separate pipelines or separate tags.

You provisioned a GKE cluster with two node pools running different CPU architectures, then used nodeSelector to make sure your ARM-optimized workload lands only on the ARM Axion nodes — not on x86 by accident. The result is a deployment that's both architecture-correct and cost-efficient.

The patterns you practiced here don't stop at this demo. The same Dockerfile technique works for any language with cross-compilation support. The same nodeSelector approach works for any workload you want to pin to ARM. As more teams migrate services to ARM over the coming years, having these skills will be a real asset.

Where to go from here:

Add a GitHub Actions workflow that runs docker buildx build --platform linux/amd64,linux/arm64 on every push, automating this entire process in CI.
Audit one of your existing stateless services for ARM compatibility and try migrating it.
Explore Node Affinity as a softer alternative to nodeSelector for workloads that can run on either architecture but prefer ARM.
Look into GKE Autopilot, which now supports ARM nodes and handles node pool management automatically.

Happy building.

Project File Structure

hello-axion/
├── app/
│   ├── main.go          — Go HTTP server
│   ├── go.mod           — Go module definition
│   └── Dockerfile       — Multi-stage Dockerfile
└── k8s/
    ├── deployment.yaml  — Deployment with nodeSelector and probes
    └── service.yaml     — LoadBalancer Service

All source files for this tutorial are available in the companion GitHub repository: https://github.com/Amiynarh/multi-arch-docker-gke-arm

How to Self-Host Your Own Server Monitoring Dashboard Using Uptime Kuma and Docker

Abdul Talha — Mon, 06 Apr 2026 20:32:31 +0000

As a developer, there's nothing worse than finding out from an angry user that your website is down. Usually, you don't know your server crashed until someone complains.

And while many SaaS tools can monitor your site, they often charge high monthly fees for simple alerts.

My goal with this article is to help you stop paying those expensive fees by showing you a powerful, free, open-source alternative called Uptime Kuma.

In this guide, you'll learn how to use Docker to deploy Uptime Kuma safely on a local Ubuntu machine.

By the end of this tutorial, you'll have set up your own private server monitoring dashboard in less than 10 minutes and created an automated Discord alert to ping your phone if your website goes offline.

Prerequisites
Step 1: Update Packages and Prepare the Firewall
Step 2: Create the Docker Compose File
Step 3: Start the Application
Step 4: Access the Dashboard
Step 5: Use Case – Monitor a Website and Send Discord Alerts
Conclusion

Prerequisites

Before you start, make sure you have:

An Ubuntu machine (like a local server, VM, or desktop).
Docker and Docker Compose installed.
Basic knowledge of the Linux terminal.

Step 1: Update Packages and Prepare the Firewall

First, you'll want to make sure your system has the newest updates. Then, you'll install the Uncomplicated Firewall (UFW) and open the network "door" (port) that Uptime Kuma uses for the dashboard. You'll also need to allow SSH so you don't lock yourself out.

Run these commands in your terminal:

Update your packages:

sudo apt update && sudo apt upgrade -y

Install the firewall:

sudo apt install ufw -y

Allow SSH and open port 3001:

sudo ufw allow ssh
sudo ufw allow 3001/tcp

Enable the firewall:

sudo ufw enable
sudo ufw reload

Step 2: Create the Docker Compose File

Using a docker-compose.yml file is the professional way to manage Docker containers. It keeps your setup organised in one single place.

To start, create a new folder for your project and enter it:

mkdir uptime-kuma && cd uptime-kuma

Then create the configuration file:

nano docker-compose.yml

Paste the following code into the editor:

services:
  uptime-kuma:
    image: louislam/uptime-kuma:2
    restart: unless-stopped
    volumes:
      - ./data:/app/data
    ports:
      - "3001:3001"

Note: The ./data:/app/data line is very important. It saves your database in a normal folder on your machine, making it easy to back up later.

Finally, save and exit: Press CTRL + X, then Y, then Enter.

Step 3: Start the Application

Now, tell Docker to read your file and start the monitoring service in the background.

docker compose up -d

How to verify: Docker will download the files. When it finishes, your terminal should print Started uptime-kuma.

Step 4: Access the Dashboard

To access the dashboard, first open your web browser and go to http://localhost:3001 (or your machine's local IP address).

When asked to choose the database, select SQLite. It's simple, fast, and requires no extra setup.

Then create an account and choose a secure admin username and password.

Step 5: Use Case – Monitor a Website and Send Discord Alerts

Now you'll put Uptime Kuma to work by monitoring a live website and setting up an alert. Just follow these steps:

Click Add New Monitor.
Set the Monitor Type to HTTP(s).
Give it a Friendly Name (e.g., "My Blog") and enter your website's URL.

Pro-Tip: How to Fix "Down" Errors (Bot Protection)

If your site uses strict security, it might block Uptime Kuma and say your site is "Down" with a 403 Forbidden error.

The Fix: Scroll down to Advanced, find the User Agent box, and paste this text to make Uptime Kuma look like a normal Chrome browser:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

Add a Discord Alert

To get a message on your phone when your site goes down:

On the right side of the monitor screen, click Setup Notification.
Select Discord from the dropdown list.
Paste a Discord Webhook URL (you can create one in your Discord server settings under Integrations).
Click Test to receive a test ping, then click Save.

Conclusion

Congratulations! You just took control of your server health. By deploying Uptime Kuma, you replaced an expensive SaaS subscription with a powerful, free monitoring tool that alerts you the second a project goes offline.

Let’s connect! I am a developer and technical writer specialising in writing step-by-step guides and workflows. You can find my latest projects on my Technical Writing Portfolio or reach out to me directly on LinkedIn.

How to Build a Bank Ledger in Golang with PostgreSQL using the Double-Entry Accounting Principle.

Paul Babatuyi — Wed, 25 Mar 2026 17:11:25 +0000

The Hidden Bugs in How Most Developers Store Money

Imagine you're building the backend for a million-dollar fintech app. You store each user's balance as a single number in the database. It feels simple: just update the number when money moves.

But with one line of code like UPDATE accounts SET balance = balance - 100, you've created a system that can silently lose millions. A server crash, a race condition, or a clever attack, and suddenly money vanishes or appears out of thin air.

There's no audit trail, no way to know what happened, and no way to prove it didn't happen on purpose.

This isn't just a theoretical risk. It's a trap that's caught even experienced developers. The world's most trusted financial systems avoid it by using double-entry accounting. Every transaction creates two records: a debit on one account, a credit on another. This lets you reconstruct every cent from history, catch inconsistencies, and audit every transaction.

There are no deletes, and no silent updates. Just an append-only trail that makes fraud and bugs much harder to hide.

In this guide, you'll build a robust backend in Go and PostgreSQL, using patterns inspired by real fintech companies. You'll learn how to design a double-entry ledger, generate type-safe SQL with sqlc, and write transactions that are safe even under heavy load.

By the end, you'll understand why these patterns matter – and how to use them to build software you can trust with real money.

Prerequisites and Project Overview
The Double-Entry Foundation
Type-Safe SQL with sqlc
The Store Layer: Transactions and Retries
The Service Layer: Business Logic
The API Layer
Running It Locally
Testing: Prove the System Works
Deployment
Conclusion

Project Resources:

Here's the project repository: https://github.com/PaulBabatuyi/double-entry-bank-Go

And here's the front-end repository: https://github.com/PaulBabatuyi/double-entry-bank

You can find the live frontend here: https://golangbank.app

You can find the live Swagger back-end API here: https://golangbank.app/swagger

Prerequisites and Project Overview

Before you dive in, make sure you have the following installed:

Go 1.23 or newer
Docker and Docker Compose
golang-migrate CLI: go install github.com/golang-migrate/migrate/v4/cmd/migrate@latest
sqlc CLI: go install github.com/sqlc-dev/sqlc/cmd/sqlc@latest

You'll also need a basic understanding of PostgreSQL and REST APIs to follow along.

If you've built a CRUD app before, you're ready for this. The project uses sqlc for type-safe queries, JWT for authentication, and a layered architecture that keeps business logic, persistence, and HTTP handling cleanly separated.

Here's how the project is organized:

.
├── cmd/                # Server entrypoint
│   └── main.go
├── internal/
│   ├── api/            # HTTP handlers & middleware
│   ├── db/             # Store layer (transactions, sqlc)
│   └── service/        # Business logic (ledger operations)
├── postgres/
│   ├── migrations/     # SQL migration files
│   └── queries/        # sqlc query files
├── docs/               # Swagger docs
├── Dockerfile, docker-compose.yml, Makefile
└── README.md

The architecture follows a clear three-layer pattern:

API Layer: Handles HTTP requests, authentication, and routing.
Service Layer: Contains the business logic. This is where double-entry rules are enforced.
Store Layer: Manages database transactions and persistence.

Every request flows from the handler, through the service, to the store, and finally to PostgreSQL. This separation makes the code easier to test, debug, and extend.

Backend Request Flow

graph TD
    A[HTTP Request] --> B[Handler - API Layer]
    B --> C[LedgerService - Business Logic]
    C --> D[Store - Persistence Layer]
    D --> E[(PostgreSQL)]
    E --> D
    D --> C
    C --> B
    B --> F[HTTP Response]

The Double-Entry Foundation: How Every Penny is Accounted For

Let's get to the heart of what makes this system bulletproof: double-entry accounting. Every operation – a deposit, withdrawal, or transfer – creates two entries that always balance. This is the secret sauce that keeps banks, payment apps, and even crypto exchanges from losing track of money.

Picture a simple deposit of $1,000:

| Account              | Debit   | Credit  |
|----------------------|---------|---------|
| User Account         |         | 1,000   |
| Settlement Account   | 1,000   |         |

Total debits always equal total credits. This is the fundamental rule. Every single operation in this system produces exactly this structure, with no exceptions.

Now picture a $200 transfer from User A to User B. Notice there are four entries, not two – both sides of both accounts are recorded:

| Account       | Debit   | Credit  | Description           |
|---------------|---------|---------|-----------------------|
| User A        | 200     |         | Transfer to User B    |
| User B        |         | 200     | Transfer from User A  |

Both entries share the same transaction_id, so you can always retrieve the complete picture of what happened with a single query. There's no guessing and no reconstructing, as the ledger tells the full story.

Why the Settlement Account Goes Negative

This trips up newcomers, so it's worth explaining explicitly. When a user deposits $1,000, the settlement account is debited $1,000. After several user deposits, the settlement balance will be negative. That's correct and expected: it represents the total amount of real-world money currently held inside the system on behalf of users. The invariant is:

SUM(all user account balances) + settlement balance = 0

If that ever doesn't hold, something is broken.

Enforcing the Rules in the Database

The database itself enforces these rules, not just the application code. Here's the core of the entries table migration:

CREATE TABLE IF NOT EXISTS entries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id UUID NOT NULL REFERENCES accounts(id) ON DELETE RESTRICT,
    debit NUMERIC(19,4) NOT NULL DEFAULT 0.0000 CHECK (debit >= 0),
    credit NUMERIC(19,4) NOT NULL DEFAULT 0.0000 CHECK (credit >= 0),
    transaction_id UUID NOT NULL,
    operation_type operation_type NOT NULL,
    description TEXT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,

    CONSTRAINT check_single_side CHECK (
        (debit > 0 AND credit = 0) OR (debit = 0 AND credit > 0)
    )
);

Let's break down why each piece matters:

Single-sided entries are impossible. The check_single_side constraint means every entry must be either a debit or a credit, never both. If you try to insert an invalid row, the database rejects it – there's no way around it.
Every transaction is linked. Both the debit and credit entries share the same transaction_id (a UUID). This lets you fetch both sides of any operation instantly, making audits and debugging straightforward.
Operation types are explicit. The operation_type column is an enum at the database level, so only valid types like deposit, withdrawal, or transfer are allowed. There are no typos and no surprises.

The Settlement Account: The System's Anchor

Every real-world ledger needs a way to represent money entering or leaving the system. That's what the settlement account does. Here's how it's seeded in the database:

INSERT INTO accounts (id, name, balance, currency, is_system)
SELECT gen_random_uuid(), 'Settlement Account', 0.0000, 'USD', TRUE
WHERE NOT EXISTS (
    SELECT 1 FROM accounts WHERE is_system = TRUE AND name = 'Settlement Account'
);

The settlement account represents the "outside world." When a user deposits money, it comes from the settlement account. When they withdraw, it goes back. Using WHERE NOT EXISTS makes this migration idempotent – that is, safe to run multiple times without creating duplicates.

Type-Safe SQL with sqlc: No More Surprises

In financial systems, you can't afford surprises from your database layer. That's why this project uses sqlc, a tool that turns your SQL queries into type-safe Go code at compile time.

With sqlc, you see exactly what SQL runs, catch mistakes before they hit production, and avoid the "magic" (and hidden bugs) of most ORMs. Every query is explicit, every type is checked, and you get the best of both worlds: raw SQL power with Go's safety.

Why NUMERIC Becomes String (and Not float64)

Here's a subtle but critical detail from sqlc.yaml:

overrides:
    - db_type: "pg_catalog.numeric"
      go_type: "string"
    - column: "entries.debit"
      go_type: "string"
    - column: "entries.credit"
      go_type: "string"
    - column: "accounts.balance"
      go_type: "string"
    - db_type: "operation_type"
      go_type: "string"

Why string, not float64? Floating point arithmetic is imprecise. 0.1 + 0.2 in most programming languages does not equal exactly 0.3.

For money, you need exact decimal arithmetic. This project uses shopspring/decimal for all calculations and stores amounts as strings, converting at the service layer boundary. The database column itself is NUMERIC(19,4), which stores exact decimals – no float rounding ever touches your money.

Preventing Race Conditions: Locking with FOR UPDATE

One of the most important queries in the system is GetAccountForUpdate:

SELECT * FROM accounts
WHERE id = $1
LIMIT 1
FOR UPDATE; -- locks row for update, prevents TOCTOU races

This query uses FOR UPDATE to lock the account row during a transaction. Why? Imagine two requests both see a $500 balance and both try to withdraw $400. Without locking, both would succeed, and you'd end up with a negative balance. With FOR UPDATE, the second transaction waits until the first finishes, eliminating this classic race condition.

Calculating the True Balance: Always Trust the Entries

The real source of truth for any account is the sum of its entries, not the denormalized balance column. Here's the reconciliation query:

SELECT CAST(
    (COALESCE(SUM(credit), 0::NUMERIC) - COALESCE(SUM(debit), 0::NUMERIC))
    AS NUMERIC(19,4)
) AS calculated_balance
FROM entries
WHERE account_id = $1;

This computes the true balance from the ledger itself. It's how you catch bugs, audit the system, and prove that every penny is accounted for. The balance column on accounts is a denormalized cache for fast reads – and this query is the ground truth that validates it.

The Store Layer: Transactions and Automatic Retries

Every financial operation in this system runs inside a transaction – no exceptions. This is enforced by the ExecTx pattern in the store layer:

func (store *Store) ExecTx(ctx context.Context, fn func(q *sqlc.Queries) error) error {
    const maxAttempts = 10
    var lastErr error
    for attempt := 0; attempt < maxAttempts; attempt++ {
        lastErr = store.execTxOnce(ctx, fn)
        if lastErr == nil {
            return nil
        }
        if !isSerializationError(lastErr) {
            return lastErr
        }
        if attempt < maxAttempts-1 {
            if waitErr := sleepWithContext(ctx, retryWait(attempt)); waitErr != nil {
                return waitErr
            }
        }
    }
    return fmt.Errorf("transaction failed after %d attempts due to serialization conflicts: %w", maxAttempts, lastErr)
}

Why Serializable Isolation?

The transaction uses PostgreSQL's strictest isolation level: sql.LevelSerializable. This is like running transactions one at a time, eliminating entire classes of concurrency bugs. If two operations would conflict, PostgreSQL aborts one and returns a serialization error (SQLSTATE 40001).

Automatic Retries: Handling Real-World Concurrency

When a serialization error occurs, the code automatically retries with exponential backoff:

func retryWait(attempt int) time.Duration {
    base := 50 * time.Millisecond
    for i := 0; i < attempt; i++ {
        base *= 2
        if base >= time.Second {
            return time.Second
        }
    }
    return base
}

func sleepWithContext(ctx context.Context, d time.Duration) error {
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(d):
        return nil
    }
}

The backoff starts at 50ms and doubles each attempt, capping at 1 second. Up to 10 attempts are made. If the client disconnects mid-retry, sleepWithContext detects the cancelled context and returns immediately. This means no wasted resources.

The Service Layer: Where Business Logic Meets Double-Entry

The service layer is the heart of the system. Its job is to translate business operations – deposits, withdrawals, transfers – into double-entry journal entries that always balance.

Deposit: Crediting the User, Debiting the Settlement

Every deposit creates two entries: a credit to the user's account and a matching debit to the settlement account. Both entries share the same transaction ID.

func (s *LedgerService) Deposit(ctx context.Context, accountID uuid.UUID, amountStr string) error {
    amount, err := validatePositiveAmount(amountStr)
    if err != nil {
        return err
    }
    return s.store.ExecTx(ctx, func(q *sqlc.Queries) error {
        settlement, err := q.GetSettlementAccountForUpdate(ctx)
        if err != nil {
            return fmt.Errorf("settlement account not found: %w", err)
        }
        account, err := q.GetAccountForUpdate(ctx, accountID)
        if err != nil {
            return fmt.Errorf("account not found: %w", err)
        }
        if account.Currency != settlement.Currency {
            return ErrCurrencyMismatch
        }
        txID := uuid.New()
        // 1. Credit user account
        _, err = q.CreateEntry(ctx, sqlc.CreateEntryParams{
            AccountID:     accountID,
            Debit:         decimal.Zero.StringFixed(4),
            Credit:        amount.StringFixed(4),
            TransactionID: txID,
            OperationType: "deposit",
            Description:   sql.NullString{String: "External deposit", Valid: true},
        })
        if err != nil { return err }
        // 2. Debit settlement (opposing entry)
        _, err = q.CreateEntry(ctx, sqlc.CreateEntryParams{
            AccountID:     settlement.ID,
            Debit:         amount.StringFixed(4),
            Credit:        decimal.Zero.StringFixed(4),
            TransactionID: txID,
            OperationType: "deposit",
            Description:   sql.NullString{String: fmt.Sprintf("Deposit to account %s", accountID), Valid: true},
        })
        if err != nil { return err }
        // 3. Update both balances atomically
        if err = q.UpdateAccountBalance(ctx, sqlc.UpdateAccountBalanceParams{
            Balance: amount.StringFixed(4), ID: accountID,
        }); err != nil { return err }
        return q.UpdateAccountBalance(ctx, sqlc.UpdateAccountBalanceParams{
            Balance: amount.Neg().StringFixed(4), ID: settlement.ID,
        })
    })
}

Two things are worth highlighting. First, both accounts are locked with GetAccountForUpdate and GetSettlementAccountForUpdate before any entries are written. This prevents any other concurrent transaction from reading a stale balance and acting on it.

Second, amount.Neg() is used to debit the settlement. Its balance goes down, representing real money now held inside the system.

Withdraw: Debiting the User, Crediting the Settlement

Withdrawals are the mirror image of deposits. The key difference is the insufficient funds check, which must happen inside the transaction after the lock is acquired:

balanceDec, err := decimal.NewFromString(account.Balance)
if err != nil {
    return errors.New("invalid balance")
}
if balanceDec.LessThan(amount) {
    return ErrInsufficientFunds
}

Checking balance inside the transaction after FOR UPDATE is critical. Checking it before, outside the transaction, would create a classic time-of-check-to-time-of-use (TOCTOU) race. Two concurrent withdrawals could both pass the check, then both execute, overdrawing the account.

The entries for a $500 withdrawal look like this:

| Account              | Debit   | Credit  |
|----------------------|---------|---------|
| User Account         | 500     |         |
| Settlement Account   |         | 500     |

The settlement is credited because real money is leaving the system, and it's being "returned" to the outside world.

Transfer: User-to-User, No Settlement Involved

Transfers move money directly between two user accounts. The settlement account isn't involved. Both accounts are locked, currency is validated, and an insufficient funds check runs before any entries are created:

func (s *LedgerService) Transfer(ctx context.Context, fromID, toID uuid.UUID, amountStr string) error {
    amount, err := validatePositiveAmount(amountStr)
    if err != nil { return err }
    if fromID == toID {
        return ErrSameAccountTransfer
    }
    return s.store.ExecTx(ctx, func(q *sqlc.Queries) error {
        fromAcc, err := q.GetAccountForUpdate(ctx, fromID)
        if err != nil { return err }
        toAcc, err := q.GetAccountForUpdate(ctx, toID)
        if err != nil { return err }
        if fromAcc.Currency != toAcc.Currency {
            return ErrCurrencyMismatch
        }
        fromBalance, _ := decimal.NewFromString(fromAcc.Balance)
        if fromBalance.LessThan(amount) {
            return ErrInsufficientFunds
        }
        txID := uuid.New()
        // Debit sender, credit receiver — same transaction ID
        // ... CreateEntry calls + UpdateAccountBalance calls
    })
}

A $200 transfer creates exactly two entries under the same transaction_id:

| Account  | Debit   | Credit  |
|----------|---------|---------|
| Sender   | 200     |         |
| Receiver |         | 200     |

ReconcileAccount: Trust, But Verify

Reconciliation is how you prove the system is correct. The ReconcileAccount function compares the stored balance column against the sum of all credits minus debits in the entries table:

func (s *LedgerService) ReconcileAccount(ctx context.Context, accountID uuid.UUID) (bool, error) {
    account, err := s.store.GetAccount(ctx, accountID)
    if err != nil { return false, fmt.Errorf("account not found: %w", err) }

    calculatedStr, err := s.store.GetAccountBalance(ctx, accountID)
    if err != nil { return false, fmt.Errorf("failed to calculate balance: %w", err) }

    calculated, _ := decimal.NewFromString(calculatedStr)
    stored, _ := decimal.NewFromString(account.Balance)

    if !stored.Equal(calculated) {
        log.Error().
            Str("stored_balance", account.Balance).
            Str("calculated", calculated.StringFixed(4)).
            Msg("Balance mismatch detected")
        return false, fmt.Errorf("balance mismatch: stored %s, calculated %s",
            account.Balance, calculated.StringFixed(4))
    }
    return true, nil
}

If they don't match, something has gone wrong: a bug, a direct database modification, or a race condition that slipped through. In production, this check can run as a background job to catch issues before they become incidents.

The API Layer: Secure, Predictable, and Boring (By Design)

The API layer is where your business logic meets the outside world. Its job is to be secure, predictable, and, if you've done things right, a little bit boring.

JWT Authentication: Secrets Matter

Authentication is handled with JWTs. The secret used to sign tokens must be at least 32 characters long (as shorter secrets are insecure and can be brute-forced). This is enforced at startup:

// internal/api/middleware.go
func InitTokenAuth(secret string) error {
    if secret == "" {
        return errors.New("JWT_SECRET environment variable is required")
    }
    if len(secret) < 32 {
        return errors.New("JWT_SECRET must be at least 32 characters")
    }
    TokenAuth = jwtauth.New("HS256", []byte(secret), nil)
    return nil
}

The server will refuse to start if the secret is missing or too short. There's no fallback and no default: the system fails loudly rather than running insecurely.

The Handler Pattern: Parse, Authorize, Validate, Call, Respond

Every handler follows the same recipe: extract JWT claims, parse the account ID, fetch the account and verify ownership, decode the request body, call the service, and respond. Authorization always happens before calling the service layer. The service knows nothing about users, keeping business logic clean and testable.

// internal/api/handler.go
func (h *Handler) Register(w http.ResponseWriter, r *http.Request) {
    var input struct {
        Email    string `json:"email"`
        Password string `json:"password"`
    }
    if err := json.NewDecoder(r.Body).Decode(&input); err != nil {
        respondError(w, http.StatusBadRequest, "invalid input")
        return
    }
    // ... hash password, create user, generate JWT ...
}

Amount Normalization: Defensive by Default

API clients send amounts in different formats – sometimes as strings, sometimes as numbers. The normalization logic ensures all amounts are handled safely:

// internal/api/amount.go
func normalizeAmountInput(value interface{}) (string, error) {
    switch v := value.(type) {
    case string:
        return strings.TrimSpace(v), nil
    case json.Number:
        return strings.TrimSpace(v.String()), nil
    case float64:
        return strconv.FormatFloat(v, 'f', -1, 64), nil
    default:
        return "", errors.New("amount must be a number or string")
    }
}

The decoder uses dec.UseNumber() so JSON numbers arrive as json.Number rather than float64, preserving full precision. The float64 case exists as a safety fallback only.

Frontend Deployment Boundary

The backend no longer serves static frontend files. The frontend is deployed separately at https://golangbank.app from its own repository: https://github.com/PaulBabatuyi/double-entry-bank.

Running It Locally: Your First End-to-End Test

git clone https://github.com/PaulBabatuyi/double-entry-bank-Go.git
cd double-entry-bank-Go
cp .env.example .env
# Edit .env — set JWT_SECRET with: openssl rand -base64 32
make postgres
make migrate-up
make server

Once the server is running:

Frontend: https://golangbank.app
Swagger UI: http://localhost:8080/swagger/index.html (local dev) or https://golangbank.app/swagger (production)
Health check: http://localhost:8080/health

The Swagger UI lets you explore every endpoint, authorize with your JWT token, and test operations directly in the browser.

Testing: Prove the System Works

Testing financial systems is non-negotiable, and claims about correctness need to be backed by code. This project tests all three layers, each targeting a different kind of failure.

Service Layer: Core Financial Logic

The most important tests live in internal/service/ledger_test.go. They run against a real PostgreSQL database – not mocks – because mock-based tests can give a false sense of security. Real database tests catch issues that only appear in production-like environments.

func TestDeposit_Success(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "0.00")

    err := ledger.Deposit(context.Background(), accountID, "100.00")
    require.NoError(t, err)

    balance := getAccountBalance(t, ledger, accountID)
    assert.Equal(t, "100.0000", balance)
}

func TestWithdraw_InsufficientFunds(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "50.00")

    err := ledger.Withdraw(context.Background(), accountID, "100.00")
    assert.ErrorIs(t, err, ErrInsufficientFunds)
}

The createTestAccount helper uses the settlement account's currency automatically, which is important: all accounts must share a currency for transfers to work, and tests that silently use a different currency will fail in confusing ways.

Concurrency Test: Proving Serializable Isolation Works

This is the most important test in the suite:

func TestConcurrentDeposits(t *testing.T) {
    ledger := setupTestLedger(t)
    accountID := createTestAccount(t, ledger, "0.00")

    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        defer wg.Done()
        _ = ledger.Deposit(context.Background(), accountID, "100.00")
    }()
    go func() {
        defer wg.Done()
        _ = ledger.Deposit(context.Background(), accountID, "100.00")
    }()
    wg.Wait()

    balance := getAccountBalance(t, ledger, accountID)
    assert.Equal(t, "200.0000", balance)
}

Two goroutines deposit simultaneously. The serializable isolation level and retry logic ensure both operations succeed and neither overwrites the other. Without the FOR UPDATE locks and transaction retry logic, this test would fail non-deterministically – which is exactly the kind of bug that's impossible to reproduce in development but devastating in production.

Store Layer: Transaction Mechanics

Tests in internal/db/store_test.go verify the retry infrastructure itself, without needing a database connection:

func TestIsSerializationError(t *testing.T) {
    pqErr := &pq.Error{Code: "40001"}
    assert.True(t, isSerializationError(pqErr))
    assert.False(t, isSerializationError(errors.New("some other error")))
}

func TestRetryWait(t *testing.T) {
    assert.Equal(t, 50*time.Millisecond, retryWait(0))
    assert.Equal(t, 100*time.Millisecond, retryWait(1))
    assert.Equal(t, 200*time.Millisecond, retryWait(2))
    assert.Equal(t, time.Second, retryWait(5)) // capped
}

func TestSleepWithContext_Cancel(t *testing.T) {
    ctx, cancel := context.WithCancel(context.Background())
    cancel() // cancel immediately
    err := sleepWithContext(ctx, 50*time.Millisecond)
    assert.Error(t, err) // should return immediately, not wait
}

API Layer: Authentication and Input Handling

Handler tests in internal/api/handler_test.go verify that the HTTP layer behaves correctly at its boundaries:

func TestRegisterHandler_BadRequest(t *testing.T) {
    h := setupTestHandler(t)
    req := httptest.NewRequest(http.MethodPost, "/register", nil)
    rw := httptest.NewRecorder()
    h.Register(rw, req)
    assert.Equal(t, http.StatusBadRequest, rw.Code)
}

func TestRegisterHandler_Success(t *testing.T) {
    h := setupTestHandler(t)
    _ = InitTokenAuth("fV7sliKV3qn657I60wEFtw/Auk/0bNU9zdp30wFzfDg=")

    email := "testuser_" + uuid.New().String() + "@example.com"
    body, _ := json.Marshal(map[string]string{"email": email, "password": "testpassword123"})

    req := httptest.NewRequest(http.MethodPost, "/register", bytes.NewReader(body))
    rw := httptest.NewRecorder()
    h.Register(rw, req)
    assert.Equal(t, http.StatusCreated, rw.Code)
}

Using uuid.New().String() in the email ensures each test run creates a unique user, preventing conflicts on repeated runs against the same database.

Middleware tests verify the security boundary itself:

func TestInitTokenAuthFromEnv_MissingSecret(t *testing.T) {
    os.Unsetenv("JWT_SECRET")
    err := InitTokenAuthFromEnv()
    assert.Error(t, err) // must fail without a secret
}

Running the Tests

# Start the database
make postgres

# Run all tests with race detection
make test

# Run with coverage report
make coverage

# Run tests the same way CI does (includes migrations)
make ci-test

The -race flag is non-negotiable for financial code. It instruments the binary to detect data races at runtime – something static analysis can't catch. If a race exists, the race detector will find it.

Deployment: Engineering Decisions That Matter in Production

The deployment setup for this project reflects several engineering decisions worth understanding, regardless of what platform you deploy to.

Migrations on Container Start

The Docker entrypoint runs golang-migrate up before starting the Go binary:

# docker-entrypoint
migrate -path /app/postgres/migrations -database "$migrate_db_url" up
exec /usr/local/bin/ledger

Running migrations at startup rather than as a separate CI step has trade-offs. The upside is simplicity: the container is always self-consistent when it starts. The downside is that each deployment takes slightly longer. For a solo project or small team, this is the right call. At scale you'd separate migrations from deployment.

Startup Retry Logic

The entrypoint retries migrations up to 12 times with a 5-second sleep between attempts:

max_attempts=12
attempt=1
while [ "\(attempt" -le "\)max_attempts" ]; do
    migration_output=$(migrate ... up 2>&1)
    # If "connection refused" or "timeout", keep retrying
    # If any other error, fail immediately
    attempt=$((attempt + 1))
done

The critical distinction is which errors trigger a retry. Network-transient errors (connection refused, timeout) are retried. Everything else – a bad migration SQL, a missing tabl – fails immediately. This avoids waiting the full 60 seconds when a deployment has a real problem.

DB URL Fallback Chain

In cloud environments, the internal database URL is often a different variable than what you configure locally. The resolveDBURL function handles this transparently:

func resolveDBURL() string {
    connStr := strings.TrimSpace(os.Getenv("DB_URL"))
    fallbackVars := []string{"INTERNAL_DATABASE_URL", "RENDER_DATABASE_URL", "DATABASE_URL"}
    // Falls back through the chain if DB_URL is empty or resolves to localhost
    ...
}

This pattern means local developers set DB_URL in .env and don't need to think about it, while the deployed container automatically uses the internal database connection without any manual wiring.

HTTP Server Timeouts

The server is configured with explicit timeouts:

srv := &http.Server{
    Addr:              ":" + port,
    Handler:           r,
    ReadTimeout:       15 * time.Second,
    WriteTimeout:      15 * time.Second,
    IdleTimeout:       60 * time.Second,
    ReadHeaderTimeout: 5 * time.Second,
}

Without timeouts, a slow or malicious client can hold connections open indefinitely, eventually exhausting the server's resources. ReadHeaderTimeout is particularly important: it limits how long the server waits for the HTTP headers before closing the connection, protecting against Slowloris-style attacks.

Conclusion: Building for the Real World

You've just walked through the core patterns that power real fintech systems:

Double-entry ledger with database-enforced constraints
Settlement account for tracking external cash flows
Serializable transactions with exponential backoff retry
Reconciliation endpoint for verifying correctness
Type-safe queries with sqlc
Row-level locking to prevent race conditions
Tests that prove correctness under concurrency

These aren't just Go patterns. They're the same principles used at companies like Monzo, Stripe, and Nubank. The implementation details differ, but the underlying ideas are the same: every dollar is accounted for, every operation is atomic, and the system can always explain where every penny went.

What's next? Three concrete next steps:

Add idempotency keys to prevent duplicate transactions on retries. If a client retries a deposit because of a network timeout, you need to detect and reject the duplicate.
Add Prometheus metrics for transaction latency and failure rates. You want to know when your p99 latency spikes before your users do.
Add a scheduled reconciliation job that runs ReconcileAccount for every account on a schedule and alerts on mismatches. Catch bugs automatically, before they become customer complaints.

The developer who stores balance as a single number and updates it directly will eventually have an incident. The developer who builds a ledger has an audit trail, a reconciliation tool, and a system that can explain every penny.

That's the real reason fintech engineers build this way: not because it's more complex, but because it's more honest about what money actually is.

Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers

Balajee Asish Brahmandam — Mon, 23 Mar 2026 17:21:11 +0000

Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.

I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.

So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.

Why Not Just Use Prometheus?
The Architecture
Setting Up the Project
The Monitoring Script — Line by Line
The Claude Diagnosis Prompt (and Why Structure Matters)
Auto-Fix Logic — Being Conservative on Purpose
Adding Slack Notifications
Health Check Endpoint
Rate Limiting Claude Calls
Docker Compose — The Full Setup
Real Errors I Caught in Production
Cost Breakdown — What This Actually Costs
Security Considerations
What I'd Do Differently
What's Next?

Why Not Just Use Prometheus?

Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.

Even then, those tools tell you what happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you why. You still need a human to look at the logs, figure out the root cause, and decide what to do.

That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).

The Architecture

Here's how the pieces fit together:

┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘

The flow works like this:

The Container Doctor runs in its own container with the Docker socket mounted
Every 10 seconds, it pulls the last 50 lines of logs from each target container
It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")
When it finds something, it sends the logs to Claude with a structured prompt
Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart
If severity is high and auto-restart is safe, the script restarts the container
Either way, it sends a Slack notification with the full diagnosis
A simple health endpoint lets you check the doctor's own status

The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.

Setting Up the Project

Create your project directory:

mkdir container-doctor && cd container-doctor

Here's your requirements.txt:

docker==7.0.0
anthropic>=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0

Install locally for testing: pip install -r requirements.txt

Create a .env file:

ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20

A quick note on CHECK_INTERVAL: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.

The Monitoring Script – Line by Line

Here's the full container_doctor.py. I'll walk through the important parts after:

import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now > rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total >= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start >= 0 and end > start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t > datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) >= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")

That's a lot of code, so let me walk through the parts that matter.

Error deduplication (is_new_error): This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.

Rate limiting (check_rate_limit): Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.

Restart throttling (inside apply_fix): If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.

Post-restart verification: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.

The Claude Diagnosis Prompt (and Why Structure Matters)

Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.

The version I landed on is explicit about format:

prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

A few things I learned:

Include the detected patterns. Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.

Ask for estimated_impact. This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."

likely_recurring is gold. If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.

Claude returns something like:

{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}

I only auto-restart on high severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.

Auto-Fix Logic – Being Conservative on Purpose

The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:

Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.

The three safety checks before any restart:

Global toggle: AUTO_FIX=true in .env. I can kill all auto-fixes instantly by changing one variable.
Claude's assessment: auto_restart_safe must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.
Restart throttle: No more than 3 restarts per container per hour. After that, it's a human problem.

If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.

Adding Slack Notifications

Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.

The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.

To set this up, create a Slack app at api.slack.com/apps, add an incoming webhook, and paste the URL in your .env.

Health Check Endpoint

The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:

curl http://localhost:8080/health

Returns:

{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}

And /history returns the last 50 diagnoses:

curl http://localhost:8080/history

I point an uptime checker (UptimeRobot, free tier) at the /health endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.

Rate Limiting Claude Calls

This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.

The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.

Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.

Docker Compose – The Full Setup

Here's the complete docker-compose.yml with the Container Doctor, a sample web server, API, and database:

version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]

Start everything: docker compose up -d

Important: The socket mount (/var/run/docker.sock:/var/run/docker.sock) gives the Container Doctor full access to the Docker daemon. Don't copy .env into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.

Real Errors I Caught in Production

I've been running this for about 3 weeks now. Here are the actual incidents it caught:

Incident 1: OOM Kill (Week 1)

Logs showed a single word: Killed. That's Linux's OOMKiller doing its thing.

Claude's diagnosis:

{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}

The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.

Incident 2: Connection Pool Exhausted (Week 2)

ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached

Claude caught that my pool size was too small for the number of workers:

{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}

Incident 3: Transient Timeout (Week 2)

WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry

Claude correctly identified this as a non-issue:

{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}

No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.

Incident 4: Disk Full (Week 3)

ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space

{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}

Notice Claude said auto_restart_safe: false here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.

Cost Breakdown – What This Actually Costs

After 3 weeks of running this on 5 containers:

Claude API: ~$3.80/month (with rate limiting and deduplication)
Linode compute: $0 extra (the Container Doctor uses about 50MB RAM)
Slack: Free tier
My time saved: ~2-3 hours/month of 3 AM debugging

Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.

If you're monitoring more containers or have noisier logs, expect higher costs. The MAX_DIAGNOSES_PER_HOUR setting is your budget knob.

Security Considerations

Let's talk about the elephant in the room: the Docker socket.

Mounting /var/run/docker.sock gives the Container Doctor root-equivalent access to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.

Here's how I mitigate this:

Network isolation: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.
Read-mostly access: The script only reads logs and restarts containers. It never execs into containers, pulls images, or modifies volumes.
No external inputs: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).
API key rotation: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.

For a more secure setup, consider Docker's --read-only flag on the socket mount and a tool like docker-socket-proxy to restrict which API calls the Container Doctor can make.

What I'd Do Differently

After 3 weeks in production, here's my honest retrospective:

I'd use structured logging from day one. My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.

I'd add per-container policies. Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.

I'd build a simple web UI. The /history endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.

I'd try local models first. For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.

I'd add a "learning mode." Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.

What's Next?

If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.

Got questions or built something similar? Drop a comment below or find me on GitHub and LinkedIn.

Happy building.

How to Troubleshoot Ghost CMS: Fixing WSL, Docker, and ActivityPub Errors

Abdul Talha — Thu, 19 Mar 2026 17:28:52 +0000

Setting up Ghost CMS (Content Management System) on your local machine is a great way to develop themes and test new features. But if you're using Windows or Docker, you might run into errors that stop your progress. And debugging takes time away from your actual development work.

In this guide, you'll learn the root causes and exact fixes for three common Ghost CMS deployment errors:

Error 1: SQLite installation failures on Windows.
Error 2: Docker containers crashing with Code 137 (memory limits).
Error 3: "Loading Interrupted" errors in the ActivityPub Network tab.

By the end of this article, you'll have a stable, working local Ghost setup. You'll know how to properly use WSL for Node.js apps, manage Docker resources, and successfully configure Ghost's new social web features.

Error 1: SQLite Installation Failures on Windows

The Symptom

When you run the command ghost install local on a Windows machine, the setup fails. You will see a long list of red text in your terminal that looks like this:

Error: Cannot find module 'sqlite3'
...
node-pre-gyp ERR! stack Error: Failed to execute...
...
MSB4019: The imported project "C:\Microsoft.Cpp.Default.props" was not found.

The error usually mentions "sqlite3" and says it "failed to execute" or is "missing."

The Cause

Ghost uses SQLite to store your blog's data. SQLite is a "native module." This means it needs a small piece of code that must be built to fit your computer's system perfectly.

Because Ghost was created to run on Linux servers, it expects to find Linux build tools to make these files. Windows uses different tools and a different way of organising files. When the Ghost CLI tries to build the SQLite files on Windows, it can't find the tools it needs, so the installation stops. Using WSL gives Ghost the Linux environment it expects.

How to Fix it:

You can use Windows Subsystem for Linux (WSL) to create a working setup.

Open your WSL terminal (like Ubuntu).
Check your tools by running node --version, npm --version, and python3 --version.
Install the Ghost CLI globally inside WSL:
```
npm install -g ghost-cli@latest
```
Run the local setup command:
```
ghost install local
```
Start the server:
```
ghost start
```

How to Verify:

Open your web browser and go to http://localhost:2368. You should see the default Ghost welcome page load without errors.

Error 2: Docker Container Exiting with Code 137

The Symptom:

When you're running Ghost using Docker Compose, the containers crash. The terminal logs show Ghost admin container exiting with code 137 or Admin service killed due to memory constraints.

The Cause:

So why does this happen? Well, error code 137 means your computer ran out of memory (RAM) and stopped the container. This usually happens if you try to run the full Ghost developer setup (which includes 15+ extra tools) on a standard computer.

How to Fix it:

To fix this error, you can switch from the complex setup to a simple setup using the official Ghost Docker image.

To do this, first stop and remove the broken containers:

docker-compose down -v
docker system prune -a

Then create a new docker-compose.yml file with only the basic tools (Ghost and a database):

services:
  ghost:
    image: ghost:latest
    restart: always
    ports:
      - "2368:2368"
    environment:
      database__client: mysql
      database__connection__host: mysql
      database__connection__user: root
      database__connection__password: yourpassword
      database__connection__database: ghost
      url: http://localhost:2368
    volumes:
      - ghost_content:/var/lib/ghost/content

  mysql:
    image: mysql:8.0
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: yourpassword
      MYSQL_DATABASE: ghost
    volumes:
      - mysql_data:/var/lib/mysql

volumes:
  ghost_content:
  mysql_data:

Then start the simple setup:

docker-compose up -d

How to Verify:

Type docker-compose ps in your terminal. You should see both the ghost and mysql containers listed with a status of "Up".

Error 3: "Loading Interrupted" in Network Analytics

The Symptom:

When you click the Analytics → Network tab in your local Ghost admin panel, the page shows a "Loading Interrupted" error. Your terminal logs show 404 errors and webhook failures:

INFO "GET /.ghost/activitypub/v1/feed/reader/" 404 52ms
ERROR No webhook secret found - cannot initialise

The Cause:

The Network tab acts as an ActivityPub reader, not a normal analytics dashboard. This error happens because ActivityPub is not set up for local use. It needs extra tools (Caddy, Redis) and a clean web address without port numbers to work.

How to Fix it:

To fix this error, just run Ghost with its required Docker tools and update your local config file to turn on the social web features.

First, start the required tools (Caddy, MySQL, Redis) from your Ghost folder:

SSH_AUTH_SOCK=/dev/null docker compose up -d caddy mysql redis

Then open your config.local.json file. Set the URL to a clean localhost address (remove the :2368 port) and turn on the developer features:

{
    "url": "http://localhost",
    "social_web_enabled": true,
    "enableDeveloperExperiments": true
}

Stop your current Ghost process:

pkill -f "yarn dev:ghost"

And restart Ghost with the new settings:

yarn dev:ghost

How to Verify:

Log back into your Ghost admin panel and click Analytics → Network. The error message will be gone, and you will see the ActivityPub feed instead.

Conclusion

Local setups can be hard, especially when mixing Windows, Docker, and new features like ActivityPub.

By fixing these three errors, you did more than just get Ghost running. You learned how to bypass Windows limits using WSL, how to manage Docker memory, and how Ghost routes social web traffic.

You now have a stable, fast, and fully working Ghost CMS workspace ready for your content.

Let’s connect! You can find my latest work on my Technical Writing Portfolio or reach out to me on LinkedIn.

How to Optimize Your Docker Build Cache & Cut Your CI/CD Pipeline Times by 80%

Balajee Asish Brahmandam — Wed, 18 Mar 2026 21:50:25 +0000

Every developer has been there. You push a one-line fix, grab your coffee, and wait. And wait. Twelve minutes later, your Docker image finishes rebuilding from scratch because something about the cache broke again.

I spent a good chunk of last year debugging slow Docker builds across multiple teams. The pattern was always the same: builds that should take two minutes were eating up fifteen, and nobody knew why. The fix turned out to be surprisingly systematic once I understood what was actually happening under the hood.

This guide walks you through exactly how to fix slow Docker builds, step by step. We'll start with how the cache actually works, then tear apart the most common mistakes, and finish with production-ready patterns you can copy into your projects today.

Prerequisites
How Docker Build Cache Actually Works
- How Cache Keys Are Computed
- The Cache Chain Rule
How to Identify Common Cache-Busting Mistakes
How to Structure Your Dockerfile for Maximum Cache Reuse
How to Set Up CI/CD Cache Backends
How to Implement Advanced Cache Patterns
How to Measure Your Improvements
Complete Optimized Dockerfile Examples
Troubleshooting Guide
Quick-Reference Checklist
Conclusion

Prerequisites

To follow along, you'll need:

A working Docker installation (Docker Desktop or Docker Engine 20.10+)
Basic comfort with writing Dockerfiles
Access to a CI/CD system like GitHub Actions, GitLab CI, or Jenkins

How Docker Build Cache Actually Works

Every instruction in a Dockerfile produces a layer. Docker stores these layers and reuses them when it detects nothing has changed. That's the cache. Simple enough in theory, but the details matter a lot.

How Cache Keys Are Computed

Different instructions compute their cache keys differently:

Instruction	Cache Key Based On	What Breaks It
`RUN`	The exact command string	Any change to the command text
`COPY` / `ADD`	File checksums of the source content	Any modification to the copied files
`ENV` / `ARG`	The variable name and value	Changing the value
`FROM`	The base image digest	A new version of the base image

The Cache Chain Rule

Here's the thing most people miss: Docker cache is sequential. If any layer's cache gets invalidated, every layer after it rebuilds from scratch, even if those later layers haven't changed at all.

Picture a row of dominoes. Knock one over in the middle and everything after it goes down too. This is why the order of instructions in your Dockerfile is so important.

Key insight: The single most impactful optimization you can make is reordering your Dockerfile so that the stuff that changes most often comes last.

How to Identify Common Cache-Busting Mistakes

Before we fix anything, let's look at what's probably breaking your cache right now. I've seen these patterns in almost every unoptimized Dockerfile I've reviewed.

Mistake 1: Copying Everything Too Early

This is the big one. Putting COPY . . near the top of the Dockerfile, before installing dependencies, means that any file change in your project invalidates the cache from that point forward. Changed a README? Cool, now your dependencies reinstall.

# BAD: Any file change invalidates the dependency install
FROM node:20-alpine
WORKDIR /app
COPY . .                    # Cache busted on every commit
RUN npm ci                  # Reinstalls every single time
RUN npm run build

Mistake 2: Not Separating Dependency Files

Your dependency manifests (package.json, requirements.txt, go.mod, Gemfile) change way less often than your source code. If you don't copy them separately, you're reinstalling all dependencies every time you touch a source file.

Mistake 3: Using ADD Instead of COPY

ADD has special behaviors like auto-extracting archives and fetching remote URLs. Those features make its cache behavior unpredictable. Stick with COPY unless you specifically need archive extraction.

Mistake 4: Splitting apt-get update and install

When you put apt-get update and apt-get install in separate RUN commands, the update step gets cached with stale package indexes. Then the install step fails or grabs outdated packages.

# BAD: Stale package index
RUN apt-get update
RUN apt-get install -y curl    # May fail with stale index

# GOOD: Always combine them
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

Mistake 5: Embedding Timestamps or Git Hashes Too Early

Injecting build-time variables like timestamps or git commit hashes via ARG or ENV early in the Dockerfile invalidates the cache on every single build. Move these to the very last layer.

⚠️ Watch out for this: CI/CD systems often inject variables like BUILD_NUMBER or GIT_SHA as build args automatically. If those ARG declarations sit near the top, your cache is toast on every run.

How to Structure Your Dockerfile for Maximum Cache Reuse

Now let's fix those mistakes. These five steps, applied in order, will get you most of the way to an optimized build.

Step 1: Apply the Dependency-First Pattern

Copy only the dependency manifests first, install, and then copy the rest of the source code. This one change alone can cut your build times in half.

# GOOD: Dependency-first pattern for Node.js
FROM node:20-alpine
WORKDIR /app

# Copy ONLY dependency files
COPY package.json package-lock.json ./

# Install dependencies (cached unless package files change)
RUN npm ci --production

# Copy source code (only this layer rebuilds on code changes)
COPY . .

# Build
RUN npm run build

The same idea works across every language:

Language	Copy First	Install Command
Node.js	`package.json`, `package-lock.json`	`npm ci`
Python	`requirements.txt` or `pyproject.toml`	`pip install -r requirements.txt`
Go	`go.mod`, `go.sum`	`go mod download`
Rust	`Cargo.toml`, `Cargo.lock`	`cargo fetch`
Java (Maven)	`pom.xml`	`mvn dependency:go-offline`
Ruby	`Gemfile`, `Gemfile.lock`	`bundle install`

Step 2: Add an Aggressive .dockerignore

A .dockerignore file keeps irrelevant files out of the build context. Fewer files in the context means fewer things that can break your cache.

# .dockerignore
.git
node_modules
dist
*.md
*.log
.env*
docker-compose*.yml
Dockerfile*
.github
tests
coverage
__pycache__

Step 3: Use Multi-Stage Builds

Multi-stage builds let you use a full development image for compiling, then copy only the finished artifacts into a slim runtime image. You get smaller images, better security, and improved cache performance because build tools and intermediate files don't carry over.

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json ./
EXPOSE 3000
CMD ["node", "dist/index.js"]

Step 4: Order Layers by Change Frequency

Think of your Dockerfile as a stack. Put the boring, stable stuff at the top and the volatile stuff at the bottom:

Base image and system dependencies (rarely change)
Language runtime configuration (occasionally change)
Application dependencies (change when you add or remove packages)
Source code (changes on every commit)
Build-time metadata like git hash or version labels (changes every build)

Step 5: Use BuildKit Mount Caches

Docker BuildKit supports RUN --mount=type=cache, which mounts a persistent cache directory that survives across builds. This is a game-changer for package managers that maintain their own download caches.

# syntax=docker/dockerfile:1

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .

# Mount pip cache so downloads persist across builds
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

COPY . .

The best part: mount caches persist even when the layer itself gets invalidated. So if you add one new package, pip only downloads that one package instead of re-fetching everything.

Here are the common cache targets for popular package managers:

Package Manager	Cache Target
pip	`/root/.cache/pip`
npm	`/root/.npm`
yarn	`/usr/local/share/.cache/yarn`
go	`/go/pkg/mod`
apt	`/var/cache/apt`
maven	`/root/.m2/repository`

How to Set Up CI/CD Cache Backends

Here's where things get tricky. Your local Docker cache works great on your laptop because the layers persist between builds. But CI/CD runners are usually ephemeral: each job starts with a totally empty cache. Without explicit cache configuration, every CI build is a cold build.

Option A: Registry-Based Cache

BuildKit can push and pull cache layers from a container registry. This is the most portable approach and works with any CI system.

docker buildx build \
  --cache-from type=registry,ref=myregistry.io/myapp:buildcache \
  --cache-to type=registry,ref=myregistry.io/myapp:buildcache,mode=max \
  --tag myregistry.io/myapp:latest \
  --push .

💡 Use mode=max to cache all layers including intermediate build stages. The default mode=min only caches layers in the final stage, which means your build stage layers get thrown away.

Option B: GitHub Actions Cache

If you're on GitHub Actions, there's native integration with BuildKit through the GitHub Actions cache API. It's fast and requires minimal setup.

# .github/workflows/build.yml
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myregistry.io/myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

Option C: S3 or Cloud Storage

For teams on AWS, GCP, or Azure, cloud object storage makes a solid cache backend. It's fast, persistent, and works across any CI system.

docker buildx build \
  --cache-from type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp \
  --cache-to type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp,mode=max \
  --tag myapp:latest .

Option D: Local Cache with Persistent Runners

If your CI runners have persistent storage (self-hosted runners, GitLab runners with shared volumes), you can export cache to a local directory.

docker buildx build \
  --cache-from type=local,src=/ci-cache/myapp \
  --cache-to type=local,dest=/ci-cache/myapp,mode=max \
  --tag myapp:latest .

How to Implement Advanced Cache Patterns

Once you've nailed the basics, these patterns can squeeze out even more performance.

Parallel Build Stages

BuildKit builds independent stages in parallel. If your app has a frontend and a backend that don't depend on each other during build, split them into separate stages and let BuildKit run them simultaneously.

# These stages build in parallel
FROM node:20-alpine AS frontend
WORKDIR /frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM python:3.12-slim AS backend
WORKDIR /backend
COPY backend/requirements.txt .
RUN pip install -r requirements.txt
COPY backend/ .

# Final stage combines both
FROM python:3.12-slim
COPY --from=backend /backend /app
COPY --from=frontend /frontend/dist /app/static
CMD ["python", "/app/main.py"]

Cache Warming for Feature Branches

Feature branches often start with a cold cache because they diverge from main. You can warm the cache by specifying multiple --cache-from sources. Docker checks them in order.

docker buildx build \
  --cache-from type=registry,ref=registry.io/app:cache-${BRANCH} \
  --cache-from type=registry,ref=registry.io/app:cache-main \
  --cache-to type=registry,ref=registry.io/app:cache-${BRANCH},mode=max \
  --tag registry.io/app:${BRANCH} .

If the branch cache hits, Docker uses it. If not, it falls back to main's cache, which usually shares most layers. This makes a massive difference for short-lived branches.

Selective Cache Invalidation with Build Args

You can use ARG instructions as cache boundaries. Anything above the ARG stays cached, while anything below it rebuilds when the arg value changes.

FROM node:20-alpine
WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci

# This ARG only invalidates layers below it
ARG CACHE_BUST_CODE=1
COPY . .
RUN npm run build

# This ARG only invalidates the label
ARG GIT_SHA=unknown
LABEL git.sha=$GIT_SHA

How to Measure Your Improvements

Optimization without measurement is just guessing. Here's how to actually prove your changes are working.

The Four Scenarios to Benchmark

Run each scenario at least three times and take the median:

Cold build: No cache at all (first build or after docker builder prune)
Warm build: No changes, full cache hit
Code change: Only source code modified
Dependency change: Package manifest modified

Real-World Before and After Numbers

Here's what I saw on a mid-sized Node.js project after applying the techniques from this guide:

Scenario	Before	After	Improvement
Cold build	12 min 34 sec	8 min 10 sec	35%
Warm build (no changes)	12 min 34 sec	14 sec	98%
Code change only	12 min 34 sec	1 min 52 sec	85%
Dependency change	12 min 34 sec	4 min 20 sec	65%

The "before" column is the same for all rows because without cache optimization, every build was essentially a cold build. That 85% improvement on code-only changes is the number that matters most, since that's what happens on the vast majority of commits.

How to Check Cache Hit Rates

Set BUILDKIT_PROGRESS=plain to get detailed output showing which layers hit cache:

BUILDKIT_PROGRESS=plain docker buildx build . 2>&1 | grep -E 'CACHED|DONE'

Look for the CACHED prefix on layers. Your goal is to see CACHED on everything except the layers that actually needed to change.

Complete Optimized Dockerfile Examples

Here are production-ready Dockerfiles you can adapt for your own projects.

Node.js Full-Stack App

# syntax=docker/dockerfile:1
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN --mount=type=cache,target=/root/.npm npm ci

FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 appgroup \
    && adduser --system --uid 1001 appuser
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
COPY package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]

Python FastAPI App

# syntax=docker/dockerfile:1
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --user -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Go Microservice

# syntax=docker/dockerfile:1
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -ldflags='-s -w' -o /app/server ./cmd/server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
EXPOSE 8080
ENTRYPOINT ["/server"]

Troubleshooting Guide

When things go wrong, check this table first:

Symptom	Likely Cause	Fix
All layers rebuild every time	`COPY . .` is too early, or `.dockerignore` is missing	Move `COPY . .` after dependency install; add `.dockerignore`
Cache never hits in CI	No cache backend configured	Add `--cache-from` / `--cache-to` with registry, gha, or s3 backend
Cache hits locally but not in CI	Different Docker versions or BuildKit not enabled	Set `DOCKER_BUILDKIT=1` and match Docker versions
Dependency layer always rebuilds	Source files copied before dependency install	Use the dependency-first pattern
Image size keeps growing	Build artifacts leaking into final image	Use multi-stage builds; only copy runtime artifacts
Registry cache is very slow	`mode=max` caching too many layers	Try `mode=min` or switch to gha/s3 for faster backends

Quick-Reference Checklist

Print this out and tape it next to your monitor:

[ ] Enable BuildKit: set DOCKER_BUILDKIT=1 or use docker buildx
[ ] Add a comprehensive .dockerignore file
[ ] Use the dependency-first pattern: copy manifests, install, then copy source
[ ] Order layers from least-changed to most-changed
[ ] Combine RUN commands that belong together (apt-get update && install)
[ ] Use multi-stage builds to separate build and runtime
[ ] Add RUN --mount=type=cache for package manager caches
[ ] Move volatile ARGs (git hash, build number) to the very last layers
[ ] Configure a CI/CD cache backend (registry, gha, or s3)
[ ] Set up cache warming for feature branches from the main branch
[ ] Use COPY instead of ADD unless you need archive extraction
[ ] Benchmark all four scenarios: cold, warm, code change, dependency change

Conclusion

I used to think slow Docker builds were just something you had to live with. After going through this process on a few projects, I realized the fix is pretty mechanical once you understand that one core principle: cache is sequential, and order matters.

Start with the dependency-first pattern and a .dockerignore. Those two changes alone will probably cut your build times in half. Then add multi-stage builds, mount caches, and CI/CD cache backends as you need them.

The teams I've worked with typically see 70-85% reductions in CI/CD pipeline times after spending a few hours on these changes. That's time you get back on every single commit, every single day.

If you found this helpful, consider sharing it with your team. There's a good chance whoever wrote your Dockerfile last didn't know about half of these tricks. No shade to them, I didn't either until I went looking.

Happy building.

How to Containerize Your MLOps Pipeline from Training to Serving

Balajee Asish Brahmandam — Thu, 12 Mar 2026 22:34:01 +0000

Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.

The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.

Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.

That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.

In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.

Prerequisites

Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+
For GPU training, you'll need the NVIDIA Container Toolkit installed on the host and a compatible GPU driver. Run nvidia-smi to verify your GPU is visible, and docker compose version to check your Compose version.
Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.

The MLOps Lifecycle: Where Containers Fit

If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.

An MLOps pipeline is a chain of interdependent stages:

Data ingestion and validation. Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.
Feature engineering. You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.
Experiment tracking. You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.
Model training. The model learns from your features. This is the compute-heavy part that often needs GPUs.
Evaluation. You measure the trained model against test data to see if it's good enough to deploy.
Packaging and serving. You wrap the trained model in an API so other systems can send it data and get predictions back.
Monitoring. You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.

Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.

The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.

Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.

This gives you the flexibility to:

Scale training on expensive GPU instances while running serving on cheaper CPU nodes
Update your feature engineering code without rebuilding your training environment
Version each stage independently in your container registry
Let data scientists and ML engineers work on training while platform engineers optimize serving

How to Build the Training Container

The training container is where most teams start, and where most teams make their first mistake.

The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.

Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.

If you're new to these concepts: a multi-stage build lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.

A cache mount tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.

Here's the training Dockerfile:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]

Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.

That's why we put things in order of how often they change:

System packages at the top (they almost never change). Installing python3.11 and git takes time, but you only do it once.
Python dependencies in the middle (they change when you add or update a library). This layer rebuilds when requirements-train.txt changes.
Your actual code at the bottom (changes on every commit). This is the layer that rebuilds most often.

With this ordering, a code change only rebuilds the final layer, not the entire image. If you put COPY src/ before pip install, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.

The --mount=type=cache,target=/root/.cache/pip line on the pip install command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.

Separate Training from Serving Requirements

Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.

It's a good idea to maintain separate requirements files:

# requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0

The overlap is smaller than you'd think. torch and scikit-learn appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.

CUDA and Driver Compatibility

One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.

Make sure you check your host driver version before choosing your base image:

# On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed

If your host driver is 535.x, don't use a cuda:12.6 base image. Use cuda:12.2 or upgrade the driver. Mismatched versions produce cryptic errors like CUDA error: no kernel image is available for execution on the device that are painful to debug.

Pin your base images to specific tags (not latest) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.

How to Set Up Experiment Tracking with MLflow

If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.

MLflow is the most widely adopted open-source tool for this. It logs three things for every training run: parameters (learning rate, batch size, number of epochs), metrics (accuracy, loss, F1 score), and artifacts (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.

Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:

services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:

Let me break down what's happening here.

The mlflow service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.

The depends_on with condition: service_healthy tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.

The db service runs Postgres with a health check that uses pg_isready, a built-in Postgres utility that checks if the database is accepting connections. The start_period gives Postgres 10 seconds to initialize before health checks start counting failures.

Your training code connects to MLflow by setting one environment variable:

import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")

After the run completes, open http://localhost:5000 in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.

A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.

How to Version Training Data with DVC

Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.

DVC (Data Version Control) fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a .dvc file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.

The workflow on your local machine looks like this:

# Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"

Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run dvc pull and DVC downloads it from remote storage.

The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

The entrypoint script pulls the data and then starts training:

#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"

For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:

training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1

Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.

You can reproduce any past experiment by checking out the Git commit and running the training container.

How to Build the Serving Container

"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a /predict endpoint that accepts transaction data and returns a fraud probability.

The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:

FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]

A few things to understand if you're new to this:

uvicorn is a lightweight Python web server that runs FastAPI applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.

HEALTHCHECK tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the curl command against the /health endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.

start-period of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without start-period, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.

Notice we're using python:3.11-slim here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.

If you want to skip the curl dependency, use Python's built-in urllib for the health check:

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1

Decouple Models from Containers

This is one of the most important patterns in this article, and the one beginners most often get wrong.

The temptation is to copy your trained model file (the .pkl, .pt, or .onnx file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.

Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.

Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:

import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models://" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}

When a client sends a POST request to /predict with JSON like {"amount": 500, "merchant_category": "electronics", "hour": 23}, the model returns a prediction. The /health endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker HEALTHCHECK checks for.

Promoting a new model version means updating the MODEL_URI environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.

For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:

@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}

How to Configure GPU Passthrough for Training

By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.

This requires two things on the host (the machine running Docker, not inside the container):

NVIDIA GPU drivers installed and working. Verify with nvidia-smi. If that command shows your GPUs, you're good.
NVIDIA Container Toolkit installed. This is the bridge between Docker and the GPU drivers. Install it from the NVIDIA docs and verify with docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi. If you see your GPU listed, the toolkit is working.

Once the host is set up, GPU access in Docker Compose looks like this:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

The deploy.resources.reservations.devices block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding print(torch.cuda.is_available()) to your training script, which should print True.

If you're running Compose v2.30.0+, you can use the shorter gpus syntax:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using device_ids. This matters when running multiple training jobs at the same time:

services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

Note that CUDA_VISIBLE_DEVICES inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.

How to Tie It All Together with Compose Profiles

If you're new to Compose profiles: by default, docker compose up starts every service defined in your docker-compose.yml. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).

Profiles solve this. When you add profiles: ["train"] to a service, that service is excluded from docker compose up by default. It only starts when you explicitly activate the profile with docker compose --profile train. This means one file defines your entire ML infrastructure, but you control what runs and when.

Here's the complete docker-compose.yml that ties every piece from this article together:

services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:

The day-to-day workflow with this file:

# Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'

This single-file approach means a new team member can clone the repo, run docker compose up -d, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).

Reproducibility: The Whole Point

Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.

Here are the practices that make this work:

Pin Everything

Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with pip freeze > requirements.txt. Use fixed random seeds in your training code and log them in MLflow.

Log Everything

Every training run should log the exact library versions (pip freeze), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:

import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...

Version Everything

Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.

Where This Breaks Down

This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:

Large datasets. Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.

GPU driver mismatches. Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.

Multi-node training. When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.

Serving at scale. A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with docker compose up --scale serving=3, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.

Secrets in production. The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.

Conclusion

Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.

That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.

Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.

But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.

If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.

How to Self-Host AFFiNE on Windows with WSL and Docker

Abdul Talha — Thu, 12 Mar 2026 16:00:05 +0000

Depending on cloud apps means that you don't truly own your notes. If your internet goes down or if the company changes its rules, you could lose access.

In this article, you'll learn how to build your own private workspace using AFFiNE. You'll use Docker Compose to link three separate pieces of software together:

The AFFiNE Core application.
A PostgreSQL database to store your notes and pages.
A Redis cache to make the app run fast and smooth.

By the end of this article, you'll have a fully functional web app running on your own computer that works just like the cloud version of Notion.

What is AFFiNE?
Prerequisites
Step 1: Preparing Your Workspace
Step 2: Getting the Official Setup Files
Step 3: Configuring Your Environment (.env)
Step 4: Launching the System
Step 5: Accessing the Admin Panel
Step 6: Configuration (Making It Yours)
Step 7: Connecting the Desktop App (Optional)
Step 8: Stopping the Server and Safe Backups
Step 9: How to Upgrade Later
Common Installation Errors and Troubleshooting
Conclusion

What is AFFiNE?

AFFiNE is an "all-in-one" workspace that combines the powers of writing, drawing, and planning.

While tools like Notion focus on documents and Miro focus on whiteboards, AFFiNE lets you do both in a single space. You can turn your written notes into a visual canvas with one click. This makes it perfect for brainstorming, tracking tasks, and managing your personal knowledge.

The Power of Self-Hosting

While AFFiNE offers a cloud version, hosting it yourself gives you three major benefits:

Total data ownership: Your notes never leave your machine. You own the database.
Privacy in the AI age: No big tech company can scan your private ideas or use them for AI training.
Real DevOps skills: Learning how to manage Docker inside WSL is a high-value skill for any modern developer.

Prerequisites

To follow this article, make sure you have these tools ready on your machine:

WSL 2 Installation: You must have WSL installed if you are using Windows (I am using Ubuntu for this guide).
Docker and Docker Compose: These must be installed and running on your machine.
Linux Terminal Commands: You should be familiar with basic commands like mkdir, cd, and wget.

Step 1: Preparing Your Workspace

To start, create a folder for your AFFiNE files. This keeps your data in one organised place.

Then open your WSL terminal and run these commands:

mkdir affine
cd affine

Step 2: Getting the Official Setup Files

You will download the official configuration files directly from the AFFiNE. In your WSL terminal, run these two commands:

Download the Docker Compose file:

wget -O docker-compose.yml https://github.com/toeverything/affine/releases/latest/download/docker-compose.yml

Download the Environment template:

wget -O .env https://github.com/toeverything/affine/releases/latest/download/default.env.example

Step 3: Configuring Your Environment (.env)

The .env file is like a hidden settings sheet. It keeps your passwords and setup details private.

To edit this file, you can use Nano, which is a simple text editor built into your Linux terminal. Follow these steps to update your settings:

Open the file with Nano:
```
nano .env
```
Update the settings: Use your arrow keys to move around the file. Update these specific lines to match the locations below. This keeps your data safely inside your new affine folder:
```
DB_DATA_LOCATION=./postgres
UPLOAD_LOCATION=./storage
CONFIG_LOCATION=./config

DB_USERNAME=affine
DB_PASSWORD=
DB_DATABASE=affine
```
Save and Exit: Press Ctrl + O to save.
- Press Enter to confirm the filename.
- Press Ctrl + X to exit the editor.

Step 4: Launching the System

Run this Docker command to build your workspace:

docker compose up -d

Docker will download the AFFiNE app and a Postgres database. The -d flag means it will run quietly in the background.

Step 5: Accessing the Admin Panel

Once the terminal says "Started," your private server is live!

Open your web browser and go to:

http://localhost:3010/

The first time you visit this page, you must create an admin account. This is the master key to your server.

Step 6: Configuration (Making It Yours)

There are two ways to configure your server.

The Easy Way: Admin Panel

In your browser, go to http://localhost:3010/admin/settings. You can change your server name or set up emails here.

The Developer Way: Config File

You can also create a config.json file inside your ./config folder.

{
  "$schema": "https://github.com/toeverything/affine/releases/latest/download/config.schema.json",
  "server": {
    "name": "My Private Workspace"
  }
}

Step 7: Connecting the Desktop App (Optional)

You don't have to use the browser. You can connect the official AFFiNE desktop app.

Download the AFFiNE desktop app.
Click the workspace list panel in the top left corner.
Click "Add Server" and enter http://localhost:3010.
Log in with your account.

Step 8: Stopping the Server and Safe Backups

You must turn your server off safely before you back up your notes.

To do that, run this command:

docker compose down

Once it stops, you can safely copy your entire affine folder to a safe place.

Step 9: How to Upgrade Later

When AFFiNE releases a new version, run these commands inside your affine folder:

Download the newest blueprint:

wget -O docker-compose.yml https://github.com/toeverything/affine/releases/latest/download/docker-compose.yml

Pull the new images and restart:

docker compose pull
docker compose up -d

Common Installation Errors and Troubleshooting

1. Docker is Not Running

The Error: Terminal says docker: command not found.
The Fix: Open the Docker Desktop app on Windows and wait for it to start.

2. Docker is Not Connected to WSL

The Fix: In Docker Desktop, go to Settings > Resources > WSL Integration and turn it ON for your distro.

3. The Port is Already in Use

The Fix: Open docker-compose.yml. Change "3010:3010" to "4000:3010". You will now visit localhost:4000.

4. Permission Denied

The Fix: If you cannot delete a folder, use the sudo command: sudo rm -rf affine/.

Conclusion

In this tutorial, you've successfully built a self-hosted, private workspace. You practised using WSL, Docker Compose, and Postgres. These are valuable skills for any developer.

Your next steps:

Create a note in AFFiNE documenting what you learned.
Turn off your server (docker compose down) and copy your folder to a backup drive.
Explore Cloudflare Tunnels if you want to access your server from your phone!

Self-hosting takes a little work, but the privacy is worth it.

Let’s connect! You can find my latest work on my Technical Writing Portfolio or reach out to me on LinkedIn.

Docker - freeCodeCamp.org

How to Build a Self-Hosted WhatsApp Bot with n8n and WAHA

Table of contents

What You'll Learn

Prerequisites

A Note on Which WhatsApp Account to Use

WAHA vs the official WhatsApp Business Cloud API

Part 1: Understanding WAHA

Part 2: Running WAHA with Docker

Part 3: Starting a WhatsApp session

Part 4: Running n8n

Part 5: Creating the Webhook Trigger in n8n

Part 6: Wiring WAHA to n8n

Part 7: Building the first auto-reply

Part 8: A Second Example — Proactive Booking Confirmations

Part 9: Going to Production

1. Put Everything Behind HTTPS

2. Rotate the API Keys

3. Rate-limit Outbound Messages

4. Scale n8n with Queue Mode

5. Monitor the Session

6. Back Up the Sessions Volume

7. Add a Live-Agent Handoff

Common Pitfalls

Webhooks Timing Out

Duplicate Messages

Messages Arriving Out of Order

Sessions Disconnecting After a Phone Restart

Your Number Gets Banned

The Wrong Chat ID Format

Where to Go Next

How to Dockerize a Go Application – Full Step-by-Step Walkthrough

What We'll Cover:

Prerequisites

What is Docker?

How to Install Docker

What is a Dockerfile?

What is Docker Compose?

The app Container

The database Container

The phpMyAdmin Container

Running Everything Together

Wrapping Up

How to Trace Multi-Agent AI Swarms with Jaeger v2

Table of Contents

What Is Distributed Tracing?

Why Jaeger v2?

Prerequisites

Installing Docker on Debian

Setting Up Jaeger v2

Basic Run

Persistent Storage with Badger

Setting Up Claude Forge Tracing

Installing Claude Forge

Installing the Tracing Hook

Opting In

Understanding the Span Model

Three Tiers of Detail

Span Attributes

Instrumenting a Multi-Agent Swarm

Hook Architecture

Sending Spans with OpenTelemetry

Correlating Pre and Post Events

State Across Invocations

Flushing Without Blocking

Viewing Traces in the Jaeger UI

Lessons from the Trenches

Environment Variable Reference

Wrapping Up

How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik

Table of Contents

1. What You'll Build

Prerequisites

2. Architecture

3. Server Prerequisites

DNS

4. Traefik — the Reverse Proxy

traefik.staging.yml (static config)

The Traefik service in docker-compose.staging.yml

5. Run Jenkins in Docker

The `app` Container

The `database` Container

The `phpMyAdmin` Container

`traefik.staging.yml` (static config)

The Traefik service in `docker-compose.staging.yml`

`Host key verification failed` / `Could not read from remote repository`

`unknown shorthand flag: 'f' in -f` ( `docker compose`)